因果模型-02-因果树

主要参考

总结

处理

首先，基础模型选用的是CART回归树。
因果树改进了MSE，主动减去了一项模型参数无关的 $E[Y_i^2]$。
- $EMSE$ 可以被定义为 估计量的方差 + 估计量偏差的平方
构建树的过程中，train set切割为 $S^{tr}$ 和 $S^{est}$ 两部分，node的预测值由 $S^{est}$ 进行无偏估计。
- 虽然最终实现上 $S^{est}$ 用train set替代了
把改进的 $MSE$ 应用到CATE中来指导 节点分割 和 建立决策树。
- 使用 $\tau$ 替代了 $\mu$，对 处理效应 进行建模
  - $MSE{\tau} = \frac{1}{ #(S^{te})} \sum{i \in S^{te}} \left{ (\tau_i - \hat{\tau}(X_i; S^{est}, \prod))^2 - \tau_i^2 \right}$
- 真实 处理效应 通过 $\hat{\tau}$ 进行无偏估计
理论上causal tree 仅支持两个Treatment。

效果

无偏的节点估计，以准确估计异质性处理效应（HTE）
使用改进的MSE，因果树能够在每次节点分裂时，综合考虑偏差和方差的影响，从而使得每个节点的划分尽可能减少整体的误差（EMSE），提高了树的预测性能。

notions

在特征空间 $\mathcal{X}$ 下存在节点分裂方式的集合：

$\prod(\ell_1, \dots, \ell_{ \#(T)})$

其中以 $\ell(x; \prod)$ 表示叶子节点 $\ell$ 属于划分方式 $\prod$，此时该划分方式下的，node 的条件期望定义为：

$\mu(x; \prod) = E[Y_i | X_i \in \ell(x; \prod)]$

那么，自然如果给定样本 $S$，其对应节点无偏统计量为：

$\hat{\mu}(x; S, \prod) = \frac{1}{ \#(i \in S : X_i \in \ell(x; \prod))} \sum_{i \in S : X_i \in \ell(x; \prod)} Y_i$

改进后的 MSE 和 Honest 方法

改进的MSE
学习目标使用修改后的MSE，在标准mse的基础上多减去了一项和模型参数估计无关的 $E[Y_i^2]$ 。
样本划分(Honest)
测试样本test set $S^{te}$
此外训练即build tree阶段，train set被切分为两部分
一部分训练样本train set：$S^{tr}$
一部分是估计样本 est set $S^{est}$
这里有点特殊：和经典的树模型不一样的是，叶子节点上存储的值不是根据train set来的，而是划分好之后通过est set进行估计。所以，这也是文中为啥把这种方法叫做“Honest”的原因。
新的评估标准
假设已经根据训练样本得到划分方式$\prod$，那么评估这种划分方式好坏被定义为：
$MSE(S^{te}, S^{est}, \prod) = \frac{1}{ \#(S^{te})} \sum_{i \in S^{te}} \left( (Y_i - \hat{\mu}(X_i; S^{est}, \prod))^2 - Y_i^2 \right)$
整体求期望变成：
$EMSE(\prod) = E_{S^{te}, S^{est}} \left[ MSE(S^{te}, S^{est}, \prod) \right]$
算法的整体目标为：
$Q^H(\pi) = -E_{S^{est}, S^{te}, S^{tr}} \left[ MSE(S^{te}, S^{est}, \pi(S^{tr})) \right]$
其中，$\pi(S)$ 定义为：
$\pi(S) = \begin{cases} \{\{L, R\}\} & \text{if } \bar{Y}_L - \bar{Y}_R \leq c \\ \{\{L\}, \{R\}\} & \text{if } \bar{Y}_L - \bar{Y}_R > c \end{cases}$
其实就是比较节点在划分后，左右子节点的输出差异是否满足阈值 $c$，$\bar{Y}_L = \mu(L)$。

节点划分方式

($EMSE$ 可以被定义为 估计量的方差 + 估计量偏差的平方)

根据 改进后的 MSE 和 Honest 方法 给出节点划分时的 loss 计算标准：

$-EMSE(\prod) = -E_{(Y_i, X_i), S^{est}} \left[ (Y_i - \mu(X_i; \prod))^2 - Y_i^2 \right]-E_{X_i, S^{est}} \left[ \left( \hat{\mu}(X_i; S^{est}, \prod) - \mu(X_i; \prod) \right)^2 \right]$ $= E_{X_i} \left[ \mu^2(X_i; \prod) \right] - E_{S^{est}, X_i} \left[ \text{Var}(\hat{\mu}^2(X_i; S^{est}, \prod)) \right]$

推导一下：

$EMSE(\prod) = E_{S^{te}, S^{est}} \left[ \frac{1}{ \#(S^{te})} \sum_{i \in S^{te}} \left( (Y_i - \hat{\mu}(X_i; S^{est}, \prod))^2 - Y_i^2 \right) \right]$ $= E_{S^{te}, S^{est}} \left[ (Y_i - \hat{\mu}(X_i; S^{est}, \prod))^2 - Y_i^2 \right]$ $= E_{S^{te}, S^{est}} \left[ \left( Y_i - \mu(X_i; \prod) + \mu(X_i; \prod) - \hat{\mu}(X_i; S^{est}, \prod) \right)^2 - Y_i^2 \right]$ $= E_{S^{te}, S^{est}} \left[ \left( Y_i - \mu(X_i; \prod) \right)^2 - Y_i^2 \right]$

因为中间展开项的期望为 0，所以公式变成：

$EMSE(\prod) = E_{S^{te}, S^{est}} \left[ (Y_i - \mu(X_i; \prod))^2 - Y_i^2 \right]$

同样地，展开项的项期望为 0，由于无偏估计 $\Rightarrow \mu(Xi; \prod) = E{S^{est}}[\hat{\mu}(X_i; S^{est}, \prod)]$，最终公式变成：

$-EMSE(\prod) = E_{X_i} \left[ \mu^2(X_i; \prod) \right] - E_{S^{est}, X_i} \left[ \text{Var}(\hat{\mu}(X_i; S^{est}, \prod)) \right]$

其中，$E{S^{est}, X_i} \left[ \text{Var}(\hat{\mu}(X_i; S^{est}, \prod)) \right] = E{S^{te}, S^{est}} \left[ (\hat{\mu}(X_i; S^{est}, \prod) - \mu(X_i; \prod))^2 \right]$

公式中第一项可以理解为偏差的平方，第二项理解为方差。因此，$EMSE(\prod)$ 可以被理解成偏差和方差的组合。为什么 MSE 可以被理解成偏差和方差的组合，以及展开项为 0：

$MSE(\hat{\theta}) = E_{\theta}[(\hat{\theta} - \theta)^2]$ $= E_{\theta} \left[ (\hat{\theta} - E_{\theta}[\hat{\theta}] + E_{\theta}[\hat{\theta}] - \theta)^2 \right]$ $= E_{\theta} \left[ (\hat{\theta} - E_{\theta}[\hat{\theta}])^2 + 2(\hat{\theta} - E_{\theta}[\hat{\theta}])(E_{\theta}[\hat{\theta}] - \theta) + (E_{\theta}[\hat{\theta}] - \theta)^2 \right]$ $= E_{\theta} \left[ (\hat{\theta} - E_{\theta}[\hat{\theta}])^2 \right] + 2 E_{\theta} \left[ (\hat{\theta} - E_{\theta}[\hat{\theta}])(E_{\theta}[\hat{\theta}] - \theta) \right] + E_{\theta} \left[ (E_{\theta}[\hat{\theta}] - \theta)^2 \right]$ $= E_{\theta} \left[ (\hat{\theta} - E_{\theta}[\hat{\theta}])^2 \right] + (E_{\theta}[\hat{\theta}] - \theta)^2$ $= \text{Var}_{\theta}(\hat{\theta}) + \text{Bias}_{\theta}(\hat{\theta}, \theta)^2$

其中，

$\text{Var}_{\theta}(\hat{\theta})$ 表示估计值的方差；
$\text{Bias}_{\theta}(\hat{\theta}, \theta)^2$ 表示估计值的偏差的平方。
因此，均方误差可以被分解为方差和偏差的平方之和。

进一步分析 `偏差项` 和 `方差项`

$-EMSE(\prod) = E_{X_i} \left[ \mu^2(X_i; \prod) \right] - E_{S^{est}, X_i} \left[ \text{Var}(\hat{\mu}^2(X_i; S^{est}, \prod)) \right]$

偏差项

$E_{X_i} \left[ \mu^2(X_i; \prod) \right] = E_{X_i} \left\{ E_S \left( \hat{\mu}(X_i; S, \prod) \right)^2 \right\}$ $= E_{X_i} \left\{ E_S \left[ \hat{\mu}^2(X_i; S, \prod) \right] - V_S \left[ \hat{\mu}(X_i; S, \prod) \right] \right\}$ $= E_{X_i} \left\{ E_S \left[ \hat{\mu}^2(X_i; S, \prod) \right] \right\} - E_{X_i} \left\{ V_S \left[ \hat{\mu}(X_i; S, \prod) \right] \right\}$

第一项总体估计值的期望使用训练集的样本，即：

$\hat{\mu}^2(X_i, S^{tr}, \prod) = E_S[\hat{\mu}(X_i; S; \prod)]$

第二项方差项，叶子节点方差求均值

$V_S[\hat{\mu}^2(X_i; S; \prod)] = \frac{S_{S^{tr}}^2}{N^{tr}}$

对于最外层的期望：

$\hat{E}_{X_i} \left[ \mu^2(X_i; \prod) \right] = \sum_{\ell \in \prod} \frac{N_\ell^{tr}}{N^{tr}} \hat{\mu}^2(X_i, S^{tr}, \prod) - \sum_{\ell \in \prod} \frac{N_\ell^{tr}}{N^{tr}} \frac{S_{tr}^2 [\ell(x, \prod)]}{N_\ell^{tr}}$ $= \frac{1}{N^{tr}} \sum_{i \in S^{tr}} \hat{\mu}^2(X_i, S^{tr}, \prod) - \frac{1}{N^{tr}} \sum_{\ell \in \prod} S_{tr}^2[\ell(x, \prod)]$

方差项

$V(\hat{\mu}(X_i; S^{est}, \prod)) = \frac{S_{S^{tr}}^2(\ell(x; \prod))}{N_{est}(\ell(x; \prod))}$ $E_{S^{est}, X_i} \left[ V \left( \hat{\mu}^2(X_i; S^{est}, \prod) \mid i \in S^{te} \right) \right] \equiv \frac{1}{N_{est}} \cdot \sum_{\ell \in \prod} S_{S^{tr}}^2(\ell)$

整合，最终估计量为：

$-EMSE(\hat{S}^{tr}, \prod) = \frac{1}{N^{tr}} \sum_{i \in S^{tr}} \hat{\mu}^2(X_i, S^{tr}, \prod) - \left( \frac{1}{N^{tr}} + \frac{1}{N^{est}} \right) \cdot \sum_{\ell \in \prod} S_{S^{tr}}^2(\ell)$ $= \frac{1}{N^{tr}} \sum_{i \in S^{tr}} \hat{\mu}^2(X_i, S^{tr}, \prod) - \frac{2}{N^{tr}} \cdot \sum_{\ell \in \prod} S_{S^{tr}}^2(\ell)$

最后一个等式假设了 train set 和 est set 同分布。

Treatment Effect 引入划分准则：处理异质效应

前面定义了 $MSE$ 的范式，当需要考虑到异质效应时，定义异质效应为：

$\tau = \mu(1, x; \prod) - \mu(0; x; \prod)$

很显然，我们永远观测不到异质性处理效应，因为我们无法观测到反事实，我们只能估计处理效应，给出异质性处理效应的估计量：

$\hat{\tau}(w, x; S, \prod) = \hat{\tau}(1, x; S, \prod) - \hat{\tau}(0, x; S, \prod)$

因果效应下的 MSE 为：

$MSE_{\tau} = \frac{1}{ \#(S^{te})} \sum_{i \in S^{te}} \left\{ (\tau_i - \hat{\tau}(X_i; S^{est}, \prod))^2 - \tau_i^2 \right\}$ $- EMSE_{\tau}(\prod) = E_{X_i} \left[ \tau^2(X_i; \prod) \right] - E_{S^{est}, X_i} \left[ V \left( \hat{\tau}^2(X_i; S^{est}, \prod) \right) \right]$

使用 $\tau$ 替代了 $\mu$，

偏差项，带入整合公式：

$-EMSE_{\tau}^{\hat{S^{tr}}, \prod} = \frac{1}{N^{tr}} \sum_{i \in S^{tr}} \hat{\tau}^2(X_i; S^{tr}, \prod) - \frac{2}{N^{tr}} \sum_{\ell \in \prod} \left( \frac{S^2_{S^{tr}_{treat}}(\ell)}{p} + \frac{S^2_{S^{tr}_{control}}(\ell)}{1 - p} \right)$

其中，$p$ 表示相应 treatment 组的样本占比，该公式也是最终的计算节点分分类标准的公式。

notions 解释

$\tau = \mu(1, x; \prod) - \mu(0; x; \prod)$
- $\tau$ 表示 异质性处理效应（treatment effect），也就是在不同特征 $x$ 下的处理效果差异。
- $\mu(1, x; \prod)$ 表示在处理组（treatment group）下的期望结果，假设 $x$ 是一个特征向量，$\prod$ 是分裂方式（例如树结构中的分裂规则）。
- $\mu(0, x; \prod)$ 表示在对照组（control group）下的期望结果。
- 因此，$\tau$ 就是处理组和对照组之间的期望结果差异，即异质性处理效应。
$\hat{\tau}(w, x; S, \prod)$
- 这是一个估计量，表示我们无法直接观测的异质性处理效应的估计值。
- $w$ 表示一个二值变量，用于区分处理组和对照组的状态（$w=1$ 表示处理组，$w=0$ 表示对照组）。
- $S$ 表示样本集合，我们在该集合上估计 $\hat{\tau}$。
- $\prod$ 表示分裂方式，用于定义节点分割和结构。
$\hat{\tau}(1, x; S, \prod) - \hat{\tau}(0, x; S, \prod)$
- $\hat{\tau}(1, x; S, \prod)$ 表示在处理组条件下，对 $\tau$ 的估计（给定样本集 $S$ 和分裂方式 $\prod$）。
- $\hat{\tau}(0, x; S, \prod)$ 表示在对照组条件下，对 $\tau$ 的估计。
- 由于我们只能观测到一个个体的一个状态（即处理或对照），因此无法直接观测到反事实。通过这个公式，我们使用样本中的观测数据来估计反事实，即 处理效果的差异。

总结来说，这里的 $\hat{\tau}(w, x; S, \prod)$ 表示在特定条件下的处理效果估计，定义了我们观测到的结果和反事实结果之间的差异。在实际中，处理效果 $\tau$ 是观测不到的，而是通过估计值 $\hat{\tau}$ 来逼近。

实现Cython中的主要计算公式

计算当前节点的处理效应( $\tau$ ) $\tau = \frac{\text{tr\_y\_sum}}{\text{tr\_count}} - \frac{\text{ct\_y\_sum}}{\text{ct\_count}}$
计算给定数据的方差。其中：
```
 $E[Y] = \frac{\text{y\_sum}}{\text{count}}$
 $E[Y^2] = \frac{\text{y\_sq\_sum}}{\text{count}}$
```

当前节点的“不纯度”（impurity） $\text{impurity} = \left(\frac{\text{Var}(Y_{\text{tr}})}{N_{\text{tr}}} + \frac{\text{Var}(Y_{\text{ct}})}{N_{\text{ct}}}\right) - \tau^2 + \text{penalty}$ 其中：

 $\text{Var}(Y_{\text{tr}})$ 和 $\text{Var}(Y_{\text{ct}})$ 分别是处理组和对照组的方差。
 $N_{\text{tr}}$ 和 $N_{\text{ct}}$ 分别是处理组和对照组的样本数量。
 $\tau$ 是处理效应。
 $\text{penalty}$ 是分组惩罚项，用于调整不同分组样本数量对纯度的影响。