Why Do Attributes Propagate in Graph Convolutional Neural Networks

2021-06-04

字数统计: 4.8k | 阅读时长≈ 24 分钟

https://yangliang.github.io/pdf/aaai21.pdf

Why Do Attributes Propagate in Graph Convolutional Neural Networks，2021，AAAI

总结：本文偏理论一点，作者提出了一个图表示学习框架GRL，并基于此从数值优化角度（梯度下降和高阶近似）对GCN及其各种变体做出了解释。最后受共轭梯度法性能优于梯度下降法的启发，提出了图共轭卷积模型。

1. 简介

1.1 摘要

Many efforts have been paid to enhance Graph ConvolutionalNetwork from the perspective of propagation under the phi-losophy that “Propagation is the essence of the GCNNs”.Unfortunately, its adverse effect is over-smoothing, whichmakes the performance dramatically drop. To prevent theover-smoothing, many variants are presented. However, theperspective of propagation can’t provide an intuitive and uni-fied interpretation to their effect on prevent over-smoothing.In this paper, we aim at providing a novel explanation tothe question of“Why do attributes propagate in GCNNs?”.which not only gives the essence of the oversmoothing, butalso illustrates why the GCN extensions, including multi-scale GCN and GCN with initial residual, can improve theperformance. To this end, an intuitive Graph RepresentationLearning (GRL) framework is presented. GRL simply con-strains the node representation similar with the original at-tribute, and encourages the connected nodes possess similarrepresentations (pairwise constraint). Based on the proposedGRL, exiting GCN and its extensions can be proved as dif-ferent numerical optimization algorithms, such as gradientdescent, of our proposed GRL framework. Inspired by thesuperiority of conjugate gradient descent compared to com-mon gradient descent, a novel Graph Conjugate Convolu-tional (GCC) network is presented to approximate the solu-tion to GRL with fast convergence. Specifically, GCC adoptstheobtainedinformation of the last layer, which can be rep-resented as the difference between the input and output of thelast layer, as the input to the next layer. Extensive experimentsdemonstrate the superior performance of GCC.

基于“Propagation is the essence of the GCNNs”这一思想，研究人员付出了很多努力来提升图卷积网络的性能。不幸的是，受过平滑影响，模型性能下降很多。为了阻止过平滑现象，人们提出了很多变体。但是，从（信息）传播的角度很难解释这些变体为什么能防止过平滑。本文中作者尝试解释这样一个问题：“Why do attributes propagate in GCNNs”。不仅给出了过平滑的本质，同时解释了各种GCN变体为什么能提高模型性能。最后，作者展示了一种直观的图表示学习框架GRL。GRL只是简单执行两个约束：（1）节点表示和原始特征之间的相似度；（2）相连接点之间的相似度。基于提出的GRL框架，现有的GCN及其变体被证明了是各种不同的数值优化算法，比如梯度下降。受共轭梯度下降比普通梯度下降优越的启发，作者提出了一种新的图共轭卷积神经网络GCC，用于快速逼近GRL的最优解。具体来说，GCC利用前一层输入和输出之间的差作为当前层输入的一部分。大量实验证明了GCC的优越性。

1.2 本文工作

为了提高GCNs的性能，研究人员从propagation角度做出了很多努力，比如GAT，GaAN，Geom-GCN等等。一种普遍的观点是“Propagation is the essence of GCNNs“，并且将GCNs的成功归于propagation引起的拉普拉斯平滑。

针对GNNs的过平滑问题，研究人员也提出了很多方法，比如Disentangled GCN，DropEdge，PageRank-GCN，JKNet，DeepGCN，GCNII等等。但是从propagation角度无法提供一个关于阻止过拟合的直观解释。

本文作者尝试解答“Why do attributes propagate in GCNNs?”，不仅给出了过平滑的本质，并且切实了各类GCN变体为什么能提高模型性能。最后提出了一个直观的图表示学习框架GRL（假设图拓扑结构是准确的）和Robust GRL（考虑图拓扑结构中存在噪声）。

根据提出的GRL和Robust GRL，作者证明了现有的GCN及其变体可以看做不同的数值优化算法比如梯度下降。具体来说带有残差的GCN可以看做GRL的梯度下降，multi-scale GCN可以看做GRL的高阶近似分析方法，带注意力的GCN可以看做是Robust GRL的梯度下降。从数值优化角度理解GCN及其变体，作者得出以下结论：“The propagation as well as its weight learning are not the essence of the GCNs, but induced by the numerical optimization of pairwise similarity requirement.”。

基于这个发现，作者提出了一种新的图共轭卷积网络，用于快速逼近GRL的解。具体来说，鉴于共轭梯度下降的优越性，图共轭卷积层采用上一层输入和输出的差作为当前层输入的一部分。

2. 方法

2.1 先验知识

一、线性系统中的梯度下降

对称正定线性系统定义如下：

Au = x\tag 1

其中 $\mathbb{x,u}\in\mathbb R^N$ 是N维实向量， $\mathbf A\in\mathbb R^{N\times N}$ 为对称正定矩阵，A和x已知，u未知。上述方程组的一个解析解 $u^*=A^{-1}x$ 计算代价很大，复杂度为 $O(N^3)$ 。

鉴于A是对称正定阵，所以上述方程可以通过梯度下降法求解，通过最小化下列方程得到：

f(\mathbf{u})=\frac{1}{2} \mathbf{u}^{T} \mathbf{A} \mathbf{u}-\mathbf{u}^{T} \mathbf{x}\tag 2

通过迭代下列公式可以得到公式2的最小值（数值优化中的梯度下降法）：

\mathbf{u}^{(t+1)}=\mathbf{u}^{(t)}-\mu_{t} \nabla f\left(\mathbf{u}^{(t)}\right)=\mathbf{u}^{(t)}+\mu_{t}\left(\mathbf{x}-\mathbf{A} \mathbf{u}^{(t)}\right)\tag 3

其中 $\mu_t$ 表示步长。在实际应用中，对于对称正定线性方程组，应用更为广泛的方法是共轭梯度法：

\begin{aligned} \mathbf{u}^{(t+1)} &=\mathbf{u}^{(t)}-\mu_{t} \nabla f\left(\mathbf{u}^{(t)}\right)+\kappa_{t}\left(\mathbf{u}^{(t)}-\mathbf{u}^{(t-1)}\right) \\ &=\mathbf{u}^{(t)}+\mu_{t}\left(\mathbf{x}-\mathbf{A} \mathbf{u}^{(t)}\right)+\kappa_{t}\left(\mathbf{u}^{(t)}-\mathbf{u}^{(t-1)}\right) \end{aligned}\tag 4

共轭梯度法收敛速度更快。求解公式1定义的方程组的解还有一种更直接的方法，即计算A的逆矩阵：

\Delta(s)=|s \mathbf{I}-\mathbf{A}|=s^{N}+\alpha_{1} s^{N-1}+\alpha_{2} s^{N-2}+\ldots+\alpha_{N}\\ \Delta(\mathbf{A})=\mathbf{A}^{N}+\alpha_{1} \mathbf{A}^{N-1}+\alpha_{2} \mathbf{A}^{N-2}+\ldots+\alpha_{N} \mathbf{I}=0\\ \mathbf{A}^{-1}=-\frac{1}{\alpha_{N}} \mathbf{A}^{N-1}-\frac{\alpha_{1}}{\alpha_{N}} \mathbf{A}^{N-2}-\ldots-\frac{\alpha_{N-1}}{\alpha_{N}} \mathbf{I}\\ \mathbf{u}^{*}=-\sum_{k=0}^{N-1} \frac{\alpha_{N-k-1}}{\alpha_{N}} \mathbf{A}^{k} \mathbf{x}\tag 5

二、图卷积神经网络

17年Kipf和Welling提出的GCN
$\mathbf{H}^{(t+1)}=\operatorname{ReLU}\left(\mathbf{W}^{(t+1)} \mathbf{H}^{(t)} \hat{\mathbf{A}}\right)\tag 6$
其中 $\hat{\mathbf{A}}=\tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}}, \tilde{\mathbf{A}}=\mathbf{A}+\mathbf{I}$ ，上述公式可以写成：
$\mathbf{H}^{(t+1)}=\operatorname{ReLU}\left(\mathbf{W}^{(t+1)} \mathbf{H}^{(t)}(\mathbf{I}+\overline{\mathbf{A}})\right)\tag 7$
其中 $\bar A=D^{-\frac{1}{2}}AD^{-\frac{1}{2}}$ 。公式7的node-wise可以表示为：
$\mathbf{h}_{i}^{(t+1)}=\sigma\left(\mathbf{W}^{(t+1)} \sum_{N(i) \cup\{i\}} \alpha_{i j} \mathbf{h}_{j}^{(t)}\right)\tag 8$
其中 $\alpha_{ij}$ 是固定的（邻接矩阵中的各个值）， $N(i)$ 表示节点 $v_i$ 的所有邻居， $\sigma(·)$ 表示非线性函数（比如ReLU）。整个GCN模型唯一的参数就是权重矩阵 $W^{(t)}$ ，可以通过最小化交叉熵损失函数得到：
$\mathcal{F}=-\sum_{v_{i} \in \mathcal{V}_{l}} \sum_{k=1}^{K} z_{i k} \ln y_{i k}\tag 9$
其中 $\mathbf{Y}=\left[y_{i k}\right]=\mathbf{H}^{(L)}$ 表示最后一层输出。该GCN模型存在的一个主要缺点就是传播权重 $\alpha_{ij}$ 是固定的。
Multi-scale Extensions，不同尺度的GCNs
1. Simplifies GCN，丢掉模型非线性层，叠加每一层参数矩阵W
$\mathbf{H}^{(L)}=\mathbf{W X} \hat{\mathbf{A}}^{L}\tag {10}$
1. N-GCN和MixHop，拼接multiple hops的结果
  $\mathbf{H}^{(t+1)}=\|_{j=0}^{P} \sigma\left(\mathbf{W}^{(j)} \mathbf{H}^{(t)} \hat{\mathbf{A}}^{j}\right)\tag {11}$
2. LanczosNet和Krylov GCN，求和multiple hops的结果
  $\mathbf{H}^{(t+1)}=\sigma\left(\sum_{j=0}^{P} \mathbf{W}^{(j)} \mathbf{H}^{(t)} \hat{\mathbf{A}}^{j}\right)\tag {12}$
Initial Residual in GCN，GCN中使用残差连接（GCNII）
$\begin{aligned} \mathbf{H}^{t+1} &=\sigma\left(\mathbf{W}^{(t+1)}\left(\left(1-\gamma_{t}\right) \mathbf{H}^{t} \hat{\mathbf{A}}+\gamma_{t} \mathbf{X}\right)\right) \\ &=\sigma\left(\mathbf{W}^{(t+1)}\left(\left(1-\gamma_{t}\right) \mathbf{H}^{t}+\left(1-\gamma_{t}\right) \mathbf{H}^{t} \overline{\mathbf{A}}+\gamma_{t} \mathbf{X}\right)\right) \end{aligned}\tag {13}$
Learning Propagation Weight，带注意力的GCNs，比如GAT，GaAN，Probabilistic GCN，按照如下方式计算 $\alpha_{ij}$ ：
$\alpha_{i j}=\exp \left(e_{i j}\right) / \sum_{k \in N i} \exp \left(e_{i k}\right)\tag {14}$
其中 $e_{ij}$ 表示不同节点之间的相似度。上述三个方法计算 $e_{ij}$ 的方式分别为：
$e_{i j}^{G A T}=\quad \text { LeakyReLU }\left(\mathbf{b}\left[\mathbf{W h}_{i} \| \mathbf{W h}_{j}\right]\right)\tag {15}$ $e_{i j}^{G a A N}=\left(\mathbf{W h}_{i}\right)^{T} \mathbf{O}\left(\mathbf{W h}_{j}\right)\tag {16}$ $e_{i j}^{P G C N}=-\left(\mathbf{W h}_{i}-\mathbf{W h}_{j}\right)^{T} \boldsymbol{\Sigma}\left(\mathbf{W h}_{i}-\mathbf{W h}_{j}\right)\tag {17}$

2.2 GRL框架

给定图 $\mathcal{G=(V,E,\mathbf X)}$ ，为了得到节点表示，需要满足下面两个约束：

Unary Constraint，节点 $v_i$ 的表示 $u_i$ 需要和原始属性 $x_i$ 相似。
Pairwise Constraint，相邻节点之间需要有相似的表示

两个约束可以公式化为：

\mathcal{C}(\mathbf{U})=\sum_{i=1}^{N} \operatorname{sim}\left(\mathbf{x}_{i}, \mathbf{u}_{i}\right)+\lambda \sum_{(i, j) \in \mathcal{E}} o_{i j} \operatorname{dis}\left(\mathbf{u}_{i}, \mathbf{u}_{j}\right)\tag {18}

其中 $o_{ij}$ 表示相似度(即边权值)，sim和dis函数可以取欧式距离：

\mathcal{C}(\mathbf{U})=\sum_{i=1}^{N}\left\|\mathbf{x}_{i}-\mathbf{u}_{i}\right\|_{2}^{2}+\lambda \sum_{(i, j) \in \mathcal{E}} o_{i j}\left\|\mathbf{u}_{i}-\mathbf{u}_{j}\right\|_{2}^{2}\tag{19}

有时候也可以去inner pro作为sim函数：

\mathcal{L}(\mathbf{U})=-2 \sum_{i=1}^{N} \mathbf{x}_{i}^{T} \mathbf{u}_{i}+\lambda \sum_{(i, j) \in \mathcal{E}} o_{i j}\left\|\mathbf{u}_{i}-\mathbf{u}_{j}\right\|_{2}^{2}\tag{20}

用 $U=[u_i]\in\mathbb R^{F\times N}$ 表示节点表示集合，公式19和20可以通过求解下列最小线性而成问题来解决（这块不知道为什么）：

\begin{align} \mathbf{U(I+M)} &=\mathbf X \tag{21}\\ \mathbf{UM} &=\mathbf X\tag{22} \end{align}

其中拉普拉斯矩阵M定义为：

\mathbf{M}=\lambda \sum_{(i, j) \in \mathcal{E}} o_{i j}\left(\mathbf{e}_{i}-\mathbf{e}_{j}\right)\left(\mathbf{e}_{i}-\mathbf{e}_{j}\right)^{T}=\lambda\left(\mathbf{D}_{O}-\mathbf{O}\right)\tag {23}

其中M和I+M都是对称正定矩阵，公式21和22可以通过下式求解：

\mathbf{U}=\mathbf{X}(\mathbf{I}+\mathbf{M})^{-1}, \quad \mathbf{U}=\mathbf{X M}^{-1}\tag{24}

图拓扑结构带噪音下的GRL

上述第二个约束条件需要图的拓扑结构是完全准确的，但是实际应用中图拓扑结构往往都带有噪声，因此需要对公式19做出调整：

\mathcal{C}(\mathbf{U})=\sum_{i=1}^{N}\left\|\mathrm{x}_{i}-\mathbf{u}_{i}\right\|_{2}^{2}+\lambda \sum_{(i, j) \in \mathcal{E}} o_{i j} \rho\left(\left\|\mathbf{u}_{i}-\mathbf{u}_{j}\right\|_{2}\right)\tag {25}

其中 $\rho(·)$ 表示惩罚函数，比如 $l_o$ 范式或者Dirac delta函数。为了减轻noisy connection，引入一个辅助变量 $l_{ij}$ 来自动删除虚假连接，公式25可以转换为：

\begin{aligned} \mathcal{C}(\mathbf{U}, \mathbf{L}) &=\sum_{i=1}^{N}\left\|\mathbf{x}_{i}-\mathbf{u}_{i}\right\|_{2}^{2} \\ &+\lambda \sum_{(i, j) \in \mathcal{E}} o_{i j}\left(l_{i j}\left\|\mathbf{u}_{i}-\mathbf{u}_{j}\right\|_{2}^{2}+\psi\left(l_{i j}\right)\right), \end{aligned}\tag{26}

该公式是关于U、L双凸的，所以要分别优化L和U，即固定L优化U，固定U优化L。

2.3 理论分析

为了简化分析过程，根据SGC中的论述，作者去掉了权重矩阵和非线性函数，只关注传播策略本身。

定理1： The scheme of GCN with initial residual in Eq. (13) is equivalent to solving the Graph RepresentationLearning in Eq. (20), where pairwise similarity $O= [o_{ij}]$ is set as the symmetric Random-Walk Markov Matrix $\bar A=D^{−\frac{1}{2}}AD^{−\frac{1}{2}}$ , with gradient descent

证明： $\mathbf{UM=X}$ 等价于公式20表示的图表示学习，根据公式3可以求解：

\mathbf{U}^{(t+1)}=\mathbf{U}^{(t)}+\mu_{t}\left(\mathbf{X}-\mathbf{U}^{(t)} \mathbf{M}\right)=\mathbf{U}^{(t)}\left(\mathbf{I}-\mu_{t} \mathbf{M}\right)+\mu_{t} \mathbf{X}\tag{30}

根据公式23，以及 $O=\bar A=D^{−\frac{1}{2}}AD^{−\frac{1}{2}}$ ， $M=D_O-O$ ，公式30可以化为：

\mathbf{U}^{(t+1)}=\left(1-\mu_{t}\right) \mathbf{U}^{(t)}+\mu_{t} \mathbf{U}^{(t)} \overline{\mathbf{A}}+\mu_{t} \mathbf{X}\tag {31}

其中 $U^{(t)}$ ， $U^{(t)}\bar A$ 和X分别对应 $H^{(t)}$ ， $H^{(t)}\bar A$ 和X，所以和公式13表示的Residual GCN是一致的。

推论1：“ The term X in gradient descent is important to properly approximate the solution to GRL. Thus, the initial residual is important in GCNs ”

定理1和推论1解释了GCNII模型中残差连接的重要性，以及为什么带有残差连接的GCN层能提高模型性能。原文还说了一句“ In contrast, if the residual connection X lacks, the approximation error of each gradient descent step can’t be ignored, thus the cumulative error is very significant in approximating the solution to GRL. Thus, stacking of GCN without initial residual connection degrades the performance. ”

定理2：“ The scheme of the multi-scale extension to GCN in Eq. (12) is equivalent to solving the Graph Representation Learning in Eq. (20), where pairwise similarity $O= [o_{ij}]$ is set as the symmetric Random-Walk Markov Matrix with selfloop $\bar A= \tilde D^{−\frac{1}{2}} \tilde A \tilde D^{−\frac{1}{2}}$ , with the higher orderapproximation to the inverse of M in Eq. (5) ”

**证明：**根据Caley-Hamilton定理， $(I-A)^{-1}$ 可以近似表示为：

(\mathbf{I}-\mathbf{A})^{-1} \approx \sum_{j=0}^{J} \beta_{j} \mathbf{A}^{j}\tag 31

公式24的解析解为： $U=XM^{-1}$ ，而 $M=D_O-O=(I-\hat A)$ ，所以公式20的解可以表示为：

\mathbf{U}=\mathbf{X}(\mathbf{I}-\mathbf{A})^{-1} \approx \sum_{j=0}^{J} \beta_{j} \mathbf{X} \hat{\mathbf{A}}^{j}\tag{33}

GRL框架的近似解等价于公式12定义的multi-scale extension GCNs。

通过定理1和定理2，证明了GCNs的传播等价于作者提出的GRL框架中的优化步骤（梯度下降步骤或者高阶近似步骤）。“ That is, they are induced by the numerical optimization of pairwise similarity requirement. Thus, the pairwise similarity constraints, which determines how to propagate, is the key to GNNs ”。

推论2：“ The solution to our proposed Robust GraphRepresentation Learning with noisy topology refinementshown in Eq. (25) with ρ(·)as in Eq. (27) is similar to the GCNNs with learnable propagation weight as Eq. (17). ”

结合上述定理和推论，作者得出一个结论（即论文标题的答案）：

" The propagation as well as its weight learning are notthe essence of the GCNs, but induced by the numericaloptimization of pairwise similarity requirement. "