COMBINING LABEL PROPAGATION AND SIMPLE MODELS OUT-PERFORMS GRAPH NEURAL NETWORKS

2021-06-04

字数统计: 2.8k | 阅读时长≈ 12 分钟

https://arxiv.org/pdf/2010.13993v2.pdf

https://github.com/CUAI/CorrectAndSmooth

COMBINING LABEL PROPAGATION AND SIMPLE MODELS OUT-PERFORMS GRAPH NEURAL NETWORKS，ICLR，2021

总结：GNN作为图学习方面的主要技术，但是其模型越来越复杂，模型收益的可解释性很差。作者发现通过结合几个简单的方法（C&S后处理步骤），在多个数据集上能超过或者媲美当前最优GNN的性能。具体来说先用简单的模型比如MLP或者线性模块预测节点类型分布，然后通过两个标签传播对预测结果进行修正（Correct）和平滑（Smooth）处理。文章的发现或者带给我们的启发有一下几点：（1）图学习模型不一定要很复杂，一些简单模型的组合往往也能取得很好地效果；（2）之间将标签用于模型中，在直推式节点分类中有很好地效果（这也是作者认为该模型的收益来源）。

1. 简介

1.1 摘要

Graph Neural Networks (GNNs) are the predominant technique for learning over graphs. However, there is relatively little understanding of why GNNs are successful in practice and whether they are necessary for good performance. Here, we show that for many standard transductive node classification benchmarks, we can exceed or match the performance of state-of-the-art GNNs by combining shallow models that ignore the graph structure with two simple post-processing steps that exploit correlation in the label structure: (i) an “error correlation” that spreads residual errors in training data to correct errors in test data and (ii) a “prediction correlation” that smooths the predictions on the test data. We call this overall procedure Correct and Smooth (C&S), and the post-processing steps are implemented via simple modifications to standard label propagation techniques fromearly graph-based semi-supervised learning methods. Our approach exceeds or nearly matches the performance of state-of-the-art GNNs on a wide variety of benchmarks, with just a small fraction of the parameters and orders of magnitude faster runtime. For instance, we exceed the best known GNN performance on the OGB-Products dataset with 137 times fewer parameters and greater than 100 timesless training time. The performance of our methods highlights how directly incorporating label information into the learning algorithm (as was done in traditionaltechniques) yields easy and substantial performance gains. We can also incorporate our techniques into big GNN models, providing modest gains. Our code forthe OGB results is at https://github.com/CUAI/CorrectAndSmooth.

GNNs虽然是图学习上非常主流的方法，但是相比较其他方法，我们不知道GNN为什么在实际应用中能取得成功，以及它对于好的实验结果是否是必要的。本文，作者发现对于许多标准的transductive节点分类benchmarks，通过组合忽略图结构的浅层模型，利用两个简单关联的两个简单post-processing步骤：（1）“error correlation”，在训练数据中传递residual errors来修正测试数据中的错误；（2）“prediction correlation，平滑测试数据中的预测结果。作者称这个过程为Correct and Smooth（C&S）。其中两个post-processing技术是通过对早期基于图的半监督学习方法地标签传播技术进行简单修改得到的。在许多标准数据集上，该方法的性能能够超过或者达到当前最优GNNs方法，并且参数量更少，训练时间更短。例如，在OGB-Products数据集上，我们的方法优于当前最好的GNN方法，并且参数量减少137倍，训练时间减少100倍。该方法的性能凸显了如何直接将标签信息整合到学习算法种产生简单而显著的性能增益。我们还可以将我们的技术整合到大型GNN模型中，获得适度的收益。

1.2 本文工作

动机： GNN的发展通常都是围绕“构建一个比基本变体（比如GCN，GraphSAGE）更具表征能力的架构”，比如GAT，GIN。许多新的GNN架构的思想都来源于语言或者视觉模型中，比如注意力和深层CNNs等等。但是随着这些模型越来越复杂，越来越难以理解他们为什么能获得性能的提升，并且难以将其拓展到大规模数据集上。

问题定义： 无向图 $G=(V,E)$ ，其中 $n=|V|$ ， $X\in\mathbb R^{n\times p}$ 为特征矩阵，A为邻接矩阵，D为度矩阵，S表示正则化后的邻接矩阵 $D^{-1/2}AD^{-1/2}$ 。对于预测问题，节点集合V划分成有标签节点集合L和无标签节点集合U。 $Y\in\mathbb R^{n\times c}$ 表示one-hot标签向量，c为类型数量。将有标签节点进一步划分成训练集 $L_t$ 和验证集 $L_v$ 。transductive分类问题定义为：给定G，X，Y，预测节点 $j\in U$ 的类别。

本文工作： 作者提出了一个包含三部分的pipeline，（1）base prediction，用一个简单模型（比如MLP或者线性模型）忽略图结构，基于节点特征进行节点类型预测；（2）correction step，在图中传播uncertainties来修正base prediction；（3）对预测进行smooth操作。其中（2）（3）是post-processing步骤，不参与训练。

目标： 研究如何用更简单的方法来改善图学习中的预测性能，并更好地理解模型性能提高的原因。

“ However, our goal is not to say thatcurrent graph learning methods are poor or inappropriate. Instead, we aim to highlight easier waysin which to improve prediction performance in graph learning and to better understand the source of performance gains. Our main finding is that more direct incorporation of labels into the learning algorithms is key. ”

2. 方法

2.1 Simple Base Predictor

采用不依赖图结构的简单预测器预测节点类型。 具体来说，训练模型f，最小化 $\sum_{i\in L_t}l(f(x_i),y_i)$ ，其中 $x_i$ 表示X的第i行， $y_i$ 表示Y的第i行，l表示损失函数。本文中f表示MLP或者一个线性模块，l表示交叉熵损失。

通过f，可以得到预测结果 $Z\in\mathbb R^{n\times c}$ （经softmax得到的概率分布）。

注：这里的基预测器也可以选择基于GNN的方法。

2.2 Correcting for Error

通过整合labels来修正错误，提高base prediction Z的准确度。

基本思想： 我们期望base prediction中的errors和图中的边是正相关的，也就是说节点i中的error会增加其邻居节点出现相似error的概率。

作者受residual propagation的启发，定义一个error matrix $E\in\mathbb R^{n\times c}$ ：

E_{L_{t}}=Z_{L_{t}}-Y_{L_{t}}, \quad E_{L_{v}}=0, \quad E_{U}=0\tag 1

采用标签传播技术来smooth error，优化目标如下：

\hat{E}=\underset{W \in \mathbb{R}^{n \times c}}{\arg \min } \operatorname{trace}\left(W^{T}(I-S) W\right)+\mu\|W-E\|_{F}^{2}\tag 2

上述优化目标可以通过不断迭代 $E^{(t+1)}=(1-\alpha) E+\alpha S E^{(t)}$ 得到，其中 $\alpha=1/(1+\mu)$ ， $E^{(0)}=E$ 。

考虑对于分类问题，smoothed error $\hat E$ 范围可能不对：

\left\|E^{(t+1)}\right\|_{2} \leq(1-\alpha)\|E\|+\alpha\|S\|_{2}\left\|E^{(t)}\right\|_{2}=(1-\alpha)\|E\|_{2}+\alpha\left\|E^{(t)}\right\|_{2}\tag 3

当 $E^{(0)}=E$ 时，有 $||E^{(t)}||_2\leq||E||_2$ ，因此这种标签传播不能完全修正图中所有节点的error。为了解决这个问题，作者提出了两种scaling误差的变体：

（1）Autoscale

“Intuitively, we want to scale the size of errors in $\hat E$ to be approximately the size of theerrors in E. ”

因为只知道训练集有标签节点的error，所以作者用这部分节点的error来近似获取scale值：

（2）FDiff-scale

参考2003年一篇文章中的做法，固定训练集节点中的error。具体来说，迭代 $E_U^{(t+1)}=[D^{-1}AE^{(t)}]_U$ ，同时保持 $E_L^{(t)}=E_L$ 直到收敛得到 $\hat E$ 。

作者发现这种方式下也可以找到一个有效的超参s，得到 $Z^{(r)}=Z+s\hat E$ 。

\sigma=\frac{1}{\left|L_{t}\right|} \sum_{j \in L_{t}}\left\|e_{j}\right\|_{1}

Z_{i,:}^{(r)}=Z_{i,:}+\sigma \hat{E}_{:, i} /\left\|\hat{E}_{:, i}^{T}\right\|_{1} \text { for } i \in U

2.3 Smoothing Final Prediction

经过前面两步，可以得到修正后的预测结果 $Z^{(r)}$ 。为了得到最终的预测结果，需要对corrected prediction进行smooth操作。

动机： 相邻节点之间很可能拥有相似的label（同构图中）。

作者这里采用另一个标签传播对标签分布进行平滑处理。首先从最优标签猜测开始，即：

G_{L_{t}}=Y_{L_{t}}, \quad G_{L_{v}, U}=Z_{L_{v}, U}^{(r)}

然后重复执行 $G^{(t+1)}=(1-\alpha) G+\alpha S G^{(t)}$ with $G^{(0)}=G$ 直到收敛，得到最终预测 $\hat Y$ 。

2.4 Summary

采用图结构不相干的简单模型获取初始预测结果Z
利用error propagation修正基础预测得到 $Z^{(r)}=Z+\hat E$
通过另一个LP对 $Z^{(r)}$ 进行平滑处理得到最终预测

3. 实验

3.1 实验设置

数据集： 下表9个数据集

数据划分： training/validation/test比例不同数据集分别设置为40%/10%/50%和60%/20%/20%。

base predictors： 使用Linear和MLP作为基预测器（输入为节点特征和spectral embedding）。

对比方法： GCN，SGC，APPNP。Plain Linear（只使用raw features）和Label Propagation（只使用label）。

SOTA方法： Arxiv和Product数据集采用UniMp作为SOTA；Cora、Pubmed和Citeseer数据集采用GCNII作为SOTA；Rice31数据集采用GCN with spectral and node2vec embeddings作为SOTA；WikiCS选用APPNP作为SOTA。