Contrastive and Generative Graph Convolutional Networks for Graph-based Semi-Supervised Learning

2021-10-04

字数统计: 3k | 阅读时长≈ 13 分钟

https://arxiv.org/pdf/2009.07111

Contrastive and Generative Graph Convolutional Networks for Graph-based
Semi-Supervised Learning ，2020，AAAI

总结：作者提出了一种半监督图对比学习方法 $\text{CG}^3$ 。模型比较特别的点有两个：一是和传统对比学习方法中增强策略不同， $\text{CG}^3$ 不改变原始图的结构或属性信息，而是从local和global两个不同维度学习节点嵌入，然后进行对比；二是 $\text{CG}^3$ 的目标函数同时包含对比损失、生成损失和交叉熵分类损失。从实验结果来看，模型的实验性能是不错的，不过可能存在的一个确定就是损失函数太复杂了，训练起来代价可能比较高。

1. 简介

1.1 摘要

Graph-based Semi-Supervised Learning (SSL) aims to transfer the labels of a handful of labeled data to the remaining massive unlabeled data via a graph. As one of the most popular graph-based SSL approaches, the recently proposed Graph Convolutional Networks (GCNs) have gained remarkable progress by combining the sound expressiveness of neural networks with graph structure. Nevertheless, the existing graph-based methods do not directly address the core problem of SSL, i.e., the shortage of supervision, and thus their performances are still very limited. To accommodate this issue, a novel GCN-based SSL algorithm is presented in this paper to enrich the supervision signals by utilizing both data similarities and graph structure. Firstly, by designing a semi-supervised contrastive loss, improved node representations can be generated via maximizing the agreement between different views of the same data or the data from the same class. Therefore, the rich unlabeled data and the scarce yet valuable labeled data can jointly provide abundant supervision information for learning discriminative node representations, which helps improve the subsequent classification result. Secondly, the underlying determinative relationship between the data features and input graph topology is extracted as supplementary supervision signals for SSL via using a graph generative loss related to the input features. Intensive experimental results on a variety of real-world datasets firmly verify the effectiveness of our algorithm compared with other state-of-the-art methods.

基于图的半监督方法（SSL）的目标是通过图将少量有标签数据的标签迁移到其余大量无标签数据上。GCNs作为最流行的graph-based SSL方法之一，它将神经网络的良好性能与图结构相结合，取得了显著进展。但是现有的graph-based方法并没有直接解决SSL的核心问题，即缺少监督，因此模型表现不佳。为了解决这个问题，本文提出了一种新的GCN-based SSL算法，通过利用数据相似性和图结构来丰富监督信号。首先，设计一个半监督对比损失，通过最大化相同数据不同视角下的一致性来提升节点表示质量。因此丰富的无标签数据和少量但是珍贵的有标签数据可以联合到一起，提供丰富的监督信息用于学习高质量节点表示，帮助提高后续分类准确度。其次，通过使用与输入特征相关的图生成损失，提取数据特征和输入图拓扑信息之间的潜在决定性关系作为SSL的补充监督信号。和SOTA方法相比，多个现实数据集上的大量实验证明了我们的算法的优越性。

1.2 本文工作

背景： 过去几十年，半监督学习SSL得到越来越多的关注，并且很多算法在其相关领域取得了很大的成功。对于graph-based SSL算法，所有有标签和无标签数据表示成节点形式，他们之间的关系通过边描述。因此图领域中半监督问题就是将少部分有标签节点的标签迁移到其余大量无标签节点中。以前常用的一种方法就是利用正则化图拉普拉斯来强制特征空间中相似节点有相同的节点标签。最近几年，关于这个问题的研究方向转移到学习you判别力的网络嵌入，研究人员提出了很多GCN相关算法，并证明了GCNs的性能优于传统方法。

动机： 虽然这几年来基于图的SSL方法取得了显著进展，但是它们并没有解决SSL的核心问题：“监督不足”。也就是说，这些方法并没有给模型带来更多的监督信号，可能拼的更多的是强大性能。作者希望能够充分利用数据本身携带的监督信息，设计一种有效的基于GCNs的SSL算法。

本文工作： 作者提出了一种新的graph-based SSL算法“Contrastive GCNs with Graph Generation（ $CG^3$ ）”。具体来说，进行local-global对比学习的同时，利用那小部分有标签数据计算分类损失，然后把两个损失放到一起进行联合优化。

2. 方法

作者提出的 $\mathrm{CG}^{3}$ 框架如下图1所示：

2.1 构建对比视角

构建两个不同的views是对比学习中关键一步，这部分介绍下作者是如何构建对比视角的。

如上图1前半部分所示，和其他对比学习方法不同，作者没有改变图结构或者节点属性，而是从local和global两个维度分别学习节点表示，然后进行local-global对比。

local view： 作者采用一个2层GCN作为这部分模型骨架
$\mathbf{H}^{\phi_{1}}=\hat{\mathbf{A}} \sigma\left(\hat{\mathbf{A}} \mathbf{X} \mathbf{W}^{(0)}\right) \mathbf{W}^{(1)}$
其中 $\hat{\mathbf{A}}=\tilde{\mathbf{D}}^{-\frac{1}{2}} \tilde{\mathbf{A}} \tilde{\mathbf{D}}^{-\frac{1}{2}}, \tilde{\mathbf{A}}=\mathbf{A}+\mathbf{I}, \tilde{\mathbf{D}}_{i i}=\sum_{j} \tilde{\mathbf{A}}_{i j}$ ， $\mathbf{H}^{\phi_{1}}$ 表示学习到的local view下的节点表示。
global view： 使用的是层次图神经网络HGCN， $\mathbf{H}^{\phi_{1}}$ 为最终学习到的global view节点表示。

2.2 对比损失

这块没什么特别的，和其他对比学习类似。对于无标签节点的无监督对比损失，采用InfoNCE作为对比损失：

视角1作为锚点和视角2对比
$\mathcal{L}_{u c}^{\phi_{1}}\left(\mathbf{x}_{i}\right)=-\log \frac{\exp \left(\left\langle\mathbf{h}_{i}^{\phi_{1}}, \mathbf{h}_{i}^{\phi_{2}}\right\rangle\right)}{\sum_{j=1}^{n} \exp \left(\left\langle\mathbf{h}_{i}^{\phi_{1}}, \mathbf{h}_{j}^{\phi_{2}}\right\rangle\right)}$
视角2作为锚点和视角1对比
$\mathcal{L}_{u c}^{\phi_{2}}\left(\mathbf{x}_{i}\right)=-\log \frac{\exp \left(\left\langle\mathbf{h}_{i}^{\phi_{2}}, \mathbf{h}_{i}^{\phi_{1}}\right\rangle\right)}{\sum_{j=1}^{n} \exp \left(\left\langle\mathbf{h}_{i}^{\phi_{2}}, \mathbf{h}_{j}^{\phi_{1}}\right\rangle\right)}$

这样总的对比损失就是上面两者之和：

\mathcal{L}_{s c}=\frac{1}{2 l} \sum_{i=1}^{l}\left(\mathcal{L}_{s c}^{\phi_{1}}\left(\mathbf{x}_{i}\right)+\mathcal{L}_{s c}^{\phi_{2}}\left(\mathbf{x}_{i}\right)\right)

因为，作者的模型是半监督的，还有少部分节点是有标签的。对这部分有标签节点，其对比损失定义为：

\begin{array}{c} \mathcal{L}_{s c}^{\phi_{1}}\left(\mathbf{x}_{i}\right)=-\log \frac{\sum_{k=1}^{l} \mathbb{1}_{\left[y_{i}=y_{k}\right]} \exp \left(\left\langle\mathbf{h}_{i}^{\phi_{1}}, \mathbf{h}_{k}^{\phi_{2}}\right\rangle\right)}{\sum_{j=1}^{l} \exp \left(\left\langle\mathbf{h}_{i}^{\phi_{1}}, \mathbf{h}_{j}^{\phi_{2}}\right\rangle\right)} \\ \mathcal{L}_{s c}^{\phi_{2}}\left(\mathbf{x}_{i}\right)=-\log \frac{\sum_{k=1}^{l} \mathbb{1}_{\left[y_{i}=y_{k}\right]} \exp \left(\left\langle\mathbf{h}_{i}^{\phi_{2}}, \mathbf{h}_{k}^{\phi_{1}}\right\rangle\right)}{\sum_{j=1}^{l} \exp \left(\left\langle\mathbf{h}_{i}^{\phi_{2}}, \mathbf{h}_{j}^{\phi_{1}}\right\rangle\right)} \end{array}

需要注意，这里的正负样本就是根据标签判断的，负样本不再无脑选取其余所有节点。这样半监督对比就是有监督和无监督损失之和：

\mathcal{L}_{s s c}=\mathcal{L}_{u c}+\mathcal{L}_{s c}

2.3 生成损失

这部分作者相当于把传统的基于图重构的无监督图学习方法整合到他这个框架里面了。

具体来说就是利用前面两个视角中得到的节点表示，对图结构进行重构然后和原始图进行对比，得到一个重构损失。

为了尽可能从节点表示重构原始图结构，定义如下条件概率：

p\left(\mathcal{G} \mid \mathbf{H}^{\phi_{1}}, \mathbf{H}^{\phi_{2}}\right)=\prod_{i, j} p\left(e_{i j} \mid \mathbf{H}^{\phi_{1}}, \mathbf{H}^{\phi_{2}}\right)

对于节点 $i$ 和节点 $j$ 之间存在边 $e_{ij}$ 的概率，作者只考虑和 $h_i$ 、 $h_j$ 有关，因此有 $p\left(e_{i j} \mid \mathbf{H}^{\phi_{1}}, \mathbf{H}^{\phi_{2}}\right)=p\left(e_{i j} \mid \mathbf{h}_{i}^{\phi_{1}}, \mathbf{h}_{j}^{\phi_{2}}\right)$ 。

在实际应用中，把上面的条件概率参数化成一个逻辑模型：

p\left(\mathcal{G} \mid \mathbf{H}^{\phi_{1}}, \mathbf{H}^{\phi_{2}}\right)=\prod_{i, j} p\left(e_{i j} \mid \mathbf{h}_{i}^{\phi_{1}}, \mathbf{h}_{j}^{\phi_{2}}\right)=\prod_{i, j} \delta\left(\left[\mathbf{h}_{i}^{\phi_{1}}, \mathbf{h}_{j}^{\phi_{2}}\right] \mathbf{w}\right)

其中 $\delta(·)$ 表示logistic函数， $w$ 为可学习参数。这样生成损失就可以定义成 $\mathcal{L}_{g^{2}}=-p\left(\mathcal{G} \mid \mathbf{H}^{\phi_{1}}, \mathbf{H}^{\phi_{2}}\right)$ 。