Graph Contrastive Learning Automated

2021-08-28

字数统计: 4k | 阅读时长≈ 18 分钟

https://github.com/Shen-Lab/GraphCL_Automated

Graph Contrastive Learning Automated ，2021，ICML

总结：可以看做“Graph Contrastive Learning with Augmentations”这篇文章的优化工作。本文作者提出了一种自动图对比学习策略JOAO，并基于此提出了一种变体JOAOv2。具体来说，由于图数据集的异质性，现有的图对比学习方法需要根据不同数据集采用试错法手动选取增强策略，这限制了GraphCL的通用性。作者受对抗学习的启发，提出了一种双层优化机制，将增强策略的采样分布整合到模型目标损失中，通过梯度下降进行优化，自动选取最适合当前数据集的增强策略。大量实验证明作者提出的JOAO策略是有效的，自动选取的增强策略和人工试错法得到的增强策略基本一致。

1. 简介

1.1 摘要

Self-supervised learning on graph-structured data has drawn recent interest for learning generalizable, transferable and robust representations from unlabeled graphs. Among many, graph contrastive learning (GraphCL) has emerged with promising representation learning performance. Unfortunately, unlike its counterpart on image data, the effectiveness of GraphCL hinges on adhoc data augmentations, which have to be manually picked per dataset, by either rules of thumb or trial-and-errors, owing to the diverse nature of graph data. That significantly limits the more general applicability of GraphCL. Aiming to fill in this crucial gap, this paper proposes a unified bi-level optimization framework to automatically, adaptively and dynamically select data augmentations when performing GraphCL on specific graph data. The general framework, dubbed JOint Augmentation Optimization (JOAO), is instantiated as min-max optimization. The selections of augmentations made by JOAO are shown to be in general aligned with previous “best practices” observed from handcrafted tuning: yet now being automated, more flexible and versatile. Moreover, we propose a new augmentation-aware projection head mechanism, which will route output features through different projection heads corresponding to different augmentations chosen at each training step. Extensive experiments demonstrate that JOAO performs on par with or sometimes better than the state-of-the-art competitors including GraphCL, on multiple graph datasets of various scales and types, yet without resorting to any laborious dataset-specific tuning on augmentation selection. We release the code at https://github.com/Shen-Lab/GraphCL_Automated.

近年来用于图结构数据的自监督学习得到了广泛研究，可以从无标签图中学习通用的、可迁移的、稳定的表示。很多图表示学习模型在表示学习任务中都取得了很不错的性能。不幸的是，和图像中对比学习不同，图对比学习的有效性依赖于特定的增强策略。但是由于图数据集的多样性，需要采用是错法为当前数据集手动选取合适的增强策略。这极大的限制了GraphCL模型的通用性。为了解决这个问题，本文我们提出了一种统一的双层优化框架，可以针对具体数据集自动地、自适应地、动态地选取数据增强策略。该框架称之为联合增强优化JOAO，可以看做是一种min-max优化的实例。通过JOAO 选出来的增强策略基本和人工选取的最优增强策略一致。另外，我们提出了一种新的增强感知映射头机制，可以通过每个训练步骤选择的不同增强策略对应的投影头对输出特征进行路由。大量实验证明，在各种类型数据集上，不需要手动选取增强策略，JOAO都取得了相当甚至优于SOTA方法的性能。

1.2 本文工作

背景： 近些年来用于图结构数据的自监督模型，尤其是图对比学习模型得到了广泛研究。但是和图像数据不同，图数据集十分多样，比如社交网络、引用网络、生物网络等。现有的图自监督模型无法解决这种多样性带来的挑战。比如，当前SOTA图对比学习模型GraphCL需要手动选取合适的增强策略，生成对比视角。

动机： 通过试错方式为GraphCL模型选取合适的数据增强策略，极大地限制了GraphCL模型的通用性，并且这在实际应用中有时是不可行的。从这一点出发，作者在本文中研究如何让模型自动地、自适应地、动态地选取合适的增强策略。

作者贡献： 简而言之，作者提出了一种即插即用的联合增强优化方法JOAO。作为一种bi-level优化框架，作者首次使用它来自动地、自适应地、动态地选取数据增强策略，并取得了比较好的实验结果。

2. 方法

先来看一下目前通用的一种图对比学习框架，作者在此基础上添加了一个可插拔的联合增强优化方法，让模型自动地、自适应地选取最合适的增强策略。

2.1 通用图对比学习框架

上图展示了通用GrapahCL的计算步骤：

$G\sim\mathbb P_G$ 表示服从分布 $\mathbb P_G$ 的图G， $A_1$ 和 $A_2$ 表示两种随机增强策略，增强策略集合表示为 $\mathcal A=\{NodeDrop,Subgraph,EdgePert,AttrMask,Identical\}$ ，并且 $A\in\mathcal A:\mathcal G\rightarrow\mathcal G$ 。
通过增强策略得到两个不同视图后，使用GNN计算节点嵌入，在通过转换函数将特征映射到对比空间计算对比损失：
$\begin{aligned} & \min _{\theta} \mathcal{L}\left(\mathrm{G}, \mathrm{A}_{1}, \mathrm{~A}_{2}, \theta\right) \\ =& \min _{\theta}\left\{\left(-\mathbb{E}_{\mathbb{P}_{\mathrm{G}} \times \mathbb{P}_{\left(\mathrm{A}_{1}, \mathrm{~A}_{2}\right)}} \operatorname{sim}(\overbrace{\mathrm{T}_{\theta, 1}(\mathrm{G}), \mathrm{T}_{\theta, 2}(\mathrm{G})}^{\text {Positive pairs }})\right.\right.\\ &\left.+\mathbb{E}_{\mathbb{P}_{\mathbf{G}} \times \mathbb{P}_{\mathbf{A}_{1}}} \log \left(\mathbb{E}_{\mathbb{P}_{G}, \times \mathbb{P}_{\mathbf{A}_{2}}} \exp \left(\operatorname{sim}(\underbrace{\mathrm{T}_{\theta, 1}(\mathrm{G}), \mathrm{T}_{\theta, 2}\left(\mathrm{G}^{\prime}\right)}_{\text {Negative pairs }})\right)\right)\right\} \end{aligned}\tag 1$
其中 $sim$ 表示余弦相似度函数。现有的GraphCL框架中，增强策略还需要人工手动选择。

2.2 JOAO框架

2.2.1 优化框架

2.1节公式1定义的损失存在两个弊端：一是需要基于先验知识预定义增强策略分布 $\mathbb P_{A_1,A_2}$ ，二是只使用了Dirac分布（即每个数据集只采用了一对增强策略）。

作者提出了一种新方法，通过下列双层优化框架，来动态地、自动地学习 $\mathbb P_{A_1,A_2}$ ：

\begin{array}{l} \min _{\theta} \quad \mathcal{L}\left(\mathrm{G}, \mathrm{A}_{1}, \mathrm{~A}_{2}, \theta\right)\\ \text { s.t. } \quad \mathbb{P}_{\left(\mathrm{A}_{1}, \mathrm{~A}_{2}\right)} \in \arg \min _{\mathbb{P}_{\left(A_{1}^{\prime}, \mathrm{A}_{2}^{\prime}\right)}} \mathcal{D}\left(\mathrm{G}, \mathrm{A}_{1}^{\prime}, \mathrm{A}_{2}^{\prime}, \theta\right) \end{array}\tag 2

作者称上述公式为联合增强优化（JOAO），即将增强策略分布添加到优化函数中。公式2第一层目标 $\mathcal L$ 和通用GraphCL框架目标一致，第二层目标 $\mathcal D$ 联合优化采样分布 $\mathbb P_{A_1,A_2}$ 用于增强策略选择。

2.2.2 实例化框架

作者受对抗训练的启发，将上述优化框架实例化为Min-Max Optimization形式：

\begin{array}{l} \min _{\theta} \quad \mathcal{L}\left(\mathrm{G}, \mathrm{A}_{1}, \mathrm{~A}_{2}, \theta\right) \\ \text { s.t. } \quad \mathbb{P}_{\left(\mathrm{A}_{1}, \mathrm{~A}_{2}\right)} \in \arg \max _{\mathbb{P}_{\left(\mathrm{A}_{1}^{\prime}, \mathrm{A}_{2}^{\prime}\right)}}\left\{\mathcal{L}\left(\mathrm{G}, \mathrm{A}_{1}^{\prime}, \mathrm{A}_{2}^{\prime}, \theta\right)\right. \\ \left.\quad-\frac{\gamma}{2} \operatorname{dist}\left(\mathbb{P}_{\left(\mathrm{A}_{1}^{\prime}, \mathrm{A}_{2}^{\prime}\right)}, \mathbb{P}_{\text {prior }}\right)\right\} \end{array}\tag 3

其中 $\gamma \in \mathcal{R}_{\geq 0}$ ， $\mathbb P_{prior}$ 表示增强策略先验分布， $dist:\mathcal{P} \times \mathcal{P} \rightarrow \mathcal{R}_{\geq 0}$ 表示两个分布之间的距离函数。本文作者设定 $\mathbb P_{prior}$ 为均匀分布， $dist(·,·)$ 选用平方欧氏距离。这样有 $\operatorname{dist}\left(\mathbb{P}_{\left(\mathrm{A}_{1}, \mathrm{~A}_{2}\right)}, \mathbb{P}_{\text {prior }}\right)=\sum_{i=1}^{|\mathcal{A}|} \sum_{j=1}^{|\mathcal{A}|}\left(p_{i j}-\frac{1}{|\mathcal{A}|^{2}}\right)^{2}$ 。

2.2.3 目标优化方法

(这块关于下层优化涉及到的数学知识比较多，看的不是很懂。感兴趣的同学可以阅读原文，自行验证下作者的推理过程。)

模型损失优化伪代码如下图所示：

上层最小化对比损失

采用上一层增强策略分布，优化GNNs参数：
$\theta^{(n)}=\theta^{(n-1)}-\alpha^{\prime} \nabla_{\theta} \mathcal{L}\left(\mathrm{G}, \mathrm{A}_{1}, \mathrm{~A}_{2}, \theta\right)\tag 4$
下层最大化增强策略采样

（这块公式转换、推导看的不是很懂，感兴趣可以看原文）

直接对下层采样分布进行优化不太好操作，作者将公式1重新转换成：

\begin{array}{l} \begin{array}{c} \mathcal{L}\left(\mathrm{G}, \mathrm{A}_{1}, \mathrm{~A}_{2}, \theta\right)=\sum_{i=1}^{|\mathcal{A}|} \sum_{j=1}^{|\mathcal{A}| \text { Targeted }} \overbrace{p_{i j}}\left\{-\mathbb{E}_{\mathbb{P}_{6}} \operatorname{sim}\left(T_{\theta}^{i}(\mathrm{G}), T_{\theta}^{j}(\mathrm{G})\right)\right. \\ \left.+\mathbb{E}_{\mathbb{P}_{6}} \log (\sum_{j^{\prime}=1}^{|\mathcal{A}|} \underbrace{p_{j^{\prime}}}_{\text {Undesired }} \mathbb{E}_{\mathbb{P}_{G}}, \exp \left(\operatorname{sim}\left(T_{\theta}^{i}(\mathrm{G}), T_{\theta}^{j^{\prime}}\left(G^{\prime}\right)\right)\right))\right\} \end{array}\\ \end{array}\tag 5

其中 $T_{\theta}^{i}=A^{i} \circ f_{\theta^{\prime}} \circ g_{\theta^{\prime \prime}},(i=1, \ldots, 5)$ 表示特征提取器， $p_{j^{\prime}}=p_{j}=\operatorname{Prob}\left(A_{2}=A^j\right)$ 。