GRAPH CONTRASTIVE LEARNING WITH PERSONALIZED AUGMENTATION

2022-10-05

字数统计: 2.8k | 阅读时长≈ 12 分钟

GRAPH CONTRASTIVE LEARNING WITH PERSONALIZED AUGMENTATION，2022，arxiv preprint

总结：本文提出了一种自适应增强图对比学习方法PGA。和现有自动GCL方法相比，它包含一个增强策略选择器，允许为每个图根据其自身特性选择最优的增强策略组合（印象中其他自动GCL也能实现这个效果）。从实验结果看，PGA性能还是不错的，不过个人认为本文提出的增强策略选择器的可解释较差，作者没有给出相关理论证明，甚至没有太多的关于为什么这样设计的文字解释，借鉴意义不大。

1. 简介

1.1 摘要

Graph contrastive learning (GCL) has emerged as an effective tool for learning unsupervised representations of graphs. The key idea is to maximize the agreement between two augmented views of each graph via data augmentation. Existing GCL models mainly focus on applying identical augmentation strategies for all graphs within a given scenario. However, real-world graphs are offten not monomorphic but abstractions of diverse natures. Even within the same scenario (e.g., macromolecules and online communities), different graphs might need diverse augmentations to perform effective GCL. Thus, blindly augmenting all graphs without considering their individual characteristics may undermine the performance of GCL arts. To deal with this, we propose the first principled framework, termed as Graph contrastive learning with Personalized Augmentation (GPA), to advance conventional GCL by allowing each graph to choose its own suitable augmentation operations. In essence, GPA infers tailored augmentation strategies for each graph based on its topology and node attributes via a learnable augmentation selector, which is a plug-and-play module and can be effectively trained with downstream GCL models end-to-end. Extensive experiments across 11 benchmark graphs from different types and domains demonstrate the superiority of GPA against state-of-the-art competitors. Moreover, by visualizing the learned augmentation distributions across different types of datasets, we show that GPA can effectively identify the most suitable augmentations for each graph based on its characteristics.

图对比学习已经成为一种有效的无监督图表示学习工具。对比学习的核心思想是最大化两个增强视角之间的一致性。对于增强策略的选择，给定某个场景，现有的GCL模型大多使用相同的增强策略。然而现实世界中的图经常不是单一的，而是具有不同性质的抽象。即使在相同场景下，为了设计有效的GCL模型，我们可能也需要不同的增强策略。因此，不单独考虑每个图的特性，而盲目地对所有的图使用相同的增强策略会降低GCL模型的性能。为了解决这个问题，我们提出了一个新的框架，称之为GPA，允许每个图选择最合适自己的增强策略。GPA包含一个可学习的增强策略选择器，基于图的拓扑结构和属性信息来定制增强策略。这个选择器是一个即插即用模块，可以通过下游GCL模型进行端到端的训练。11个标准数据及上的大量实验表明了，和SOTA方法相比，GPA性能更优。另外，通过可视化不同数据集上学习到的增强策略分布，我们发现GPA可以根据图的特性，为每个图选择最适合的增强策略。

1.2 本文工作

背景： 对比学习GCL最近引起了很大注意，其核心思想在于通过扰动生成两个增强视角，然后最大化两个视角之间的互信息来学习表示。众所周知，增强策略的选择会严重影响GCL的性能，比如edge perturbation在社交网络中可能很好用，但是在生物分子网络中性能较差。因此，手动选取适合特定场景的增强策略往往需要大量实验去试错。当然一些研究人员提出了自动化GCL，即让模型自动选取最适合的增强策略。

动机： 现实世界中，同一场景下每个graph可能都有自己独立的特征。因此，为某个场景选取最适合的增强策略往往还不够，我们需要为该场景下每个图根据其自身的characteristic，量身定制一种增强策略。作者考虑：“ What are the impacts of different augmentation strategies on a given graph (an instance in a graph dataset)? Can we build a stronger automated GCL by allowing each graph instance to choose its favorable augmentation types? ”

本文工作： 作者提出了一中有效的对比学习框架GPA，其包含一个personalized增强策略选择器，允许我们为每个图单独选择增强策略。

2. 方法

GPA框架如下图所示，可以看到，和常规GCL方法相比有两处不同：

个性化增强策略选择器：针对图自身特性，选择最合适的增强策略
选择器优化：不仅要优化GNN编码器，还要对选择器进行优化

下面我们看看GPA中这两块是如何设计的。

## 2.1 增强策略选择器

这里作者吹的比较厉害，说什么搜索空间大，不好设计啥啥啥的，但是最后给出的解决方案就是很朴素的MLP。

假设增强策略集合为 $\mathcal{A}=\left\{A_i: 1 \leq i \leq K\right\}$ ，对于query graph $\mathcal G_n$ ，选择器的目标是从候选集中选择两个最informative的增强策略。

作者的方案是这样的：

对于候选集中任意一个augmentation pair $A_{i, j}^n=\left(A_i^n, A_j^n\right)$ ，按照如下公式计算其importance score $\hat\alpha_{i,j}^n$ ：
$\begin{aligned} \hat{\alpha}_{i, j}^n &=\frac{\exp \left(\alpha_{i, j}^n\right)}{\sum_{i^{\prime} j^{\prime}} \exp \left(\alpha_{i^{\prime}, j^{\prime}}^n\right)} \\ \alpha_{i, j}^n &=g_\theta\left(f_w\left(A_i^n\left(\mathcal{G}_n\right) \| A_j^n\left(\mathcal{G}_n\right)\right)\right), \end{aligned}$
其中 $g_\theta$ 表示得分函数， $f_w(\cdot)$ 是GNN编码器， $||$ 是concatenation操作， $A_i^n(\mathcal G_n)$ 和 $A_j^n(\mathcal G_n)$ 表示增强视角。
通过上述计算方法，我们可以得到所有pair的得分：
$\alpha^n=\left[\hat{\alpha}_{1,1}^n, \cdots, \hat{\alpha}_{i, j}^n, \cdots, \hat{\alpha}_{K, K}^n\right]$

concatenation操作是啥？为什么这样算出来的得分最高的pair就是最优pair？？

作者并没有相关分析，也没有相关理论证明，个人感觉有点扯。

2.2 模型优化

本文并没有对GCL的对比损失进行改动，GPA的损失函数定义如下：

\mathcal{L}\left(\mathcal{G}_n\right)=-\log \frac{\exp \left(\operatorname{sim}\left(f_w\left(A_i^n\left(\mathcal{G}_n\right)\right), f_w\left(A_j^n\left(\mathcal{G}_n\right)\right)\right) / \tau\right)}{\sum_{n^{\prime}=1, n^{\prime} \neq n}^N \exp \left(\operatorname{sim}\left(f_w\left(A_i^n\left(\mathcal{G}_n\right)\right), f_w\left(A_j^{n^{\prime}}\left(\mathcal{G}_{n^{\prime}}\right)\right)\right) / \tau\right)}

和常规GCL框架的损失函数形式完全一样，就是正负样本对之间的InfoNCE损失。

GPA和常规GCL模型不同点是模型优化上。常规GCL框架只需要优化编码器即可，GPA多了一个策略选择器，也就是说要同时优化选择器和编码器。

一种简单的策略就是各自优化各自的，两者分开优化。但是这种方法显然不是最优的，因为这两个模块是息息相关的，一个好的GCL方法离不开好的增强策略，同时选择器也需要GCL提供信号，帮助选择最优组合。

作者采用bi-level方式对两个模块进行优化（很多方法里面都有用到，不是作者独创的）：

\begin{aligned} &\arg \min _\theta \mathcal{L}_{\text {valid }}\left(w^*(\theta), \theta\right) \\ &\text { s.t. } w^*(\theta)=\underset{w}{\arg \min } \mathcal{L}_{\text {train }}(w, \theta), \end{aligned}

其中 $\theta$ 是选择器的参数， $\mathcal w$ 是GCL方法的参数。目标函数分两层：

upper-level：给定最优 $w^*$ 的情况下，在验证集上寻找最优的 $\theta$ 最小化$\mathcal{L}_{\text {valid }}\left(w^(\theta), \theta\right) $。这里$ \theta $的优化比较困难，因为每当$ \theta $更新时，我们都要重新计算最优的$ w^(\theta) $，再去计算梯度，计算量非常大。这里用一次梯度$ w’(\theta) $代替$ w^*(\theta)$，即：
$\begin{aligned} \nabla_\theta \mathcal{L}_{\text {valid }}\left(w^*(\theta), \theta\right) & \approx \nabla_\theta \mathcal{L}_{\text {valid }}\left(w^{\prime}, \theta\right) \\ & \approx \nabla_\theta \mathcal{L}_{\text {valid }}\left(w-\xi \nabla_w \mathcal{L}_{\text {train }}(w, \theta), \theta\right) \end{aligned}$
根据链式规则，上式可转化为：
$\begin{aligned} \nabla_\theta \mathcal{L}_{\text {valid }}\left(w^*(\theta) \theta\right) & \approx \nabla_\theta \mathcal{L}_{\text {valid }}\left(w^{\prime}, \theta\right) \\ &-\xi \nabla_{\theta, w}^2 \mathcal{L}_{\text {train }}(w, \theta) \nabla_{w^{\prime}} \mathcal{L}_{\text {valid }}\left(w^{\prime}, \theta\right) \end{aligned}$
上式第二项二次梯度，计算量还是比较大，可以再做如下近似：
$\begin{aligned} &\xi \nabla_{\theta, w}^2 \mathcal{L}_{\text {train }}(w, \theta) \nabla_{w^{\prime}} \mathcal{L}_{\text {valid }}\left(w^{\prime}, \theta\right)\\ &\approx \frac{\nabla_\theta \mathcal{L}_{\text {train }}\left(w^{+}, \theta\right)-\nabla_\theta \mathcal{L}_{\text {train }}\left(w^{-}, \theta\right)}{2 \epsilon}, \end{aligned}$
这样最终的梯度公式为：
$\begin{aligned} \nabla_\theta \mathcal{L}_{\text {valid }}\left(w^*(\theta), \theta\right) & \approx \nabla_\theta \mathcal{L}_{\text {valid }}\left(w^{\prime}, \theta\right) \\ &-\xi \frac{\nabla_\theta \mathcal{L}_{\text {train }}\left(w^{+}, \theta\right)-\nabla_\theta \mathcal{L}_{\text {train }}\left(w^{-}, \theta\right)}{2 \epsilon} . \end{aligned}$
lower-level：训练集上寻找最优的 $w$ ，最小化对比损失（固定 $\theta$ ）。因为$\theta $是固定的，因此梯度下降也很简单：
$w^{\prime}=w-\xi \nabla_w \mathcal{L}_{\text {train }}(w, \theta)$