Few-Shot Knowledge Graph Completion

2021-06-04

字数统计: 2.8k | 阅读时长≈ 13 分钟

http://www.meng-jiang.com/pubs/fsrl-aaai20/fsrl-aaai20-paper.pdf

Few-Shot Knowledge Graph Completion ，AAAI，2020

1. 简介

1.1 摘要

Knowledge graphs (KGs) serve as useful resources for various natural language processing applications. Previous KGcompletion approaches require a large number of training instances (i.e., head-tail entity pairs) for every relation. The realcase is that for most of the relations, very few entity pairs areavailable. Existing work of one-shot learning limits method generalizability for few-shot scenarios and does not fully use the supervisory information; however, few-shot KG completion has not been well studied yet. In this work, we propose a novel few-shot relation learning model (FSRL) that aims at discovering facts of new relations with few-shot references. FSRL can effectively capture knowledge from heterogeneous graph structure, aggregate representations of few-shot references, and match similar entity pairs of reference set for every relation. Extensive experiments on two public datasets demonstrate that FSRL outperforms the state-of-the-art.

知识图谱（KG）在很多自然语言处理应用中是一种非常有用的形式。现有的KG补全方法需要每种关系都要大量的训练实例（head-tail实体对），但是在实际应用中只能获取到少量的实体对。现有的one-shot学习方法通用性差，并且没有充分利用监督信息，而few-shot KG补全还缺乏相关研究。在这篇工作中，作者提出了一种原生的小样本关系学习模型FSRL，旨在利用少量references来挖掘新的relations。FSRL能够从异构图结构中有效地捕捉知识，聚合few-shot references的表示，并为每个关系都匹配一个相似的reference set实体对。两个公共数据集上的大量实验表明FSRL的性能要优于state-of-the-art。

1.2 背景

现有的KG补全方法都需要大量实体对，但是在真实数据集中关系出现的频率往往呈现长尾分布，即很多关系只有少量实体对和其相关联。2018年Xiong等人提出了GMatching方法，通过一个局部邻居编码器来学习实体嵌入，在one-shot relation 推理上取得了不错的效果，但是存在一些弊端：

GMatching假设所有的局部邻居对于entity embedding的贡献度是一样的，然而这和事实是相违背的，导致模型不能充分学习图的结构知识，影响了模型性能
GMatching设定的场景是one-shot，将其拓展到few-shot场景下会影响模型性能

因此设计一个在few-shot场景下可以有效complete relation的模型十分重要，作者据此提出了Few-Shot Relation Learning模型（FSRL）。

1.3 问题定义

知识图谱G表示为： $\{(h,r,t)\}\subseteq \mathcal E\times\mathcal R\times\mathcal E$ ， $\mathcal E$ 和 $\mathcal R$ 表示表示实体集合和关系集合。

KG补全任务的目标：给定头实体 $h$ 和关系，预测可能的尾实体，即 $(h,r,?)$ ；或者给定尾实体和头实体，预测可能存在的关系，即 $(h,?,r)$ 。本文作者针对的任务场景为前者。（和其他KG补全方法不同，作者针对的是只有少量实体对的应用场景）

模型的学习目标是：让正确实体 $t_{true}$ 的排名高于错误实体 $t_{false}$ 的排名。

和标准小样本学习方法一样，构建一个training tasks集合。集合中每个元任务 $\mathcal T_{mtr}$ 都包含一种关系 $r\in\mathcal R$ ，及用于训练/测试的实体对 $D_r=\{P_r^{train},P_r^{test}\}$ ，其中 $P_r^{train}$ 只包含少量实体对 $(h_k,t_k)\in R_r$ ， $P_r^{test}=\{(h_i,t_i,C_{h_i,r}|(h_i,r,t_i)\in G\}$ 则包含所有真实的尾实体 $t_i$ 和所有候选实体 $t_i\in C_{h_i,r}$ 。模型关于关系r的排名损失表示为 $\mathcal L_\Theta(h_i,t_i|C_{h_i,r},P_r^{train})$ ，则模型学习目标为：

min_\Theta\mathbb E_{\mathcal T_{mtr}}\Big[\sum_{(h_i,t_i,C_{h_i,r})\in P_r^{test}}\frac{\mathcal L_\Theta(h_i,t_i|C_{h_i,r},P_r^{train})}{|P_r^{test}|}\Big]\tag 1

其中 $|P_r^{test}|$ 表示 $P_r^{test}$ 中的元组数量。

在元训练结束后，在新关系 $r^\prime\in\mathcal R^\prime$ 上对模型进行测试， $\mathcal R\cap\mathcal R^\prime=\empty$ 。元测试阶段和元训练时一样，会构建若干个元测试任务。

2. 模型

FSRL包含三部分：（1）encoding（2）aggregating（3）matching

## 2.1 Encoding

作者考虑邻居影响力的差异性，利用注意力机制，设计了一个relation-aware异构邻居编码器，节点嵌入h的计算方式如下：

\begin{array}{c} f_{\theta}(h)=\sigma\left(\sum_{i} \alpha_{i} e_{t_{i}}\right) \\ \alpha_{i}=\dfrac{\exp \left\{u_{r t}^{T}\left(\mathcal{W}_{r t}\left(e_{r_{i}} \oplus e_{t_{i}}\right)+b_{r t}\right)\right\}}{\sum_{j} \exp \left\{u_{r t}^{T}\left(\mathcal{W}_{r t}\left(e_{r_{j}} \oplus e_{t_{j}}\right)+b_{r t}\right)\right\}} \end{array}\tag 2

其中 $\sigma$ 表示激活单元， $e_{r_i},e_{t_i}\in\mathbb R^{d\times 1}$ 分别表示预训练好的 $t_i$ 和 $r_i$ 的嵌入。 $u_{rt}\in\mathbb R^{d\times 1}$ ， $\mathcal W_{rt}\in\mathbb R^{d\times 2d}$ 以及 $b_{rt}\in\mathbb R^{d\times 1}$ 为可学习参数。

2.2 Aggregating

经过Encoding后，我们可以得到所有实体对 $(h_k,t_k)\in R_r$ 的表示 $\mathcal E_{h_k,t_k}=[f_\theta(h_k) \oplus f_\theta(t_k)]$ 。如何只通过少量实体对学习到reference set $R_i$ 的表示是很困难的，因为需要对不同实体对之间的交互进行建模并累积它们的表征能力。作者参考NLP中的一些方法，通过聚合所有关系r下所有实体对的embedding来计算 $R_r$ 的embedding：

f_{\epsilon}\left(R_{r}\right)=\mathcal{A} \mathcal{G}_{\left(h_{k}, t_{k}\right) \in R_{r}}\left\{\mathcal{E}_{h_{k}, t_{k}}\right\}\tag 3

其中 $\mathcal{AG}$ 表示聚合函数，可以是一个池化操作，也可以是一个前馈神经网络。作者这里采用RNN的思想对实体对信息进行聚合，将 $\mathcal E_{h_k,t_k}\in R_r$ 喂给一个循环自编码器：

\mathcal{E}_{h_{1}, t_{1}} \rightarrow m_{1} \rightarrow \cdots \rightarrow m_{K} \rightarrow d_{K} \rightarrow \cdots \rightarrow d_{1}\tag 4

其中K表示reference set的大小（即few-shot size）。编码器和解码器hidden state计算方式如下:

\begin{array}{c} m_{k}=\mathrm{RNN}_{\text {encoder }}\left(\mathcal{E}_{h_{k}, t_{k}}, m_{k-1}\right) \\ d_{k-1}=\mathrm{RNN}_{\text {decoder }}\left(d_{k}\right) \end{array}\tag 5

自编码器的重构损失定义为：

\mathcal{L}_{r e}\left(R_{r}\right)=\sum_{k}\left\|d_{k}-\mathcal{E}_{h_{k}, t_{k}}\right\|_{2}^{2}\tag 6

为了聚合编码器每一层的hidden state，作者借鉴深度残差网络的思想， $f_\mathcal E(R_r)$ 计算方式如下：

\begin{array}{c} m_{k}^{\prime}=m_{k}+\mathcal{E}_{h_{k}, t_{k}} \\ \beta_{k}=\dfrac{\exp \left\{u_{R}^{T}\left(\mathcal{W}_{R} m_{k}^{\prime}+b_{R}\right)\right\}}{\sum_{k^{\prime}} \exp \left\{u_{R}^{T}\left(\mathcal{W}_{R} m_{k^{\prime}}^{\prime}+b_{R}\right)\right\}} \\ f_{\epsilon}\left(R_{r}\right)=\sum_{k} \beta_{k} m_{k}^{\prime} \end{array}\tag 7

其中 $u_R\in\mathbb R^{d\times 1},\mathcal W_R\in\mathbb R^{d\times 2d},b_R\in\mathbb R^{d\times 1}$ 为可学习参数。

2.3 Matching

经过encoder和aggregation后，我们可以得到两个向量：

$\mathcal E_{h_l,t_l}=[f_\theta(h_l)\oplus f_\theta(t_l)]$ ：表示实体对 $(h_l,t_l)$ 的embedding
$f_\mathcal E(R_r)$ ：关系的embedding

这里作者参考one-shot learning中匹配网络的方法，用RNN做一个multiple steps的匹配：

\begin{array}{c} g_{t}^{\prime}, c_{t}=\mathrm{RNN}_{\text {match }}\left(\mathcal{E}_{h_{l}, t_{l}},\left[g_{t-1} \oplus f_{\epsilon}\left(R_{r}\right)\right], c_{t-1}\right) \\ g_{t}=g_{t}^{\prime}+\mathcal{E}_{h_{l}, t_{l}} \end{array}\tag 8

其中 $RNN_{match}$ 是一个LSTM单元，输入为 $\mathcal E_{h_l,t_l}$ ，hidden state为 $g_{t-1}$ ，cell state为 $c_t$ 。最后一步（第T步）的隐藏状态 $g_T$ 作为实体对 $(h_l,t_l)$ 优化后的嵌入，然后计算 $g_t$ 和 $f_\mathcal E(R_r)$ 的内积作为最终的matching score。

2.4 学习目标

对于关系r，首先随机选取少量positive实体对 $\{(h_k,t_k)|(h_k,r,t_k)\in G\}$ ，将其视为支持集，其余的positive实体对 $\mathcal{PE}_r$ 用来测试。同时向测试样本里面添加一些negative实体对 $\mathcal{NE}_r$ 来污染测试样本。这样最终的排名损失定义如下：

\mathcal{L}_{\text {rank }}=\sum_{r} \sum_{\left(h_{l}, t_{l}\right) \in \mathcal{P} \mathcal{E}_{r}} \sum_{\left(h_{l}, t_{l}^{-}\right) \in \mathcal{N} \mathcal{E}_{r}}\left[\xi+s_{\left(h_{l}, t_{l}^{-}\right)}-s_{\left(h_{l}, t_{l}\right)}\right]_{+}\tag 9

其中 $[x]_+=max[0,x]$ （一种损失定义方式）， $\xi$ 表示safety margin距离， $s_{(h_l,t_l^-)}$ 和 $s_{(h_l,t_l)}$ 分别表示positive实体对和negative实体对的得分。再结合自编码器的重构损失 $\mathcal L_{re}$ ，最终模型整体的损失定义为：