‭Review Published Versions 1 Vol 1 (3) : 224-243 2019
Learning to Complete Knowledge Graphs with Deep Sequential Models
: 2018 - 12 - 19
: 2019 - 04 - 22
: 2019 - 05 - 05
: 2019 - 05 - 15
3263 138 0
Abstract & Keywords
Abstract: Knowledge graph (KG) completion aims at filling the missing facts in a KG, where a fact is typically represented as a triple in the form of \(\left(head,relation,tail\right)\). Traditional KG completion methods compel two-thirds of a triple provided (e.g., head and relation) to predict the remaining one. In this paper, we propose a new method that extends multi-layer recurrent neural networks (RNNs) to model triples in a KG as sequences. It obtains state-of-the-art performance on the common entity prediction task, i.e., giving head (or tail) and relation to predict the tail (or the head), using two benchmark data sets. Furthermore, the deep sequential characteristic of our method enables it to predict the relations given head (or tail) only, and even predict the whole triples. Our experiments on these two new KG completion tasks demonstrate that our method achieves superior performance compared with several alternative methods.
Keywords: Knowledge graph; entity prediction; triple prediction; recurrent neural network
1. Introduction
Knowledge graphs (KGs), such as DBpedia [1] and Freebase [2], often use triples, in the form of \(\left(h,r,t\right)\), to record billions of real-world facts, where \(h,t\) denote entities and \(r\) denotes a relation between \(h\) and \(t\). Despite the enormous effort put into KG creation and maintenance, current KGs are still far from complete. KG completion is proposed to deal with this problem. Previous methods focus on a general task called entity prediction (also known as link prediction) [3, 4], which requires one to complete a triple in a KG by predicting \(t\) given \(\left(h,r,?\right)\) or predicting \(h\) given \(\left(?,r,t\right)\). The left-most part of Figure 1 shows a commonly-used model for tail entity prediction. Input h, r, is firstly projected by some vectors or matrices, where the bold letters denote the corresponding embeddings of \(h\) and \(r\), and then combined to a continuous representation \({{o}}_{t}\) for predicting \(t\).
Although existing methods have shown good performance on entity prediction, in real world they may still be inadequate to complete a KG. Let us assume that there is a method which can effectively complete an entity \(h\) given a relation \(r\) explicitly. But if we do not provide any relations, this method would be incompetent to enrich \(h\), since it is incapable of knowing which relation should be used to complete this entity. An alternative way is to iterate over all relations in the KG, but such process is time-consuming and error-prone. We argue that this may prevent the existing methods from being useful in the real world.
The recurrent neural network (RNN) is a neural sequence model, which has achieved state-of-the-art performance on many natural language processing (NLP) tasks, such as language modeling, speech recognition and machine translation [5, 6]. A triple in a KG can be considered to be a sequence of length 3, which inspires us to use RNNs to model KGs. However, we are still challenged by the following two obstacles: (i) triples are not natural language. They model complex structures of KGs with a fixed expression \(\left(h,r,t\right)\). Such short sequences may be insufficient to provide adequate context for prediction, while it is time-consuming and difficult to construct valuable long sequences from a huge number of paths in KGs; and (ii) relations and entities are elements of two different types, and they appear in triples in a fixed order. It is over-simplified to treat them as the same type.
In this paper, we propose a new method, which employs RNNs to model triples in a KG as sequences. We first design BSKG, a basic sequential model for KGs, as an initial version to illustrate our method (Figure 1). We denote an RNN cell by \(c\), which receives its previous hidden state and the current element as input to predict the next. The cell in the entity layer processes entity embeddings like \(h\), while the cell in the relation layer processes relation embeddings like \(r\). In BSKG, we use one cell to sequentially process all input elements, and thus \(h,r\) are fed to the same cell \(c\) to obtain their respective output. Then, we use \({{o}}_{r}\) to predict relations of \(h\) and \({{o}}_{t}\) to predict tails of \(h\to r\).
However, BSKG lacks effectiveness to model complex structures, because it only uses a single RNN cell to process all input sequences. To improve the performance on complex KGs, a few existing methods like TransH [7] and TransR [8] suggest projecting entities by some vectors or matrices before combination. Different from them, we do not choose to extend BSKG by adding such a layer, since it requires creation of variables for each relation, which consumes considerable resources and may also make the models hard to converge. Instead, we propose DSKG inthis paper, a deep sequential model that extends multi-layer RNNs to model KGs. DSKG can smoothly model complex structures, since each cell in DSKG does not need to output an explicit prediction result, but gives its own comprehension by adding or removing information from the hidden state and conveys this processed hidden state to its next cell. Therefore, the original input is continually refined by each cell, and the complex features can be fluently processed.

Figure 1.   Different models for tail entity prediction. Note: White and black circles denote input and output, respectively. For BSKG and DSKG, light gray circles denote cells in the entity layer and dark gray circles denote cells in the relation layer.
Specifically, we design three different strategies to integrate the RNN cells in DSKG. The right-most part of Figure 1 shows their two-layer examples.
• The Joint strategy is the most prevalent way used in the NLP area. RNN cells are reused in different layers like BSKG, but the output hidden state of each cell is not only recurrently fed to itself next time, but also sequentially conveyed to its next cell.
• The Respective strategy uses independent RNN cells for the entity layer and the relation layer, that is to say, \({c}_{1},{c}_{2},{c}_{3}\mathrm{ }\mathrm{a}\mathrm{n}\mathrm{d}\mathrm{ }{c}_{4}\) are all different. We want this strategy to achieve better performance when relations are diverse and complex.
• The Cascade strategy also uses different RNN cells for the entity layer and the relation layer. Instead of parallel connecting the two layers, we decide to link the end cell of the entity layer to the first cell of the relation layer. As a result, each cell in this strategy can concentrate more on its current work.
The main contributions of this paper are listed as follows:
• We introduce a new method for KG completion, which extends multi-layer RNNs to model triples in a KG as sequences of length 3. Three distinct strategies are proposed to integrate RNN cells and show their different characteristics in our experiments.
• We design two new KG completion tasks, namely relation prediction and triple prediction, as complements to the entity prediction task. The relation prediction task aims to predict relations only given a head (or tail) entity as input. The triple prediction task is to predict the whole triples only given a head entity.
• Our experimental results show that our method achieves state-of-the-art performance for entity prediction on the benchmark data sets based on Freebase and WordNet. It also achieves promising results on the new relation prediction and triple prediction data sets.
The rest of this paper is organized as follows. Section 2 summarizes the related work. Section 3 describes the details of our method. Section 4 presents the experimental results. Finally, we conclude this paper in Section 5.
2. Related Work
We divide existing models into two sub-areas: translational models and non-translational models. We summarize several representative models in Table 1. Our models are also added for comparison.
Table 1.   Comparison of existing models and ours.
ModelsPreprepared dataEnergy functionsNotation explanations
TransH\(∥(h-w_r^T hw_r)+r-(t-w_r^T tw_r)∥_{L*}\)\({{w}}_{r}\) : relation-specific normalization vector
TransRTransE embeddings\({∥W_r h+r-W_r t∥ }_{L*}\)\({{W}}_{r}\) : relation-specific matrix
STransETransE embeddings\({∥W_(r,1) h+r-W_(r,2) t∥ }_{L*}\)\({W}_{r,1},{W}_{r,2}\) : relation-specific matrices
PTransETransE embeddings; paths\(h+r-t+1/(|P(h,t)|) ∑_(p∈P(h,t))( p-r)\)\(P\left(h,t\right)\) : path set of\(\left(h,...,t\right)\)
NTNWord embeddings\(u_r^T tanh(h^T M_r t+W_(r,1) h+W_(r,2) t+b_r)\)\(u_r,M_r,W_(r,1),W_(r,2),b_r\) : relation-specific variables
DISTMULT\(h^T W_rt\)\({{W}}_{r}\) : relation-specific diagonal matrix
NLFeatNode and link features\(Φ_(i,j,k)^TΘ\)\({{\Phi }}_{i,j,k}^{T}\) : feature vector; \({\Theta }\): weight vector
ConvE\(g(vec(g(concat(h,r)*Ω) )W)⋅t\)\(g\) : activation function; \(\mathrm{\Omega }:\) convolutional layer
ConvKB\(concat(g([h,r,t]*Ω) ).w\)\(g\) : activation function; \(\mathrm{\Omega }:\) convolutional layer
BSKG\({L}_{r\left(h\right)}+{L}_{t\left(h,r\right)}\)\({c}_{1}^{h}\to \left\{{c}_{1}^{r}\right\}\)
DSKG (joint)\({c}_{1}^{h}\to \left\{{c}_{2}^{h},{c}_{1}^{r}\right\},{c}_{2}^{h}\to \left\{{c}_{2}^{r}\right\},{c}_{1}^{r}\to \left\{{c}_{2}^{r}\right\}\)
DSKG (respective)\({c}_{1}^{h}\to \left\{{c}_{2}^{h},{c}_{3}^{r}\right\},{c}_{2}^{h}\to \left\{{c}_{4}^{r}\right\},{c}_{3}^{r}\to \left\{{c}_{4}^{r}\right\}\)
DSKG (cascade)\({c}_{1}^{h}\to \left\{{c}_{2}^{h}\right\},{c}_{2}^{h}\to \left\{{c}_{3}^{r}\right\},{c}_{3}^{r}\to \left\{{c}_{4}^{r}\right\}\)
Translational Models
Perhaps, TransE [4] is the most well-known embedding model for KG completion. It represents entities and relations as \(k\)-dimensional vectors in a unified space, and models a triple \(\left(h,r,t\right)\) as \({h}+{r}\approx {t}\). TransE works well for one-to-one relationship, but it fails to model more complex (e.g., one-to-many) relationships. TransH [7] resolves this problem by regarding each relation \(r\) as a vector on a hyperplane whose normalization vector is \({{w}}_{r}\). It projects entity embeddings \({h},{t}\) to this hyperplane by \({w}_{r}\), and uses the same energy function as TransE for training. TransR [8] uses relation-specific matrices to project entities. It creates a project matrix \({{W}}_{r}\) for each relation \(r\) and projects \({h},{t}\) by \({{W}}_{r}\). TransR also employs the same energy function for training.
To extend the above models, STransE [9] learns two project matrices \({{W}}_{r,1},{{W}}_{r,2}\) for each relation \(r\), where \({{W}}_{r,1}\) is used for projecting \({h}\) and \({{W}}_{r,2}\) is for projecting \({t}\). TranSparse [10] is similar to STransE, but it uses a more complex method to project \({h},{t}\). PTransE [11] is also a TransE-like model. It uses additional path information for training. For example, if there exist two triples \(\left({e}_{1},{r}_{1},{e}_{2}\right),\left({e}_{2},{r}_{2},{e}_{3}\right)\), which can be regarded as a path in a KG, and another triple \(\left({e}_{1},{r}_{x},{e}_{3}\right)\) holds simultaneously, then the path \({e}_{1}\to {r}_{1}\to {e}_{2}\to {r}_{2}\to {e}_{3}\) is a valuable path recorded as \(\left({e}_{1},{r}_{1},{r}_{2},{e}_{3}\right)\). However, preparing desirable paths needs to iterate over all possible paths, and thus this process may consume quadratic resources. We argue that PTransE and many path-based models may be inefficient to model large KGs.
All the aforementioned models choose to minimize an energy function that is used in or similar to TransE. Moreover, except TransE and TransH, the remaining models require pre-trained entity and relation embeddings from TransE as initial input, which increases the training expense and also blurs their actual performance.
Non-translational Models
There also exist a number of models that are very different from the translational models. A neural tensor network (NTN) [12] defines a different loss function and creates many variables for each relation. This makes NTN unsuitable for KGs having plenty of relations. Furthermore, NTN requires word embeddings as initial input, which severely increases the training expense. DISTMULT [13] is as simple as TransE, but it employs a completely different energy function. More specifically, it is based on the Bilinear model [14] and represents each relation as a diagonal matrix. Node+LinkFeat (abbr. NLFeat) [15] can also be regarded as a path-based model like PTransE, but it only needs to extract paths of length 1 for constructing node and link features. For example, if there exists a triple \(\left({e}_{i},{r}_{k},{e}_{j}\right)\) in a KG, and \(\left({e}_{i},{r}^{\text{'}},{e}_{j}\right)\) holds at the same time, then a binary feature is constructed as 1(\({r}^{\text{'}}\) & \({r}_{k}\)). Although using paths of length 1 to construct features is much easier than using longer paths, it still consumes considerable resources for large KGs.
Recently, the deep neural networks have received much attention in KG completion. Instead of using simple algebraic operations, the deep neural models stack a group of different neural layers to model complex patterns in KGs. R-GCN [16] leverages the conventional graph convolutional networks [17] and extends it to multi-relational data like KGs. ConvE [18] applies convolutional neural networks (CNNs) [19] for KG completion. ConvKB [20] is also based on CNNs but implemented with a novel fashion.
Other methods like [21, 22] use extra data that cannot be extracted from the original training data, such as text corpora or entity descriptions. Due to the focus of this paper, we do not consider them currently.
3. Basic and Deep Sequential Models
In this section, we first describe our basic sequential model BSKG. Then, we present our deep sequential model DSKG with three different integrating strategies. Finally, two optimization methods useful for accelerating convergence and preventing over-fitting are introduced.
Basic Sequential Model
We start with the basic sequential model, which has only one single RNN cell. Let \(T,E,R\) be the sets of triples, entities and relations in a KG, respectively. Given a triple \(\left(h,r,t\right)\in T\), we represent \(h,t\in E\) and \(r\in R\) all as \(k\)-dimensional vectors \({h},{t},{r}\), and use the tail entity prediction task for example to write the basic RNN layer as follows:
\[S_h =c(h,s_0)\]
\[S_r = c (r, s_h)\]
where \(c\) denotes an RNN cell, which receives its previous hidden state and the current element as input, and outputs the processed hidden state. \({{s}}_{h},{{s}}_{r}\) denote the processed hidden states of input \({h},{r}\), respectively. \({{s}}_{0}\) denotes the zero hidden state for initialization.
There are a variety of candidate RNN cells that can be used to implement our model, e.g., the gated recurrent unit (GRU) cell [23] or the long short-term memory (LSTM) cell [24]. For simplicity, we do not discuss the details here, but use
\[O_h = f (s_h, c)\]
\[O_r = ​​=f(s_r,c)\]
to abstractly represent the operations for calculating the output of RNN cells.
Then, we can respectively use \({{o}}_{h},{{o}}_{r}\) to predict the relations of \(h\) and the tail entities of \(h\to r\) as follows:
\[P_r = W_1 o_h+b_1\]
\[P_t = W_2 o_r+b_2\]
where \({{W}}_{1}\) denotes the weight matrix of relation prediction, which has a shape of \(|R|×k\). \({{b}}_{1}\) denotes the bias vector. \({{W}}_{2},{{b}}_{2}\) are defined similarly. Therefore, \({{p}}_{r},{{p}}_{t}\) can be directly regarded as the unscaled probabilities for relation prediction and tail entity prediction, respectively.
We respectively calculate the sampled softmax cross-entropy relation loss \({L}_{r}\) and entity loss \({L}_{t}\) [25] for the unscaled probabilities \({{p}}_{r},{{p}}_{t}\) as follows:

where \({n}_{i}\) denotes the normalized probability of the \(i\)-th relation with softmax function. \({y}_{i}\) is the label for \({n}_{i}\). Due to the fact that the \(r\)-th relation (the \(t\)-th entity) is the only correct one in this case, we can remove the zero components in the sum. \({N}_{r}\), \({N}_{t}\) denote the sets of sampled negative labels for relation prediction and entity prediction, respectively.
There may exist multiple correct labels for both relation prediction and entity prediction, so in this paper we only use the current input triple to provide the correct labels for both relation prediction and entity prediction. For example, let \(\left({h}_{0},{r}_{0},{t}_{0}\right)\) be the current training triple and \({t}_{1},\dots ,{t}_{m}\) be other correct labels for \(\left({h}_{0},{r}_{0},?\right)\), as these triples \(\left({h}_{0},{r}_{0},{t}_{i}\right),i=1,\dots ,m\) also exist in the training data. But we only consider \({t}_{0}\) as the exclusively right label and avoid adding it during sampling negative labels. We propose this method since it makes training much faster and does not need to prepare the label information for multi-label classification. This is also because the number of possible negative labels is much larger than that of correct ones.
Finally, we can jointly minimize both two losses \({L}_{r},{L}_{t}\), or just minimize \({L}_{t}\), as the final loss for training:
\[\begin{array}{rr}{L}_{j}& ={L}_{r}+{L}_{t}\\ {L}_{e}& ={L}_{t},\end{array}\]
where \({L}_{j}\) denotes the joint loss, and \({L}_{e}\) denotes the entity-only loss. Our experimental results show that minimizing \({L}_{j}\) performs better on the entity prediction task, and also makes our models capable of predicting relations and triples.
Deep Sequential Model
RNN can be considered to be a deep neural network with indefinite layers, since it can recurrently process input in arbitrary times. However, only using a single RNN cell to process all information may be inefficient for KGs. Multi-layer RNNs have shown promising performance on modeling complex hierarchical architectures in the NLP area [26], and KGs happen to have such architectures. Therefore, we propose DSKG, which uses multi-layer RNNs to model complex KGs. Three different integrating strategies, namely Joint, Respective and Cascade, are designed to integrate the RNN cells in DSKG. Figure1 illustrates a two-layer version of our DSKG model with the three strategies. We describe them in detail below:
This strategy uses two distinct RNN cells \({c}_{1},{c}_{2}\) to process both entities and relations. The output hidden state of each cell needs to be fed to itself next time. Note that \({c}_{1}\)’s output hidden state is also conveyed to \({c}_{2}\). By doing this, we do not need \({c}_{1}\) to give an explicit result for each input, but enable it to convey its own comprehension to \({c}_{2}\). \({c}_{2}\) performs prediction based on its previous hidden state and \({c}_{1}\)’s processed hidden state, which include both sequential and hierarchical information. We formalize this process as follows:
\[s_h^1 = c_1 (h,s_0)\]
\[s_h^2 = c_2 (s_h^1,s_0 )\]
\[s_r^1 = c_1 (r,s_h^1)\]
\[s_r^2 = c_2 (s_r^1,s_h^2)\]
We still use Equation (2) to calculate the output of RNN cells, and get two output pairs \(\) and \((o_h^1,o_h^2)\). For each pair, we can either combine its two output by weights or just use its last output to predict relations or tail entities. The former one has an advantage of modeling long hierarchical sequences, such as multiple brace-matching; while the latter one usually performs better when the sequences are short [26]. In this paper, we only use the output of the last cells for prediction, since triples in KGs are short.
Because entities and relations have very different characteristics, we believe that building multi-layer RNNs for them respectively may help our model capture very complex structures. Based on this intuition, we provide the relation layer with independent RNN cells. As shown in Figure 1, we still use \({c}_{1},{c}_{2}\) to process input \({h}\), but assign independent RNN cells \({c}_{3},{c}_{4}\) to process \({r}\):
\[s_r^1 = c_3 (r,s_h^1)\]
\[s_r^2 =c_4 (s_r^1,s_h^2)\]
This is the only difference compared with Equation (7). Our experiments show that this strategy improves the performance on completing KGs with more complex structures. For example, FB15K [4] is considered to be a complex data set since it has more than 1,000 different relations, while WN18 [4] only has 18 types of relations.
We propose this strategy because we believe that longer sequential RNN cells may model knowledge structures better, and the cells in the entity layer or the relation layer can concentrate more on its own current work. For comparison, \({c}_{1},{c}_{2}\) in Respective pass their output hidden states to \({c}_{3},{c}_{4}\), respectively. Differently, Cascade only feeds \({c}_{2}\)’s output hidden state to \({c}_{3}\). So, Cascade cuts down the conflicts between the entity layer and the relation layer. We believe that this can help improve performance on triple prediction. We formalize this strategy as follows:
\[s_r^1 = c_3 (r,s_h^2)\]
\[s_r^2 = c_4 (s_r^1,s_0)\]
However, \({h}\) needs to be sequentially conveyed four times to obtain the final output. Such long sequence may help the our model process \({h}\)’s features, but may also lose some information of \({h}\) during conveying.
In addition to the above three integrating strategies, we also design a simple variant, called Entity-only, for comparing with Joint. The only difference between them is that Entity-only only optimizes the entity-only loss, which is more like previous entity prediction models.
Batch Normalization and Dropout
To accelerate convergence and prevent over-fitting, we also employ two optimization methods in our models.
Batch Normalization
Batch normalization is widely used to alleviate the impact of improperly-initialized neural networks [27]. It enforces the input of next layers to a uniform Gaussian distribution. In our models, the batch normalization layers are placed before and after both the entity layer and the relation layer.
Dropout is a simple but effective method to prevent over-fitting [28]. It is implemented by keeping a neuron active with probability \({p}_{D}\) during training. In our models, we add the dropout layers before and after each RNN cell.
4. Experiments
We implemented our method with TensorFlow, and conducted three different experiments, namely entity prediction, relation prediction and triple prediction, to evaluate it. In this section, we first introduce the data sets and experiment settings. Then, we describe the procedure of each experiment and report the corresponding results.
By carrying out these experiments, we want to answer the following three questions:
1. Can our method achieve state-of-the-art performance on some benchmark data sets?
2. How does our method perform on the new KG completion tasks such as relation prediction and triple prediction?
3. What are the strengths and weaknesses of each integrating strategy in our deep sequential model?
Data Sets and Experiment Settings
In all our experiments, we chose FB15K and WN18 as our data sets, which were proposed in [4] and used by a large number of previous methods. FB15K has 1,345 different relations, while WN18 contains 18 distinct relations. The detailed statistical data of these two data sets are listed in Table 2.
Table 2.   Statistics of the experimental data sets.
# Entities14,95140,943
# Relations1,34518
# Training triples483,142141,442
# Validation triples50,0005,000
# Testing triples59,0715,000
For each data set, we used the Adam [29] optimizer to train one model for all the evaluation tasks and stopped training when the entity prediction result on the validation data is optimized. Thus, the relation prediction and triple prediction results may not be optimal. For both FB15K and WN18, we set the parameters as follows: learning rate \(\lambda =0.001\), embedding dimension \(k=512\), negative sample number \({n}_{S}=512\), batch size \({n}_{B}=2,048\), and the keep-probability for dropout layers \({p}_{D}=0.5\). We employed the LSTM cells in all our models and used two such cells in DSKG (joint), four in DSKG (respective) and DSKG (cascade), just as shown in Figure 1. Adding more cells for DSKG may improve the performance, but would cause slower training speed. Also, our experiments would demonstrate the effectiveness of the two-layer DSKG model.
Additionally, we added the reversed relations in the training data. This strategy is helpful to model relation pairs reversed to each other. There are many methods using reversed relations, such as PTransE [11] and ConvE [15]. We directly used the results reported in their papers. Specifically, for each triple \(\left(h,r,t\right)\) in the training data, we constructed a reversed triple \(\left(t,{r}^{-},h\right)\) and added it into the training data. Adding the reversed relations also enabled our models to predict head and tail entities in an integrated fashion, which means that our models can predict tail entities with input \(\left(h,r,?\right)\), and predict head entities with \(\left(t,{r}^{-},?\right)\) simultaneously. The reversed relations provide a convenient way to evaluate the head prediction. Also, they enhance the connectivity of KGs. Adding them can slightly improve the performance, while significantly boost the convergence speed.
Entity Prediction
Entity prediction aims to predict \(h\) (or \(t\)) given an incomplete triple \(\left(?,r,t\right)\) (or \(\left(h,r,?\right)\)). We evaluated our models with the same method used in [4]. For each triple \(\left(h,r,t\right)\) in the testing data, we constructed two incomplete triples \(\left(h,r,?\right),\left(t,{r}^{-},?\right)\) for tail prediction and head prediction, respectively.
Following [4] and many others, two evaluation metrics were used: the mean rank of correct entities (MR) and the percentage of correct entities in ranked top-10 (Hits@10). Besides, an incomplete triple like \(\left(h,r,?\right)\) may have multiple correct tails. Entity \(t\) in the current testing triple \(\left(h,r,t\right)\) is only one of the correct. Thus, we also employed the filtered mean rank (FMR) and filtered Hits@10 (FHits@10) [4], which removed all the correct entities except \(t\) during ranking.
The experimental results on FB15K and WN18 are shown in Table 3. We can observe that:
1) DSKG outperformed the other models on FB15K and also achieved superior performance on WN18 for Hits@10 and FHits@10.
2) Compared with BSKG, DSKG significantly improved the performance of FHits@10 on FB15K, which has a more complex structure than WN18.
3) DSKG (joint) outperformed DSKG (entity-only) on both FB15K and WN18, which proved that jointly optimizing relation prediction and entity prediction is helpful for predicting entities.
4) The three strategies in DSKG performed similarly on FB15K, but diverged for MR and FMR on WN18. For example, DSKG (cascade) achieved a better FHits10 than DSKG (joint) on WN18, but it performed worse on MR and FMR. We argue that WN18 only has 5,000 triples for testing, but it has about 40,000 different entities. So, if a model fails on one triple in the testing data, its MR and FMR would drop 8.0. However, for FB15K, its MR and FMR would only drop 0.25 at most.
5) DSKG (respective) outperformed DSKG (joint) on FB15K, which may show that using distinct RNN cells for the entity layer and the relation layer is helpful for modeling complex KGs.
Table 3.   Entity prediction results.
DSKG (entity-only)53.789.51924082.595.1248233
DSKG (joint)53.889.81883882.595.2217203
DSKG (respective)53.989.91883681.495.2354338
DSKG (cascade)54.189.31833681.595.3370353
“-” indicates the result unreported in literature. Models are ordered according to their publishing years.
Table 4 shows the detailed entity prediction results on FB15K. We separated results by relationship categories, e.g., 1:1 denotes the one-to-one relationship and 1:\(M\) denotes one-to-many. We can observe that BSKG and DSKG outperformed the others on the complex relationship categories (i.e., 1:\(M\), \(M\):1 and \(M\):\(N\)). The reasons are: (1) Our models are probability-based rather than margin-based, so they would not suffer from the problem of modeling the complex relationships in the margin-based models; (2) RNN cells can properly model triples that have complex relationships, since they can transfer and model entities or relations alternately.
Table 4.   Detailed entity prediction results on FB15K by relationship categories.
ModelsHead prediction (FHits@10)Tail prediction (FHits@10)
1:11:MM :1M :NOverall1:11:MM :1M :NOverall
DSKG (entity-only)
DSKG (joint)88.397.069.589.287.388.782.296.192.792.2
DSKG (respective)89.797.169.689.987.588.882.296.292.992.3
DSKG (cascade)89.497.268.389.286.989.780.996.192.391.8
Relation Prediction
Relation prediction aims to predict relations given a head (or tail) entity only. For each triple \(\left(h,r,t\right)\) in the same testing data as used in the entity prediction experiment, we took \({h}\) as input for predicting \(r\), and \({t}\) for \({r}^{-}\). Note that the testing data used here can be redundant. For example, \(\left({h}_{1},{r}_{1},{t}_{1}\right)\) and \(\left({h}_{1},{r}_{1},{t}_{2}\right)\) are two different triples for entity prediction, while relation prediction only considers \({{h}}_{1},{{r}}_{1}\). We did not remove the redundant testing triple \(\left({h}_{1},{r}_{1},{t}_{2}\right)\) in the relation prediction and triple prediction experiments, since the occurrence of \(\left({h}_{1},{r}_{1}\right)\) can be regarded as its weight, reflecting its prevalence and importance in a KG.
Since there are no previous models designed for this task, we proposed a specific relation prediction model for comparison, which has two fully-connected layers. Each layer in this model has a weight matrix and a bias vector, and uses ReLU as the activating function. Note that DSKG (entity-only) may not suit this experiment task, because it does not minimize the relation prediction loss during training. We used the same evaluation metrics in this experiment.
The relation prediction results are shown in Table 5. For FB15K, we can observe the following:
1) Predicting forward relations was more accurate than predicting backward relations for FHits@10 and FMR, due to the facts that KGs are usually constructed by humans, and \(\left(h,r,t\right)\) is a more natural representation than \(\left(t,{r}^{-},h\right)\).
2) DSKG outperformed the fully-connected two-layer model, which verified that jointly optimizing relation prediction and entity prediction can also help predict relations.
3) Both BSKG and DSKG achieved an FMR of 1.5 on predicting forward relations, which indicated that our models predicted the correct forward relations of an entity with a very high probability on average. We also evaluated our models on WN18. Because this data set has only 18 distinct relations, it is not surprising that all the models achieved similar performance. Still, DSKG showed a slight advantage for Hits@10 and FHits@10. It is worth noting that our models can also predict \(r\) given both \(h\) and \(t\) using the same method as in [11] (i.e., the aforementioned relationship prediction task).
Table 5.   Relation prediction results.
Data setsModelsForward (h → )Backward ( t → )
DSKG (joint)82.699.15.71.590.392.24.44.0
DSKG (respective)83.599.05.81.790.592.24.64.2
DSKG (cascade)
DSKG (joint)99.899.
DSKG (respective)99.899.92.71.599.899.92.81.8
DSKG (cascade)99.899.
Triple Prediction
DSKG is capable of not only predicting entities and relations, but also predicting the whole triples given an entity. So, triple prediction can be considered to be a more integrated task for KG completion. For each triple \(\left(h,r,t\right)\) in the same testing data as used in the entity prediction experiment, we only used \({h}\) as input to predict the triples about \(h\). Each model was first asked to predict the relation \({r}_{p}\) of \(h\) with the highest probability, and then used this relation to predict the most probable tail \({t}_{p}\). As a result, we got one output triple \(\left(h,{r}_{p},{t}_{p}\right)\) for each input \({h}\). Note that the models in this experiment only predicted triples in the forward direction, because: (i) we have shown that the backward relation prediction results on FMR were worse than the forward relation prediction results; and (ii) predicting forward triples of an entity is more natural to KG modeling.
We evaluated the output triples with the following method. Let \({S}_{R}\) denote the set of all triples in a KG, \({S}_{O}\) denote the set of output triples, \({S}_{P}\) denote the union of testing and validation triples. \({S}_{R}\) and \({S}_{P}\) are referred to as the representation and prediction triple sets, respectively. We calculate the correct representation number as \({n}_{R}=|{S}_{R}\cap {S}_{O}|\), and the correct prediction number as \({n}_{P}=|{S}_{P}\cap {S}_{O}|\).
So, the representation precision and the prediction precision are calculated as follows:
\[\begin{array}{rr}\mathrm{R}\mathrm{e}\mathrm{p}\mathrm{r}\mathrm{e}\mathrm{s}\mathrm{e}\mathrm{n}\mathrm{t}\mathrm{a}\mathrm{t}\mathrm{i}\mathrm{o}\mathrm{n}\mathrm{ }\mathrm{p}\mathrm{r}\mathrm{e}\mathrm{c}\mathrm{i}\mathrm{s}\mathrm{i}\mathrm{o}\mathrm{n}& =\frac{{n}_{R}}{|{S}_{O}|}\\ \mathrm{P}\mathrm{r}\mathrm{e}\mathrm{d}\mathrm{i}\mathrm{c}\mathrm{t}\mathrm{i}\mathrm{o}\mathrm{n}\mathrm{ }\mathrm{p}\mathrm{r}\mathrm{e}\mathrm{c}\mathrm{i}\mathrm{s}\mathrm{i}\mathrm{o}\mathrm{n}& =\frac{{n}_{P}}{\left(|{S}_{O}|-{n}_{R}+{n}_{P}\right)}.\end{array}\]
We slightly modified the evaluation procedures of TransE, TransR and ConvE for comparison, since they are unable to predict relations. We provided relations for them using four different providers, i.e., DSKG (joint), DSKG (respective), DSKG (cascade) and correct relations.
The experimental results are shown in Table 6. We can observe the following:
1) DSKG outperformed TransE and TransR, especially on WN18, even though providing the correct relations for them. This may be caused by the fact that both TransE and TransR are margin-based, and thus comparing distances for prediction may mislead them.
2) ConvE achieved similar performance compared with BSKG, but performed slightly weaker than DSKG.
3) Even we did not provide the correct relations, and DSKG still achieved good results, especially DSKG (cascade), which outperformed all the others on both FB15K and WN18.
4) For all the integrating strategies in DSKG, using their own providers to provide relations performed even better for Representation precision, which is probably because correct relations are actually extracted from testing data, while their own providers preferred to give more relations that have already been trained.
Table 6.   Triple prediction results.
Relation providersModelsFB15KWN18
DSKGTransE (joint)80.930.645.95.0
TransE (respective)81.632.910.60.7
TransE (cascade)80.831.230.61.8
TransR (joint)85.423.471.215.1
TransR (respective)87.429.355.815.6
TransR (cascade)85.724.471.815.5
ConvE (joint)
ConvE (respective)
ConvE (cascade)
BSKG (joint)
BSKG (respective)
BSKG (cascade)97.269.693.656.7
DSKG (joint)98.570.796.172.5
DSKG (respective)98.373.492.458.5
DSKG (cascade)98.573.499.189.8
Correct relationsTransE75.349.338.720.5
DSKG (joint)95.282.890.283.1
DSKG (respective)95.282.996.794.1
DSKG (cascade)
5. Conclusion and Future Work
In this paper, we proposed a new KG completion method that extends multi-layer RNNs to model triples in KGs as sequences. Our experimental results on FB15K and WN18 showed that our method achieved superior performance not only on the benchmark entity prediction task, but also on two new KG completion tasks to predict relations and triples given one single entity. For future work, we plan to explore the following three directions:
• Integrating the attention mechanism [30] in our method. While the attention mechanism has shown its power in the NLP area, applying it to KG completion has not been well studied. In future, we would like to extend our method with this mechanism to improve its inference ability.
• Using a provided KG to complete another KG. Recently, several methods start to leverage extra textual data for improving KG completion. However, textual data like entity descriptions and text corpora are written in natural language. Due to the ambiguity and heterogeneity of textual data, they may bring mistakes in prediction. Therefore, we think that taking existing KGs as another kind of extra data may improve performance.
• Embedding KGs for entity alignment. We also want to consider the problem of learning KG embeddings for entity alignment. Particularly, we look forward to learning embeddings of different KGs in a unified space and employing RNNs to explore neighboring context for entity alignment.
S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, & Z.G. Ives. DBpedia: A nucleus for a web of open data. In: KAberer et al. (eds) The Semantic Web. Berlin: Springer, 2007, pp. 722–735. doi: 10.1007/978-3-540-76298-0_52.
K.D. Bollacker, C. Evans, P. Paritosh, T. Sturge, & J. Taylor. Freebase: A collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, ACM, 2008, pp. 1247–1250. doi: 10.1145/1376616.1376746.
A. Bordes, J. Weston, R. Collobert, & Y. Bengio. Learning structured embeddings of knowledge bases. In: Proceedings of the 25th AAAI Conference on Artificial Intelligence, AAAI, 2011, pp. 301–306. Available at: https://www.aaai.org/ocs/index.php/AAAI/AAAI11/paper/view/3659/3898.
A. Bordes, N. Usunier, A. Garcia-Durán, J. Weston, & O. Yakhnenko. Translating embeddings for modeling multi-relational data. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS), NIPS Foundation, 2013, pp. 2787–2795. Available at: http://papers.nips.cc/paper/5071-translating-embeddings-for-modeling-multi-relational-data.pdf.
O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, & G.E. Hinton. Grammar as a foreign language. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, MIT Press, 2015, pp. 2773–2781. Available at: http://papers.nips.cc/paper/5635-grammar-as-a-foreign-language.pdf.
R. Józefowicz, O. Vinyals, M. Schuster, N. Shazeer, & Y. Wu. Exploring the limits of language modeling. In: Proceedings of the 4th International Conference on Learning Representations, ICLR, 2016.
Z. Wang, J. Zhang, J. Feng, & Z. Chen. Knowledge graph embedding by translating on hyperplanes. In: Proceedings of the 28th AAAI Conference on Artificial Intelligence, AAAI, 2014, pp. 1112–1119. Available at: https://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8531/8546.
Y. Lin, Z. Liu, M. Sun, Y. Liu, & X. Zhu. Learning entity and relation embeddings for knowledge graph completion. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence, AAAI, 2015, pp. 2181–2187. doi: 10.1016/j.procs.2017.05.045.
D.Q. Nguyen, K. Sirts, L. Qu, & M. Johnson. STransE: A novel embedding model of entities and relationships in knowledge bases. In: Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics, ACM, 2016, pp. 460–466. doi: 10.18653/v1/N16-1054.
G. Ji, K. Liu, S. He, & J. Zhao. Knowledge graph completion with adaptive sparse transfer matrix. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence, AAAI, 2016, pp. 985–991. Available at: https://aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/11982/11693.
Y. Lin, Z. Liu, H.-B. Luan, M. Sun, S. Rao, & S. Liu. Modeling relation paths for representation learning of knowledge bases. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, ACM, 2015, pp. 705–714. doi: 10.18653/v1/D15-1082.
R. Socher, D. Chen, C. Manning, D. Chen, & A. Ng. Reasoning with neural tensor networks for knowledge base completion. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, MIT Press, 2013, pp. 926–934. doi: 10.1109/ICICIP.2013.6568119.
B. Yang, S.W. Yih, X. He, J. Gao, & L. Deng. Embedding entities and relations for learning and inference in knowledge bases. In: Proceedings of the 3rd International Conference on Learning Representations, ICLR, 2015. Available at: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ICLR2015_updated.pdf.
M. Nickel, V. Tresp, & H.-P. Kriegel. A three-way model for collective learning on multi-relational data. In: L. Getoor, & T. Scheffer (eds.) Proceedings of the 28th International Conference on Machine Learning (ICML-11). Bellevue, Washington: ACM, pp. 809--816.
K. Toutanova, & D. Chen. Observed versus latent features for knowledge base and text inference. In: Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, 2015, pp. 57–66. doi: 10.18653/v1/w15-4007.
M.S. Schlichtkrull, T.N. Kipf, P. Bloem, R. van den Berg, I. Titov, & M. Welling. Modeling relational data with graph convolutional networks. In: A. Gangemi et al. (eds.) The Semantic Web. ISWC 2018. Cham, Switzerland: Springer, 2018, pp. 593–607. doi: 10.1007/978-3-319-93417-4_38.
D.K. Duvenaud, D. Maclaurin, J. Aguilera-Iparraguirre, R.G_omez-Bombarelli, T. Hirzel, A. Aspuru-Guzik, & R.P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, MIT Press, 2015, pp. 2224–2232. Available at: http://dl.acm.org/citation.cfm?id=2969488.
T. Dettmers, P. Minervini, P. Stenetorp, & S. Riedel. Convolutional 2d knowledge graph embeddings. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, AAAI, 2018, pp. 1811–1818. Available at: https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17366.
Y. LeCun, L. Bottou, Y. Bengio, & P. Haffner. Gradient-based learning applied to document recognition.Proceedings of the IEEE 86(11)(1998), 2278–2324. doi: 10.1109/5.726791.
D.Q. Nguyen, T.D. Nguyen, D.Q. Nguyen, & D.Q. Phung. A novel embedding model for knowledge base completion based on convolutional neural network. In: Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics, ACL, 2018, pp. 327–333. doi: 10.18653/v1/N18-2053.
R. Xie, Z. Liu, J. Jia, H. Luan, & M. Sun. Representation learning of knowledge graphs with entity descriptions. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence, AAAI, 2016, pp. 2659–2665. Available at: tps://aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12216/12004.
H. Xiao, M. Huang, L. Meng, & X. Zhu. SSP: Semantic space projection for knowledge graph embedding with text descriptions. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, AAAI, 2017, pp. 3104–3110.
K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, & Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1724–1734. doi: 10.3115/v1/D14-1179.
S. Hochreiter, & J. Schmidhuber. Long short-term memory. Neural Computation, 1997, pp. 1735–1780. doi: 10.1162/neco.1997.9.8.1735.
S. Jean, K. Cho, R. Memisevic, & Y. Bengio. On using very large target vocabulary for neural machine translation. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, ACL, 2015, pp. 1–10.
M. Hermans, & B. Schrauwen. Training and analyzing deep recurrent neural networks. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, MIT Press, 2013, pp. 190–198. Available at: http://papers.nips.cc/paper/5166-training-and-analysing-deep-recurrent-neural-networks.pdf.
S. Ioffe, & C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning, JMLR, 2015, pp. 448–456. Available at: https://dl.acm.org/citation.cfm?id=3045167%22.
N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, & R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014, pp. 1929–1958. Available at: http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf.
D.P. Kingma, & J. Ba. Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations, ICLR, 2015.
D. Bahdanau, K. Cho, & Y. Bengio. Neural machine translation by jointly learning to align and translate. In: Proceedings of the 3rd International Conference on Learning Representations, ICLR, 2015.
Article and author information
Cite As
L. Guo, Q. Zhang, W. Hu, Z. Sun, & Y. Qu. Learning to complete knowledge graphs with deep sequential models. Data Intelligence 1(2019), 224-243. doi: 10.1162/dint_a_00016
Lingbing Guo
L. Guo designed the model.
Lingbing Guo is currently a M.S. student in Department of Computer Science and Technology, Nanjing University, Nanjing, China. He received his B.S. degree in computer science and technology in 2016 from Henan University, China. His research interest is knowledge graph completion.
Qingheng Zhang
Q. Zhang conducted main experiments.
Qingheng Zhang is currently a M.S. student in Department of Computer Science and Technology, Nanjing University, Nanjing, China. He received his B.S. degree in computer science and technology in 2017 from Hohai University, China. His research interest is knowledge graph embedding.
Wei Hu
W. Hu (whu@nju.edu.cn, corresponding author) assembled the manuscript.
Wei Hu is currently an associate professor in State Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, China. He received his PhD degree in computer software and theory in 2009, and his B.S. degree in computer science and technology in 2005, both from Southeast University, China. His main research interests include knowledge graph, data integration and intelligent software.
Zequn Sun
Z. Sun and Y. Qu revised the whole paper.
Zequn Sun is currently a PhD student in Department of Computer Science and Technology, Nanjing University, China. He received his B.S. degree in computer science and technology in 2016 from Hohai University, China. His research interest is entity alignment.
Yuzhong Qu
Z. Sun and Y. Qu revised the whole paper.
Yuzhong Qu received his B.S. and M.S. degrees in mathematics from Fudan University, China, in 1985 and 1988 respectively, and got his Ph.D. degree in computer software from Nanjing University, Nanjing, China, in 1995. He is currently a full professor in State Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, China. His research interests include Semantic Web, question answering and novel software technology for the Web.
This work was supported by the National Natural Science Foundation of China under Grant No. 61872172.
Publication records
Published: May 15, 2019 (Versions1
Data Intelligence