In recent years, there has been a great amount of efforts in trying to harvest knowledge from Web, and a variety of knowledge graphs (KGs) or knowledge bases (KBs) have been constructed, such as YAGO , DBpedia , Freebase  and CN-DBpedia . These knowledge bases play important roles in many applications, such as search engine , recommendation system  and question answering .
However, knowledge bases are generally incomplete. Facts of current KBs (e.g., DBpedia , YAGO , Freebase  and CN-DBpedia ) are mainly obtained from the carefully edited structured texts (e.g., infobox and category information) of Web pages in the online encyclopedia websites (e.g., Wikipedia and Baidu Baike). Since knowledge is rich but editors have limited editing capabilities, many structured texts are often incomplete, making the facts in knowledge bases directly extracted from structured texts are incomplete. According to Catriple , only 44.2% of articles in Wikipedia have infobox information. Also in Baidu Baike, the largest Chinese online encyclopedia website, almost 32% (nearly 3 million) of entities lack the infobox and category information altogether .
Incomplete knowledge base will lead to poor performance of many downstream applications, since they cannot find the corresponding facts in the knowledge bases. For example, if the knowledge graphs lack the fact about Donald Trump’s birthday, then it cannot answer the question of “when was Donald Trump born”.
To address this challenge, we propose an extraction and verification framework to enrich the knowledge bases. Based on the existing knowledge bases, we first extract new facts from the description texts of entities. But not all newly-formed facts can be added directly to the knowledge base because the errors might be involved by the extraction [11, 12, 13]. For example, in Table 1, which is the F1-score of the state-of-the-art text-based extractors on Slot Filling benchmark TAC data set, including the pattern-based method (PATdist ), traditional machine learning methods (Mintz++ , SVMskip ), graphical model (MIMLRE ) and neural network based method (CNNcontext) , the facts extracted by these extractors still have a lot of noise. This motivates us to employ a novel crowdsourcing method to verify the extracted facts. Considering the human cost, we only verify those low-confidence facts. In the end, only two types of facts extracted by the extractors can be added to the knowledge base. One is the facts with high confidence, and the other is the facts with low confidence but verified by human as correct.
The F1 scores on slot filling benchmark TAC data set (dev: data from 2012/2013, eval: data from 2014) 
|per:cause of death||.76||.42||.75||.36||.44||.11||.82||.32||.77||.52|
|per:date of birth||1.0||.60||.99||.60||.67||.57||1.0||.67||1.0||.77|
|per:date of death||.67||.45||.67||.45||.30||.32||.79||.54||.72||.48|
|per:empl memb of||.38||.36||.41||.37||.24||.22||.42||.36||.41||.37|
|per:location of birth||.56||.22||.56||.22||.30||.30||.59||.27||.59||.23|
|per:loc of death||.65||.41||.66||.43||.13||.00||.64||.34||.63||.28|
|per:loc of residence||.14||.11||.15||.18||.10||.03||.31||.33||.20||.23|
The missing facts in the knowledge base mainly include the relationship between entities and entities and the relationship between entities and concepts. In this paper, we use description texts of entities to enrich the knowledge base, including two subtasks, entity typing and slot filling. Our contributions are as follows:
1). First, for entity typing subtask, we propose a multi-instance learning model to process textual information as well as heterogeneous information.
2). Second, for slot filling subtask, we use a transfer learning strategy to extract the values of the long-tailed predicates.
3). Third, we propose a novel implicit crowdsourcing approach to verify low-confidence new facts.
4). Finally, we apply this framework to the existing knowledge base CN-DBpedia and release a new version of knowledge base CN-DBpedia2, which additionally contains the facts extracted from the description texts of entities. By April 2019, CN-DBpedia2 contains about 16,024,656 entities and 228,499,155 facts.
The rest of this paper is organized as follows. Section 2 introduces the system architecture of CN-DBpedia2. Section 3 and Section 4 detail the methods of entity typing and slot filling. Section 5 introduces how to verify those low-confidence new facts. Section 6 presents the statistics of our new system. Finally, Section 7 concludes the paper.