Wikipedia has become one of the most accessible online encyclopedias. It has extremely high language coverage, containing articles in 298 languages. Among them, the English version of Wikipedia owns more than 5.6 million articles, sitting in the first position. "Everyone can edit" makes its knowledge constantly increase and evolve. However, knowledge in Wikipedia is in the form of free text or attribute-value pairs in infobox. Wikipedia’s vast knowledge inspires the emergence of many knowledge base (KB) projects that structure knowledge and link knowledge in different languages.
Several projects construct KBs from Wikipedia, e.g., DBpedia , YAGO  and BabelNet . Nevertheless, they have different focuses. YAGO pays more attention to the semantic consistency of the same knowledge in different languages. DBpedia does much work on the extraction and alignment of cross-lingual fact triples. BabelNet concentrates on the entity concepts, senses and synsets.
The imbalanced size of different Wikipedia language versions apparently leads to the highly imbalanced knowledge distribution in different languages. This is reflected in the KBs that are based on this imbalance, as knowledge encoded in non-English languages is much less than those in English. To address this issue, XLORE has become the first large-scale cross-lingual KB with a balanced amount of Chinese-English knowledge . It gives a new way for building a knowledge graph across any two languages by utilizing cross-lingual links in Wikipedia. Although XLORE already has a relatively balanced amount of bilingual knowledge, there are still a large number of missing facts that need to be supplemented. After reviewing the quality of XLORE, there are clearly three kinds of facts that require enhancement:
1) The number of cross-lingual links between English instances and Chinese instances is limited. Discovering more cross-lingual links is beneficial to knowledge sharing across different languages;
2) Each language version maintains its own set of infoboxes with their own set of attributes, as well as sometimes providing different values for corresponding attributes. Therefore, attributes in different languages must be matched if we want to get coherent knowledge;
3) The type information of an instance is incomplete. For example, Yao Ming should not only be assigned with Person, Athlete and Basketball Player, but also Businessman.
Completing these three types of missing facts is a very challenging task. Existing cross-lingual knowledge linking discovery methods heavily depend on the number of existing cross-lingual links. It is a fact that the cross-lingual links in Wikipedia are quite sparse. Existing cross-lingual property matching methods have high precision. But the number of aligned properties is quite small for such a large-scale KB. Existing type inference methods require creation and maintenance of large-scale highly-qualified annotated corpora, which are often difficult to obtain.
In this paper, we present XLORE2, an extension of XLORE, as a holistic approach to the creation of a large-scale English-Chinese bilingual KB, to adequately answer the above problems.
Our approach applies the cross-lingual knowledge linking method to find more cross-lingual links between equivalent instances in different languages and the fine-grained type inference method to assign specific types for those instances without type information. Further, we perform subClassOf and instanceOf relations validation in XLORE2 in order to build a high-quality taxonomy. Moreover, in cross-lingual property matching, we investigate several effective features and propose entity-attribute factor graphing to find the corresponding attributes between English and Chinese. This strategy uncovers many more facts by completing the attribute knowledge, and addresses to a large extent the obstacle of language imbalance. Last but not least, we design an efficient entity linking system XLink, which links the “mentions” in a document to the various entities in XLORE2. As a result, XLORE2 reveals significantly more facts when compared with XLORE.
The rest of this paper is organized as follows. Section 2 discusses related work. Section 3 presents the framework of XLORE2. Section 4 introduces our approaches in cross-lingual knowledge building. Section 5 introduces our methods of data quality improvement. Section 6 presents some practical applications of XLORE2. Section 7 gives some statistical analysis of XLORE2. Section 8 then makes a conclusion.