AMiner has collected a large scholar data set with more than 130,000,000 researcher profiles and 233,000,000 publications from the Internet by June 2018 along with a number of subsets that were constructed for different research purposes. The details of these subsets are as follows and can be found at https://www.aminer.cn/data.
Citation Network. The citation data are extracted from DBLP, ACM DL and other sources. The data set contains 1,572,277 papers and 2,084,019 citation relationships. Each paper is associated with abstract, authors, year, venue, and title. The data set can be used for clustering with network and side information, studying influence in the citation network, finding the most influential papers, topic modeling analysis, etc.
Academic Social Network. These data include papers, paper citation, author information and author collaboration. The data set contains 1,712,433 authors, 2,092,356 papers, 8,024,869 citation relationships and 4,258,615 collaboration relationships noted between authors.
Advisor-advisee: The data set is comprised of 815,946 authors and 2,792,833 co-author relationships. For evaluating the performance of inferring advisor-advisee relationships between co-authors we created a smaller ground truth data using the following method: (1) collecting the advisor-advisee information from the Mathematics Genealogy project and the AI Genealogy project; (2) manually crawling the advisor-advisee information from researchers’ homepages. Finally, we have labeled 1,534 co-author relationships of which 514 are advisor-advisee relationships.
Topic-co-author. It is a topic-based co-author network, which contains 640,134 authors of 8 topics and 1,554,643 co-author relationships. The eight topics are: Data Mining/Association Rules, Web Services, Bayesian Networks/Belief Function, Web Mining/Information Fusion, Semantic Web/Description Logics, Machine Learning, Database Systems/XML Data and Information Retrieval.
Topic-paper-author. The data set is collected for the purpose of cross domain recommendation which contains 33,739 authors associated to5 topics as well as 139,278 co-author relationships. The five topics are Data Mining (with 6,282 authors and 22,862 co-author relationships), Medical Informatics (with 9,150 authors and 31,851 co-author relationships), Theory (with 5,449 authors and 27,712 co-author relationships), Visualization (with 5,268 authors and 19,261 co-author relationships) and Database (with 7,590 authors and 37,592 co-author relationships).
Topic-citation. It is a topic-based citation network which contains 2,329,760 papers of 10 topics and 12,710,347 citations relationships. The 10 topics are: Data Mining/Association Rules, Web Services, Bayesian Networks/Belief Function, Web Mining/Information Fusion, Semantic Web/Description Logics, Machine Learning, Database Systems/XML Data, Pattern Recognition/Image Analysis, Information Retrieval, and Natural Language System/Statistical Machine Translation.
Kernel Community. It is a co-authorship network with 822,415 nodes and 2,928,360 undirected edges. Each vertex represents an author and each edge represents a co-author relationship.
Dynamic Co-author. The data set contains 1,768,776 papers published during the time period from 1986 to 2012 with 1,629,217 authors involved. Each year is regarded as a time stamp and there are 27 time stamps in total. At each time stamp, we create an edge between two authors if they have co-authored at least one paper in the most recent three years (including the current year). We convert the undirected co-author network into a directed network by regarding each undirected edge as two symmetric directed edges.
Expert Finding. This data set is a benchmark for expert finding which contains 1,781 experts of 13 topics.
Association Search. This data set is used to evaluate the effectiveness of association search approaches which contains 8,369 author pairs specific to nine topics. Each author pair contains a source author and target author.
Topic Model Results for AMiner Data Set: There are the results of ACT model on the AMiner data set which contains the top 1,000,000 papers and authors of 200 topics.
Co-author. This is a co-author network on the AMiner system which contains 1,560,640 authors and 4,258,946 co-author relationships.
Disambiguation. This data set is used for studying name disambiguation in a digital library. It contains 110 authors and their affiliations as well as their disambiguation results (ground truth).