Building a Knowledge Graph for Recommending...

Building a Knowledge Graph for Recommending ExpertsBehnam [email protected]

University of PittsburghPittsburgh, PA

Peter [email protected]

University of PittsburghPittsburgh, PA

ABSTRACTRecommending experts is an important challenge in many contexts.In this paper, we present a method to build a knowledge graphby integrating data from Google Scholar and Wikipedia to helpstudents find a research advisor or thesis committee member. Thisknowledge graph is used to power the exploratory search interfaceto recommend similar keywords and relevant scholars to the stu-dents with a limited level of knowledge and familiarity with thesubject of research.

CCS CONCEPTS• Information systems → Graph-based database models; In-formation integration; Web searching and information dis-covery.

KEYWORDSData Integration, Knowledge Graph, Recommender SystemsACM Reference Format:Behnam Rahdari and Peter Brusilovsky. 2019. Building a Knowledge Graphfor Recommending Experts. In KI2KG ’19: 1ST INTERNATIONAL WORK-SHOP ON CHALLENGES AND EXPERIENCES FROM DATA INTEGRATIONTO KNOWLEDGE GRAPHS, August 05, 2019, Anchorage, AK. ACM, NewYork, NY, USA, 4 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTIONRecommending experts is an important challenge in many contexts.The nature of this challenge is to find a knowledgeable personwith an advance expertise in one or more target topics among alarge number of potential candidates. A well-explored examplesare finding an expert for a specific project within a large companyor finding a doctor with advance knowledge of a specific diseasein a large city. While in these two contexts large companies andhospitals use knowledge management techniques to catalogue keyareas of expertise and use it to represent information about eachexpert, finding experts in other contexts could be more challenging.

The context that we target out in this paper is finding a researchadvisor. Every year many undergraduate, master-level and PhDstudents are facing a difficult of finding a research advisor. Whilelarge universities have many highly knowledgeable faculty, finding

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] ’19, August 05, 2019, Anchorage, AK© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn

one with the expertise that matches student interests, requirements,and preparation is a hard problem. Whether this search is for find-ing an advisor for a summer research project, a faculty sponsor foran independent study, or a committee member for a PhD thesis,online sources frequently fail the students and they resort to theword of mouth within a limited circle of instructors, classmates,and university staff. One problem is a wide variety of sources whereadvisor information could be found online (department directories,publication sites, funding agency pages, personal home pages, etc.)Each of theses sources covers only some aspects of faculty exper-tise and frequently represent only a subset of available advisors.Another problem is the lack of "expertise catalog". A university usu-ally offers a catalog of courses and majors, but not a fine-grainedcatalog of expertise areas covered by faculty. As a result studentsfrequently can’t even properly name their target area of interest orformulate a Web search query when looking for advisors.

The focus of our project is to offer a single-access-point ex-ploratory search system, which allows students to discover theirtarget areas of interest and find relevant advisors within these ar-eas. In its core, the platform uses a knowledge expertise graph,which represents multiple connections between research topics andprospective research advisors within a large university or a largeresearch field. We build this graph by processing several knowledgesources about faculty and their research interests. This paper brieflyreviews the kind of knowledge graph we built, the process of itsbuilding by information extraction, and the information explorationsystem powered by this knowledge.

2 BACKGROUNDThe attempts to built “a map of science” representing most impor-tant areas of research expertise and their connections with expertshave been done in the past, however, the lack of proper informationsources made it hard to produce maps that are suitable for findingadvisors. A good examples are attempts presented in [9] and [6],where academic journals are used as proxies of expertise areas anda map of science is built by clustering journals by co-publicationlinks. While this map is useful as a “big picture” of science, itsuse for advisor finding is problematic since it represents expertiseon a very coarse-grain level and misses many prospective advisorwho are not frequent journal authors. However, the emergence ofmodern sites powered by a combination of advanced informationprocessing and collective wisdom makes the task of building a fine-grain knowledge network of experts and expertise areas feasible.In our work we rely most extensively to two of these sites - GoogleScholar andWikipedia.

Google Scholar has been long recognized among the best freelyaccessible academic information sources in term of coverage andaccessibility [3]. It has been compared positively with numberof similar citation services namely Web of Science [1] , PubMed

https://doi.org/10.1145/nnnnnnn.nnnnnnn

https://doi.org/10.1145/nnnnnnn.nnnnnnn

KI2KG ’19, August 05, 2019, Anchorage, AK Rahdari and Brusilovsky.

Figure 1: High level Knowledge Graph schema: from left to right, "individual nodes" (blue) store scholar’s demographic in-formation , "keyword nodes" (red) keep detailed information about topic keywords and "category nodes"(green) convey thehierarchical association between categories and semantic relationship between keywords.

and Scopus [2, 5]. Yet, although Google Scholar contains nearly160 million documents [7] covering a large portion of publisheddocuments, the lack of semantic connections between conceptsand keywords within these documents makes it difficult to use thesystem for finding advisors, especially by less experienced students.

Wikipedia is commonly used by researchers to compute thesemantic relatedness within documents [8, 10, 11], extract OpenInformation [13] and mine meaning using relations, facts and de-scription to extracts and makes use of the concepts [4].

3 BUILDING THE KNOWLEDGE GRAPHTo support students in finding advisors we created a knowledgegraph using data from Google Scholar and enriching it semanti-cally using Wikipedia. In turn, this graph was used to power aninteractive exploratory recommendation interface, which make thetask of advisor-finding easier, especially for students with a limitedlevel of knowledge and familiarity with the subject of research. Tosupport several advisor-finding scenarios, we build several versionsof the graph. The graph presented in this paper is focused on taskof finding a top expert in a specific topic of interest within somebroad field of research (such as Artificial Intelligence) across manyuniversities. This is a typical task for a student selecting a PhDprogram to join or for a senior PhD student looking for an externalthesis committee member.

3.1 Data Sources3.1.1 Google Scholar. We exploited the information of 1000 activescholars in two popular fields of computer science; Artificial Intelli-gent and Computer Architecture (focusing on top 500 scholars perfield). For each individual, we extracted the following information(see Table 1):

• Name : Full name of the scholar.• Affiliation : The university or research institution the scholaraffiliated with.

• Verified Email Domain: It used to check the validity of thescholar profile.

• Self-Defined Keywords: A list of up to five keywords de-fined by scholars to describe their research interests.

• Citations : The total number of citations received by all thescholar’s publications.

• h-index : The h-index measures the citation impact andproductivity of a scholar’s publications. We use this mea-sure alongside with other quantitative scores to re-order theresults of the recommendations.

• i10-Index : i10-Index describes as the total number of scholar’spublications with 10 citations and more. This score, which isonly used by Google Scholar, also used to re-rank the resultsof the recommendations.

• Recent publications (20) : We used 20 most recent publica-tions to generate additional keywords representing currentinterests of each scholar. The keywords were extracted fromthe titles of recent publications as follows: After removingstop-words, we generated all the possible keyword candi-dates as uni-grams, bi-grams, and tri-grams. Next, we onlykept the keywords that have an entry in Wikipedia (seekeyword verification below).

• Top Co-Authors (10) : For each scholar, we extracted a listof top 10 co-authors from their Google Scholar profile.

3.1.2 Wikipedia. We used Wikipedia to add a semantic layer toprofiles extracted from Google Scholar. Throughout this process,we also obtained some useful information that led to a strongerconnection between keywords and enables us to add weight toeach scholar-keyword relation. Wikipedia API has been used forthe following purposes:

• Keyword verification : As mentioned before, we collectedtwo sets of keywords for each scholar: self-defined keywordsand keywords extracted from recent publications. WikipediaAPI has been used to verify the validity of these keywordsby using fuzzy match technique to find the Wikipedia entrydescribing the keyword. We removed all keywords that didnot match with any article in Wikipedia. While Wikipediamight miss articles for some less popular research topics,we need to have all topic keywords explainable for the stu-dent audience and a match to a Wikipedia article was the

Building a Knowledge Graph for Recommending Experts KI2KG ’19, August 05, 2019, Anchorage, AK

Figure 2: Interface Design of Research Advisor Recommendation

best way to assure it. For all remaining keywords, we calcu-lated association weight between a keyword and a scholaras cosine similarity between the full-text Wikipedia entry ofeach keyword and concatenated text from scholar’s recentpublications.

• Entry Summary : To offer student users a short descriptionof each topic keyword, we collected page summary for allkeywords using Wikipedia API.

• Top relevant keywords (10) : Most (if not all) Wikipediapages have multiple links to similar or related articles. Wecollected the top 10 links based on the number of their oc-currences in each page. We employ these links to create ahighly connected network of keywords.

• Entry Categories : Wikipedia uses categories to group sim-ilar articles. We extracted all categories associated with apage and used a full Wikipedia category hierarchy schemato find relationships between categories in our data set.

3.2 Graph RepresentationWe used Neo4j [12] graph database to represent information aboutall scholars and keywords. The overall schema of the knowledgegraph is represented in Figure 1. As is shown, there are three distinctnode types in the graph:

• Individual : This node type conveys demographic infor-mation about scholars including full name, affiliation, ver-ified email domain, and URLs of personal homepage andWikipedia page (if exist). Individual nodes are connected via"works_with" links which represent the co-authorship rela-tions between scholars. An individual node also connects toseveral keyword nodes that represent the scholar’s researchinterests and expertise. The individual nodes with a dashedborder represent scholars added via co-authorship extractionwho are not among the top 500 extracted scholars. Thesenodes are not considered in the final recommendations andonly used to indicate the connections of the top scholars.

• Keyword : There are three types of keyword nodes. Self-Defined keywords, keywords extracted from recent scholar’spublications, and relevant keywords that are represent theconnection between two other types (shown by a dashedborder) and will not appear in the recommendations. The

relationship between keywords represented by "has_link"arc is established if the target node has been mentioned inthe source node’s entry page. Keyword nodes are connectedto individual nodes via "has_key" and to category nodes via"belongs_to" arcs.

• Category :We employed a full hierarchical schema ofWikipediacategories to represent the inter-connectivity between cat-egories in our data set. These relations are presented as"has_child" arc in Figure 1. The category nodes are used tofind the semantic relationship between keywords.

4 USING THE KNOWLEDGE GRAPH4.1 Interface DesignThe knowledge graph is used to power the exploratory searchinterface for finding advisors. the interface consist of four mainsections.

4.1.1 Instant search box. User can use the search box to search fortopic keyword or scholars of interest (Figure 2:B). When the userstarts typing, a list of matching keywords and scholars will appear.When an item is selected from the list, it will automatically addedto proper location on left or right side of the interface. at the sametime, an updated list of recommendations will be presented to theuser.

4.1.2 Favorite keywords. This section (Figure 2:A) shows user’s“favorite” keywords. User add keywords to this list using the instantsearch box or by clicking on the plus button next to each recom-mended keyword. User can interact with three buttons on the rightside of each keywords to (1) see more information about that key-word (including Wikipedia summary, similar keywords and otherscholars with this research interest), (2) remove the keyword fromthe favorite list and (3) enable/disable the effect of this keyword inthe list of recommendations.

4.1.3 Favorite scholars. Similar to favorite keywords, the user canassemble a list of favorite scholars (Figure 2:C). A new favoritescholar could be added to the list from the instant search resultsor a list or recommended scholars by clicking on plus button nextto a recommended scholar. The three buttons on the right side ofeach favorite scholars could be used to obtain more details about

KI2KG ’19, August 05, 2019, Anchorage, AK Rahdari and Brusilovsky.

Artificial Intelligent Computer Architectureall unique all unique

Self-definedKeywords 1916 628 1671 493

PublicationConcepts 5946 1985 5889 1650

RelevantKeywords 37339 24712 30677 21355

WikipediaCategories 3488 857 2496 1775

Co-AuthorRelationship 5727 4771 5096 3287

Table 1: Data Statistics: number of item for each field

the scholar (affiliation, full list of research interests, and similarscholars), remove the scholar from the list, and enable/disable theeffect of this scholar in the final recommendation. Together with thefavorite keywords list explained above, the list of favorite scholarsform a user profile of interests, which the user gradually assembleswhile exploring possible areas of interests and scholars. In turn, theprofile of interests is used to generate further recommendations asexplained below.

4.1.4 Recommendations. This section (Figure 2:D) consist of twosubsections. Recommended keywords (Figure 2:D1) shows the listof three most relevant additional keywords, which are suggestedgiven already selected (and enabled) user’s favorite keywords andscholars. User can see more information about the keyword (simi-lar to favorite keyword section) and also add these recommendedkeyword to their favorite list using two circular button on the rightside of each keyword. Recommended scholars (Figure 2:D2) shows alist of recommended scholars, which are most relevant to the active(enabled) favorite topic keywords and most similar to the activefavorite scholars. For each recommended scholar the list showsbasic personal and academical information. User can also see thesimilarity between the recommended scholars and their profile ofinterests represented by favorite keywords and scholars.

4.2 Recommendation methodWe generate the recommendations using "Cypher Query Language"in Neo4j. In the following we explain how we generate recommen-dations for keywords and scholars.

4.2.1 Keyword Recommendation. In order to recommend similarkeywords, we use user’s favorite keywords and scholars. Each key-word is connected to other keywords in two ways: 1- Via the similarresearch interest between scholars and 2- Via similar relevant key-words and categories. We consider both of these relations to findsimilar keywords. In the final list, we sorted the keywords basedon the number of occurrences then we chose top three keywordsto be presented to the user.

4.2.2 Scholar Recommendation. Similar to keyword recommenda-tion, we use both favorite keywords and scholars. There are threecriteria for scholar recommendation: weighted scholar’s research

interests, co-authorship relationship between scholars, and con-nection between scholar interest through relevant keywords andcategories. After generating the list of candidate scholars, we sort itbased on the similarity score (calculated based on weighted similar-ity score for each of three criteria) and present the top ten resultsto the user.

5 DISCUSSION AND FUTUREWORKWe presented a method to build a knowledge graph by integratingdata from Google Scholar and Wikipedia to help students withlimited knowledge about the subject find a research advisor orthesis committee member. Although Google scholar covers a vari-ety of publications and patents, additional sources of information(scholar’s active research projects, funding information, etc.) couldmake the knowledge graph more connected and provide the userswith additional critical information when it comes to finding anadvisor. We also plan to refine our keyword extraction technique.More sophisticated methods of extraction using natural languageprocessing and machine learning could potentially improve thesemantic relation between concepts and provide users with a morerealistic set of research interests for scholars. We also designinga series of controlled user study and field studies to evaluate theusability and value of the exploratory search interface. We hope,that these user studies will provide valuable insights for improvingthe knowledge graph and the interface.

REFERENCES[1] Joost CF De Winter, Amir A Zadpoor, and Dimitra Dodou. 2014. The expansion

of Google Scholar versus Web of Science: a longitudinal study. Scientometrics 98,2 (2014), 1547–1565.

[2] Matthew E. Falagas, Eleni I. Pitsouni, George A. Malietzis, and Georgios Pappas.2008. Comparison of PubMed, Scopus, Web of Science, and Google Scholar:strengths and weaknesses. The FASEB Journal 22, 2 (2008), 338–342. https://doi.org/10.1096/fj.07-9492LSF arXiv:https://doi.org/10.1096/fj.07-9492LSF PMID:17884971.

[3] Philipp Mayr and Anne-Kathrin Walter. 2007. An exploratory study of GoogleScholar. Online Information Review 31, 6 (2007), 814–830.

[4] Olena Medelyan, David Milne, Catherine Legg, and Ian H Witten. 2009. Miningmeaning from Wikipedia. International Journal of Human-Computer Studies 67, 9(2009), 716–754.

[5] John Mingers and Martin Meyer. 2017. Normalizing Google Scholar data for usein research evaluation. Scientometrics 112, 2 (2017), 1111–1121.

[6] Colin Murray, Weimao Ke, and Katy Börner. 2006. Mapping Scientific Disciplinesand Author Expertise Based on Personal Bibliography Files. In Tenth InternationalConference on Information Visualisation (IV’06). 258–263. https://doi.org/10.1109/IV.2006.73

[7] Enrique Orduña-Malea, Juan Manuel Ayllón, Alberto Martín-Martín, andEmilio Delgado López-Cózar. 2014. About the size of Google Scholar: playingthe numbers. arXiv preprint arXiv:1407.6239 (2014).

[8] Behnam Rahdari and Peter Brusilovsky. 2019. User-controlled hybrid recommen-dation for academic papers. In Proceedings of the 24th International Conference onIntelligent User Interfaces: Companion. ACM, 99–100.

[9] Richard M. Shiffrin and Katy Börner. 2004. Mapping knowledge domains. Pro-ceedings of the National Academy of Sciences 101, suppl 1 (2004), 5183–5185.https://doi.org/10.1073/pnas.0307852100

[10] Michael Strube and Simone Paolo Ponzetto. 2006. WikiRelate! Computing se-mantic relatedness using Wikipedia. In AAAI, Vol. 6. 1419–1424.

[11] Max Völkel, Markus Krötzsch, Denny Vrandecic, Heiko Haller, and Rudi Studer.2006. Semantic wikipedia. In Proceedings of the 15th international conference onWorld Wide Web. ACM, 585–594.

[12] Wikipedia contributors. 2019. Neo4j — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Neo4j&oldid=892454531 [Online; accessed2-May-2019].

[13] Fei Wu and Daniel S. Weld. 2010. Open Information Extraction Using Wikipedia.In Proceedings of the 48th Annual Meeting of the Association for ComputationalLinguistics (ACL ’10). Association for Computational Linguistics, Stroudsburg,PA, USA, 118–127. http://dl.acm.org/citation.cfm?id=1858681.1858694

https://doi.org/10.1096/fj.07-9492LSF

https://doi.org/10.1096/fj.07-9492LSF

http://arxiv.org/abs/https://doi.org/10.1096/fj.07-9492LSF

https://doi.org/10.1109/IV.2006.73

https://doi.org/10.1109/IV.2006.73

https://doi.org/10.1073/pnas.0307852100

https://en.wikipedia.org/w/index.php?title=Neo4j&oldid=892454531

https://en.wikipedia.org/w/index.php?title=Neo4j&oldid=892454531

http://dl.acm.org/citation.cfm?id=1858681.1858694

Building a Knowledge Graph for Recommending...

Documents

Transcript of Building a Knowledge Graph for Recommending...