Learning to Construct and Reason with a Large KB of Extracted Information William W. Cohen Machine...

download Learning to Construct and Reason with a Large KB of Extracted Information William W. Cohen Machine Learning Dept and Language Technology Dept joint work.

If you can't read please download the document

Transcript of Learning to Construct and Reason with a Large KB of Extracted Information William W. Cohen Machine...

  • Slide 1

Learning to Construct and Reason with a Large KB of Extracted Information William W. Cohen Machine Learning Dept and Language Technology Dept joint work with: Tom Mitchell, Ni Lao, William Wang, Kathryn Rivard Mazaitis, Richard Wang, Frank Lin, Ni Lao, Estevam Hruschka, Jr., Burr Settles, Partha Talukdar, Derry Wijaya, Edith Law, Justin Betteridge, Jayant Krishnamurthy, Bryan Kisiel, Andrew Carlson, Weam Abu Zaki, Bhavana Dalvi, Malcolm Greaves, Lise Getoor, Jay Pujara, Hui Miao, Slide 2 Outline Background: information extraction and NELL Key ideas in NELL Coupled learning Multi-view, multi-strategy learning Inference in NELL Inference as another learning strategy Learning in graphs Path Ranking Algorithm ProPPR Promotion as inference Conclusions & summary Slide 3 But first.some backstory Slide 4 ..and an unrelated project Slide 5 ..called SimStudent Slide 6 SimStudent will learn rules to solve a problem step-by-step and guide a student through how solve problems step-by-step Slide 7 Quinlans FOIL Slide 8 Summary of SimStudent Possible for a human author (eg middle school teacher) to build an ITS system by building a GUI, then demonstrating problem solving and having the system learn how from examples The rules learned by SimStudent can be used to construct a student model with parameter tuning this can predict how well individual students will learn better than state-of-the-art in some cases! AI problem solving with a cognitively predictive model and ILP is a key component! Slide 9 Information Extraction Goal: Extract facts about the world automatically by reading text IE systems are usually based on learning how to recognize facts in text.. and then (sometimes) aggregating the results Latest-generation IE systems need not require large amounts of training and IE does not necessarily require subtle analysis of any particular piece of text Slide 10 Never Ending Language Learning (NELL) NELL is a broad-coverage IE system Simultaneously learning 500-600 concepts and relations (person, celebrity, emotion, aquiredBy, locatedIn, capitalCityOf,..) Starting point: containment/disjointness relations between concepts, types for relations, and O(10) examples per concept/relation Uses 500M web page corpus + live queries Running (almost) continuously for over three years Has learned over 50M beliefs, over 1M high-confidence ones about 85% of high-confidence beliefs are correct Slide 11 Demo http://rtw.ml.cmu.edu/rtw/ Slide 12 NELL Screenshots Slide 13 Slide 14 Slide 15 More examples of what NELL knows Slide 16 Slide 17 Outline Background: information extraction and NELL Key ideas in NELL Coupled learning Multi-view, multi-strategy learning Inference in NELL Inference as another learning strategy Learning in graphs Path Ranking Algorithm ProPPR Promotion as inference Conclusions & summary Slide 18 Bootstrapped SSL learning of lexical patterns Paris Pittsburgh Seattle Cupertino mayor of arg1 live in arg1 San Francisco Austin denial arg1 is home of traits such as arg1 its underconstrained!! anxiety selfishness Berlin Extract cities: Given: four seed examples of the class city Slide 19 NP1NP2 Krzyzewski coaches the Blue Devils. athlete team coachesTeam(c,t) person coach sport playsForTeam(a,t) NP Krzyzewski coaches the Blue Devils. coach(NP) hard (underconstrained) semi-supervised learning problem much easier (more constrained) semi-supervised learning problem teamPlaysSport(t,s) playsSport(a,s) One Key to Accurate Semi-Supervised Learning 1.Easier to learn many interrelated tasks than one isolated task 2.Also easier to learn using many different types of information Slide 20 Outline Background: information extraction and NELL Key ideas in NELL Coupled learning Multi-view, multi-strategy learning Inference in NELL Inference as another learning strategy Learning in graphs Path Ranking Algorithm ProPPR Promotion as inference Conclusions & summary Slide 21 Ontology and populated KB the Web CBL text extraction patterns SEAL HTML extraction patterns evidence integration PRA learned inference rules Morph Morphology based extractor Another key idea: use multiple types of information Slide 22 Extrapolating user-provided seeds Set expansion (SEAL): Given seeds (kdd, icml, icdm), formulate query to search engine and collect semi-structured web pages Detect lists on these pages Merge the results, ranking items frequently occurring on good lists highest Details: Wang & Cohen ICDM 2007, 2008; EMNLP 2008, 2009 Slide 23 Sample semi-structure pages for the concept dictators Slide 24 Another example of propagation: Extrapolating seeds in SEAL Set expansion (SEAL): Given seeds (kdd, icml, icdm), formulate query to search engine and collect semi-structured web pages Detect lists on these pages Merge the results, ranking items frequently occurring on good lists highest Details: Wang & Cohen ICDM 2007, 2008; EMNLP 2008, 2009 Slide 25 Extrapolating user-provided seeds Set expansion (SEAL): Given seeds (kdd, icml, icdm), formulate query to search engine and collect semi-structured web pages Detect lists on these pages Merge the results, ranking items frequently occurring on good lists highest Details: Richard Wang & Cohen ICDM 2007, 2008; EMNLP 2008, 2009 300 pages/concept > 100 pages/concept Slide 26 Ontology and populated KB the Web CBL text extraction patterns SEAL HTML extraction patterns evidence integration PRA learned inference rules Morph Morphology based extractor Another key idea: use multiple types of information Slide 27 Outline Background: information extraction and NELL Key ideas in NELL Coupled learning Multi-view, multi-strategy learning Inference in NELL Inference as another learning strategy Background: Learning in graphs Path Ranking Algorithm ProPPR Promotion as inference Conclusions & summary Slide 28 proposal CMU NSF graph William 6/18/07 6/17/07 Sent To Term In Subject [email protected] Background: Personal Info Management as Similarity Queries on a Graph [SIGIR 2006, EMNLP 2008, TOIS 2010] Einat Minkov, Univ Haifa Slide 29 Learning about graph similarity Personalized PageRank aka Random Walk with Restart: Similarity measure for nodes in a graph, analogous to TFIDF for text in a WHIRL database natural extension to PageRank amenable to learning parameters of the walk (gradient search, w/ various optimization metrics): Toutanova, Manning & NG, ICML2004; Nie et al, WWW2005; Xi et al, SIGIR 2005 or: reranking, etc queries: Given type t* and node x, find y:T(y)=t* and y~x Given type t* and nodes X, find y:T(y)=t* and y~X Slide 30 Many tasks can be reduced to similarity queries Person name disambiguation Threading Alias finding [ term andy file msgId ] person [ file msgId ] file What are the adjacent messages in this thread? A proxy for finding more messages like this one What are the email-addresses of Jason ?... [ term Jason ] email-address Meeting attendees finder Which email-addresses (persons) should I notify about this meeting? [ meeting mtgId ] email-address Slide 31 Results on a sample task Mgmt. game PERSON NAME DISAMBIGUATION + Learning Slide 32 Learning about graph similarity: the next generation Personalized PageRank aka Random Walk with Restart: Given type t* and nodes X, find y:T(y)=t* and y~X Ni Laos thesis (2012): New, better learning methods richer parameterization faster PPR inference structure learning Other tasks: relation-finding in parsed text information management for biologists inference in large noisy knowledge bases Slide 33 Lao: A learned random walk strategy is a weighted set of random-walk experts, each of which is a walk constrained by a path (i.e., sequence of relations) 6) approx. standard IR retrieval 1) papers co-cited with on-topic papers 7,8) papers cited during the past two years 12-13) papers published during the past two years Recommending papers to cite in a paper being prepared Slide 34 Another study: learning inference rules for a noisy KB (Lao, Cohen, Mitchell 2011) (Lao et al, 2012) Synonyms of the query team Random walk interpretation is crucial i.e. 10-15 extra points in MRR Slide 35 Ontology and populated KB the Web CBL text extraction patterns SEAL HTML extraction patterns evidence integration PRA learned inference rules Morph Morphology based extractor Another key idea: use multiple types of information Slide 36 Outline Background: information extraction and NELL Key ideas in NELL Inference in NELL Inference as another learning strategy Background: Learning in graphs Path Ranking Algorithm PRA + FOL: ProPPR and joint learning for inference Promotion as inference Conclusions & summary Slide 37 How can you extend PRA to Non-binary predicates? Paths that include constants? Recursive rules? . ? Current direction: using ideas from PRA in a general first-order logic: ProPPR Slide 38 athletePlaySportViaRule(Athlete,Sport) onTeamViaKB(Athlete,Team), teamPlaysSportViaKB(Team,Sport) teamPlaysSportViaRule(Team,Sport) memberOfViaKB(Team,Conference), hasMemberViaKB(Conference,Team2), playsViaKB(Team2,Sport). teamPlaysSportViaRule(Team,Sport) onTeamViaKB(Athlete,Team), athletePlaysSportViaKB(Athlete,Sport) athletePlaySportViaRule(Athlete,Sport) onTeamViaKB(Athlete,Team), teamPlaysSportViaKB(Team,Sport) teamPlaysSportViaRule(Team,Sport) memberOfViaKB(Team,Conference), hasMemberViaKB(Conference,Team2), playsViaKB(Team2,Sport). teamPlaysSportViaRule(Team,Sport) onTeamViaKB(Athlete,Team), athletePlaysSportViaKB(Athlete,Sport) A limitation Paths are learned separately for each relation type, and one learned rule cant call another PRA can learn this. Slide 39 A limitation Paths are learned separately for each relation type, and one learned rule cant call another But PRA cant learn this.. athletePlaySport(Athlete,Sport) onTeam(Athlete,Team), teamPlaysSport(Team,Sport) athletePlaySport(Athlete,Sport) athletePlaySportViaKB(Athlete,Sport) teamPlaysSport(Team,Sport) memberOf(Team,Conference), hasMember(Conference,Team2), plays(Team2,Sport). teamPlaysSport(Team,Sport) onTeam(Athlete,Team), athletePlaysSport(Athlete,Sport) teamPlaysSport(Team,Sport) teamPlaysSportViaKB(Team,Sport) athletePlaySport(Athlete,Sport) onTeam(Athlete,Team), teamPlaysSport(Team,Sport) athletePlaySport(Athlete,Sport) athletePlaySportViaKB(Athlete,Sport) teamPlaysSport(Team,Sport) memberOf(Team,Conference), hasMember(Conference,Team2), plays(Team2,Sport). teamPlaysSport(Team,Sport) onTeam(Athlete,Team), athletePlaysSport(Athlete,Sport) teamPlaysSport(Team,Sport) teamPlaysSportViaKB(Team,Sport) Slide 40 Solution: a major extension from PRA to include large subset of Prolog athletePlaySport(Athlete,Sport) onTeam(Athlete,Team), teamPlaysSport(Team,Sport) athletePlaySport(Athlete,Sport) athletePlaySportViaKB(Athlete,Sport) teamPlaysSport(Team,Sport) memberOf(Team,Conference), hasMember(Conference,Team2), plays(Team2,Sport). teamPlaysSport(Team,Sport) onTeam(Athlete,Team), athletePlaysSport(Athlete,Sport) teamPlaysSport(Team,Sport) teamPlaysSportViaKB(Team,Sport) athletePlaySport(Athlete,Sport) onTeam(Athlete,Team), teamPlaysSport(Team,Sport) athletePlaySport(Athlete,Sport) athletePlaySportViaKB(Athlete,Sport) teamPlaysSport(Team,Sport) memberOf(Team,Conference), hasMember(Conference,Team2), plays(Team2,Sport). teamPlaysSport(Team,Sport) onTeam(Athlete,Team), athletePlaysSport(Athlete,Sport) teamPlaysSport(Team,Sport) teamPlaysSportViaKB(Team,Sport) Slide 41 Sample ProPPR program. Horn rules features of rules (vars from head ok) Slide 42 .. and search space Doh! This is a graph! Slide 43 Score for a query soln (e.g., Z=sport for about(a,Z)) depends on probability of reaching a node* learn transition probabilities based on features of the rules implicit reset transitions with (p) back to query node Looking for answers supported by many short proofs Score for a query soln (e.g., Z=sport for about(a,Z)) depends on probability of reaching a node* learn transition probabilities based on features of the rules implicit reset transitions with (p) back to query node Looking for answers supported by many short proofs Grounding size is O(1/) ie independent of DB size fast approx incremental inference (Reid,Lang,Chung, 08) Learning: supervised variant of personalized PageRank (Backstrom & Leskovic, 2011) *Exactly as in Stochastic Logic Programs [Cussens, 2001] Slide 44 Sample Task: Citation Matching Task: citation matching (Alchemy: Poon & Domingos). Dataset: CORA dataset, 1295 citations of 132 distinct papers. Training set: section 1-4. Test set: section 5. ProPPR program: translated from corresponding Markov logic network (dropping non-Horn clauses) # of rules: 21. Slide 45 Task: Citation Matching Slide 46 Time: Citation Matching vs Alchemy Grounding is independent of DB size Slide 47 Accuracy: Citation Matching AUC scores: 0.0=low, 1.0=hi w=1 is before learning AUC scores: 0.0=low, 1.0=hi w=1 is before learning UW rules Our rules Slide 48 It gets better.. Learning uses many example queries e.g: sameCitation(c120,X) with X=c123+, X=c124-, Each query is grounded to a separate small graph (for its proof) Goal is to tune weights on these edge features to optimize RWR on the query- graphs. Can do SGD and run RWR separately on each query-graph Graphs do share edge features, so theres some synchronization needed Slide 49 Learning can be parallelized by splitting on the separate groundings of each query Slide 50 You can do more with ProPPR Slide 51 Ontology and populated KB the Web CBL text extraction patterns SEAL HTML extraction patterns evidence integration PRA learned inference rules Morph Morphology based extractor Back to NELL Slide 52 Experiment: Take top K paths for each predicate learned by Laos PRA (I dont know how to do structure learning for ProPPR yet) Convert to a mutually recursive ProPPR program Train weights on entire program athletePlaySport(Athlete,Sport) onTeam(Athlete,Team), teamPlaysSport(Team,Sport) athletePlaySport(Athlete,Sport) athletePlaySportViaKB(Athlete,Sport) teamPlaysSport(Team,Sport) memberOf(Team,Conference), hasMember(Conference,Team2), plays(Team2,Sport). teamPlaysSport(Team,Sport) onTeam(Athlete,Team), athletePlaysSport(Athlete,Sport) teamPlaysSport(Team,Sport) teamPlaysSportViaKB(Team,Sport) athletePlaySport(Athlete,Sport) onTeam(Athlete,Team), teamPlaysSport(Team,Sport) athletePlaySport(Athlete,Sport) athletePlaySportViaKB(Athlete,Sport) teamPlaysSport(Team,Sport) memberOf(Team,Conference), hasMember(Conference,Team2), plays(Team2,Sport). teamPlaysSport(Team,Sport) onTeam(Athlete,Team), athletePlaysSport(Athlete,Sport) teamPlaysSport(Team,Sport) teamPlaysSportViaKB(Team,Sport) Slide 53 More details Train on NELLs KB as of iteration 713 Test on new facts from later iterations Try three subdomains of NELL pick a seed entity S pick top M entities nodes in a (simple untyped RWR) from S project KB to just these M entities look at three subdomains, six values of M Slide 54 Slide 55 Time (to answer queries) Slide 56 Time (sec to answer 10 recursive queries) Slide 57 Time (sec to train top 5K DB) Slide 58 ProPPR vs Alchemy >4 days to train discriminatively on recursive theory with 500-entity sample pseudo-likelihood training fails on some recursive rule sets Slide 59 Outline Background: information extraction and NELL Key ideas in NELL Coupled learning Multi-view, multi-strategy learning Inference in NELL Inference as another learning strategy Learning in graphs Path Ranking Algorithm ProPPR Promotion as inference Conclusions & summary Slide 60 More detail on NELL For iteration i=1,.,715,: For each view (lexical patterns, , PRA): Distantly-train for that view using KB i Propose new candidate beliefs based on the learned view-specific classifier Hueristically find the best candidate beliefs and promote them into KB i+1 Not obvious how to promote in a principled way Slide 61 Promotion: identifying new correct extractions from a pool of noisy extractions Many types of noise are possible: co-referent entities missing or spurious labels missing or spurious relations violations of ontology (e.g., an athlete that is not a person) Identifying true extractions requires joint reasoning, e.g. Pooling information about co-referent entities Enforcing mutual exclusion of labels and relations Problem: How can we integrate extractions from multiple sources in the presence of ontological constraints at the scale of millions of extractions? Slide 62 An example Ontology: Dom(hasCapital, country) Mut(country, bird) Sample Extractions: Lbl(Kyrgyzstan, bird) Lbl(Kyrgyzstan, country) Lbl(Kyrgyz Republic, country) Rel(Kyrgyz Republic, Bishkek, hasCapital) Entity Resolution: SameEnt(Kyrgyz Republic, Kyrgyzstan) country KyrgyzstanKyrgyz Republic bird Bishkek SameEnt Dom Mut Lbl Rel(hasCapital) Lbl Kyrgyzstan Kyrgyz Republic Bishkek country Rel(hasCapital) Lbl A knowledge graph view of NELLs extractions What you want Slide 63 Knowledge graph country KyrgyzstanKyrgyz Republic bird Bishkek SameEnt Dom Mut Lbl Rel(hasCapital) Lbl Representation as a noisy knowledge graph Kyrgyzstan Kyrgyz Republic Bishkek country Rel(hasCapital) Lbl After Knowledge Graph Identification graph identification Lise Getoor, Jay Pujara, and Hui Miao @ UMD Slide 64 Graph Identification as Joint Reasoning: Probabilistic Soft Logic (PSL) Templating language for hinge-loss MRFs, much more scalable! Model specified as a collection of logical formulas Formulas are ground by substituting literal values Truth values of atoms relaxed to [0,1] interval Truth values of formulas derived from Lukasiewicz t-norm Each ground rule, r, has a weighted potential, r corresponding to a distance to satisfaction PSL defines a probability distribution over atom truth value assignments, I: Most probable explanation (MPE) inference is convex Running time scales linearly with grounded rules (|R|) Slide 65 PSL Representation of Heuristics for Promotion Promote any candidate Promote hints (old promotion strategy) Be consistent about labels for duplicate entities Slide 66 PSL Representation of Ontological Rules Adapted from Jiang et al., ICDM 2012 Be consistent with constraints from ontology Too expressive for ProPPR Slide 67 Datasets & Results Evaluation on NELL dataset from iteration 165: 1.7M candidate facts 70K ontological constraints Predictions on 25K facts from a 2-hop neighborhood around test data Beats other methods, runs in just 10 seconds! F1AUC Baseline.828.873 NELL.673.765 MLN (Jiang, 12).836.899 KGI-PSL.853.904 Slide 68 Summary Background: information extraction and NELL Key ideas in NELL Coupled learning Multi-view, multi-strategy learning Inference in NELL Inference as another learning strategy Learning in graphs Path Ranking Algorithm ProPPR Promotion as inference Conclusions & summary