My work in Information Retrieval, Machine Learning...
Transcript of My work in Information Retrieval, Machine Learning...
Thamme Gowda@Stanford University, Nov 3rd, 2016
My work in Information Retrieval,
Machine Learning and NLP
1
+ I’m Thamme Gowda+ University of Southern California, Los Angeles - MSCS + NASA Jet Propulsion Laboratory - Intern+ Apache Software Foundation - Volunteer + Datoin - Co Founder+ You can find me online: @thammegowda
HELLO!
2
OVERVIEW
+ In this presentation+ USC IRDS - DARPA Memex+ NASA Jet Propulsion Lab + Datoin
+ Research+ Clustering Web Pages+ Mars Target Encyclopedia
+ Research Interests and motivations
3
USC INFORMATION RETRIEVAL AND DATA SCIENCE
+ Dr. Chris Mattmann’s group+ Contributions to Free and Open Source
Softwares + Top Apache Projects:
+ Apache Tika, Nutch, Joshua (Incubating)+ Sparkler - A web crawler on Apache Spark
+ Involvement in DARPA Memex program
4
+ Summer Intern (Full time), Fall Co-Op (part time)+ Continued involvement with DARPA MEMEX + DARPA Data Driven Discovery of Models (D3M)+ Mars Target Encyclopedia + Mars Landmarks Classification
NASA JET PROPULSION LABORATORY
5
MEMEXCollaborators: Dr. Chris Mattmann Paul Ramirez Kyle Hundman, et al.
6
DARPA MEMEX
+ Web Crawling, Information Retrieval + Apache Solr based Search Index+ Information Extraction
+ Names of people, organizations, locations+ From location names to GPS coordinates
+ Object Recognition - Models from ImageNet dataset
7
TIKA-1787: NAMED ENTITY RECOGNITION+ NER on Memex Dataset + Added NER support to Apache Tika [1]
+ MaxEnt Classifier from Apache OpenNLP (default) [2]
+ CRF Classifier from Stanford CoreNLP [3]
+ MITIE - IE toolkit from MIT LL
[1] https://wiki.apache.org/tika/TikaAndNER[2] https://opennlp.apache.org/documentation/manual/opennlp.html#tools.namefind[3] http://nlp.stanford.edu/software/CRF-NER.shtml
8
https://wiki.apache.org/tika/TikaAndNER9
TIKA-1993: OBJECT RECOGNITION
+ Image Recognition support+ Integrated Tensorflow’s Inception-V3 model+ Evaluated multiple ways of integration
+ Command Line Invocation → S-l-o-w as a turtle+ Java Native Interface → Transitive dep. issues+ GRPC Client Server → Dependency version issues+ REST Client Server → Works best, please use this!
+ https://wiki.apache.org/tika/TikaAndVision
10
TIKA-1993 DEMO<meta name="OBJECT" content="German shepherd, German shepherd dog, German police dog, alsatian (0.36203)"/><meta name="OBJECT" content="military uniform (0.13061)"/>
* Photo Credits - Wikimedia.org
11
LABELING THE WEAPONS DATASET+ 1.3 Million Images of DARPA MEMEX dataset+ Detected objects in the images+ Two Experiments
+ 1st time - top 2 objects+ 2nd time - top 2 objects + confidence threshold of 0.3
+ Improved the efficiency for large jobs+ 1.3 million images took ~ 36 hours on 32 CPU cores+ Improvements are upstreamed to Apache Tika
+ Pushed the results back to Imagecatdev solrhttp://imagecat.dyndns.org/weapons/imagespace-dev/
12
http://imagecat.dyndns.org/weapons/imagespace-dev/13
MEMEX CP1: HT CLASSIFIER+ Using SVM+ Created custom vectors+ Stanford CoreNLP for tokenization+ Features:
+ Unigrams+ Selected Bigrams
+ All grams are lemmatized+ Classification is done at the cluster level+ https://github.com/USCDataScience/svm-classifier-memex
* Photo credits http://scikit-learn.org/
14
MEMEX CP1: EVAL. DATASET AU ROC
81.7% AU-ROC for 487 Clusters (Next best result: 65%)
15
MEMEX CP1: Sample Features+ Cluster classification instead of individual documents+ Lemmatization+ Selected Bigrams and N-grams:
+ Adjectives and nouns - together+ Adverbs and verbs - together
16
SPARKLERCollaborators:
Karanjeet Singh
17
SPARKLER[1] (aka Spark-Crawler) + Redesigned and reimagined crawler
+ Taking the best parts of Apache Nutch+ Combined with recent advancements in distributed
computing+ Crawler database is redesigned → indexed store
using Apache Lucene/Solr+ Crawler pipeline is designed → CrawldbRDD
18
[1] https://github.com/uscdataScience/sparkler
SPARKLER - ROADMAP+ Partitioning for Fair Fetching � + Apache Solr Backend for crawldb �+ Stores Data on FS �+ OSGI based Plugin Framework (Apache Felix) �
+ Regex URL Filter �, JavaScript Engine �+ Admin Dashboard �+ Apache Kafka Integration �+ ApacheCon EU 2016 �+ TODO: More Plugins from Nutch+ TODO: Apache Incubator
Quick Start: https://github.com/USCDataScience/sparkler/wiki/sparkler-0.1
19
SPARKLER PIPELINE[1]
[1] https://github.com/USCDataScience/sparkler/wiki/Sparkler-Internals20
Co-Founder
21
DATOIN - DATA TO INFORMATION[1]
+ A platform for data flow pipelines+ Drag-drop-connect the components+ SDK to build reusable components+ Machine Learning components are our interest+ All round experience - idea, design, implement,
test, deploy, collaborate
[1] http://datoin.com
22
Demo: http://datoin.com/applications/extraction?type=custom
Pipeline: http://datoin.com/pipeline/viewPipeline/pipeline-9d3bc507-4549-4139-8f0f-dc38a0adf354
*Photo from http://datoin.com
23
RESEARCH EXPERIENCE Clustering Web Pages based on Structure and Style
Object Recognition - Landmarks Classification Named Entity Recognition - Mars Target Encyclopedia
24
Clustering Web Pages Based On Structure And Style Similarity
Thamme Gowda and Chris Mattmann, IEEE IRI 2016
USC Information Retrieval and Data Science Group (irds.usc.edu)
25
AUTO EXTRACTOR[1]
+ Unsupervised Learning for Information Extraction+ First Step - Clustering based on structure and style+ Kernels
+ Structural Similarity using Tree Edit Distance between DOM trees
+ Style Similarity of CSS class names + Shared Near Neighbor Clustering
+ Distributable on Apache Spark[1] https://github.com/thammegowda/autoextractor
26
SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov 27
28
SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
29
SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
30
SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov
METHOD OVERVIEW
CLUSTERING
METHOD : STEP #1
WEB PAGES FROM CRAWLER LIKE APACHE NUTCH
STRUCTURAL SIMILARITY
STRUCTURAL SIMILARITY
● 3 Edit operations● Normalized
distance
* Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing,18(6), 1245-1262.
MINIMUM TREE EDIT DISTANCE
33
METHOD: STEP #2
WEB PAGES FROM CRAWLER LIKE APACHE NUTCH
STYLE SIMILARITY
STYLE SIMILARITY
• Similar web pages have similar css styles• XPath : ”//*[@class]/@class”• Simple measure :
• Jaccard Similarity on CSS class names
STYLE SIMILARITY
35
METHOD : STEP #3
AGGREGATED SIMILARITY
AGGREGATE
METHOD : STEP #4
SIMILARITY MATRIX CLUSTERS
CLUSTERINGSHARED NEAR NEIGHBOR
• TED very expensive• Zhang-Shasha’s TED
• O(|T1| x |T2| x Min{depth(T1), leaves(T1)} x Min{depth(T2), leaves(T2)})
• That’s O(n4)• Approx. 1000 HTML Tags• That’s O(1012)
CHALLENGES
Number of HTML Tags
Tim
e Co
mpl
exit
y38
LANDMARKS CLASSIFICATIONAND
MARS TARGET ENCYCLOPEDIA
Jet Propulsion Laboratory, California Institute of Technology
Contributors: Dr. Kiri Wagstaff Dr. Raymond Francis
39
+ Goal: Classify Landmark images from High Resolution Imaging Science Experiment (HiRISE)
+ Classes: Crater, Dark Dune, Bright Dune, Streak etc+ Trained a deep neural net for image classification+ Compared with the results from Caffe based classifier
LANDMARKS CLASSIFICATION
40
LANDMARKS CLASSIFICATION + Challenges:
+ Too little training data+ Demands lots of CPU power+ Labels are not precise
+ Solution: Transfer Learning+ Start with Inception-V3 Net using state-of-the-art model+ Erase the weights of last layer+ Retrain the network for new classes
41
INCEPTION-V3 ARCHITECTURE
* Photo Credits - Google Research
This Network has 5.64% top-5 error on ILSVRC 2012 validation dataset
42
LANDMARKS CLASSIFICATIONModel EvaluationJudge↓\TFlo→ streak other dark_dune bright_dune crater [TFlo.Tot]
streak 1 55 1 0 1 58
other 1 1562 143 2 60 1768
dark_dune 0 18 471 0 1 490
bright_dune 0 6 1 0 16 23
crater 0 225 0 0 158 383
[Judge.Total] 2 1866 616 2 236 [2722]
43
+ Goal: Build a search engine for research articles related to planetary science.
+ Minerals, Elements,Targets etc+ Contributions: parser and indexer tools
+ Apache Tika to extract text+ Grobid parser to extract title, authors,
affiliations etc+ Stanford CoreNLP for NER+ Apache Lucene/Solr inverted index
+ https://github.com/USCDataScience/parser-indexer-py
MARS TARGET ENCYCLOPEDIA (MTE)
44
+ Build custom NER model for planetary science + Entities include ELEMENTS, MINERALS, TARGETS, etc+ Annotated the documents published in Lunar and Planetary
Science Conference [1] (LPSC) 2015 using brat[2]
+ Trained a model for Stanford CoreNLP’s CRFClassifier [3]
INFORMATION EXTRACTION[4]
1. http://www.hou.usra.edu/meetings/lpsc2015/2. http://brat.nlplab.org/3. http://nlp.stanford.edu/software/CRF-NER.shtml4. https://github.com/USCDataScience/parser-indexer-py/tree/master/src/corenlp
45
Gasda, P. J., et al. "Potential Link Between High-Silica Diagenetic Features in Both Eolian and Lacustrine Rock Units Measured in Gale Crater with MSL." Lunar and Planetary Science Conference. Vol. 47. 2016.
46
NER MODEL EVALUATION● 8 Test documents from LPSC 2015● 100 training documents from LPSC 2015 and LPSC 2016
47
+ There is so much to learn! + Research in Question Answering (AI) is fascinating+ Narrowing down:
+ Natural Language Understanding+ Question Answering
+ Information Extraction+ Knowledge Representation
RESEARCH MOTIVATIONS AND INTERESTS
48
+ Question Answering is an Interface, not a single applicationany human interfacing system that has input and output can be converted to a sort of question - answering system
+ Natural Language Understanding+ Identifying the entities (nouns)+ Resolving the references (pronouns, context)+ Updating the states (adjectives) of entities based on the actions
(verbs)
QUESTION ANSWERING
49
+ Capture the mutable aspects of knowledge+ A formal language to do (math) reasoning
(Induction, deduction)+ A graphical language to visualize and explain+ Storage and Retrieval
KNOWLEDGE REPRESENTATION
50
TIMELINE KR
+ Every entity has a timeline+ Timelines intersect with each other+ For example: this meeting
51
TIMELINE KR : Example
52
QUESTIONS ?
Thanks
53