My work in Information Retrieval, Machine Learning...

Thamme Gowda@Stanford University, Nov 3rd, 2016

My work in Information Retrieval,

Machine Learning and NLP

1

+ I’m Thamme Gowda+ University of Southern California, Los Angeles - MSCS + NASA Jet Propulsion Laboratory - Intern+ Apache Software Foundation - Volunteer + Datoin - Co Founder+ You can find me online: @thammegowda

HELLO!

2

https://twitter.com/thammegowda

OVERVIEW

+ In this presentation+ USC IRDS - DARPA Memex+ NASA Jet Propulsion Lab + Datoin

+ Research+ Clustering Web Pages+ Mars Target Encyclopedia

+ Research Interests and motivations

3

USC INFORMATION RETRIEVAL AND DATA SCIENCE

+ Dr. Chris Mattmann’s group+ Contributions to Free and Open Source

Softwares + Top Apache Projects:

+ Apache Tika, Nutch, Joshua (Incubating)+ Sparkler - A web crawler on Apache Spark

+ Involvement in DARPA Memex program

4

+ Summer Intern (Full time), Fall Co-Op (part time)+ Continued involvement with DARPA MEMEX + DARPA Data Driven Discovery of Models (D3M)+ Mars Target Encyclopedia + Mars Landmarks Classification

NASA JET PROPULSION LABORATORY

5

MEMEXCollaborators: Dr. Chris Mattmann Paul Ramirez Kyle Hundman, et al.

6

DARPA MEMEX

+ Web Crawling, Information Retrieval + Apache Solr based Search Index+ Information Extraction

+ Names of people, organizations, locations+ From location names to GPS coordinates

+ Object Recognition - Models from ImageNet dataset

7

TIKA-1787: NAMED ENTITY RECOGNITION+ NER on Memex Dataset + Added NER support to Apache Tika [1]

+ MaxEnt Classifier from Apache OpenNLP (default) [2]

+ CRF Classifier from Stanford CoreNLP [3]

+ MITIE - IE toolkit from MIT LL

[1] https://wiki.apache.org/tika/TikaAndNER[2] https://opennlp.apache.org/documentation/manual/opennlp.html#tools.namefind[3] http://nlp.stanford.edu/software/CRF-NER.shtml

8

https://wiki.apache.org/tika/TikaAndNER

https://opennlp.apache.org/documentation/manual/opennlp.html#tools.namefind

http://nlp.stanford.edu/software/CRF-NER.shtml

https://wiki.apache.org/tika/TikaAndNER9



TIKA-1993: OBJECT RECOGNITION

+ Image Recognition support+ Integrated Tensorflow’s Inception-V3 model+ Evaluated multiple ways of integration

+ Command Line Invocation → S-l-o-w as a turtle+ Java Native Interface → Transitive dep. issues+ GRPC Client Server → Dependency version issues+ REST Client Server → Works best, please use this!

+ https://wiki.apache.org/tika/TikaAndVision

10

https://wiki.apache.org/tika/TikaAndVision

https://wiki.apache.org/tika/TikaAndVision

TIKA-1993 DEMO<meta name="OBJECT" content="German shepherd, German shepherd dog, German police dog, alsatian (0.36203)"/><meta name="OBJECT" content="military uniform (0.13061)"/>

* Photo Credits - Wikimedia.org

11

LABELING THE WEAPONS DATASET+ 1.3 Million Images of DARPA MEMEX dataset+ Detected objects in the images+ Two Experiments

+ 1st time - top 2 objects+ 2nd time - top 2 objects + confidence threshold of 0.3

+ Improved the efficiency for large jobs+ 1.3 million images took ~ 36 hours on 32 CPU cores+ Improvements are upstreamed to Apache Tika

+ Pushed the results back to Imagecatdev solrhttp://imagecat.dyndns.org/weapons/imagespace-dev/

12

http://imagecat.dyndns.org/weapons/imagespace-dev/


http://imagecat.dyndns.org/weapons/imagespace-dev/13



MEMEX CP1: HT CLASSIFIER+ Using SVM+ Created custom vectors+ Stanford CoreNLP for tokenization+ Features:

+ Unigrams+ Selected Bigrams

+ All grams are lemmatized+ Classification is done at the cluster level+ https://github.com/USCDataScience/svm-classifier-memex

* Photo credits http://scikit-learn.org/

14

https://github.com/USCDataScience/svm-classifier-memex

https://github.com/USCDataScience/svm-classifier-memex

MEMEX CP1: EVAL. DATASET AU ROC

81.7% AU-ROC for 487 Clusters (Next best result: 65%)

15

MEMEX CP1: Sample Features+ Cluster classification instead of individual documents+ Lemmatization+ Selected Bigrams and N-grams:

+ Adjectives and nouns - together+ Adverbs and verbs - together

16

SPARKLERCollaborators:

Karanjeet Singh

17

SPARKLER[1] (aka Spark-Crawler) + Redesigned and reimagined crawler

+ Taking the best parts of Apache Nutch+ Combined with recent advancements in distributed

computing+ Crawler database is redesigned → indexed store

using Apache Lucene/Solr+ Crawler pipeline is designed → CrawldbRDD

18

[1] https://github.com/uscdataScience/sparkler

https://github.com/uscdataScience/sparkler

SPARKLER - ROADMAP+ Partitioning for Fair Fetching � + Apache Solr Backend for crawldb �+ Stores Data on FS �+ OSGI based Plugin Framework (Apache Felix) �

+ Regex URL Filter �, JavaScript Engine �+ Admin Dashboard �+ Apache Kafka Integration �+ ApacheCon EU 2016 �+ TODO: More Plugins from Nutch+ TODO: Apache Incubator

Quick Start: https://github.com/USCDataScience/sparkler/wiki/sparkler-0.1

19

https://github.com/USCDataScience/sparkler/wiki/sparkler-0.1

SPARKLER PIPELINE[1]

[1] https://github.com/USCDataScience/sparkler/wiki/Sparkler-Internals20

https://github.com/USCDataScience/sparkler/wiki/Sparkler-Internals

Co-Founder

21

DATOIN - DATA TO INFORMATION[1]

+ A platform for data flow pipelines+ Drag-drop-connect the components+ SDK to build reusable components+ Machine Learning components are our interest+ All round experience - idea, design, implement,

test, deploy, collaborate

[1] http://datoin.com

22

http://datoin.com

Demo: http://datoin.com/applications/extraction?type=custom

Pipeline: http://datoin.com/pipeline/viewPipeline/pipeline-9d3bc507-4549-4139-8f0f-dc38a0adf354

*Photo from http://datoin.com

23

http://datoin.com/applications/extraction?type=custom

http://datoin.com/pipeline/viewPipeline/pipeline-9d3bc507-4549-4139-8f0f-dc38a0adf354

RESEARCH EXPERIENCE Clustering Web Pages based on Structure and Style

Object Recognition - Landmarks Classification Named Entity Recognition - Mars Target Encyclopedia

24

Clustering Web Pages Based On Structure And Style Similarity

Thamme Gowda and Chris Mattmann, IEEE IRI 2016

USC Information Retrieval and Data Science Group (irds.usc.edu)

25

AUTO EXTRACTOR[1]

+ Unsupervised Learning for Information Extraction+ First Step - Clustering based on structure and style+ Kernels

+ Structural Similarity using Tree Edit Distance between DOM trees

+ Style Similarity of CSS class names + Shared Near Neighbor Clustering

+ Distributable on Apache Spark[1] https://github.com/thammegowda/autoextractor

26

https://github.com/thammegowda/autoextractor

SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov 27

http://www.armslist.com

http://trec-dd.org/dataset.html

http://memex.jpl.nasa.gov

28

SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov




29





30





METHOD OVERVIEW

CLUSTERING

METHOD : STEP #1

WEB PAGES FROM CRAWLER LIKE APACHE NUTCH

STRUCTURAL SIMILARITY

STRUCTURAL SIMILARITY

● 3 Edit operations● Normalized

distance

* Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing,18(6), 1245-1262.

MINIMUM TREE EDIT DISTANCE

33

METHOD: STEP #2

WEB PAGES FROM CRAWLER LIKE APACHE NUTCH

STYLE SIMILARITY

STYLE SIMILARITY

• Similar web pages have similar css styles• XPath : ”//*[@class]/@class”• Simple measure :

• Jaccard Similarity on CSS class names

STYLE SIMILARITY

35

METHOD : STEP #3

AGGREGATED SIMILARITY

AGGREGATE

METHOD : STEP #4

SIMILARITY MATRIX CLUSTERS

CLUSTERINGSHARED NEAR NEIGHBOR

• TED very expensive• Zhang-Shasha’s TED

• O(|T1| x |T2| x Min{depth(T1), leaves(T1)} x Min{depth(T2), leaves(T2)})

• That’s O(n4)• Approx. 1000 HTML Tags• That’s O(1012)

CHALLENGES

Number of HTML Tags

Tim

e Co

mpl

exit

y38

LANDMARKS CLASSIFICATIONAND

MARS TARGET ENCYCLOPEDIA

Jet Propulsion Laboratory, California Institute of Technology

Contributors: Dr. Kiri Wagstaff Dr. Raymond Francis

39

+ Goal: Classify Landmark images from High Resolution Imaging Science Experiment (HiRISE)

+ Classes: Crater, Dark Dune, Bright Dune, Streak etc+ Trained a deep neural net for image classification+ Compared with the results from Caffe based classifier

LANDMARKS CLASSIFICATION

40

LANDMARKS CLASSIFICATION + Challenges:

+ Too little training data+ Demands lots of CPU power+ Labels are not precise

+ Solution: Transfer Learning+ Start with Inception-V3 Net using state-of-the-art model+ Erase the weights of last layer+ Retrain the network for new classes

41

INCEPTION-V3 ARCHITECTURE

* Photo Credits - Google Research

This Network has 5.64% top-5 error on ILSVRC 2012 validation dataset

42

http://www.image-net.org/challenges/LSVRC/2012/

LANDMARKS CLASSIFICATIONModel EvaluationJudge↓\TFlo→ streak other dark_dune bright_dune crater [TFlo.Tot]

streak 1 55 1 0 1 58

other 1 1562 143 2 60 1768

dark_dune 0 18 471 0 1 490

bright_dune 0 6 1 0 16 23

crater 0 225 0 0 158 383

[Judge.Total] 2 1866 616 2 236 [2722]

43

+ Goal: Build a search engine for research articles related to planetary science.

+ Minerals, Elements,Targets etc+ Contributions: parser and indexer tools

+ Apache Tika to extract text+ Grobid parser to extract title, authors,

affiliations etc+ Stanford CoreNLP for NER+ Apache Lucene/Solr inverted index

+ https://github.com/USCDataScience/parser-indexer-py

MARS TARGET ENCYCLOPEDIA (MTE)

44

https://github.com/USCDataScience/parser-indexer-py

https://github.com/USCDataScience/parser-indexer-py

+ Build custom NER model for planetary science + Entities include ELEMENTS, MINERALS, TARGETS, etc+ Annotated the documents published in Lunar and Planetary

Science Conference [1] (LPSC) 2015 using brat[2]

+ Trained a model for Stanford CoreNLP’s CRFClassifier [3]

INFORMATION EXTRACTION[4]

1. http://www.hou.usra.edu/meetings/lpsc2015/2. http://brat.nlplab.org/3. http://nlp.stanford.edu/software/CRF-NER.shtml4. https://github.com/USCDataScience/parser-indexer-py/tree/master/src/corenlp

45

http://www.hou.usra.edu/meetings/lpsc2015/

http://www.hou.usra.edu/meetings/lpsc2015/

http://brat.nlplab.org/

http://brat.nlplab.org/



https://github.com/USCDataScience/parser-indexer-py/tree/master/src/corenlp

https://github.com/USCDataScience/parser-indexer-py/tree/master/src/corenlp

Gasda, P. J., et al. "Potential Link Between High-Silica Diagenetic Features in Both Eolian and Lacustrine Rock Units Measured in Gale Crater with MSL." Lunar and Planetary Science Conference. Vol. 47. 2016.

46

NER MODEL EVALUATION● 8 Test documents from LPSC 2015● 100 training documents from LPSC 2015 and LPSC 2016

47

+ There is so much to learn! + Research in Question Answering (AI) is fascinating+ Narrowing down:

+ Natural Language Understanding+ Question Answering

+ Information Extraction+ Knowledge Representation

RESEARCH MOTIVATIONS AND INTERESTS

48

+ Question Answering is an Interface, not a single applicationany human interfacing system that has input and output can be converted to a sort of question - answering system

+ Natural Language Understanding+ Identifying the entities (nouns)+ Resolving the references (pronouns, context)+ Updating the states (adjectives) of entities based on the actions

(verbs)

QUESTION ANSWERING

49

+ Capture the mutable aspects of knowledge+ A formal language to do (math) reasoning

(Induction, deduction)+ A graphical language to visualize and explain+ Storage and Retrieval

KNOWLEDGE REPRESENTATION

50

TIMELINE KR

+ Every entity has a timeline+ Timelines intersect with each other+ For example: this meeting

51

TIMELINE KR : Example

52

QUESTIONS ?

Thanks

53

My work in Information Retrieval, Machine Learning...

Documents

Transcript of My work in Information Retrieval, Machine Learning...