My work in Information Retrieval, Machine Learning...

Thamme Gowda@Stanford University, Nov 3rd, 2016

My work in Information Retrieval,

Machine Learning and NLP

+ I’m Thamme Gowda+ University of Southern California, Los Angeles - MSCS + NASA Jet Propulsion Laboratory - Intern+ Apache Software Foundation - Volunteer + Datoin - Co Founder+ You can find me online: @thammegowda

HELLO!

OVERVIEW

+ In this presentation+ USC IRDS - DARPA Memex+ NASA Jet Propulsion Lab + Datoin

+ Research+ Clustering Web Pages+ Mars Target Encyclopedia

+ Research Interests and motivations

USC INFORMATION RETRIEVAL AND DATA SCIENCE

+ Dr. Chris Mattmann’s group+ Contributions to Free and Open Source

Softwares + Top Apache Projects:

+ Apache Tika, Nutch, Joshua (Incubating)+ Sparkler - A web crawler on Apache Spark

+ Involvement in DARPA Memex program

+ Summer Intern (Full time), Fall Co-Op (part time)+ Continued involvement with DARPA MEMEX + DARPA Data Driven Discovery of Models (D3M)+ Mars Target Encyclopedia + Mars Landmarks Classification

NASA JET PROPULSION LABORATORY

MEMEXCollaborators: Dr. Chris Mattmann Paul Ramirez Kyle Hundman, et al.

DARPA MEMEX

+ Web Crawling, Information Retrieval + Apache Solr based Search Index+ Information Extraction

+ Names of people, organizations, locations+ From location names to GPS coordinates

+ Object Recognition - Models from ImageNet dataset

TIKA-1787: NAMED ENTITY RECOGNITION+ NER on Memex Dataset + Added NER support to Apache Tika [1]

+ MaxEnt Classifier from Apache OpenNLP (default) [2]

+ CRF Classifier from Stanford CoreNLP [3]

+ MITIE - IE toolkit from MIT LL

[1] https://wiki.apache.org/tika/TikaAndNER[2] https://opennlp.apache.org/documentation/manual/opennlp.html#tools.namefind[3] http://nlp.stanford.edu/software/CRF-NER.shtml

https://wiki.apache.org/tika/TikaAndNER9

TIKA-1993: OBJECT RECOGNITION

+ Image Recognition support+ Integrated Tensorflow’s Inception-V3 model+ Evaluated multiple ways of integration

+ Command Line Invocation → S-l-o-w as a turtle+ Java Native Interface → Transitive dep. issues+ GRPC Client Server → Dependency version issues+ REST Client Server → Works best, please use this!

+ https://wiki.apache.org/tika/TikaAndVision

TIKA-1993 DEMO<meta name="OBJECT" content="German shepherd, German shepherd dog, German police dog, alsatian (0.36203)"/><meta name="OBJECT" content="military uniform (0.13061)"/>

* Photo Credits - Wikimedia.org

LABELING THE WEAPONS DATASET+ 1.3 Million Images of DARPA MEMEX dataset+ Detected objects in the images+ Two Experiments

+ 1st time - top 2 objects+ 2nd time - top 2 objects + confidence threshold of 0.3

+ Improved the efficiency for large jobs+ 1.3 million images took ~ 36 hours on 32 CPU cores+ Improvements are upstreamed to Apache Tika

+ Pushed the results back to Imagecatdev solrhttp://imagecat.dyndns.org/weapons/imagespace-dev/

http://imagecat.dyndns.org/weapons/imagespace-dev/13

MEMEX CP1: HT CLASSIFIER+ Using SVM+ Created custom vectors+ Stanford CoreNLP for tokenization+ Features:

+ Unigrams+ Selected Bigrams

+ All grams are lemmatized+ Classification is done at the cluster level+ https://github.com/USCDataScience/svm-classifier-memex

* Photo credits http://scikit-learn.org/

MEMEX CP1: EVAL. DATASET AU ROC

81.7% AU-ROC for 487 Clusters (Next best result: 65%)

MEMEX CP1: Sample Features+ Cluster classification instead of individual documents+ Lemmatization+ Selected Bigrams and N-grams:

+ Adjectives and nouns - together+ Adverbs and verbs - together

SPARKLERCollaborators:

Karanjeet Singh

SPARKLER[1] (aka Spark-Crawler) + Redesigned and reimagined crawler

+ Taking the best parts of Apache Nutch+ Combined with recent advancements in distributed

computing+ Crawler database is redesigned → indexed store

using Apache Lucene/Solr+ Crawler pipeline is designed → CrawldbRDD

[1] https://github.com/uscdataScience/sparkler

SPARKLER - ROADMAP+ Partitioning for Fair Fetching � + Apache Solr Backend for crawldb �+ Stores Data on FS �+ OSGI based Plugin Framework (Apache Felix) �

+ Regex URL Filter �, JavaScript Engine �+ Admin Dashboard �+ Apache Kafka Integration �+ ApacheCon EU 2016 �+ TODO: More Plugins from Nutch+ TODO: Apache Incubator

Quick Start: https://github.com/USCDataScience/sparkler/wiki/sparkler-0.1

SPARKLER PIPELINE[1]

[1] https://github.com/USCDataScience/sparkler/wiki/Sparkler-Internals20

Co-Founder

DATOIN - DATA TO INFORMATION[1]

+ A platform for data flow pipelines+ Drag-drop-connect the components+ SDK to build reusable components+ Machine Learning components are our interest+ All round experience - idea, design, implement,

test, deploy, collaborate

[1] http://datoin.com

Demo: http://datoin.com/applications/extraction?type=custom

Pipeline: http://datoin.com/pipeline/viewPipeline/pipeline-9d3bc507-4549-4139-8f0f-dc38a0adf354

*Photo from http://datoin.com

RESEARCH EXPERIENCE Clustering Web Pages based on Structure and Style

Object Recognition - Landmarks Classification Named Entity Recognition - Mars Target Encyclopedia

Clustering Web Pages Based On Structure And Style Similarity

Thamme Gowda and Chris Mattmann, IEEE IRI 2016

USC Information Retrieval and Data Science Group (irds.usc.edu)

AUTO EXTRACTOR[1]

+ Unsupervised Learning for Information Extraction+ First Step - Clustering based on structure and style+ Kernels

+ Structural Similarity using Tree Edit Distance between DOM trees

+ Style Similarity of CSS class names + Shared Near Neighbor Clustering

+ Distributable on Apache Spark[1] https://github.com/thammegowda/autoextractor

SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov 27

SAMPLE WEB PAGESCredits: http://www.armslist.com, http://trec-dd.org/dataset.html, http://memex.jpl.nasa.gov

METHOD OVERVIEW

CLUSTERING

METHOD : STEP #1

WEB PAGES FROM CRAWLER LIKE APACHE NUTCH

STRUCTURAL SIMILARITY

● 3 Edit operations● Normalized

distance

* Zhang, K., & Shasha, D. (1989). Simple fast algorithms for the editing distance between trees and related problems. SIAM journal on computing,18(6), 1245-1262.

MINIMUM TREE EDIT DISTANCE

METHOD: STEP #2

WEB PAGES FROM CRAWLER LIKE APACHE NUTCH

STYLE SIMILARITY

• Similar web pages have similar css styles• XPath : ”//*[@class]/@class”• Simple measure :

• Jaccard Similarity on CSS class names

STYLE SIMILARITY

METHOD : STEP #3

AGGREGATED SIMILARITY

AGGREGATE

METHOD : STEP #4

SIMILARITY MATRIX CLUSTERS

CLUSTERINGSHARED NEAR NEIGHBOR

• TED very expensive• Zhang-Shasha’s TED

• O(|T1| x |T2| x Min{depth(T1), leaves(T1)} x Min{depth(T2), leaves(T2)})

• That’s O(n4)• Approx. 1000 HTML Tags• That’s O(1012)

CHALLENGES

Number of HTML Tags

LANDMARKS CLASSIFICATIONAND

MARS TARGET ENCYCLOPEDIA

Jet Propulsion Laboratory, California Institute of Technology

Contributors: Dr. Kiri Wagstaff Dr. Raymond Francis

+ Goal: Classify Landmark images from High Resolution Imaging Science Experiment (HiRISE)

+ Classes: Crater, Dark Dune, Bright Dune, Streak etc+ Trained a deep neural net for image classification+ Compared with the results from Caffe based classifier

LANDMARKS CLASSIFICATION

LANDMARKS CLASSIFICATION + Challenges:

+ Too little training data+ Demands lots of CPU power+ Labels are not precise

+ Solution: Transfer Learning+ Start with Inception-V3 Net using state-of-the-art model+ Erase the weights of last layer+ Retrain the network for new classes

INCEPTION-V3 ARCHITECTURE

* Photo Credits - Google Research

This Network has 5.64% top-5 error on ILSVRC 2012 validation dataset

LANDMARKS CLASSIFICATIONModel EvaluationJudge↓\TFlo→ streak other dark_dune bright_dune crater [TFlo.Tot]

streak 1 55 1 0 1 58

other 1 1562 143 2 60 1768

dark_dune 0 18 471 0 1 490

bright_dune 0 6 1 0 16 23

crater 0 225 0 0 158 383

[Judge.Total] 2 1866 616 2 236 [2722]

+ Goal: Build a search engine for research articles related to planetary science.

+ Minerals, Elements,Targets etc+ Contributions: parser and indexer tools

+ Apache Tika to extract text+ Grobid parser to extract title, authors,

affiliations etc+ Stanford CoreNLP for NER+ Apache Lucene/Solr inverted index

+ https://github.com/USCDataScience/parser-indexer-py

MARS TARGET ENCYCLOPEDIA (MTE)

+ Build custom NER model for planetary science + Entities include ELEMENTS, MINERALS, TARGETS, etc+ Annotated the documents published in Lunar and Planetary

Science Conference [1] (LPSC) 2015 using brat[2]

+ Trained a model for Stanford CoreNLP’s CRFClassifier [3]

INFORMATION EXTRACTION[4]

1. http://www.hou.usra.edu/meetings/lpsc2015/2. http://brat.nlplab.org/3. http://nlp.stanford.edu/software/CRF-NER.shtml4. https://github.com/USCDataScience/parser-indexer-py/tree/master/src/corenlp

Gasda, P. J., et al. "Potential Link Between High-Silica Diagenetic Features in Both Eolian and Lacustrine Rock Units Measured in Gale Crater with MSL." Lunar and Planetary Science Conference. Vol. 47. 2016.

NER MODEL EVALUATION● 8 Test documents from LPSC 2015● 100 training documents from LPSC 2015 and LPSC 2016

+ There is so much to learn! + Research in Question Answering (AI) is fascinating+ Narrowing down:

+ Natural Language Understanding+ Question Answering

+ Information Extraction+ Knowledge Representation

RESEARCH MOTIVATIONS AND INTERESTS

+ Question Answering is an Interface, not a single applicationany human interfacing system that has input and output can be converted to a sort of question - answering system

+ Natural Language Understanding+ Identifying the entities (nouns)+ Resolving the references (pronouns, context)+ Updating the states (adjectives) of entities based on the actions

(verbs)

QUESTION ANSWERING

+ Capture the mutable aspects of knowledge+ A formal language to do (math) reasoning

(Induction, deduction)+ A graphical language to visualize and explain+ Storage and Retrieval

KNOWLEDGE REPRESENTATION

TIMELINE KR

+ Every entity has a timeline+ Timelines intersect with each other+ For example: this meeting

TIMELINE KR : Example

QUESTIONS ?

Thanks

My work in Information Retrieval, Machine Learning...

Documents

Transcript of My work in Information Retrieval, Machine Learning...

SVM for Image Retrieval - Stanford Universityinfolab.stanford.edu/~echang/svmactive-ext.pdfSVM Active — Support Vector Machine Active Learning for Image Retrieval Edward Chang Electrical

Intelligent Indexing and Retrieval of Images A Machine Learning Approach

Machine Learning for Image Retrieval - Stanford Universityinfolab.stanford.edu/~echang/JHU-APL.pdf · Machine Learning for. Image Retrieval. Edward Chang. Associate Professor, Electrical

Statistical machine learning for information retrieval

Machine learning for information retrieval: Neural networks, symbolic le arning…€¦ · · 2017-11-29Machine Learning for Information Retrieval: Neural Networks, Symbolic Learning,

Text Classification using Support Vector Machine Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Google: A Computer Scientist’s Playground · •Machine learning, i nformation retrieval –Improving quality of search results by analyzing (lots of) data ... statistical machine

Machine Learning for Information Retrieval - cse.msu.edurongjin/sigir08-ml-tutorial.pdf · 2 Outline Introduction to information retrieval, statistical inference and machine learning

Content Based Image Retrieval : An Introductionmiune/LECTURES/Sikkim_CBIR...Content Based Image Retrieval : An Introduction 9/22/2014 Prof. M. K. Kundu, MIU,ISI Malay K. Kundu Machine

NEURAL CAPTION-IMAGE RETRIEVAL - Machine Learningcs229.stanford.edu/proj2018/report/59.pdf · NEURAL CAPTION-IMAGE RETRIEVAL Junyang Qian (junyangq), Giacomo Lamberti (giacomol) 1

Information Retrieval: Semantic Web Technologies and Wikidata … · 2019-07-02 · Information Retrieval: Semantic Web Technologies and Wikidata from R Case Study Machine Tagging

Machine Learning for Information Retrieval

Statistical Machine Learningfor Data Mining and …lyu/student/phd/steven/defense_hoi.pdf1 Statistical Machine Learning for Data Mining and Collaborative Multimedia Retrieval Presented

Machine learning –en introduktion - SAS Retrieval AI Computational Neuroscience Data Mining Data Science Machine Learning Pattern Recognition Vad är vad egentligen? Copyright ©

MAN-MACHINE INTERFACE REQUIREMENTS FOR SATELLITE RETRIEVAL ... · teleoperatbor system man-machine interface requirements for satellite retrieval and servicing volume ii: design criteria

A PPLICATIONS OF MACHINE LEARNING IN INFORMATION RETRIEVAL

Machine Learning and Information Retrieval

Information Retrieval - Stanford Universitystanford.edu/class/cs276/handouts/lecture10... · Introduction to Information Retrieval Sec. 13.1 Machine Learning: Supervised Classiﬁcation

Utilizing Machine Learning in Information Retrieval: Text Classification …people.cs.georgetown.edu/.../COSC488-text-classificati… · · 2012-08-24Utilizing Machine Learning

CSC2535: 2013 Advanced Machine Learning Lecture 8b Image retrieval using multilayer neural networks Geoffrey Hinton.