Memex - PyData Seattle

Post on 15-Aug-2015

547 views 4 download

Tags:

Transcript of Memex - PyData Seattle

© 2015 Continuum Analytics- Confidential & Proprietary

Memex: Mining the Deep Web

Katrina Riehl, PhDData ScientistContinuum Analytics

July 25, 2015

© 2015 Continuum Analytics- Confidential & Proprietary

THE DEEP WEBExplaining

When you ask the internet a question, who is answering?

© 2015 Continuum Analytics- Confidential & Proprietary

DARPA MEMEXAn introduction to

What is MEMEX?• Today's web searches use a centralized, one-size-fits-all approach that searches the Internet

with the same set of tools for all queries. While that model has been wildly successful commercially, it does not work well for many government use cases.

• DARPA launched the Memex program in September, 2014. • Memex seeks to develop software that advances online search capabilities• Creation of a new domain-specific indexing and search paradigm

• content discovery • information extraction• information retrieval• user collaboration

• Extension of current search capabilities • deep web • dark web • nontraditional (e.g. multimedia) content.

Memex Search Domains• Human/Labor Trafficking• Weapons• Material Research Science• Financial Fraud• Counterfeit Electronics• Patent Trolling• Child Exploitation

http://opencatalog.darpa.mil

© 2015 Continuum Analytics- Confidential & Proprietary

LARGE SCALE DATA ANALYTICSAn Overview of the Ecosystem

BI - DB DM/Stats/ML

Scientific Computing

Distributed Systems

Numba

bcolz

RHadoop

© 2015 Continuum Analytics- Confidential & Proprietary

THE ANALYTICS PIPELINE

Analytics Pipeline• Web Crawlers & Scrapers• Entity Extractors• Indexers• Visual Analytics

Memex Explorer• Pluggable Framework for Crawling & Data Discovery• Django Web Application• Elasticsearch Index• Bokeh Visualizations for Crawling Stats• Kibana Dashboards for Initial Data Exploration• Apache Nutch Crawler• NYU ACHE Crawler• NYU Domain Discovery Tool

Data Storage

Abstract expressions

Computational backend

csv

HDF5bcolz

DataFrame HDFS

selectionfilter

group by

join

column wise

Pandas

Streaming Python

Spark

MongoDB

SQLAlchemy

json

DATA ANALYSIS

Topic Modeling

Topic Modeling

Topic Modeling

QUESTIONS?Thank you!!