Meshwork - Insight Data Engineering Project

9
RANKING THE INTERNETZ WITH MESHWORK Justin Cano Insight Data Engineering Fellow

Transcript of Meshwork - Insight Data Engineering Project

RANKING THE INTERNETZ WITH MESHWORK Justin Cano Insight Data Engineering Fellow

Motivation • The internet is huge • How does your page rank amongst others in your mesh?

• What is the reach of your website? Which pages are affecting your page rank?

Data Source • Common Crawl Organization

• More than 7 years of web page data, over 500TB • CC April 2015 web corpus ~168TB • Processed ~445GB for project • Readily available in S3

Meshwork – your mesh in a network http://www.jcano.me/meshwork

Pipeline

Data from S3 (source of truth)

REST

Data

Raw (WARC format) Extraction

Edge List

… …

Data Flow

Link edge data (vertexId, pageRank)

10

15

20

25

4 6 8

>3mil Records

Time (h)

Scaling up Page Rank job… spark-submit --class pageRank...

About Me Justin Cano UC Riverside BS Computer Engineering

Previous work experience Software Engineer @

Hobbies I like building things!

•  Hardware, software Learning and using new technologies Moviegoer Outdoor activities: biking, snowboarding Interests: design, app dev Favorite TV Shows: Futurama & The Daily Show

Embedded Systems Developer @

Software Engineer Intern @