Hw09 Enabling Ad Hoc Analytics At Web Scale

Post on 24-Jun-2015

3.342 views 1 download

Tags:

Transcript of Hw09 Enabling Ad Hoc Analytics At Web Scale

rod smith (rod.smith@us.ibm.com)

© 2006 IBM Corporation

Enabling ad-hoc

Analytic Apps

with Hadoop

Enabling ad-hoc

Analytic Apps

with Hadoop

Text

Friday, October 2, 2009

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

Emerging Technology - What do we work on?

Making Hadoop accessible to

business professionals

Friday, October 2, 2009

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

New Intelligence - Big Data

Nearly 15 petabytes of data are created every day — eight times more than the information in all the libraries in the U.S,

Volume of data in enterprises is doubling approximately every 3 years (Forrester Research)

• Includes structured and unstructured data, excludes rich media

Costs to find, collect & analyze data is decreasing significantly as web innovation proceeds

Content is untapped value for business insights & intelligence

Friday, October 2, 2009

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

New Intelligence - New Class of Application on Horizon?

ExploreExplore

Extract

GatherGather

Internet Evolution: A web of data

sources, services for exploring &

manipulating data, and ways that users can connect them together (Tom Coates/Yahoo™ )

Enterprises recognizing potential of

leveraging the broader web for

business intelligence coverage - as

well as for internal data

Next wave of content-centric webApps

emerging

• Long(er) running data collection & analytic applications

Friday, October 2, 2009

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

New Intelligence - New Class of Application on Horizon?

Internet Evolution: A web of data

sources, services for exploring &

manipulating data, and ways that users can connect them together (Tom Coates/Yahoo™ )

Enterprises recognizing potential of

leveraging the broader web for

business intelligence coverage - as

well as for internal data

Next wave of content-centric webApps

emerging

• Long(er) running data collection & analytic applications

Friday, October 2, 2009

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

New Intelligence - New Class of Application on Horizon?

Hear business users asking for

the ability to directly manipulate,

analyze & remix massive data

sources & services

• LOB “… Google wetted my appetite...I want more customizable analytics with me in the drivers seat…”

Leveraging easy-to-use, rich data

manipulation metaphors like

spreadsheets, etc..

Rich visualizations to quickly

identify insights

Friday, October 2, 2009

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

New Intelligence - New Class of Application on Horizon?

Hear business users asking for

the ability to directly manipulate,

analyze & remix massive data

sources & services

• LOB “… Google wetted my appetite...I want more customizable analytics with me in the drivers seat…”

Leveraging easy-to-use, rich data

manipulation metaphors like

spreadsheets, etc..

Rich visualizations to quickly

identify insights

Rich Spectrum

DIY AnalyticApplications

Emerging

Friday, October 2, 2009

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

BBC Digital

Democracy ProjectAchieving Increased

Government Transparency

Web Content To Gather:• UK Parliament Web Site

• Timeframe: 10 + years

Business Questions• Name names: Who is doing what, who

isn!t doing what

• Overlay voting record with demographic & voting records over time

• Buzz - what are people talking about?

• Visualize content relationships

Knowledge of Interest: • Members of Parliament (MPs)

• Bills, Debates, Voting Districts

Let!s Talk Customer Scenarios - BBC

Friday, October 2, 2009

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

Let!s Talk Customers Scenarios - Thomson Reuters

Web Content To Gather: • ~118 3rd Party Finanical News Services and

Blogs, including: BBC, CNN ,Yahoo News, Financial Times, NY Times, The Big Picture, Fox News, PR Newswire, Market Watch, World Press, Forbes, Google News, Wall Street , Journal, MSNBC, The Sun, ZDNet,

Business Questions• NewsBuzz: What are the headlines? What

are not the headlines but still infocus?

• OpinionMonitor: Who is saying what? What are the debate topics?

• NewsTimeline: Chronology (pulse) of headline news?

• TopicCloud: Tag based topic metrix

• IssueAnalytics: Link backs to semantically related news

Knowledge of Interest:• People, places, events

Enrich Trader!s Desktop Enhancement

Timely aggregation & analytics of content originating from public internet sites

Scenario• Gather unstructured data from anywhere between 200 to

2000 data sources - every 15 minutes

• Perform preprocessing (search, transform, index) over each source

• Publish harvested content for distributed content services and downstream Mashups

Friday, October 2, 2009

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

IBM Emerging Technology Project: M2

What is it?An insight engine for enabling ad-hoc business insights for business users - at web scale

How does it work?Discovery Process1. point M2 to data sources of interests

• unstructured web data, feeds, XML, etc..

2. transform data into a form that can be analyzed• Unstructured data becomes semi-structured data

• Example: name: Rod Smith, employer: IBM, state: GA

• Apply analytics - enriching the data

3. “what if tooling” - browser-based visual front end - spreadsheet metaphor to create worksheets for exploring/visualizing the data

What!s different?• Unlocking insights embedded in unstructured data

• Analyzing data previously unavailable to analyze

Friday, October 2, 2009

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

M2 -> Demo

Web Content To Gather: • Gathered 1.4m patent docs from USPTO

• 1991-2007 case records from Court of

Appeals United States Federal Circuit

(CAFC)

Business Questions• How much is a target company worth?

• What are the high-value areas of their

portfolio?

• Explored cited patent topics, litigated

patents

Knowledge of Interest: • Patents ranked by citation – e.g how often

was a patent referenced determines value

• Corporate genealogies IP ownership roll-up

• Augment analysis with items affecting IP

value, inventor affiliation, citation rank by

time

Project:Improve IP Portfolio Analysis for Mergers & Acquisitions

“...please collect all US Patent filings… then let’s do…”

Friday, October 2, 2009

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

What!s Under the Covers: Hadoop

Emergence of map/reduce programming

model for a new class of webApp

Hadoop: provides a framework for large

scale parallel processing map/reduce

apps (Apache projects lead by Yahoo)

• Offers simplicity of “programming” - Looks like a simple single threaded app model for developers

• Handles big data - scalable storage across machine clusters (think read-only file system)

• Deployment: no application knowledge of runtime or OS or cloud necessary

• Today - setting up, coding Hadoop jobs in Java, etc. is the domain of skilled Java engineers

Friday, October 2, 2009

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

Expanding upon the Hadoop stack

• Visual tooling builds extensively on Pig

M2 Architecture Characteristics:

• Extensible via UDFs

• REST API for customer choice of analytic service/engine

• REST APl for choice of visualization packages

• Export content as feeds, XML, etc..

• ...more to come

IBM Emerging Technology Project: M2 Architectural Components

Friday, October 2, 2009

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

Conclusions

In God we trust

Friday, October 2, 2009

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

Conclusions

…all others bring data

Friday, October 2, 2009

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

Conclusions

Enterprises quickly evolving their thinking

from a Database strategy to a Data Strategy

encompassing unstructured & structured

content

Repeatable business patterns in broad range

of industries emerging

Hadoop has potential to be the platform for

broad range of solutions from web-based

analytics -> business event processing ->

collaboration

Friday, October 2, 2009

IBM Software GroupOctober 2009 SWG Emerging Internet Technology

Hadoop World ’09

Almost The End

Selecting customer proof of concept projects

!"#$%"&!'!()*('+,*,-

www-01.ibm.com/software/ebusiness/jstart/about.html

INTERESTED?

Friday, October 2, 2009