1 Analyzing Twitter Data with Hadoop Gwen Shapira, Software Engineer @Gwenshap ©2012 Cloudera, Inc.
1 © Cloudera, Inc. All rights reserved. Alexander Bibighaus| Director of Engineering, Cloudera,...
-
Upload
warren-marsh -
Category
Documents
-
view
220 -
download
1
description
Transcript of 1 © Cloudera, Inc. All rights reserved. Alexander Bibighaus| Director of Engineering, Cloudera,...
1© Cloudera, Inc. All rights reserved.
Alexander Bibighaus| Director of Engineering, Cloudera, Inc.
The Future of Data Managementwith Hadoop and the Enterprise Data Hub
2© Cloudera, Inc. All rights reserved.
3© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights reserved.
Cloudera Snapshot
Founded 2008, by former employees ofEmployees Today 900+World Class Support 24x7 Global Staff
Pro-active & Predictive Support ProgramsMission Critical Thousands of Enterprise Users
Over ~600 Paying Subscription CustomersThe Largest Ecosystem Over 1600+ PartnersCloudera University Over 100,000+ TrainedOpen Source Leaders Cloudera Employees are Leading Developers & ContributorsTotal Capital Raised $1B+ (from Intel, Google, Dell, T. Rowe Price, Accel, Greylock)Mission Help Organizations Leverage the Power of
All Their Data to Ask Bigger Questions.
4© Cloudera, Inc. All rights reserved.
A Big Data Revolution is happening as we speak
Industrial Revolution Data Revolution
5© Cloudera, Inc. All rights reserved.
Data Drives IndustriesFinancial Services Public Sector
Healthcare
Telecommunications
Retail
Optimize network performance Money laundering detection Cyber security detection
Product recommendations Personalized medicine
6© Cloudera, Inc. All rights reserved.
Data Drives BusinessSales Operations
Product
Marketing
Customer Satisfaction
Increase conversions by 2% Convert 5% more leads Reduce fraud by 3%
Reduce churn by 1% Increase user adoption by 10%
7© Cloudera, Inc. All rights reserved.
Why is Big Data Happening Now?
Everything that can be measured will be measured.
Employees and customers expect more personal interactions, but not at the cost of their privacy.The age of “segment of 1”.
The most innovative companiesembrace experimentation, predictive analytics and agility.
Instrumentation Personalization Advanced Analytics
8© Cloudera, Inc. All rights reserved.
Data is fueling this opportunity
Web/MobileClickstream
SocialMedia
SensorNetworks
Audio, Image &
Video
9© Cloudera, Inc. All rights reserved.
Access to diverse analysis techniques
SQLVideo &
Voice Processing
Text Sentiment Analysis
SocialGraph
Analysis
10© Cloudera, Inc. All rights reserved.
People require analytics
“80% of CEOs cite data mining and analytics as strategically important.”
-2015 PWC CEO Survey
11© Cloudera, Inc. All rights reserved.
UNSTRUCTURED DATA
* Source: IDC 2011
2005 20152010
1.8 trillion gigabytes of data was created in 2011*• More than 90% is unstructured data• Data volume doubles every year
10,000
0
GB
of D
ata
(IN B
ILLI
ON
S)Big Data is Getting Bigger & More Multi-structured
STRUCTURED DATA
12© Cloudera, Inc. All rights reserved.
Hadoop Changes the Game: Storage & Compute Together
©2014 Cloudera, Inc. All rights reserved.
The Hadoop WayThe Old Way
$30,000+ per TBExpensive & Unattainable
• Hard to scale• Network is a bottleneck• Only handles relational data• Difficult to add new fields & data types
Expensive, Special purpose, “Reliable” ServersExpensive Licensed Software
Network
Data Storage(SAN, NAS)
Compute(RDBMS, EDW)
$300-$1,000 per TBAffordable & Attainable
• Scales out forever• No bottlenecks• Easy to ingest any data• Agile data access
Commodity “Unreliable” ServersHybrid Open Source Software
Compute(CPU)
Memory Storage(Disk)
zz
13© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights reserved.
The Legacy Way: Bringing Data to Applications
Can’t Get a 360 View• Many special-purpose
systems• Moving data around• No complete views
Can’t Retain Valuable Data• Leaving data behind• Risk and compliance• High cost of storage
Can’t Meet ETL SLAs• Up-front modeling• Transforms slow• Transforms lose data
Can’t Ask New Questions• Existing systems strained• No agility• “BI backlog”
4
1
2
3
SERVERSMARTSEDWS DOCUMENTS STORAGE SEARCH ARCHIVE
ERP, CRM, RDBMS, MACHINES FILES, IMAGES, VIDEOS, LOGS, CLICKSTREAMS EXTERNAL DATA SOURCES
14© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights reserved.
The Agile Way: Bringing Applications to Data
SERVERS MARTS EDWS DOCUMENTS STORAGE SEARCH ARCHIVE
ERP, CRM, RDBMS, MACHINES FILES, IMAGES, VIDEOS, LOGS, CLICKSTREAMS ESTERNAL DATA SOURCES
Consolidated Architecture• Bring applications to data• Combine different workloads on
common data (i.e. SQL + Search)• True analytic agility
4
1
2
3 4
Active Archive• Full fidelity original data• Indefinite time, any source• Lowest cost storage
1
Scalable Transformations• One source of data for all analytics• Persist state of transformed data• Significantly faster & cheaper
2
Agile Exploration• Simple search + BI tools• “Schema on read” agility• Reduce BI user backlog requests
3
15© Cloudera, Inc. All rights reserved.
Hadoop is more than just Apache Hadoop
2006 2008 2009 2010 2011 2012 Present
Core Hadoop (HDFS, MR)
HBaseZooKeeper
Core Hadoop
HivePig
MahoutHBase
ZooKeeperCore Hadoop
SqoopWhirrAvroHivePig
MahoutHBase
ZooKeeperCore Hadoop
FlumeBigtopOozie
MRUnitHCatalog
SqoopWhirrAvroHivePig
MahoutHBase
ZooKeeper
SparkImpala
SolrKafkaFlumeBigtopOozie
MRUnitHCatalog
SqoopWhirrAvroHivePig
MahoutHBase
ZooKeeper
ParquetSentrySpark
ImpalaSolr
KafkaFlumeBigtopOozie
MRUnitHCatalog
SqoopWhirrAvroHivePig
MahoutHBase
ZooKeeper
Core Hadoop +YARN
Core Hadoop +YARN
Core Hadoop +YARN
16© Cloudera, Inc. All rights reserved.
Cloudera Enterprise powered by Apache Hadoop
A new kind of data platform • One place for unlimited any-type data
• Unified, multi-framework data access
Key Advantages:• High performance• Enterprise system and data management
• Secure by default• Open source, Open standards
Security and Administration
Unlimited Storage
Process
Discover
Model Serve
DeploymentFlexibility
On-PremisesAppliancesEngineered Systems
Public CloudPrivate CloudHybrid Cloud
17© Cloudera, Inc. All rights reserved.
Data Drives Travel/Leisure
• Customer Segmentation• Marketing Campaign Testing• Regulatory Compliance
18© Cloudera, Inc. All rights reserved.
Data Drives Social
19© Cloudera, Inc. All rights reserved.
Data Drives Manufacturing
• Predictive maintenance• Goods classification
20© Cloudera, Inc. All rights reserved.
Data Drives Healthcare
• Population Health• Patient Monitoring• Chronic Disease Management
21© Cloudera, Inc. All rights reserved.
MEDIA /ENTERTAINMENTViewers /advertising effectiveness
ON-LINE SERVICES / SOCIAL MEDIAPeople & career matchingWebsiteoptimization
HEALTH CAREPatient sensors, monitoring, EHRs Quality of care
FINANCIAL SERVICESRisk & portfolioanalysisNew products
CONSUMER PACKAGED GOODSSentiment analysis of what’s hot,customer service
TRAVEL & TRANSPORTATIONSensor analysis for optimal traffic flowsCustomer sentiment
RETAILConsumer sentimentOptimized marketing
EDUCATION& RESEARCHExperimentsensor analysis
LIFE SCIENCESClinical trialsGenomics
AUTOMOTIVEAuto sensors reporting location, problems
COMMUNICATIONSLocation-based advertising
HIGH TECHNOLOGY / INDUSTRIAL MFG.Mfg quality
Warranty analysis
UTILITIESSmart Meter analysis for network capacity
OIL & GASDrilling exploration sensor analysis
LAW ENFORCEMENT & DEFENSEThreat analysis,Social media monitoring, Photo analysis
Big Data takes on a lot of questions
22© Cloudera, Inc. All rights reserved.22
A Fortune 500 company specializing in agriculture and genomics can automate data-driven R&D decisions to
reduce time to market from years to months.
Ask Bigger Questions:How do we feed the world?
©2013 Cloudera, Inc. All rights reserved.22
23© Cloudera, Inc. All rights reserved.
Thank you!Alexander Bibighaus| Director of Engineering