Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
-
Upload
douglas-moore -
Category
Technology
-
view
276 -
download
0
description
Transcript of Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
| 1
Big Data Anti-Patterns: Lessons from the Front Lines
Douglas Moore
Principal Data Architect
Think Big, a Teradata Company
| 22
Think Big – 3 Years
- Roadmaps- Delivery
• BDW, Search, Streaming - Tech Assessments
About Douglas Moore
Before Big Data
- Data Warehousing- OLTP- Systems Architecture- Electricity- High End Graphics - Supercomputers- Numerical Analysis
@douglas_maContact me at:
| 33
Think Big
4yr Old “Big Data” Professional Services Firm
- Roadmaps- Engineering- Data Science- Hands on Training
Recently acquired by Teradata• Maintaining Independence
| 44
Content Drawn From Vast Amounts of Experience
…
50+ Clients
Leading security software vendor
Leading Discount Retailer
| 55
I started out with just 3 topics…
Then while on the road to Strata,
I met 7 big data architects
- Who had 7 clients
• Who had 7 projects
• That demonstrated 7 Anti-Patterns
Introduction
Big Data Anti-pattern: “Commonly applied but bad solution”
I95 Wikipedia
| 6
• Hardware and Infrastructure
• Tooling
• Big Data Warehousing
Three Focus Areas
6
| 77
Reference Architecture Driven
- 90’s & 00’s data center patterns- Servers MUST NOT FAIL- Standard Server Config
• $35,000/node• Dual Power supply• RAID• SAS 15K RPM• SAN• VMs for Production• Flat Network
Hardware & Infrastructure
[Image source: HP: The transformation to HP Converged Infrastructure]
Automated provisioning is a good thing!
| 88
Locality Locality Locality
- Bring Computation to Data
#1 Locality
Co-locate data and compute
Locally Attached Storage
Localize & isolate network traffic
Rack AwarenessVM Cluster Hadoop Cluster
| 99
Sequential IO >> Random Access
#2 Sequential IO
http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
Image credit: Wikipedia.org
Large block IO
Append only writes
JBOD
| 1010
Increase # parallel components
- Reduce component cost
Data block replication
- Performance- Availability
Commodity++ (2014)
- High density data nodes- $8-12,000- ~12 drives- ~12-16 cores- Buy more servers for the cost of
one• 4-5x spindles • 4-5x cores
#3 Increase Parallelism
Hadoop Cluster
| 1111
Hadoop Cluster
Expect Failure1,2 Rack Awareness
Data Block Replication
Task Retry
Node Black Listing
Monitor Everything
Name Node HA
#4 Failure
| 1212
Hadoop Ecosystem Tools
Tooling
http://en.wikipedia.org/wiki/File:Bicycle_multi-tool.JPG
| 1313
“If it came in the box then I should use it”
Example
- Oozie for scheduling
Tooling: Just looking inside the box
Best Practice: • Use your current enterprise scheduler
| 1414
Tooling: NoSQL
• “Now I have all of my log data in NoSQL, let’s do analytics over it”
Example
- Streaming data into Mongo DB• Running aggregates• Running MR jobs
| 1515
Best Practice
Best Practice: • Split the stream
• Real-time access in NoSQL• Batch analytics in Hadoop
| 1818
Hadoop Streaming- Integrate legacy code- Integrate analytic tools
• Data science libs
Hadoop integrates any type of application tooling- Java- Python- R- C, C++- Fortran- Cobol- Ruby
Right Framework, Right Need…
| 1919
Got to love Ruby
- Very Cool (or it was)- Dynamic Language- Expressive- Compact- Fast Iteration
Got to Hate Ruby
- Slow- Hard to follow & debug- Does not play well with
threading
Right Use Case – ETL, Wrong Framework
“It’s much faster to develop in, developer time is valuable, just throw a couple more boxes at it”
Bench tested Ruby ETL framework at 5,000 records / second
| 2020
Right Use Case – ETL, Wrong Framework…
Best Practice:• Write new code in fastest execution framework• High value legacy code, analytic tools use Hadoop Streaming• Innovation is Important: Test and Learn
DO THE MATH:
Storm Java: ~ 1MM+ events / second / server
Storm Ruby: 5000 * 12 cores = 60,000 events / second / server= 16.67 times more servers
bit.ly/1t0HXJH
| 2121
Hadoop Use Cases
1. ETL Offload
2. Data Warehousing
Big Data Warehousing
Hadoop Data Types
1. Structured
2. Semi-structured
3. Multi or Unstructured
| 2222
Don’t over curate:
“We are going to
- Define and parse 1,000 attributes from the machine log files on ETL servers,
- load just what we need to,- this will take 6 months”
HCatalog
Navigator, Loom,…
UDFs, UDTFs
- JSON, Regex built in- Custom Java- Hadoop Streaming (e.g. use
Python, Perl) Hive Partitions
Recursive directory reads
Bucket Joins
Columnar formats
- ORC- Parquet
First Principles: #5 Schema on Read
Best Practices:• Define what you need to• Parse on Demand• Structure to optimize• Beware the data palace
fountain & data swamp
| 2323
Right Schema
Data Warehouse
OLTP
Hadoop
| 24
Workload Hadoop NoSQL MPP, Reporting DBs, Mainframe
ETL
Business Intelligence
Cross business reporting
Sub-set analytics
Full scan analytics
Decision Support TBs-PBs GB-TBs
Operational Reports
Complex security requirements
Search
Fast Lookup
Right Workload, Right Tool
| 2525
Understand strengths & weaknesses of each choice
- Get help as needed to make your first effort successful Deploy the right tool for the right workload
Test and Learn
Summary
http://www.keepcalm-o-matic.co.uk/p/keep-calm-and-climb-on-94/
| 2626
Thank You
Work with the best on a wide range of cool projects:[email protected]
@douglas_ma
Douglas Moore
DATA SCIENTISTS
DATA ARCHITECTS
DATA SOLUTIONS
Think Big Start Smart Scale Fast
Work with theLeading Innovator in Big Data
27