Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Hadoop world overview trends and topics
-
Upload
valentin-kropov -
Category
Data & Analytics
-
view
136 -
download
0
Transcript of Hadoop world overview trends and topics
Trends and Topics
Valentyn KropovSolutions Architect, SAG, SoftServe
Agenda1. Conference Overview
2. Bright Future of Hadoop Map-Reduce
3. Apache Spark Data Frames
4. Cloudera Kudu
5. Most Popular Reference Architecture
6. Use Cases
#1Conference Overview
#2Bright Future of
Hadoop MapReduce
Spark is a Future
Cloudera Anounces One Platform Initiative (Sep, 9 2015)
Spark is a Present
It appeared in 72% of presentations and use-cases
At Hadoop World Conference
Spark is Easier to Code
Map Reduce / Java Spark / Scala
Spark is Faster
Up to 100x faster!
Spark is Interactive
Spark is Real-Time
And they have Power • 400 contributors• From 100+ companies• Databricks (1 y.o, 30->100
people, $47 million)• Cloudera (370 patches, 43k
lines of code)
Cloudera One Platform: Read More
http://goo.gl/jSK0h6
#3Spark Data Frames
Most of Data is Still Structured!
• No Sorting?• No Joins?• No Aggregations?• No Filtering?• No cross-DB connections?
Data Frame is…• API
• like a Table (RDBMS)
• or Data Frame (Python/R)
• Abstraction layer over
RDD
Construct Data Frame# Constructs a DataFrame from Hive users = context.table("users") # from JSON files in S3 logs = context.load("s3n://data.json", "json")
Filtering
# Create a new DataFrame that contains “young users” only young = users.filter(users.age < 21)
Group By
# Count the number of young users by gender young.groupBy("gender").count()
Joins!
# Join users with another DataFrame called logs users.join(logs, logs.userId == users.userId, "left_outer")
Spark Languages
Spark Survey 2015
Why Not Python + RDD?
Data Frames and Python
• Compiled into JVM bytecode
• Data Never Leaves the JVM
• Python passes commands only
• Commands are pushed down
Data Frames Performance
Data Frames: Read More
http://www.slideshare.net/JonHaddad/enter-the-snake-pit-for-fast-and-easy-spark
#4Cloudera’s Kudu
What’s Kudu?• Columnar Storage for Hadoop• Not just a file-format • Supports low-latency random access (ms)• Good alternative for Impala + Parquet• Integrates with Spark, Hadoop, Impala• It’s in Beta now
Faster than Parquet
Kudu: Architecture
Kudu: use-cases
• Write: newly-arrived data immediately
available to users
• Time-Series applications which needs to
support both random and scattered reads
Kudu: Read More
http://getkudu.io/
#5Most Popular
Reference Architecture
Reference Architecture
Yarn (90%)Mesos (10%)
Kafka
• Highly-scalable
• Fault-tolerant (commit-log)
• Partition-based load-balancing
Spark Streaming
• Processes data in micro batches (Dstream,
windows slides)
• Supports data locality with Cassandra
• Real-time data science (Data Frames, Mlib)
• BI Support (Spark SQL)
Cassandra• No SPOF
• Masterless (easy operations and scaling)
• Replicates data across data-centers
• Most mature and fast growing
• Evolves into New SQL (transactions)
• SQL-like-CQL
Spark
• Is Awesome for Analytics (both
real-time and batch)
Reference Architecture: Read More
http://www.datastax.com/dev/blog/streaming-big-data-with-spark-spark-streaming-kafka-cassandra-and-akka
#6Netflix Big Data
Platform
Netflix: Size
•20PB DW on S3•Read ~10% of data daily•Write ~10% of read data
daily•500 billion events daily
Netflix: Analyze
•300 Data Scientists•Python, R, Scala, etc
Netflix: Compute and Storage
• Separate Compute and storage (S3)• To have heterogeneous
clusters• And no-downtime upgrades
Netflix: Architecture
Netflix: Read More
http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/43373
#7Big Data
Mission to Mars
Mission Orion
Mission Orion: Size
• 350k measurands• 2TB / hour• 1200 telemetry sensors• 3 x 1GB networks busy• Data retention is 25 years
Data Reader/Simulator IngestPacket
Measurands (GPBs) Kafka
Message Bus
Packet Measurands
(GPBs) Deduplication
(Spark)
HBase Writer(Spark)
mach5-sample Obj
Splitter + Decom (GDS)
C++ client Reads Packets and
Decommutates
Tlm Data
Packet Measurands GPB File
(represents a Packet(s) and contains
decommutated measurands)
Header Metadataapid:seqctr:time: value1
…..
apid:seqctr:time: valueN
mach5-sample (Spark)
Packet Measurands
(GPBs)
Lockheed Martin Proprietary Information
StorageAnalytics
HDFS
HFiles (HBase-RDD)
Mach-5 Data Ingest for Orion
HBase
Web/UITomcatGlassfish
Etc.
TraceFOSS
widgets
Aggregation
(Spark)
Alerting(Spark)
Limit Checking(Spark)
Orion: Read More
http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/43181
Thanks!