Big Data Overview Part 1
-
Upload
william-simms -
Category
Technology
-
view
163 -
download
3
description
Transcript of Big Data Overview Part 1
![Page 2: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/2.jpg)
Opening remarks
• Sponsors• Pluralsight
• Free month gift card give away. Enter your name in the pot!
• DevExpress• $250 in developer JustCode tools.
• O’Reilly• Book give away. Enter your name in the pot!
• Boston Code Camp 22 (November 22nd)• http://www.bostoncodecamp.com/
• Thanks to 3thought for the space
![Page 3: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/3.jpg)
About Me
Software Developer
Agile Team Member
Team LeadAgile
Advocate
SDLC Implementer
![Page 4: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/4.jpg)
SDLC
![Page 5: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/5.jpg)
Big Data
“Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.”
- Wikipedia
![Page 6: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/6.jpg)
The 3 Vs
• Volume• A few Gigabytes -> Petabyte
• Velocity• Arrives quickly
• Variety• Multiple Sources
![Page 7: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/7.jpg)
Volume
• Traditional SQL architectures don’t scale to very large
• Maybe this isn’t so true…but the MMP systems are expensive
![Page 8: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/8.jpg)
An example problem (Volume)
• You own a chain of stores
• … with 25,000 stores and 100,000 POS systems
• Need information on inventory changes• By region
• By store
![Page 9: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/9.jpg)
Velocity
• Traditional solutions don’t handle fast inbound data
• Maybe this isn’t so true…but you lose data.
![Page 10: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/10.jpg)
Another example (Velocity)
• You host a website
• … on 10,000 servers
• Monitor logs for errors
![Page 11: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/11.jpg)
Variety
• Most traditional solutions don’t handle a variety of data types well
• Maybe this isn’t so true…But you need to write a custom importer for every type.
![Page 12: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/12.jpg)
A final example (Variety)
• You own a business
• With a sales and marketing teams
• … in different regions around the world
• Correlate sales numbers against marketing expenses
![Page 13: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/13.jpg)
The First Problem : Computing Power
First Second Third
First Second Third
First Second Third
First Second Third
First Second Third
Limited by cores (Scaling up)
![Page 14: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/14.jpg)
Solution: Scale out (not up!)
Server 1 Server 2
Server 3 Server 4
Coordinator
![Page 15: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/15.jpg)
Coordination
Job Coordinator
Runner
Runner
Runner
![Page 16: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/16.jpg)
MapReduce
• A programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. – Wikipedia
WHAT?
![Page 17: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/17.jpg)
Map and Reduce
• Map• Process data returning key value pairs
• Reduce• Aggregate/Filter key value pairs into result
Map
Map
Data
Data
Reduce Result
![Page 18: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/18.jpg)
Mapping
• Easy example
• Store Sales• Find most sales per store in 2010
Year Month Store Id SalesTotal2010 1 13 1,0002010 3 43 12,0002010 3 21 21,0002010 4 13 3,0002010 2 56 4,0002010 6 32 12,0002010 7 1 4,0002010 2 23 2,000
![Page 19: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/19.jpg)
Solution – Map
1. Mapper feeds document rows to your program
2. You return key value pairs
StoreId Sales
21 2,000
23 3,000
2 1,000
21 23,000
![Page 20: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/20.jpg)
Solution - Reduce
• Data is merged• Merged into Key/Values:
{21, [2,000, 23,000]}
{23, [3,000]}
{2, [1,000]}
• You process each row
![Page 21: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/21.jpg)
Data Access
• Each process needs access to data
Typical Desired
![Page 22: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/22.jpg)
HDFS
• Hadoop File System• Open-source implementation of the Google File System (GFS)
Hard drives last about 1,000 days. So, if you have 1K hard drives, you’ll lose one per day.
![Page 23: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/23.jpg)
The ecosystem• Hive
• SQL-like query language• Define and enforce schema
• Pig• SQL-like query language
• Sqoop• SQL/Hadoop integration
• Oozie• Scheduling
• Mahout• Machine Learning interface
• Storm• Stream-based MapReduce
… and Many Others
![Page 24: Big Data Overview Part 1](https://reader033.fdocuments.net/reader033/viewer/2022052508/559626fc1a28ab9b5a8b4616/html5/thumbnails/24.jpg)
Vendors
• Hortonworks• Single click install of Sandbox
• Cloudera• Downloadable VM
• Syncfusion• Single click install of Syncfusion Big Data
• Amazon AWS• Elastic MapReduce
• Microsoft Azure• HDInsight