Bigdata processing with Spark

Click here to load reader

download Bigdata processing with Spark

of 62

  • date post

    21-Jan-2018
  • Category

    Science

  • view

    221
  • download

    1

Embed Size (px)

Transcript of Bigdata processing with Spark

  1. 1. SIKS Big Data Course Prof.dr.ir. Arjen P. de Vries arjen@acm.org Enschede, December 5, 2016
  2. 2. Big Data If your organization stores multiple petabytes of data, if the information most critical to your business resides in forms other than rows and columns of numbers, or if answering your biggest question would involve a mashup of several analytical efforts, youve got a big data opportunity http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
  3. 3. Process Challenges in Big Data Analytics include - capturing data, - aligning data from different sources (e.g., resolving when two objects are the same), - transforming the data into a form suitable for analysis, - modeling it, whether mathematically, or through some form of simulation, - understanding the output visualizing and sharing the results Attributed to IBM Researchs Laura Haas in http://www.odbms.org/download/Zicari.pdf
  4. 4. How big is big? Facebook (Aug 2012): - 2.5 billion content items shared per day (status updates + wall posts + photos + videos + comments) - 2.7 billion Likes per day - 300 million photos uploaded per day
  5. 5. Big is very big! 100+ petabytes of disk space in one of FBs largest Hadoop (HDFS) clusters 105 terabytes of data scanned via Hive, Facebooks Hadoop query language, every 30 minutes 70,000 queries executed on these databases per day 500+ terabytes of new data ingested into the databases every day http://gigaom.com/data/facebook-is-collecting-your-data-500-terabytes-a-day/
  6. 6. Back of the Envelope Note: 105 terabytes of data scanned every 30 minutes A very very fast disk can do 300 MB/s so, on one disk, this would take (105 TB = 110100480 MB) / 300 (MB/s) = 367Ks =~ 6000m So at least 200 disks are used in parallel! PS: the June 2010 estimate was that facebook ran on 60K servers
  7. 7. Source: Google Data Center (is the Computer)
  8. 8. Source: NY Times (6/14/2006), http://www.nytimes.com/2006/06/14/technology/14search.html
  9. 9. FBs Data Centers Suggested further reading: - http://www.datacenterknowledge.com/the-facebook-data-center-faq/ - http://opencompute.org/ - Open hardware: server, storage, and data center - Claim 38% more efficient and 24% less expensive to build and run than other state-of-the-art data centers
  10. 10. Building Blocks Source: Barroso, Clidaras and Hlzle (2013): DOI 10.2200/S00516ED2V01Y201306CAC024
  11. 11. Storage Hierarchy Source: Barroso, Clidaras and Hlzle (2013): DOI 10.2200/S00516ED2V01Y201306CAC024
  12. 12. Numbers Everyone Should Know L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 100 ns Main memory reference 100 ns Compress 1K bytes with Zippy 10,000 ns Send 2K bytes over 1 Gbps network 20,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip within same datacenter 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from network 10,000,000 ns Read 1 MB sequentially from disk 30,000,000 ns Send packet CA->Netherlands->CA 150,000,000 ns According to Jeff Dean
  13. 13. Storage Hierarchy Source: Barroso, Clidaras and Hlzle (2013): DOI 10.2200/S00516ED2V01Y201306CAC024
  14. 14. Storage Hierarchy Source: Barroso, Clidaras and Hlzle (2013): DOI 10.2200/S00516ED2V01Y201306CAC024
  15. 15. Quiz Time!! Consider a 1 TB database with 100 byte records - We want to update 1 percent of the records Plan A: Seek to the records and make the updates Plan B: Write out a new database that includes the updates Source: Ted Dunning, on Hadoop mailing list
  16. 16. Seeks vs. Scans Consider a 1 TB database with 100 byte records - We want to update 1 percent of the records Scenario 1: random access - Each update takes ~30 ms (seek, read, write) - 108 updates = ~35 days Scenario 2: rewrite all records - Assume 100 MB/s throughput - Time = 5.6 hours(!) Lesson: avoid random seeks! In words of Prof. Peter Boncz (CWI & VU): Latency is the enemy Source: Ted Dunning, on Hadoop mailing list
  17. 17. Programming for Big Data the Data Center
  18. 18. Emerging Big Data Systems Distributed Shared-nothing - None of the resources are logically shared between processes Data parallel - Exactly the same task is performed on different pieces of the data
  19. 19. Shared-nothing A collection of independent, possibly virtual, machines, each with local disk and local main memory, connected together on a high-speed network - Possible trade-off: large number of low-end servers instead of small number of high-end ones
  20. 20. @UT~1990
  21. 21. Data Parallel Remember: 0.5ns (L1) vs. 500,000ns (round trip in datacenter) is 6 orders in magnitude! With huge amounts of data (and resources necessary to process it), we simply cannot expect to ship the data to the application the application logic needs to ship to the data!
  22. 22. Grays Laws How to approach data engineering challenges for large-scale scientific datasets: 1. Scientific computing is becoming increasingly data intensive 2. The solution is in a scale-out architecture 3. Bring computations to the data, rather than data to the computations 4. Start the design with the 20 queries 5. Go from working to working See: http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_part1_szalay.pdf
  23. 23. Distributed File System (DFS) Exact location of data is unknown to the programmer Programmer writes a program on an abstraction level above that of low level data - however, notice that abstraction level offered is usually still rather low
  24. 24. GFS: Assumptions Commodity hardware over exotic hardware - Scale out, not up High component failure rates - Inexpensive commodity components fail all the time Modest number of huge files - Multi-gigabyte files are common, if not encouraged Files are write-once, mostly appended to - Perhaps concurrently Large streaming reads over random access - High sustained throughput over low latency GFS slides adapted from material by (Ghemawat et al., SOSP 2003)
  25. 25. GFS: Design Decisions Files stored as chunks - Fixed size (64MB) Reliability through replication - Each chunk replicated across 3+ chunkservers Single master to coordinate access, keep metadata - Simple centralized management No data caching - Little benefit due to large datasets, streaming reads Simplify the API - Push some of the issues onto the client (e.g., data layout) HDFS = GFS clone (same basic ideas)
  26. 26. A Prototype Big Data Analysis Task Iterate over a large number of records Extract something of interest from each Aggregate intermediate results - Usually, aggregation requires to shuffle and sort the intermediate results Generate final output Key idea: provide a functional abstraction for these two operations Map Reduce (Dean and Ghemawat, OSDI 2004)
  27. 27. Map / Reduce A simple and powerful interface that enables automatic parallelization and distribution of large-scale computations, combined with an implementation of this interface that achieves high performance on large clusters of commodity PCs MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, 2004 http://research.google.com/archive/mapreduce.html
  28. 28. MR Implementations Google invented their MR system, a proprietary implementation in C++ - Bindings in Java, Python Hadoop is an open-source re-implementation in Java - Original development led by Yahoo - Now an Apache open source project - Emerging as the de facto big data stack - Rapidly expanding software ecosystem
  29. 29. Map / Reduce Process data using special map() and reduce() functions - The map() function is called on every item in the input and emits a series of intermediate key/value pairs - All values associated with a given key are grouped together: (Keys arrive at each reducer in sorted order) - The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output
  30. 30. split 0 split 1 split 2 split 3 split 4 worker worker worker worker worker Master User Program output file 0 output file 1 (1) submit (2) schedule map (2) schedule reduce (3) read (4) local write (5) remote read (6) write Input files Map phase Intermediate files (on local disk) Reduce phase Output files Adapted by Jimmy Lin from (Dean and Ghemawat, OSDI 2004)
  31. 31. MapReduce mapmap map map Shuffle and Sort: aggregate values by keys reduce reduce reduce k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6 ba 1 2 c c3 6 a c5 2 b c7 8 a 1 5 b 2 7 c 2 3 6 8 r1 s1 r2 s2 r3 s3 mapmap map map Shuffle and Sort: aggregate values by keys reduce reduce reduce k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6 ba 1 2ba 1 2 c c3 6c c3 6 a c5 2a c5 2 b c7 8b c7 8 a 1 5a 1 5 b 2 7b 2 7 c 2 3 6 8c 2 3 6 8 r1 s1r1 s1 r2 s2r2 s2 r3 s3r3 s3
  32. 32. MapReduce Runtime Handles scheduling - Assigns workers to map and reduce tasks Handles data distribution - Moves processes to data Handles synchronization - Gathers, sorts, and shuffles intermediate data Handles errors and faults - Detects worker failures and restarts Everything happens on top of a Distributed File System (DFS)
  33. 33. Q: Hadoop the Answer?
  34. 34. Data Juggling Operational reality of many organizations is that Big Data is constantly being pumped between different systems: - Key-value stores - General-purpose distributed file system - (Distributed) DBMSs - Custom (distributed) file organizations
  35. 35. Q: Hadoop the Answer? Not that easy to write efficient and scalable code!
  36. 36. Controlling Execution Cleverly-constructed data structures for keys and values - Carry partial results together through the pipeline Sort order of intermediate keys - Control order in which reducers process