1 Make Sense of Big Data Researched by JIANG Wen-rui Led by Pro. ZOU.

48
1 Make Sense of Big Data Researched by JIANG Wen-rui Led by Pro. ZOU

Transcript of 1 Make Sense of Big Data Researched by JIANG Wen-rui Led by Pro. ZOU.

  • Slide 1

1 Make Sense of Big Data Researched by JIANG Wen-rui Led by Pro. ZOU Slide 2 2 Three levels of Big Data Data Analysis Software InfrastructureHardware Infrastructure SaaS IaaS PaaS Slide 3 3 ContradictionFirst and Second Level Data Analysis Meachine Learning Data Warehouse Statistics SoftWare Infrastruct MapReduce Pregel GraphLab GraphBuilder Spark Slide 4 Evolution of Big Data Tech 4 Data Intelligence Level HBase SparkHive Pig Pregel GraphBuild er HDFS MLBaseMahout MapReduce Shark BDAS Cloudera GraphLab Software Architecture Level BC-PDM Graph app MapR Slide 5 5 4V in Big Data VWhy? Volume Big Data is just that data sets that are so massive that typical software systems are incapable of economically storing, let alone managing and computing, the information. A Big Data platform must capture and readily provide such quantities in a comprehensive and uniform storage framework to enable straightforward management and development Variety One of the tenets of Big Data is the exponential growth of unstructured data. The vast majority of data now originates from sources with either limited or variable structure, such as social media and telemetry. A Big Data platform must accommodate the full spectrum of data types and forms. Velocity As organizations continue to seek new questions, patterns, and metrics within their data sets, they demand rapid and agile modeling and query capabilities. A Big Data platform should maintain the original format and precision of all ingested data to ensure full latitude of future analysis and processing cycles. Value Driving relevant value, whether as revenue or cost savings, from data is the primary motivator for many organizations. The popularity of long tail business models has forced companies to examine their data in detail to find the patterns, affiliations, and connections to drive these new opportunities Slide 6 6 Model VS FramePerformance Google MapReduce Good at data-independence tasks, not machine learning and graph processing(data-dependent and iterative tasks). based on acyclic data flow Think like a key. Google Pregel Good at iterative and data-dependent computations, include graph processing. Using BSP(Bulk Synchronous Parallel) Model. A Message Passing abstraction. CMU GraphLab Good at iterative and data-dependent computations, especially nature graph problem. Using asynchronous distributed shared memory model. A Shared-State abstraction. Think like a vertex. UC Berkeley BDAS Spark Good at Iterative algorithms, Interactive data mining, OLAP reports. Using RDDs(resilient distributed datasets) abstraction, which using In- Memory Cluster Computing and distributed-memory model. Slide 7 7 MapReduce Slide 8 8 Map@MapReuduce Slide 9 9 Reduce@MapReuduce Slide 10 10 RPC@MapReuduce Slide 11 11 RPC@MapReuduce Slide 12 12 MapReduce+BSP Slide 13 13 BSP Model Processors Local Computation Communication Barrier Synchronization Slide 14 14 Mapreduce + BSP Slide 15 15 GraphLab Slide 16 16 GraphLab-Think like a vertex Slide 17 Graphlab Working Pattern 17 PatternFunctions MR Map-Reduce Map_reduce_vertices Map_reduce_edges Transform_vertices Transform_edges GAS Gather-Apply-Scatter Gather_edges Gather Apply Scatter_edges Scatter Slide 18 Machine 2 Machine 1 Machine 4 Machine 3 Distributed Execution of a PowerGraph Vertex-Program 11 11 22 22 33 33 44 44 + + + Y Y YY Y Gather Apply Scatter 18 Master Mirror Slide 19 Graphlab vs Pregel--Example Whats the popularity of this user? Whats the popularity of this user? Popular? Depends on popularity of her followers Depends on the popularity their followers 19 Slide 20 Graphlab vs Pregel-- PageRank uUpdate ranks in parallel uIterate until convergence Rank of user i Weighted sum of neighbors ranks 20 Slide 21 The Pregel Abstraction 21 Vertex-Programs interact by sending messages. i i Pregel_PageRank(i, messages) : // Receive all the messages total = 0 foreach( msg in messages) : total = total + msg // Update the rank of this vertex R[i] = 0.15 + total // Send new messages to neighbors foreach(j in out_neighbors[i]) : Send msg(R[i] * w ij ) to vertex j Malewicz et al. [PODC09, SIGMOD10] Slide 22 Barrier The Pregel Abstraction ComputeCommunicate Slide 23 The GraphLab Abstraction Vertex-Programs directly read the neighbors state i i GraphLab_PageRank(i) // Compute sum over neighbors total = 0 foreach( j in in_neighbors(i)): total = total + R[j] * w ji // Update the PageRank R[i] = 0.15 + total // Trigger neighbors to run again if R[i] not converged then foreach( j in out_neighbors(i)): signal vertex-program on j 23 Low et al. [UAI10, VLDB12] j Slide 24 GraphLab Execution CPU 1 CPU 2 The scheduler determines the order that vertices are executed e e f f g g k k j j i i h h d d c c b b a a b b i i h h a a i i b b e e f f j j c c Scheduler The process repeats until the scheduler is empty Slide 25 GraphLab vs. Pregel (BSP) Multicore PageRank (25M Vertices, 355M Edges) 51% updated only once Slide 26 Better for ML Graph-parallel Abstractions 26 Shared State i i Asynchronous Messaging i i Synchronous Slide 27 Asynchronous Execution requires heavy locking (GraphLab) Challenges of High-Degree Vertices Touches a large fraction of graph (GraphLab) Sequentially process edges Sends many messages (Pregel) Edge meta-data too large for single machine Synchronous Execution prone to stragglers (Pregel) 27 Slide 28 28 Berkeley Data Analytics Stack Slide 29 29 Berkeley Data Analytics Stack HDFS MapReduce MPI GraphLab etc Mesos(Cluster resource manager) Shared RDDs(distributed memory) Spark Shark(Spark+Hive)-SQL BlinkDB(approximate queries) MLBase Value Volume Velocity Variety Slide 30 Spark-Motivation Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Map Reduce InputOutput Slide 31 31 Spark Iterative algorithms, including many machine learning algorithms and graph algorithms like PageRank. Interactive data mining, where a user would like to load data into RAM across a cluster and query it repeatedly. OLAP reports that run multiple aggregation queries on the same data. Slide 32 32 Spark Spark allows iterative computation on the same data, which would form a cycle if jobs were visualized Spark offers an abstraction called resilient distributed datasets (RDDs) to support these applications efficientlyresilient distributed datasets (RDDs) Slide 33 33 RDDs Resilient Distributed Dataset (RDD) serves as an abstraction to raw data, and some data is kept in memory and cached for later use. Spark allows data to be committed in RAM for an approximate20x speedup over MapReduce based on disks. RDDs allow Spark to outperform existing models by up to 100x in multi-pass analytics RDDs are immutable and created through parallel transformations such as map, filter, groupBy and reduce Slide 34 34 Function-Mapreduce VS Spark Slide 35 Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s Slide 36 MLBase Motivation-2 Gaps In spite of the modern primacy of data, the complexity of existing ML algorithms is often overwhelming many users do not understand the trade-offs and challenges of parameterizing and choosing between different learning techniques. They need to tune and compare several suitable algorithms Further more, existing scalable systems that support machine learning are typically not accessible to ML researchers without a strong background in distributed systems and low-level primitives So we design a systems which is extensibility to novel ML algorithms. Slide 37 37 MLBase4 pieces Capability MQL A simple declarative way to specify ML tasks ML-Library A library of distributed algorithms Set of high-level operators to enable ML researchers to scalably implement a wide range of ML methods without deep systems knowledge ML-Optimizer A novel optimizer to select and dynamically adapt the choice of learning algorithm ML-Runtime A new run-time optimized for the data-access patterns of these high-level operators Slide 38 MLBase Architecture Slide 39 MLBase Slide 40 40 Error guide Just Hadoop Frame ? In a sense, the distributed platforms just a language, we can not miss them, also not only depend on them. The things more important is as follows: Machine Learning! Reading: Machine Learning A Probabilistic Perspective. Deep Learning. Slide 41 41 Parallel time series regression Led by Dr. Yang Group LI Zhong-hua WANG Yun-zhi JIANG Wen-rui FUJITSU Slide 42 42 Parallel time series regression PropertyPerformance Platform Hadoop from Apache. MapReduce from Google (Open Source) GraphLab from Carnegie Mellon University (Open Source) Both are Good at distributed parallel processing MapReduce good at acyclic data flow GraphLab - Good at iterative and data-dependent computations Volume Support for big data. The algorithm has good scalability. When a large amount of data comes, the algorithm can handle it without any modification, just by increasing the number of clusters Velocity Rapid and agile modeling and handling capabilities for big data. Interface Using XML file for input parameters setting, allowing customers set parameters intuitively Slide 43 43 Parallel time series regression Decompose CycLenCalcu Indicative Frag TBSCPro Clustering MapReduce GraphLab Choose Cluster MapReduce Slide 44 44 Design for Parallel Indicative fragment Indicative fragment - identification the best length of indicative fragment. Assume - days:90 Max indicative fragment Length:96 Compare - Serial and parallel time complexity 1 2 3 96 3 3 3 3 C 90 2 1 1 2 2 2 2 1 1 2 2 96 C 90 2 96 Generate all the 96* (90*89/2) operation pairs before the parallel computation Serial Parallel Time Complexity: 96* C 90 2 Time Complexity: 1 Slide 45 45 TBSCPro 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 All Days 1 2 3 4 5 a1 a2 a3 a4 a5 a6 a7 a8 .. e1 e2 e3 e4 e5 e6 e7 e8 .. d1 d2 d3 d4 d5 d6 d7 d8 .. c1 c2 c3 c4 c5 c6 c7 c8 .. b1 b2 b3 b4 b5 b6 b7 b8 .. Heap with capacity 3 Slide 46 46 Parallel time series regression model Slide 47 47 Slide 48 48 Thank you !