Pig: Data Analysis Tool in Cloud
-
Upload
jeff-zhang -
Category
Technology
-
view
4.026 -
download
0
description
Transcript of Pig: Data Analysis Tool in Cloud
![Page 2: Pig: Data Analysis Tool in Cloud](https://reader034.fdocuments.net/reader034/viewer/2022051612/54be2d314a79592f108b45ef/html5/thumbnails/2.jpg)
Agenda
• Background
• What is Pig
• Brief introduction of Pig internals
• Demo
• Q/A
![Page 3: Pig: Data Analysis Tool in Cloud](https://reader034.fdocuments.net/reader034/viewer/2022051612/54be2d314a79592f108b45ef/html5/thumbnails/3.jpg)
Data Explosion
• Web 2.0
• More digit terminal
![Page 4: Pig: Data Analysis Tool in Cloud](https://reader034.fdocuments.net/reader034/viewer/2022051612/54be2d314a79592f108b45ef/html5/thumbnails/4.jpg)
What we have for data analysis
• RDBMS (Scalability)
• Parallel RDBMS (Expensive)
• Programming Language (Too complex)
• Hadoop MapReduce (Still too complex for non-hadoop users)
![Page 5: Pig: Data Analysis Tool in Cloud](https://reader034.fdocuments.net/reader034/viewer/2022051612/54be2d314a79592f108b45ef/html5/thumbnails/5.jpg)
Then, Pig’s Coming
![Page 6: Pig: Data Analysis Tool in Cloud](https://reader034.fdocuments.net/reader034/viewer/2022051612/54be2d314a79592f108b45ef/html5/thumbnails/6.jpg)
What is Pig
Apache Pig is a platform for analyzing large data sets that consists of a high-level language (PigLatin) for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
• Ease of programming
• Optimization opportunities
• Extensibility
• Built upon Hadoop
![Page 7: Pig: Data Analysis Tool in Cloud](https://reader034.fdocuments.net/reader034/viewer/2022051612/54be2d314a79592f108b45ef/html5/thumbnails/7.jpg)
A simple example of Pig-Latin
raw_data = load '/java_one/pv' Using PigStorage(‘,') as (time_stamp : long, url : chararray);
pages = foreach raw_data generate url;pages = group pages by url;pages = foreach pages generate group as url, COUNT(pages.url) as pv;
pages = order pages by pv desc;top10 = limit pages 10;
dump top10;
• Page view
• The most 10 popular pages
1291950309812, http://snda.com/page_1 1291950309822, http://snda.com/page_2 1291950309832, http://snda.com/page_3
….
![Page 8: Pig: Data Analysis Tool in Cloud](https://reader034.fdocuments.net/reader034/viewer/2022051612/54be2d314a79592f108b45ef/html5/thumbnails/8.jpg)
Operators in Pig-Latin
Load - a = load ‘data’ using PigStorage(‘\t’) as (f1:int ,f2:double,f3:chararray)
Store - store a into ‘/test/output’ using PigStorage(‘,’)
Dump - dump a
Filter - b = filter a by f1 > 0 and f2 == ‘java_one’
Foreach - b = foreach a generate f1, f3
Group - b= group a by f3;
Join - b = Join a by f1, b by f1;
Describe - describe b;
….
![Page 9: Pig: Data Analysis Tool in Cloud](https://reader034.fdocuments.net/reader034/viewer/2022051612/54be2d314a79592f108b45ef/html5/thumbnails/9.jpg)
Data Structure in Pig
• Cell field in database- Primitive types: int, long, float, double, bytearray, chararrar,nul
- Complex types: map, tuple, databag
• Tuple row– (1, 1.2, “java”)
• DataBag table or view – { (1, 1.2, “java”), (2,2.3, “c++”) , (3,4.5,”c”) }
![Page 10: Pig: Data Analysis Tool in Cloud](https://reader034.fdocuments.net/reader034/viewer/2022051612/54be2d314a79592f108b45ef/html5/thumbnails/10.jpg)
How to use Pig
Grunt (Interactive Shell)
Java API
Other languages (in future)
![Page 11: Pig: Data Analysis Tool in Cloud](https://reader034.fdocuments.net/reader034/viewer/2022051612/54be2d314a79592f108b45ef/html5/thumbnails/11.jpg)
Architecture of Pig
Parser (PigLatinLogicalPlan)
Optimizer (LogicalPlan LogicalPlan)
Compiler (LogicalPlan PhysiclaPlan MapReducePlan)
ExecutionEngine
PigContext
Hadoop
Grunt (Interactive shell) PigServer (Java API)
![Page 12: Pig: Data Analysis Tool in Cloud](https://reader034.fdocuments.net/reader034/viewer/2022051612/54be2d314a79592f108b45ef/html5/thumbnails/12.jpg)
Three basic operations of Pig
• Group by
• Join
• Order
![Page 13: Pig: Data Analysis Tool in Cloud](https://reader034.fdocuments.net/reader034/viewer/2022051612/54be2d314a79592f108b45ef/html5/thumbnails/13.jpg)
How Pig do Group by
(A,1)(B,2)(C,3)(B,4)(B,5)(C,6)(A,7)(E,8)(D,9)
(A,1)(B,2)(C,3)
(B,4)(B,5)(C,6)
(A,7)(E,8)(D,9)
(A,{(A,1),(A,7)})(C,{(C,3),(C,6)})
(E,{(E,8)})
(B,{(B,2),(B,4),(B,5)})(D,{(D,9)})
Data Source Split Mapper Partition Reducer
![Page 14: Pig: Data Analysis Tool in Cloud](https://reader034.fdocuments.net/reader034/viewer/2022051612/54be2d314a79592f108b45ef/html5/thumbnails/14.jpg)
How Pig do Join
(3,A3)(5,A5)(3,B3)(2,B2)
(2,A2)(4,B4)
((1,A1),(1,B1))((3,A3),(3,B3))((5,A5),(5,B5))
((2,A2)(2,B2))((4,B4),(4,B4))
(1,A1)(4,A4)(3,A3)(5,A5)(2,A2)
(5,B5)(1,B1)(3,B3)(2,B2)(4,B4)
(1,A1)(4,A4)(5,B5)(1,B1)
Data Source Split Mapper Partition Reducer
![Page 15: Pig: Data Analysis Tool in Cloud](https://reader034.fdocuments.net/reader034/viewer/2022051612/54be2d314a79592f108b45ef/html5/thumbnails/15.jpg)
How Pig do Sort
(100)(200)(900)(50)
(600)(800)(300)(400)
(100)(200)(900)
(50)(600)(800)
(300)(400)
(50)(100)(200)(300)(400)
(600)(800)
Data Source Split Mapper Range Partition Reducer
![Page 16: Pig: Data Analysis Tool in Cloud](https://reader034.fdocuments.net/reader034/viewer/2022051612/54be2d314a79592f108b45ef/html5/thumbnails/16.jpg)
UDF (User-Defined-Function)
register myudf.jar; raw_data = load ‘/java_one/udf’ as (name:chararray);firstnames = foreach raw_data generate myudf.FirstName (name); store firstnames into ‘/java_one/udf_output’;
public class FirstName extends EvalFunc<String>{
@Override public String exec(Tuple input) throws IOException { String name=input.get(0).toString(); …. return firstname; }}
![Page 17: Pig: Data Analysis Tool in Cloud](https://reader034.fdocuments.net/reader034/viewer/2022051612/54be2d314a79592f108b45ef/html5/thumbnails/17.jpg)
What Storage Pig Supports
• HDFS– Plain Text– Binary format– Customized format (XML, JSON, Protobuf, Thrift…)
• RDBMS (DBStorage)
• Cassandra (CassandraStorage)
• HBase (HBaseStorage)
![Page 18: Pig: Data Analysis Tool in Cloud](https://reader034.fdocuments.net/reader034/viewer/2022051612/54be2d314a79592f108b45ef/html5/thumbnails/18.jpg)
What fields can Pig be applied
• Data Analysis
• Text Processing
• ETL
• Machine Learning
![Page 20: Pig: Data Analysis Tool in Cloud](https://reader034.fdocuments.net/reader034/viewer/2022051612/54be2d314a79592f108b45ef/html5/thumbnails/20.jpg)
References
• http://pig.apache.org (Pig official site)
• http://hadoop.apache.org (Hadoop official site)
• https://github.com/zjffdu/RAF-PIG (Rich API for Pig)
![Page 21: Pig: Data Analysis Tool in Cloud](https://reader034.fdocuments.net/reader034/viewer/2022051612/54be2d314a79592f108b45ef/html5/thumbnails/21.jpg)
Demo
![Page 22: Pig: Data Analysis Tool in Cloud](https://reader034.fdocuments.net/reader034/viewer/2022051612/54be2d314a79592f108b45ef/html5/thumbnails/22.jpg)
Thank you !Q&A