Efficient processing of large and complex XML documents in Hadoop
-
Upload
hadoopsummit -
Category
Technology
-
view
117 -
download
0
description
Transcript of Efficient processing of large and complex XML documents in Hadoop
![Page 1: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/1.jpg)
Efficient processing of large and complex XML documents in Hadoop
Sujoe Bose Senior Principal, Sabre Holdings June, 2013
![Page 2: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/2.jpg)
Presenta.on Outline
§ MoBvaBon § ETL vs. ELT § Avro Format § Mapping from XML to Avro § Interfaces to access Avro § Performance and Storage consideraBons § Other types of storage/processing formats
confidenBal 2
![Page 3: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/3.jpg)
You will learn about …
§ A method to store and process complex XML data in Hadoop as Avro files
§ Interfaces to access and analyze data in Avro from Hive, Java and Pig
§ VariaBons of the method and their relaBve trade-‐offs in storage and processing
confidenBal 3
![Page 4: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/4.jpg)
Mo.va.on
§ Prevalence of XML and its derivaBves – Spurred by WebServices and SOA – Preferred communicaBon format unBl newer formats entered
– Data and logs represented in XML § XML – metadata combined data – Flexibility vs. Complexity
§ Could be arbitrarily nested and large § Volumes of documents – Big Data
confidenBal 4
![Page 5: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/5.jpg)
Challenges
§ Parsing XML is CPU Intensive § Certain parsers/parsing methods result in more memory consumpBon
§ Repeated parsing for each query § Large and deeply nested XMLs makes problem worse § Presence of tags in data result in high I/O due to storage size
§ Special handling of opBonal fields
confidenBal 5
![Page 6: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/6.jpg)
ETL vs. ELT
confidenBal 6
§ Hadoop generally built for EL – T – aka Schema-‐on-‐Read – Load as-‐is – Transform on Access/Query
§ Compare with Data Warehouse ETL – Aka Schema-‐on-‐Write – Transform and Load – Queries are lot simpler – TransformaBon and cleansing done a priori
![Page 7: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/7.jpg)
Mix of ETL and ELT
§ Generally beaer in Flexibility
§ More suitable for simpler and well-‐defined formats
§ More applicable for experimentaBon
§ XML data parsed on demand for every query
confidenBal 7
§ Generally beaer in Performance
§ More suitable when substanBal cleansing and reformacng is needed
§ RepeBBve queries and producBon workloads
§ XML Data pre-‐parsed to minimize resource usage
ELT ETL
![Page 8: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/8.jpg)
Approaches
confidenBal 8
XML Files Avro Files
ETL Pre-‐parsing
Pig UDF
Avro Schema
On-‐demand Parsing
Interfaces
Processin
g Da
ta
Hive SerDe MapReduce Pig
UDF Hive SerDe MapReduce
![Page 9: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/9.jpg)
ELT
confidenBal 9 confidenBal 9
XML Files Avro Files
ETL Pre-‐parsing
Pig UDF
Avro Schema
On-‐demand Parsing
Interfaces
Processin
g Da
ta
Hive SerDe MapReduce Pig
UDF Hive SerDe MapReduce
![Page 10: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/10.jpg)
ETL
confidenBal 10 confidenBal 10
XML Files Avro Files
ETL Pre-‐parsing
Pig UDF
Avro Schema
On-‐demand Parsing
Interfaces
Processin
g Da
ta
Hive SerDe MapReduce Pig
UDF Hive SerDe MapReduce
![Page 11: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/11.jpg)
XML Pre-‐parsing
§ Nested Elements and Aaributes § RepresentaBon of parsed XML Structure § Enter Avro!
confidenBal 11
![Page 12: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/12.jpg)
Avro
§ Data serializaBon system § Specifically designed for Hadoop, but used in other environments also
§ Rich data structures: Arrays, Records, Maps etc. § Compact, fast, binary data format § Metadata stored at file level – not record level § Split-‐able – Ideal for Map-‐Reduce
confidenBal 12
![Page 13: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/13.jpg)
Avro APIs
§ Generic Objects and Pre-‐generated Objects – Easy API including simple gets and puts
§ APIs in several languages – Java – C# – C/C++ – Python – Ruby
confidenBal 13
![Page 14: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/14.jpg)
Use-‐case
§ FIXML – Financial InformaBon eXchange – hap://www.fixprotocol.org/specificaBons/
§ XML Database Benchmark – hap://tpox.sourceforge.net/
§ Provides sample data for benchmarking § Data Generator for generaBng large and predictable datasets
confidenBal 14
![Page 15: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/15.jpg)
FIXML
§ XML Data Generator – hap://tpox.sourceforge.net/tpoxdata.htm
§ Order: Buy and sell order of securiBes
confidenBal 15
![Page 16: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/16.jpg)
Simple mapping
confidenBal 16
XML Avro Pig
Elements with repeated nested elements
Array Bag
Elements with aaributes and text elements
Record Tuple
Aaributes and Text Elements Field Field
![Page 17: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/17.jpg)
Avro Schema
{ "type": "record", "name": "FIXOrder", "namespace": "com.sabre.fixml", "doc": "Definition and mapping for FIX Orders", "mapping": "/FIXML", "fields": [ { "name":"v", "type":"string", "mapping":"@v"}, { "name":"r", "type":"string", "mapping":"@r"}, { "name":"s", "type":"string", "mapping":"@s"}, { "name":"Order", "mapping":"Order", "type": { "name":"OrderRecord", "mapping":"Order", "type": "record", "fields": [ { "name":"ID", "type":"string", "mapping":"@ID"}, { "name":"ID2", "type":"string", "mapping":"@ID2"}, { "name":"OrignDt", "type":"string", "mapping":"@OrignDt"}, { "name":"TrdDt", "type":"string", "mapping":"@TrdDt"}, { "name":"Acct", "type":"string", "mapping":"@Acct"}, { "name":"AcctTyp", "type":"string", "mapping":"@AcctTyp"}, { "name":"DayBkngInst", "type":"string", "mapping":"@DayBkngInst"}, { "name":"BkngUnit", "type":"string", "mapping":"@BkngUnit"}, { "name":"PreallocMeth", "type":"string", "mapping":"@PreallocMeth"}, { "name":"AllocID", "type":"string", "mapping":"@AllocID"}, { "name":"CshMgn", "type":"string", "mapping":"@CshMgn"}, { "name":"ClrFeeInd", "type":"string", "mapping":"@ClrFeeInd"},
...
confidenBal 17
![Page 18: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/18.jpg)
Pig Schema
FIXOrder: tuple ( v: chararray, r: chararray, s: chararray, Order: tuple ( ID: chararray, ID2: chararray, OrignDt: chararray, TrdDt: chararray, Acct: chararray, AcctTyp: chararray, DayBkngInst: chararray, BkngUnit: chararray, PreallocMeth: chararray, AllocID: chararray, CshMgn: chararray, ClrFeeInd: chararray,
confidenBal 18
![Page 19: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/19.jpg)
Avro – Access Methods
§ Direct support for access from Hive (using SerDe)
CREATE EXTERNAL TABLE <TableName>!ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’!
STORED as INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat’!
OUTPUTFORMAT! ‘org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat’! LOCATION ‘location-of-avro-files’! TBLPROPERTIES ('avro.schema.url'=‘location-of-schema-file.avsc')
§ Access via Pig -‐ AvroStorage § Avro API -‐ Java MapReduce
confidenBal 19
![Page 20: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/20.jpg)
Test Data
§ Base SecuriBes Order file 500,000 records § Replicated for volume – 15x -‐ 7.5 million records – 30x -‐ 15 million records – 45x -‐ 22.5 million records – 60x – 30 million records – 75x – 37.5 million records
confidenBal 20
![Page 21: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/21.jpg)
Comparison
confidenBal 21
XML Files Avro Files
ETL Pre-‐parsing
Pig UDF
Avro Schema
On-‐demand Parsing
Interfaces
Processin
g Da
ta
Hive SerDe MapReduce Pig
UDF Hive SerDe MapReduce
![Page 22: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/22.jpg)
File sizes: Orders
§ Base Data – XML file size as is: 749,337,916 (750MB) – Gzip Compressed: 182,687,654 (183MB)
§ Applied Avro conversion – Avro Snappy: 151,647,926 (152MB) – Avro Gzip: 107,898,177 (108MB)
confidenBal 22
![Page 23: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/23.jpg)
Storage Size Comparison
confidenBal 23
![Page 24: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/24.jpg)
Test Environment
§ 18 Nodes § Node configuraBon: – 12 cores per node – 48GB memory – 36 TB with 12 disks of 3TB each
§ CDH 4.1.2
confidenBal 24
![Page 25: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/25.jpg)
Sample Query
§ Security Orders per Account
order_records = LOAD '$AVRO_INPUT' using AVRO_LOAD AS ( -‐-‐-‐-‐-‐-‐-‐ Pig Schema goes here -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ ); order_projecBon = FOREACH order_records GENERATE Order.Acct as Account, Order.OrdQty.Qty
as QuanBty; order_group = GROUP order_projecBon BY Account; order_count = FOREACH order_group GENERATE group, SUM(order_projecBon.QuanBty); STORE order_count INTO '$PIG_OUTPUT' Using PigStorage(',');
confidenBal 25
![Page 26: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/26.jpg)
Run Types
§ Pre-‐parsed approach: – XML to Avro materializaBon: xml-‐to-‐avro
• XML to Avro is run only once on the data
– Avro to Pig via UDF: avro-‐to-‐pig § Parse on demand – XML parsing using Pig UDF: xml-‐to-‐pig
confidenBal 26
![Page 27: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/27.jpg)
confidenBal 27
Run .me in Seconds
Analysis on raw XML: XML to Pig
Pre-‐parsing XML: XML to Avro
Analysis on parsed XML: Avro to Pig
![Page 28: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/28.jpg)
confidenBal 28
CPU Usage Comparison
Analysis on raw XML: XML to Pig
Pre-‐parsing XML: XML to Avro
Analysis on parsed XML: Avro to Pig
![Page 29: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/29.jpg)
confidenBal 29 confidenBal 29
Memory Usage Comparison: Total Memused (GB)
Analysis on raw XML: XML to Pig
Pre-‐parsing XML: XML to Avro
Analysis on parsed XML: Avro to Pig
![Page 30: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/30.jpg)
Results
§ Analysis on pre-‐parsed data compared raw XML – RunBme reducBon by more than 50% – Memory and CPU consumpBon reduced by about 50%
§ Pre-‐parsing stage takes more resources and Bme than on-‐demand parsing
§ RepeBBve queries will benefit from one-‐Bme pre-‐parsing
confidenBal 30
![Page 31: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/31.jpg)
Caveats
§ Not all fields were extracted from the XML input (opBonal elements)
§ Challenge in keeping-‐up with versions/changes of XML
§ Performance numbers can depend on the type of data and the mapping used
confidenBal 31
![Page 32: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/32.jpg)
Alterna.ves
§ Formats other than Avro may be more suitable § Record Columnar formats (RC Files & ORC Files) § Trevni: a column file format supporBng Avro § Parquet: another columnar storage for Hadoop
confidenBal 32
![Page 33: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/33.jpg)
Mo.va.on for Columnar Format
§ Map Reduce capability § Column ProjecBons reduce I/O § Column Compression due to similarity of data further reduces I/O
confidenBal 33
![Page 34: Efficient processing of large and complex XML documents in Hadoop](https://reader034.fdocuments.net/reader034/viewer/2022051014/54c672024a7959c80c8b4574/html5/thumbnails/34.jpg)
Summary
§ Materialized version well-‐suited for repeated queries § For ad-‐hoc/experimental queries parse-‐on-‐demand is beaer
§ Mapping from XML to Avro can be automated § Hive, Pig and MapReduce Interfaces to access Avro Files
§ RelaBve trade-‐offs between flexibility and performance/storage
confidenBal 34