An Experimental Comparison of Complex Object Implementations …ruleml.org › talks ›...
Transcript of An Experimental Comparison of Complex Object Implementations …ruleml.org › talks ›...
![Page 1: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/1.jpg)
An Experimental Comparison of Complex Object Implementations for
Big Data Systems
Kia Teymourian
Joint work with Sourav Sikdar, Chris Jermaine
![Page 2: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/2.jpg)
Partially Published in:
Sourav Sikdar, Kia Teymourian, and Chris Jermaine. 2017. An experimental comparison of complex object implementations for big data systems. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC '17). ACM, New York, NY, USA, 432-444. DOI: https://doi.org/10.1145/3127479.3129248
2
![Page 3: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/3.jpg)
Introduction
3
• Relational databases store records made of flat types.- integer, float, boolean, char etc.
• All the records have fixed size.
• Example: A student database.
Last Name First Name Student ID Net ID SSN …
Doe John S012141* jd* *4768 …
Roe Jane S012142* jr* *4321 …
…
…
![Page 4: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/4.jpg)
Introduction
4
• How do relational databases store complex objects, e.g., graphs?- Complex Objects have variable size and are highly nested.
![Page 5: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/5.jpg)
Introduction
5
• How do relational databases store complex objects?- Complex Objects have variable size and are highly nested.
Graph ID …
1 …
… …
Graph ID Vertex ID …
1 1 …
1 2
…
Graph ID from to …
1 1 2 …
1 2 3
1 3 4
…
Graphs
Vertices
Edges
![Page 6: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/6.jpg)
Introduction
6
• Modern programming languages provide a lot of useful features.- Generics (in Java), Templates (in C++).
• Outside relational database -
![Page 7: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/7.jpg)
Introduction
7
Big Data System:There are costs associated with -• Objectification• Serialization• Garbage Collection
![Page 8: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/8.jpg)
Key Questions
8
Any big data system designer faces some important choices:• Which data model to use?• Which implementation for data model to use?• Which runtime environment to use?
![Page 9: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/9.jpg)
Goal
9
Across a variety of data management tasks, experimentally compare the costs associated with various choices of complex object models and implementations.
![Page 10: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/10.jpg)
Complex Object Models
10
• Host Language Objects• Self-Describing Documents• Custom Data Models
![Page 11: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/11.jpg)
1. Host Language Objects
11
• Which runtime environment to use?- Automatic memory managed vs Not - Managed(Java) vs Unmanaged (C++)
• Which serialization framework to use?- Serialization: Conversion from in memory to on disk representation.
![Page 12: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/12.jpg)
1. Host Language Objects
12
Java C++
Java DefaultJava ByteBuffer C++ Hand-CodedJava Kryo C++ Boost
C++ InPlace
![Page 13: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/13.jpg)
C++ InPlace
13
• We borrow the idea from relational database.- On disk representation = In memory representation.
Serialization
Deserialization
~Zero Cost
DiskMain Memory
![Page 14: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/14.jpg)
2. Self-Describing Documents
14
JSON + gzip BSON
JSON BSON
14
![Page 15: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/15.jpg)
3. Custom Data Models
15
Java Protocol Buffers C++ Protocol Buffers
Compile
Compile
![Page 16: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/16.jpg)
Summary: Object Implementations
16
Host-languageobjects
Java DefaultJava KryoJava ByteBufferC++ BoostC++ HandCodedC++ InPlace
Self-Describing Documents
JSONBSON
Custom Nested Models
Java Protocol BuffersC++ Protocol Buffers
![Page 17: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/17.jpg)
Experiments
17
•Read from Local Disks- Sequential Read (start from random position in file)- Random Read (read random pages)
•Network IO- Read from 10 Clients RAM push to single server- Read from 10 Clients Disk push to single server
•External Sort•Distributed Data Aggregation
![Page 18: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/18.jpg)
Dataset
18
• Average size of a TPC-H Customer object on disk:Implementation Size (Bytes)
Java JSON + gzip 8508Java Kryo 16176
Java Protocol Buffers 17305C++ Protocol Buffers 17931
C++ HandCoded 19275Java ByteBuffer 19478
Java Default 19556C++ Boost 21004
C++ InPlace 25127Java BSON 33879
Data + Schema + Compression
Data + Light-weight encoding
Data
Memory Representation of Data
Data + Headers
Data + Schema
![Page 19: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/19.jpg)
1. Sequential Read
19
Goal: Test the ability to support fast retrieval of objects.
Task:3 million TPC-H Customer objects.Read 100K objects sequentially.
![Page 20: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/20.jpg)
1. Sequential Read
20
• The fastest C++ implementation (InPlace) is at least 1.5x faster than fastest Java implementation (Kryo) for larger reads.
![Page 21: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/21.jpg)
1. Sequential Read
21
• The fastest C++ implementation (InPlace) is at least 1.5x faster than fastest Java implementation (Kryo) for larger reads.
• The faster C++ implementations are up to 5x-10x faster than document models.
![Page 22: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/22.jpg)
1. Sequential Read
22
• The fastest C++ implementation (InPlace) is at least 1.5x faster than fastest Java implementation (Kryo) for larger reads.
• The faster C++ implementations are upto 5x-10x faster than document models.
• C++ InPlace is IO bound.• JSON + gzip is CPU bound.
![Page 23: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/23.jpg)
2. External Sort
23
Goal:Sorting is common workflow in data management system.
Details:Sorting 3 million TPC-H Customer objects (~ 60GB).Compute machine has 30GB RAM.
![Page 24: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/24.jpg)
2. External Sort
24
• The fastest C++ implementation (InPlace) is ~2x faster than fastest Java implementation (Kryo).
![Page 25: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/25.jpg)
2. External Sort
25
• The fastest C++ implementation (InPlace) is ~2x faster than fastest Java implementation (Kryo).
• The faster C++ implementations are up to 5x-10x faster than document models.
![Page 26: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/26.jpg)
Tweets Dataset
26
• Tweet objects are highly complex and nested graph objects. • See JSON Format of ithttps://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json
public class TweetStatus {
private User user;private Coordinates coordinates;private Place place;private TweetStatus quotedStatus;private TweetStatus retweetedStatus;
private List<HashtagEntity> hashtagEntities;private List<MediaEntity> mediaEntities;private List<URLEntity> urlEntities;private List<UserMentionEntity> userMentionEntities;private List<SymbolEntity> symbolEntities;
…}
Implementation as Java Class TweetStatus
![Page 27: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/27.jpg)
Tweets Dataset – IO Experiments
27
2
5
15
50
100
200
500900
197.4 240.9 250.1
786.9
33.8 35.6 34.8
Sequential Read
1000K Tweet Objects
Java
Default
Java
JSON
Java
JSON G
ZIP
Java
BSON
Java
ByteBuff
er
Java
Kryo
Java
Protoc
ol
IO TimeCPU Time
Tim
e (s
ec)−
log
2000
4000
8000
1400011626
4062
3202
5956
2709 2757 2890
Random Read
1M Tweet Objects
Java
Default
Java
JSON
Java
JSON G
ZIP
Java
BSON
Java
ByteBuff
er
Java
Kryo
Java
Protoc
ol
IO TimeCPU Time
Tim
e (s
ec)−
log
![Page 28: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/28.jpg)
Conclusions
28
• The execution time in a memory managed environment (Java) is significantly higher than an un-managed environment (C++ on Linux).
- A 1.5x-2x performance penalty even before system is designed.
• The costs are even higher for self-describing document formats like JSON.- Sorting JSON objects has 5x-10x penalty compared to C++ solutions.
• There is value in the “classical database” way of doing things –keeping the in-memory and on-disk representation the same.
![Page 29: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/29.jpg)
Use PlinyCompute
A platform for high-performance distributed tool and library development.
http://plinycompute.rice.edu/
https://github.com/riceplinygroup/plinycompute
Published in SIGMOD2018
Jia Zou, R. Matthew Barnett, Tania Lorido-Botran, Shangyu Luo, Carlos Monroy, Sourav Sikdar, Kia Teymourian, Binhang Yuan, and Chris Jermaine. PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development. In SIGMOD '18.DOI: https://doi.org/10.1145/3183713.3196933
29
![Page 30: An Experimental Comparison of Complex Object Implementations …ruleml.org › talks › KiaTeymourian-ComplObjImps4BigDataSys... · 2018-06-28 · Partially Published in: Sourav](https://reader033.fdocuments.net/reader033/viewer/2022042315/5f03b8ea7e708231d40a7320/html5/thumbnails/30.jpg)
Thank You
30