20140120 presto meetup_en
-
Upload
ogibayashi -
Category
Technology
-
view
826 -
download
1
Transcript of 20140120 presto meetup_en
Our Presto use caseand
performance testHironori Ogibayashi
Shin Matsuura
About us
● Hironori Ogibayashi(@angostura11) ● Shin Matsuura
○ IT Infrastructure team in Japanese telecommunications carrier
○ Mainly working on middleware - test, installation, deployment.
Todays Topic
● Presto use case○ Deployment○ Use case○ Challenges○ Future work
● Performance comparison between Hive+Tez and Presto
Presto use case
Log Collection Flow
FluentdAggregator
Hadoop Cluster ApplicationWebHDFS
・1500 Fluentd instances・25,000 msg / sec・400GB / day・150 types of log
Log Usage
● Systems Infrastructure team○ Checking trends in server performance ○ Performance analysis of Oracle
Database
● Application development team○ Improving system and business
operations.
Application for Oracle DB Performance Analysis- Check existing/potential problems of Oracle database, for certain system, certain period.
- Utilize logs stored in HDFS. Queries were executed on Hive.
- But, it took more than one hour to get the result...
- (So, we migrated to Presto.)
Why Presto?
● Frequent use of Interactive / ad-hoc queries.
● Of cource, faster is better.
Hadoop Slave
Presto Deployment
Hadoop Slave
DataNode
TaskTracker
Presto Worker
Presto Coordinator
Hive Metastore
Application/Client
・・・
● A decicated physical machine as a Coordinator.
● Workers run on each Hadoop slaves.● Logs in HDFS are periodically
converted to RCfiles.● Presto versions
○ 0.66⇒0.73⇒0.75⇒0.82
Deployment Effect - Elapsed time of a single query
230sec
7sec- Elapsed time of one of
the queries issued by the application.
- Query was run on CDH4 (MRv1) cluster.
Deployment and Operation
● Deployment○ Easy;Just extract binaries in each server and modify
configuration file.○ Automated by Ansible + yum.
● What we use in operation○ Query history
■ Coordinator Web UI○ Logs
■ /var/presto/data/logs/{server.log,launcher.log}○ Metrics
■ presto-metrics(https://github.com/xerial/presto-metrics)⇒Fluentd⇒Elasticsearch + Kibana
○ sys schema
Challenges● Worker crash / hang.
○ OutOfMemory. In case of hanging, we resolve to “kill -9”.○ We Modified the memory parameter: task.shard.max-
threads×task.max-memory < -Xmx
● At first, we set node-scheduler.include-coordinator=true. In which case, Coordinator crashed due to heavy query.
● SQL difference from HiveQL○ At first our Application used both Hive and Presto because we used
Presto experimentally.Hence the Application had to support both HiveQL and Presto(ANSI SQL).
○ Now, the application no longer use Hive.
Future work
● Improve Coodinator’s availability.● Security
○ Now, all queries are executed as Presto’s daemon user.● Resource isolation between Presto and Hadoop daemons.
Presto VS Hive+Tez
Contents
From a Performance perspective
Presto VS Hive+Tez
(not tuning any parameteres)
Conclusion
Presto VS Hive+TezWin Lose
How Fast??
Presto VS Hive+Tez
2.0~136 times
more details
Testing environment Configurations2p12c
64GB Mem36TB Disk
NN
DN DN DN
Hadoop(HDP2.1)
Presto(0.82)
Coodinator
Worker Worker Worker
Master : 3nodes
Slave : 3nodes
NN
Metastore
Sample data
300GBcsv file
50 columns
1.1B records
Performance measurement perspectives
• Query patterns• Data format patterns• Repetitive Querying
Query patterns
Queries
Query1: select count(*) from TestTBL
Query2: select * from TestTBL where col1 = ‘XXX’
Query3: select * from TestTBL where col1 = ‘XXX’ and col2 = ‘YYY’
Query4: select col1, count(*) from TestTBL group by col1
Query5: select col1, count(*) from TestTBL where col2 = ‘YYY’ group by col1
data format :Txt
Results: Query patterns
data format :Txt
Results: Query patterns
100x faster
Presto was faster in processing speed than Hive+Tez in all queries.
Data format patterns
Data formats
• Text File (Textfile)• Record Columnar File (RCfile)• Optimized Row Columnar File (ORCfile)
Results: Data format patterns
※Query: Query2
Results: Data format patterns
※Query: Query2
Presto was faster in processing speed than Hive+Tez in all data formats.
Repetitive Querying
Change in processing time with repetitions(Presto)
※Query: Query2※Data format: Txt
Change in processing time with repetitions (Presto)※Query: Query2
※Data format: Txt
Became faster After the second time.Cache ???
2.5x faster
Change in processing time with repetitions (Hive+Tez)
※Query: Query2※Data format: Txt
Change in processing time with repetitions (Hive+Tez)
※Query: Query2※Data format: Txt
No real change in processing time
+α
Engine:Presto
Query × Data format
Engine:Presto
Query × Data format
Is using RCfile the most stable and fastest way ??
Summary
Result● Presto was faster than Hive+Tez in all queries.● Presto was faster than Hive+Tez in all data formats.● With repetitive Querying, presto became faster.● By Using RCfile, Presto was the most stable and fastest.
Next● Benchmark from node scaling and data volumn
perspectives.● Benchmark while using compression functions of
ORCfile.● Benchmark with HDP2.2.
Appendix
ほぼすべての条件で2回目以降高速