BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
-
Upload
david-lauzon -
Category
Technology
-
view
1.510 -
download
2
description
Transcript of BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
1 / 18
EndoMine SystemJewish General Hospital
by David Lauzon and Anton ZakharovBig Data Montreal #9February 5th 2013
2 / 18
Presentation
• Our Objectives• Requirements and context• Project scope• Hadoop Solution
– Big Data Solution Overview– Hive Table Schema– Compression Performance– Data Architecture in Hadoop– Hadoop/Impala Prototype Demo
• Oracle Solution• Hadoop vs Oracle comparison• What are expensive queries?
3 / 18
Our Objectives
• Lead an end-of-study project in an industrial context– Requirements elicitation– Implement a « proof-of-concept » prototype
• Experiment with big data technologies– Compare with RDBMS
4 / 18
Requirements and context
• Department of Medical Diagnostic (medical test results DB, e.g. blood, urine, ...)– Dr. Shaun Eintracht
• « ad hoc » Query • ETL Query
– Dr. Elizabeth Mac Namara• « business intelligence » requirements• Realtime Dashboard
• Department of Endocrinology– Dr. Mark Trifiro
• Data mining
5 / 18
Project scope
• First iteration = improve ad-hoc queries– Slow analytical queries and ETL (MS Access)– Risk of « crashing » production DB– Some queries impossible to process
6 / 18
Production DB (Oracle)
7 / 18
Solutions
• Solution 1 : Hadoop + Impala
• Solution 2 : Tune the existing Oracle RDBMS
8 / 18
Big Data Solution Overview
9 / 18
Hive Table Schema
10 / 18
Compression Performance
Oracle FS Text File Sequence File
SeqFile + Gzip
SeqFile + Snappy
0
50
100
150
200
250
ImpalaHiveOracle
11 / 18
Data Architecture in Hadoop
• All big tables are pre-joined– With specimen (1) – Without specimen (2)
• Partitioned using two schemes – Year-month (3) – Year and Test (4)
• 4 different versions of the same data:– stay_order_results_yearmonth – stay_order_results_year_and_test – stay_order_results_specimen_yearmonth – stay_order_results_specimen_year_and_test
12 / 18
Hadoop Prototype Demo
13 / 18
Oracle Solution
• Same tables as source DB– A big pre-joined table is not a good solution
• Techniques explored :– Partitioning• Partitions automatically created
– Compression• Inefficient for joins
– Clustering– Join multiple partitioned tables
14 / 18
Oracle Solution (continued)
• Avoid too many indexes on the big tables:– Takes a lot of memory– Slow to create– May not be used if query use more than 5% of the
rows
15 / 18
Comparison: Hadoop Solution
• Pro– Crunch massive amount of data– Scalability– Free software
• Cons– Needs better UI and tune-ups– Maintenance cost– Require ETL time to merge data into one table – BIG Joins should be avoided
16 / 18
Comparison: Oracle Solution
• Pro– Just need to create a slave DB (just?)– Faster random-lookup– Easier to find expertise
• Cons– Scalability up to a certain point..– Synchronisation with master DB:• Rebuilding indexes would take hours
17 / 18
What are expensive queries?
• If possible, avoid these constructs on large result sets– SELECT DISTINCT– ORDER BY– GROUP BY– JOIN big table with another big table• JOIN big table with multiple small tables should be OK
18 / 18
Conclusion
• Recommendation to use a “classic” RDBMS– The database fit on a single-node– Existing expertise in-house– Acceptable performance with appropriate
tune-ups– Stop using MS Access
• Disadvantage : limited scalability