BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

1 / 18

EndoMine SystemJewish General Hospital

by David Lauzon and Anton ZakharovBig Data Montreal #9February 5th 2013

2 / 18

Presentation

• Our Objectives• Requirements and context• Project scope• Hadoop Solution

– Big Data Solution Overview– Hive Table Schema– Compression Performance– Data Architecture in Hadoop– Hadoop/Impala Prototype Demo

• Oracle Solution• Hadoop vs Oracle comparison• What are expensive queries?

3 / 18

Our Objectives

• Lead an end-of-study project in an industrial context– Requirements elicitation– Implement a « proof-of-concept » prototype

• Experiment with big data technologies– Compare with RDBMS

4 / 18

Requirements and context

• Department of Medical Diagnostic (medical test results DB, e.g. blood, urine, ...)– Dr. Shaun Eintracht

• « ad hoc » Query • ETL Query

– Dr. Elizabeth Mac Namara• « business intelligence » requirements• Realtime Dashboard

• Department of Endocrinology– Dr. Mark Trifiro

• Data mining

5 / 18

Project scope

• First iteration = improve ad-hoc queries– Slow analytical queries and ETL (MS Access)– Risk of « crashing » production DB– Some queries impossible to process

6 / 18

Production DB (Oracle)

7 / 18

Solutions

• Solution 1 : Hadoop + Impala

• Solution 2 : Tune the existing Oracle RDBMS

8 / 18

Big Data Solution Overview

9 / 18

Hive Table Schema

10 / 18

Compression Performance

Oracle FS Text File Sequence File

SeqFile + Gzip

SeqFile + Snappy

0

50

100

150

200

250

ImpalaHiveOracle

11 / 18

Data Architecture in Hadoop

• All big tables are pre-joined– With specimen (1) – Without specimen (2)

• Partitioned using two schemes – Year-month (3) – Year and Test (4)

• 4 different versions of the same data:– stay_order_results_yearmonth – stay_order_results_year_and_test – stay_order_results_specimen_yearmonth – stay_order_results_specimen_year_and_test

12 / 18

Hadoop Prototype Demo

13 / 18

Oracle Solution

• Same tables as source DB– A big pre-joined table is not a good solution

• Techniques explored :– Partitioning• Partitions automatically created

– Compression• Inefficient for joins

– Clustering– Join multiple partitioned tables

14 / 18

Oracle Solution (continued)

• Avoid too many indexes on the big tables:– Takes a lot of memory– Slow to create– May not be used if query use more than 5% of the

rows

15 / 18

Comparison: Hadoop Solution

• Pro– Crunch massive amount of data– Scalability– Free software

• Cons– Needs better UI and tune-ups– Maintenance cost– Require ETL time to merge data into one table – BIG Joins should be avoided

16 / 18

Comparison: Oracle Solution

• Pro– Just need to create a slave DB (just?)– Faster random-lookup– Easier to find expertise

• Cons– Scalability up to a certain point..– Synchronisation with master DB:• Rebuilding indexes would take hours

17 / 18

What are expensive queries?

• If possible, avoid these constructs on large result sets– SELECT DISTINCT– ORDER BY– GROUP BY– JOIN big table with another big table• JOIN big table with multiple small tables should be OK

18 / 18

Conclusion

• Recommendation to use a “classic” RDBMS– The database fit on a single-node– Existing expertise in-house– Acceptable performance with appropriate

tune-ups– Stop using MS Access

• Disadvantage : limited scalability

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

Technology

Transcript of BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case