HospETL - Delivering a Healthcare Analytics Platform

17
+ HospETL Healthcare Analytics Platform Angela Razzell Insight Data Engineering Fellowship New York

Transcript of HospETL - Delivering a Healthcare Analytics Platform

+

HospETL Healthcare Analytics Platform

Angela Razzell Insight Data Engineering Fellowship New York

My motivation n  Delivery of an analytics platform in Amazon Redshift for

randomly generated healthcare data.

n  Deep delve into Amazon Redshift as a distributed data warehouse system.

n  Redshift is being widely employed in business, efficient analytics is important to supply operational insight.

Technologies

AMAZON  REDSHIFT  

Technologies

AMAZON  REDSHIFT  

Total  Capacity:  640  GB  

4  x  dc1.large  nodes  

4  x  $0.25  /  hour  

Schema

ref_doctor ref_hospital

ref_eddisposal

ref_edcomplaint

ref_diagnosis

patient 10 million rows

X 21 columns

elective_bookings 53+ million rows

X 13 columns

(ED) encounter 40 million rows

X 10 columns

admissions & appts 28 million rows

X 12 columns

patient_diagnosis 40 million rows

X 21 columns

Columnar Compression Types How it works Examples

Raw N/A – no compression, use for large domain Identifiers

Bytedict Creates dict. of unique values, optimal for limited unique values Dept. code

LZO Creates a dictionary of repeating character sequences, use for very long character strings

Comments

Runlength Store repeat value counts, use for consecutive repeating values Doctor code

Text255 & text32k

Creates dictionary of unique words for repeating text Address

Delta Record difference between values that follow each other, optimal for consecutive integer values

Gender Code

Mostly Store values in smaller standard storage size, optimal when the data type for a column is larger than most values

BIGINT columns

Columnar Compression Types How it works Examples

Raw N/A – no compression, use for large domain Identifiers

Bytedict Creates dict. of unique values, optimal for limited unique values Dept. code

LZO Creates a dictionary of repeating character sequences, use for very long character strings

Comments

Runlength Store repeat value counts, use for consecutive repeating values Doctor code

Text255 & text32k

Creates dictionary of unique words for repeating text Address

Delta Record difference between values that follow each other, optimal for consecutive integer values

Gender Code

Mostly Store values in smaller standard storage size, optimal when the data type for a column is larger than most values

BIGINT columns

Columnar Compression: Runlength Doctor Code

Original size (bytes)

Compressed Value

Compressed size (bytes)

C1 2 {2,C1} 3

C1 2 0

C22 3 {4,C22} 4

C22 3 0

C22 3 0

C22 3 0

C101 4 {1,C101} 5

Total: 20 12

No columnar compression or keys

50%

25%

Add columnar compression

20%

10%

Add columnar compression and keys

15%

Challenges

n Creating schema from scratch.

n Generating and loading large datasets.

n Learning Redshift and how to optimize it.

About me n  Worked in Data Migration for IT system

implementation project and Business Intelligence at an NHS Trust.

n M.Eng in Engineering Mathematics from University of Bristol. n Interests include hiking and swimming.

Demo n  www.hospETL.website

Encryption n  AWS Key Management Services (KMS)

n  Automatically integrates with Redshift n  $1 a month

n  Hardware Security Module (HSM) n  Need to use client and server certificates to configure a trusted connection

to Amazon Redshift n  Monthly fee plus $5000 initial cost

Redshift cluster n Set up a Redshift cluster with 4 dc1.large nodes. = four

nodes with two slices each

Node size vCPU ECU

RAM (GiB)

Slices per Node

Storage per Node

Node Range

Total Capacity

dc1.large 2 7 15 2 160 GB SSD

1-32 5.12 TB

Columnar Compression Types How it works Use case Examples

Raw N/A – no compression Large domain Identifiers

Bytedict Creates a dict. of unique values Limited unique vals Dept. code

LZO Creates a dictionary of repeating character sequences

V. Long char strings Comments

Runlength Store repeated value counts, use for consecutive repeating values

Consecutive repeating vals

Dr code

Text255 & text32k

Creates dictionary of unique words for repeating text

Repeating words within string

Address

Delta Record difference between values that follow each other

Consecutive integer vals

Gender Code

Mostly Store values in smaller standard storage size

Column data type is larger than most vals

BIGINT columns