HospETL - Delivering a Healthcare Analytics Platform
-
Upload
angela-razzell -
Category
Data & Analytics
-
view
74 -
download
2
Transcript of HospETL - Delivering a Healthcare Analytics Platform
My motivation n Delivery of an analytics platform in Amazon Redshift for
randomly generated healthcare data.
n Deep delve into Amazon Redshift as a distributed data warehouse system.
n Redshift is being widely employed in business, efficient analytics is important to supply operational insight.
Schema
ref_doctor ref_hospital
ref_eddisposal
ref_edcomplaint
ref_diagnosis
patient 10 million rows
X 21 columns
elective_bookings 53+ million rows
X 13 columns
(ED) encounter 40 million rows
X 10 columns
admissions & appts 28 million rows
X 12 columns
patient_diagnosis 40 million rows
X 21 columns
Columnar Compression Types How it works Examples
Raw N/A – no compression, use for large domain Identifiers
Bytedict Creates dict. of unique values, optimal for limited unique values Dept. code
LZO Creates a dictionary of repeating character sequences, use for very long character strings
Comments
Runlength Store repeat value counts, use for consecutive repeating values Doctor code
Text255 & text32k
Creates dictionary of unique words for repeating text Address
Delta Record difference between values that follow each other, optimal for consecutive integer values
Gender Code
Mostly Store values in smaller standard storage size, optimal when the data type for a column is larger than most values
BIGINT columns
Columnar Compression Types How it works Examples
Raw N/A – no compression, use for large domain Identifiers
Bytedict Creates dict. of unique values, optimal for limited unique values Dept. code
LZO Creates a dictionary of repeating character sequences, use for very long character strings
Comments
Runlength Store repeat value counts, use for consecutive repeating values Doctor code
Text255 & text32k
Creates dictionary of unique words for repeating text Address
Delta Record difference between values that follow each other, optimal for consecutive integer values
Gender Code
Mostly Store values in smaller standard storage size, optimal when the data type for a column is larger than most values
BIGINT columns
Columnar Compression: Runlength Doctor Code
Original size (bytes)
Compressed Value
Compressed size (bytes)
C1 2 {2,C1} 3
C1 2 0
C22 3 {4,C22} 4
C22 3 0
C22 3 0
C22 3 0
C101 4 {1,C101} 5
Total: 20 12
Challenges
n Creating schema from scratch.
n Generating and loading large datasets.
n Learning Redshift and how to optimize it.
About me n Worked in Data Migration for IT system
implementation project and Business Intelligence at an NHS Trust.
n M.Eng in Engineering Mathematics from University of Bristol. n Interests include hiking and swimming.
Encryption n AWS Key Management Services (KMS)
n Automatically integrates with Redshift n $1 a month
n Hardware Security Module (HSM) n Need to use client and server certificates to configure a trusted connection
to Amazon Redshift n Monthly fee plus $5000 initial cost
Redshift cluster n Set up a Redshift cluster with 4 dc1.large nodes. = four
nodes with two slices each
Node size vCPU ECU
RAM (GiB)
Slices per Node
Storage per Node
Node Range
Total Capacity
dc1.large 2 7 15 2 160 GB SSD
1-32 5.12 TB
Columnar Compression Types How it works Use case Examples
Raw N/A – no compression Large domain Identifiers
Bytedict Creates a dict. of unique values Limited unique vals Dept. code
LZO Creates a dictionary of repeating character sequences
V. Long char strings Comments
Runlength Store repeated value counts, use for consecutive repeating values
Consecutive repeating vals
Dr code
Text255 & text32k
Creates dictionary of unique words for repeating text
Repeating words within string
Address
Delta Record difference between values that follow each other
Consecutive integer vals
Gender Code
Mostly Store values in smaller standard storage size
Column data type is larger than most vals
BIGINT columns