Our Dot on the Horizon
- Central point for delivering healthcare processes data for medical research
- Integrate various sources - Historize, trace and pseudonymize all data used
Our Journey
- Learning and adapting to Data Vault not everybody is a modeler (Shu Ha Ri)
- Script, code, build, try, test, throw away and start again - Testing overrated? - Architecture improvements Performance issues SAS/Microsoft Performance issues loading scripts Automate DV load
- From Chaos to SCRUM
Our Obstacles
- Registration for healthcare process vs. usability for research
- Questionnaires: sources or generic models? - Performance: Do we really need all complete texts? Do we really need 20 years of lab results?
- The usual: conflicting interests, politics etc.
Our preliminary results
- 2013: selection of 5 major Studies as starting showcases proved difficult
- 2014: had to choose 5 new showcases from 25 applicants
- Started as Research Data Platform, now growth towards Enterprise Data Platform (including Education and BI)
- Architecture now stable
Lessons learned
• Automate when possible • Invest in a team of skilled pioneers • Models rule everything • Adapt agility, teach agility
Presenter:
Date: Note:
Company:
eMail: Twitter:
Sander Robijns June 5, 2014 Estrenuo BVBA [email protected] @srobijns
The Issue
No enterprise-wide business keys
The Current Approach
Using recursive links on hubs to identify the same-as relationship
The Struggle
Getting the facts reported under a single business key
The Future Approach
Master Data Management will take away some of the struggles
The Lesson Learned
Get the enterprise-wide business keys in place first using data governance
Presenter:
Date: Note:
Company:
eMail: Twitter:
Kasper de Graaf June 5 2014
Occurro [email protected] kdgraaf
Groups of Links: context at hospital
Imagine the following: • An operation (surgery) is executed by a
group of people (first surgeon, second surgeon, assistant, anesthiologist, etc.)
• An operation is planned a couple of weeks in advance
• Whenever the planning changes in the source the complete group is sent to the EDW
Group of Links: the Data
{Time} operation_no employee_no role
T=1 19354 John OP1
19354 Jane OP2
19354 Chris ANA
T=2 19354 John OP1
19354 Mary ANA
T=3 19354 Jane OP1
19354 Chris ANA
Please note: the actual operation with operation_no 19354 is executed by Jane (OP1) and Chris (ANA)
Groups of Links: the Problem
Standard Data Vault loading routines cannot handle this situation: operation_no employee_no role load_dts
19354 John OP1 T=1
19354 Jane OP2 T=1
19354 Chris ANA T=1
19354 Mary ANA T=2
19354 Jane OP1 T=3
Groups of Links: the Problem
Using end-dating of a link (preferable a validity satellite) cannot handle this problem either: operation_no employee_no role load_dts Active?
19354 John OP1 T=1 No (T=3)
19354 Jane OP2 T=1 Yes
19354 Chris ANA T=1 No (T=3)
19354 Mary ANA T=2 Yes
19354 Jane OP1 T=3 Yes
BK of link used: operation_no + role
Groups of Links: our solution
1. Add a validity satellite to the link (for end-dating) 2. Tell the meta data of the automatin tool this is a
group validity satellite with BK=operation_no 3. Whenever an existing operation_no is present in
the staging layer set all current links to Active=No
4. Process as usual
• Remark: because the same row can come back (i.e. John/OP1) it will be set to Active=No and Active=Yes at the same time there can be no unique index on BK of Validity satellite and some cleaning up is required after loading
Groups of Links: special thanks to …
St. Antonius Hospital (for having the problem) Edwin Weber (for coding the solution) Get your copy of the solution: http://sourceforge.net/projects/pdidatavaultfw/
Presenter:
Date: Note:
Company:
eMail: Twitter:
Juan-José van der Linden June 5, 2014 DV, MPP QOSQO [email protected] @delostilos
SMP => MPP => AMPP
SMP Symmetric Processing MPP Massively Parallel Processing AMPP Asymmetric MPP ( SMP + MPP)
Primary key => distribution key
hub -< satellite join - data redistribution - join local in parallel
BK SID
Ensemble 1
Dimensional 2
SID LDTS INFO
1 2001-01-01 My first DV
1 2014-06-05 DV Masters
2 1997-08-02 DM manifesto
Node 1 Node 2
Hub SID => distribution key
hub -< satellite join - join local in parallel
BK SID
Ensemble 1
Dimensional 2
SID LDTS INFO
1 2001-01-01 First DV
1 2014-06-05 DV Masters
2 1997-08-02 DM manifesto
Node 1 Node 2
Link SID => distribution key
Default L_SID, 1:N & N:M - data redistribution - join local in parallel
H_MID H_SID L_SID
1 A 1
1 B 2
L_SID LDTS LDTS_END CURRENT
1 2001-01-01 2006-01-01 N
1 2014-06-05 9999-12-31 Y
2 2006-01-01 2014-06-05 N
H_MID H_SID L_SID
1 A 1
1 B 2
L_SID H_MID H_SID LDTS LDTS_END
1 1 A 2001-01-01 2006-01-01
1 1 B 2014-06-05 9999-12-31
2 1 A 2006-01-01 2014-06-05
1:N => H_MID on link satellite - join local in parallel H_MID is the ensemble identifier !
Node 1 Node 2
Use the ensemble identifier if possible!
H_SID H_SID LDTS INFO
L_SID? H_SID H_MID H_SID ? L_SID ? LDTS INFO
Distributing data efficiently to ensure good performance in a MPP database. - If uneven distribution, one node may become a
bottleneck for the whole execution Try to minimize data movement between nodes - Data redistribution may occur when joining tables
Ensemble
Presenter:
Date: Note:
Company:
eMail: Twitter:
Remco Broekmans June 5, 2014 Example for ReConnect Coarem [email protected] RemcoBroekmans
SAP #Hana is a column store #database which brings #efficiency in storage and access - #in-memory.
SAP #Hana seems to benefit on their technical #architecture in using 1 broad Satellite per #Hub - #benefit no need for #PIT, less tables
Splitting #Sat’s in #rate-of-change as efficient in storage as column store #multiple Sat’s to prefer if data coming from multiple sources (#write efficiency)
#referential join will only perform the join if data from the joined tables is used create 1 #PIT per #Hub (not as #SQL view)
#Lesson: DV is #efficient way of storing data #Lesson: #SQL views can’t be read by Hana Studio #Lesson: #Hana is still evolving