Untangling Healthcare With Spark and Dataflow - PhillyETE 2016
-
Upload
ryan-brush -
Category
Software
-
view
314 -
download
7
Transcript of Untangling Healthcare With Spark and Dataflow - PhillyETE 2016
![Page 1: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/1.jpg)
Untangling Healthcare with Spark and Dataflow
Ryan Brush
@ryanbrush
![Page 2: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/2.jpg)
![Page 3: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/3.jpg)
Actual depiction of healthcare data
![Page 4: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/4.jpg)
One out of six dollars
![Page 5: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/5.jpg)
![Page 6: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/6.jpg)
Three Acts
(Mostly)
![Page 7: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/7.jpg)
Act IMaking sense of the pieces
![Page 8: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/8.jpg)
answer = askQuestion (allHealthData)
![Page 9: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/9.jpg)
8000 CPT Codes 72,000 ICD-10 Codes
63,000 SNOMED disease codes
Incomplete, conflicting data sets No common person identifier
Standard data models and codes interpreted inconsistently
Different meanings in different contexts
How do we make sense of this?55 Million Patients 3 petabytes of data
![Page 10: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/10.jpg)
Claims
Medical Records
Pharma
Operational
Link Records Semantic Integration
User-entered annotations
Condition Registries
Quality Measures
Analytics
Rules
![Page 11: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/11.jpg)
Link Records
Claims
Medical Records
Pharma
Operational
Semantic Integration
User-entered annotations
Condition Registries
Quality Measures
Analytics
Rules
![Page 12: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/12.jpg)
Medical Records
Document
Sections
Notes
Addenda Order
. . .
Normalize Structure
Clean Data
![Page 13: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/13.jpg)
Link Records
Claims
Medical Records
Pharma
Operational
Semantic Integration
User-entered annotations
Condition Registries
Quality Measures
Analytics
Rules
![Page 14: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/14.jpg)
answer = askQuestion (allHealthData)
![Page 15: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/15.jpg)
linkedData = link(clean (pharma), clean (claims), clean (records))normalized = normalize(linkedData)
answer = askQuestion (normalized)
![Page 16: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/16.jpg)
Rein in variance
http://fortune.com/2014/07/24/can-big-data-cure-cancer/
![Page 17: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/17.jpg)
Rein in variance
oral vs. axillary temperature
![Page 18: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/18.jpg)
Join all the things!
![Page 19: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/19.jpg)
JavaRDD<ExternalRecord> externalRecords = ...
![Page 20: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/20.jpg)
JavaRDD<ExternalRecord> externalRecords = ... JavaPairRDD<ExternalRecord, ExternalRecord> cartesian = externalRecords.cartesian(externalRecords);
![Page 21: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/21.jpg)
JavaRDD<Similarity> matches = cartesian.map(t -> {
})
return Similarity.newBuilder() .setLeftRecordId(left.getExternalId()) .setRightRecordId(right.getExternalId()) .setScore(score) .build();
ExternalRecord left = t._1(); ExternalRecord right = t._2(); double score = recordSimilarity(left, right);
.filter(s -> s.getScore() > THRESHOLD);
![Page 22: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/22.jpg)
Person 1 Person 2 Person 3
Person 1 1 0.98 0.12
Person 2 1 0.55
Person 3 1
![Page 23: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/23.jpg)
Reassembly Humpty Dumpty in Code
![Page 24: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/24.jpg)
JavaPairRDD<String,String> idToLink = . . .
JavaPairRDD<String,ExternalRecord> idToRecord = . . .
![Page 25: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/25.jpg)
JavaPairRDD<String,String> idToLink = . . .
JavaPairRDD<String,ExternalRecord> idToRecord = . . . JavaRDD<Person> people = idToRecord.join(idToLink) .mapToPair( // Tuple of universal ID and external record. item -> new Tuple2<>(item._2._2, item._2._1)) .groupByKey() .map(SparkExampleTest::mergeExternalRecords);
![Page 26: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/26.jpg)
SNOMED:388431003
HCPCS:J1815
ICD10:E13.9
CPT:3046F
SNOMED:43396009, value: 9.4
![Page 27: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/27.jpg)
SNOMED:388431003
HCPCS:J1815
InsulinMed InsulinMed
ICD10:E13.9
DiabetesCondition
Diabetic
CPT:3046F SNOMED:43396009, value: 9.4
Retaking Rules for Developers, Strange Loop 2014
![Page 28: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/28.jpg)
![Page 29: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/29.jpg)
29
select * from outcomes where…
![Page 30: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/30.jpg)
30
![Page 31: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/31.jpg)
Start with the questions you want to ask and transform the data to fit.
![Page 32: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/32.jpg)
But what questions are we asking?
![Page 33: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/33.jpg)
![Page 34: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/34.jpg)
“The problem is we don’t understand the problem.”
-Paul MacReady
![Page 35: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/35.jpg)
cleanData = clean (allHealthData)projected = projectForPurpose (cleanData)
answer = askQuestion (projected)
![Page 36: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/36.jpg)
Sepsis
![Page 37: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/37.jpg)
![Page 38: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/38.jpg)
Early Lessons
![Page 39: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/39.jpg)
Make no assumptions about your data
![Page 40: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/40.jpg)
Your errors are a signal
![Page 41: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/41.jpg)
Data sources have a signature
![Page 42: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/42.jpg)
But the latency!And the complexity!
![Page 43: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/43.jpg)
Act IIPutting health care together…fast
![Page 44: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/44.jpg)
JavaPairRDD<String,String> idToLink = . . .
JavaPairRDD<String,ExternalRecord> idToRecord = . . . JavaRDD<Person> people = idToRecord.join(idToLink) .mapToPair( // Tuple of UUID and external record. item -> new Tuple2<>(item._2._2, item._2._1)) .groupByKey() .map(SparkExampleTest::mergeExternalRecords);
![Page 45: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/45.jpg)
JavaPairDStream<String,String> idToLink = . . .
JavaPairDStream<String,ExternalRecord> idToRecord = . . .
idToLink.join(idToRecord);
![Page 46: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/46.jpg)
JavaPairDStream<String,String> idToLink = . . .
JavaPairDStream<String,ExternalRecord> idToRecord = . . . StateSpec personAndLinkStateSpec = StateSpec.function(new BuildPersonState());
JavaDStream<Tuple2<List<ExternalRecord>,List<String>>> recordsWithLinks = idToRecord.cogroup(idToLink) .mapWithState(personAndLinkStateSpec);// And a lot more...
![Page 47: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/47.jpg)
Link Updates
Record Updates
Grouped Record and Link Updates
PreviousState
Person Records
![Page 48: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/48.jpg)
Link Updates
Record Updates
Grouped Record and Link Updates
PreviousState
Person Records
What about deletes?
![Page 49: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/49.jpg)
Stream processing is not “fast” batch processing.
Little reuse beyond core functions
Different pipeline semantics
Must implement compensation logic
![Page 50: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/50.jpg)
Batch Processing
Stream Processing
Reusable Code
![Page 51: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/51.jpg)
If you're willing to restrict the flexibility of your approach, you can almost
always do something better.
-John Carmack
![Page 52: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/52.jpg)
public void rollingMap(EntityKey key, Long version, T value, Emitter emitter);
public void rollingReduce(EntityKey key, S state, Emitter emitter);
![Page 53: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/53.jpg)
emitter.emit(key,value);
emitter.tombstone(outdatedKey);
![Page 54: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/54.jpg)
Domain-Specific API
Batch Host Streaming Host
Reusable Code
![Page 55: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/55.jpg)
So are we done?
Limited expressiveness
Not composable
Learning curve
Artificial complexity
![Page 56: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/56.jpg)
Act IIIReframing the problem
![Page 57: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/57.jpg)
Batch Processing
Stream Processing
Reusable Code
![Page 58: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/58.jpg)
Batch Processing
Stream Processing
Kappa Architecture
![Page 59: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/59.jpg)
It’s time for a POSIX of data processing
![Page 60: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/60.jpg)
Make everything a stream!
(If the technology can scale to your volume and historical data.)
(If your problem can be expressed in monoids.)
![Page 61: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/61.jpg)
Apache Beam(was: Google Cloud Dataflow)
![Page 62: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/62.jpg)
Potentially a POSIX of data processing
Composable Units (PTransforms)
Unification of Batch and Stream
Spark, Flink, Google Cloud Dataflow runners
![Page 63: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/63.jpg)
Bounded: a fixed dataset
Unbounded: continuously updating dataset
Window: time range of your data to process
Trigger: when to process a time range
![Page 64: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/64.jpg)
10:00 11:00 12:008:00 9:00
Windows and Triggers
![Page 65: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/65.jpg)
10:00 11:00 12:008:00 9:00
Windows and Triggers
![Page 66: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/66.jpg)
10:00 11:00 12:008:00 9:00
Windows and Triggers
![Page 67: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/67.jpg)
10:00 11:00 12:008:00 9:00
Windows and Triggers
![Page 68: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/68.jpg)
8:03 8:04 8:058:01 8:02
Windows and Triggers
![Page 69: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/69.jpg)
public class LinkRecordsTransform extends PTransform<PCollectionTuple,PCollection<Person>> {
}
public static final TupleTag<RecordLink> LINKS = new TupleTag<>(); public static final TupleTag<ExternalRecord> RECORDS = new TupleTag<>();
@Override public PCollection<Person> apply(PCollectionTuple input) { . . . }
![Page 70: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/70.jpg)
PCollection<KV<String,CoGbkResult>> cogrouped = KeyedPCollectionTuple .of(LINKS, idToLinks) .and(RECORDS, idToRecords)
// Combines by key AND window return uuidToRecs.apply( Combine.<String,ExternalRecord,Person>perKey( new PersonCombineFn())) .setCoder(KEY_PERSON_CODER) .apply(Values.<Person>create());
PCollection<KV<String,ExternalRecord>> uuidToRecs = cogrouped.apply( ParDo.of(new LinkExternalRecords())) .setCoder(KEY_REC_CODER);
apply implementation:
.apply(CoGroupByKey.create());
![Page 71: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/71.jpg)
PCollection<Person> people = PCollectionTuple .of(LinkRecordsTransform.LINKS, windowedLinks) .and(LinkRecordsTransform.RECORDS, windowedRecs) .apply(new LinkRecordsTransform());
PCollection<RecordLink> windowedLinks = . . .
PCollection<ExternalRecord> windowedRecs = . . .
![Page 72: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/72.jpg)
PCollection<RecordLink> windowedLinks = . . .
PCollection<Person> people = PCollectionTuple .of(LinkRecordsTransform.LINKS, windowedLinks) .and(LinkRecordsTransform.RECORDS, windowedRecs) .apply(new LinkRecordsTransform());
PCollection<ExternalRecord> windowsRecs = . . .
![Page 73: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/73.jpg)
PCollection<Person> people = PCollectionTuple .of(LinkRecordsTransform.LINKS, windowedLinks) .and(LinkRecordsTransform.RECORDS, windowedRecs) .apply(new LinkRecordsTransform());
PCollection<RecordLink> windowedLinks = links.apply( Window.<RecordLink>into( FixedWindows.of(Duration.standardMinutes(60)));
PCollection<ExternalRecord> windowsRecs = . . .
![Page 74: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/74.jpg)
PCollection<Person> people = PCollectionTuple .of(LinkRecordsTransform.LINKS, windowedLinks) .and(LinkRecordsTransform.RECORDS, windowedRecs) .apply(new LinkRecordsTransform());
PCollection<RecordLink> windowedLinks = links.apply( Window.<RecordLink>into( FixedWindows.of(Duration.standardMinutes(60))) .withAllowedLateness(Duration.standardMinutes(15)) .accumulatingFiredPanes());
PCollection<ExternalRecord> windowsRecs = . . .
![Page 75: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/75.jpg)
PCollection<Person> people = PCollectionTuple .of(LinkRecordsTransform.LINKS, windowedLinks) .and(LinkRecordsTransform.RECORDS, windowedRecs) .apply(new LinkRecordsTransform());
PCollection<RecordLink> windowedLinks = links.apply( Window.<RecordLink>into( FixedWindows.of(Duration.standardMinutes(60))) .withAllowedLateness(Duration.standardMinutes(15)) .accumulatingFiredPanes()); .triggering(Repeatedly.forever( AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(10)))));
PCollection<ExternalRecord> windowsRecs = . . .
![Page 76: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/76.jpg)
PCollection<Person> people = PCollectionTuple .of(LinkRecordsTransform.LINKS, windowedLinks) .and(LinkRecordsTransform.RECORDS, windowedRecs) .apply(new LinkRecordsTransform());
PCollection<RecordLink> windowedLinks = links.apply( Window.<RecordLink>into( SlidingWindows.of(Duration.standardMinutes(120))) .withAllowedLateness(Duration.standardMinutes(15)) .accumulatingFiredPanes()); .triggering(Repeatedly.forever( AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(10)))));
PCollection<ExternalRecord> windowsRecs = . . .
![Page 77: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/77.jpg)
PCollection<Person> people = PCollectionTuple .of(LinkRecordsTransform.LINKS, windowedLinks) .and(LinkRecordsTransform.RECORDS, windowedRecs) .apply(new LinkRecordsTransform());
PCollection<RecordLink> windowedLinks = links.apply(
PCollection<ExternalRecord> windowsRecs = . . .
Window.<RecordLink>into(new GlobalWindows()) .triggering(Repeatedly.forever( AfterProcessingTime.pastFirstElementInPane(). plusDelayOf(Duration.standardMinutes(5)))) .accumulatingFiredPanes());
![Page 78: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/78.jpg)
when was that data created?what data have I received?
when should I process that data?how should I group data to process?what to do with late-arriving data?should I emit preliminary results?
how should I amend those results?
Untangling Concerns
![Page 79: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/79.jpg)
Simple Made EasyRich Hickey, Strange Loop 2011
Modular
ComposableEasier to reason about
![Page 80: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/80.jpg)
But some caveats:
Runners at varying level of maturity
Retraction not yet implemented (see BEAM-91)
APIs may change
![Page 81: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/81.jpg)
MLLib
REPLDataframes
Spark offers a rich ecosystemSpark SQL
Genome Analysis Toolkit
![Page 82: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/82.jpg)
Large, complex, processing pipelines
Exploration and transformation of data
Two classes of problems:
![Page 83: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/83.jpg)
Actual depiction of healthcare data
![Page 84: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/84.jpg)
Time
Und
erst
andi
ng
OrientationPattern
DiscoveryPrescriptiveFrameworks
Scalable Processing
Web Development
![Page 85: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/85.jpg)
Time
Und
erst
andi
ng
OrientationPattern
DiscoveryPrescriptiveFrameworks
Scalable Processing Web Development
![Page 86: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/86.jpg)
Focus on the essencerather than the accidents.
![Page 87: Untangling Healthcare With Spark and Dataflow - PhillyETE 2016](https://reader033.fdocuments.net/reader033/viewer/2022052706/5a6d23147f8b9af8418b4ddf/html5/thumbnails/87.jpg)
Questions?
@ryanbrush