ABDW17-Lightning Talks track-Real-Time Insight Into a Billion Emails
Lightning Talks & Integrations Track - Real-Time Streaming of Events to Predict Noise @ ABDW17, Pune
-
Upload
datatorrent -
Category
Technology
-
view
7 -
download
0
Transcript of Lightning Talks & Integrations Track - Real-Time Streaming of Events to Predict Noise @ ABDW17, Pune
High Throughput Raw Data Ingestion (~35,000 ops/sec)
Data Pre-Processing ( <2 hrs for 96 GB
data)
Visualization on raw and processed
data
Data Access & Fast Query (400K
rows/sec)
Achieved 89%
Prediction Accuracy of
parasitic events
/input/Perietids_1/Day_7.csv
Time,Variable3,Variable71,-0.000456,0.0002932,0.016824,0.0472063,0.022986,0.0465844,0.027403,0.0488215,0.026759,0.0391146,0.016151,0.025287
/output/Perietids_1_Day_7.csv
Region,Well,Day,Time,Variable,ValuePerietids_1,1,Day_7,1,Variable3,-0.000456Perietids_1,1,Day_7,1,Variable7,0.000293Perietids_1,1,Day_7,2,Variable3,0.016824Perietids_1,1,Day_7,2,Variable7,0.047206Perietids_1,1,Day_7,3,Variable3,0.022986Perietids_1,1,Day_7,3,Variable7,0.046584Perietids_1,1,Day_7,4,Variable3,0.027403Perietids_1,1,Day_7,4,Variable7,0.048821Perietids_1,1,Day_7,5,Variable3,0.026759Perietids_1,1,Day_7,5,Variable7,0.039114Perietids_1,1,Day_7,6,Variable3,0.016151Perietids_1,1,Day_7,6,Variable7,0.025287
/output/Perietids_1_Day_7.csv
Region,Well,Day,Time,Variable,ValuePerietids_1,1,Day_7,1,Variable3,-0.000456Perietids_1,1,Day_7,1,Variable7,0.000293Perietids_1,1,Day_7,2,Variable3,0.016824Perietids_1,1,Day_7,2,Variable7,0.047206Perietids_1,1,Day_7,3,Variable3,0.022986Perietids_1,1,Day_7,3,Variable7,0.046584Perietids_1,1,Day_7,4,Variable3,0.027403Perietids_1,1,Day_7,4,Variable7,0.048821Perietids_1,1,Day_7,5,Variable3,0.026759Perietids_1,1,Day_7,5,Variable7,0.039114Perietids_1,1,Day_7,6,Variable3,0.016151Perietids_1,1,Day_7,6,Variable7,0.025287
•
Sr. no.
Algorithms Area under ROC (Receiver Operating
Characteristic)
Area under PR(Precision Recall)
1 Decision Tree 0.910859892634328 0.8998662900517511
2 Gradient Boosted Tree 0.9153602433919851 0.8982661492173283
3 Random Forest Tree 0.8921605540835773 0.8737343102424423
4 Support Vector Machine 0.45597755210613794 0.2680391212960384
Data Pre-processing job (~96 GB data)• Time = 1 hr 53 mins, ~900 million input rows, With Min, Max, (Max-Min), Average
Feature• Time = 12 hr 29 mins, ~900 million input rows, With Min, Max, (Max-Min),
Average, Distinct Feature – Spark ML-Lib• Time = 52 mins, ~900 million input rows, With Min, Max, (Max-Min), Average,
Distinct Feature – scikit-learn
Model Building & Save• 2 mins for 9752 input rows post-aggregations
Event Prediction for 2 wells• 3-4 mins, ~12 million input rows, 70 day files for 2 wells (140 records)
Overall Accuracy 93.45%Precision 87.47%
Recall/TPR 90.07%
Specificity 94.73%
Total events=1603 Predicted (0)
Predicted (1)
True (0) TN=1079 FP=60 1139
False (1) FN=45 TP=419 464
1124 479 1603
Overall Accuracy 92.20%Precision 88.90%
Recall/TPR 89.53%
Specificity 95.54%
Total events=1977 Predicted (0)
Predicted (1)
True (0) TN=1350 FP=63 1413
False (1) FN=59 TP=505 564
1409 568 1977