Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression...

29
Team Samba ©Michael Fromm, Christian Lemke, Antoine Maurino, Mihaela Hanea, Sebastian Wagner 1

Transcript of Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression...

Page 1: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

Team Samba©Michael Fromm, Christian Lemke, Antoine Maurino, Mihaela Hanea, Sebastian

Wagner

1

Page 2: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

2

Agenda1. KDD CUP 2017

2. Data Preprocessing & Transformation

3. Big Data Science Tool

4. Summary

5. Demo

Page 3: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

3

1. KDD CUP 2017

2. Data Preprocessing & Transformation

3. Big Data Science Tool

4. Summary

5. Demo

Page 4: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

4

Overview• Topic: Highway tollgates traffic flow prediction• Task: Estimate the average travel time from

intersections to tollgates

• Topic: Highway tollgates traffic flow prediction• Task: Estimate the average travel time from

intersections to tollgates in time windows

Page 5: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

5

Data

• 110000 data points• 3 months time range• 48 MB data size

Page 6: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

6

Task

Page 7: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

7

Our results

Page 8: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

8

1. KDD CUP 2017

2. Data Preprocessing & Transformation

3. Big Data Science Tool

4. Summary

5. Demo

Page 9: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

9

Data Preprocessing & Transformation

• Handling missing values

• Specification of the input and output

• Data aggregation

• Manual feature selection

Page 10: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

10

Input FeaturesX = Time Information (3 Features)

X = Weather (7 Features)

Page 11: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

11

Input FeaturesX = Current Situation (6·24 = 144 Features)

Page 12: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

12

Output FeaturesY = Average Travel Time (6∙6 = 36 Features)

Page 13: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

13

1. KDD CUP 2017

2. Data Preprocessing & Transformation

3. Big Data Science Tool

4. Summary

5. Demo

Page 14: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

14

Prediction task• Algorithms

- Linear regression (Scikit-learn)

- Support vector machine (Scikit-learn)

- Feed-forward neural network (TensorFlow)

• Distribution of model learning (Apache Flink)

• Select best model for learning

Page 15: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

15

Architecture

Page 16: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

16

1. KDD CUP 2017

2. Data Preprocessing & Transformation

3. Big Data Science Tool

4. Summary

5. Demo

Page 17: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

17

Challenges• Follow the motto:

“Do not separate responsibilities! Everyone is

responsible for everything.”

• Rotation of Scrum master

• Security issues

• Dynamic rescaling not supported by Flink 1.3

Page 18: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

18

Learnings• Python 3

• Sklearn

• Numpy + Pandas

• Linear Regression

• SVR

• TensorFlow (lowlevel)

• Soft skills

• IT-Security

• Flink, Clusters

• MongoDB

• Scrum

• Web Dev

Page 19: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

19

Expected outcome✓ Selection of models for traffic flow prediction problem

✓ Documentation of models and explanation of

hyperparameters

✓ Model selection framework in Flink

✓ GUI for model selection framework for arbitrary dataset

✓ Best model for traffic flow prediction problems

Page 20: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

20

Future work• Adding more models

e.g. ensemble learning, recurrent networks

• Adding authentication

• Dashboards

• GPU computation for neural nets

• Distributed database

Page 21: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

21

1. KDD CUP 2017

2. Data Preprocessing & Transformation

3. Big Data Science Tool

4. Summary

5. Demo

Page 22: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

22

Live Demo

Page 23: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

Thanks for yourattention

Page 24: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

24

Scalability

Page 25: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

25

KDD CUP 2017 - Data

Data statistics:● 110000 trajectories● 3 months● 48 MB

Page 26: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

26

ResultsSklearn:• Linear Regression - TI - MAPE ?0.8?• SVR - TI - MAPE 0.200

TensorFlow:• Linear Regression - TI - MAPE 0.8• NN - CS - MAPE 0.55• DNN - MAPE ?

Page 27: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

27

Neural network model

Page 28: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

28

Training error

Page 29: Team Samba - uni-muenchen.de · • Python 3 • Sklearn • Numpy + Pandas • Linear Regression • SVR • TensorFlow (lowlevel) • Soft skills • IT-Security • Flink, Clusters

29

Learning process