Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in...
Transcript of Hochschule Düsseldorf Fachbereich ... · KNIME Analytics Platform SS 2016 - IT Applications in...
HSDHochschule Düsseldorf
University of Applied Scienses
WFachbereich Wirtschaftswissenschaften
Faculty of Business Studies
IT Applications in Business Analytics
Business Analytics (M.Sc.)
IT in Business Analytics
SS2016 / Lecture 05 – Introduction to KNIME
Thomas Zeutschler
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Let’s get started…
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 2
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Intoduction
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 3
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
KNIME Analytics Platform
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 4
The KNIME is an open source platform for
analytical data modelling and processing.
KNIME was developed at University of Konstanz in 2004-2006 and
focussed initially on pharmaceutical research.
Today KNIME is modular, highly scalable data processing platform
which allow an easy integration of different modules for:
data loading, processing, transformation
data analysis
visual data exploration
www.knime.org
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
KNIME Analytics Platform – Workflows
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 5
An analysis is defined by a graphical Workflow.
Interlinked Nodes are defining the various steps of a workflow.
Hundreds of predefined nodes available for various purposes…
data loading, processing, transformation and data delivery
data analysis and visualization
interaction with other tools (e.g. run an R script)
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
KNIME Analytics Platform - Frontend
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 6
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
KNIME Analytics Platform – Real World Example
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 7
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
KNIME Analytics Platform – Real World Example
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 8
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Knime – Installation
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 9
Register, download and install Knime from http://knime.org
www.knime.org
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Knime – Lets get started…
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 10
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Knime – First Data Analysis
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 11
“Sleep in Mammals: Ecological and Constitutional Correlates"
Description
https://www.stat.auckland.ac.nz/~stats330/datasets.dir/sleep.txt
Dataset
https://www.stat.auckland.ac.nz/~stats330/datasets.dir/sleep.csv
“Titanic Survival Status”
Description
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html
Dataset
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Knime - Essential Nodes
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 12
Problem: Too many nodes…
Solution 1: You can search directly in the Node Repository.
Solution 2: Search https://tech.knime.org/forum for your problem.
Reading Data
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Knime - Essential Nodes
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 13
Data Preparation
The input table is split into two partitions (i.e. row-wise),
e.g. train and test data. The two partitions are available
at the two output ports.
This node helps handle missing values found in cells of
the input table.
The node allows for row / column filtering according to
certain criteria
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Knime - Essential Nodes
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 14
First Statistical Data Analysis
Calculates statistical moments such as minimum, maximum,
mean, standard deviation, variance, median, overall sum,
number of missing values and row count across all numeric
columns, and counts all nominal values together with their
occurrences.
Creates a cross table (also referred as contingency table
or cross tab). It can be used to analyze the relation of
two columns with categorical data and does display the
frequency distribution of the categorical variables in a
table.
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Knime – Data Mining Cheating…
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 15
http://scikit-learn.org/stable/tutorial/machine_learning_map/
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Knime – Data Mining Cheating…
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 16
Algorithm Pros Cons Good at
Linear regression
- Very fast (runs in constant time)
- Easy to understand the model
- Less prone to overfitting
- Unable to model complex relationships
-Unable to capture nonlinear relationships
without first transforming the inputs
- The first look at a dataset
- Numerical data with lots of features
Decision trees
- Fast
- Robust to noise and missing values
- Accurate
- Complex trees are hard to interpret
- Duplication within the same sub-tree is
possible
- Star classification
- Medical diagnosis
- Credit risk analysis
Neural networks
- Extremely powerful
- Can model even very complex relationships
- No need to understand the underlying data
- Almost works by “magic”
- Prone to overfitting
- Long training time
- Requires significant computing power for
large datasets
- Model is essentially unreadable
- Images
- Video
- “Human-intelligence” type tasks like driving or
flying
- Robotics
Support Vector
Machines
- Can model complex, nonlinear
relationships
- Robust to noise (because they maximize
margins)
- Need to select a good kernel function
- Model parameters are difficult to interpret
- Sometimes numerical stability problems
- Requires significant memory and
processing power
- Classifying proteins
- Text classification
- Image classification
- Handwriting recognition
K-Nearest Neighbors
- Simple
- Powerful
- No training involved (“lazy”)
- Naturally handles multiclass classification
and regression
- Expensive and slow to predict new
instances
- Must define a meaningful distance
function
- Performs poorly on high-dimensionality
datasets
- Low-dimensional datasets
- Computer security: intrusion detection
- Fault detection in semiconducter manufacturing
- Video content retrieval
- Gene expression
- Protein-protein interaction
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Knime – Data Mining Cheating…
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 17
http://www.kdnuggets.com/2015/07/good-data-science-machine-learning-cheat-sheets.html
https://github.com/soulmachin
e/machine-learning-cheat-
sheet/raw/master/machine-
learning-cheat-sheet.pdf
https://azure.microsoft.com/en-
us/documentation/articles/mach
ine-learning-algorithm-cheat-
sheet/
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Exercise in Knime
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 18
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
First Exercise in Knime
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 19
"Sleep in Mammals: Ecological and Constitutional Correlates" by Allison, T. and Cicchetti, D. (1976)
https://www.stat.auckland.ac.nz/~stats330/datasets.dir/sleep.txt
…/sleep.csv
Source:
https://www.stat.auckland.
ac.nz/~stats330/datasets.d
ir/
Training Video:
https://www.youtube.com/
watch?v=Uo1C7Iligw0
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
First Exercise in Knime
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 20
"Sleep in Mammals: Ecological and Constitutional Correlates" by Allison, T. and Cicchetti, D. (1976)
1. How old do animals become on average?
2. Which species gets the oldest?
3. Can we have a histogram of lifespan?
4. What is the correlation between lifespan and size
of an animal?
5. Can we have a full correlation matrix of all
variables (see figure 1)?
6. Can we have a scatter-plot of species size vs.
danger factor (see figure 2)?
7. Split the dataset (train, test). And answer the
following question: Can we predict “total-sleep”?
Figure 1
Figure 2
https://www.stat.auckland.ac.nz/~stats330/datasets.dir/sleep.csv
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Lecture Summary & Homework
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 21
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Lessons Learned
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 22
Knime is an easy path towards analytics.
A workflow oriented way of working, dramatically simplifies the data
analysis and modelling process.
Combine CRISP DM and Knime and you are able to solve complex
analytical problems in a well organized and repeatable format.
First try to understand what algorithm fits to what problem and how they
behave and what influences their behavior.
Second (if you are willing), try to understand how algorithms work.
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Resources
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 23
Knime
Knime Forum: https://tech.knime.org/forum
Knime Training Video: https://www.youtube.com/user/KNIMETV
Data Mining Literature
Data Mining for the Masses:
http://docs.rapidminer.com/downloads/DataMiningForTheMasses.pdf
Machine Learning Cheat Sheet
https://github.com/soulmachine/machine-learning-cheat-
sheet/raw/master/machine-learning-cheat-sheet.pdf
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Get Prepared (Homework)
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 24
Homework: Titanic Survival Status
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls
Answer the following question:
“What was the probability
to survive per Class (1,2,3)
and Sex (male, female)?”
Create a Knime workflow that answers the
question based on the original Titanic
data set.
Submit your results as a Knime archive file to
Hint:
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Any Questions?
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 25