Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from...

56
Advanced Data Science Regression Idan Schwartz

Transcript of Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from...

Page 1: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Advanced Data ScienceRegression

Idan Schwartz

Page 2: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Titanic - Machine Learning from Disaster

• On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.

• One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

• Question: what sorts of people were likely to survive?

Page 3: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Data cleaning

• To apply Machine Learning models, data must be converted to a tabular form.

• This whole process is the most time consuming and difficult process

Page 4: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Environment - python

• We will work with python.• Fun to code!

• Clean syntax, no semicolons, no type safety

• Loads of data structures: dict = {}, list = []

• Easy to iterate: [f(x) for x in dict]

• Support OO, functional programing

• Awesome Libraries - Python has all the libraries you need.• Use Anaconda package for easy start - includes over 100 of the most popular Python

packages for data science.

Page 5: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Data Handling Tools - pandas

• Python's version of Excel.

• Easy to read and manipulate data.• Column insertion and deletion• merging and joining• Aggregation (Group by engine)

• Easy statistics• One command to get all statistics (df.describe)• Time series-functionality

• Easy plotting.

• Other resources for working with pandas:• http://pandas.pydata.org/pandas-docs/stable/tutorials.html

Page 6: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Data observation

• Results:

import pandas as pd

df = pd.read_csv("data/train.csv")

Page 7: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Features – (or 𝑋)

• Each row is an observation.

• Each column tells us something about each of our observations, like their name, sex or age.

• These columns are called a features of our dataset.

• Most features have complete data on every observation, like Survivedfeature, while some are missing information like Age.

Page 8: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Types of features

• There are usually three types of variables:• numerical variables

• Age, sibsp, parch etc

• categorical variables• pclass, sex, embarked

• variables with text inside them• Name

Page 9: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

scikit-learn

• Collection of machine learning algorithms and tools in Python.

• Built on:• NumPy – multi-dimensional arrays and matrices. (ndarray)

• SciPy – key algorithms and functions core to Python's scientific computing capabilities.

• matplotlib - plotting.

• used in academia and industry (Spotify, bit.ly, Evernote).

• http://scikit-learn.org/stable/

Page 10: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Processing features - categorical data

• Using scikit-learn Preprocessing library:

• LabelEncoder – Convert the categorical data to labels• transform ([1, 1, 2, 6]) to ([0, 0, 1, 2])

• Other option, use pandas: pandas.get_dummies()

• OneHotEncoder – Convert to one hot(if order is not important)

• Example:

Fits over the following dataset ([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])

transform([[0, 1, 1]]) to [[ 1., 0., 0., 1., 0., 0., 1., 0., 0.]]

Page 11: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Processing features - text

• Bag of words (BOW) - Convert a text documents to a vector of word counts (In scikit-learn use CountVectorizer)

stop words are words which are filtered out before or after processing of natural language data (stopwords can be obtained using NLTK)

Page 12: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

tf-idf• weighting the counts, so that frequent tokens get lower weight (inverse document

frequency)

• 𝑡𝑓(𝑡, 𝑑)means the term-frequency• Simplest is to use the raw frequency of a term in a document, i.e. the number of

times that term 𝑡 occurs in document 𝑑. (|𝑡 ∈ 𝑑|)

• 𝑖𝑑𝑓(𝑡, 𝐷)means the inverse document-frequency• a measure of how much information the word provides, that is, whether the term is

common or rare across all documents

idf 𝑡, 𝐷 = log𝑁

|𝑑 ∈ 𝐷: 𝑡 ∈ 𝑑|• 𝑁 = |𝐷|

• |𝑑 ∈ 𝐷: 𝑡 ∈ 𝑑| - numbers of documents where the term 𝑡 appears

Page 13: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

tf-idf

• The tf − idf feature:tf − idf = tf(𝑡, 𝑑) ⋅ idf(𝑡, 𝐷)

• tf-idf performs better than the counts most of the times

• In Scikit-learn use TfidfVectorizer

Page 14: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Neural networks based models

• Words which share common contexts in the corpus, are located in close proximity to one another in the vector space.

• Packages:

• Word2vec • https://radimrehurek.com/gensim/models/word2vec.html

• Doc2vec (for documents instead of words)• https://radimrehurek.com/gensim/models/doc2vec.html

• fastText• https://github.com/facebookresearch/fastText

• More details in next lectures.

Page 15: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Natural Language Tool Kit(NLTK)

• over 50 corpora and lexical resources

• text processing libraries

• Example: Adding Part of speech (POS) tags

• You should also check Textblob for text processing. Should be simpler.

>>> sentence = """At eight o'clock on Thursday morning... Arthur didn't feel very good.""">>> tokens = nltk.word_tokenize(sentence)>>> tokens['At', 'eight', "o'clock", 'on', 'Thursday', 'morning','Arthur', 'did', "n't", 'feel', 'very', 'good', '.']>>> tagged = nltk.pos_tag(tokens)>>> tagged[0:6]

[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),('Thursday', 'NNP'), ('morning', 'NN')]

Page 16: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Cleaning the data

• The features ticket and cabin have many missing values and so can’t add much value to our analysis.

• To handle this we will drop them from the DataFrame to preserve the integrity of our dataset.

df = df.drop(['Ticket','Cabin'], axis=1)

Page 17: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

What about names?

• Is there a difference between: Miss, Mrs? Master, Mr?

• Stemming – get the part that is common to all its inflected variants (use NLTK)

• Examples: • cats, catlike , "catty" etc. based on the root "cat“

• Waits, waited, waiting based on the root “wait”

• Other things to consider• Does the specific name really relevant?

• Do capital letters carry information?

• Punctuation symbols?

Page 18: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

NaN values

• Naively, just drop any observation containing NaN value• In pandas use df = df.dropna()

• Better – Choose mean, median, most frequent• Check Imputer class in sci-kit learn.

• Even better – predict the missing values, by using the other features• For instance, the Sex, Family Size, Class etc. Can tell us the Age.

• There are many intelligent ways for handling uncertainty in data• Relevant courses - Uncertainty in Databases course by Benny Kimelfield

Page 19: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Labels – (or 𝑌)

• Labels are the data we want to predict given the features.

• In our case, we want to predict the survival column• Single column, binary values

• Survived – 1, Not survived – 0

Page 20: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Types of labels

• Classification Problems:• Single column, binary values

• Review is positive or negative

• Multiple column, binary values

• Multilablel• Review is on kitchen/books/movies

• Regression Problems:• Single column, real values

• What is the salary of 40 years old man.

• Multiple column, real values

1/0

1/0 1/0

214.25

1/2/3…../n

214.25 335.5

Page 21: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Reminder: Regression

• Regression wants to predict a continuous-valued output for an input.

• Data:

• Goal:

𝑓: 𝑋 → 𝑌𝑃 𝑌 𝑋 = 𝑓 𝑥 + 𝜖, 𝜖~𝑁(0, 𝜎)

Noise

Page 22: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Reminder: Linear Regression

• assumes a linear relationship between inputs and outputs

• Data:

• Therefore:

• ⇒ for single feature

Page 23: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

predict commute time for a new person, who lives 1.1 miles from campus.

Page 24: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Now, you want to predict commute time for a new person, who lives 1.1 miles from campus.

1.1

~23

Page 25: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

How can we find this line?

Page 26: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

How can we find this line?

• Define• xi: input, distance from campus• yi: output, commute time

• We want to predict y for an unknown x

• Assume• In general, assume

y = f(x) + ε• For 1-D linear regression,

assumef(x) = w0 + w1x

• We want to learn the parameters w

Page 27: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

We can learn w from the observed data by maximizing the conditional likelihood.

Page 28: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Minimizing the least-squares error equal to Maximize conditional likelihood

minimizing least-squares error

Page 29: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Classification

• We’ve seen in the course:• Naïve Bayes• Decision trees• Logistic Regression

Page 30: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Logistic regression is a discriminativeapproach to classification.• Discriminative: directly estimates P(Y|X)

• Only concerned with discriminating (differentiating) between classes Y

• In contrast, naïve Bayes is a generative classifier• Estimates P(Y) & P(X|Y) and uses Bayes’ rule to calculate P(Y|X)

• Explains how data are generated, given class label Y

• Both logistic regression and naïve Bayes use their estimates of P(Y|X) to assign a class to an input X—the difference is in how they arrive at these estimates, and their assumptions.

• Logistic regression doesn’t use the naïve Bayes assumption for training.

Page 31: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Assumption of logistic regression

•Consider learning f: X Y, where

•X is a vector of real-valued features, < X1 … Xn >

•Y is boolean

•assume all Xi are conditionally independent given Y

•model P(Xi | Y = yk) as Gaussian N(ik,i)

•model P(Y) as Bernoulli ()

•What does that imply about the form of P(Y|X)?

• By using Bayes rule, we get the sigmoid function

Page 32: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

The logistic function

a

b

Page 33: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Logistic regression models probabilities with the logistic function.

• Want to predict Y=1 for X when P(Y=1|X) ≥ 0.5

Y = 1

Y = 0

P(Y=1|X)

𝑤0 + ∑𝑤𝑖𝑥𝑖 = 0

Page 34: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Maximize the conditional likelihood to find the weights w = [w0,w1,…,wd].

Page 35: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

How can we optimize this function?

• Concave

• No closed-form solution for w

• Calculate the gradient:

Page 36: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Gradient descent can optimize differentiable functions.• Suppose you have a differentiable function f(x)

• Gradient descent• Choose starting point 𝑥(0)

• Repeat until no change:

Updated valuefor optimum Previous value

for optimumStep size

Gradient of f,evaluated at current x

Page 37: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Here is the trajectory of gradient descent on a quadratic function.

Page 38: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

How does step size affect the result?

Page 39: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Gradient descent can optimize differentiable functions.• Suppose you have a differentiable function f(x)

• Gradient descent• Choose starting point 𝑥(0)

• Repeat until no change:

Updated valuefor optimum Previous value

for optimumStep size

Gradient of f,evaluated at current x

Page 40: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Regularization

• There are no constraints on the search space of 𝑤.

• It might cause the algorithm to over-fit over the training examples.

• We want to add some simple priors on 𝑤 parameters.

Page 41: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

MAP instead of MLE

𝑤~𝑁 0,1

2𝜆

How it affects our GD step

𝑤𝑖 ← 𝑤𝑖 − 𝜂𝜆𝑤𝑖2 + 𝜂

𝑙

𝑥𝑖𝑙 𝑌𝑙 − 𝑃 𝑌𝑙 = 1 𝑋𝑙 ,𝑊

Regularization term, takes the weights go to zero.

Page 42: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Why small weights

• We bias the weights to be small.

• ⇒ Simple, The sigmoid function is “less sure”.

𝑠𝑖𝑔𝑚𝑜𝑖𝑑 =1

1 + 𝑒−𝑤𝑥

𝑃 𝑥

𝑥

Page 43: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Back to our problem Visualizing the data• graph of how many survived

• Important to see if data is skewed.

df.Survived.value_counts().plot(kind='bar', alpha=alpha_bar_chart)

Page 44: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Visualizing – understanding the data

df.Age[df.Pclass == 1].plot(kind='kde') df.Age[df.Pclass == 2].plot(kind='kde')df.Age[df.Pclass == 3].plot(kind='kde')

df.Embarked.value_counts().plot(kind='bar', alpha=alpha_bar_chart)

Page 45: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Who Survived?

• Woman and high class are more likely to survive.

• Understanding the most basic relationships is essential to build more insightful model

Page 46: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Classification

• It’s very simple to run classification algorithms in scikit learn.

• Simple call:

• You can replace LogisticRegression with any classification class.

clf = LogisticRegression(C=1.0, penalty='l2', random_state=None)

clf.fit(X[train],y[train])

print classifier.score(X[test], y[test])

Page 47: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Popular classification modules

Page 48: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

How to choose?

• Also use GridSearchCV

Page 49: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Use pipelines to easily combine techniques

pipeline = Pipeline([

('bow', CountVectorizer(analyzer=split_into_lemmas)), # strings to token integer counts

('tfidf', TfidfTransformer()), # integer counts to weighted TF-IDF scores

('classifier', MultinomialNB()), # train on TF-IDF vectors w/ Naive Bayes classifier

])

Page 50: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Pickle

• Use pickle to save model after training

with open('survival_classifier.pkl', 'wb') as clf_file:

cPickle.dump(clf, clf_file)

Page 51: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Evaluation - Cross validation

• We split the data into two different parts, a training set and a validation set.

Page 52: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

KFolds

• Split dataset into k consecutive folds

• Each fold is used as a validation set once while the k - 1 remaining fold form the training set.

Page 53: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Skewed data sets

• Skewed datasets appears in classification problems• When one class is over-represented in the data set

• Example: fraud detection (90% of activity is normal)

Page 54: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Cross validation – skewed datasets

• use stratified splitting

Page 55: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Model in one command

scores = cross_val_score(pipeline, # steps to convert raw messages into models

X, # training data

y, # training labels

cv=10, # split data randomly into 10 parts: 9 for training, 1 for scoring

scoring='accuracy', # which scoring metric?

n_jobs=-1, # -1 = use all cores = faster

)

print scores

Page 56: Advanced Data Science Regression - Technion · PDF fileTitanic - Machine Learning from Disaster •On April 15, 1912, the Titanic sank after colliding with an iceberg, killing 1502

Evaluation Metrics

• how are we going to evaluate our results?• what the evaluation metric or objective is?

• Basic Method:

• Problem – Bad metric for skewed datasets

• We should use AUC, more next classes.