ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data...

53
ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering Cornell September 30, 2019 1 / 38

Transcript of ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data...

Page 1: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

ORIE 4741: Learning with Big Messy Data

Feature Engineering

Professor Udell

Operations Research and Information EngineeringCornell

September 30, 2019

1 / 38

Page 2: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Announcements

I Guest lecturer Thursday: Milinda Lakkam from LinkedIn

I Project proposals due this Thursday (midnight) on github

I New project idea: course evaluations

I Homework 2 due Thursday next week

I Section this week: Github basics (for project)

I Want to be a grader?

2 / 38

Page 3: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Outline

Feature engineering

Polynomial transformations

Boolean, nominal, ordinal, text, . . .

Time series

3 / 38

Page 4: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Linear models

To fit a linear model (= linear in parameters w)

I pick a transformation φ : X → Rd

I predict y using a linear function of φ(x)

h(x) = wTφ(x) =d∑

i=1

wi (φ(x))i

I we want h(xi ) ≈ yi for every i = 1, . . . , n

Q: why do we want a model linear in the parameters w?A: because the optimization problems are easy to solve!e.g., just use least squares.

4 / 38

Page 5: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Linear models

To fit a linear model (= linear in parameters w)

I pick a transformation φ : X → Rd

I predict y using a linear function of φ(x)

h(x) = wTφ(x) =d∑

i=1

wi (φ(x))i

I we want h(xi ) ≈ yi for every i = 1, . . . , n

Q: why do we want a model linear in the parameters w?

A: because the optimization problems are easy to solve!e.g., just use least squares.

4 / 38

Page 6: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Linear models

To fit a linear model (= linear in parameters w)

I pick a transformation φ : X → Rd

I predict y using a linear function of φ(x)

h(x) = wTφ(x) =d∑

i=1

wi (φ(x))i

I we want h(xi ) ≈ yi for every i = 1, . . . , n

Q: why do we want a model linear in the parameters w?A: because the optimization problems are easy to solve!e.g., just use least squares.

4 / 38

Page 7: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Feature engineering

How to pick φ : X → Rd?

I so response y will depend linearly on φ(x)

I so d is not too big

5 / 38

Page 8: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Feature engineering

How to pick φ : X → Rd?

I so response y will depend linearly on φ(x)

I so d is not too big

if you think this looks like a hack: you’re right

6 / 38

Page 9: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Feature engineering

examples:

I adding offsetI standardizing featuresI polynomial fitsI products of featuresI autoregressive modelsI local linear regressionI transforming BooleansI transforming ordinalsI transforming nominalsI transforming imagesI transforming textI concatenating dataI all of the above

https://xkcd.com/2048/7 / 38

Page 10: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Outline

Feature engineering

Polynomial transformations

Boolean, nominal, ordinal, text, . . .

Time series

8 / 38

Page 11: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Adding offset

I X = Rd−1

I let φ(x) = (x , 1)

I now h(x) = wTφ(x) = wT1:d−1x + wd

9 / 38

Page 12: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Fitting a polynomial

I X = R

I letφ(x) = (1, x , x2, x3, . . . , xd−1)

be the vector of all monomials in x of degree < d

I now h(x) = wTφ(x) = w1 + w2x + w3x2 + · · ·+ wdx

d−1

10 / 38

Page 13: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Demo: Linear models

https://github.com/ORIE4741/demos

11 / 38

Page 14: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Fitting a multivariate polynomial

I X = R2

I pick a maximum degree k

I let

φ(x) = (1, x1, x2, x21 , x1x2, x

22 , x

31 , x

21x2, x1x

22 , x

32 . . . , x

k2 )

be the vector of all monomials in x1 and x2 of degree < d

I now h(x) = wTφ(x) can fit any polynomial of degree ≤ kin X

and similarly for X = Rd . . .

12 / 38

Page 15: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Fitting a multivariate polynomial

I X = R2

I pick a maximum degree k

I let

φ(x) = (1, x1, x2, x21 , x1x2, x

22 , x

31 , x

21x2, x1x

22 , x

32 . . . , x

k2 )

be the vector of all monomials in x1 and x2 of degree < d

I now h(x) = wTφ(x) can fit any polynomial of degree ≤ kin X

and similarly for X = Rd . . .

12 / 38

Page 16: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Demo: Linear models

polynomial classification

https://github.com/ORIE4741/demos

13 / 38

Page 17: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Linear classification

-4 -2 0 2 4

-1.0

-0.5

0.0

0.5

1.0

positivenegativeclassification boundary

14 / 38

Page 18: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Polynomial classification

-4 -2 0 2 40

5

10

15

20positivenegativeclassification boundary

15 / 38

Page 19: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Example: fitting a multivariate polynomial

I X = R2, Y = {−1, 1}I let

φ(x) = (1, x1, x2, x21 , x1x2, x

22 )

be the vector of all monomials of degree ≤ 2

I now let h(x) = sign(wTφ(x))

Q: if h(x) = sign(5− 3x1 + 2x2 + x21 − x1x2 + 2x22 ), what is

{x : h(x) = 1}?

A: An ellipse!

16 / 38

Page 20: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Example: fitting a multivariate polynomial

I X = R2, Y = {−1, 1}I let

φ(x) = (1, x1, x2, x21 , x1x2, x

22 )

be the vector of all monomials of degree ≤ 2

I now let h(x) = sign(wTφ(x))

Q: if h(x) = sign(5− 3x1 + 2x2 + x21 − x1x2 + 2x22 ), what is

{x : h(x) = 1}?

A: An ellipse!

16 / 38

Page 21: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Example: fitting a multivariate polynomial

I X = R2, Y = {−1, 1}I let

φ(x) = (1, x1, x2, x21 , x1x2, x

22 )

be the vector of all monomials of degree ≤ 2

I now let h(x) = sign(wTφ(x))

Q: if h(x) = sign(5− 3x1 + 2x2 + x21 − x1x2 − 2x22 ), what is

{x : h(x) = −1}?

A: A hyperbola!

17 / 38

Page 22: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Example: fitting a multivariate polynomial

I X = R2, Y = {−1, 1}I let

φ(x) = (1, x1, x2, x21 , x1x2, x

22 )

be the vector of all monomials of degree ≤ 2

I now let h(x) = sign(wTφ(x))

Q: if h(x) = sign(5− 3x1 + 2x2 + x21 − x1x2 − 2x22 ), what is

{x : h(x) = −1}?

A: A hyperbola!

17 / 38

Page 23: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Outline

Feature engineering

Polynomial transformations

Boolean, nominal, ordinal, text, . . .

Time series

18 / 38

Page 24: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Notation: boolean indicator function

define

1(statement) =

{1 statement is true

0 statement is false

examples:

I 1(1 < 0) = 0

I 1(17 = 17) = 1

19 / 38

Page 25: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Boolean variables

I X = {true, false}I let φ(x) = 1(x)

20 / 38

Page 26: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Boolean expressions

I X = {true, false}2 ={(true, true), (true, false), (false, true), (false, false)}.

I let φ(x) = [1(x1),1(x2),1(x1 and x2),1(x1 or x2)]

I equivalent: polynomials in [1(x1),1(x2)] span the samespace

I encodes logical expressions!

21 / 38

Page 27: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Nominal values: one-hot encoding

I nominal data: e.g., X = {apple, orange, banana}I let

φ(x) = [1(x = apple),1(x = orange),1(x = banana)]

I called one-hot encoding: only one element is non-zero

extension: sets

22 / 38

Page 28: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Nominal values: one-hot encoding

I nominal data: e.g., X = {apple, orange, banana}I let

φ(x) = [1(x = apple),1(x = orange),1(x = banana)]

I called one-hot encoding: only one element is non-zero

extension: sets

22 / 38

Page 29: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Nominal values: look up features!

why not use other information known about each item?

I X = {apple, orange, banana}I price, calories, weight, . . .

I X = zip codeI average income, temperature in July, walk score, %

residential, . . .

I . . .

database lingo: join tables on nominal value

23 / 38

Page 30: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Ordinal values: real encoding

I ordinal data: e.g.,X = {Stage I,Stage II,Stage III, Stage IV}

I let

φ(x) =

1, x = Stage I

2, x = Stage II

3, x = Stage III

4, x = Stage IV

I default encoding

24 / 38

Page 31: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Ordinal values: real encoding

I X = {Stage I,Stage II,Stage III, Stage IV}I Y = R, number of years lived after diagnosis

I use real encoding φ to transform ordinal data

I fit linear model with offset to predict y as wφ(x) + b

Suppose model predicts a person diagnosed with Stage II cancerwill survive 2 more years, and a person diagnosed with Stage Icancer will survive 4 more years.

Q: What is w? b?A: b = 6, w = −2Q: How long does the model predict a persion with Stage IVcancer will survive?A: −2 years

25 / 38

Page 32: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Ordinal values: real encoding

I X = {Stage I,Stage II,Stage III, Stage IV}I Y = R, number of years lived after diagnosis

I use real encoding φ to transform ordinal data

I fit linear model with offset to predict y as wφ(x) + b

Suppose model predicts a person diagnosed with Stage II cancerwill survive 2 more years, and a person diagnosed with Stage Icancer will survive 4 more years.Q: What is w? b?

A: b = 6, w = −2Q: How long does the model predict a persion with Stage IVcancer will survive?A: −2 years

25 / 38

Page 33: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Ordinal values: real encoding

I X = {Stage I,Stage II,Stage III, Stage IV}I Y = R, number of years lived after diagnosis

I use real encoding φ to transform ordinal data

I fit linear model with offset to predict y as wφ(x) + b

Suppose model predicts a person diagnosed with Stage II cancerwill survive 2 more years, and a person diagnosed with Stage Icancer will survive 4 more years.Q: What is w? b?A: b = 6, w = −2

Q: How long does the model predict a persion with Stage IVcancer will survive?A: −2 years

25 / 38

Page 34: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Ordinal values: real encoding

I X = {Stage I,Stage II,Stage III, Stage IV}I Y = R, number of years lived after diagnosis

I use real encoding φ to transform ordinal data

I fit linear model with offset to predict y as wφ(x) + b

Suppose model predicts a person diagnosed with Stage II cancerwill survive 2 more years, and a person diagnosed with Stage Icancer will survive 4 more years.Q: What is w? b?A: b = 6, w = −2Q: How long does the model predict a persion with Stage IVcancer will survive?

A: −2 years

25 / 38

Page 35: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Ordinal values: real encoding

I X = {Stage I,Stage II,Stage III, Stage IV}I Y = R, number of years lived after diagnosis

I use real encoding φ to transform ordinal data

I fit linear model with offset to predict y as wφ(x) + b

Suppose model predicts a person diagnosed with Stage II cancerwill survive 2 more years, and a person diagnosed with Stage Icancer will survive 4 more years.Q: What is w? b?A: b = 6, w = −2Q: How long does the model predict a persion with Stage IVcancer will survive?A: −2 years

25 / 38

Page 36: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Ordinal values: boolean encoding

I ordinal data: e.g.,X = {Stage I,Stage II,Stage III, Stage IV}

I let

φ(x) = [1(x ≥ Stage II),1(x ≥ Stage III),1(x ≥ Stage IV)]

26 / 38

Page 37: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Ordinal values: boolean encoding

I X = {Stage I,Stage II,Stage III, Stage IV}I Y = R, number of years lived after diagnosis

I define transformation φ : X → R as

φ(x) = [1(x ≥ Stage II),1(x ≥ Stage III),1(x ≥ Stage IV)]

I fit linear model with offset to predict y as w>φ(x) + b

Suppose model predicts a person diagnosed with Stage II cancerwill survive 2 more years, and a person diagnosed with Stage Icancer will survive 4 more years.

Q: What is w? b?A: b = 4, w1 = −2, w2 and w3 not determinedQ: How long does the model predict a persion with Stage IVcancer will survive?A: can’t say without more information

27 / 38

Page 38: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Ordinal values: boolean encoding

I X = {Stage I,Stage II,Stage III, Stage IV}I Y = R, number of years lived after diagnosis

I define transformation φ : X → R as

φ(x) = [1(x ≥ Stage II),1(x ≥ Stage III),1(x ≥ Stage IV)]

I fit linear model with offset to predict y as w>φ(x) + b

Suppose model predicts a person diagnosed with Stage II cancerwill survive 2 more years, and a person diagnosed with Stage Icancer will survive 4 more years.Q: What is w? b?

A: b = 4, w1 = −2, w2 and w3 not determinedQ: How long does the model predict a persion with Stage IVcancer will survive?A: can’t say without more information

27 / 38

Page 39: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Ordinal values: boolean encoding

I X = {Stage I,Stage II,Stage III, Stage IV}I Y = R, number of years lived after diagnosis

I define transformation φ : X → R as

φ(x) = [1(x ≥ Stage II),1(x ≥ Stage III),1(x ≥ Stage IV)]

I fit linear model with offset to predict y as w>φ(x) + b

Suppose model predicts a person diagnosed with Stage II cancerwill survive 2 more years, and a person diagnosed with Stage Icancer will survive 4 more years.Q: What is w? b?A: b = 4, w1 = −2, w2 and w3 not determined

Q: How long does the model predict a persion with Stage IVcancer will survive?A: can’t say without more information

27 / 38

Page 40: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Ordinal values: boolean encoding

I X = {Stage I,Stage II,Stage III, Stage IV}I Y = R, number of years lived after diagnosis

I define transformation φ : X → R as

φ(x) = [1(x ≥ Stage II),1(x ≥ Stage III),1(x ≥ Stage IV)]

I fit linear model with offset to predict y as w>φ(x) + b

Suppose model predicts a person diagnosed with Stage II cancerwill survive 2 more years, and a person diagnosed with Stage Icancer will survive 4 more years.Q: What is w? b?A: b = 4, w1 = −2, w2 and w3 not determinedQ: How long does the model predict a persion with Stage IVcancer will survive?

A: can’t say without more information

27 / 38

Page 41: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Ordinal values: boolean encoding

I X = {Stage I,Stage II,Stage III, Stage IV}I Y = R, number of years lived after diagnosis

I define transformation φ : X → R as

φ(x) = [1(x ≥ Stage II),1(x ≥ Stage III),1(x ≥ Stage IV)]

I fit linear model with offset to predict y as w>φ(x) + b

Suppose model predicts a person diagnosed with Stage II cancerwill survive 2 more years, and a person diagnosed with Stage Icancer will survive 4 more years.Q: What is w? b?A: b = 4, w1 = −2, w2 and w3 not determinedQ: How long does the model predict a persion with Stage IVcancer will survive?A: can’t say without more information

27 / 38

Page 42: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Text

X = sentences, documents, tweets, . . .

I bag of words model (one-hot encoding):I pick set of words {w1, . . . ,wd}I φ(x) = [1(x contains w1), . . . ,1(x contains wd)]I ignores order of words in sentence

I pre-trained neural networks:I sentiment analysis: https://medium.com/@b.terryjack/

nlp-pre-trained-sentiment-analysis-1eb52a9d742cI Universal Sentence Encoder (USE) embedding:

https:

//colab.research.google.com/github/tensorflow/

hub/blob/master/examples/colab/semantic_

similarity_with_tf_hub_universal_encoder.ipynbI lots of others: https://modelzoo.co/

28 / 38

Page 43: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Text

X = sentences, documents, tweets, . . .

I bag of words model (one-hot encoding):I pick set of words {w1, . . . ,wd}I φ(x) = [1(x contains w1), . . . ,1(x contains wd)]I ignores order of words in sentence

I pre-trained neural networks:I sentiment analysis: https://medium.com/@b.terryjack/

nlp-pre-trained-sentiment-analysis-1eb52a9d742cI Universal Sentence Encoder (USE) embedding:

https:

//colab.research.google.com/github/tensorflow/

hub/blob/master/examples/colab/semantic_

similarity_with_tf_hub_universal_encoder.ipynbI lots of others: https://modelzoo.co/

28 / 38

Page 44: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Neural networks: whirlwind primer

NN(x) = σ(W1σ(W2 . . . σ(W`x))))

I σ is a nonlinearity applied elementwise to a vector, e.g.I ReLU: σ(x) = max(x , 0)I sigmoid: σ(x) = log(1 + exp (x))

I each W is a matrixI trained on very large datasets, e.g., Wikipedia, YouTube

29 / 38

Page 45: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Why not use deep learning?

towards a solution: https://arxiv.org/abs/1907.10597

30 / 38

Page 46: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Outline

Feature engineering

Polynomial transformations

Boolean, nominal, ordinal, text, . . .

Time series

31 / 38

Page 47: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Fitting a time series

I given a time series xt ∈ R, t = 1, . . . ,T

I want to predict the value at the next time T + 1

I input: time index t and time series x1:t up to time t

I letφ(t, x) = (xt−1, xt−2, . . . , xt−d)

(called the “lagged outcomes”)

I now h(x) = wTφ(x) = w1xt−1 + w2xt−2 + · · ·+ wdxt−d

also called an auto-regressive (AR) model

32 / 38

Page 48: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Fitting a local model: local neighbors

I X = Rm

I pick a set of points µi ∈ Rm, i = 1, . . . , d , radius δ ∈ R

I let φ(x) = [1(‖x − µ1‖2 ≤ δ), . . . ,1(‖x − µd‖2 ≤ δ)]

often, d = n and µi = xi

33 / 38

Page 49: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Fitting a local model: 1 nearest neighbor

I X = Rm

I pick a set of points µi ∈ Rm, i = 1, . . . , d

I let δ = min(‖x − µ1‖2, . . . , ‖x − µd‖2)

I let φ(x) = [1(‖x − µ1‖2 ≤ δ), . . . ,1(‖x − µd‖2 ≤ δ)]

often, d = n and µi = xi

34 / 38

Page 50: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Fitting a local model: smoothing

I X = Rm

I pick a set of points µi ∈ Rm, i = 1, . . . , d , parameter α ∈ R

I let

φ(x) = (exp(−α‖x − µ1‖2

), . . . , exp

(−α‖x − µd‖2

))

often, d = n and µi = xi

35 / 38

Page 51: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Demo: Crime

preprocessing:https://juliabox.com/notebooks/demos/Crime.ipynb

predicting: https://juliabox.com/notebooks/demos/

Predicting%20crime.ipynb

36 / 38

Page 52: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

Review

I linear models are linear in the parameters w

I can fit many different models by picking feature mappingφ : X → Rd

37 / 38

Page 53: ORIE 4741: Learning with Big Messy Data [2ex] Feature ...ORIE 4741: Learning with Big Messy Data Feature Engineering Professor Udell Operations Research and Information Engineering

What makes a good project?

I Clear outcome to predict

I Linear regression should do something interesting

I A data science project; not an NLP or Vision project

I New, interesting model; not a Kaggle competition

38 / 38