GDG DevFest Seoul 2017: Codelab - Time Series Analysis for Kaggle using TensorFlow

Post on 21-Jan-2018

138 views 1 download

Transcript of GDG DevFest Seoul 2017: Codelab - Time Series Analysis for Kaggle using TensorFlow

전태균, 전승현

Developer of Satrec Initiative

Taegyun Jeon and Seunghyun Jeon

시계열 분석: TensorFlow로 짜보고 Kaggle 도전하기

Time Series Analysis

Introduction to Kaggle

KaggleZeroToAll

Contents

코드랩을 다 듣고 나시면

1. 시계열 문제에 대해 이해!2. Kaggle에서 문제 풀기 가능!3. Kaggle Leaderboard에 본인의 모델 업로드!

Time Series Analysis

시계열 분석

● Time Series Analysis

● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN

● TensorFlow TimeSeries API (TFTS)

시계열 분석

● Time Series Analysis

● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN

● TensorFlow TimeSeries API (TFTS)

시계열 분석

시계열 데이터

시계열 데이터● Stock values

● Economic variables

● Weather

● Sensor: Internet-of-Things

● Energy demand

● Signal processing

● Sales forecasting

문제점

● Standard Supervised Learning

○ IID assumption

○ Same distribution for training and test data

○ Distributions fixed over time (stationarity)

● Time Series

○ 모두 해당 되지 않음!!

시계열 분석

● Time Series Analysis

● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN

● TensorFlow TimeSeries API (TFTS)

Autoregressive (AR) Models

● AR(p) model

: Linear generative model based on the pth order Markov assumption

○ : zero mean uncorrelated random variables with variance

○ : autoregressive coefficients

○ : observed stochastic process

Moving Average (MA)● MA(q) model

: Linear generative model for noise term on the qth order Markov

assumption

○ : moving average coefficients

ARMA Model● ARMA(p,q) model

: generative linear model that combines AR(p) and MA(q) models

Stationarity● Definition: a sequence of random variables is stationary if its

distribution is invariant to shifting in time.

Lag Operator● Definition: Lag operator is defined by

● ARMA model in terms of the lag operator:

● Characteristic polynomial

can be used to study properties of this stochastic process.

ARIMA Model● Definition: Non-stationary processes can be modeled using processes

whose characteristic polynomial has unit roots.

● Characteristic polynomial with unit roots can be factored:

● ARIMA(p, D, q) model is an ARMA(p,q) model for

Other Extensions● Further variants:

○ Models with seasonal components (SARIMA)

○ Models with side information (ARIMAX)

○ Models with long-memory (ARFIMA)

○ Multi-variate time series model (VAR)

○ Models with time-varing coefficients

○ other non-linear models

Recurrent Neural Networks

Recurrent Neural Networks

Recurrent Neural Networks

Recurrent Neural Networks

Recurrent Neural Networks

Recurrent Neural Networks

Recurrent Neural Networks

Recurrent Neural Networks

Recurrent Neural Networks

시계열 분석

● Time Series Analysis

● Models for Time Series Analysis: AR, MA, ARMA, ARIMA, RNN

● TensorFlow TimeSeries API (TFTS)

쉽게 구현 할 수 있는 방법?

TensorFlow TimeSeries● tf.contrib.timeseries

○ Classic model (state space, autoregressive)

○ Flexible infrastructure

○ Data management

■ Chunking

■ Batching

■ Saving model

■ Truncated backpropagation

과연 쉬울까요??

예제부터 살펴봅시다

Introduction to Kaggle

https://www.kaggle.com/

What is the Kaggle?

마음껏 데이터를 가지고 놀수있는 데이터 놀이터

Kaggle에서 노는 법

1. 대회 고르기2. 문제와 데이터를 확인하고 분석하기3. 다른 사람들은 어떻게 하나 구경하기 4. 본인만의 솔루션 만들기

Competitions 종류

1. Featured: 기업, 기관에서 돈을 걸고 경쟁2. Research: 연구 목적 대회3. Playground: 연습 문제 4. Getting Started: 연습 문제

몇 가지 일반적인 대회 규칙

1. 하루 제출 횟수 제한2. Test의 일정 비율만 Public Score에 노출3. 대회가 종료될때 최종 점수가 공개4. 대회가 끝나도 데이터셋 접근 가능!

Kaggle에서 노는 법

1. 대회 고르기2. 문제와 데이터를 확인하고 분석하기3. 다른 사람들은 어떻게 하나 구경하기 4. 본인만의 솔루션 만들기

Kaggle에서 노는 법

1. 대회 고르기2. 문제와 데이터를 확인하고 분석하기3. 다른 사람들은 어떻게 하나 구경하기 4. 본인만의 솔루션 만들기

https://www.kaggle.com/c/favorita-grocery-sales-forecasting

오프라인 식료품점의 판매량 예측하기

복잡하다면…

남이 잘 분석한걸 이용하자: https://www.kaggle.com/headsortails/shopping-for-insights-favorita-eda

대부분의 대회에서 가장 많이 추천을 받는 커널은 EDA처음 대회 들어가면 EDA를 먼저 보는걸 추천

Kaggle에서 노는 법

1. 대회 고르기2. 문제와 데이터를 확인하고 분석하기3. 다른 사람들은 어떻게 하나 구경하기 4. 본인만의 솔루션 만들기

https://www.kaggle.com/towever/devfest

KaggleZeroToAll

# -*- coding: utf-8 -*-

import datetime

from datetime import timedelta

import numpy as np

import pandas as pd

import tensorflow as tf

from tensorflow.contrib.timeseries.python.timeseries import NumpyReader

from tensorflow.contrib.timeseries.python.timeseries import estimators as tfts_estimators

from tensorflow.contrib.timeseries.python.timeseries import model as tfts_model

import matplotlib

import matplotlib.pyplot as plt

%matplotlib inline

Prepare

dtypes = {'id':'int64', 'item_nbr':'int32', 'store_nbr':'int8'}

train = pd.read_csv('../input/train.csv', usecols=[1,2,3,4], dtype=dtypes,

parse_dates=['date'],

skiprows=range(1, 101688780) #Skip initial dates

)

train.loc[(train.unit_sales < 0),'unit_sales'] = 0 # eliminate negatives

train['unit_sales'] = train['unit_sales'].apply(pd.np.log1p) #logarithm conversion

train['dow'] = train['date'].dt.dayofweek

Read Dataset

# creating records for all items, in all markets on all dates

# for correct calculation of daily unit sales averages.

u_dates = train.date.unique()

u_stores = train.store_nbr.unique()

u_items = train.item_nbr.unique()

train.set_index(['date', 'store_nbr', 'item_nbr'], inplace=True)

train = train.reindex(

pd.MultiIndex.from_product(

(u_dates, u_stores, u_items),

names=['date','store_nbr','item_nbr']

)

)

Preprocess data

train.loc[:, 'unit_sales'].fillna(0, inplace=True) # fill NaNs

train.reset_index(inplace=True) # reset index and restoring unique columns

lastdate = train.iloc[train.shape[0]-1].date # get last day on data

train.head()

Preprocess data

train.loc[:, 'unit_sales'].fillna(0, inplace=True) # fill NaNs

train.reset_index(inplace=True) # reset index and restoring unique columns

lastdate = train.iloc[train.shape[0]-1].date # get last day on data

train.head()

Preprocess data

tmp = train[['item_nbr','store_nbr','dow','unit_sales']]

ma_dw = tmp.groupby(['item_nbr','store_nbr','dow'])['unit_sales'].mean().to_frame('madw')

ma_dw.reset_index(inplace=True)

ma_dw.head()

Preprocess data

tmp = ma_dw[['item_nbr','store_nbr','madw']]

ma_wk = tmp.groupby(['item_nbr', 'store_nbr'])['madw'].mean().to_frame('mawk')

ma_wk.reset_index(inplace=True)

ma_wk.head()

Preprocess data

tmp = train[['item_nbr','store_nbr','unit_sales']]

ma_is = tmp.groupby(['item_nbr', 'store_nbr'])['unit_sales'].mean().to_frame('mais226')

Moving Average using Pandas

for i in [112,56,28,14,7,3,1]:

tmp = train[train.date>lastdate-timedelta(int(i))]

tmpg = tmp.groupby(['item_nbr','store_nbr'])['unit_sales'].mean().to_frame('mais'+str(i))

ma_is = ma_is.join(tmpg, how='left')

del tmp,tmpg

Moving Average using Pandas

ma_is['mais']=ma_is.median(axis=1)

ma_is.reset_index(inplace=True)

ma_is.head()

Moving Average using Pandas

def data_to_npreader(store_nbr: int, item_nbr: int) -> NumpyReader:

unit_sales = train[np.logical_and(train["store_nbr"] == store_nbr,

train['item_nbr'] == item_nbr)].unit_sales

x = np.asarray(range(len(unit_sales)))

y = np.asarray(unit_sales)

dataset = {

tf.contrib.timeseries.TrainEvalFeatures.TIMES: x,

tf.contrib.timeseries.TrainEvalFeatures.VALUES: y,

}

reader = NumpyReader(dataset)

return x, y, reader

Make data trainable

x, y, reader = data_to_npreader(store_nbr=1, item_nbr=105574)

train_input_fn = tf.contrib.timeseries.RandomWindowInputFn(

reader, batch_size=32, window_size=40)

ar = tf.contrib.timeseries.ARRegressor(

periodicities=21, input_window_size=30, output_window_size=10,

num_features=1,

loss=tf.contrib.timeseries.ARModel.NORMAL_LIKELIHOOD_LOSS

)

ar.train(input_fn=train_input_fn, steps=16000)

Tensorflow Timesereies - ARRegressor

evaluation_input_fn = tf.contrib.timeseries.WholeDatasetInputFn(reader)

# keys of evaluation: ['covariance', 'loss', 'mean', 'observed', 'start_tuple',

'times', 'global_step']

evaluation = ar.evaluate(input_fn=evaluation_input_fn, steps=1)

(ar_predictions,) = tuple(ar.predict(

input_fn=tf.contrib.timeseries.predict_continuation_input_fn(

evaluation, steps=16)))

Tensorflow Timesereies - ARRegressor

plt.figure(figsize=(15, 5))

plt.plot(x.reshape(-1), y.reshape(-1), label='origin')

plt.plot(evaluation['times'].reshape(-1), evaluation['mean'].reshape(-1), label='evaluation')

plt.plot(ar_predictions['times'].reshape(-1), ar_predictions['mean'].reshape(-1),

label='prediction')

plt.xlabel('time_step')

plt.ylabel('values')

plt.legend(loc=4)

plt.show()

Tensorflow Timesereies - ARRegressor

Tensorflow Timesereies - ARRegressor

Tensorflow Timesereies - LSTM

get lstm class: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/timeseries/examples/lstm.py

Tensorflow Timesereies - LSTMx, y, reader = data_to_npreader(store_nbr=2, item_nbr=105574)

train_input_fn = tf.contrib.timeseries.RandomWindowInputFn(

reader, batch_size=16, window_size=21)

estimator = tfts_estimators.TimeSeriesRegressor(

model=_LSTMModel(num_features=1, num_units=32),

optimizer=tf.train.AdamOptimizer(0.001))

estimator.train(input_fn=train_input_fn, steps=16000)

evaluation_input_fn = tf.contrib.timeseries.WholeDatasetInputFn(reader)

evaluation = estimator.evaluate(input_fn=evaluation_input_fn, steps=1)

Tensorflow Timesereies - LSTM

(lstm_predictions,) = tuple(estimator.predict(

input_fn=tf.contrib.timeseries.predict_continuation_input_fn(

evaluation, steps=16)))

Tensorflow Timesereies - LSTMplt.figure(figsize=(15, 5))

plt.plot(x.reshape(-1), y.reshape(-1), label='origin')

plt.plot(evaluation['times'].reshape(-1), evaluation['mean'].reshape(-1), label='evaluation')

plt.plot(lstm_predictions['times'].reshape(-1), lstm_predictions['mean'].reshape(-1),

label='prediction')

plt.xlabel('time_step')

plt.ylabel('values')

plt.legend(loc=4)

plt.show()

Tensorflow Timesereies - LSTM

Forecasting test data

# Read test dataset

test = pd.read_csv('../input/test.csv', dtype=dtypes,

parse_dates=['date'])

test['dow'] = test['date'].dt.dayofweek

Forecasting test data# Moving Average

test = pd.merge(test, ma_is, how='left', on=['item_nbr','store_nbr'])

test = pd.merge(test, ma_wk, how='left', on=['item_nbr','store_nbr'])

test = pd.merge(test, ma_dw, how='left', on=['item_nbr','store_nbr','dow'])

test['unit_sales'] = test.mais

# Autoregressive

ar_predictions['mean'][ar_predictions['mean'] < 0] = 0

test.loc[np.logical_and(test['store_nbr'] == 1, test['item_nbr'] == 105574), 'unit_sales'] =

ar_predictions['mean']

# LSTM

lstm_predictions['mean'][lstm_predictions['mean'] < 0] = 0

test.loc[np.logical_and(test['store_nbr'] == 2, test['item_nbr'] == 105574), 'unit_sales'] =

lstm_predictions['mean']

Forecasting test data

pos_idx = test['mawk'] > 0

test_pos = test.loc[pos_idx]

test.loc[pos_idx, 'unit_sales'] = test_pos['unit_sales'] * test_pos['madw'] / test_pos['mawk']

test.loc[:, "unit_sales"].fillna(0, inplace=True)

test['unit_sales'] = test['unit_sales'].apply(pd.np.expm1) # restoring unit values

Forecasting test data

holiday = pd.read_csv('../input/holidays_events.csv', parse_dates=['date'])

holiday = holiday.loc[holiday['transferred'] == False]

test = pd.merge(test, holiday, how = 'left', on =['date'] )

test['transferred'].fillna(True, inplace=True)

test.loc[test['transferred'] == False, 'unit_sales'] *= 1.2

test.loc[test['onpromotion'] == True, 'unit_sales'] *= 1.15

test[['id','unit_sales']].to_csv('submission.csv.gz', index=False, compression='gzip')

Thanks You!