Kx for wine tasting

25
Kx for Wine Tas.ng Machine Learning in q/kdb+ Mark Lefevre Algorithmic Quan.ta.ve Analyst

Transcript of Kx for wine tasting

Page 1: Kx for wine tasting

Kx  for  Wine  Tas.ng  Machine  Learning  in  q/kdb+  

Mark  Lefevre  Algorithmic  Quan.ta.ve  Analyst  

Page 2: Kx for wine tasting
Page 3: Kx for wine tasting

Machine  Learning  Introduc.on  

•  ML  algorithms  can  be  grouped  by  learning  style  –  Supervised  Learning  – Unsupervised  Learning  –  Reinforcement  Learning  

•  Or,  alterna.vely,  by  similarity  –  Regression  –  Clustering  –  Classifica.on  – Neural  Networks  –  Etc.  

Page 4: Kx for wine tasting

Unsupervised  Learning  

•  Uses  a  dataset  with  known  inputs  and  unlabeled  outputs  –  In  a  true  applica.on,  it  is  impossible  to  evaluate  the  accuracy  of  the  algorithm’s  output  

•  Infers  a  func.on  to  describe  a  transforma.on  •  Typical  types  of  problems  are  classifica.on,  clustering,  anomaly/fraud  detec.on,  image  processing  and  topic  modeling  

Page 5: Kx for wine tasting

K-­‐Means  Clustering  Algorithm  •  Given  n  d-­‐dimensional  data  

points  (x1,  x2,  …,  xn),  par..on  the  n  observa.ons  into  k  (≤  n)  sets  S  =  {S1,  S2,  …,  Sk}  that  minimize  a  within-­‐cluster  distance  measure  

•  Using  a  Euclidean  distance  measure  (L2-­‐norm)  

argminS

x −µix∈Si

∑2

i=1

k

∑Cluster1   Cluster2  

Page 6: Kx for wine tasting

Lloyds  Algorithm  

•  A  simple,  useful  heuris.c  algorithm  is  widely  used  o\en  called  Lloyds  Algorithm  

0.    Ini.alize  centroids  Iterate  the  following  two  steps  un.l  convergence  1.  Assign  data  points  to  nearest  cluster  2.  Calculate  new  centroids  

Page 7: Kx for wine tasting

Simple  Example  (k=3)  

0.  Ini)alize  3  Centroids  

-­‐4  

-­‐3  

-­‐2  

-­‐1  

0  

1  

2  

3  

4  

5  

-­‐8   -­‐6   -­‐4   -­‐2   0   2   4   6  

1.  Cluster  Assignment  

-­‐4  

-­‐3  

-­‐2  

-­‐1  

0  

1  

2  

3  

4  

5  

-­‐8   -­‐6   -­‐4   -­‐2   0   2   4   6  

Page 8: Kx for wine tasting

Simple  Example  

2.  Calculate  New  Centroids  

-­‐4  

-­‐3  

-­‐2  

-­‐1  

0  

1  

2  

3  

4  

5  

-­‐8   -­‐6   -­‐4   -­‐2   0   2   4   6  

3.  Cluster  Assignment  

-­‐4  

-­‐3  

-­‐2  

-­‐1  

0  

1  

2  

3  

4  

5  

-­‐8   -­‐6   -­‐4   -­‐2   0   2   4   6  

Page 9: Kx for wine tasting

Simple  Example  

4.  Calculate  New  Centroids  

-­‐4  

-­‐3  

-­‐2  

-­‐1  

0  

1  

2  

3  

4  

5  

-­‐8   -­‐6   -­‐4   -­‐2   0   2   4   6  

5.  Cluster  Assignment  

-­‐4  

-­‐3  

-­‐2  

-­‐1  

0  

1  

2  

3  

4  

5  

-­‐8   -­‐6   -­‐4   -­‐2   0   2   4   6  

Page 10: Kx for wine tasting

Wine  Dataset  •  UCI  Machine  Learning  

Repository  •  hdp://archive.ics.uci.edu/ml  •  Irvine,  CA:  University  of  

California,  School  of  Informa.on  and  Computer  Science.  

•  Consists  of  178  instances,  13  chemical  analysis  adributes  and  a  column  indica.ng  the  actual  class  

1.  Alcohol    2.  Malic  acid    3.  Ash    4.  Alcalinity  of  ash    5.  Magnesium    6.  Total  phenols    7.  Flavanoids    8.  Nonflavanoid  phenols    9.  Proanthocyanins    10.  Color  intensity    11.  Hue    12.  OD280/OD315    13.  Proline    

Page 11: Kx for wine tasting

Quick  Look  at  Raw  Wine  Dataset  

•  Here  are  9  samples,  3  from  each  class  •  What  do  you  no.ce  about  the  data?  •  Could  you  find  a  padern  to  dis.nguish  the  3  cul.vars  from  each  other?  

Page 12: Kx for wine tasting

Wine  Dataset  Boxplots  (All  features)  

Page 13: Kx for wine tasting

Wine  Dataset  Boxplots  (-­‐Proline)  

Page 14: Kx for wine tasting

Wine  Dataset  Boxplots  (-­‐Proline,  -­‐Magnesium)  

Page 15: Kx for wine tasting

Alcohol  and  Malic  Acid  QQ  Plots  

Page 16: Kx for wine tasting

Q  Code  // Demonstration implementing k-means algorithm/Lloyds algorithm wds:flip (`$'14#.Q.A)!("J",13#"F";",") 0: `:wine.csv; actualGroup:wds[`A]; /X:delete A from X; wds:update g:178?3 from wds; f:{[X] // Lambda Function to find centroids by group (column name=g) C:{[t;b;ac;f] ?[t;();b;ac!f,/:ac]} [X;{x!x} raze `g;(cols X) except `g;avg]; // Group assignments newg:{{x?min x}x$'x} each (raze each delete g from X)-/:\:(raze each value C); update g:newg from X }; wds:(f/)wds;

Page 17: Kx for wine tasting

Principal  Component  Analysis  (PCA)  

•  PCA  is  a  sta.s.cal  procedure  that  u.lizes  orthogonal  transforma.ons  to  convert  a  set  of  observa.ons  of  possibly  correlated  variables  into  a  set  of  values  of  linearly  uncorrelated  variables  called  principal  components.  

•  In  a  word,  decorrela(on  •  The  principal  components  are  

the  eigenvectors  of  a  symmetric  variance-­‐covariance  matrix  

•  Eigenvectors  are  ordered  by  their  corresponding  eigenvalues  –  Amount  of  variance  explained  

by  the  component  •  Taking  a  few  of  principal  

components,  we  can  achieve  –  dimensionality  reduc)on  

•  This  is  very  useful  for  high  dimensionality  problems,  such  as  instantaneous  forward  curve  evolu.ons  analyzing  wine  

ADVANCED  APPLICATION  

Page 18: Kx for wine tasting

Principle  Components  

Page 19: Kx for wine tasting

Visualiza.on  of  Wine  Data  Using  Principal  Component  Analysis  

Page 20: Kx for wine tasting
Page 21: Kx for wine tasting

Another  Look  at  Raw  Wine  Dataset  

•  Here  are  9  samples,  3  from  each  class  •  Can  you  find  a  padern  to  dis.nguish  the  3  cul.vars,  if  you  knew  the  principle  components?  

Page 22: Kx for wine tasting

Weaknesses  of  K-­‐Means  

•  K  is  an  input  •  Sensi.vity  to  ini.aliza.on  – Mul.ple  runs  with  different  random  ini.aliza.ons  

–  Kmeans++  •  Empty  clusters  

–  Delete  cluster  –  Randomly  chose  another  centroid  

•  Hyperspherical  clusters  –  Cannot  handle  non  globular  clusters  well  

•  Outliers  –  K-­‐medians  algorithm  

•  No  guarantee  it  will  converge  to  global  op.mum  –  NP-­‐hard  

Page 23: Kx for wine tasting

K-­‐means++  •  Improved  ini.aliza.on  

algorithm  •  Addresses  poten.ally  bad  

ini.al  guesses  

1.  Choose  random  data  point  as  first  centroid  

2.  Compute  distance,  D(x),  from  ini.al  random  point  to  all  other  data  points  

3.  Choose  a  new  centroid  from  those  data  points  using  a  weighted  probability  distribu.on  propor.onal  to  D(x)2  

4.  Repeat  Steps  2  and  3  un.l  k  centers  have  been  chosen  

-­‐0.6  

-­‐0.4  

-­‐0.2  

0  

0.2  

0.4  

0.6  

-­‐15   -­‐10   -­‐5   0   5   10   15  

Cluster1   Cluster2  

Page 24: Kx for wine tasting

Conclusion  

•  Briefly  introduced  machine  learning,  the  concept  of  unsupervised  learning  and  the  k-­‐means  algorithm  

•  Showed  how  this  algorithm  can  be  easily  wriden  in  q  and  can  be  used  to  learn  how  to  categorize  wine  cul.vars  

•  Hopefully,  this  has  provide  an  interes.ng  look  at  the  opportuni.es  to  u.lize  q/kdb+  in  machine  learning  

Page 25: Kx for wine tasting

About  Me  •  Mark  is  currently  consul.ng  at  one  of  the  largest  banks  in  Tokyo  as  an  algorithmic  

quan.ta.ve  analyst  developing  high-­‐performance  algorithmic  trading  systems  on  the  e-­‐FX  desk.  Prior  to  moving  to  Japan,  he  worked  in  London  for  Unicredit  on  the  Equity-­‐Linked  Origina.on  desk  crea.ng  conver.ble  bonds  for  European  corporates,  consulted  in  the  US  on  e-­‐commerce  analy.cs  and  worked  for  several  high-­‐tech  so\ware  companies.  

•  Earlier  in  his  career,  he  worked  for  Mitsubishi  Semiconductor  America  designing  semiconductors  and  a  startup  developing  a  DSP.  He  then  moved  into  applica.ons  engineering  for  an  Electronic  Design  Automa.on  (EDA)  company  and,  subsequently,  internet  so\ware  companies  in  CA  and  Europe.      

•  Mark  has  a  bachelors  degree  in  Electrical  Engineering  and  Computer  Science  from  Duke  University,  a  masters  degree  in  Computer  Engineering  from  North  Carolina  State  University  and  an  MBA  in  Quan.ta.ve  Finance  from  the  Wharton  School  of  Business.  He  recently  completed  a  Cer.ficate  in  Quan.ta.ve  Finance  (CQF).  

•  He  dreams  of  the  day  when  he  can  create  so\ware  without  encountering  a  single  type  error