Introduction TOPIC 0roch/mmids/intro-motiv.pdfTo introduce a first data science problem and...

5
TOPIC 0 Introduction First data science example: species delimitation Course: Math 535 (http://www.math.wisc.edu/~roch/mmids/) - Mathematical Methods in Data Science (MMiDS) Author: Sebastien Roch (http://www.math.wisc.edu/~roch/), Department of Mathematics, University of Wisconsin-Madison Updated: August 28, 2020 Copyright: © 2020 Sebastien Roch Imagine that you are an evolutionary biologist studying irises and that you have collected measurements on a large number of iris samples. Your goal is to identify different species (https://en.wikipedia.org/wiki/Species) within this collection. (Source) (https://www.w3resource.com/machine-learning/scikit-learn/iris/index.php) Here is a classical iris dataset (https://en.wikipedia.org/wiki/Iris_flower_data_set) first analyzed by Fisher. We will upload the data in the form of a DataFrame -- similar to a spreadsheet -- where the columns are different measurements (or features) and the rows are different samples. Below, we show the first lines of the dataset.

Transcript of Introduction TOPIC 0roch/mmids/intro-motiv.pdfTo introduce a first data science problem and...

Page 1: Introduction TOPIC 0roch/mmids/intro-motiv.pdfTo introduce a first data science problem and highlight some relevant, surprising phenomena arising in high-dimensional space. We will

TOPIC 0

Introduction

First data science example: species delimitation

Course: Math 535 (http://www.math.wisc.edu/~roch/mmids/) - Mathematical Methods in Data Science (MMiDS)Author: Sebastien Roch (http://www.math.wisc.edu/~roch/), Department of Mathematics, University ofWisconsin-MadisonUpdated: August 28, 2020Copyright: © 2020 Sebastien Roch

Imagine that you are an evolutionary biologist studying irises and that you have collected measurements on alarge number of iris samples. Your goal is to identify different species (https://en.wikipedia.org/wiki/Species)within this collection.

(Source) (https://www.w3resource.com/machine-learning/scikit-learn/iris/index.php)

Here is a classical iris dataset (https://en.wikipedia.org/wiki/Iris_flower_data_set) first analyzed by Fisher. We willupload the data in the form of a DataFrame -- similar to a spreadsheet -- where the columns are differentmeasurements (or features) and the rows are different samples. Below, we show the first lines of the dataset.5

Page 2: Introduction TOPIC 0roch/mmids/intro-motiv.pdfTo introduce a first data science problem and highlight some relevant, surprising phenomena arising in high-dimensional space. We will

In [1]: # Julia version: 1.5.1using CSV, DataFrames, Statistics, Plots, LinearAlgebra, StatsPlots

In [2]: df = CSV.read("iris-measurements.csv")first(df, 5)

There are samples.150

In [3]: nrow(df)

Here is a summary of the data.

In [4]: describe(df)

Let's first extract the columns into vectors and combine them into a matrix, and visualize the petal data. Below,each point is a sample. This is called a scatter plot (https://en.wikipedia.org/wiki/Scatter_plot).

In [5]: X = reduce(hcat, [df[:,:PetalLengthCm], df[:,:PetalWidthCm], df[:,:SepalLengthCm], df[:,:SepalWidthCm]]);

Out[2]: 5 rows × 5 columns

Id PetalLengthCm PetalWidthCm SepalLengthCm SepalWidthCm

Int64 Float64 Float64 Float64 Float64

1 1 1.4 0.2 5.1 3.5

2 2 1.4 0.2 4.9 3.0

3 3 1.3 0.2 4.7 3.2

4 4 1.5 0.2 4.6 3.1

5 5 1.4 0.2 5.0 3.6

Out[3]: 150

Out[4]: 5 rows × 8 columns

variable mean min median max nunique nmissing eltype

Symbol Float64 Real Float64 Real Nothing Nothing DataType

1 Id 75.5 1 75.5 150 Int64

2 PetalLengthCm 3.75867 1.0 4.35 6.9 Float64

3 PetalWidthCm 1.19867 0.1 1.3 2.5 Float64

4 SepalLengthCm 5.84333 4.3 5.8 7.9 Float64

5 SepalWidthCm 3.054 2.0 3.0 4.4 Float64

Page 3: Introduction TOPIC 0roch/mmids/intro-motiv.pdfTo introduce a first data science problem and highlight some relevant, surprising phenomena arising in high-dimensional space. We will

In [6]: scatter(X[:,1], X[:,2], legend=false, xlabel="PetalLength", ylabel="PetalWidth")

Observe a clear cluster of samples on the bottom left. This may be an indication that these samples come froma separate species. What is a cluster (https://en.wikipedia.org/wiki/Cluster_analysis)? Intuitively, it is a group ofsamples that are close to each other, but far from every other sample.

Now let's look at the full dataset. Visualizing the full -dimensional data is not so straighforward. One way to dothis is to consider all pairwise scatter plots.

4

Out[6]:

Page 4: Introduction TOPIC 0roch/mmids/intro-motiv.pdfTo introduce a first data science problem and highlight some relevant, surprising phenomena arising in high-dimensional space. We will

In [7]: cornerplot(X, label=["PetalWidth", "PetalLength", "SepalWidth", "SepalLength"], size=(500,500), markersize=2)

We would like a method that automatically identifies clusters -- whatever the dimension of the data. We willdiscuss a standard way to do this: -means clustering.

This topic has two main goals:

1. To review basic facts about Euclidean geometry, vector calculus, probability, and matrix algebra.2. To introduce a first data science problem and highlight some relevant, surprising phenomena arising in

high-dimensional space.

We will come back to the iris dataset in an accompanying tutorial notebook.

𝑘

Out[7]:

Page 5: Introduction TOPIC 0roch/mmids/intro-motiv.pdfTo introduce a first data science problem and highlight some relevant, surprising phenomena arising in high-dimensional space. We will

Optional readingYou may want to review basic linear algebra and probability. In particular, take a look at

Chapter 1 in [Sol] and Sections 1.2.1-1.2.4 in [Bis]

where, throughout this course, we will refer to the following textbooks available online:

[Sol] Solomon, Numerical Algorithms: Methods for Computer Vision, Machine Learning, and Graphics,CRC, 2015 (https://people.csail.mit.edu/jsolomon/share/book/numerical_book.pdf)[Bis] Bishop, Pattern Recognition and Machine Learning, Springer, 2006 (https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf)

Material for this lecture is covered partly in:

Sections 1.4, 9.1 and 11.1.1 of [Bis]

ReferencesParts of this topic's notebooks are based on the following references.

[Carp] B. Carpenter, Typical Sets and the Curse of Dimensionality (https://mc-stan.org/users/documentation/case-studies/curse-dims.html)

[BHK] A. Blum, J. Hopcroft, R. Kannan, Foundations of Data Science, Cambridge University Press, 2020.(http://www.cs.cornell.edu/jeh/book%20no%20so;utions%20March%202019.pdf)

[VMLS] S. Boyd and L. Vandenberghe. Introduction to Applied Linear Algebra: Vectors, Matrices, and LeastSquares. Cambridge University Press, 2018 (http://vmls-book.stanford.edu)

[SSBD] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to Algorithms.Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014(https://www.cse.huji.ac.il/~shais/UnderstandingMachineLearning/index.html)