A Beginner's Guide to Machine Learning with Scikit-Learn

32
A Beginner’s Guide to Machine Learning with Scikit-Learn Sarah Guido PyTennessee 2014

description

Given at the PyData NYC 2013 conference (http://vimeo.com/79517341), and will be given at PyTennessee 2014. Scikit-learn is one of the most well-known machine learning Python modules in existence. But how does it work, and what, for that matter, is machine learning? For those with programming experience but who are new to machine learning, this talk gives a beginner-level overview of how machine learning can be useful, important machine learning concepts, and how to implement them with scikit-learn. We’ll use real world data to look at supervised and unsupervised machine learning algorithms and why scikit-learn is useful for performing these tasks.

Transcript of A Beginner's Guide to Machine Learning with Scikit-Learn

Page 1: A Beginner's Guide to Machine Learning with Scikit-Learn

A Beginner’s Guide to Machine Learning with Scikit-LearnSarah Guido

PyTennessee 2014

Page 2: A Beginner's Guide to Machine Learning with Scikit-Learn

All about me

• Grad student at the University of Michigan• Data analyst for HathiTrust• Organizer of Ann Arbor PyLadies chapter

Page 3: A Beginner's Guide to Machine Learning with Scikit-Learn

My talk

• Machine learning and scikit-learn• Supervised and unsupervised learning• Preprocessing, validation and testing, strategies for machine learning

Page 4: A Beginner's Guide to Machine Learning with Scikit-Learn

What is machine learning?

• Application of algorithms that learn from examples

• Representation and generalization

Page 5: A Beginner's Guide to Machine Learning with Scikit-Learn

Why should we care?

• Useful in every day life• Email spam, handwriting analysis, stock market

analysis, Netflix

• Especially useful in data analysis• Feature extraction, linear regression, classification,

clustering

Page 6: A Beginner's Guide to Machine Learning with Scikit-Learn

Machine Learning Vocab

• Instance• Feature• Class• Categorical

• Nominal• Ordinal

• Continuous

Page 7: A Beginner's Guide to Machine Learning with Scikit-Learn

Machine Learning VocabFeature Class

Instance

Page 8: A Beginner's Guide to Machine Learning with Scikit-Learn

Scikit-Learn

• Machine learning module• Open-source• Built-in datasets• Good resources for learning

Page 9: A Beginner's Guide to Machine Learning with Scikit-Learn

Scikit-Learn

• Model = EstimatorObject()• Model.fit(dataset.data, dataset.target)

• dataset.data = dataset• dataset.target = labels

• Model.predict(dataset.data)

Page 10: A Beginner's Guide to Machine Learning with Scikit-Learn

Scikit-Learn

• Supervised• Unsupervised• Semi-supervised• Reinforcement learning• Neural networks• …and many more!

Page 11: A Beginner's Guide to Machine Learning with Scikit-Learn

Supervised learning

• Labeled data• You know what you’re looking for• Classification: predict categorical labels• Regression: predict continuous target variables

Page 12: A Beginner's Guide to Machine Learning with Scikit-Learn

Classification

• Categorical variables• Relationship between instance and feature• Classification algorithms == classifiers

Page 13: A Beginner's Guide to Machine Learning with Scikit-Learn

Classification

• Naïve Bayes classifier• Features are independent• Fast performance• Decent classifier

Page 14: A Beginner's Guide to Machine Learning with Scikit-Learn

Classification

• Car evaluation dataset-UCI• Features: buying price, the maintenance price, the number of doors, the number of seats, the size of the trunk, and the safety ranking

• Labels: unacceptable, acceptable, good, or very good

Page 15: A Beginner's Guide to Machine Learning with Scikit-Learn

Classification

Page 16: A Beginner's Guide to Machine Learning with Scikit-Learn

Classification

Page 17: A Beginner's Guide to Machine Learning with Scikit-Learn

Classification

Page 18: A Beginner's Guide to Machine Learning with Scikit-Learn

Unsupervised algorithms

• Unlabeled data• You might have no idea what you’re looking for• Clustering: splitting observations into groups• Dimensionality reduction: flatten data to fewer dimensions

Page 19: A Beginner's Guide to Machine Learning with Scikit-Learn

Clustering

• Exploring the data• Similar objects in the same group• Distance between data points

Page 20: A Beginner's Guide to Machine Learning with Scikit-Learn

Clustering

• K-means clustering• Three steps

• Chooses initial cluster centers• Assigns data instance to cluster• Recalculates cluster center

• Efficient

Page 21: A Beginner's Guide to Machine Learning with Scikit-Learn

Clustering

Page 22: A Beginner's Guide to Machine Learning with Scikit-Learn

Clustering

Page 23: A Beginner's Guide to Machine Learning with Scikit-Learn

Clustering

Page 24: A Beginner's Guide to Machine Learning with Scikit-Learn

Data preprocessing

• Encoding categorical features

Page 25: A Beginner's Guide to Machine Learning with Scikit-Learn

Data preprocessing

Page 26: A Beginner's Guide to Machine Learning with Scikit-Learn

Data preprocessing

Page 27: A Beginner's Guide to Machine Learning with Scikit-Learn

Data preprocessing

• Split the dataset into training and test data

Page 28: A Beginner's Guide to Machine Learning with Scikit-Learn

Validation and testing

• Model evaluation

• Cross-validation

Page 29: A Beginner's Guide to Machine Learning with Scikit-Learn

Good strategies

• Avoid overfitting• Use lots of data• Intuition fails in high dimensions

Page 30: A Beginner's Guide to Machine Learning with Scikit-Learn

My materials

• Scikit-learn.org documentation and tutorials• Machine learning class at U of M• Scikit-learn talks

Page 31: A Beginner's Guide to Machine Learning with Scikit-Learn

Resources

• Scikit-learn documentation and tutorials• scikit-learn.org/stable/documentation.html

• Other resources• http://archive.ics.uci.edu/ml/datasets.html• Mldata.org

• Videos• Scikit-learn tutorial: http://vimeo.com/53062607• Intro to scikit-learn: http://vimeo.com/72859487

Page 32: A Beginner's Guide to Machine Learning with Scikit-Learn

Contact me!

• @sarah_guido• Linkedin.com/sarahguido• github.com/sarguido