scikit-learn user guide Release 0.12-git scikit-learn developers
A Beginner's Guide to Machine Learning with Scikit-Learn
-
Upload
sarah-guido -
Category
Technology
-
view
133 -
download
3
description
Transcript of A Beginner's Guide to Machine Learning with Scikit-Learn
A Beginner’s Guide to Machine Learning with Scikit-LearnSarah Guido
PyTennessee 2014
All about me
• Grad student at the University of Michigan• Data analyst for HathiTrust• Organizer of Ann Arbor PyLadies chapter
My talk
• Machine learning and scikit-learn• Supervised and unsupervised learning• Preprocessing, validation and testing, strategies for machine learning
What is machine learning?
• Application of algorithms that learn from examples
• Representation and generalization
Why should we care?
• Useful in every day life• Email spam, handwriting analysis, stock market
analysis, Netflix
• Especially useful in data analysis• Feature extraction, linear regression, classification,
clustering
Machine Learning Vocab
• Instance• Feature• Class• Categorical
• Nominal• Ordinal
• Continuous
Machine Learning VocabFeature Class
Instance
Scikit-Learn
• Machine learning module• Open-source• Built-in datasets• Good resources for learning
Scikit-Learn
• Model = EstimatorObject()• Model.fit(dataset.data, dataset.target)
• dataset.data = dataset• dataset.target = labels
• Model.predict(dataset.data)
Scikit-Learn
• Supervised• Unsupervised• Semi-supervised• Reinforcement learning• Neural networks• …and many more!
Supervised learning
• Labeled data• You know what you’re looking for• Classification: predict categorical labels• Regression: predict continuous target variables
Classification
• Categorical variables• Relationship between instance and feature• Classification algorithms == classifiers
Classification
• Naïve Bayes classifier• Features are independent• Fast performance• Decent classifier
Classification
• Car evaluation dataset-UCI• Features: buying price, the maintenance price, the number of doors, the number of seats, the size of the trunk, and the safety ranking
• Labels: unacceptable, acceptable, good, or very good
Classification
Classification
Classification
Unsupervised algorithms
• Unlabeled data• You might have no idea what you’re looking for• Clustering: splitting observations into groups• Dimensionality reduction: flatten data to fewer dimensions
Clustering
• Exploring the data• Similar objects in the same group• Distance between data points
Clustering
• K-means clustering• Three steps
• Chooses initial cluster centers• Assigns data instance to cluster• Recalculates cluster center
• Efficient
Clustering
Clustering
Clustering
Data preprocessing
• Encoding categorical features
Data preprocessing
Data preprocessing
Data preprocessing
• Split the dataset into training and test data
Validation and testing
• Model evaluation
• Cross-validation
Good strategies
• Avoid overfitting• Use lots of data• Intuition fails in high dimensions
My materials
• Scikit-learn.org documentation and tutorials• Machine learning class at U of M• Scikit-learn talks
Resources
• Scikit-learn documentation and tutorials• scikit-learn.org/stable/documentation.html
• Other resources• http://archive.ics.uci.edu/ml/datasets.html• Mldata.org
• Videos• Scikit-learn tutorial: http://vimeo.com/53062607• Intro to scikit-learn: http://vimeo.com/72859487
Contact me!
• @sarah_guido• Linkedin.com/sarahguido• github.com/sarguido