Data Science for Random Forests Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data...

16
Data Science for Random Forests Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Semantic Community Data Science Data Science for Random Forests November 2, 2015 1

Transcript of Data Science for Random Forests Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data...

Page 1: Data Science for Random Forests Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science.

1

Data Science for Random Forests MeetupDr. Brand Niemann

Director and Senior Data Scientist/Data JournalistSemantic CommunitySemantic Community

Data ScienceData Science for Random Forests

November 2, 2015

Page 2: Data Science for Random Forests Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science.

2

Random Forests Defined

• Random forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random forests correct for decision trees' habit of overfitting to their training set.[1]:587–588

• The algorithm for inducing a random forest was developed by Leo Breiman[2] and Adele Cutler,[3] and "Random Forests" is their trademark. The method combines Breiman's "bagging" idea and the random selection of features, introduced independently by Ho[4][5] and Amit and Geman[6] in order to construct a collection of decision trees with controlled variance.• The selection of a random subset of features is an example of the

random subspace method, which, in Ho's formulation, is a way to implement classification proposed by Eugene Kleinberg.[7]

https://en.wikipedia.org/wiki/Random_forest

Page 3: Data Science for Random Forests Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science.

3

Machine learning and data mining

• Problems:• Classification. Clustering. Regression. Anomaly detection. Association rules.

Reinforcement learning. Structured prediction. Feature learning. Online learning. Semi-supervised learning. Unsupervised learning. Learning to rank. Grammar induction.

• Supervised learning (classification • regression):• Decision trees. Ensembles (Bagging, Boosting, Random forest). k-NN Linear

regression. Naive Bayes. Neural networks. Logistic regression. Perceptron. Support vector machine (SVM). Relevance vector machine (RVM).

• Clustering:• BIRCH. Hierarchical. k-means. Expectation-maximization (EM). DBSCAN. OPTICS.

Mean-shift.https://en.wikipedia.org/wiki/Random_forest

Page 4: Data Science for Random Forests Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science.

4

Introduction to Random Forests for Beginners – free ebook

• Random Forests is of the most powerful and successful machine learning techniques. This free ebook will help beginners to leverage the power of Random Forests. An Introduction to Random Forests for Beginners.

• Random Forests is one of the top 2 methods used by Kaggle competition winners. • An Introduction to Random Forests It is an ensemble learning method for classification and regression

that builds many decision trees at training time and combines their output for the final prediction. • This ebook will help beginners leverage the power of multiple alternative analyses, randomization

strategies, and ensemble learning with Random Forests. The 70-page ebook includes graphs, examples, and illustrations.

• Chapters include:• What is Random Forests?• Segment and cluster• Suited for wide data• Advantages of Random Forests• Case Study example

http://info.salford-systems.com/an-introduction-to-random-forests-for-beginners

Page 5: Data Science for Random Forests Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science.

5

Real World Example

• The Future of Alaska Project: Forecasting Alaska’s Ecosystem in the 22nd Century• Analytics On a Grand Scale: Alaska over the next 100 years• To assist long term planning related to Alaska’s biological natural resources

researchers at the University of Alaska, led by Professor Falk Huettman, have built models predicting the influence of climate change on many of Alaska’s plants and animals• An Associate Professor of Wildlife Ecology, Dr. Huettmann runs the EWHALE

(Ecological Wildlife Habitat Data Analysis for the Land and Seascape) Lab with the Institute of Arctic Biology, Biology and Wildlife Department at the University of Alaska-Fairbanks (UAF).

Real World Example

Page 6: Data Science for Random Forests Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science.

6

Connecting Alaska Landscapes Into the Future

• We employed the Random Forests™ modeling algorithm to identify probable relationships between historic temperature and precipitation data and known distributions for species and biomes across Alaska. These relationships were then used to predict future species and biome distribution based on projected temperature and precipitation. This approach, known as ensemble modeling, takes the average of the outputs of multiple individual models, thus generally providing more robust predictions (Breiman 1998, 2001).

Connecting Alaska Landscapes Into the Future

MODELING CLIMATE CHANGE ENVELOPES: RANDOM FORESTS™

Page 7: Data Science for Random Forests Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science.

7

Commentary

• Dr. Falk Huettmann is very confident in the RandomForest software and results for Alaska, however, the inventor, Professor Leo Breiman, says his philosophy is:• RF is an example of a tool that is useful in doing analyses of scientific data.• But the cleverest algorithms are no substitute for human intelligence and

knowledge of the data in the problem.• Take the output of random forests not as absolute truth, but as smart

computer generated guesses that may be helpful in leading to a deeper understanding of the problem.

• We need to do an audit and see who is closer to the truth.

Page 8: Data Science for Random Forests Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science.

8

Wide Data: 317 Rows and 468 Columns

Alaska.csv

Page 10: Data Science for Random Forests Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science.

10

Page 11: Data Science for Random Forests Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science.

11

Page 12: Data Science for Random Forests Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science.

12

Page 13: Data Science for Random Forests Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science.

13

http://ckan.snap.uaf.edu/dataset

Page 15: Data Science for Random Forests Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science.

15

Conclusions and Recommendations

• I got a request from a new member of the Federal Big Data Working Group Meetup to "look over my shoulder" when I did my data science to help them enter a Kaggle Competition.• He tried the Kaggle Titantic Example with Data Set and R Script Which I Had Done

with Spotfire in My Data Science for Statistics.com Tutorial.• Then We Found the Introduction to Random Forests for Beginners – free ebook

and the Alaska Data Set.• So Far We Have Been Unable to Understand and Reproduce Dr. Falk Huettmann’s

Random Forest Results in the Connecting Alaska Landscapes Into the Future (2010) Report.• Next We Want to Extend the Reach of R to the Enterprise and Another

Visualization Bakeoff.

Page 16: Data Science for Random Forests Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science Data Science.

16

Agenda

• 6:30 p.m. Welcome and Introduction (New Tutorial and Mentoring) Start with Video: Learning Path: Data Science with R then see Kaggle Competition: How Much Did It Rain? II using Spotfire instead of R! See Slides and Slides for Data Science for Random Forests• Recent Addition: Data Science for Six World Series-Time Series Analysis and Forecasting• Also see new Data Science Data Publication: Homelessness in Metropolitan Washington for

Data Science for Homeless Data Bakeoff Part II on November 4th (to be rescheduled)

• 7:15 p.m. Brief Member Introductions• 7:30 p.m. Invited Presentation Ujval Kamath (TIBCO Data Science Team Member

for Louis Bajuk-Yorgan, TIBCO Spotfire Senior Director, Project Management) Slides• 8:30 p.m. Open Discussion• 8:45 p.m. Networking• 9:00 p.m. Depart