Data Science for Random Forests Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data...

1

Data Science for Random Forests MeetupDr. Brand Niemann

Director and Senior Data Scientist/Data JournalistSemantic CommunitySemantic Community

Data ScienceData Science for Random Forests

November 2, 2015

http://semanticommunity.info/

http://semanticommunity.info/Data_Science

http://semanticommunity.info/Data_Science/Data_Science_for_Random_Forests

2

Random Forests Defined

• Random forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random forests correct for decision trees' habit of overfitting to their training set.[1]:587–588

• The algorithm for inducing a random forest was developed by Leo Breiman[2] and Adele Cutler,[3] and "Random Forests" is their trademark. The method combines Breiman's "bagging" idea and the random selection of features, introduced independently by Ho[4][5] and Amit and Geman[6] in order to construct a collection of decision trees with controlled variance.• The selection of a random subset of features is an example of the

random subspace method, which, in Ho's formulation, is a way to implement classification proposed by Eugene Kleinberg.[7]

https://en.wikipedia.org/wiki/Random_forest

https://en.wikipedia.org/wiki/Ensemble_learning

https://en.wikipedia.org/wiki/Statistical_classification

https://en.wikipedia.org/wiki/Regression_analysis

https://en.wikipedia.org/wiki/Decision_tree_learning

https://en.wikipedia.org/wiki/Mode_(statistics)

https://en.wikipedia.org/wiki/Overfitting

https://en.wikipedia.org/wiki/Random_forest#cite_note-elemstatlearn-1

https://en.wikipedia.org/wiki/Leo_Breiman

https://en.wikipedia.org/wiki/Leo_Breiman

https://en.wikipedia.org/wiki/Random_forest#cite_note-breiman2001-2

https://en.wikipedia.org/wiki/Random_forest#cite_note-rpackage-3

https://en.wikipedia.org/wiki/Trademark

https://en.wikipedia.org/wiki/Bootstrap_aggregating

https://en.wikipedia.org/wiki/Random_forest#cite_note-4

https://en.wikipedia.org/wiki/Random_forest#cite_note-ho1998-5

https://en.wikipedia.org/wiki/Donald_Geman

https://en.wikipedia.org/wiki/Random_forest#cite_note-amitgeman1997-6

https://en.wikipedia.org/wiki/Random_subspace_method

https://en.wikipedia.org/wiki/Random_forest#cite_note-7


3

Machine learning and data mining

• Problems:• Classification. Clustering. Regression. Anomaly detection. Association rules.

Reinforcement learning. Structured prediction. Feature learning. Online learning. Semi-supervised learning. Unsupervised learning. Learning to rank. Grammar induction.

• Supervised learning (classification • regression):• Decision trees. Ensembles (Bagging, Boosting, Random forest). k-NN Linear

regression. Naive Bayes. Neural networks. Logistic regression. Perceptron. Support vector machine (SVM). Relevance vector machine (RVM).

• Clustering:• BIRCH. Hierarchical. k-means. Expectation-maximization (EM). DBSCAN. OPTICS.

Mean-shift.https://en.wikipedia.org/wiki/Random_forest


4

Introduction to Random Forests for Beginners – free ebook

• Random Forests is of the most powerful and successful machine learning techniques. This free ebook will help beginners to leverage the power of Random Forests. An Introduction to Random Forests for Beginners.

• Random Forests is one of the top 2 methods used by Kaggle competition winners. • An Introduction to Random Forests It is an ensemble learning method for classification and regression

that builds many decision trees at training time and combines their output for the final prediction. • This ebook will help beginners leverage the power of multiple alternative analyses, randomization

strategies, and ensemble learning with Random Forests. The 70-page ebook includes graphs, examples, and illustrations.

• Chapters include:• What is Random Forests?• Segment and cluster• Suited for wide data• Advantages of Random Forests• Case Study example

http://info.salford-systems.com/an-introduction-to-random-forests-for-beginners

http://info.salford-systems.com/an-introduction-to-random-forests-for-beginners

5

Real World Example

• The Future of Alaska Project: Forecasting Alaska’s Ecosystem in the 22nd Century• Analytics On a Grand Scale: Alaska over the next 100 years• To assist long term planning related to Alaska’s biological natural resources

researchers at the University of Alaska, led by Professor Falk Huettman, have built models predicting the influence of climate change on many of Alaska’s plants and animals• An Associate Professor of Wildlife Ecology, Dr. Huettmann runs the EWHALE

(Ecological Wildlife Habitat Data Analysis for the Land and Seascape) Lab with the Institute of Arctic Biology, Biology and Wildlife Department at the University of Alaska-Fairbanks (UAF).

Real World Example

http://semanticommunity.info/Data_Science/Data_Science_for_Random_Forests#Real_World_Example

6

Connecting Alaska Landscapes Into the Future

• We employed the Random Forests™ modeling algorithm to identify probable relationships between historic temperature and precipitation data and known distributions for species and biomes across Alaska. These relationships were then used to predict future species and biome distribution based on projected temperature and precipitation. This approach, known as ensemble modeling, takes the average of the outputs of multiple individual models, thus generally providing more robust predictions (Breiman 1998, 2001).

Connecting Alaska Landscapes Into the Future

MODELING CLIMATE CHANGE ENVELOPES: RANDOM FORESTS™

http://semanticommunity.info/Data_Science/Data_Science_for_Random_Forests#Connecting_Alaska_Landscapes_Into_the_Future

http://semanticommunity.info/Data_Science/Data_Science_for_Random_Forests#MODELING_CLIMATE_CHANGE_ENVELOPES:_RANDOM_FORESTS.E2.84.A2

http://semanticommunity.info/Data_Science/Data_Science_for_Random_Forests#MODELING_CLIMATE_CHANGE_ENVELOPES:_RANDOM_FORESTS.E2.84.A2

7

Commentary

• Dr. Falk Huettmann is very confident in the RandomForest software and results for Alaska, however, the inventor, Professor Leo Breiman, says his philosophy is:• RF is an example of a tool that is useful in doing analyses of scientific data.• But the cleverest algorithms are no substitute for human intelligence and

knowledge of the data in the problem.• Take the output of random forests not as absolute truth, but as smart

computer generated guesses that may be helpful in leading to a deeper understanding of the problem.

• We need to do an audit and see who is closer to the truth.

8

Wide Data: 317 Rows and 468 Columns

Alaska.csv

http://semanticommunity.info/@api/deki/files/35411/Alaska.csv?origin=mt-web

9

Web Player

https://spotfire.cloud.tibco.com/spotfire/wp/render/20389173303/analysis?file=/users/bniemann/Public/RandomForests-Spotfire&waid=VFSGNyt2TEibJMSNFxMGV-310729d31e-hFq

https://spotfire.cloud.tibco.com/spotfire/wp/render/20389173303/analysis?file=/users/bniemann/Public/RandomForests-Spotfire&waid=VFSGNyt2TEibJMSNFxMGV-310729d31e-hFq

13

http://ckan.snap.uaf.edu/dataset

http://ckan.snap.uaf.edu/dataset

14

Statistics Essentials For Dummies and Statistics II For Dummies

Statistics Essentials For Dummies Statistics II For Dummies

http://semanticommunity.info/Data_Science/Data_Science_for_Random_Forests#Statistics_Essentials_For_Dummies

http://semanticommunity.info/Data_Science/Data_Science_for_Random_Forests#Statistics_II_For_Dummies

15

Conclusions and Recommendations

• I got a request from a new member of the Federal Big Data Working Group Meetup to "look over my shoulder" when I did my data science to help them enter a Kaggle Competition.• He tried the Kaggle Titantic Example with Data Set and R Script Which I Had Done

with Spotfire in My Data Science for Statistics.com Tutorial.• Then We Found the Introduction to Random Forests for Beginners – free ebook

and the Alaska Data Set.• So Far We Have Been Unable to Understand and Reproduce Dr. Falk Huettmann’s

Random Forest Results in the Connecting Alaska Landscapes Into the Future (2010) Report.• Next We Want to Extend the Reach of R to the Enterprise and Another

Visualization Bakeoff.

http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup

http://semanticommunity.info/Data_Science/Data_Science_for_Statistics.com

http://semanticommunity.info/Data_Science/Data_Science_for_Statistics.com

http://semanticommunity.info/Data_Science/Data_Science_for_Random_Forests#Introduction_to_Random_Forests_for_Beginners_.E2.80.93_free_ebook

http://semanticommunity.info/Data_Science/Data_Science_for_Random_Forests#Introduction_to_Random_Forests_for_Beginners_.E2.80.93_free_ebook

16

Agenda

• 6:30 p.m. Welcome and Introduction (New Tutorial and Mentoring) Start with Video: Learning Path: Data Science with R then see Kaggle Competition: How Much Did It Rain? II using Spotfire instead of R! See Slides and Slides for Data Science for Random Forests• Recent Addition: Data Science for Six World Series-Time Series Analysis and Forecasting• Also see new Data Science Data Publication: Homelessness in Metropolitan Washington for

Data Science for Homeless Data Bakeoff Part II on November 4th (to be rescheduled)

• 7:15 p.m. Brief Member Introductions• 7:30 p.m. Invited Presentation Ujval Kamath (TIBCO Data Science Team Member

for Louis Bajuk-Yorgan, TIBCO Spotfire Senior Director, Project Management) Slides• 8:30 p.m. Open Discussion• 8:45 p.m. Networking• 9:00 p.m. Depart

https://player.oreilly.com/videos/9781491940303?toc_id=220077

http://semanticommunity.info/Data_Science/Data_Science_for_Random_Forests#How_Much_Did_It_Rain.3F_II

http://semanticommunity.info/@api/deki/files/35935/BrandNiemann11022015A.pptx

http://semanticommunity.info/%40api/deki/files/35634/BrandNiemann11022015.pptx?origin=mt-web



http://semanticommunity.info/Data_Science/Data_Science_for_Six_World_Series-Time_Series_Analysis_and_Forecasting

http://semanticommunity.info/Data_Science/Data_Science_for_Six_World_Series-Time_Series_Analysis_and_Forecasting

http://semanticommunity.info/Data_Science/Data_Science_for_Homeless_Data/Homelessness_in_Metropolitan_Washington

http://semanticommunity.info/Data_Science/Data_Science_for_Homeless_Data

https://www.linkedin.com/pub/ujval-kamath/27/292/b9

http://semanticommunity.info/Data_Science/Data_Science_for_Random_Forests#Slides

Data Science for Random Forests Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data...

Documents

Transcript of Data Science for Random Forests Meetup Dr. Brand Niemann Director and Senior Data Scientist/Data...