Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009
-
Upload
szilard-slides -
Category
Technology
-
view
350 -
download
0
description
Transcript of Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009
![Page 1: Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009](https://reader036.fdocuments.net/reader036/viewer/2022082623/54795bc2b47959a4098b4762/html5/thumbnails/1.jpg)
Using R for Predictive Analytics
Szilard Pafka
Predictive Analytics World / DC useR Group
October 20, 2009
![Page 2: Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009](https://reader036.fdocuments.net/reader036/viewer/2022082623/54795bc2b47959a4098b4762/html5/thumbnails/2.jpg)
Introduction and agenda
Introduction:
• physics, finance, statistical modeling (PhD)
• data mining at a credit card processor (Chief Scientist)
• co-organizing the Los Angeles area R users group
Agenda:
• using R for predictive analytics
• benefits from R users groups
![Page 3: Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009](https://reader036.fdocuments.net/reader036/viewer/2022082623/54795bc2b47959a4098b4762/html5/thumbnails/3.jpg)
R and predictive analytics
R:
• data manipulation
• statistical analysis, exploratory data analysis
• computations, simulations
• modeling
• visualization
Predictive analytics (data mining for prediction problems):
• understanding the domain problem and related questions
• data exploration and cleaning
• building predictive models (supervised learning)
• model diagnostics, selection and evaluation
• deployment
![Page 4: Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009](https://reader036.fdocuments.net/reader036/viewer/2022082623/54795bc2b47959a4098b4762/html5/thumbnails/4.jpg)
R vs alternatives
R:
• open source• packages• latest cutting-edge methods
• command line interaction (vs GUI)• flexible• results easier to reproduce
• learning curve• books• R-help
• cost, multiple platforms
alternatives:
• spreadsheets
• programminglanguages
• data miningsoftware
• similar softwareenvironments
![Page 5: Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009](https://reader036.fdocuments.net/reader036/viewer/2022082623/54795bc2b47959a4098b4762/html5/thumbnails/5.jpg)
Data exploration
• data frames
• import from various sources (csv files, database connection)
• flexible data manipulation functionalities• subscripting• numerous transformations• text manipulation (e.g. regexp)• data aggregation• specialized packages (reshape, plyr etc.)
• graphics
• sample R code
![Page 6: Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009](https://reader036.fdocuments.net/reader036/viewer/2022082623/54795bc2b47959a4098b4762/html5/thumbnails/6.jpg)
Visualization
• powerful tool for data exploration, diagnostics etc.
• R graphics: base, lattice, ggplot2
• visual perception, principles of visual design
• ggplot2• based on the grammar of graphics• nice defaults• elements: mappings, geoms, stats, scales, facets• very expressive (sample code)
![Page 7: Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009](https://reader036.fdocuments.net/reader036/viewer/2022082623/54795bc2b47959a4098b4762/html5/thumbnails/7.jpg)
Building predictive models
• supervised learning: y = f (x1, x2, . . . , xp)
• regression (numeric y), classification (categorical y)
• spam detection example: classification (2 classes, Y/N)
• numerous methods: logistic regression, LDA, nearest neighbors,
decision trees, bagging, boosting, random forests, neural networks,
support vector machines etc.
• example: neural networks with nnet (sample code)• weight decay as regularization• helps convergence and against overfitting
![Page 8: Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009](https://reader036.fdocuments.net/reader036/viewer/2022082623/54795bc2b47959a4098b4762/html5/thumbnails/8.jpg)
Model evaluation
99% accuracy?
• package caret: unified interface/wrapper for• training models, tuning (complexity parameters)• diagnosing, selecting, evaluating models
• model deployment
• Conclusion: R is a viable tool for predictive analytics
![Page 9: Using R for Predictive Analytics / Data Mining / Data Science - Oct 2009](https://reader036.fdocuments.net/reader036/viewer/2022082623/54795bc2b47959a4098b4762/html5/thumbnails/9.jpg)
R users groups
• R users groups around the world, in the US: SF, NY, LA, DC...
• Los Angeles: 150+ members, 4 meetings so far
• talks, Q&A, discussions, exchange of knowledge• pointing people where to look for (packages, docs etc.)• also more generally for statistical methodology issues
• networking opportunities