112, 115, 212, 215, 218 GU 12 SMU, GU 15 SMU GU 12 DMU, GU ...
Team: #19 Presenter: Xiaozhe Wang Yue Gu
description
Transcript of Team: #19 Presenter: Xiaozhe Wang Yue Gu
Team: #19Presenter:Xiaozhe WangYue Gu
Ricardo:Integrating R and Hadoop
Agenda Background Introduction to R Disadvantages for Current Strategies Introduction to Ricardo Overview of Ricardo’s Architecture Evaluation Reference
Data Mining Examples
Eg: Amazon personalized recommendation of
products Netfix recommend the movies to the
customer by the taste of this customer
Background
R’s functionality for Data Mining Principal and independent component
analysis k-means clustering SVM classification Generalized-linear Latent-factor Bayesian Time- series
Introduction to R
R: Simplified Method for Data Mining
Kmeans on RKmeans Algorithm
Introduction to R
Disadvantages for Current Strategies
Exploit vertical scalability Limited Expensive
Sample the dataset Lose important features Lose the accuracy
Large-scale management system(DMS) Less functionality
Disadvantages for Current Strategies in Scalability for Data Mining
Ricardo: R and Hadoop
Introduction to Ricardo
Overview of Ricardo’s Architecture
Architecture
Performance and Scalability Object: Simulate a real recommender system Original Netflix competition dataset • Jaql requires about twice as much time as raw Hadoop.
• higher level of abstraction
Evaluation
Conclusion
Ricardo, a scalable platform
Conclusion
Reference
S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: integrating R and Hadoop. In SIGMOD 2010.http://www.mpi-inf.mpg.de/~rgemulla/publications/das10ricardo.pdf
Questions?
Thanks!!!!