Team: #19 Presenter: Xiaozhe Wang Yue Gu

Post on 09-Jan-2016

44 views 0 download

description

Ricardo: Integrating R and Hadoop. Team: #19 Presenter: Xiaozhe Wang Yue Gu. Agenda. Background Introduction to R Disadvantages for Current Strategies Introduction to Ricardo Overview of Ricardo’s Architecture Evaluation Reference. Background. Data Mining Examples. Eg : - PowerPoint PPT Presentation

Transcript of Team: #19 Presenter: Xiaozhe Wang Yue Gu

Team: #19Presenter:Xiaozhe WangYue Gu

Ricardo:Integrating R and Hadoop

Agenda Background Introduction to R Disadvantages for Current Strategies Introduction to Ricardo Overview of Ricardo’s Architecture Evaluation Reference

Data Mining Examples

Eg: Amazon personalized recommendation of

products Netfix recommend the movies to the

customer by the taste of this customer

Background

R’s functionality for Data Mining Principal and independent component

analysis k-means clustering SVM classification Generalized-linear Latent-factor Bayesian Time- series

Introduction to R

R: Simplified Method for Data Mining

Kmeans on RKmeans Algorithm

Introduction to R

Disadvantages for Current Strategies

Exploit vertical scalability Limited Expensive

Sample the dataset Lose important features Lose the accuracy

Large-scale management system(DMS) Less functionality

Disadvantages for Current Strategies in Scalability for Data Mining

Ricardo: R and Hadoop

Introduction to Ricardo

Overview of Ricardo’s Architecture

Architecture

Performance and Scalability Object: Simulate a real recommender system Original Netflix competition dataset • Jaql requires about twice as much time as raw Hadoop.

• higher level of abstraction

Evaluation

Conclusion

Ricardo, a scalable platform

Conclusion

Reference

S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: integrating R and Hadoop. In SIGMOD 2010.http://www.mpi-inf.mpg.de/~rgemulla/publications/das10ricardo.pdf

Questions?

Thanks!!!!