Team: #19 Presenter: Xiaozhe Wang Yue Gu

13
Team: #19 Presenter: Xiaozhe Wang Yue Gu Ricardo: Integrating R and Hadoop

description

Ricardo: Integrating R and Hadoop. Team: #19 Presenter: Xiaozhe Wang Yue Gu. Agenda. Background Introduction to R Disadvantages for Current Strategies Introduction to Ricardo Overview of Ricardo’s Architecture Evaluation Reference. Background. Data Mining Examples. Eg : - PowerPoint PPT Presentation

Transcript of Team: #19 Presenter: Xiaozhe Wang Yue Gu

Page 1: Team: #19 Presenter: Xiaozhe  Wang Yue Gu

Team: #19Presenter:Xiaozhe WangYue Gu

Ricardo:Integrating R and Hadoop

Page 2: Team: #19 Presenter: Xiaozhe  Wang Yue Gu

Agenda Background Introduction to R Disadvantages for Current Strategies Introduction to Ricardo Overview of Ricardo’s Architecture Evaluation Reference

Page 3: Team: #19 Presenter: Xiaozhe  Wang Yue Gu

Data Mining Examples

Eg: Amazon personalized recommendation of

products Netfix recommend the movies to the

customer by the taste of this customer

Background

Page 4: Team: #19 Presenter: Xiaozhe  Wang Yue Gu

R’s functionality for Data Mining Principal and independent component

analysis k-means clustering SVM classification Generalized-linear Latent-factor Bayesian Time- series

Introduction to R

Page 5: Team: #19 Presenter: Xiaozhe  Wang Yue Gu

R: Simplified Method for Data Mining

Kmeans on RKmeans Algorithm

Introduction to R

Page 6: Team: #19 Presenter: Xiaozhe  Wang Yue Gu

Disadvantages for Current Strategies

Exploit vertical scalability Limited Expensive

Sample the dataset Lose important features Lose the accuracy

Large-scale management system(DMS) Less functionality

Disadvantages for Current Strategies in Scalability for Data Mining

Page 7: Team: #19 Presenter: Xiaozhe  Wang Yue Gu

Ricardo: R and Hadoop

Introduction to Ricardo

Page 8: Team: #19 Presenter: Xiaozhe  Wang Yue Gu

Overview of Ricardo’s Architecture

Architecture

Page 9: Team: #19 Presenter: Xiaozhe  Wang Yue Gu

Performance and Scalability Object: Simulate a real recommender system Original Netflix competition dataset • Jaql requires about twice as much time as raw Hadoop.

• higher level of abstraction

Evaluation

Page 10: Team: #19 Presenter: Xiaozhe  Wang Yue Gu

Conclusion

Ricardo, a scalable platform

Conclusion

Page 11: Team: #19 Presenter: Xiaozhe  Wang Yue Gu

Reference

S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: integrating R and Hadoop. In SIGMOD 2010.http://www.mpi-inf.mpg.de/~rgemulla/publications/das10ricardo.pdf

Page 12: Team: #19 Presenter: Xiaozhe  Wang Yue Gu

Questions?

Page 13: Team: #19 Presenter: Xiaozhe  Wang Yue Gu

Thanks!!!!