Team: #19 Presenter: Xiaozhe Wang Yue Gu

Team: #19 Presenter: Xiaozhe Wang Yue Gu Ricardo: Integrating R and Hadoop

Upload
winka
Category

Documents
view
44
download
0

Embed Size (px):

description

Ricardo: Integrating R and Hadoop. Team: #19 Presenter: Xiaozhe Wang Yue Gu. Agenda. Background Introduction to R Disadvantages for Current Strategies Introduction to Ricardo Overview of Ricardo’s Architecture Evaluation Reference. Background. Data Mining Examples. Eg : - PowerPoint PPT Presentation

Transcript of Team: #19 Presenter: Xiaozhe Wang Yue Gu

Team: #19Presenter:Xiaozhe WangYue Gu

Ricardo:Integrating R and Hadoop

Agenda Background Introduction to R Disadvantages for Current Strategies Introduction to Ricardo Overview of Ricardo’s Architecture Evaluation Reference

Page 3: Team: #19 Presenter: Xiaozhe Wang Yue Gu

Data Mining Examples

Eg: Amazon personalized recommendation of

products Netfix recommend the movies to the

customer by the taste of this customer

Background

Page 4: Team: #19 Presenter: Xiaozhe Wang Yue Gu

R’s functionality for Data Mining Principal and independent component

analysis k-means clustering SVM classification Generalized-linear Latent-factor Bayesian Time- series

Introduction to R

Page 5: Team: #19 Presenter: Xiaozhe Wang Yue Gu

R: Simplified Method for Data Mining

Kmeans on RKmeans Algorithm

Introduction to R

Page 6: Team: #19 Presenter: Xiaozhe Wang Yue Gu

Disadvantages for Current Strategies

Exploit vertical scalability Limited Expensive

Sample the dataset Lose important features Lose the accuracy

Large-scale management system(DMS) Less functionality

Disadvantages for Current Strategies in Scalability for Data Mining

Page 7: Team: #19 Presenter: Xiaozhe Wang Yue Gu

Ricardo: R and Hadoop

Introduction to Ricardo

Page 8: Team: #19 Presenter: Xiaozhe Wang Yue Gu

Overview of Ricardo’s Architecture

Architecture

Page 9: Team: #19 Presenter: Xiaozhe Wang Yue Gu

Performance and Scalability Object: Simulate a real recommender system Original Netflix competition dataset • Jaql requires about twice as much time as raw Hadoop.

• higher level of abstraction

Evaluation

Page 10: Team: #19 Presenter: Xiaozhe Wang Yue Gu

Conclusion

Ricardo, a scalable platform

Conclusion

Page 11: Team: #19 Presenter: Xiaozhe Wang Yue Gu

Reference

S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: integrating R and Hadoop. In SIGMOD 2010.http://www.mpi-inf.mpg.de/~rgemulla/publications/das10ricardo.pdf

http://www.mpi-inf.mpg.de/~rgemulla/publications/das10ricardo.pdf