Indonesia internet users 2012 - marketeers nov 2012 cover story - waizly
Los Angeles R users group - Nov 17 2010 - Part 2
description
Transcript of Los Angeles R users group - Nov 17 2010 - Part 2
![Page 1: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/1.jpg)
Introduction to the Future of R
Avram AelonyNovember 2010
Wednesday, November 17, 2010
![Page 2: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/2.jpg)
Talk Outline:
1. Strengths
II. Criticisms
III. Challenges
IV. Remedies and Solutions
V. The Future
Wednesday, November 17, 2010
![Page 3: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/3.jpg)
Quick disclaimer:
- I don’t consider myself an R expert
- I don’t have a crystal ball informing of the Future
- This talk is about polite observations
- The future is dynamic
YMMD <- your-mileage-may-differ()
?Wednesday, November 17, 2010
![Page 4: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/4.jpg)
R’s Strengths
- a many good things, too many to mention individually
... but let’s try...
Wednesday, November 17, 2010
![Page 5: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/5.jpg)
Strengths of R
- A high quality statistical platform, yielding reproducible results
- Open Source, free and available
- Large, active community
- Intuitive language structure
- Data as rows and columns
- Package plugin architecture - there are many packages, top packages in widespread use
- Distributed contributions written/offered/controlled by many/multiple individuals
- Data processing for most individual needs.
- Emerging success and increasing corporate adoption e.g. some corporate needs (often used for prototyping and adhoc analytics)
Wednesday, November 17, 2010
![Page 6: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/6.jpg)
Strengths of R
More succinctly... based on a paraphrasing of a post by Ted Dunning *
1. Library
II. Language
III. Community
* http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/the_future_of_r.html
Wednesday, November 17, 2010
![Page 7: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/7.jpg)
Criticisms of R
- Larger grievances: memory and inefficiency
“One of the most vexing issues in R is memory. For anyone who works with large datasets - even if you have 64-bit R running and lots (e.g., 18Gb) of RAM, memory can still confound, frustrate, and stymie even experienced R users.”
http://www.matthewckeller.com/html/memory.html
- Small grievances: syntax, elegance, and managing complexity
“Most packages are very good, but I regret to say some are pretty inefficient and others downright dangerous.”
-Bill Venables, quote from 2007 http://www.mail-archive.com/[email protected]/msg06853.html
“...R functions used to be lean and mean, and now they’re full of exception-handling and calls to other packages. R functions are spaghetti-like messes of connections in which I keep expecting to run into syntax like “GOTO 120...”
- comment taken from Gelman blog on the future of R. http://www.stat.columbia.edu/~cook/movabletype/archives/2010/09/the_future_of_r.html
Wednesday, November 17, 2010
![Page 8: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/8.jpg)
However, greater challenges for R lie ahead
1. Big Data is coming...
II. Isn’t Big Data already here ?
How can we imagine an ideal environment to address Big Data?
Wednesday, November 17, 2010
![Page 9: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/9.jpg)
- What is Big Data?
"Every 2 Days We Create As Much Information As We Did Up To 2003" - Eric Schmidt, Chairman & CEO, Google.
http://techcrunch.com/2010/08/04/schmidt-data/
"Data is abundant, Information is useful, Knowledge is precious." http://hadoop-karma.blogspot.com/2010/03/how-much-data-is-generated-on-internet.html
- Freshness, this data will self destruct in 5 seconds... !!
"How Much Time Do You Have Before Web‐Generated Leads Go Cold?" http://www.matrixintegratedmarketing.com/MIT.pdf
Get ready:“Web Scale Big Data - 100’s of Terabytes”
-John Sichi, Facebook, on intended usage with Hive.http://www.slideshare.net/jsichi/hive-evolution-apachecon-2010 slide #6.
Wednesday, November 17, 2010
![Page 10: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/10.jpg)
What is Big Data?
Wikipedia - http://en.wikipedia.org/wiki/Big_data
?Wednesday, November 17, 2010
![Page 11: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/11.jpg)
Solving the “Big” Data problem
... as I see it,
there are 5 competing possible solution “avenues”
Wednesday, November 17, 2010
![Page 12: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/12.jpg)
The “Big” Data problem:
Solution #1
Use R in Conjunction with other specialized tools.
Examples:
- R remains a language for small datasets but has “hooks” and “bridges” that enable use with MapReduce style tools (Hadoop, Streaming, Hive, Pig, Cascading, others...)
Wednesday, November 17, 2010
![Page 13: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/13.jpg)
The “Big” Data problem:
Solution #2
Packages that enable new functionality for reading and processing very large data sets
Examples:- Saptarshi Guha’s RHIPE (R and Hadoop Processing Environment)- Kane & Emerson’s bigmemory - Adler et al.‘s ff package - Henrik Bengtsson’s R.huge package (deprecated) - (many new yet-to-be-developed possibilities here )
So.... enhance functions, but no enhancements to the core language
Wednesday, November 17, 2010
![Page 14: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/14.jpg)
The “Big” Data problem:
Solution #3
Same language but have R “do the right thing” under the hood.
Examples:
- Out of memory algorithms, think: “I see you’re trying to analyze a sizable amount of data...”
- Either seamlessly or after user approval to go ahead...
# perhaps, perhaps...d <- read.table(fn=”s3//:mybucket.name”, enormous.data=TRUE)
or if possible, enhance core language as well as functionality!!!
Wednesday, November 17, 2010
![Page 15: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/15.jpg)
The “Big” Data problem:
Solution #4 - Completely start over
http://www.stat.auckland.ac.nz/%7Eihaka/downloads/Compstat-2008.pdf
2008
Wednesday, November 17, 2010
![Page 16: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/16.jpg)
The “Big” Data problem:
2010
http://www.stat.auckland.ac.nz/%7Eihaka/downloads/JSM-2010.pdf
Wednesday, November 17, 2010
![Page 17: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/17.jpg)
The “Big” Data problem:
The Ihaka/Lang “Back to the Future” paper came out in 2008.
The Ihaka “Lessons Learned” 2010 paper mentions:
- the need of an “effective language for handling large-scale computations”
- nostalgia for Lisp
Have there been any Lisp-like advances since then?
What about Clojure ?
Wednesday, November 17, 2010
![Page 18: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/18.jpg)
The “Big” Data problem:
Solution #5 - Does Clojure fit the bill ?
H0: Clojure already has many of the things Ross Ihaka would ask for H1: Really?
-Rich Hickey http://clojure.org/rationale
Clojure may be seen as a solution, or as an example path for R to follow, improve upon, or choose to differ...
Wednesday, November 17, 2010
![Page 19: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/19.jpg)
Clojure
-Rich Hickey http://clojure.org
Wednesday, November 17, 2010
![Page 20: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/20.jpg)
- Core Clojure
- Incanter: "a Clojure-based, R-like platform for statistical computing and graphics" http://incanter.org/
- Infer: "a (Clojure) library for machine learning and statistical inference, designed to be used in real production systems." https://github.com/bradford/infer
- Cascalog: “Data processing on Hadoop without the hassle” “a Clojure-based query language for Hadoop”
The problem with many new languages is that initially there are no libraries...
Clojure already has many, and can use any Java library directly as necessary.
Wednesday, November 17, 2010
![Page 21: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/21.jpg)
What will the Future really hold for R ?
Wednesday, November 17, 2010
![Page 22: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/22.jpg)
Thanks for listening...
Wednesday, November 17, 2010
![Page 23: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/23.jpg)
Appendix:
A few slides on Clojure, and three powerful Clojure libraries:
IncanterInferCascalog
Wednesday, November 17, 2010
![Page 24: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/24.jpg)
Clojure - a quick tour-Rich Hickey http://clojure.org
Wednesday, November 17, 2010
![Page 25: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/25.jpg)
Please see http://incanter.org/docs/data-sorcery-new.pdf for an excellent intro to Incanter.
David Edgar Liebke’s Incanter
Wednesday, November 17, 2010
![Page 26: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/26.jpg)
Below are example snippets from Incanter
Wednesday, November 17, 2010
![Page 27: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/27.jpg)
Bradford Cross’ Infer : "a (Clojure) library for machine learning and statistical inference, designed
to be used in real production systems."
https://github.com/bradford/infer
Wednesday, November 17, 2010
![Page 28: Los Angeles R users group - Nov 17 2010 - Part 2](https://reader036.fdocuments.net/reader036/viewer/2022081400/54813c6b5906b5dc6c8b4634/html5/thumbnails/28.jpg)
Nathan Marz’s Cascalog: http://nathanmarz.com/blog/introducing-cascalog-a-clojure-based-query-language-for-hado.html
Wednesday, November 17, 2010