CS-495/595 Big Data processing concepts (part 1) Lecture...
Transcript of CS-495/595 Big Data processing concepts (part 1) Lecture...
1/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
CS-495/595Big Data processing concepts (part 1)
Lecture #2
Dr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck Cartledge
21 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 2015
2/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
Table of contents I
1 Big Data
2 Overview
3 Concepts (part 1)
4 Break
5 Assignment
6 Conclusion
7 References
3/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
What is it?
And, why is it interesting?
Big data has emerged as a technology term and trendthat is complementary to and considered to be equally astransformational as the cloud computing model.. . . represented as an “old” or “new” capability dependingon the perspective of those defining it, . . .
Lee Badger [10]
Big Data can be characterized by the three V’s:volume (large amounts of data), variety (includesdifferent types of data), and velocity (constantlyaccumulating new data).
Jules. J. Berman [3]
4/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
Notional definition
We’ll be covering virtually “bleeding edge” stuff.
Data too big for a singlemachine.
Processing too long for asingle machine.
Question/analysis isparalizable.
5/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
Where does it come from?
Lots of places, lots of it, and fast.
230,000,000 tweets per day[8]
2,700,000,000 Facebooklikes per day [2]
100 hours of YouTube videoevery minute [14]
Clickstream left on servers
6/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
Where does it come from?
Anatomy of a tweet
Originator name and bio
Location, time
Followers??
How active is thisoriginator??
Image from: http://www.slaw.ca/2011/11/17/the-anatomy-of-a-tweet-metadata-on-twitter/
7/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
Where does it come from?
How are tweets being used?
To monitor the affulence of an area.
The density of tweets
“Characterization” ofhashtags
Frequency, time, andlocation Collect data in real-time,
compare it to “old data,” andmake predictions[12].
Combining location, time, and topic can give insight into trendsbased on freely given data.
8/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
Where does it come from?
Big data quality problems aren’t limited to big data.
What was the gender and age breakdown of the Titanic surviors?Was it really women and children to the lifeboats first?
Genders are: male, female,and ”
Ages are: ” and positivefloats
Remember about sample size andstatistics?
Data from:https://www.kaggle.com/c/titanic-gettingStarted/data
9/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
The Big “Vs”
Volume, Velocity, Variety are hard problems.
Vocabulary started withe-commerce [9]:
Volume: lots of data
Velocity: data is created fast
Variety: data has differentorigins
Big Data addresses questions based on these Vs
10/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
Other Vs
Sometimes other Vs appear as well.
The original 3 Vs have been expanded by many[5]:
Veracity: is the datatrustworthy?
Value: how “good” is thedata?
Variability: is the dataconsistent?
Sometimes complexity gets added to the mix.
11/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
Other Vs
A 3Vs visual.
The further out on the rings, the more it is “Big Data” like.
Velocity — All data isreal-time, only the intervalchanges
Variety — Is the datastructured, or not?
Volume — How much datahas to be processed?
Changes in any of these vectors can cause a revaluation of thecurrent approach [4].
12/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
Other Vs
Velocity — what does it mean for Big Data?
Frequency of datageneration/delivery
Think of data from a device,or sensor, robots, clicklogs
Real-time analysis is small(9%) [13].
Most Big Data analytics isbatch
Take away: data is generated at a high speed, it must be analyzedbefore the next set of data is delivered. Little’s Law L = λW [11]
13/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
Other Vs
Variety — what does it mean for Big Data?
Not all data is the same.
Data from a multitude ofdifferent sources.
Not all data is useful.
Data is lost during“normalization”
Hopefully not importantdata, when in doubt: keep itsomehow
Gets away from relationaldatabases
14/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
Other Vs
Volume — what does it mean for Big Data?
How much is there? And, how do we store it?
Store relational records?
Store transactional records?
How long to keep dataavailable?
How to access data?
How to migrate data?
Figure: Exponential data growth[7]
See http://en.wikipedia.org/wiki/Metric prefix for list of prefixes.
15/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
Other Vs
The Big Data challenges.
Hetrogeneity
Scale
Timeliness
Complexity
Privacy
The Big Data user changes the question[1].
16/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
The Vs
Our friends the Vs
Classic Vs
Additional Vs
17/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
Lots of data
Data sources
Government:1 Medicare data (we’ll see
more of this later)2 NSA, DoD, NASA
Private:1 Clickstream2 FICO3 Walmart4 Android devices
Figure: Sample Clickstream
[6]
18/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
What does data look like?
Data characteristics
Formatted/unformatted
Bits, bytes, tagged, freeform
Clean, messy
Complete, fragmented
19/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
What does data look like?
Torrents of data
Primary usage
Secondary usage
“Exhaust”
Storage1 Accessability2 Longivity3 Privacy
20/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
What does data look like?
Big data players
Brokers
Scientists
Visionaries
21/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
Break time.
Take about 10 minutes.
22/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
A “Hello World” level problem.
With a little license.
A simply stated problem: Countthe number of unique words inShakespeare’s Macbeth.
A few Java classes
A Hadoop environment
Process strings from a file
Summarize the results
Grad students have a little moreto do.
23/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
A “Hello World” level problem.
The Hadoop “cook book”, simple things (on thesurface).
Partition (paralyze) thesource data
Create key value pairs1 Receive line of text2 Parse the text in some
way3 Create key/value pairs
Behind the scenes key valuepairs are combined
Reduce key and multiplevalues
Produce something useful
24/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
Undergraduate level problem
Shakespeare’s Macbeth (mechanics)
Things that need to get done:
1 Get a copy of the play
2 Get it onto the HadoopDistributed File System(HDFS)
3 Write and compile a Mapperclass
4 Write and compile aReducer class
5 Write and compile a mainclass
6 Run it on the ODU CSHadoop farm
25/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
Undergraduate level problem
Undergrad results: a simple textual listing
Some words are moreimportant than others (stopwords)
Only base works (stems ==stem)
Words sorted alphabetically
Number of occurrences perword
Words don’t havepunctuation
Case insensitive words
26/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
Graduate level problem
Graduate challenges: undergrads worked with one file,graduates with two
How do the vocabularies of Romeo and Juliet, and Macbethcompare?
Slightly more work to do withdata:
Work with two files
Compare first 50 words ofboth plays
1 Order2 Usage (relative not
absolute)
Interested in how the similar the vocabularies are across the twoplays.
27/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
What have we covered?
Big Data VsBig Data sourcesProblems associated with Big DataAssignment #1
Next time: Big Data processing concepts (part deux)
28/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
References I
[1] Divyakant Agrawal, Philip Bernstein, Elisa Bertino, SusanDavidson, Umeshwas Dayal, and Michael Franklin,Challenges and Opportunities with Big Data, Purde e-Pubs(2011).
[2] Anson Alexander,Facebook User Statistics 2012 [Infographic], ansonAlex.com(2012).
[3] Jules J Berman,Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information,Newnes, 2013.
[4] Pinal Dave, Big Data Beginning Big Data Day 2 of 21,http://blog.sqlauthority.com/2013/10/02/, 2013.
29/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
References II
[5] Mike Ferguson,Architecting A Big Data Platform for Analytics, AWhitepaper Prepared for IBM (2012).
[6] Christian Hagen, KHalid Khan, Marco Ciobo, and Jason Miller,Big Data and the Creative Destruction of Today’s Business Models,http://www.atkearney.com/strategic-it/ideas-insights/article/-/asset publisher/LCcgOeS4t85g/content/big-data-and-the-creative-destruction-of-today-s-business-models/10192,2013.
[7] Applied Innovations, Track website visitors,http://www.appliedi.net/blog/track-website-visitors/, 2010.
30/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
References III
[8] Joab Jackson, The Big Promise of Big Data, BusinessSoftware (2012).
[9] Doug Laney, 3d data management: Controlling data volume,velocity and variety, META Group Research Note 6 (2001).
[10] Robert Bohn Lee Badger, David Bernstein,US Government Cloud Computing Technology Roadmap Volume I,Tech. report, National Institute of Standards and Technology,2014.
[11] John DC Little, A Proof for the Queuing Formula: L= λ W,Operations Research 9 (1961), no. 3, 383–387.
31/31
Big Data Overview Concepts (part 1) Break Assignment Conclusion References
References IV
[12] Patrick Meier, Using big data to inform poverty reductionstrategies, http://irevolution.net/2013/06/19/pulse-of-egypt-to-inform-poverty-reduction/,2013.
[13] Philip Russom, Big Data Analytics, TDWI Best PracticesReport, Fourth Quarter (2011).
[14] YouTube, Statistics,http://www.youtube.com/yt/press/statistics.html.