Big Data: Data Analysis Boot Camp What is Big Data?ccartled/Teaching/2018-Spring/DataAnalysi… ·...

32
1/32 Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q&A Conclusion References Files Big Data: Data Analysis Boot Camp What is Big Data? Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD 19 January 2018 19 January 2018 19 January 2018 19 January 2018 19 January 2018 19 January 2018 19 January 2018 19 January 2018 19 January 2018 19 January 2018 19 January 2018 19 January 2018 19 January 2018 19 January 2018 19 January 2018 19 January 2018 19 January 2018 19 January 2018 19 January 2018 19 January 2018 19 January 2018

Transcript of Big Data: Data Analysis Boot Camp What is Big Data?ccartled/Teaching/2018-Spring/DataAnalysi… ·...

  • 1/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Big Data: Data Analysis Boot CampWhat is Big Data?

    Chuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhD

    19 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 201819 January 2018

  • 2/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Table of contents (1 of 1)

    1 Intro.2 What is Big Data

    And, why is it interesting?3 Big Data’s Vs

    Classical definitionData sources and types

    4 What sets BD apartStatistics and BD

    5 Real definitionsPragmatic and practical

    6 EthicsA simple idea in pictures

    7 Q & A

    8 Conclusion

    9 References

    10 Files

  • 3/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    What are we going to cover?

    We’re going to talk about:

    What is Big Data?

    What is Big Data, beyond themarketing hype?

    What sets Big Data apart?

    What is a practical definition ofBig Data?

  • 4/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    And, why is it interesting?

    And, why is it interesting?

    “Big data has emerged as a technology term andtrend that is complementary to and considered to beequally as transformational as the cloud computingmodel. . . . represented as an ’old’ or ’new’ capabilitydepending on the perspective of those defining it, . . . ”

    Lee Badger [8]

    “Big Data can be characterized by the three V’s:volume (large amounts of data), variety (includesdifferent types of data), and velocity (constantlyaccumulating new data).”

    Jules. J. Berman [3]

  • 5/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Classical definition

    Doug Laney, META Group

    The origin of “Big Data” ideas and definitions.

    Started in the e-commerceMergers and Acquisitionsarena

    Used to explain whytraditional RelationalDatabase ManagementSystems (RDMS) wouldn’tscale

    Intended audience wasnon-technical management Image from [7].

    Take away: traditional RDMS don’t/won’t scale and differentapproaches are needed.

  • 6/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Classical definition

    Laney’s original BD Vs

    Image from [7].

  • 7/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Classical definition

    Volume — what does it mean for Big Data?

    How much is there? And, how do we store it?

    Store relational records?

    Store transactional records?

    How long to keep dataavailable?

    How to access data?

    How to migrate data?

    Image from [4].

    See http://en.wikipedia.org/wiki/Metric prefix for list of prefixes.

  • 8/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Classical definition

    Velocity — what does it mean for Big Data?

    Frequency of datageneration/delivery

    Think of data from a device,or sensor, robots, clicklogs

    Real-time analysis is small(9%) [12].

    Most Big Data analytics isbatch

    Known as “Little’s Law”[9]

    Take away: data is generated at a high speed, it must be analyzedbefore the next set of data is delivered.

  • 9/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Classical definition

    Variety — what does it mean for Big Data?

    Not all data is the same.

    Data from a multitude ofdifferent sources.

    Not all data is useful.

    Data is lost during“normalization”

    Hopefully not importantdata, when in doubt: keep itsomehow

    Gets away from relationaldatabases

  • 10/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Classical definition

    The original Vs have been expanded

    Lots more Vs.

    1 Vagueness2 Validity3 Value4 Variability5 Variety6 Velocity7 Venue

    8 Veracity9 Viability

    10 Vincularity11 Virility12 Viscosity13 Visibility14 Visible

    15 Visualization

    16 Vitality

    17 Vocabulary

    18 Volatility

    19 Volume

    We’ll talk about these later.

  • 11/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Data sources and types

    The Big Data challenges.

    Heterogeneity

    Scale

    Timeliness

    Complexity

    Privacy

    The Big Data user changes the question[1].

  • 12/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Statistics and BD

    Important ideas from statistics

    How “good” an answer do you want?Questions that need to beanswered:

    How accurately do you needthe answer?

    What level of confidence doyou intend to use?

    What is your currentestimate of the answeryou’re after? Image from [6].

    The greater the tolerance for error, the fewer samples needed.

  • 13/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Statistics and BD

    If you have some pre-knowledge of the “population” then you onlyneed to sample a very small number of “individuals” to get a goodenough answer.[15]

  • 14/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Statistics and BD

    How sampling differs from “Big Data”

    Sampling – start with apreconceived idea of the outcome

    Sampling – few data pointsextremely valuable (n = 1000)

    Big data – you don’t know whatthe data holds

    Big data – many data pointsextremely cheap (n = all)

    Leadership role changes frominvestigator to data [10].

    Large data sets are messy, incomplete, inconsistent, and errorprone. Require lots of data munging and data wrangling.

  • 15/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Statistics and BD

    We’ll be covering virtually “bleeding edge” stuff.

    Data too big for a singlemachine.

    Processing too long for asingle machine.

    Question/analysis isparalizable.

    Image from [13].

  • 16/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Statistics and BD

    Lots of places, lots of it, and fast.

    We are “drowning” in Big Data.

    230,000,000 tweets per day[5]

    2,700,000,000 Facebooklikes per day [2]

    100 hours of YouTube videoevery minute [16]

    Clickstream left on serversOur wearable devices are contributing to this avalanche of data.

  • 17/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Statistics and BD

    With all this data, what kinds of questions can we ask?

    How is data from one dataset related to data inanother?

    Are the relationshipsone-to-one or, one-to-many,or many-to-many?

    Is the data “clean” or not?

    What are we trying to findfrom the data?

    The details of the questions depend on the data and what we areinterested in finding.

  • 18/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Statistics and BD

    Some questions are easily stated, . . .

    Which of these questions areamenable to Big Data processing(and why)?

    1 a[i ] = b[i ] + c[i ]

    2 a[i ] = f (b)

    3 a[i ] = a[i − 1] + b[i − 1]4 a = b + c

  • 19/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Statistics and BD

    Does the tweet sentiment change over time?

  • 20/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Statistics and BD

    What sends what type of tweet?

  • 21/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Statistics and BD

    Where do tweets come from?

  • 22/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Pragmatic and practical

    A pragmatic definition

    “. . . big data refers to things one can do at a largescale that cannot be done at a smaller one, to extractnew insights or create new forms of value, in ways thatchange markets, organizations, the relationship betweencitizens and governments, and more.”

    Mayer-Schönberger and Cukier [10]

  • 23/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Pragmatic and practical

    A practical definition based on “people” time.

    If:

    Your data won’t fit into onemachine or application, or

    You are waiting too long foran answer

    then:

    You have a Big Data problem that requires Big Data tools andtechniques.

  • 24/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    A simple idea in pictures

    “In addition to thefoundational and translationalskills training that studentsreceive, they would alsobenefit from a betterunderstanding of ethics andsocial context of data . . . ”

    NAS [11]

    Image from [14].

  • 25/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    A simple idea in pictures

    Ethics in Big Data[11]

    Data confidence – avoidingoverconfidence and theinclination to drawstronger-than-appropriateconclusionsData context – understandthe context of data setsbefore they are processedFairness – treat all [data]equitably and avoid bias

    Privacy – privacy withrespect to how data arecollected and analyzedStewardship – supervisionof a data set at all stagesof existenceValidity – ensure that thedata set contains validinformation

    This area could be a full credit college course.

  • 26/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Q & A time.

    Q: Name two families whose kidswon’t join the Marines.A: The Halls of Montezuma andthe Shores of Tripoli.

  • 27/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    What have we covered?

    Big Data is all around us.Big Data is about volume, variety,velocity, and getting answersquickly.Some Big Data questions are easyto state, but impossible to answer.Dealing with Big Data can raisereal ethical questions

    Next: What is R?

  • 28/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    References (1 of 4)

    [1] Divyakant Agrawal, Philip Bernstein, Elisa Bertino, SusanDavidson, Umeshwas Dayal, and Michael Franklin,Challenges and Opportunities with Big Data, Purde e-Pubs(2011).

    [2] Anson Alexander,Facebook User Statistics 2012 [Infographic], ansonAlex.com(2012).

    [3] Jules J Berman,Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information,Newnes, 2013.

    [4] Applied Innovations, Track website visitors, http://www.appliedi.net/blog/track-website-visitors/,2010.

    http://www.appliedi.net/blog/track-website-visitors/http://www.appliedi.net/blog/track-website-visitors/

  • 29/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    References (2 of 4)

    [5] Joab Jackson, The Big Promise of Big Data, BusinessSoftware (2012).

    [6] James Klurfeld, Making sense of the campaign: The truthabout polling,http://drc.centerfornewsliteracy.org/resource/

    making-sense-campaign-truth-about-polling, 2016.

    [7] Doug Laney,3D Data Management: Controlling Data Volume, Velocity and Variety,META Group Research Note 6 (2001).

    [8] Robert Bohn Lee Badger, David Bernstein,US Government Cloud Computing Technology Roadmap Volume I,Tech. report, National Institute of Standards and Technology,2014.

    http://drc.centerfornewsliteracy.org/resource/making-sense-campaign-truth-about-pollinghttp://drc.centerfornewsliteracy.org/resource/making-sense-campaign-truth-about-polling

  • 30/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    References (3 of 4)

    [9] John DC Little, A Proof for the Queuing Formula: L= λ W,Operations Research 9 (1961), no. 3, 383–387.

    [10] Viktor Mayer-Schönberger and Kenneth Cukier, Big data: Arevolution that will transform how we live, work, and think,Houghton Mifflin Harcourt, 2013.

    [11] National Academies of Sciences Engineering and Medicine,Envisioning the data science discipline: The undergraduateperspective: Interim report, National Academies of Science,2017.

    [12] Philip Russom, Big Data Analytics, TDWI Best PracticesReport, Fourth Quarter (2011).

  • 31/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    References (4 of 4)

    [13] Andy Shaw, Leading edge or bleeding edge? (reflecting oninnovation), http://poopengineer.blogspot.com/2015/04/leading-edge-or-bleeding-edge.html, 2015.

    [14] European Union Staff, Ethics, https://ec.europa.eu/programmes/horizon2020/en/h2020-section/ethics,2017.

    [15] Mario F Triola, Essentials of statistics, Pearson AddisonWesley Boston, MA, USA:, 2008.

    [16] YouTube, Statistics,http://www.youtube.com/yt/press/statistics.html.

    http://poopengineer.blogspot.com/2015/04/leading-edge-or-bleeding-edge.htmlhttp://poopengineer.blogspot.com/2015/04/leading-edge-or-bleeding-edge.htmlhttps://ec.europa.eu/programmes/horizon2020/en/h2020-section/ethicshttps://ec.europa.eu/programmes/horizon2020/en/h2020-section/ethicshttp://www.youtube.com/yt/press/statistics.html

  • 32/32

    Intro. What is Big Data Big Data’s Vs What sets BD apart Real definitions Ethics Q & A Conclusion References Files

    Files of interest

    1 The V’s of Big data

  • How Many Vs are there in Big Data?

    Chuck Cartledge

    Sunday 6th March, 2016 at 13:32

    Contents

    1 Introduction 1

    2 Big Data Vs 1

    3 Conclusion 4

    List of Tables

    1 Big Data Vs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1 Introduction

    Doug Laney is widely credited with originating the first big Vs of Big Data: volume, variety,and velocity. Since then the number of Vs has grown and grown. I will list all the Vs that Ican find in chronological order until I get tired, to see if there is a pattern in their growth.

    2 Big Data Vs

    Table 1 chronicles some of the Big Data Vs.

  • Table 1: Big Data Vs.

    Num. Year V Definition Source

    1 2001 Variety . . . no greater barrier to effective data manage-ment will exist than the variety of incompati-ble data formats, non-aligned data structures,and inconsistent data semantics.

    [8, 10]

    2 2001 Velocity E-commerce has also increased point-of-interaction (POI) speed and, consequently, thepace data used to support interactions andgenerated by interactions.

    [8]

    3 2001 Volume E-commerce channels increase thedepth/breadth of data available about atransaction (or any point of interaction).

    [8]

    4 2013 Validity . . . is the data correct and accurate for the in-tended use.

    [2, 9, 10,11, 14]

    5 2013 Value How to determine the prescriptive value ofdata?

    [2, 5, 9,13, 14,15, 7, 6,3, 1]

    6 2013 Variability Many options or variable interpretations canconfuse interpretation.

    [2, 5, 10,13, 15]

    7 2013 Veracity . . . to the biases, noise and abnormality indata.

    [2, 5, 9,11, 14,15, 12, 6,3, 4, 1]

    8 2013 Viability . . . can the data be analyzed in a way thatmakes it decision-relevant?

    [5, 10]

    9 2013 Virility . . . Defined by some users as the rate at whichthe data spreads; how often it is picked up andrepeated by other users or events.

    [15]

    10 2013 Viscosity . . . used to describe the latency or lag time inthe data relative to the event being described.

    [15]

    11 2013 Visibility . . . the state of being able to see or be seen - isimplied. [9, 14, 10]

    (Continued on the next page.)

    2

  • Table 1. (Continued from the previous page.)

    Num. Year V Definition Source

    12 2013 Visualization Making all that vast amount of data compre-hensible in a manner that is easy to under-stand and read. With the right analyses andvisualizations, raw data can be put to use oth-erwise raw data remains essentially useless.

    [13]

    13 2013 Volatility . . . how long is data valid and how long shouldit be stored.

    [10, 11]

    14 2014 Vagueness . . . confusion over the meaning of big data (Isit Hadoop? Is it something that weve alwayshad? Whats new about it? What are thetools? Which tools should I use? etc.)

    [2]

    15 2014 Venue . . . distributed, heterogeneous data from mul-tiple platforms, from different owners systems,with different access and formatting require-ments, private vs. public cloud.

    [2]

    16 2014 Vocabulary . . . schema, data models, semantics, ontolo-gies, taxonomies, and other content- andcontext-based metadata that describe thedatas structure, syntax, content, and prove-nance.

    [2]

    17 2015 Vincularity . . . it implies connectivity or linkage. [10]

    18 2015 Visible We live in an increasingly visual world and thestatistics of increase in the number of imagesand videos shared on the Internet is stagger-ing.

    [10]

    19 2015 Vitality . . . criticality of the data is another conceptthat is crucial and is embedded in the conceptof Value.

    [10]

    3

  • 3 Conclusion

    Doug Laney’s initial 3Vs were based on his experiences in a business mergers and acquisitionsenvironment. The terms and ideas were easy to understand and to relate to. Many, manypeople have hopped on the “V” band wagon, and will probably continue to for the foreseeablefuture. From a practical stand point; if your data won’t fit in one machine, or the machinetakes too long to return an answer then you have a Big Data problem. Regardless of manyVs apply.

    An alternative set of attributes of a Big Data based on [16] is:Size: the volume of the datasets is a critical factor, orComplexity: the structure, behaviour and permutations of the datasets is a critical

    factor, orTechnologies: the tools and techniques which are used to process a sizable or complex

    dataset is a critical factor.Any two of these attributes meets the requirements of a Big Data problem.

    4

  • References

    [1] Marcos D Assunção, Rodrigo N Calheiros, Silvia Bianchi, Marco AS Netto, and Rajku-mar Buyya, Big data computing and clouds: Trends and future directions, Journal ofParallel and Distributed Computing 79 (2015), 3–15.

    [2] Kirk Borne, Top 10 big data challenges - a serious look at 10 big datavs, https://www.mapr.com/blog/top-10-big-data-challenges-%E2%80%93-serious-look-10-big-data-v%E2%80%99s, 2014.

    [3] Yuri Demchenko, Paola Grosso, Cees De Laat, and Peter Membrey, Addressing big dataissues in scientific data infrastructure, Collaboration Technologies and Systems (CTS),2013 International Conference on, IEEE, 2013, pp. 48–55.

    [4] Xin Luna Dong and Divesh Srivastava, Big data integration, Data Engineering (ICDE),2013 IEEE 29th International Conference on, IEEE, 2013, pp. 1245–1248.

    [5] Seth Grimes, Big data: Avoid ’wanna v’ confusion, http://www.informationweek.com/big-data/big-data-analytics/big-data-avoid-wanna-v-confusion/d/

    d-id/1111077?, 2013.

    [6] Pascal Hitzler and Krzysztof Janowicz, Linked data, big data, and the 4th paradigm.,Semantic Web 4 (2013), no. 3, 233–235.

    [7] Stephen Kaisler, Frank Armour, Juan Antonio Espinosa, and William Money, Big data:Issues and challenges moving forward, System Sciences (HICSS), 2013 46th HawaiiInternational Conference on, IEEE, 2013, pp. 995–1004.

    [8] Doug Laney, 3d data management: Controlling data volume, velocity and variety, METAGroup Research Note 6 (2001).

    [9] Rob Livingstone, The 7 vs of big data, http://rob-livingstone.com/2013/06/big-data-or-black-hole/, 2013.

    [10] Rajiv Maheshwari, 3 vs or 7 vs - whats the value of big data?, https://www.linkedin.com/pulse/3-vs-7-whats-value-big-data-rajiv-maheshwari, 2105.

    [11] Kevin Normandeau, Beyond volume, variety and velocity is the is-sue of big data veracity, http://insidebigdata.com/2013/09/12/beyond-volume-variety-velocity-issue-big-data-veracity/, 2013.

    [12] Wullianallur Raghupathi and Viju Raghupathi, Big data analytics in healthcare: promiseand potential, Health Information Science and Systems 2 (2014), no. 1, 3.

    5

  • [13] BI Staff, Why the 3vs are not sufficient to describe big data, https://datafloq.com/read/3vs-sufficient-describe-big-data/166, 2013.

    [14] University of Technology Staff, The 7 vs of big data, http://mbitm.uts.edu.au/feed/7-vs-big-data, 2013.

    [15] Bill Vorhies, How many vs in big data the charac-teristics that define big data, http://data-magnum.com/how-many-vs-in-big-data-the-characteristics-that-define-big-data/, 2013.

    [16] Jonathan Stuart Ward and Adam Barker, Undefined by data: A survey of big datadefinitions, arXiv preprint arXiv:1309.5821 (2013).

    6

    "Chuck Cartledge"

    Intro.What is Big DataAnd, why is it interesting?

    Big Data's VsClassical definitionData sources and types

    What sets BD apartStatistics and BD

    Real definitionsPragmatic and practical

    EthicsA simple idea in pictures

    Q & AConclusionReferencesFiles