Defining & Characterizing Big Data

10
Defining & Characterising Big Data Big Data Crash Course, an event sponsored by the MIT-SDM Big Data Explorers Club Jim Barkley The MITRE Corporation MIT SDM Fellow September 20, 2014

description

Presentation by James Barkley at the MIT Big Data Explorers "Crash Course" on 9/20/2014. "Defining & Characterizing Big Data". http://www.mitbigdataexplorers.com/

Transcript of Defining & Characterizing Big Data

Page 1: Defining & Characterizing Big Data

Defining & Characterising Big DataBig Data Crash Course,

an event sponsored by the MIT-SDM Big Data Explorers Club

Jim BarkleyThe MITRE Corporation

MIT SDM Fellow

September 20, 2014

Page 2: Defining & Characterizing Big Data

2

Defining Big Data

“Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.”

-Wikipedia

“Big Data is when the size of the data itself becomes [a significant] part of the problem.”- O’reilly Media

“large, diverse, complex, longitudinal, and/or distributed datasets generated from instruments, sensors, Internet transactions, email, video, click streams, and/or all other digital sources available today and in the future.”- National Science Foundation

Page 3: Defining & Characterizing Big Data

3

Characterizing “Big”

Page 4: Defining & Characterizing Big Data

4

“Numbers Everyone Should Know” – SoCC 2010 Keynote, Jeffrey Dean, Google

Page 5: Defining & Characterizing Big Data

5

• Preserving Privacy Values

• Responsible Educational Innovation

• Big Data & Discrimination

• Law Enforcement & Security

• Data as a Public Resource

Big Data and Government

http://www.whitehouse.gov/BigData

Page 6: Defining & Characterizing Big Data

6

• National Science Foundation• Department of Defense• National Institute of Health• Department of Energy• US Geological Survey

Big Data Research and Development Initiative

• DARPA XDATA• http://

www.darpa.mil/OpenCatalog/index.html

Page 7: Defining & Characterizing Big Data

7

Big Data & Industry

http://blogs.the451group.com/information_management/2012/11/02/updated-database-landscape-graphic/

Page 8: Defining & Characterizing Big Data

8

• Algorithms

• Bio-/Health-/Life- Sciences

• Infrastructure/City-Related

• Massive Scale/Data Optimization

• Risk; Privacy; Policy

• Social Media-Related Projects

• Visual/Scene Recognition

Big Data & MIT

Page 9: Defining & Characterizing Big Data

9

• LABS:– CSAIL; Intel Science & Tech

Center– MIT Geospatial Data Center– MIT Information Quality (MITIQ)– MIT Initiative on the Digital

Economy (IDE)– Laboratory for Information &

Decisions Systems (LIDS)– Operations Research Center

(ORC); Accenture & MIT Alliance on Business Analytics

– W3C Consortium Big Data Community Group

Big Data & MIT

• Example Research Project:

• TUNABLE FAST SIMILARITY SEARCH FOR HIGH-DIMENSIONAL DATA

• “Locality-Sensitive Hashing (LSH) is an efficient algorithm for finding pairs of similar (or highly correlated) objects in a database without enumerating all pairs of such objects. Example applications include searching for near-duplicate documents, similar images, highly correlated stocks etc.”

Page 10: Defining & Characterizing Big Data

10

• Thanks for coming today. You choose your own level of involvement– “Some are naturally average, some settle, and some have mediocrity thrust

upon them”• Club Goals & Vision:

1. Serve as learning platform through planned activities, sharing, and collaboration

2. Serve as a networking tool3. Serve as an incubator for projects, investigations, research, and startups

• Future ideas for club activities:– Use the listserv!!! Share articles and have discussions.– Monthly club meetings (1-2 hour, touch base on current efforts, make club

members present) ?– Small BOF sessions around specific technologies (e.g., MongoDB) or domains

(e.g., Health Care)– More full-day or weekend events. Hackathons, unconferences– Subsidize local area conference attendance– Kaggle team-ups and hacking for profit

Big Data Explorers & You

“Cyberspace. A consensual hallucination experienced daily by billions of legitimate operators, in every nation, by children being taught mathematical concepts . . . A graphic representation of data

abstracted from the banks of every computer in the human system.”