Mining Big Data in Real Time

21
Mining Big Data in Real Time Albert Bifet

description

Big Data is a new term used to identify datasets that we can not manage with current methodologies or data mining software tools due to their large size and complexity. Big Data mining is the capability of extracting useful information from these large datasets or streams of data. New mining techniques are necessary due to the volume, variability, and velocity, of such data.

Transcript of Mining Big Data in Real Time

Page 1: Mining Big Data in Real Time

Mining Big Data in Real Time

Albert Bifet

Page 2: Mining Big Data in Real Time

2 04/11/2023

Motivation

• BIG DATA is an OPEN SOURCE Software Revolution

• BIG DATA Analytics 2.0• What is happening right now

• Why we need new tools?

• Improve decision making:• Measure and react in REAL-TIME

Page 3: Mining Big Data in Real Time

3 04/11/2023

Real Time Decision Making

Companies need to know:

• what is happening right now, in real time, to be able to

• react• anticipate and detect

new business opportunities.

Page 4: Mining Big Data in Real Time

4 04/11/2023

Big Data 6 Vs

• Volume• Variety• Velocity

• Value• Variability• Veracity V

Page 5: Mining Big Data in Real Time

5 04/11/2023

Controversy of Big Data

• All data is BIG now• Hype to sell Hadoop

based systems• Ethical concerns about

accessibility• Limited access to Big

Data creates new digital divides

Page 6: Mining Big Data in Real Time

6 04/11/2023

Controversy of Big Data

• Statistical Significance: – When the number

of variables grow, the number of fake correlations also grow

– Leinweber: S&P 500 stock index correlated with butter production in Bangladesh

Page 7: Mining Big Data in Real Time

7 04/11/2023

Need for Big Data

• McKinsey Global Institute (MGI) Report on Big Data, 2011

Page 8: Mining Big Data in Real Time

8 04/11/2023

Need for Big Data

• McKinsey Global Institute (MGI) Report on Big Data, 2011

Page 9: Mining Big Data in Real Time

9 04/11/2023

More data or better models?

Xavier Amatriain Netflix Research/Engineering Director http://recsys.acm.org/more-data-or-better-models/

Page 10: Mining Big Data in Real Time

10 04/11/2023

Future Challenges for Big Data

• Evaluation

• Time evolving data• Distributed mining

• Compression• Visualization• Hidden Big Data

Page 11: Mining Big Data in Real Time

11 04/11/2023

HADOOP Architecture

Page 12: Mining Big Data in Real Time

12 04/11/2023

Apache Mahout

Page 13: Mining Big Data in Real Time

13 04/11/2023

Pig

Pig Similar to SQL

Page 14: Mining Big Data in Real Time

14 04/11/2023

Apache S4

Page 15: Mining Big Data in Real Time

15 04/11/2023

Twitter Storm

Page 16: Mining Big Data in Real Time

16 04/11/2023

Runaway Complexity

Page 17: Mining Big Data in Real Time

17 04/11/2023

What is SAMOA?

• NEW Software framework for mining distributed data streams• Big Data mining for evolving streams in REAL-TIME

Page 18: Mining Big Data in Real Time

18 04/11/2023

Big Data Stream Mining

BIG DATA Streams• Sequence is potentially infinite• High amount of data, high speed of arrival• Change over time• Process elements from a data stream in only one pass

• Approximation algorithms– Small error rate with high probability

Page 19: Mining Big Data in Real Time

19 04/11/2023

Big Data Stream Mining

Distributed BIG DATA

• BIG DATA Analytics 2.0– Apache S4

• Yahoo! 2010

– Storm• Twitter 2011

Machine Learning

Distributed

Batch

Hadoop

Mahout

Stream

S4, Storm

SAMOA

Non Distribute

d

Batch

R, WEKA,

Stream

MOA

Page 20: Mining Big Data in Real Time

20 04/11/2023

SAMOA ArchitectureUse S4, Storm, or other distributed stream processing platformUse MOA, or other streaming machine learning libraryEasy to extend through PACKAGES

SAMOA

S4 Storm

SAMOA

Classifier

Methods

Clustering

Methods

Frequent Pattern Mining

Page 21: Mining Big Data in Real Time

21 04/11/2023

Thanks!

http://samoa-project.net/G. De Francisci Morales SAMOA: A Platform for Mining Big Data Streams

Keynote Talk at RAMSS ’13: 2nd International Workshop on Real-Time Analysis and Mining of Social Streams @WWW, Rio De Janeiro, 2013.