Mining Big Data in Real Time
-
Upload
albert-bifet -
Category
Technology
-
view
3.003 -
download
1
description
Transcript of Mining Big Data in Real Time
Mining Big Data in Real Time
Albert Bifet
2 04/11/2023
Motivation
• BIG DATA is an OPEN SOURCE Software Revolution
• BIG DATA Analytics 2.0• What is happening right now
• Why we need new tools?
• Improve decision making:• Measure and react in REAL-TIME
3 04/11/2023
Real Time Decision Making
Companies need to know:
• what is happening right now, in real time, to be able to
• react• anticipate and detect
new business opportunities.
4 04/11/2023
Big Data 6 Vs
• Volume• Variety• Velocity
• Value• Variability• Veracity V
5 04/11/2023
Controversy of Big Data
• All data is BIG now• Hype to sell Hadoop
based systems• Ethical concerns about
accessibility• Limited access to Big
Data creates new digital divides
6 04/11/2023
Controversy of Big Data
• Statistical Significance: – When the number
of variables grow, the number of fake correlations also grow
– Leinweber: S&P 500 stock index correlated with butter production in Bangladesh
7 04/11/2023
Need for Big Data
• McKinsey Global Institute (MGI) Report on Big Data, 2011
8 04/11/2023
Need for Big Data
• McKinsey Global Institute (MGI) Report on Big Data, 2011
9 04/11/2023
More data or better models?
Xavier Amatriain Netflix Research/Engineering Director http://recsys.acm.org/more-data-or-better-models/
10 04/11/2023
Future Challenges for Big Data
• Evaluation
• Time evolving data• Distributed mining
• Compression• Visualization• Hidden Big Data
11 04/11/2023
HADOOP Architecture
12 04/11/2023
Apache Mahout
13 04/11/2023
Pig
Pig Similar to SQL
14 04/11/2023
Apache S4
15 04/11/2023
Twitter Storm
16 04/11/2023
Runaway Complexity
17 04/11/2023
What is SAMOA?
• NEW Software framework for mining distributed data streams• Big Data mining for evolving streams in REAL-TIME
18 04/11/2023
Big Data Stream Mining
BIG DATA Streams• Sequence is potentially infinite• High amount of data, high speed of arrival• Change over time• Process elements from a data stream in only one pass
• Approximation algorithms– Small error rate with high probability
19 04/11/2023
Big Data Stream Mining
Distributed BIG DATA
• BIG DATA Analytics 2.0– Apache S4
• Yahoo! 2010
– Storm• Twitter 2011
Machine Learning
Distributed
Batch
Hadoop
Mahout
Stream
S4, Storm
SAMOA
Non Distribute
d
Batch
R, WEKA,
…
Stream
MOA
20 04/11/2023
SAMOA ArchitectureUse S4, Storm, or other distributed stream processing platformUse MOA, or other streaming machine learning libraryEasy to extend through PACKAGES
SAMOA
S4 Storm
…
SAMOA
Classifier
Methods
Clustering
Methods
Frequent Pattern Mining
21 04/11/2023
Thanks!
http://samoa-project.net/G. De Francisci Morales SAMOA: A Platform for Mining Big Data Streams
Keynote Talk at RAMSS ’13: 2nd International Workshop on Real-Time Analysis and Mining of Social Streams @WWW, Rio De Janeiro, 2013.