Seminaire bigdata23102014

43
Big Data: Opportunities and Challenges Raja Chiky [email protected]

description

Big Data : Opportunities and Challenges

Transcript of Seminaire bigdata23102014

Page 1: Seminaire bigdata23102014

Big Data: Opportunities and Challenges

Raja Chiky – [email protected]

Page 2: Seminaire bigdata23102014
Page 3: Seminaire bigdata23102014

OUTLINE

¡ About me

¡ What is Big Data?

¡  Evolution of Business Intelligence

¡  Big Data Opportunities

¡  Big Data challenges

¡ Conclusion

3

24/10/2014

Page 4: Seminaire bigdata23102014

About me ¡  Associate professor in Computer Science – LISITE-RDI

¡  Research interest: Data stream mining, scalability and resource optimization in distributed architectures (e.g cloud architectures), recommender systems

¡  Research field: Large scale data management

1. Real-time and distributed

processing of various data

sources

2. Use semantic technologies to add a semantic

layer

3. Recommender systems and

collaborative data mining

4. Optimizing resources in large scale systems

Heterogeneous  and  dynamic  data  streams  

Heterogeneous  and  sta1c  data  

sensors

5. Modeling and validation of complex systems

4

24/10/2014

Page 5: Seminaire bigdata23102014

What is Big data?

5

24/10/2014

Page 6: Seminaire bigdata23102014

6 Big Data: Buzzword!

24/10/2014

Page 7: Seminaire bigdata23102014

New era 7

24/10/2014

Page 8: Seminaire bigdata23102014

24/10/2014

8 Where is all this data coming from?

24/10/2014

Page 9: Seminaire bigdata23102014

24/10/2014

9 More and More connected Things

Page 10: Seminaire bigdata23102014

24/10/2014

10 So, what is Big Data?

Dawn  of  (me  

2003   2012  

5  EB  

…  

2.7  ZB  

2015  

10  ZB  (E)  

Volume  of  data  created  Worldwide  

§  1  YB  =  10^24  Bytes  §  1  ZB  =  10^21  Bytes  §  1  EB  =  10^18  Bytes  §  1  PB  =  10^15  Bytes  §  1TB  =  10^12  Bytes  §  1  GB  =  10^9  Bytes  

Variety  of  data  

Velocity  of  data  

§  Walmart  handles  1M  transac(ons  per  hour  §  Google  processes  24PB  of  data  per  day  §  AT&T  transfers  30  PB  of  data  per  day  §  90  trillion  emails  are  sent  per  year  §  World  of  WarcraQ  uses  1.3  PB  of  storage  

§  Facebook  when  had  a  user  base  of  900  M  users,  had  25  PB  of  compressed  data  

§  400M  tweets  per  day  in  June  ’12  §  72  hours  of  video  is  uploaded  to  Youtube  

every  minute  

§  Radio  §  TV  §  News  §  E-­‐Mails  §  Facebook  

Posts  

§  Tweets  §  Blogs  §  Photos  §  Videos  (user  

and  paid)  §  RSS  feeds  

§  Wikipedia  §  GPS  data  §  RFID  §  POS  

Scanners  §  …  

Volume  

Variety  

Velocity  

Big  Data  Elements  

Source: Big Data & Analytics - Why Should We Care?, Vishwa Kolla

+ Veracity (IBM) - information uncertainty

Page 11: Seminaire bigdata23102014

octobre  24,  2014  

11  Key factors ¡ Cheap storage

¡  Recording everything is not expensive anymore

¡ Cloud computing ¡  Cheap, on demand computing resources from

anywhere in the world and for everyone

¡  Business reasons ¡  New insights arise that give competitive

advantage

¡ Data in various forms everywhere: IoT and IoE, Social Networks, Open Data

¡  The way we interact with each other and with data / information

¡  …

24/10/2014

Page 12: Seminaire bigdata23102014

24/10/2014

12 Transforming our daily lives

Then Now

One size fits all Personalization & Targeted Selling

Source: Big Data Trends by David Feinleib

Page 13: Seminaire bigdata23102014

24/10/2014

13 Fitness

Source: Big Data Trends by David Feinleib

Then Now

Manual tracking Focus on the goal

Page 14: Seminaire bigdata23102014

24/10/2014

14 Customer service

Then Now

Reactive Customer Service Pro-active Customer Service

Source: Big Data Trends by David Feinleib

Page 15: Seminaire bigdata23102014

24/10/2014

15 Customer service: 360-degree view of the customer

Why?  

What?  

How?  When/Where?  

Who?  

Opera1onal  data  

Behavioral  data  

Descrip1ve  data  

Interac1on  data  Contextual  

data  

Page 16: Seminaire bigdata23102014

24/10/2014

17 Big Data opportunities

Source: Source: Big Data opportunities survey, Unisphere / SAP, May 2013.

Page 17: Seminaire bigdata23102014

Opportunities: big data use cases

360°  view  of  the  customer  

•  Integra1on  of  data  from  social  networks,  CRM,  transac1onal  data,  etc.    

• Example:  T-­‐Mobile,  telecom  operator  -­‐>  Reduc1on  of  the  customer  leave  of  50%  in  a  quarter  

E-­‐reputa?on  

• Sen1ment  analysis,  proac1ve  monitoring  of  social  networks  

• Example:  Nestlé,  food  group-­‐>  Gain  of  4  places  in  the  Reputa1on  Ins1tute’s  Index  due  to  an  interac1on  24/7  

Op?misa?on  

• Predic1ve  analysis  for  anomalies  detec1on,  processes  op1miza1on  using  sensors  and  opera1onal  data  

• Example:  Union  Pacific  Railroad,  reduce  train  derailments,  increase  train  shipment,  carbon  emission  reduc1on  

Public  security  

• Monitoring  social  networks,  integra1on  of  spa1al  data  and  sensors  

• Example:  Serious  Request  2012  -­‐>  monitoring  of  crowd  movements  with  Twi^er  and  sensors,  localiza1on  of  public  force,  integra1on  with  GIS  

19

24/10/2014

Page 18: Seminaire bigdata23102014

Evolution of Business Intelligence

20

24/10/2014

Page 19: Seminaire bigdata23102014

24/10/2014

21 Static Data Semantic Data Stream (Big) Data

Output  

User  Interac1on  

Gathering  Informa1on  

Store  

Data  sources  

Visual analytics

Structured/unstructured data

Seman1c  ETL/Batch  processing  

Flexible  queries  /    SPARQL  

Triple Sore

C  

Static report

databases

ETL/Batch  processing  

Ad-­‐hoc  queries    Analy1cs  

Data Warehouse

C  

Real-time analytics

sensors

Static data Data streams

Semantic ETL

stream processing

Continuous queries/ Business rules

Knowledge

enrichment

Databases/ Triplestores

Real time visual-analytics

Re

tro-

ac

tion

Load shedding

Page 20: Seminaire bigdata23102014

24/10/2014

22 Static Data Semantic Data Stream (Big) Data

Output  

User  Interac1on  

Gathering  Informa1on  

Store  

Data  sources  

Real-time analytics

sensors

Static data Data streams

Semantic ETL

stream processing

Continuous queries/ Business rules

Knowledge

enrichment

Databases/ Triplestores

Real time visual-analytics

Re

tro-

ac

tion

Load shedding

C

Visual analytics

Structured/unstructured data

Semantic ETL/Batch processin

g

Flexible queries / SPARQL

Triple Sore

Static report

databases

ETL/Batch processin

g

Ad-hoc queries

Analytics

Data Warehouse

C

Page 21: Seminaire bigdata23102014

24/10/2014

23 Static Data Semantic Data Stream (Big) Data

Output

User Interaction

Gathering Information

Store

Data sources

Visual analytics

Structured/unstructured data

Semantic ETL/Batch processin

g

Flexible queries / SPARQL

Triple Sore

C

Static report

databases

ETL/Batch processin

g

Ad-hoc queries

Analytics

Data Warehouse

C

Real-time analytics

sensors

Static data Data stream

Semantic ETL

stream processing

Continuous queries/ Business rules

Knowledge enrichment

Databases/ Triplestores

Real time visual-analytics

Re

tro-

ac

tion

Load shedding

Page 22: Seminaire bigdata23102014

What are Big Data Challenges?

24

24/10/2014

Page 23: Seminaire bigdata23102014

Big Data workflow

1.  Capture

2.  Store

3.  Analyze

4.  Visualize

Challenges arise in all these steps

25

24/10/2014

Page 24: Seminaire bigdata23102014

24/10/2014

26 Challenges: Data Collection ¡  Heterogeneity of sources ¡  Company databases => Silos

¡  Sensor networks, Intelligent objects

¡  Data streams: Social Networks, financial information, etc.

¡ Data Velocity

¡ Data provenance and quality

Page 25: Seminaire bigdata23102014

24/10/2014

27 Type of data used in Big Data initiatives

Internal data Traditional sources

« New data »

Source: Big Data opportunities survey, Unisphere / SAP, May 2013.

Page 26: Seminaire bigdata23102014

24/10/2014

28 Challenges: Data Collection Velocity Website logs

Network monitoring Financial services

eCommerce Traffic control Power consumption

Weather forecasting

Page 27: Seminaire bigdata23102014

What is a data stream? ¡  Golab & Oszu (2003): “A data stream is a real-time, continuous, ordered

(implicitly by arrival time or explicitly by timestamp) sequence of items. It is impossible to control the order in which items arrive, nor is it feasible to locally store a stream in its entirety.”

¡  Massive volumes of data, items arrive at a high rate.

29

24/10/2014

Page 28: Seminaire bigdata23102014

24/10/2014

30 Data Stream Management Systems

DBMS DSMS

Data model Permanent updatable relations Streams and permanent updatable relations

Storage Data is stored on disk Permanent relations are stored on disk Streams are processed on the fly

Query SQL language Creating structures Inserting/updating/deleting data Retrieving data (one-time query)

SQL-like query language Standard SQL on permanent relations Extended SQL on streams with windowing Continuous queries

Performance Large volumes of data Optimization of computer resources to deal with Several streams Several queries Ability to face variations in arrival rates without crash

Page 29: Seminaire bigdata23102014

24/10/2014

31 Challenges: Data Collection Data provenance and quality ¡ Data provenance: Provenance refers to the information that

describes data in sufficient detail to facilitate reproduction and enable validation of results.

¡ Data quality: Validity and consistency of the data. Is it up to date and fit for the targetted use case ?

Source: Patrick McDaniel, Kevin Butler, Steve McLaughlin, Radu Sion, Erez Zadok, and Marianne Winslett, Towards a secure and ecfficient system for end-to-end provenance, 2010.

Page 30: Seminaire bigdata23102014

24/10/2014

32 Challenges in data storage ¡  Large amounts of data ¡  Need to use a highly distributed architecture

¡ Massive queries ¡  Avoid joins since they are very time consuming

¡  Evolutionary schema ¡  Flexibility and scalability

¡  Predictable and low latency

¡  High availability

¡  Elasticity : Horizontal extensibility (Scale out)

¡ No need: Transaction / Strong consistency/ Complex queries

Page 31: Seminaire bigdata23102014

Limitation of RDBMS

“ If the only tool you have is a hammer, you tend to see every problem as a nail.” Abraham Maslow

33

24/10/2014

Page 32: Seminaire bigdata23102014

Limitation of RDBMS 34

24/10/2014

Page 33: Seminaire bigdata23102014

NO SQL Not Only

Relational

35

•  No SQL => Not Only SQL •  SQL must not die but storage solutions should be

considered for specific applications Exact name: Non relational DB

24/10/2014

Page 34: Seminaire bigdata23102014

CAP theorem (E.Brewer, N. Lynch 2000)

C

A P

“CAP Theorem”: C-A-P: choose two.

consistency

Availability Partition-Tolerance

Claim: every distributed system is on one side of the triangle.

CA: available, and consistent, unless there is a partition.

AP: a reachable replica provides service even in a partition, but may be inconsistent.

CP: always consistent, even in a partition, but a reachable replica may deny service without agreement of the others

36

24/10/2014

Page 35: Seminaire bigdata23102014

NoSQL Taxonomy

Data

Key-value

Document

Column

Graph

37

24/10/2014

Page 36: Seminaire bigdata23102014

Challenges in Data Analytics ¡  Problems in large scale analytics ¡  Distributed computation efficiency

¡  Evaluate performance gains from distribution

¡  Bringing data to the processor

¡  Efficient parallel algorithms (statistics, summaries)

¡  Speed analytics

¡  Streaming computations

¡  Load balancing

¡  Load Shedding

38

24/10/2014

Page 37: Seminaire bigdata23102014

24/10/2014

39 Challenges in Data Access and Visualization ¡  The main goal of data visualization is to communicate

information clearly and effectively through graphical means

¡  Provide results of analytics workflow for faster systems such as real-time query interfaces

“Visualization is a form of knowledge compression” - David McCandless

Page 38: Seminaire bigdata23102014

24/10/2014

40 Big Data: Technological challenges ¡ Data infrastructure tools and platforms : data centers, cloud

infrastructures, noSQL databases, in-memory databases, Hadoop/Map Reduce Ecosphere

¡ New generation of front-end tools for BI and analytic systems: data visualization and visual analytics, self-service BI, Mobile BI

¡ Data processing : supercomputers, distributed or massively parallel-computing

Page 39: Seminaire bigdata23102014

24/10/2014

41

Page 40: Seminaire bigdata23102014

24/10/2014

42 Conclusion: Big Data challenges ¡  Semantic Information aggregation ¡  Information aggregation: “too much data to assimilate but not

enough knowledge to act”

¡ Distributed and real-time processing ¡  Design of real-time and distributed algorithms for stream processing

and information aggregation

¡  Distribution and parallelization of data mining algorithms

¡  Optimizing resources

¡  visual analytics and user modeling ¡  Dynamic user model

¡  Novel visualizations for very large datasets

¡ Data protection

Page 41: Seminaire bigdata23102014

24/10/2014

43 IEEE Metro Area Smart Tech Workshop on Distributed Data Streaming Dec 5,2014 Paris ¡  08h00: Registration - Breakfast

08h50: Room L012 - Welcome 09h00: Room L012 - Introduction to Distributed Data Streaming - Speaker: Raja Chiky (ISEP) 10h15: Coffee break 10h45: Room L012 - Real World Issues in Supervised Classification for Data Streams - Speaker: Vincent Lemaire (Orange Labs) 11h30: Room L012 - Use Case 1- Finance - Speaker: Antoine Chambille (Quartet FS) 12h00: Room L012 - Use Case 2 – Smart metering - Speakers: Marie-Luce Picard (EDF R&D) 12h30: Lunch offsite 14h00: Rooms L305-L306 - 2 Parallel labs sessions: Real-Time Data processing with open source DSMS - Speakers: Raja Chiky and Sylvain Lefebvre - 1st part 15:30: Coffee break 16:00: Rooms L305-L306 - 2 Parallel labs sessions: Real-Time Data processing with open source DSMS - Speakers: Raja Chiky and Sylvain Lefebvre - 2nd part 17h30: Reception onsite

Page 42: Seminaire bigdata23102014

24/10/2014  

44

Thanks to Marie-Aude Aufaure, ECP Sylvain lefebvre, ISEP

Page 43: Seminaire bigdata23102014

Big  Data   Linked  Data  Volume,  Variety,  Velocity,  Veracity,    …              Value  

Web  of  data,  Seman(c  Web    -­‐  A  set  of  principles  and  good  

prac1ces  allowing  to  link,  publish  and  search  for  web  data    

-­‐  Structure  and  seman1cally  enrich  RDF  data,  with  a  very  high  scalability  

 -­‐>  Big  Linked  Data  

Integrate,  aggregate,  analyze,  visualize    large  data  sets,  whatever  is  their  type,  provenance,  speed  of  their  flow  …      

Big  Linked  Data      

Linked  Big  Data  

Our  Value  proposi?on  –  Seman1c  aggrega1on  from  textual  and  non  textual  streams  –  Manage  seman1c  heterogeneity,  real-­‐1me  and  distributed  processing  –  Ensure  data  quality  and  veracity  –  Visual  analy1cs  

Seman?c  Technologies  

Living  Lab    

Linked  &  Big  Data  Academic  Chair