Centre de Calcul de l’Institut National de Physique ...} Flink like SPARK Batch, streaming, …}...

14
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules XLDB 2017 10/11/2017

Transcript of Centre de Calcul de l’Institut National de Physique ...} Flink like SPARK Batch, streaming, …}...

Page 1: Centre de Calcul de l’Institut National de Physique ...} Flink like SPARK Batch, streaming, …} SPARK New stream features SQL can be run on streams (Bullet) Checkpoints improve

Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules

XLDB 2017 10/11/2017

Page 2: Centre de Calcul de l’Institut National de Physique ...} Flink like SPARK Batch, streaming, …} SPARK New stream features SQL can be run on streams (Bullet) Checkpoints improve

}  Very large amounts of data

}  Very volatile data

}  Data as streams

}  Data schemaless }  Data quality

Modern data management

Page 3: Centre de Calcul de l’Institut National de Physique ...} Flink like SPARK Batch, streaming, …} SPARK New stream features SQL can be run on streams (Bullet) Checkpoints improve

}  XLDB from 2011

}  XLDB 2017 ◦  2,5 days ◦  140 attendees ◦  14 talks ◦  17 lightnings talks ◦  2 sessions posters ◦  2 demonstrations

}  Speakers ◦  LIRIS ◦  Databricks (SPARK) ◦  CERN ◦  SAP Big Data ◦  Imperial College ◦  …

XLDB

Page 4: Centre de Calcul de l’Institut National de Physique ...} Flink like SPARK Batch, streaming, …} SPARK New stream features SQL can be run on streams (Bullet) Checkpoints improve

}  SAP Hub ◦  Plateforme de connexion avec hadoop, …

}  Liris ◦  Hive and HadoopDB evaluation with LSST data ◦  Performance �  Loading : Hive is better �  Query reponse time : HadoopDB is better with a high volumetry ◦  Scalability (25 à 50 machines): both scale up well

Talks

Page 5: Centre de Calcul de l’Institut National de Physique ...} Flink like SPARK Batch, streaming, …} SPARK New stream features SQL can be run on streams (Bullet) Checkpoints improve

}  Liris proposal

Talks

Page 6: Centre de Calcul de l’Institut National de Physique ...} Flink like SPARK Batch, streaming, …} SPARK New stream features SQL can be run on streams (Bullet) Checkpoints improve

Talks

Page 7: Centre de Calcul de l’Institut National de Physique ...} Flink like SPARK Batch, streaming, …} SPARK New stream features SQL can be run on streams (Bullet) Checkpoints improve

}  Flink like SPARK ◦  Batch, streaming, …

}  SPARK New stream features ◦  SQL can be run on streams (Bullet) ◦  Checkpoints improve fault-tolerance ◦  Aggregation by window

}  LeanXcale ◦  New database vendor ◦  Scalable transactional management: scale up to many million of

transactions per second ◦  OLTP and OLAP

Talks

Page 8: Centre de Calcul de l’Institut National de Physique ...} Flink like SPARK Batch, streaming, …} SPARK New stream features SQL can be run on streams (Bullet) Checkpoints improve

}  MonetDB by ◦  Column storage ◦  Interesting product but no map and not very well documented

}  CEPH presentation (CERN) ◦  Open source product ◦  CEPH does not use replicated block ◦  Storage virtualisation

Talks

Page 9: Centre de Calcul de l’Institut National de Physique ...} Flink like SPARK Batch, streaming, …} SPARK New stream features SQL can be run on streams (Bullet) Checkpoints improve

}  CloudMdsQL ◦  Provide integrated access to multiple, heterogeneous cloud data

stores such as NoSQL, HDFS and RDBMS ◦  Others polystores : SPARKSQL, Polybase … ◦  Issues : Execute joins between RDBMS and HDFS and Nosql ◦  Not OpenSource (LeanXcale)

Talks

QueryProcessor

RDBMSWrapper

HDFSWrapper

SELECTid,xFROMASCAN(…).MAP(…).REDUCE(…).FILTER(KEYIN(1,3)).PROJECT(…)

Page 10: Centre de Calcul de l’Institut National de Physique ...} Flink like SPARK Batch, streaming, …} SPARK New stream features SQL can be run on streams (Bullet) Checkpoints improve

}  Knowledge Preservation in HEP (Notre Dame) ◦  Huge investment in producing data for science ◦  Data can be wasted or not re-used ◦  Data preservation: “backing up your hard drive” ◦  harder problem: software + “knowledge” ◦  Data And Software Preservation for Open Science

�  CERN –DESY – CNRS �  Containers Portability = Preservation! �  CERN Open Data Portal

�  CERN Analysis Preservation ◦  How new analysis tools can be preserved ?

Talks

Page 11: Centre de Calcul de l’Institut National de Physique ...} Flink like SPARK Batch, streaming, …} SPARK New stream features SQL can be run on streams (Bullet) Checkpoints improve

}  European Bioinformatics Institute : Genomics based on ElasticSearch

�  8 data nodes (2 cores / 32Gb RAM / 200Gb disk) �  3.2 billons of documents / 782Gb �  Complex query on >100 million genes ~500ms

}  Bullet �  A real-time query engine that lets you run queries on very large data

streams �  OpenSource �  Components : Storm, kafka

}  Kafka Kloner like MirrorMaker ◦  A dynamic High-Speed Inter-Cluster Kafka Replicator ◦  Developped by Yahoo for yahoo ◦  150 billion events per day with an average latency around 2 sec.

Lightning talks

Page 12: Centre de Calcul de l’Institut National de Physique ...} Flink like SPARK Batch, streaming, …} SPARK New stream features SQL can be run on streams (Bullet) Checkpoints improve

}  CERN Openlab project with SPARK ◦  Physics Data Analytics and Data Reduction with Apache Spark ◦  CMS Experiment

}  Oracle database In-Memory ◦  Significant improvement for Data warehouse appliance

Lightning talks

Page 13: Centre de Calcul de l’Institut National de Physique ...} Flink like SPARK Batch, streaming, …} SPARK New stream features SQL can be run on streams (Bullet) Checkpoints improve

}  AstroSpark ◦  SPARK for astronomical data : Cone-Search, Cross-Match … ◦  Data partitioning and indexing with healpix ◦  Query optimizer for astronomical queries ◦  Astronomical Data Query Language support

Lightning talks

Page 14: Centre de Calcul de l’Institut National de Physique ...} Flink like SPARK Batch, streaming, …} SPARK New stream features SQL can be run on streams (Bullet) Checkpoints improve

}  QSERV ◦  Execution of 2 queries before losing the ssh connexion at CC ◦  Shared nothing architecture ◦  Big challenge but many things to do : �  Fault-tolerance �  Data distribution �  Big queries

}  Wikidata

Demonstration