CERN openlab V preparation, Data Analytics (for research) Many contributors, especially EN-ICE and...
-
Upload
calvin-payne -
Category
Documents
-
view
217 -
download
2
Transcript of CERN openlab V preparation, Data Analytics (for research) Many contributors, especially EN-ICE and...
CERN openlab V preparation,Data Analytics (for research)
Many contributors, especially EN-ICE and IT-DB
Challenges
2
Online triggers and DAQ
Offline simulation and processing
Data storage architectures
Resource management and provisioning
Data analytics
Networks and connectivity
Use case: Quench Protection System
Critical system for LHC operation• Major upgrade for LHC Run 2 (2015-2018)
High throughput for data storage requirement• Constant load of 150k changes/s from 100k signals
Whole data set is transfered to long-term storage DB• Query + Filter + Insertion
Analysis performed on both DBs
Backup
LHC Logging(long-termstorage)
RDB Archive16 ProjectsAround LHC
4
Credit: Kacper Szkudlarek EN-ICE
Use case: Quench Protection System
Nominal conditions• Stable constant load of 150k changes/s
• 100 MB/s of I/O operations
• 500 GB of data stored each day
Peak performance• Exceeded 1 million value changes per second
• 500-600 MB/s of I/O operations
All CERN production WinCC OA systems (accelerators, detectors and technical infrastructure, 600 servers) will benefit from these optimizations
Next challenge: ~10x increase • Required for next major upgrade (2019-2020)
6
Credit: Kacper Szkudlarek EN-ICE
7
Anomaly detection
>. SVM - Support Vector Machines
Credit: Massimo Lamanna, Sebastien Ponce (IT-DSS), Stefano Alberto Russo (ex IT-DB)
8
Data Placement / ATLAS
>Use cases: Trace Mining (user interactions with Distributed Data Management) Popularity (used for deciding which data to delete) Accounting and popularity (reports on data contents/popularity)
Log file aggregation
>ATLAS Distributed Data Management uses both SQL and NoSQL
9
Data Placement / CMS
>Intelligent data placement models for the CMS experiment
>Need to extract further knowledge from the monitoring data in order to implement an effective data placement Correlate file-access monitoring with site status Readiness, queue length, storage and CPU available Classify analysis activities and needed resources Making recommendations Learn from the past trends and patterns
10
Data Placement / EMBL-EBI
>To support the diverse data analysis that will take place within ELIXIR, the ability to ‘push’ data from a provider to a major analysis centres, or for the major analysis centre to ‘pull’ the required data set from a nearby source, becomes a critical capability
13
Domain specific language
>LHC Logging (50+ TB/year)>Perform analysis as close to data as
possible, in database analysis: built-in + ORE?
>Multi source extraction API >Domain specific
language
Credit: Chris Roderick BE-CO
14
Network monitoring
>Time correlation During a PS throughput test, was there any known
activity in the same link? There is packet loss, does this appears as degraded
performance somewhere at the same time
>We observe loss of performance in some network link Is it a network problem and where? Is it a storage problem?
Credit: Simone Campana
15
ESA
>Envisage “intelligent” bots doing much of the researcher's work in scanning the archives to collect relevant information in a particular field.
>Such “automated bots” would present their results only when called upon and only focused on a problem at hand (e.g. give me serendipitous objects in the X-Ray range lying around the Crab Nebula, since an unexplained region of hot gas may have an effect on the infra-red region I am studying…).
>The bot may be further refined to extract only very good quality data from all X-Ray missions or for a given time
Credit: Salim Ansari
17
Analytics and Modelling for Availability Improvement in the FCC
>Near real-time modelling of the accelerator complex and its infrastructure services would further improve early warning capabilities, permit preventive maintenance and leverage co-scheduling of fault-prevention interventions
>Real-world use-cases taken from LHC accelerator operation shall serve as the basis to develop formal data analytics scenarios
Credit: Johannes Gutleber
18
Data analytics on scientific articles
>INSPIRE, ZENODO, ORCID>Automated extraction of information about
authors, references, key words, etc.) >Semantic analysis of text allowing identification of
the main field, key words (not appearing in the text), sentiment of references; validation based on their importance within the context of the publication and the ability to join and correlate concepts from different domains and publications.
Credit: Tim Smith
19
Administrative Information System
>(among others)>Make the data available using a bi-
temporal model, one time dimension comes from the business – e.g. contractual dates; and the other one is purely technical and indicates when which data was effectively part of the DWH and allows writing queries using a “show data as of” date
Credit: Derek Mathieson
Technology
Near real time processing• processing large amounts of data (Gigabytes per second)
with low latency (in the order of seconds) coming from different sources and domains
Batch processing (including predictive analytics)• Linear and nonlinear modelling, classical statistical tests,
complex time-series analysis and forecasting, classification, clustering
Data repositories, RDBMS and NoSQL Integration Challenges (Data Analytics as a Service)
20
21
Analytics as a service
> “Analytics platform” or (Big data) “Analytics-as-a-service” (A3S ?):
> Data fed from multiple sources (live)
> Stored reliably> Data processing with multiple
systems> Easy access, domain expert
natural language (DSL)> Visualisation> Special interest from Human Brain
Project
Credit: CERN EN-ICE
Education
“data scientist” role type Variety of tools and ideas, important
theoretical/academic background Implement a workshop/training along the
line of the one on multi-threading and parallelism
Clear need and interest about data analytics education and information sharing
22
Conclusion
Interest from many parts of CERN, experiments, engineering, administrative, IT
Leverages the work done in openlab IV Combined from the beginning with a multi
department AaaS service Education and outreach Interest from other research laboratories and
openlab partners• Challenges
• Interest in shared research / investigation / deployment
23