TrustwOrthy model-awaRE Analytics Data · PDF fileThe Toreador user is an expert programmer,...

download TrustwOrthy model-awaRE Analytics Data  · PDF fileThe Toreador user is an expert programmer, aware of the potentialities (flexibility and controllability) and purposes

If you can't read please download the document

Transcript of TrustwOrthy model-awaRE Analytics Data · PDF fileThe Toreador user is an expert programmer,...

  • TrustwOrthy model-awaRE Analytics Data platfORm

    Beniamino Di Martino

    CINI Universit della Campania Luigi Vanvitelli

    BigData: programming analytics for multiple BD Platforms deployment -

    The Toreador Code-Based Approach

  • Toreador EC H2020 project

    2

    TrustwOrthy model-awaREAnalytics Data platfORm

    Call: ICT-BIGDATA-2015

    Project Coordinator:

    Prof. Ernesto Damiani

    WP2 leader:

    Prof. Beniamino Di Martino

  • Objectives

    3

    Specification of a fully declarative framework and a model set supporting Big Data analytics.

    MBDAaaS model-based BDA-as-a-service providing models of the entire Big Data analysis process and of its artefacts.Automation and commoditization of Big Data analytics.Enabling it to be easily tailored to domain-specific customer

    requirements.

    SLA and assurance approaches to guarantee contractual quality, performance, and security of BDA.

    Design and development of automatic deployment of TOREADOR analytic solutions on multiple BD platforms.

  • Pilots

    4

    SAP: The Application Log Analysis Pilot (security log).

    LIGHTSOURCE: The Energy Production Data Analysis Pilot (sensor-based management of photovoltaic).

    JoT: The Clickstream Analysis Pilot (e.g., fraud control).

    DTA: The Aerospace Products Manufacturing Analysis Pilot (data related to manufacturing process).

  • Model-driven approach

    Abstract the typical procedural models (e.g., data pipeline) implemented in big data frameworks

    Develop model transformations to translate modelling decisions into actual provisioning on multiple Big Data Platforms

    TOREADOR Overview (Recall)

    Declarative

    Models

    Procedural

    Models

    Deploymen

    t Models

    (Non-)Fuctional

    Goals: Service

    goals of Big Data

    Pipeline

    What the BDA

    should achieve and

    how to achieve

    objectives

    How the BDA

    process should

    work

    PlatformPlatformPlatform

  • Specify goals as service level objectives (SLO).

    Five areas: representation, preparation, analytics, processing, display and reporting.A single model because aspects of different

    areas may impact on the same procedural model template.

    Based on a controlled vocabulary. Controlled names (e.g., anonymization). Values in an ordinal scale as strings or

    numbers (e.g., obfuscation, hashing, k-anonymity).

    Insufficient to run a Big Data Analytics. Incomplete.

    TOREADOR Overview: Declarative Model

    Declarative

    Models

    Procedural

    Models

    Deploymen

    t Models

    Service Level

    Objectives (SLOs):

    Service goals of Big

    Data Pipeline

    What the BDA

    should achieve and

    how to achieve

    objectives

    How the BDA

    process should

    work

  • Complete the declarative model with all information needed for running analytics. Simple to map SLOs on procedures.Platform independent.

    Specified using a function that takes as input SLOs and returns as output procedural templates. Procedural templates are pre-calculated

    based on defined SLOs.

    Templates express competences of data scientist and data technologist.Declarative models used to select the (set

    of) proper templates.

    TOREADOR Overview: Procedural Model

    Declarative

    Models

    Procedural

    Models

    Service Level

    Objectives (SLOs):

    Service goals of Big

    Data Pipeline

    What the BDA

    should achieve and

    how to achieve

    objectives

    Deploymen

    t Models

    How the BDA

    process should

    work

  • Specify how procedural models are incarnated in a ready-to-be-deployed architecture. Drive analytics execution in real scenarios.To be defined for each TOREADOR application scenarios.Platform Dependent.

    TOREADOR Overview: Deployment Model

    Declarative

    Models

    Procedural

    Models

    Deploymen

    t Models

    Service Level

    Objectives (SLOs):

    Service goals of Big

    Data Pipeline

    What the BDA

    should achieve and

    how to achieve

    objectives

    How the BDA

    process should

    work

    PlatformPlatformPlatform

  • The Toreador Dev & Deploy Process

    Service Catalog

    OWL-S Service

    Ontology

    Service

    Definition /

    Implementation

    Deployment/Compi

    ler

    Platform/Executi

    on

    JSON

    OWL-S Composition Model

    Service Model

    Code of the services (Jobs)

    APIs

    OWL-S Service Ont.

    OWL-S Editor

    OWL-S Selector Code Compiler

    JSON

    Code base Model

    Declarative

    Model

    Service

    Composition

    Service Selection

  • Two development approaches

    Service-basedNo coding, for basic usersAnalytics services are provided by the target TOREADOR platformBig Data campaign built by composing existing servicesBased on model transformations

    Code-basedAdvanced usersAnalytics algorithms are developed by the usersParallel computations are configured by the users

    Both driven by the same declarative modelCode once, deploy everywhere

    Support for batch and stream (batch first)

  • The Code-Based Approach: Motivation

    Code Once/Deploy Everywhere

    The Toreador user is an expert programmer, aware of the potentialities (flexibility andcontrollability) and purposes (analytics developed from scratch or migration of legacycode) of a code-based approach.She expresses the parallel computation of a coded algorithm, in terms of parallelprimitives.Toreador distributes it among computational nodes hosted by differentplatforms/technologies in multi-platform Big Data and Cloud environments using State ofthe Art orchestrators. 11

    I. Code III. DeployII. Transform

    Deployment Model

    Procedural Model

    Skeleton-BasedCode Compiler

  • First phase: Code

    Code

    iteration = 0while iteration == 0 or previousCentroids != centroids:

    previousCentroids = centroidsAssigns =

    data_parallel_region(data, kmeans_assign, centroids)print("ASSIGN:", assigns)clusters = [[] for x in centroids]for x in assigns:

    clusters[x[1]].append(x[0])

    Definition of a set of Parallel Primitives expressing Parallel Computational Patterns ina given programming language

    12

  • Second Phase: Transform

    Definitions of Code Skeletons that realize the computationallogic, workflow and interactions characterizing a ParallelComputational Pattern.

    Realization of a Skeleton-Based Code Compiler (Source to sourceTransformer) which encarnating the Skeletons - transforms thesequential code augmented with the parallel primitives intoparallel versions, specific to different platforms/technologies,according to a 4-steps workflow:

    13

  • Second Phase: Transform

    Skeleton based Code Compiler first step

    1. Parallel Pattern Selection

    1.Selection of the ParallelParadigm (based onDeclarative Model andutilized primitives)

    14

  • Second Phase: Transform

    Skeleton based Code Compiler second step

    2.Transformation of the algorithmcoded utilizing theaforementioned primitives in aparallel agnostic versionincarnating predefined codeSkeletons Technology independent(Agnostic w.r. to the targetplatforms/ technologies)

    2. Incarnation of agnostic Skeletons

    import math

    def data_parallel_region(distr, func,

    *repl):

    return [func(x, *repl) for x in distr]

    def distance(a, b):

    """Computes euclidean distance

    between two vectors"""

    return math.sqrt(sum([(x[1]-

    x[0])**2 for x in zip(a, b)]))

    def kmeans_init(data, k):

    """Returns initial centroids

    configuration"""

    return random.sample(data, k)

    def kmeans_assign(p, centroids

    import math

    def data_parallel_region(distr, func,

    *repl):

    return [func(x, *repl) for x in distr]

    def distance(a, b):

    """Computes euclidean distance

    between two vectors"""

    return math.sqrt(sum([(x[1]-

    x[0])**2 for x in zip(a, b)]))

    def kmeans_init(data, k):

    """Returns initial centroids

    configuration"""

    return random.sample(data, k)

    def kmeans_assign(p, centroids

    import math

    def data_parallel_region(distr, func,

    *repl):

    return [func(x, *repl) for x in distr]

    def distance(a, b):

    """Computes euclidean distance

    between two vectors"""

    return math.sqrt(sum([(x[1]-

    x[0])**2 for x in zip(a, b)]))

    def kmeans_init(data, k):

    """Returns initial centroids

    configuration"""

    return random.sample(data, k)

    def kmeans_assign(p, centroids

    import math

    def data_parallel_region(distr, func,

    *repl):

    return [func(x, *repl) for x in distr]

    def distance(a, b):

    """Computes euclidean distance

    between two vectors"""

    return math.sqrt(sum([(x[1]-

    x[0])**2 for x in zip(a, b)]))

    def kmeans_init(data, k):

    """Returns initial centroids

    configuration"""

    return random.sample(data, k)

    def kmeans_assign(p, centroids

    15

  • Second Phase: Transform

    Skeleton based Code Compiler third step

    3.Specialization of the agnosticSkeletons in multiple versions ofthe code to be executed, butspecific to the different targetplatforms/technologies

    3. Production of technology dependent Skeletons

    Amazon: ./elastic-mapreduce --create --stream input s3n://ngramstest/input_gz --output s3n://ngramstest/output-streaming-fullrun --mapper s3n://ngramstest/code/mapper.py reducer s3n://ngramstest/code/reducer.py --nu