Paris Spark Meetup - Trifacta - 03_04_2017

download Paris Spark Meetup - Trifacta - 03_04_2017

of 16

  • date post

    16-Apr-2017
  • Category

    Technology

  • view

    282
  • download

    4

Embed Size (px)

Transcript of Paris Spark Meetup - Trifacta - 03_04_2017

  • 1

    Data Wrangling sur Hadoop avec Spark

    Paris Spark Meetup 03/04/17Victor CoustenobleTechnical regional manager EMEAvictor@trifacta.com@vizanalytics

  • DATA WRANGLING

    2

    QUESTION ANALYZE INSIGHTDISCOVER STRUCTURE CLEANSE ENRICH VALIDATE PUBLISH

    What is Data Wrangling?

  • 3

    DATA

  • Company Overview

    Background Headquartered in San Francisco, with offices in Boston,

    London, Berlin, Paris >100+ Employees Created in 2012

    Focus 100% focused on Data Wrangling and Data Preparation Accelerate time to value and business use of Big Data Visual, interactive and Self-Service Data Preparation

    4

  • 5

    Business System Data Machine Generated Data Third Party Data

    Reporting / BIData Visualization

    LOB IT

    Explore Structure Clean Enrich Validate Publish

    Distributed Data Platform

    Predictive Analytics / Data Science

    Machine Data /Enterprise Processes

    Applications / processes

    Reporting / Data driven decision

    Recommendations /Data Mining

    Self-service access for business analysts to rawdata operated under IT control

  • 6

    INTERACTIVE &VISUAL

    PREDICTIVE &SUGGESTIONS

    INTEROPERABLE

    Trifacta Key Differentiators

  • Interoperable: Reduces Total Cost of Ownership

    7

    Interoperability with metadata repositories enables discoverability

    and lineage for compliance & audit

    Interoperability with existing security models prevents administering

    another app

    *Predictive Interaction for Data Transformation Heer, Hellerstein & Kandel; Stanford University & University of California, Berkeley (2015)

    Intelligent Execution* ensures Trifacta is highly performantboth now and in the future

  • 8

    Execution Architecture

    Optimized processing for data not needing parallel processing

    Future Technologies

    Intelligent Execution

    In-memory

  • MBs GBs TBs PBs

    Data Volume

    Exec

    utio

    n La

    tenc

    y

    Immediate

    Interactive

    Batch

    Intelligent Execution Architecture

    Automatically selects the right execution engine for the data set being transformed

  • TRIFACTA

    Trifacta Workflow in Hadoop

    Sample Scale Up

    RefineSample

    Results

    Identify/Register Data

    1. Predictive Interaction

    2.

    Co

    nsu

    me

    Schedulers

    Monitor and Adjust

    3.

    Schedule

    Visualization & Analysis

    Secure AccessKerberos, LDAP

    CLI

  • How Does Trifactas Spark Work?

    Yarn ressource manager Cluster deployment mode

  • 12

    Trifacta executes our own version of Spark in a Cluster Deployment Mode using the Hadoop clusters YARN resource manager. Trifactas Spark job lives in its own YARN container, separate from other Spark

    jobs running on the same cluster.

    Trifacta submits the following to YARN for execution across cluster: Spark v2.1.0 libraries Trifacta Transformation & Profiling libraries Transformation logic (DAG) Libraries are distributed & cached by YARN after initial load.

    Spark jobs parameters (possible per user) : Executor parameters (memory size, nb vcores). Dynamic allocation (by default) for dynamic nb of executors depending of

    YARN available ressources. Possible to assign jobs to specific YARN queue

    How Does Trifactas Spark Work?

  • Trifacta Selected as OEM Partner for Google Cloud Dataprep Service

    Trifacta Interface & Photon Engine Integrated within Google Cloud Ecosystem Access & publish data from/to Google Cloud Storage & BigQuery Compile recipes to Google Cloud Dataflow for fully-managed auto-scaling execution

    Google Cloud Dataprep

    Cloud Storage

    BigQuery

    Dataflow

    Cloud Storage

    BigQuery

    Cloud DataprepINPUT OUTPUT

    https://cloud.google.com/dataprep/

  • Storage

    3rd PartyExperian,Nielson,FICO

    v

    IT

    LOB

    Discovering Structuring Cleaning Enriching Validating Publishing

    Ingestion Processing

    DATA LAKE

    Demonstration : Predict and Avoid Churn Customer 360

    Customer Data

    Account Activity

    Social Media

    CRMContact / Status

    VoiceText Data

    TweetsHandles

    ANALYSIS & VISUALIZATION

  • Trifacta: The Global Leader in Data Wrangling

    No. 1 by Analysts

    #1 End User Data Preparation Vendor

    2015

    Leader in Forrester Wave for Data Preparation Tools

    2017

    0

    50 000No. 1 by Users

    No. 1 by Customers

    No. 1 by Partners

    2016

    Oct 2015 Oct 2016 Oct 2017

    2017

  • MerciQuestions?

    Tlcharger Trifacta Wrangler trifacta.com/start-wrangling

    victor@trifacta.com@vizanalytics