Paris Spark Meetup - Trifacta - 03_04_2017

16
1 Data Wrangling sur Hadoop avec Spark Paris Spark Meetup 03/04/17 Victor Coustenoble Technical regional manager EMEA [email protected] @vizanalytics

Transcript of Paris Spark Meetup - Trifacta - 03_04_2017

Page 1: Paris Spark Meetup - Trifacta - 03_04_2017

1

Data Wrangling sur Hadoop avec Spark

Paris Spark Meetup 03/04/17Victor CoustenobleTechnical regional manager [email protected]@vizanalytics

Page 2: Paris Spark Meetup - Trifacta - 03_04_2017

DATA WRANGLING

2

QUESTION ANALYZE INSIGHTDISCOVER STRUCTURE CLEANSE ENRICH VALIDATE PUBLISH

What is Data Wrangling?

Page 3: Paris Spark Meetup - Trifacta - 03_04_2017

3

DATA

Page 4: Paris Spark Meetup - Trifacta - 03_04_2017

Company Overview

Background➔ Headquartered in San Francisco, with offices in Boston,

London, Berlin, Paris➔ >100+ Employees➔ Created in 2012

Focus➔ 100% focused on Data Wrangling and Data Preparation➔ Accelerate time to value and business use of Big Data➔ Visual, interactive and Self-Service Data Preparation

4

Page 5: Paris Spark Meetup - Trifacta - 03_04_2017

5

Business System Data Machine Generated Data Third Party Data

Reporting / BIData Visualization

LOB IT

Explore Structure Clean Enrich Validate Publish

Distributed Data Platform

Predictive Analytics / Data Science

Machine Data /Enterprise Processes

Applications / processes

Reporting / Data driven decision

Recommendations /Data Mining

Self-service access for business analysts to rawdata operated under IT control

Page 6: Paris Spark Meetup - Trifacta - 03_04_2017

6

INTERACTIVE &VISUAL

PREDICTIVE &SUGGESTIONS

INTEROPERABLE

Trifacta Key Differentiators

Page 7: Paris Spark Meetup - Trifacta - 03_04_2017

Interoperable: Reduces Total Cost of Ownership

7

Interoperability with metadata repositories enables discoverability

and lineage for compliance & audit

Interoperability with existing security models prevents administering

another app

*Predictive Interaction for Data Transformation – Heer, Hellerstein & Kandel; Stanford University & University of California, Berkeley (2015)

Intelligent Execution* ensures Trifacta is highly performantboth now and in the future

Page 8: Paris Spark Meetup - Trifacta - 03_04_2017

8

Execution Architecture

Optimized processing for data not needing parallel processing

Future Technologies

Intelligent Execution

In-memory

Page 9: Paris Spark Meetup - Trifacta - 03_04_2017

MBs GBs TBs PBs

Data Volume

Exec

utio

n La

tenc

y

Immediate

Interactive

Batch

Intelligent Execution Architecture

Automatically selects the right execution engine for the data set being transformed

Page 10: Paris Spark Meetup - Trifacta - 03_04_2017

TRIFACTA

Trifacta Workflow in Hadoop

Sample Scale Up

RefineSample

Results

Identify/Register Data

1. Predictive Interaction

2.

Co

nsu

me

Schedulers

Monitor and Adjust

3.

Schedule

Visualization & Analysis

Secure AccessKerberos, LDAP…

CLI

Page 11: Paris Spark Meetup - Trifacta - 03_04_2017

How Does Trifacta’s Spark Work?

§ Yarn ressource manager§ Cluster deployment mode

Page 12: Paris Spark Meetup - Trifacta - 03_04_2017

12

Trifacta executes our own version of Spark in a “Cluster Deployment Mode” using the Hadoop cluster’s YARN resource manager.§ Trifacta’s Spark job lives in its own YARN container, separate from other Spark

jobs running on the same cluster.

Trifacta submits the following to YARN for execution across cluster:§ Spark v2.1.0 libraries§ Trifacta Transformation & Profiling libraries§ Transformation logic (DAG)§ Libraries are distributed & cached by YARN after initial load.

Spark jobs parameters (possible per user) :§ Executor parameters (memory size, nb vcores).§ Dynamic allocation (by default) for dynamic nb of executors depending of

YARN available ressources.§ Possible to assign jobs to specific YARN queue

How Does Trifacta’s Spark Work?

Page 13: Paris Spark Meetup - Trifacta - 03_04_2017

Trifacta Selected as OEM Partner for Google Cloud Dataprep Service

Trifacta Interface & Photon Engine Integrated within Google Cloud Ecosystem● Access & publish data from/to Google Cloud Storage & BigQuery● Compile recipes to Google Cloud Dataflow for fully-managed auto-scaling execution

Google Cloud Dataprep

Cloud Storage

BigQuery

Dataflow

Cloud Storage

BigQuery

Cloud DataprepINPUT OUTPUT

https://cloud.google.com/dataprep/

Page 14: Paris Spark Meetup - Trifacta - 03_04_2017

Storage

3rd PartyExperian,Nielson,FICO…

v

IT

LOB

Discovering Structuring Cleaning Enriching Validating Publishing

Ingestion Processing

DATA LAKE

Demonstration : Predict and Avoid Churn – Customer 360

Customer Data

Account Activity

Social Media

CRMContact / Status

VoiceText Data

TweetsHandles

ANALYSIS & VISUALIZATION

Page 15: Paris Spark Meetup - Trifacta - 03_04_2017

Trifacta: The Global Leader in Data Wrangling

No. 1 by Analysts

#1 End User Data Preparation Vendor

2015

Leader in Forrester Wave for Data Preparation Tools

2017

0

50 000No. 1 by Users

No. 1 by Customers

No. 1 by Partners

2016

Oct 2015 Oct 2016 Oct 2017

2017

Page 16: Paris Spark Meetup - Trifacta - 03_04_2017

MerciQuestions?

Télécharger Trifacta Wrangler trifacta.com/start-wrangling

[email protected]@vizanalytics