HUG Ireland Event - HPCC Presentation Slides

26
Introduction to HPCC Systems® Powered by LexisNexis Risk Solutions Ignacio Calvo Senior Software Engineer 07/03/2016

Transcript of HUG Ireland Event - HPCC Presentation Slides

Page 1: HUG Ireland Event - HPCC Presentation Slides

Introduction to HPCC Systems® Powered by LexisNexis Risk Solutions

Ignacio Calvo Senior Software Engineer

07/03/2016

Page 2: HUG Ireland Event - HPCC Presentation Slides

1. A brief history of HPCC

2. Architecture with use case

3. Integration

4. Q&A

Page 3: HUG Ireland Event - HPCC Presentation Slides

Mapflow : Geospatial

LexisNexis : global markets

HPCC : BigData

A brief history of HPCC

Page 4: HUG Ireland Event - HPCC Presentation Slides

Case study

Introduction to HPCC Systems 4

For a given X and Y coordinate, calculate within a specified radius the following :

• Total number of policies

• Total value of policies

Update each record with this information

THE CHALLENGE

Page 5: HUG Ireland Event - HPCC Presentation Slides

Data Flow Oriented Big Data Platform

Introduction to HPCC Systems 5

ESP Middleware

Services

Raw data from several sources

Bat

ch S

ub

scri

ber

s Po

rtal

Thor (data refinery) • Shared Nothing MPP Architecture

• Commodity Hardware

• Batch ETL and Analytics

ECL

Batch requests for scoring and analytics • Easy to use • Implicitly Parallel • Compiles to C++

ROXIE (data delivery) • Shared Nothing MPP Architecture

• Commodity Hardware

• Real-time Indexed Based Query

• Low Latency, Highly Concurrent and Highly Redundant

Batch Processed Data

Page 6: HUG Ireland Event - HPCC Presentation Slides

Bat

ch S

ub

scri

ber

s

Thor

Thor – The Batch Processing Analytics Engine

Introduction to HPCC Systems 6

Raw data from

several sources

Rep

ort

ing

ECL Batch

reporting requests

ROXIE

Batch reporting requests

Massively Parallel Extract Transform and Load (ETL) engine • Built from the ground up as a parallel data

environment

Enables data integration on a scale not previously available • Current LexisNexis person data build process

generates 350 billion intermediate results at peak

Suitable for: • Massive joins/merges

• Massive sorts and transformations

• Any N2 problem

“Identify and catalog all the stars in the Milky Way galaxy”

Page 7: HUG Ireland Event - HPCC Presentation Slides

Bat

ch S

ub

scri

ber

s

Thor

ROXIE – The Real-Time Analytics Delivery Engine

Introduction to HPCC Systems 7

Raw data from

several sources

Rep

ort

ing

ECL Batch

reporting requests

ROXIE

Batch reporting requests

A massively parallel, high throughput, structured query response engine

Ultra fast due to its read-only nature

Allows indices to be built onto data for efficient multi-user retrieval of data

Suitable for: • Volumes of structured queries

• Full text ranked Boolean search

“I want the star Alpha Centauri”

Page 8: HUG Ireland Event - HPCC Presentation Slides

ECL – The Data Flow Oriented Programming Language

Bat

ch S

ub

scri

ber

s

Thor

Introduction to HPCC Systems 8

Raw data from

several sources

Rep

ort

ing

ECL Batch

reporting requests

ROXIE

Batch reporting requests

• An easy to use, data-centric programming language optimized for large-scale data management and query processing

• Highly efficient — automatically distributes workload across all nodes.

• Industry analysts: “80% more efficient than C++, Java and SQL — 1/3 reduction in programmer time to maintain/enhance existing applications”

• Benchmark against SQL (5 times more efficient) for code generation

• Automatic parallelization and synchronization of sequential algorithms for parallel and distributed processing. Compiles to C++

• Large library of built-in modules to handle common data manipulation tasks. Can embed / import : C++, Python, JavaScript, R, Java

Declarative programming language … powerful, extensible, implicitly parallel, maintainable, complete and homogeneous

Page 9: HUG Ireland Event - HPCC Presentation Slides

Graph viewer

Introduction to HPCC Systems 9

Page 10: HUG Ireland Event - HPCC Presentation Slides

A Robust — and Proven — Platform for IoT

Introduction to HPCC Systems 10

ROXIE

HPCC Systems Platform

Data Collection

Rules Execution

Alert Delivery

Search

BI

• Real-time indexed based search

• Real-time rules execution

• Alert call back

• Real-time store

• Real-time analytics on real-time data

• Long term store

• Batch analytics

Distributed Massively Parallel Architecture

Real-time Services

Thor

Cassandra

Page 11: HUG Ireland Event - HPCC Presentation Slides

Lambda architecture

Introduction to HPCC Systems 11

Page 12: HUG Ireland Event - HPCC Presentation Slides

Lambda architecture

Introduction to HPCC Systems 12

Page 13: HUG Ireland Event - HPCC Presentation Slides

HPCC: Internet of Things Architecture

Introduction to HPCC Systems 13

ROXIE

• REST

• SOAP

• Websocket

• IPv6

• 6LoWPAN

• UDP

• uIP

• DTLS

• MQTT

• CoAP

• ROLL

• XMPP-IoT

• Mihini/M3DA

Thor

Index Updates

• AMQP

• DDS

• LLAP

• LWM2M

• SSI

• IOTDB

• SensorML

• IPSO

• Telehash

• TSMP

• NanoIP

• ONS 2.0

Adapter

Blueberries KiwisFigs BananasGrapes Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

10

9

8

7

6

5

4

3

2

AMT

DATE

Grapes12.5%

Figs12.5%

Blueberries12.5%

Apples12.5%

Bananas12.5%

Kiwis12.5%

Oranges12.5%

Cherries12.5%

Good

Fair

Danger

Page 14: HUG Ireland Event - HPCC Presentation Slides

HPCC Systems Technology: Big Data Is Our Core Competency

14

SPEED

• Scales to extreme workloads quickly and easily

• Increases speed of development leads to faster production/delivery

• Improves developer productivity

Introduction to HPCC Systems

Page 15: HUG Ireland Event - HPCC Presentation Slides

HPCC Systems Technology: Big Data Is Our Core Competency

15

SPEED CAPACITY

• Scales to extreme workloads quickly and easily

• Increases speed of development leads to faster production/delivery

• Improves developer productivity

• Enables massive joins, merges, transformations, sorts, or tough N2 problems

• Increases business responsiveness

• Accelerates creation of new services via rapid prototyping capabilities

• Offers a platform for collaboration and innovation leading to better results

Introduction to HPCC Systems

Page 16: HUG Ireland Event - HPCC Presentation Slides

HPCC Systems Technology: Big Data Is Our Core Competency

16

SPEED CAPACITY COST SAVINGS

• Scales to extreme workloads quickly and easily

• Increases speed of development leads to faster production/delivery

• Improves developer productivity

• Enables massive joins, merges, transformations, sorts, or tough N2 problems

• Increases business responsiveness

• Accelerates creation of new services via rapid prototyping capabilities

• Offers a platform for collaboration and innovation leading to better results

• Leverages commodity hardware so fewer people can do much more in less time

• Uses IT resources efficiently via sharing and higher system utilization

• Open source since 2011

Introduction to HPCC Systems

Page 17: HUG Ireland Event - HPCC Presentation Slides

• Grid computing

• Data-centric language (ECL)

• Integrated delivery system that offers data plus analytics

Our Solutions Are Powered by HPCC at Their Core

Introduction to HPCC Systems 17

Big Data

Structured Records

Unstructured Records

News Articles

Proprietary Data

Public Records

Unstructured and Structured Content High Performance Computing Cluster Platform (HPCC) Analysis Applications Key Capabilities

• Over 4 petabytes of content

• 50 billion records

• 20,000 sources

• 8.9 billion unique name and address combinations

• Multi-bureau/multi-source models and bureau roll-over support

• Extensive experience leveraging atomic level data, combining and leveraging disparate data

• Approximately 400 models deployed (custom and flagship)

• Data and analytics

• Identity verification and authentication

• Fraud detection and prevention

• Investigation

• Screening

• Receivables management

Fusion

Linking

Refinery

Financial Services

Government

Health Care

Insurance

Legal

Retail

Open Source Components

Complex Analysis

Clustering Analysis

Link Analysis

Entity Resolution

Page 18: HUG Ireland Event - HPCC Presentation Slides

Example : Understanding People Relations Helps Us Predict Risk

8.9 B unique name/

address combos

4 B property records

37 M unique

businesses

417 M criminal records

269 M auto and home claim records

188.5 M unique

cell phones

16.5 B consumer

records

3.7 B motor vehicle registrations

SSN xxx-xx-xxxxx

321 High St. Chicago, IL 60540

2000 – 2013

Mobile Phone 630.555.9876

Boat License #414567

K.R. Jones

Kathy Jones

Kathy R. Jones

Kathy Schroeder

Car VIN #RGSWA04A87B1xxxxx

123 Avenue San Francisco, CA 94107

2013 – Present

Lived at …

Owns …

Aliases …

Personal info …

Involved in …

DUI Case #4859xxx-xxx

Felony Indictment Chicago C#0404-xxx

Bankruptcy September 12, 2013

Filed for … Loan Application January 30, 2015

Introduction to HPCC Systems 18

Four Petabytes of Information :

• 50 billion records

• 20,000 sources

• Several million records added daily

Page 19: HUG Ireland Event - HPCC Presentation Slides

Example : Understanding People Relations Helps Us Predict Risk

8.9 B unique name/

address combos

4 B property records

37 M unique

businesses

417 M criminal records

269 M auto and home claim records

188.5 M unique

cell phones

16.5 B consumer

records

3.7 B motor vehicle registrations

• Collect largest, broadest, deepest, most accurate, up-to-date repository of public record and contributory data

• Clean and standardize the data

• Identify unique entities using sophisticated learning techniques

• Create the social relationships

SSN xxx-xx-xxxxx

321 High St. Chicago, IL 60540

2000 – 2013

Mobile Phone 630.555.9876

Boat License #414567

K.R. Jones

Kathy Jones

Kathy R. Jones

Kathy Schroeder

Car VIN #RGSWA04A87B1xxxxx

123 Avenue San Francisco, CA 94107

2013 – Present

Lived at …

Owns …

Aliases …

Personal info …

Involved in …

DUI Case #4859xxx-xxx

Felony Indictment Chicago C#0404-xxx

Bankruptcy September 12, 2013

Filed for … Loan Application January 30, 2015

Introduction to HPCC Systems 19

Page 20: HUG Ireland Event - HPCC Presentation Slides

Intel Xeon / 16 cores

qsort New merge sort

33M rows 11.464s 1.433s

503M rows 29.9s 24.2s

Power 8 / 160 execution threads

qsort New merge sort

33M rows 26.5s 4.0s

503M rows 120.0s 18.0s

Performance

Page 21: HUG Ireland Event - HPCC Presentation Slides

Integration

• Embed / import : C++, Python, JavaScript, R, Java

• HDFS to HPCC Connector

• Amazon Web Services (AWS)

• JDBC Driver

Page 22: HUG Ireland Event - HPCC Presentation Slides

Integration : JDBC Driver

Page 23: HUG Ireland Event - HPCC Presentation Slides

Why HPCC? • Efficient MPP + sub-second queries

• Consistent support, all in one platform

• Scales out to thousands of nodes

• Great learning curve

• Fast development

• Open source since 2011 : Apache 2.0

• Reliable, mature : 10+ years in production

Page 24: HUG Ireland Event - HPCC Presentation Slides

Next steps • Virtual Machine image

• Online training : vouchers available

• Documentation

• Forum : online community

• External testimonies and use cases

• Meetups

Page 25: HUG Ireland Event - HPCC Presentation Slides

Useful Links

• HPCC Meetups : http://www.meetup.com/HPCC-Dublin-Big-Data

• HPCC Systems: https://hpccsystems.com/

• Community forums: https://hpccsystems.com/bb

• The HPCC Systems blog: https://hpccsystems.com/resources/blog

• Online training: learn.lexisnexis.com/hpcc

• Summit: https://hpccsystems.com/community/events/2015-hpcc-systems-engineering-summit-community-day

• HPCC on YouTube: https://www.youtube.com/user/HPCCSystems/videos

• GitHub: https://github.com/hpcc-systems

• Lambda architecture : http://cdn.hpccsystems.com/whitepapers/Lambda.pdf

• Performance : https://hpccsystems.com/resources/blog/lchapman/look-whats-coming-soon-hpcc-systems-600-beta-2

• JDBC Driver : https://hpccsystems.com/download/third-party-integrations/hpcc-jdbc-driver

• HDFS to HPCC Connector : http://cdn.hpccsystems.com/install/h2h/1.4.4-1/docs/HDFS_to_HPCC_Connector-1.4.4-1.pdf

• HPCC on AWS : https://aws.hpccsystems.com/aws/getting_started/

HPCC Systems - Online Resources 25

Page 26: HUG Ireland Event - HPCC Presentation Slides

hpccsystems.com