Hadoop User Group Ireland (HUG) Ireland - Eddie Baggot Presentation April 2016
HUG Ireland Event - HPCC Presentation Slides
-
Upload
john-mulhall -
Category
Technology
-
view
339 -
download
0
Transcript of HUG Ireland Event - HPCC Presentation Slides
Introduction to HPCC Systems® Powered by LexisNexis Risk Solutions
Ignacio Calvo Senior Software Engineer
07/03/2016
1. A brief history of HPCC
2. Architecture with use case
3. Integration
4. Q&A
Mapflow : Geospatial
LexisNexis : global markets
HPCC : BigData
A brief history of HPCC
Case study
Introduction to HPCC Systems 4
For a given X and Y coordinate, calculate within a specified radius the following :
• Total number of policies
• Total value of policies
Update each record with this information
THE CHALLENGE
Data Flow Oriented Big Data Platform
Introduction to HPCC Systems 5
ESP Middleware
Services
Raw data from several sources
Bat
ch S
ub
scri
ber
s Po
rtal
Thor (data refinery) • Shared Nothing MPP Architecture
• Commodity Hardware
• Batch ETL and Analytics
ECL
Batch requests for scoring and analytics • Easy to use • Implicitly Parallel • Compiles to C++
ROXIE (data delivery) • Shared Nothing MPP Architecture
• Commodity Hardware
• Real-time Indexed Based Query
• Low Latency, Highly Concurrent and Highly Redundant
Batch Processed Data
Bat
ch S
ub
scri
ber
s
Thor
Thor – The Batch Processing Analytics Engine
Introduction to HPCC Systems 6
Raw data from
several sources
Rep
ort
ing
ECL Batch
reporting requests
ROXIE
Batch reporting requests
Massively Parallel Extract Transform and Load (ETL) engine • Built from the ground up as a parallel data
environment
Enables data integration on a scale not previously available • Current LexisNexis person data build process
generates 350 billion intermediate results at peak
Suitable for: • Massive joins/merges
• Massive sorts and transformations
• Any N2 problem
“Identify and catalog all the stars in the Milky Way galaxy”
Bat
ch S
ub
scri
ber
s
Thor
ROXIE – The Real-Time Analytics Delivery Engine
Introduction to HPCC Systems 7
Raw data from
several sources
Rep
ort
ing
ECL Batch
reporting requests
ROXIE
Batch reporting requests
A massively parallel, high throughput, structured query response engine
Ultra fast due to its read-only nature
Allows indices to be built onto data for efficient multi-user retrieval of data
Suitable for: • Volumes of structured queries
• Full text ranked Boolean search
“I want the star Alpha Centauri”
ECL – The Data Flow Oriented Programming Language
Bat
ch S
ub
scri
ber
s
Thor
Introduction to HPCC Systems 8
Raw data from
several sources
Rep
ort
ing
ECL Batch
reporting requests
ROXIE
Batch reporting requests
• An easy to use, data-centric programming language optimized for large-scale data management and query processing
• Highly efficient — automatically distributes workload across all nodes.
• Industry analysts: “80% more efficient than C++, Java and SQL — 1/3 reduction in programmer time to maintain/enhance existing applications”
• Benchmark against SQL (5 times more efficient) for code generation
• Automatic parallelization and synchronization of sequential algorithms for parallel and distributed processing. Compiles to C++
• Large library of built-in modules to handle common data manipulation tasks. Can embed / import : C++, Python, JavaScript, R, Java
Declarative programming language … powerful, extensible, implicitly parallel, maintainable, complete and homogeneous
Graph viewer
Introduction to HPCC Systems 9
A Robust — and Proven — Platform for IoT
Introduction to HPCC Systems 10
ROXIE
HPCC Systems Platform
Data Collection
Rules Execution
Alert Delivery
Search
BI
• Real-time indexed based search
• Real-time rules execution
• Alert call back
• Real-time store
• Real-time analytics on real-time data
• Long term store
• Batch analytics
Distributed Massively Parallel Architecture
Real-time Services
Thor
Cassandra
Lambda architecture
Introduction to HPCC Systems 11
Lambda architecture
Introduction to HPCC Systems 12
HPCC: Internet of Things Architecture
Introduction to HPCC Systems 13
ROXIE
• REST
• SOAP
• Websocket
• IPv6
• 6LoWPAN
• UDP
• uIP
• DTLS
• MQTT
• CoAP
• ROLL
• XMPP-IoT
• Mihini/M3DA
Thor
Index Updates
• AMQP
• DDS
• LLAP
• LWM2M
• SSI
• IOTDB
• SensorML
• IPSO
• Telehash
• TSMP
• NanoIP
• ONS 2.0
Adapter
Blueberries KiwisFigs BananasGrapes Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
10
9
8
7
6
5
4
3
2
AMT
DATE
Grapes12.5%
Figs12.5%
Blueberries12.5%
Apples12.5%
Bananas12.5%
Kiwis12.5%
Oranges12.5%
Cherries12.5%
Good
Fair
Danger
HPCC Systems Technology: Big Data Is Our Core Competency
14
SPEED
• Scales to extreme workloads quickly and easily
• Increases speed of development leads to faster production/delivery
• Improves developer productivity
Introduction to HPCC Systems
HPCC Systems Technology: Big Data Is Our Core Competency
15
SPEED CAPACITY
• Scales to extreme workloads quickly and easily
• Increases speed of development leads to faster production/delivery
• Improves developer productivity
• Enables massive joins, merges, transformations, sorts, or tough N2 problems
• Increases business responsiveness
• Accelerates creation of new services via rapid prototyping capabilities
• Offers a platform for collaboration and innovation leading to better results
Introduction to HPCC Systems
HPCC Systems Technology: Big Data Is Our Core Competency
16
SPEED CAPACITY COST SAVINGS
• Scales to extreme workloads quickly and easily
• Increases speed of development leads to faster production/delivery
• Improves developer productivity
• Enables massive joins, merges, transformations, sorts, or tough N2 problems
• Increases business responsiveness
• Accelerates creation of new services via rapid prototyping capabilities
• Offers a platform for collaboration and innovation leading to better results
• Leverages commodity hardware so fewer people can do much more in less time
• Uses IT resources efficiently via sharing and higher system utilization
• Open source since 2011
Introduction to HPCC Systems
• Grid computing
• Data-centric language (ECL)
• Integrated delivery system that offers data plus analytics
Our Solutions Are Powered by HPCC at Their Core
Introduction to HPCC Systems 17
Big Data
Structured Records
Unstructured Records
News Articles
Proprietary Data
Public Records
Unstructured and Structured Content High Performance Computing Cluster Platform (HPCC) Analysis Applications Key Capabilities
• Over 4 petabytes of content
• 50 billion records
• 20,000 sources
• 8.9 billion unique name and address combinations
• Multi-bureau/multi-source models and bureau roll-over support
• Extensive experience leveraging atomic level data, combining and leveraging disparate data
• Approximately 400 models deployed (custom and flagship)
• Data and analytics
• Identity verification and authentication
• Fraud detection and prevention
• Investigation
• Screening
• Receivables management
Fusion
Linking
Refinery
Financial Services
Government
Health Care
Insurance
Legal
Retail
Open Source Components
Complex Analysis
Clustering Analysis
Link Analysis
Entity Resolution
Example : Understanding People Relations Helps Us Predict Risk
8.9 B unique name/
address combos
4 B property records
37 M unique
businesses
417 M criminal records
269 M auto and home claim records
188.5 M unique
cell phones
16.5 B consumer
records
3.7 B motor vehicle registrations
SSN xxx-xx-xxxxx
321 High St. Chicago, IL 60540
2000 – 2013
Mobile Phone 630.555.9876
Boat License #414567
K.R. Jones
Kathy Jones
Kathy R. Jones
Kathy Schroeder
Car VIN #RGSWA04A87B1xxxxx
123 Avenue San Francisco, CA 94107
2013 – Present
Lived at …
Owns …
Aliases …
Personal info …
Involved in …
DUI Case #4859xxx-xxx
Felony Indictment Chicago C#0404-xxx
Bankruptcy September 12, 2013
Filed for … Loan Application January 30, 2015
Introduction to HPCC Systems 18
Four Petabytes of Information :
• 50 billion records
• 20,000 sources
• Several million records added daily
Example : Understanding People Relations Helps Us Predict Risk
8.9 B unique name/
address combos
4 B property records
37 M unique
businesses
417 M criminal records
269 M auto and home claim records
188.5 M unique
cell phones
16.5 B consumer
records
3.7 B motor vehicle registrations
• Collect largest, broadest, deepest, most accurate, up-to-date repository of public record and contributory data
• Clean and standardize the data
• Identify unique entities using sophisticated learning techniques
• Create the social relationships
SSN xxx-xx-xxxxx
321 High St. Chicago, IL 60540
2000 – 2013
Mobile Phone 630.555.9876
Boat License #414567
K.R. Jones
Kathy Jones
Kathy R. Jones
Kathy Schroeder
Car VIN #RGSWA04A87B1xxxxx
123 Avenue San Francisco, CA 94107
2013 – Present
Lived at …
Owns …
Aliases …
Personal info …
Involved in …
DUI Case #4859xxx-xxx
Felony Indictment Chicago C#0404-xxx
Bankruptcy September 12, 2013
Filed for … Loan Application January 30, 2015
Introduction to HPCC Systems 19
Intel Xeon / 16 cores
qsort New merge sort
33M rows 11.464s 1.433s
503M rows 29.9s 24.2s
Power 8 / 160 execution threads
qsort New merge sort
33M rows 26.5s 4.0s
503M rows 120.0s 18.0s
Performance
Integration
• Embed / import : C++, Python, JavaScript, R, Java
• HDFS to HPCC Connector
• Amazon Web Services (AWS)
• JDBC Driver
Integration : JDBC Driver
Why HPCC? • Efficient MPP + sub-second queries
• Consistent support, all in one platform
• Scales out to thousands of nodes
• Great learning curve
• Fast development
• Open source since 2011 : Apache 2.0
• Reliable, mature : 10+ years in production
Next steps • Virtual Machine image
• Online training : vouchers available
• Documentation
• Forum : online community
• External testimonies and use cases
• Meetups
Useful Links
• HPCC Meetups : http://www.meetup.com/HPCC-Dublin-Big-Data
• HPCC Systems: https://hpccsystems.com/
• Community forums: https://hpccsystems.com/bb
• The HPCC Systems blog: https://hpccsystems.com/resources/blog
• Online training: learn.lexisnexis.com/hpcc
• Summit: https://hpccsystems.com/community/events/2015-hpcc-systems-engineering-summit-community-day
• HPCC on YouTube: https://www.youtube.com/user/HPCCSystems/videos
• GitHub: https://github.com/hpcc-systems
• Lambda architecture : http://cdn.hpccsystems.com/whitepapers/Lambda.pdf
• Performance : https://hpccsystems.com/resources/blog/lchapman/look-whats-coming-soon-hpcc-systems-600-beta-2
• JDBC Driver : https://hpccsystems.com/download/third-party-integrations/hpcc-jdbc-driver
• HDFS to HPCC Connector : http://cdn.hpccsystems.com/install/h2h/1.4.4-1/docs/HDFS_to_HPCC_Connector-1.4.4-1.pdf
• HPCC on AWS : https://aws.hpccsystems.com/aws/getting_started/
HPCC Systems - Online Resources 25
hpccsystems.com