Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applications for Investment Banks...

22
Real Time Big Data Applications for Investment Banks & Financial Institutions

Transcript of Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applications for Investment Banks...

Real Time Big Data Applications for Investment Banks & Financial Institutions

Dev Lakhani

• 15 years Software Architecture & Development Experience• 7 Years of Big Data Experience• Big Data Architectures for Banks, Telecom, Retail, Media

• Deutche Telekom• ASOS• Tier 1 Investment Banks in Canary Wharf• Dentsu Aeigis

• Contributor to Hadoop, Spark, Tachyon, HBase, Ignite• uk.linkedin.com/in/devlakhani

• Overview of Big Data in financial institutions

• Architectural constraints in investment banking

• Implementation challenges

• Data model

• Future for financial applications

Introduction

• This talk has a technical focus

• This presentation is not representative of any client

• Real time re-definition for Big Data

• Vendor neutral talk

Disclaimers

Real Time Definition (Modified)

[AS MODIFIER] Computing Relating to a system in which input data is

processed within a guaranteed response time, using up-to-date

(latest version) information and available on demand as feedback tothe process from which it is coming.

Problem Domain

Big Data Drivers for Investment Banking & Financial Instituions • Capturing billions of trades • Quantifying risk and exposure• Regulatory requirements• Response to news and events• Detect fraud, rogue trading and anomalies• Performing simulations & algorithmic trading• Business analysis -PNL• Capital reserves and forecasting

Why Use Big Data?

Service Layer.Load Balanced/ Cached

TRADES

REFERENCEDATA

TRADES

High Level Architecture

• Disaster avoidance (not recovery) through replication and redundancy

• High availability• "Chinese Wall" policy and segmentation of

information• Within the bank• External to the bank

• Security & role based segmentation• Responsiveness and throughput• API or service based architecutre,

transparent to quants/end users• Data completeness, 1 lost trade = $1 < x <

$10million in VaR estimate

Constraints

•Distributed File System, ingest raw data•Regulatory compliance& archiving• Last option disaster recovery•Direct access to "power-users" for modelling and

analysis

Big Data Solution Architecture Components

•Distributed Warehouse•Not always highly transactional• Trading exchange worries about the

trade/transaction• Eventually consistent sufficient

• SQL vs No-SQL•MPP (Massively Parallel Processing)• In memory vs on disk tuning

Big Data Solution Architecture Components

•Analytics and Serving Layer•Perform descriptive stats• Trade summaries• Risk Calculation•Monte Carlo Simulation

•Machine learning• Expose APIs•Report/Aggregate/Present

Big Data Solution Architecture Components

Physical Processes and Daemons

• HDFS• Datanodes- store the data• Journalnodes - shared edits (HA)• Primary and Seconday namenode (HA)• Zookeeper - coordinate between Namenodes

• YARN• Resource manager x 2 • Node managers x (number of nodes)• Job history servers

Lower Level Architecture Components

Physical Processes and Daemons• HBase (1.0.0)

• N xHBase zookeepers• 2 x Hbase masters• 2 x Hbase master -regionservers• N x Regionservers

• Spark• Master (No HA)• N x slaves

• Monitoring• JMX monitoring

Lower Level Architecture Components

{"book":[{" trade:id":"8400000-8cf0-11bd-b23e-10b96e4ef00d","timestamp":"2015-04-04T14:56:45+00:00 ",

" type":"spotfxusd", "value":"4999"}

]}

• 20+ interbank systems, 100s of reference sets (e.g. exchange rates)• Billions of these per day, 100TB+

Data model

• Estimate Value at Risk• Over a given timeframe, week, month,

year• A confidence level 95%-99%• A loss amount e.g £1m

What is the maximum potential loss >£1m over that time?

• Using Spark calculate the covariance matrix of past returns

• Use RDDs and parallel data structures to simulate various conditions

• Sum, aggregate and take bottom 5%

Analytics, Machine Learning & Simulation

Towards Real Time/ Streaming VaR

• Keys have to be distributed evenly• Encoding and compression choices have to be

made• LZO, GZ, Snappy, Codecs

• Serialization choices and memory tuning• Java objects/JSON objects/JSON to Java

• Replication has to be managed and tested• Cross cluster replication• Cross data center replication• Availability throughput during replication• Rolling restarts and upgrades

Performance Challenges

• In memory tuning, off heap and on heap, region sizes• Java tuning, heap, permgen, generation (for 20+ daemons!)• HBase requires a functioning and performant HDFS cluster• Cassandra requires tuning for compaction, replication• Spark needs correct partitioning and persistence strategies• Allocation of resources to nodes, network, disk etc.• Role and table based segmentation - maintaining the Chinese

Wall

Performance Challenges

Once you solve that...

•Distributed File System for ingested/archived data•MPP warehouse for querying and analytics•Quant layer for machine learning and prediction•Service layer to expose APIs for VaR, stress tests•Response guarantees for real time Big Data

[email protected]

livedemo.batchinsights.combatchinsights.com