Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applications for Investment Banks...
-
Upload
dataconomy -
Category
Data & Analytics
-
view
892 -
download
0
Transcript of Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applications for Investment Banks...
Dev Lakhani
• 15 years Software Architecture & Development Experience• 7 Years of Big Data Experience• Big Data Architectures for Banks, Telecom, Retail, Media
• Deutche Telekom• ASOS• Tier 1 Investment Banks in Canary Wharf• Dentsu Aeigis
• Contributor to Hadoop, Spark, Tachyon, HBase, Ignite• uk.linkedin.com/in/devlakhani
• Overview of Big Data in financial institutions
• Architectural constraints in investment banking
• Implementation challenges
• Data model
• Future for financial applications
Introduction
• This talk has a technical focus
• This presentation is not representative of any client
• Real time re-definition for Big Data
• Vendor neutral talk
Disclaimers
Real Time Definition
[AS MODIFIER] Computing Relating to a system in which input data isprocessed within milliseconds so that it is available virtually immediately asfeedback to the process from which it is coming, e.g. in a missile guidancesystem:real-time signal processingreal-time software
http://www.oxforddictionaries.com/definition/english/real-time
Real Time Definition (Modified)
[AS MODIFIER] Computing Relating to a system in which input data is
processed within a guaranteed response time, using up-to-date
(latest version) information and available on demand as feedback tothe process from which it is coming.
Big Data Drivers for Investment Banking & Financial Instituions • Capturing billions of trades • Quantifying risk and exposure• Regulatory requirements• Response to news and events• Detect fraud, rogue trading and anomalies• Performing simulations & algorithmic trading• Business analysis -PNL• Capital reserves and forecasting
Why Use Big Data?
• Disaster avoidance (not recovery) through replication and redundancy
• High availability• "Chinese Wall" policy and segmentation of
information• Within the bank• External to the bank
• Security & role based segmentation• Responsiveness and throughput• API or service based architecutre,
transparent to quants/end users• Data completeness, 1 lost trade = $1 < x <
$10million in VaR estimate
Constraints
•Distributed File System, ingest raw data•Regulatory compliance& archiving• Last option disaster recovery•Direct access to "power-users" for modelling and
analysis
Big Data Solution Architecture Components
•Distributed Warehouse•Not always highly transactional• Trading exchange worries about the
trade/transaction• Eventually consistent sufficient
• SQL vs No-SQL•MPP (Massively Parallel Processing)• In memory vs on disk tuning
Big Data Solution Architecture Components
•Analytics and Serving Layer•Perform descriptive stats• Trade summaries• Risk Calculation•Monte Carlo Simulation
•Machine learning• Expose APIs•Report/Aggregate/Present
Big Data Solution Architecture Components
Physical Processes and Daemons
• HDFS• Datanodes- store the data• Journalnodes - shared edits (HA)• Primary and Seconday namenode (HA)• Zookeeper - coordinate between Namenodes
• YARN• Resource manager x 2 • Node managers x (number of nodes)• Job history servers
Lower Level Architecture Components
Physical Processes and Daemons• HBase (1.0.0)
• N xHBase zookeepers• 2 x Hbase masters• 2 x Hbase master -regionservers• N x Regionservers
• Spark• Master (No HA)• N x slaves
• Monitoring• JMX monitoring
Lower Level Architecture Components
{"book":[{" trade:id":"8400000-8cf0-11bd-b23e-10b96e4ef00d","timestamp":"2015-04-04T14:56:45+00:00 ",
" type":"spotfxusd", "value":"4999"}
]}
• 20+ interbank systems, 100s of reference sets (e.g. exchange rates)• Billions of these per day, 100TB+
Data model
• Estimate Value at Risk• Over a given timeframe, week, month,
year• A confidence level 95%-99%• A loss amount e.g £1m
What is the maximum potential loss >£1m over that time?
• Using Spark calculate the covariance matrix of past returns
• Use RDDs and parallel data structures to simulate various conditions
• Sum, aggregate and take bottom 5%
Analytics, Machine Learning & Simulation
• Keys have to be distributed evenly• Encoding and compression choices have to be
made• LZO, GZ, Snappy, Codecs
• Serialization choices and memory tuning• Java objects/JSON objects/JSON to Java
• Replication has to be managed and tested• Cross cluster replication• Cross data center replication• Availability throughput during replication• Rolling restarts and upgrades
Performance Challenges
• In memory tuning, off heap and on heap, region sizes• Java tuning, heap, permgen, generation (for 20+ daemons!)• HBase requires a functioning and performant HDFS cluster• Cassandra requires tuning for compaction, replication• Spark needs correct partitioning and persistence strategies• Allocation of resources to nodes, network, disk etc.• Role and table based segmentation - maintaining the Chinese
Wall
Performance Challenges
Once you solve that...
•Distributed File System for ingested/archived data•MPP warehouse for querying and analytics•Quant layer for machine learning and prediction•Service layer to expose APIs for VaR, stress tests•Response guarantees for real time Big Data