Modern Big Data Systems for Machine Learning
-
Upload
zpektral -
Category
Technology
-
view
425 -
download
3
Transcript of Modern Big Data Systems for Machine Learning
![Page 1: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/1.jpg)
Modern Big Data Systems for
Machine Learning
Antonio Roldao, Ph.D. CQF. 1
10/July/2015, Thomson Reuters, London, UK
![Page 2: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/2.jpg)
About Me
http://anton.io @roldao
2
![Page 3: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/3.jpg)
This Talk on Big Data Systems
Data Big Data as a Buzzword and the useless 4V’s Basic Aspects of Data Advanced Aspects of Data Small Data Innovations
Algorithms for Machine Learning ML Overview Optimization Problems Solving Systems of Linear Equations Accelerating ML Using Different Technologies
Distributed Computing Computing at Scale Platform Examples
Antonio Roldao, Ph.D. CQF. 3
![Page 4: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/4.jpg)
Big Data
4Vs of BD?! Volume Variety Velocity Veracity
Too simplistic and technically useless!
“Any amount of data that is too big for Excel to process.”
1956 Hard-drive with 5 MB
Mostly a marketing Buzzword which mean different things to different people.
Antonio Roldao, Ph.D. CQF. 4
![Page 5: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/5.jpg)
Understanding Data – Basic Storage formats
Uncompressed <-> Compressed Unencrypted <-> Encrypted Human-readable <-> Binary Rigid <-> Templated <-> Self-describing Mainly regular <-> Irregular Different types and encodings…
Generation (write) modes parallel <-> sequential append-only in-place updates random inserts…
Consumption (read) modes parallel <-> sequential random <-> well defined access…
Antonio Roldao, Ph.D. CQF. 5
![Page 6: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/6.jpg)
Understanding Data – Advanced
Represents: How concepts are connected (graph) How connections evolve with time (time series)
Bitemporal (e.g. value depends on time frame) Time value of data (e.g. Useful today, but not tomorrow) Sensitivity (e.g. Medical, Economical, Political, Privacy…) Interdependency (e.g. one wrong bit destroys everything) Cleanliness (e.g. how Noisy it is) Truthfulness (e.g. how Accurate it is) Redundancy (e.g. how safe does it need to be) Density (e.g. how Redundant it is) Accessibility (e.g. Local <-> Global)
Cost / BudgetAntonio Roldao, Ph.D. CQF. 6
![Page 7: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/7.jpg)
Myriad of Data-stores/bases File-Systems
local, distributed, p2p,… rom, tape, spindle, flash, ram,…
Key-Value Stores Relational Object Geo-location Row-based Column-based Time-Series Graph-based ACID compliant or not Sharding Support Replication Support HA Support Blockchain LayerFS …
Antonio Roldao, Ph.D. CQF. 7
![Page 8: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/8.jpg)
Recent Innovations in “Small-data”
XML (1996) YAML (2001) JSON BSON Google Protocol Buffers (initial release 2008) Cap’n Proto Thrift Avro FAST FIX/BFIX Flat Buffers Simple Binary Encoding (2014) Dynamically Adaptive Encoding (Future)
http://www.quora.com/What-are-the-pros-and-cons-of-different-serialization-formats-for-Hadoop
Antonio Roldao, Ph.D. CQF. 8
![Page 9: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/9.jpg)
Processing Data
Antonio Roldao, Ph.D. CQF. 9
![Page 10: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/10.jpg)
Machine Learning
Antonio Roldao, Ph.D. CQF. 10
![Page 11: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/11.jpg)
ML / AI – Boils down to…
Given an input (X) and/or state (S) produce a output (Y)
X may include Index or Time element (e.g. time series)
S may include: a feedback-loop (e.g. reinforcement learning) a previously trained dataset (e.g. supervised learning)
Y divides into two types: predictions (e.g. weather, trading, ...) categorizations
known categories (e.g. object/speech recognition, …) unknown categories (e.g. insight generation, …)
Antonio Roldao, Ph.D. CQF. 11
![Page 12: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/12.jpg)
Dimensionality Reduction
Principal Component Analysis
First component
Subsequent components
Antonio Roldao, Ph.D. CQF. 12
![Page 13: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/13.jpg)
Clustering
k-Means
For x observations cluster into k partitions the where ui represents the mean of points in Si
Antonio Roldao, Ph.D. CQF. 13
![Page 14: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/14.jpg)
General Al
Genetic Algorithms
For n mutations select mi that minimizes the difference between output yi and a given reference (r):
where
Antonio Roldao, Ph.D. CQF. 14
![Page 15: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/15.jpg)
Artificial Neural Networks
Deep Convolutional Neural Network (d-CNN)
Optimization involving Stochastic Gradient Descent + Back-propagation
Antonio Roldao, Ph.D. CQF. 15
![Page 16: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/16.jpg)
All About Optimization
All these schemes involve solving for some constants that Minimize or Maximize some Cost function
Require fundamental Optimization algorithms such as: Direct Methods
Combinatorial Algorithms Greedy Algorithm Minimax Algorithm with alpha-beta pruning …
Iterative Methods Gradient Methods Karmarkar’s Algorithm …
Antonio Roldao, Ph.D. CQF. 16
![Page 17: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/17.jpg)
At the Core of Optimization…
…there is a solution of a System of Linear equation of the form:
with x subject to some constraints.
Which need algos that can be subdivided into two categories:
Direct Methods Gaussian, LU, QR, Cholesky, LDL, …
Iterative Methods MINRES, GC, BiCGSTAB, GMRES, ORTHOMIN, …
Antonio Roldao, Ph.D. CQF. 17
![Page 18: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/18.jpg)
Accelerating Machine Learning
CPU GPGPU FPGA
Sequential Processing Parallel Processing
High FlexibilityHigh AbstractionsMany Libraries…Direct Methods
Ultra-Low-LatencyHigh BandwidthFine grain optimization...Iterative MethodsNeural NetworksMarkov ChainsMonte Carlo
Antonio Roldao, Ph.D. CQF. 18
![Page 19: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/19.jpg)
Networked Computing Systems
Mainframe Computing Cluster Computing Distributed Computing Grid Computing
Orbital Computing Interstellar Computing Galactic Computing Inter-Universe Computing
Cloud Computing
Antonio Roldao, Ph.D. CQF. 19
![Page 20: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/20.jpg)
Modern Big Data Systems – Basic Components Dynamic (abstraction) + Statically-Typed (speed) Languages Need to rethink and re-engineer main systems:
Data & Code Stores Logging Code Revision and Deployment Compute Nodes and Brokers Management Graceful Failure and Recovery Credentials and Access Controls Task Schedulers Messaging Bus Web/Mobile Interfaces Regression Testing…
Containerize and Standardize Services
Antonio Roldao, Ph.D. CQF. 20
![Page 21: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/21.jpg)
Examples – Modern Big Data Systems Finance
Athena/Hydra @ JP Morgan Quartz/Sandra @ Bank of America Slang/SecDB @ Goldman Sachs Optimus/DAL @ Morgan Stanley WSQ Tech @ n-prop shops &
datapark.io @ quants / prop-shops
Machine Learning
Alpha/DL @ Muse.Ai
Antonio Roldao, Ph.D. CQF. 21
![Page 22: Modern Big Data Systems for Machine Learning](https://reader030.fdocuments.net/reader030/viewer/2022032700/55d2d3fdbb61eb8e578b45e5/html5/thumbnails/22.jpg)
Thank you
http://anton.io @roldao