PyData Paris 2015 - Track 4.2 Jean Maynier

12
Python, SQLAlchemy and Scrapy at Kpler Jean Maynier CTO

Transcript of PyData Paris 2015 - Track 4.2 Jean Maynier

Page 1: PyData Paris 2015 - Track 4.2 Jean Maynier

Python, SQLAlchemy

and Scrapy at Kpler

Jean Maynier

CTO

Page 2: PyData Paris 2015 - Track 4.2 Jean Maynier

Business : commodities markets

� Domain: trading / maritime transport

� Customers: physical traders / analysts

� Needs: follow cargoes flows and prices impact

Page 3: PyData Paris 2015 - Track 4.2 Jean Maynier

Commercial Database

“Meta” AIS (10 networks)

Ports Authorities

Gas Inventories

Methodology : 4 data layers

The first “Cargo-Tracking” system, bridging the gap between LNG Shipping and Trading.

Page 4: PyData Paris 2015 - Track 4.2 Jean Maynier

Why python ?

� Simple, fast to prototype

� Data libraries (sqlAlchemy, Pandas, Scikit-learn)

� Data oriented community

� Scrapy: best scraping framework

Step 1

Page 5: PyData Paris 2015 - Track 4.2 Jean Maynier

Scrapy

� Simplify crawling

� Reusable extractors (xpath, css selectors)

� PaaS: Scrapinghub

� Blacklist: Tor, crawlera for IP rotation

Step 2

Page 6: PyData Paris 2015 - Track 4.2 Jean Maynier

Process raw data

� Clean and normalize data � Positions

� Port calls

� Contracts

� Prices

� Python PubSub system

� Persist data : sqlAlchemy & Postgresql

Step 3

Page 7: PyData Paris 2015 - Track 4.2 Jean Maynier

Enrich data

Done

� Vessel dock/undock events

� Predict next destination

� Find buyer/seller

� Basic routing

Todo / WIP

� Estimate price

� Routing with weather

� Portfolio optimisation

� Fleet optimisation

Step 4

Page 8: PyData Paris 2015 - Track 4.2 Jean Maynier

Architecture

Websites sources

API sources

Scraping Python/scrapy

Cache

Elasticseach

Data

ETL + models python

Web

REST API scala

client

mobile client

Excel

Page 9: PyData Paris 2015 - Track 4.2 Jean Maynier

Metrics trends

Present

� DBs < 100Gb

� 50 sources

� 500 vessels

� Positions every 3min

Future

� 1-10 Tb

� 100 sources

� 10k to 100k vessels

� Position every 30s

Performance problem !

Page 10: PyData Paris 2015 - Track 4.2 Jean Maynier

Batch to data streaming

� Parallelization

� Granularity : item vs source

� Handle failure, no data loss, monitoring

� Akka streaming, Spark streaming, Celery, Storm

Step 5

Page 11: PyData Paris 2015 - Track 4.2 Jean Maynier

Storm at Kpler: POC

Websites sources

API sources

Scraping Python/scrapy

ETL + models

Position

Dock/undock

Custom prices Match prices

Next destination

Trades

Cargo volumes

Page 12: PyData Paris 2015 - Track 4.2 Jean Maynier

We are recruiting, join us !

[email protected]