Python as part of a production machine learning stack by Michael Manapat PyData SV 2014
-
Upload
pydata -
Category
Technology
-
view
123 -
download
2
description
Transcript of Python as part of a production machine learning stack by Michael Manapat PyData SV 2014
Python as part of a produc0on machine learning stack Michael Manapat @mlmanapat Stripe
Outline -‐Why we need ML at Stripe -‐Simple models with sklearn -‐Pipelines with Luigi -‐Scoring as a service
Stripe is a technology company focusing on making payments easy -‐Short applica>on
Tokeniza0on Customer
browser Stripe
Stripe.js
Token
Merchant server Stripe
API call
Result
API Call import stripe stripe.Charge.create( amount=400, currency="usd", card="tok_103xnl2gR5VxTSB” [email protected]" )"
Fraud / business viola0ons -‐Terms of service viola>ons (weapons) -‐Merchant fraud (card “cashers”) -‐Transac>on fraud -‐No machine learning a year ago
Fraud / business viola0ons -‐Terms of service viola>ons E-‐cigareMes, drugs, weapons, etc. How do we find these automa>cally?
Merchant sign up flow
Applica>on submission
Website scraped
Text scored Applica>on reviewed
Merchant sign up flow
Applica>on submission
Website scraped
Text scored Applica>on reviewed
Machine learning
pipeline and service
Building a classifier: e-‐cigareIes data = pandas.from_pickle(‘ecigs’) data.head() text violator 0 " please verify your age i am 21 years or older ... True 1 coming soon toggle me drag me with your mouse ... False 2 drink moscow mules cart 0 log in or create an ... False 3 vapors electronic cigarette buy now insuper st... True 4 t-shirts shorts hawaii about us silver coll... False [5 rows x 2 columns]
Features for text classifica0on cv = CountVectorizer features = cv.fit_transform(data['text'])
Sparse matrix of word counts from input text (omiSng feature selec>on)
Features for text classifica0on X_train, X_test, y_train, y_test = train_test_split( features, data['violator'], test_size=0.2)
-‐Avoid leakage -Other cross-‐valida>on methods
Training model = LogisticRegression() model.fit(X_train, y_train)
Serializer reads from model.intercept_ model.coef_
Valida0on probs = model.predict_proba(X_test) fpr, tpr, thresholds = roc_curve(y_test, probs[:, 1]) matplotlib.pyplot(fpr, tpr)
ROC: Receiver opera0ng characteris0c
Pipeline -‐Fetch website snapshots from S3 -‐Fetch classifica>ons from SQL/Impala -‐Sani>ze text (strip HTML) -‐Run feature genera>on and selec>on -‐Train and serialize model -‐Export valida>on sta>s>cs
Luigi class GetSnapshots(luigi.Task): def run(self): " "... class GenFeatures(luigi.Task): def requires(self): return GetSnapshots()"
Luigi runs tasks on Hadoop cluster "
Scoring as a service " Applica>on submission
Website scraped
Text scored Applica>on reviewed
ThriO RPC
Scoring Service
Scoring as a service struct ScoringRequest { 1: string text 2: optional string model_name } struct ScoringResponse { 1: double score" " "// Experiments? 2: double request_duration }"
Why a service? -‐Same code base for training/scoring -‐Reduced duplica>on/easier deploys -‐Experimenta>on
-‐Log requests and responses (Parquet/Impala) -‐Centralized monitoring (Graphite)
Summary -‐Simple models with sklearn -‐Pipelines with Luigi -‐Scoring as a service Thanks! @mlmanapat