Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

65
Reactive Feature Generation with Spark and MLlib Jeff Smith x.ai

Transcript of Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Page 1: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Reactive Feature Generation with Spark and MLlib

Jeff Smith x.ai

Page 2: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

x.ai is a personal assistant who schedules meetings for you

Page 3: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

M A N N I N G

Jeff Smith

Page 4: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

REACTIVE MACHINE LEARNING

Page 5: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Reactive Machine Learning

Page 6: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Machine Learning Systems

Page 7: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Traits of Reactive Systems

Page 8: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Responsive

Page 9: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Resilient

Page 10: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Elastic

Page 11: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Message-Driven

Page 12: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Reactive Strategies

Page 13: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Machine Learning Data

Page 14: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Machine Learning Data

Page 15: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Machine Learning Systems

Page 16: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

INTRODUCING FEATURES

Page 17: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Microblogging DataSquawks Squawkers Super

Page 18: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Feature Generation

Raw Data FeaturesFeature Generation Pipeline

Page 19: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

FEATURE TRANSFORMS

Page 20: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Feature Generation

Raw Data FeaturesFeature Generation Pipeline

Extract Transform

Page 21: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Features

trait FeatureType[V] { val name: String}

trait Feature[V] extends FeatureType[V] { val value: V}

Page 22: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Features

case class IntFeature(name: String, value: Int) extends Feature[Int]

case class BooleanFeature(name: String, value: Boolean) extends Feature[Boolean]

Page 23: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Named Transforms

def binarize(feature: IntFeature, threshold: Double) = { BooleanFeature("binarized-" + feature.name, feature.value > threshold)}

Page 24: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Non-Trivial Transformsdef categorize(thresholds: List[Double]) = { (rawFeature: DoubleFeature) => { IntFeature("categorized-" + rawFeature.name, thresholds.sorted .zipWithIndex .find { case (threshold, i) => rawFeature.value < threshold }.getOrElse((None, -1)) ._2) }}

Page 25: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Standardizing Namingtrait Named { def name(inputFeature: Feature[Any]) : String = { inputFeature.name + "-" + Thread.currentThread.getStackTrace()(3).getMethodName }}

object Binarizer extends Named { def binarize(feature: IntFeature, threshold: Double) = { BooleanFeature(name(feature), feature.value > threshold) }}

Page 26: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Lineages"cleaned-normalized-categorized-interactions"

interactions categorized normalized cleaned

Page 27: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

PIPELINES

Page 28: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Feature Generation

Raw Data FeaturesFeature Generation Pipeline

Page 29: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Multi-Stage Generation

Raw Data ExtractedExtract Transform Features

Page 30: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Pipeline Compositiondef extract(rawSquawks: RDD[JsonDocument]): RDD[IntFeature] = { ???} def transform(inputFeatures: RDD[Feature[Int]]): RDD[BooleanFeature] = { ???} val trainableFeatures = transform(extract(rawSquawks))

Page 31: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Feature Generation

Raw Data FeaturesFeature Generation Pipeline

Page 32: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Pipelines

Don’t orchestrate when you can compose

Page 33: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Pipeline Failure

Raw Data FeaturesFeature Generation Pipeline

Raw Data FeaturesFeature Generation Pipeline

Page 34: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Supervising Feature Generation

Raw Data FeaturesFeature Generation Pipeline

Supervision

Page 35: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Supervising Feature Generation

Raw Data FeaturesFeature Generation Pipeline

Reactive DB Drivers Cluster Managers Feature Validation

Page 36: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Reactive Database DriversCouchbase MongoDB Cassandra

Page 37: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Reactive Database Drivers

val rawSquawks: RDD[JsonDocument] = sc.couchbaseView( ViewQuery.from("squawks", "by_squawk_id")) .map(_.id) .couchbaseGet[JsonDocument]()

Page 38: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Cluster ManagersSpark Standalone Mesos YARN

Page 39: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Supervising Feature Generation

Raw Data FeaturesFeature Generation Pipeline

Reactive DB Drivers Cluster Managers Feature Validation

Page 40: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

FEATURE COLLECTIONS

Page 41: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Original Features

object SquawkLength extends FeatureType[Int]

object Super extends LabelType[Boolean]

val originalFeatures: Set[FeatureType] = Set(SquawkLength)val label = Super

Page 42: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Basic Features

object PastSquawks extends FeatureType[Int]

val basicFeatures = originalFeatures + PastSquawks

Page 43: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

More Features

object MobileSquawker extends FeatureType[Boolean]

val moreFeatures = basicFeatures + MobileSquawker

Page 44: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Feature Collections

case class FeatureCollection(id: Int, createdAt: DateTime, features: Set[_ <: FeatureType[_]], label: LabelType[_])

Page 45: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Feature Collectionsval earlierCollection = FeatureCollection(101, earlier, basicFeatures, label)

val latestCollection = FeatureCollection(202, now, moreFeatures, label)

val featureCollections = sc.parallelize( Seq(earlierCollection, latestCollection))

Page 46: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Feature Generation

Raw Data FeaturesFeature Generation Pipeline

Page 47: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Fallback Collections

val FallbackCollection = FeatureCollection(404, beginningOfTime, originalFeatures, label)

Page 48: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Fallback Collectionsdef validCollection(collections: RDD[FeatureCollection], invalidFeatures: Set[FeatureType[_]]) = { val validCollections = collections.filter( fc => !fc.features .exists(invalidFeatures.contains)) .sortBy(collection => collection.id) if (validCollections.count() > 0) { validCollections.first() } else FallbackCollection}

Page 49: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

VALIDATING FEATURES

Page 50: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Supervising Feature Generation

Raw Data FeaturesFeature Generation Pipeline

Reactive DB Drivers Cluster Managers Feature Validation

Page 51: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Predicting Super Squawkers

Page 52: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Training Instances

val instances = Seq( (123, Vectors.dense(0.2, 0.3, 16.2, 1.1), 0.0), (456, Vectors.dense(0.1, 1.3, 11.3, 1.2), 1.0), (789, Vectors.dense(1.2, 0.8, 14.5, 0.5), 0.0))

Page 53: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Selection Setup

val featuresName = "features"val labelName = "isSuper"

val instancesDF = sqlContext.createDataFrame(instances) .toDF("id", featuresName, labelName)

val K = 2

Page 54: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Feature Selection

val selector = new ChiSqSelector() .setNumTopFeatures(K) .setFeaturesCol(featuresName) .setLabelCol(labelName) .setOutputCol("selectedFeatures")

Page 55: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Feature Selection

val selector = new ChiSqSelector() .setNumTopFeatures(K) .setFeaturesCol(featuresName) .setLabelCol(labelName) .setOutputCol("selectedFeatures")

val selectedFeatures = selector.fit(instancesDF) .transform(instancesDF)

Page 56: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Back to RDDs

val labeledPoints = sc.parallelize(instances.map({ case (id, features, label) => LabeledPoint(label = label, features = features)}))

Page 57: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Validating Features

def validateSelection(labeledPoints: RDD[LabeledPoint], topK: Int, cutoff: Double) = { val pValues = Statistics.chiSqTest(labeledPoints) .map(result => result.pValue) .sorted pValues(topK) < cutoff}

Page 58: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Persisting Validationscase class ValidatedFeatureCollection(id: Int, createdAt: DateTime, features: Set[_ <: FeatureType[_]], label: LabelType[_], passedValidation: Boolean, cutoff: Double)

Page 59: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

SUMMARY

Page 60: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Machine Learning Systems

Page 61: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Traits of Reactive Systems

Page 62: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Reactive Strategies

Page 63: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Machine Learning Data

Page 64: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Feature Generation

Raw Data FeaturesFeature Generation Pipeline

Page 65: Reactive Feature Generation with Spark and MLlib by Jeffrey Smith (1)

Reactive Feature Generation with Spark and MLlib

reactivemachinelearning.com @jeffksmithjr @xdotai