Data Science Crash Course Hadoop Summit SJ

RobertHryniewiczDataEvangelist@RobHryniewicz

Hands-onIntrotoDataSciencewithApacheSpark

Crash�Course

2 ©HortonworksInc.2011–2016.AllRightsReserved

Plan for Today• Data Science & ML• ML Examples• Overview of ML methods• K-means, Decision Trees & Random Forests• Spark MLlib & ML• Lab Overview


DataScienceExamples


Predictive Analytics Pre-requisitesSalesPlay4:Predictive

Analytics


Predictive Analytics Process and Tools


MachineLearning

“… science of how computers learn without being explicitly programmed” – Andrew Ng


MachineLearningMethods


Supervisedvs

UnsupervisedLearning

Exampleslabeled.

Examplesnotlabeled.


UnsupervisedLearningSupervisedLearning


CLASSIFICATIONIdentifyingtowhichcategoryanobjectbelongsto.

Applications:spamdetection,imagerecognition,...

Algorithms:k-nn,decisiontrees,randomforest,...


REGRESSIONPredictingacontinuous-valuedattribute

associatedwithanobject.

Applications:drugresponse,stockprices,…

Algorithms: linearregression,…


CLUSTERINGAutomaticgroupingofsimilarobjectsintosets.

Applications:customersegmentation,topicmodeling,…

Algorithms: k-means,LDA,…


COLLABORATIVEFILTERINGFillinthemissingentriesofauser-itemassociationmatrix.

Applications:Productrecommendation,…

Algorithms: Alternating Least Squares (ALS)


DIMENSIONALITYREDUCTIONReducingthenumberofrandomvariablestoconsider.

Applications:visualization,increasedefficiency,…Algorithms: PCA,t-SNE,…


PREPROCESSINGFeatureextractionandnormalization

Applications:transforminginputdatasuchastextasinputtoMLalgorithms

Algorithms:TF-IDF,word2vec,onehotencoding,…


MODELSELECTIONComparing,validatingandchoosingparametersandmodels.

Applications:improvedaccuracyviaparametertuning

Algorithms:gridsearch,metrics…


SparkMLlib


SparkMachineLearningLibrary

Ã Clustering– k-meansclustering– latentDirichlet allocation(LDA)

Ã Dimensionalityreduction– singularityvaluedecomposition(SVD)– principalcomponentanalysis(PCA)

Ã FeatureExtractors&Transformers– word2vec

Ã Basicstatistics– summarystatistics– hypothesistesting– randomnumbergeneration

Ã Classificationandregression– linearmodels(SVMs,log&linearregression)– decisiontrees– ensemblesoftrees(RandomForests&GBTs)

Ã Collaborativefiltering– alternatingleastsquares(ALS)


K-MeansClustering(UnsupervisedLearning)


Why K-Means

Ã Simple&fastalgorithm tofindclusters

Ã Commontechniqueforanomalydetection

Ã Drawbacks– Doesn'tworkwellwithnon-circularclustershape– Numberofclusterandinitialseedvalueneedtobespecifiedbeforehand– Strongsensitivitytooutliersandnoise– Lowcapabilitytopassthelocaloptimum.


Initialize Cluster Centers

Randomlypick3clustercenters.


Assign Each Point

Assigneachpointtothenearestclustercenter.


Recompute Cluster Centers

Moveeachclustertothemeanofeach

cluster.


K-means Clustering


San Francisco


Outline Each Neighborhood


Folium: choropleth map


SF Neighborhood Centers Calculated with K-Means


Sample Dataset – K-Means

0.0, 0.0, 0.00.1, 0.1, 0.10.2, 0.2, 0.2

3.0, 3.0, 3.03.1, 3.1, 3.13.2, 3.2, 3.2


DecisionTrees&RandomForests(SupervisedLearning)


WhyDecisionTrees?

Ã Simpletounderstandandinterpret. (Andexplaintoexecutives.)

Ã Requireslittledatapreparation. (Othertechniquesoftenrequiredatanormalisation, dummyvariablesneedtobecreatedandblankvaluestoberemoved.)

Ã Performswellwithlargedatasets.


VisualIntrotoDecisionTrees

Ã http://www.r2d3.us/visual-intro-to-machine-learning-part-1


Random Forest (Ensemble Model)

ÃMainidea:buildanensembleofsimpledecisiontreesÃ EachtreeissimpleandlesslikelytooverfitÃ Classify/predictbyvotingbetweenalltrees


DecisionTreevsRandomForest


Overcomelimitationsofasinglehypothesis

DecisionTree ModelAveraging

WhyEnsembleswork?


DiabetesDataset– DecisionTrees/RandomForest

Labeledsetwith8Features

-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333+1 1:-0.882353 2:-0.145729 3:0.0819672 4:-0.414141 5:-1 6:-0.207153 7:-0.766866 8:-0.666667 -1 1:-0.0588235 2:0.839196 3:0.0491803 4:-1 5:-1 6:-0.305514 7:-0.492741 8:-0.633333 +1 1:-0.882353 2:-0.105528 3:0.0819672 4:-0.535354 5:-0.777778 6:-0.162444 7:-0.923997 8:-1 -1 1:-1 2:0.376884 3:-0.344262 4:-0.292929 5:-0.602837 6:0.28465 7:0.887276 8:-0.6 +1 1:-0.411765 2:0.165829 3:0.213115 4:-1 5:-1 6:-0.23696 7:-0.894962 8:-0.7 -1 1:-0.647059 2:-0.21608 3:-0.180328 4:-0.353535 5:-0.791962 6:-0.0760059 7:-0.854825 8:-0.833333

...


MachineLearninginSpark


SparkEcosystem

SparkCore

SparkSQL SparkStreaming MLlib GraphX


MachineLearningwithSpark(MLlib &ML)

Ã Original“lower”API

Ã BuiltontopofRDDs

Ã MaintenancemodestartingwithSpark2.0

MLlib

Ã Newer“higher-level”APIforconstructingworkflows

Ã BuiltontopofDataFrames

ML

Both algorithms implemented to take advantage of data

parallelism


Predict

Model

Supervised Learning: End-to-End Flow

Feature Extraction Train the Model

ModelData items

Labels

Data item Feature Extraction Label

Training(batch)

Predicting(real time or batch)

Feature Matrix

Feature Vector

Training set


Spark ML: Spark API for building ML pipelines

Featuretransform

1

Featuretransform

2

Combinefeatures

RandomForest

InputDataFrame(TRAIN)

InputDataFrame(TEST)

OutputDataframe

(PREDICTIONS)

Pipeline

PipelineModel


Spark ML Pipeline

Ã Pipeline includes both fit() and transform() methods

– fit() is for training– transform() is for prediction

InputDataFrame(TRAIN)

InputDataFrame(TEST)

OutputDataframe

(PREDICTIONS)

Pipeline

PipelineModel

fit()transform()

model = pipe.fit(trainData) # Train modelresults = model.transform(testData) # Test model


Spark ML – Simple Random Forest Example

indexer = StringIndexer(inputCol=”district", outputCol=”dis-inx")

parser = Tokenizer(inputCol=”text-field", outputCol="words")

hashingTF = HashingTF(numFeatures=50, inputCol="words", outputCol="hash-inx")

vecAssembler = VectorAssembler(

inputCols =[“dis-inx”, “hash-inx”],

outputCol="features")

rf = RandomForestClassifier(numTrees=100, labelCol="label", seed=42)

pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])

model = pipe.fit(trainData) # Train model

results = model.transform(testData) # Test model


ApacheZeppelin– AModernWeb-basedDataScienceStudio

Ã Dataexplorationanddiscovery

Ã Visualization

Ã DeeplyintegratedwithSparkandHadoop

Ã Pluggableinterpreters

Ã Multiplelanguagesinonenotebook:R,Python,Scala


Exporting ML Models - PMML

Ã PredictiveModelMarkupLanguage(PMML)Ã Supportedmodels

–K-Means– LinearRegression–RidgeRegression– Lasso– SVM–Binary


Additional Resources

• MachineLearning• NaturalLanguageProcessing(NLP)

• ScalableMachineLearning• IntroductiontoStatistics


Lab Overviewtinyurl.com/hwx-intro-to-ml-with-spark


HortonworksCommunityConnection

Read access for everyone, join to participate and be recognized

• FullQ&APlatform(likeStackOverflow)

• KnowledgeBaseArticles

• CodeSamplesandRepositories

RobertHryniewicz@RobHryniewicz

Thanks!

Data Science Crash Course Hadoop Summit SJ

Technology

Transcript of Data Science Crash Course Hadoop Summit SJ