Apache spark mllib guide for sschung/cis612/Apache spark mllib guide for is the Apache

Apache spark mllib guide for sschung/cis612/Apache spark mllib guide for   is the Apache
Apache spark mllib guide for sschung/cis612/Apache spark mllib guide for   is the Apache
Apache spark mllib guide for sschung/cis612/Apache spark mllib guide for   is the Apache
download Apache spark mllib guide for sschung/cis612/Apache spark mllib guide for   is the Apache

of 3

  • date post

    22-May-2020
  • Category

    Documents

  • view

    1
  • download

    0

Embed Size (px)

Transcript of Apache spark mllib guide for sschung/cis612/Apache spark mllib guide for is the Apache

  • Apache spark mllib guide for pipelining Here is the Apache spark mllib guide for pipelining:

    http://spark.apache.org/docs/latest/ml-pipeline.html

    An example pipeline can include the following stages:

    The first two the Tokenizer and HashingTF are transformers which implement the

    method transform() that converts one dataFrame into another by appending one or

    more columns.

    A feature transformer might take a DataFrame, read a column (e.g., text), map it into a

    new column (e.g., feature vectors), and output a new DataFrame with the mapped

    column appended.

    A learning model might take a DataFrame, read the column containing feature vectors,

    predict the label for each feature vector, and output a new DataFrame with predicted

    labels appended as a column.

    Logistic Regression is an Estimator. An Estimator abstracts the concept of a learning

    algorithm or any algorithm that fits or trains on data. Technically, an Estimator

    implements a method fit(), which accepts a DataFrame and produces a Model, which is

    a Transformer. For example, a learning algorithm such as LogisticRegression is an

    Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and

    hence a Transformer (from apache spark pipeline website).

    The bottom line is the workflow of data in the pipeline after each stage we begin with

    rawText after Tokenizer we have words, HashingTF we have feature vectors then after

    we fit and train the model to both the test and training sets we can arrive at our

    predictions.

    This introduces both speed and efficiency, This methodology allows a flexible way to

    transform text columns and map it to a new data frame of feature vectors outputting a

    final data frame of these mapped feature vectors.

  • Our model:

    We have a dataframe consisting of tweet_text (cleaned) and sentiment that has been

    run through our sentiment algorithm, we read the column containing feature vectors

    (tweet_text) this is transformed in the pipeline, predict the label for each feature vector

    (sentiment), and output a new DataFrame with predicted labels appended as a

    column(prediction).