apache-spark - RIP Tutorial 1 1: apache-spark 2 2 2 Examples 2 2 4 Spark 4 2: Apache Spark...

download apache-spark - RIP Tutorial 1 1: apache-spark 2 2 2 Examples 2 2 4 Spark 4 2: Apache Spark DataFrames

of 43

  • date post

    22-May-2020
  • Category

    Documents

  • view

    18
  • download

    0

Embed Size (px)

Transcript of apache-spark - RIP Tutorial 1 1: apache-spark 2 2 2 Examples 2 2 4 Spark 4 2: Apache Spark...

  • apache-spark

    #apache-

    spark

  • 1

    1: apache-spark 2

    2

    2

    Examples 2

    2

    4

    Spark 4

    2: Apache Spark DataFrames 6

    Examples 6

    JAVASpark DataFrames 6

    Spark Dataframe 7

    3: Scala 9

    9

    Examples 9

    9

    textFile 9

    4: Spark DataFrame 10

    10

    Examples 10

    ScalaDataFrame 10

    toDF 10

    createDataFrame 10

    10

    5: Spark Launcher 12

    12

    Examples 12

    SparkLauncher 12

    6: Spark SQL 13

    Examples 13

    13

  • 13

    14

    - 14

    7: Spark Streaming 18

    Examples 18

    PairDStreamFunctions.updateStateByKey 18

    PairDStreamFunctions.mapWithState 18

    8: pysparkscala 21

    21

    Examples 21

    python RDDScala 21

    python RDDscala 21

    spark-submit 21

    9: Spark 1.6Spark 2.0 22

    22

    Examples 22

    build.sbt 22

    ML Vector 22

    10: 23

    Examples 23

    23

    23

    Scala 23

    Python 23

    11: 25

    25

    Examples 25

    25

    RDD 26

    RDD 26

    26

    RDD 27

  • 12: 28

    28

    Examples 28

    Spark 28

    13: 31

    Examples 31

    Scala + JUnit 31

    14: SparkJSON 32

    Examples 32

    GsonJSON 32

    15: Apache Spark 33

    33

    Examples 33

    33

    33

    33

    33

    33

    33

    33

    34

    16: 35

    Examples 35

    Spark ClientCluster 35

    17: Apache Spark SQL 36

    36

    Examples 36

    Spark SQL Shuffle 36

    18: 'sparkR''.binsparkR' 37

    37

    37

  • Examples 37

    Spark for R 37

    38

  • You can share this PDF with anyone you feel could benefit from it, downloaded the latest version from: apache-spark

    It is an unofficial and free apache-spark ebook created for educational purposes. All the content is extracted from Stack Overflow Documentation, which is written by many hardworking individuals at Stack Overflow. It is neither affiliated with Stack Overflow nor official apache-spark.

    The content is released under Creative Commons BY-SA, and the list of contributors to each chapter are provided in the credits section at the end of this book. Images may be copyright of their respective owners unless otherwise specified. All trademarks and registered trademarks are the property of their respective company owners.

    Use the content presented in this book at your own risk; it is not guaranteed to be correct nor accurate, please send your feedback and corrections to info@zzzprojects.com

    https://riptutorial.com/zh-CN/home 1

    http://riptutorial.com/ebook/apache-spark https://archive.org/details/documentation-dump.7z mailto:info@zzzprojects.com

  • 1: apache-spark

    Apache Spark。/。

    apache-spark。apache-spark。

    2.2.0 2017711

    2.1.1 201752

    2.1.0 20161228

    2.0.1 2016103

    2.0.0 2016726

    1.6.0 201614

    1.5.0 201599

    1.4.0 2015611

    1.3.0 2015313

    1.2.0

    1.1.0 2014911

    1.0.0

    0.9.0 201422

    0.8.0 2013925

    0.7.0 2013227

    0.6.0 20121015

    Examples

    aggregatezeroValueseqOpcombOp

    aggregate()RDDRDD。

    zeroValue 。1. seqOp RDD。。2.

    https://riptutorial.com/zh-CN/home 2

    https://spark.apache.org/releases/spark-release-2-2-0.html http://spark.apache.org/news/spark-2-1-1-released.html http://spark.apache.org/news/spark-2-1-0-released.html http://spark.apache.org/news/spark-2-0-1-released.html https://spark.apache.org/releases/spark-release-2-0-0.html https://spark.apache.org/news/spark-1-6-0-released.html https://spark.apache.org/news/spark-1-5-0-released.html https://spark.apache.org/news/spark-1-4-0-released.html https://spark.apache.org/news/spark-1-3-0-released.html https://spark.apache.org/news/spark-1-2-0-released.html https://spark.apache.org/news/spark-1-1-0-released.html https://spark.apache.org/news/spark-1-0-0-released.html https://spark.apache.org/news/spark-0-9-0-released.html https://spark.apache.org/news/spark-0-8-0-released.html https://spark.apache.org/news/spark-0-7-0-released.html https://spark.apache.org/news/spark-version-0-6-0-released.html

  • combOp 。3.

    。(sum, length) 。

    Spark shell42

    listRDD = sc.parallelize([1,2,3,4], 2)

    seqOp

    seqOp = (lambda local_result, list_element: (local_result[0] + list_element, local_result[1] + 1) )

    combOp

    combOp = (lambda some_local_result, another_local_result: (some_local_result[0] + another_local_result[0], some_local_result[1] + another_local_result[1]) )

    listRDD.aggregate( (0, 0), seqOp, combOp) Out[8]: (10, 4)

    [1,2]。seqOp - (sum, length)。

    local_resultzeroValueaggregate() 。0,0list_element

    0 + 1 = 1 0 + 1 = 1

    1,111。 local_result0,01,1。

    1 + 2 = 3 1 + 1 = 2

    3,2。7,2。

    combOp

    (3,2) + (7,2) = (10, 4)

    'figure'

    (0, 0)

  • 1 + 1 = 2 1 + 1 = 2 | | v v (3, 2) (7, 2) \ / \ / \ / \ / \ / \ / ------------ | combOp | ------------ | v (10, 4)

    Spark ;。。

    In [1]: lines = sc.textFile(file) // will run instantly, regardless file's size In [2]: errors = lines.filter(lambda line: line.startsWith("error")) // run instantly In [3]: errorCount = errors.count() // an action occurred, let the party start! Out[3]: 0 // no line with 'error', in this example

    [1]SparkRDDlines 。“”。

    [2]error 。SparkerrorsRDDRDD lineserror 。

    [3] Spark RDDerrors。 count() Sparkcount()。

    [3] [1][2][3]

    textFile()[1] 1.

    linesfilter() “ED[2]2.

    count()[3]3.

    Spark[3][1]/[2][3]Spark。startsWith() [2]Spark[3]Spark[1][2] [2]。

    [3][3]

    [3] lineserrors。。RDDspark。RDDcache。

    Spark/。

    Spark

    spark-shell

    https://riptutorial.com/zh-CN/home 4

    http://spark.apache.org/docs/latest/programming-guide.html#transformations http://spark.apache.org/docs/latest/programming-guide.html#transformations http://spark.apache.org/docs/latest/programming-guide.html#transformations http://spark.apache.org/docs/latest/programming-guide.html#transformations

  • sc.version

    SparkContext.version

    spark-submit

    spark-submit --version

    apache-spark https://riptutorial.com/zh-CN/apache-spark/topic/833/apache-spark

    https://riptutorial.com/zh-CN/home 5

    https://riptutorial.com/zh-CN/apache-spark/topic/833/%E5%BC%80%E5%A7%8B%E4%BD%BF%E7%94%A8apache-spark https://riptutorial.com/zh-CN/apache-spark/topic/833/%E5%BC%80%E5%A7%8B%E4%BD%BF%E7%94%A8apache-spark https://riptutorial.com/zh-CN/apache-spark/topic/833/%E5%BC%80%E5%A7%8B%E4%BD%BF%E7%94%A8apache-spark https://riptutorial.com/zh-CN/apache-spark/topic/833/%E5%BC%80%E5%A7%8B%E4%BD%BF%E7%94%A8apache-spark https://riptutorial.com/zh-CN/apache-spark/topic/833/%E5%BC%80%E5%A7%8B%E4%BD%BF%E7%94%A8apache-spark https://riptutorial.com/zh-CN/apache-spark/topic/833/%E5%BC%80%E5%A7%8B%E4%BD%BF%E7%94%A8apache-spark

  • 2: Apache Spark DataFrames

    Examples

    JAVASpark DataFrames

    DataFrame。。 DataFrameHiveRDD。

    Oracle RDBMSspark::

    SparkConf sparkConf = new SparkConf().setAppName("SparkConsumer"); sparkConf.registerKryoClasses(new Class[]{ Class.forName("org.apache.hadoop.io.Text"), Class.forName("packageName.className") }); JavaSparkContext sparkContext=new JavaSparkContext(sparkConf); SQLContext sqlcontext= new SQLContext(sparkContext); Map options = new HashMap(); options.put("driver", "oracle.jdbc.driver.OracleDriver"); options.put("url", "jdbc:oracle:thin:username/password@host:port:orcl"); //oracle url to connect options.put("dbtable", "DbName.tableName"); DataFrame df=sqlcontext.load("jdbc", options); df.show(); //this will print content into tablular format

    rdd

    JavaRDD rdd=df.javaRDD();

    public class LoadSaveTextFile { //static schema class public static class Schema implements Serializable { public String getTimestamp() { return timestamp; } public void setTimestamp(String timestamp) { this.timestamp = timestamp; } public String getMachId() { return machId; } public void setMachId(String machId) { this.machId = machId; } public String getSensorType() { return sensorType; } public void setSensorType(String sensorType) { this.sensorType = sensorType;

    https://riptutorial.com/zh-CN/home 6

  • } //instance variables private String timestamp; private String machId; private String sensorType; } public static void main(String[] args) throws ClassNotFoundException { SparkConf sparkConf = new SparkConf().setAppName("SparkConsumer"); sparkConf.registerKryoClasses(new Class[]{ Class.forName("org.apache.hadoop.io.Text"),