1. Spark DataFrames + SQL Spark + MongoDB 1. Spark DataFrames + SQL 1.1 Setup the Spark cluster on...

download 1. Spark DataFrames + SQL Spark + MongoDB 1. Spark DataFrames + SQL 1.1 Setup the Spark cluster on Azure

of 13

  • date post

    24-May-2020
  • Category

    Documents

  • view

    8
  • download

    0

Embed Size (px)

Transcript of 1. Spark DataFrames + SQL Spark + MongoDB 1. Spark DataFrames + SQL 1.1 Setup the Spark cluster on...

  • Big Data for Engineers – Exercises

    Spring 2019 – Week 9 – ETH Zurich

    Spark + MongoDB

    1. Spark DataFrames + SQL

    1.1 Setup the Spark cluster on Azure

    Create a cluster Sign into the azure portal (portal.azure.com). Search for "HDInsight clusters" using the search box at the top. Click on "+ Add". Give the cluster a unique name. In the "Select Cluster Type" choose Spark and a standard Cluster Tier (Finish with pressing "select"). In step 2, the container name will be filled in for you automatically. If you want to do the exercise sheet in several sittings, change it to something you can remember or write it down. Set up a Spark cluster with default configuration. It should cost something around 3.68 sFR/h. Wait for 20 mins so that your cluster is ready.

    Important

    Remember to delete the cluster once you are done. If you want to stop doing the exercises at any point, delete it and recreate it using the same container name as you used the first time, so that the resources are still there.

  • Access your cluster Make sure you can access your cluster (the NameNode) via SSH:

    $ ssh @-ssh.azurehdinsight.net

    If you are using Linux or MacOSX, you can use your standard terminal. If you are using Windows you can use:

    Putty SSH Client and PSCP tool (get them at here). This Notebook server terminal (Click on the Jupyter logo and the goto New -> Terminal). Azure Cloud Terminal (see the HBase exercise sheet for details)

    The cluster has its own Jupyter server. We will use it. You can access it through the following link:

    https://.azurehdinsight.net/jupyter

    You can access cluster's YARN in your browser

    https://.azurehdinsight.net/yarnui/hn/cluster

    The Spark UI can be accessed via Azure Portal, see Spark job debugging

    You need to upload this notebook to your cluster's Jupyter in order to execute Python code blocks. To do this, just open the Jupyter through the link given above and use the "Upload" button.

    1.2. The Great Language Game This week you will be using again the language confusion dataset. You will write queries with Spark DataFrames and SQL. You will have to submit the results of this exercise to Moodle to obtain the weekly bonus. You will need three things:

    The query you wrote Something related to its output (which you will be graded on) The time it took you to write it The time it took you to run it

    As you might have observed in the sample queries above, the time a job took to run is displayed on the rightmost column of its ouptut. If it consists of several stages, however, you will need the sum of them. The easiest thing is if you just take the execution time of the whole query:

    http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-job-debugging http://lars.yencken.org/datasets/languagegame/

  • Of course, you will not be evaluated on the time it took you to write the queries (nor on the time it took them to run), but this is useful to us in order to measure the increase in performance when using Sparksoniq. There is a cell that outputs the time you started working before every query. Use this if you find it useful.

    For this exercise, we strongly suggest that you use the Azure cluster as described above.

    Log in to your cluster using SSH as explained above and run the following commands:

    wget http://data.greatlanguagegame.com.s3.amazonaws.com/confusion-2014-03-02.tbz2

    tar -jxvf confusion-2014-03-02.tbz2 -C /tmp

    hdfs dfs -copyFromLocal /tmp/confusion-2014-03-02/confusion-2014-03-02.json /confusion.json

    This dowloads the archive file to the cluster, decompresses it and uploads it to HDFS when using a cluster. Now, create an RDD from the file containing the entries:

    In [ ]:

    data = sc.textFile('wasb:///confusion.json')

    Last week you loaded the json data with the following snippet:

    In [ ]:

    import json entries = data.map(json.loads) type(entries)

    This week you will use DataFrames:

    In [ ]:

    entries_df = spark.read.json(data).cache() type(entries_df)

    You can check the schema by executing the following code:

    In [ ]:

    entries_df.printSchema()

    You can place the data to a temporary table with the following code:

    In [ ]:

    entries_df.registerTempTable("entries")

    Now, you can use normal SQL, with sql magic (%%sql), to perform queries on the table entries. For example:

    In [ ]:

    %%sql SELECT * FROM entries WHERE country == "CH"

    Good! Let's get to work. A few last things:

    This week, you should not have issues with the output being too long, since sql magic limits its size automatically. Remember to delete the cluster if you want to stop working! You can recreate it using the same container name and your resources will still be there.

    And now to the actual queries:

    1. Find all games such that the guessed language is correct (=target), and such that this language is Russian.

    In [ ]:

    from datetime import datetime

  • # Started working: print(datetime.now().time())

    In [ ]:

    %%sql

    2. List all chosen answers to games where the guessed language is correct (=target).

    In [ ]:

    # Started working: print(datetime.now().time())

    In [ ]:

    %%sql

    3. Find all distinct values of languages (the target field).

    In [ ]:

    # Started working: print(datetime.now().time())

    In [ ]:

    %%sql

    4. Return the top three games where the guessed language is correct (=target) ordered by language (ascending), then country (ascending), then date (ascending).

    In [ ]:

    # Started working: print(datetime.now().time())

    In [ ]:

    %%sql

    5. Aggregate all games by country and target language, counting the number of guesses for each pair (country, target).

    In [ ]:

    # Started working: print(datetime.now().time())

    In [ ]:

    %%sql

    6. Find the overall percentage of correct guesses when the first answer (amongst the array of possible answers) was the correct one.

    In [ ]:

    # Started working: print(datetime.now().time())

    In [ ]:

    %%sql

    7. Sort the languages by increasing overall percentage of correct guesses.

    In [ ]:

    # Started working: print(datetime.now().time())

    In [ ]:

  • In [ ]:

    %%sql

    8. Group the games by the index of the correct answer in the choices array and output all counts.

    The following code snippet will create a user-defined SQL function, which you can use in your SQL queries. You may call it in your queries as array_position(x, y) , where x is an array (for example an entry for the column choices ) and y is some data that the position/index of which you want to find in the array.

    In [ ]:

    spark.udf.register("array_position", lambda x,y: x.index(y))

    In [ ]:

    # Started working: print(datetime.now().time())

    In [ ]:

    %%sql

    9. What is the language of the sample that has the highest successful guess rate?

    In [ ]:

    # Started working: print(datetime.now().time())

    In [ ]:

    %%sql

    10. Return all games played on the latest day.

    In [ ]:

    # Started working: print(datetime.now().time())

    In [ ]:

    %%sql

    2. Document stores A record in document store is a document. Document encoding schemes include XML, YAML, JSON, and BSON, as well as binary forms like PDF and Microsoft Office documents (MS Word, Excel, and so on). MongoDB documents are similar to JSON objects. Documents are composed of field-value pairs and have the following structure:

    The values of fields may include other documents, arrays, and arrays of documents. Data in MongoDB has a flexible schema in the same collection. All documents do not need to have the same set of fields or structure, and common fields in a collection's documents may hold different types of data.

    2.1 General Questions 1. What are advantages of document stores over relational databases?

  • 1. What are advantages of document stores over relational databases? 2. Can the data in document stores be normalized? 3. How does denormalization affect performance?

    2.2 True/False Questions Say if the following statements are true or false.

    1. Document stores expose only a key-value interface. 2. Different relationships between data can be represented by references and embedded documents. 3. MongoDB does not support schema validation. 4. MongoDB stores documents in the XML format. 5. In document stores, you must determine and declare a table's schema before inserting data. 6. MongoDB performance degrades when the number of documents increases. 7. Document stores are column stores with flexible schema. 8. There are no joins in MongoDB.

    3. MongoDB

    3.1 Setup MongoDB Navigate to https://www.mongodb.com/download-center and register:

    Create a free-tier cluster. You are free to use any cloud provider and region of your preference. For example, you could choose AWS and Frankfurt:

    https://www.mongodb.com/download-center

  • Create a database user:

    Whitelist your IP address (or all IP addresses if using the Azure notebook service):

    Create a test database with a restaurants collection:

  • Install pymongo and dnspython:

    In [ ]:

    !pip install pymongo[tls] !pip install dnspython

    To connect to the database, copy the connection string for your version of python by navigating