Download - 1. Spark DataFrames + SQL€¦ · Spark + MongoDB 1. Spark DataFrames + SQL 1.1 Setup the Spark cluster on Azure Create a cluster Sign into the azure portal (portal.azure.com). Search

Big Data for Engineers – Exercises

Spring 2019 – Week 9 – ETH Zurich

Spark + MongoDB

1. Spark DataFrames + SQL

1.1 Setup the Spark cluster on Azure

Create a clusterSign into the azure portal (portal.azure.com).Search for "HDInsight clusters" using the search box at the top.Click on "+ Add".Give the cluster a unique name.In the "Select Cluster Type" choose Spark and a standard Cluster Tier (Finish with pressing "select").In step 2, the container name will be filled in for you automatically. If you want to do the exercise sheet in several sittings, changeit to something you can remember or write it down.Set up a Spark cluster with default configuration. It should cost something around 3.68 sFR/h.Wait for 20 mins so that your cluster is ready.

Important

Remember to delete the cluster once you are done. If you want to stop doing the exercises at any point, delete it and recreate it usingthe same container name as you used the first time, so that the resources are still there.

Access your clusterMake sure you can access your cluster (the NameNode) via SSH:

$ ssh <ssh_user_name>@<cluster_name>-ssh.azurehdinsight.net

If you are using Linux or MacOSX, you can use your standard terminal. If you are using Windows you can use:

Putty SSH Client and PSCP tool (get them at here).This Notebook server terminal (Click on the Jupyter logo and the goto New -> Terminal).Azure Cloud Terminal (see the HBase exercise sheet for details)

The cluster has its own Jupyter server. We will use it. You can access it through the following link:

https://<cluster_name>.azurehdinsight.net/jupyter

You can access cluster's YARN in your browser

https://<cluster_name>.azurehdinsight.net/yarnui/hn/cluster

The Spark UI can be accessed via Azure Portal, see Spark job debugging

You need to upload this notebook to your cluster's Jupyter inorder to execute Python code blocks.To do this, just open the Jupyter through the link given above and use the "Upload" button.

1.2. The Great Language GameThis week you will be using again the language confusion dataset. You will write queries with Spark DataFrames and SQL. You willhave to submit the results of this exercise to Moodle to obtain the weekly bonus. You will need three things:

The query you wroteSomething related to its output (which you will be graded on)The time it took you to write itThe time it took you to run it

As you might have observed in the sample queries above, the time a job took to run is displayed on the rightmost column of its ouptut.If it consists of several stages, however, you will need the sum of them. The easiest thing is if you just take the execution time of thewhole query:

http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html

https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-job-debugging

http://lars.yencken.org/datasets/languagegame/

Of course, you will not be evaluated on the time it took you to write the queries (nor on the time it took them to run), but this is usefulto us in order to measure the increase in performance when using Sparksoniq. There is a cell that outputs the time you startedworking before every query. Use this if you find it useful.

For this exercise, we strongly suggest that you use the Azure cluster as described above.

Log in to your cluster using SSH as explained above and run the following commands:

wget http://data.greatlanguagegame.com.s3.amazonaws.com/confusion-2014-03-02.tbz2

tar -jxvf confusion-2014-03-02.tbz2 -C /tmp

hdfs dfs -copyFromLocal /tmp/confusion-2014-03-02/confusion-2014-03-02.json /confusion.json

This dowloads the archive file to the cluster, decompresses it and uploads it to HDFS when using a cluster. Now, create an RDD fromthe file containing the entries:

In [ ]:

data = sc.textFile('wasb:///confusion.json')

Last week you loaded the json data with the following snippet:

In [ ]:

import jsonentries = data.map(json.loads)type(entries)

This week you will use DataFrames:

In [ ]:

entries_df = spark.read.json(data).cache()type(entries_df)

You can check the schema by executing the following code:

In [ ]:

entries_df.printSchema()

You can place the data to a temporary table with the following code:

In [ ]:

entries_df.registerTempTable("entries")

Now, you can use normal SQL, with sql magic (%%sql), to perform queries on the table entries. For example:

In [ ]:

%%sqlSELECT *FROM entriesWHERE country == "CH"

Good! Let's get to work. A few last things:

This week, you should not have issues with the output being too long, since sql magic limits its size automatically.Remember to delete the cluster if you want to stop working! You can recreate it using the same container name and yourresources will still be there.

And now to the actual queries:

1. Find all games such that the guessed language is correct (=target), and such that this language is Russian.

In [ ]:

from datetime import datetime

# Started working:print(datetime.now().time())

In [ ]:

%%sql

2. List all chosen answers to games where the guessed language is correct (=target).

In [ ]:


In [ ]:

%%sql

3. Find all distinct values of languages (the target field).

In [ ]:


In [ ]:

%%sql

4. Return the top three games where the guessed language is correct (=target) ordered by language (ascending), then country(ascending), then date (ascending).

In [ ]:


In [ ]:

%%sql

5. Aggregate all games by country and target language, counting the number of guesses for each pair (country, target).

In [ ]:


In [ ]:

%%sql

6. Find the overall percentage of correct guesses when the first answer (amongst the array of possible answers) was the correct one.

In [ ]:


In [ ]:

%%sql

7. Sort the languages by increasing overall percentage of correct guesses.

In [ ]:


In [ ]:

In [ ]:

%%sql

8. Group the games by the index of the correct answer in the choices array and output all counts.

The following code snippet will create a user-defined SQL function, which you can use in your SQL queries.You may call it in your queries as array_position(x, y) , where x is an array (for example an entry for the column choices ) and y is some data that the position/index of which you want to find in the array.

In [ ]:

spark.udf.register("array_position", lambda x,y: x.index(y))

In [ ]:


In [ ]:

%%sql

9. What is the language of the sample that has the highest successful guess rate?

In [ ]:


In [ ]:

%%sql

10. Return all games played on the latest day.

In [ ]:


In [ ]:

%%sql

2. Document storesA record in document store is a document. Document encoding schemes include XML, YAML, JSON, and BSON, as well as binaryforms like PDF and Microsoft Office documents (MS Word, Excel, and so on). MongoDB documents are similar to JSON objects.Documents are composed of field-value pairs and have the following structure:

The values of fields may include other documents, arrays, and arrays of documents. Data in MongoDB has a flexible schema in thesame collection. All documents do not need to have the same set of fields or structure, and common fields in a collection's documentsmay hold different types of data.

2.1 General Questions1. What are advantages of document stores over relational databases?

1. What are advantages of document stores over relational databases?2. Can the data in document stores be normalized?3. How does denormalization affect performance?

2.2 True/False QuestionsSay if the following statements are true or false.

1. Document stores expose only a key-value interface.2. Different relationships between data can be represented by references and embedded documents.3. MongoDB does not support schema validation.4. MongoDB stores documents in the XML format.5. In document stores, you must determine and declare a table's schema before inserting data.6. MongoDB performance degrades when the number of documents increases.7. Document stores are column stores with flexible schema.8. There are no joins in MongoDB.

3. MongoDB

3.1 Setup MongoDBNavigate to https://www.mongodb.com/download-center and register:

Create a free-tier cluster. You are free to use any cloud provider and region of your preference. For example, you could chooseAWS and Frankfurt:

https://www.mongodb.com/download-center

Create a database user:

Whitelist your IP address (or all IP addresses if using the Azure notebook service):

Create a test database with a restaurants collection:

Install pymongo and dnspython:

In [ ]:

!pip install pymongo[tls]!pip install dnspython

To connect to the database, copy the connection string for your version of python by navigating on the MongoDB website, asshow in the following pictures:

In [ ]:

import pymongoimport dnsfrom pprint import pprintimport urllibimport jsonimport dateutilfrom datetime import datetime, timezone, timedelta

If you get an error when importing one of the modules above, then try:!pip install <module_name>

In [ ]:

client = pymongo.MongoClient("mongodb+srv://<username>:<password>@<cluster_address>/test?retryWrites=true")db = client.test

Import the restaurants dataset:

In [ ]:

file = urllib.request.urlopen("https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json")

dataset/primer-dataset.json") file_data = []

for line in file: record = json.loads(line) grades = record['grades'] new_grades = [] for g in grades: new_g = {} for k, v in g.items(): if k == 'date': new_g['date'] = datetime(1970, 1, 1, tzinfo=timezone.utc) + timedelta(milliseconds=v['$date']) else: new_g[k] = v new_grades.append(new_g) record['grades'] = new_grades file_data.append(record)

db.restaurants.insert_many(file_data)

Try to insert a document into the restaurants collection. In addition, you can see the structure of documents the in thecollection:

In [ ]:

db.restaurants.insert_one( { "address" : { "street" : "2 Avenue", "zipcode" : "10075", "building" : "1480", "coord" : [ -73.9557413, 40.7720266 ] }, "borough" : "Manhattan", "cuisine" : "Italian", "grades" : [ { "date" : dateutil.parser.parse("2014-10-01T00:00:00Z"), "grade" : "A", "score" : 11 }, { "date" : dateutil.parser.parse("2014-01-16T00:00:00Z"), "grade" : "A", "score" : 17 } ], "name" : "Vella", "restaurant_id" : "41704620" })

Query all documents in a collection:

db.restaurants.find()

Query one document in a collection:

db.restaurants.find_one()

To format the result, you can use pprint , as in the following:

for doc in db.restaurants.find().limit(3):

pprint(doc)

Query DocumentsFor the db.collection.find() method, you can specify the following optional fields:

a query filter to specify which documents to return,a query projection to specifies which fields from the matching documents to return (the projection limits the amount of data that

a query projection to specifies which fields from the matching documents to return (the projection limits the amount of data thatMongoDB returns to the client over the network),optionally, a cursor modifier to impose limits, skips, and sort orders.

3.4 QuestionsWrite queries in MongoDB that return the following:

1. All restaurants in borough (a town) "Brooklyn" and cuisine (a style of cooking) "Hamburgers".2. The number of restaurants in the borough "Brooklyn" and cuisine "Hamburgers".3. All restaurants with zipcode 11225.4. Names of restaurants with zipcode 11225 that have at least one grade "C".5. Names of restaurants with zipcode 11225 that have as first grade "C" and as second grade "A".6. Names and streets of restaurants that don't have an "A" grade.7. All restaurants with a grade C and a score greater than 50.8. All restaurants with a grade C or a score greater than 50.9. All restaurants that have only A grades.

You can read more about MongoDB here:

https://docs.mongodb.com/getting-started/shell/query/

4. Indexing in MongoDBIndexes support the efficient resolution of queries. Without indexes, MongoDB must scan every document of a collection to selectthose documents that match the query statement. Scans can be highly inefficient and require MongoDB to process a large volume ofdata.

Indexes are special data structures that store a small portion of the data set in an easy-to-traverse form. The index stores the value ofa specific field or set of fields, ordered by the value of the field as specified in the index.

MongoDB supports indexes that contain either a single field or multiple fields depending on the operations that this index typesupports.

By default, MongoDB creates the _id index, which is an ascending unique index on the _id field, for all collections when thecollection is created. You cannot remove the index on the _id field.

Managing indexes in MongoDBAn explain() operator provides information on the query plan. It returns a document that describes the process and indexes usedto return the query. This may provide useful insight when attempting to optimize a query.

db.restaurants.find({"borough" : "Brooklyn"}).explain()

You can create an index by calling the createIndex() method.

db.restaurants.create_index([("borough", 1)])

Now, you retrieve a new query plan for indexed data.

db.restaurants.find({"borough" : "Brooklyn"}).explain()

The value of the field in the index specification describes the kind of index for that field. For example, a value of 1 specifies an indexthat orders items in ascending order. A value of -1 specifies an index that orders items in descending order.

To remove all indexes, you can use db.collection.drop_indexes() . To remove a specific index you can use db.collection.drop_index() , such as db.restaurants.drop_index([("borough", 1)]) .

4.1 QuestionsPlease answer questions 1 and 2 in Moodle

1) Which queries will use the following index:

db.restaurants.create_index([("borough", 1)])

A. db.restaurants.find({"addresses.city" : "Boston"})

https://docs.mongodb.com/getting-started/shell/query/

A. db.restaurants.find({"addresses.city" : "Boston"})B. db.restaurants.find({}, {"borough" : 1})C. db.restaurants.find().sort([("borough", 1)])D. db.restaurants.find({"cuisine" : "Italian" }, {"borough" : 1})

2) Which queries will use the following index:

db.restaurants.create_index([("address", -1)])

A. db.restaurants.find({"address.zipcode" : "11225"})B. db.restaurants.find({"addresses.city" : "Boston"})C. db.restaurants.find({"addresses.city" : "Boston"}, {"address" : 1 })D. db.restaurants.find({"address" : 1 })

3) Write a command for creating an index on the "zipcode" field.

4) Write an index to speed up the following query:

db.restaurants.find({"grades.grade" : { "$ne" : "A"}}, {"name" : 1 , "address.street": 1})

5) Write an index to speed up the following query:

db.restaurants.find({"grades.score" : {"$gt" : 50}, "grades.grade" : "C"})

In [ ]: