Clustering Uber Pickups using Apache Spark's MLlib

Click here to load reader

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Clustering Uber Pickups using Apache Spark's MLlib

Clustering Uber Pickups using Apache Spark's MLlibClustering Uber Pickups using Apache Spark’s MLlib Vincent Trost & Kevin Baik
April 28, 2017
With the rise of technology in society, disruptive innovation is all around us. Taxis used to be a staple of life in the city, but the development of the smart phone changed all of that. Along came a ride-sharing app called “Uber” that allows users to hail a ride from their exact location using only their smart phone. Not only that, but the car one gets picked up in can be a regular old car instead of a bright yellow taxi. People can drive for Uber whenever they want, and all payment is handled digitally. Drivers are able to register their car and drive whenever they want by just switching a button in the app. The scientists, Kevin and Vincent, were interested in running a k-means clustering analysis using Spark’s MLlib to identify the most dense pickup locations in a certain area (which for this study, is New York City and its surrounding area) while also utilizing the parallelization methods provided by the Apache Spark framework.
These data are freely available on GitHub via The company submitted a Freedom of Information Law (FOIL) request to the NYC Taxi & Limousine Commission in order to obtain these data on July 20, 2015 [1]. They were required by law to comply, and after obtaining the data, they cleaned it and put it on GitHub for the public.
The Data
As mentioned, the data were obtained from fivethirtyeight’s GitHub page. The data were from the months of April 2014 through August 2014, and contained pickup locations in and around New York City. The files contained five columns: ID, DateTime, Latitude, Longitude, and Base.
Figure 1: Snapshot of the data
These are all relatively self explanatory except for the Base variable. In New York City, some taxi and limousine companies that own their own fleet of cars will allow people to rent them out and use them for Uber. This allows the companies to make money on their assets around the clock and also allows the drivers to drive for Uber without using their own car. The Base variable explains which base the car performing
that pickup came from. For the purposes of this project, it was not very relevant. Overall, the data was small in size- only around 48 MB. This made exploring scalability measures
challenging, but that will be covered later on in the report. Though it was small in size, there were still well over one million unique pickups represented in the data (1048575 to be exact).
Performing K-Means Clustering
The first objective in performing K-Means clustering was to clean the data. To do that, the scientists used the following command: val parsedData ={ line => Vectors.dense(
line.split(",") .slice(3, 5) .map(_.toDouble))} .cache()
More will be explained on why cache() was chosen later on. This command split the .csv file up by commas, only kept the latitude and longitude, and mapped every value to type double. That step is important because the kmeans implementation only accepts numeric vectors. The next step is to train the K-Means model.
The most important metric in the K-Means algorithm is the choice for K. The scientists want to optimize K for their use-case, as it represents the number of clusters. One way to evaluate this metric is by computing the “cost”. This cost represents the sum of square distance of each pickup from its cluster center. The lower the cost, the more it explains the data. Obviously, the higher of a choice for K would converge to fitting the data perfectly. There is no perfect way to choose K, but one widely used way is called the “Elbow Method”. The idea behind this method is to choose K at a point beyond which increasing K would have a negligible return.
Figure 2: Kmeans Cost as K Increases
The scientists saw this visualization and decided that k = 200 was a good place to agree on a choice of k, because the gain between 200 and 1000 clusters provided no significant additional benefit.
Scalability Tests
In order to evaluate how efficient the program is, different configurations of spark-submit commands must be tested. Along with that, different code implementations utilizing commands like cache() and persist() with different flags must be evaluated as well. The following tables summarize those tests. The K-Means run-time is clocked before and after training the model as shown below: val iterationCount = 200 //held constant val clusterCount = 200 //held constant val start = System.nanoTime val model = KMeans.train(parsedData, clusterCount, iterationCount) val end = System.nanoTime println("KMeans Run-Time: " + (end - start) / 10e9 + "s")
The current rdd storing command used in the code for these tests is cache(). Each parameter in the spark-submit configuration was tested by holding two constant and varying one at a time. An average was taken across three trials to help account for the variability in cluster use affecting the run-time.
Varying the number of cores within the executors spark-submit --master yarn-client
--driver-memory 2g //Always constant --executor-memory 2g //constant --num-executors 4 //constant --executor-cores (2, 8, 16)
Number of Cores Trial 1 Trial 2 Trial 3 Average 2 13.9541 10.2777 16.5499 13.5940 8 13.9311 13.3010 15.8827 14.3716 16 11.5307 13.8614 16.1953 13.8625
Here we observe a small difference between the tests. It seems as though since it is processing a smaller dataset, the fewer number of cores probably fits the use-case best.
Varying the number of executors used spark-submit --master yarn-client
--driver-memory 2g //Always constant --executor-memory 2g //constant --num-executors (3, 15, 30) --executor-cores 2 //constant
Number of Executors Trial 1 Trial 2 Trial 3 Average 3 11.1275 15.8867 9.4692 12.1611 15 14.7817 13.8307 15.0752 14.5625 30 15.3249 15.8669 23.6597 18.2838
Here we observe that the fewer executors, the better. This is most likely due to the excessive number of executors being used on such a small dataset creating overhead in the aggregation stage.
--driver-memory 2g //Always constant --executor-memory (512MB, 2g, 8g) --num-executors 4 //constant --executor-cores 2 //constant
Executor Memory Trial 1 Trial 2 Trial 3 Average 512MB 15.2041 18.2490 17.5473 17.0001 2g 11.3612 12.9340 12.8551 12.3834 8g 9.2242 10.8015 14.0420 11.3559
Here we observe the higher the memory, the better- though the gain is small between such a steep increase in memory (from 2g to 8g). It makes sense that at a certain amount of memory for such a small dataset, performance would plateau. Nonetheless, 8g ran the fastest.
Based on the findings, we know that the executor cores did not make much of a difference but the number of executors and the executor memory did. We can now try an ensemble command for our use case. Since 3 executors performed best, and 8g executor memory performed best, they should make a fast combination. spark-submit --master yarn-client
--driver-memory 2g //Always constant --executor-memory 8g --num-executors 3 --executor-cores 2
Trial 1 Trial 2 Trial 3 Average 11.8125 12.9690 10.8062 11.8626
The ensemble of best performers just missed out-performing the command with one more executor, which is interesting because the trials suggested less executors perform better, but with more trials this might not hold true. The performance times were very close nonetheless, and the variability was much less too.
In-Code Scalability Measures
It was noted that cache() was the rdd storing method used in the code. Other implementations of persist() were tested with different flags. They were run using the “ensemble” spark-submit configuration because even though the 8g memory with 4 executors configuration was ultimately the fastest, the variability was much higher with a range of 9 to 14, which was concerning.
Trial 1 Trial 2 Trial 3 Average 10.0722 14.6886 11.8092 12.1900
Trial 1 Trial 2 Trial 3 Average 18.1217 13.9961 16.7736 16.2971
Trial 1 Trial 2 Trial 3 Average 16.0902 12.5814 25.9199 18.1972
Here we see MEMORY_ONLY performing the best, followed by MEMORY_AND_DISK, and then MEMORY_AND_DISK_SER. The reason the serializable format comes in last is because though the serializable format is more space- efficient, it is more CPU intensive to read. Comparing the persist() methods to cache():
Trial 1 Trial 2 Trial 3 Average 12.3200 10.3350 14.0219 12.2256
Not surprisingly, the cache() performs very similarly to the persist(StorageLevel.MEMORY_ONLY).
The scala program [3] output the cluster centers and cluster sizes to text files, which were then taken off the cluster and loaded into R for visualization. Using an R package called leaflet, an interactive map was made to visualize the cluster centers and their sizes [2]. The color scale goes from white to blue, white meaning the least dense cluster and dark blue meaning the most dense cluster.
Figure 3: Interactive version:
The scientists initially were hoping their cluster centers would correspond to much more specific locations on the map. This was not the case. They found that they correlated to popular pickup areas instead. Another idea the scientists entertained was subsetting the data by time, to see if maybe there was a difference between locations at night versus overall, but this proved unfeasible as this subset of data were too small.
Top 15 most dense clusters
Cluster Rank Cluster ID Cluster Size Latitude Longitude 1 2 29255 41.06940278 -73.84038958 2 70 26224 40.69540156 -73.81338677 3 174 22890 40.7431948 -73.98634259 4 93 22433 41.2932 -74.09022 5 54 22301 40.64510801 -73.97185006 6 95 20533 40.74824038 -73.93974121 7 122 20483 40.72990924 -73.98668317 8 190 20181 40.7449109 -73.95434998 9 47 19897 40.59828986 -73.97944906 10 68 19372 40.67323883 -73.97517343 11 114 19340 41.0281712 -73.6224416 12 13 19140 40.69084395 -73.98746458 13 141 19117 40.76327588 -73.97673154 14 157 19097 40.82725843 -74.08289625 15 32 18669 40.77653512 -73.95471698
Anyone who wishes can plot the latitude/longitude coordinates to see what is around each cluster center. The centers of interest to the scientists were the ones in Manhattan, because there was suspicion that they might correlate to popular destinations. It was found that they correlate to popular areas above all else. Ahead are the top three cluster centers in Manhattan.
Figure 4: The third most dense cluster center - First in Manhattan - near Madison Square Park
All the orange dots on the map are either a bar or restaurant. This was a common theme amongst the clusters in Manhattan.
Figure 5: The seventh most dense cluster - Second in Manhattan - at 2nd Ave and 10th Street
Figure 6: The thirteenth most dense cluster - Third in Manhattan - at 56th and 6th
We still see a high amount of orange dots, but also a bit more hotels as well. These findings can help suggest popular places to eat and drink to anyone who might be interested. It
also can help Uber drivers know where to linger in order to maximize thier probability of landing a ride. There are plenty of other hypotheses that can be posed to explain these results. The scientists, however, were more pleased that they correlated to areas that had similar characteristics.
Another interesting discovery was that Uber was significantly popular in Long Island City. The 6th and 8th most dense clusters encompass the entire city. There were over forty thousand pickups in this small area. Might be a good place to start driving for Uber!
Figure 7: The sixth and eighth most dense clusters- Long Island City
The findings for this project were interesting, but the scalability measures were the focus of the class. In testing the algorithm further with different spark-submit configurations, in conjunction with proving that cache() was the most effective at cutting down run-time, the scientists were able to pinpoint their optimal configuration. With more data, many things could be improved upon. The scalability measures could be further refined, the scientists would have more of an opportunity to implement techniques such as mapPartition() to further parallelize the data, and more specific locations could be pinpointed with a higher number of cluster centers. As well, Date.Time could be added as an extra dimension upon which to cluster, adding more complexity into the algorithm. With more data, the idea of subsetting to spot trends during the night-time would be more feasible. The scientists could have randomly generated data using Bayesian statistics. If they had enough time, they could have even submitted another FOIL request to the
NYC Taxi and Limousine Commission. But time was of the essence, and the potential this project has is exciting. The scientists are considering submitting a FOIL request to Uber for data on other cities to utilize this algorithm to visualize similar results in different places. Though the dataset was small, plenty of knowledge was gained on how Spark goes about parallelizing its tasks.
[2] Vincent Trost,, (2017)
[3] DS 410.Kevin Baik,(2017), GitHub repository.