VisHive: Creating Ad-hoc Computational Clusters using ... · 7 JT) JWF DMVTUFS PG DPOOFDUFE EFWJDFT...

VisHive: Creating Ad-hoc Computational Clusters usingMobile Devices in Web-based Visualization

Shivalik Sen∗

Birla Institute of Technology and ScienceGoa, India

Sriram Karthik Badam†

University of MarylandCollege Park, MD, USA

Niklas Elmqvist‡

University of MarylandCollege Park, MD, USA

刀攀猀甀氀琀猀愀爀攀瀀愀猀猀攀搀戀愀挀欀琀漀琀栀攀漀爀椀最椀渀䐀攀瘀椀挀攀爀攀焀甀椀爀攀猀

挀漀洀瀀氀攀砀挀漀洀瀀甀琀愀琀椀漀渀昀漀爀瘀椀猀甀愀氀椀稀椀渀最戀椀最搀愀琀愀

䌀漀洀瀀甀琀愀琀椀漀渀椀猀猀瀀氀椀琀愀渀搀瀀愀猀猀攀搀琀漀琀栀攀挀漀渀渀攀挀琀攀搀搀攀瘀椀挀攀挀氀甀猀琀攀爀 ⠀栀椀瘀攀⤀

刀攀猀甀氀琀猀昀爀漀洀琀栀攀䠀椀瘀攀椀渀琀攀最爀愀琀攀搀椀渀琀漀瘀椀猀甀愀氀椀稀愀琀椀漀渀

嘀椀猀䠀椀瘀攀挀氀甀猀琀攀爀漀昀挀漀渀渀攀挀琀攀搀搀攀瘀椀挀攀猀

Fig. 1. VisHive creates ad-hoc and opportunistic clusters from the local devices available to a user. Here, laptops, smartphones,and tablet devices are connected into a cluster to handle complex computations. Connected devices contribute computational powerusing VisHive.

Abstract—Current web-based visualizations are designed for single computers and cannot make use of multiple devices, even whena user has access to not only a laptop, but also a tablet and a smartphone. We present VisHive, a JavaScript toolkit for constructingweb-based visualization applications that can transparently connect multiple devices—called cells—into an ad-hoc cluster—called ahive—for local computation. Cells are organized into a master-slave architecture, where the master provides the visual interface tothe user and controls the slaves, and the slaves mainly perform computation. VisHive is built entirely using current web technologies,runs in the native browser of each cell device, and requires no specific download and install on the involved devices. We demonstrateVisHive using a time-series prediction visual analytics tool that utilizes connected slave cells to continuously perform additionalpredictions while the master awaits user input.

Index Terms—Parallel computing, web-based visualization, junkyard cluster, ad-hoc clusters, JavaScript toolkit.

1 INTRODUCTIONModern browsers have a lot to offer visualization developers, includ-ing advanced accelerated graphics; support for multi-touch, gesture-based, and pen-based computing; and seamless integration with theentire web ecosystem, including remote databases, sophisticated webservices, and online geographical map systems, to name just a few ex-amples. Most importantly, the web browser is now ubiquitous on alldevices, from laptop to smartphone, tablet to smartwatch, and requireno specific download or install to run sophisticated applications. Forthese reasons, it is not surprising that the web is quickly becoming

∗E-mail: [email protected]†E-mail: [email protected]‡E-mail: [email protected]

HCIL Technical Report.

one of the most popular target platforms for visualization. Accord-ingly, a host of toolkits, frameworks, and middlewares have recentlybecome available for visualization development, such as D3 [6] forcreating visual representations of data using declarative syntax, Poly-Chrome [2] for duplicating visualizations across multiple devices, andVisDock [11] for providing cross-cutting interaction techniques to vi-sualizations in a standardized tooldock similar to Adobe Photoshop.

However, the browser is still not an ideal computational environ-ment for executing the kind of heavy algorithms—such as clusteranalysis, graph layout, or probabilistic topic modeling—that manyvisualization and, in particular, visual analytics (VA) applicationsmay require. One reason for this lies with JavaScript (the program-ming language of the web browser) itself: it is an interpreted andweakly/dynamically typed scripting language that was never intendedfor high-performance computation. While it is possible to write high-performance code using JavaScript, this requires special care andin-depth knowledge to avoid the many pitfalls inherent in common

JavaScript coding idioms. JavaScript also long had no support for mul-tithreading, which is critical for concurrent implementations of manypopular algorithms. As a result, JavaScript libraries for scientific com-puting and other heavy computational domains were basically incon-ceivable. Fortunately, the HTML5 standard brings the Web WorkerAPI1, which allows for splitting computation across multiple concur-rent threads of execution. This API, as well as the continuous improve-ment of browser JavaScript interpreters such as Google’s V8 engine(which is used in Google Chrome), has belatedly led to JavaScript andthe browser now also becoming a platform for scientific computing.

One major gap in web-based visualization development remains:parallelization of computation across multiple devices. A commonuse-case of a web-based visualizations involves a user that has accessto multiple devices in their immediate physical surroundings. For ex-ample, if the user is accessing the visualization using a laptop, theymay also have a smartphone in their pocket, a tablet in their backpack,and a desktop computer under their office desk. While offloading com-putation to a server-side or cloud-based component is certainly possi-ble, it would make a lot of sense if the user was also able to fire upthe additional devices and use them for serendipitously offloading anyheavy computation required by the visualization tool, such as a senti-ment analysis, a complex graph layout, or a dimensionality reductionalgorithm. Compared to the server-based or cloud-based solution, this“local cloud” of co-located physical devices brings benefits to bothend-users and developers. More specifically, the end-user can avoidany mobile network fees while minimizing latency by confining thecommunication to the local Wi-Fi network, whereas the visualizationdeveloper can implement concurrent computation using JavaScript inthe web-based visualization application itself and without having toworry about creating separate server-side software for this purpose.

In this paper, we propose VISHIVE, a JavaScript toolkit for cre-ating this type of ad-hoc, opportunistic clusters consisting of local,networked devices that is directly integrated in a web application, thusrequiring no server-side components or client-side download and in-stall. Using the VisHive toolkit is simple for developers: they simplycreate a single web application using the library that also includes thecomputational code. Multiple devices—each called a cell—join a ses-sion through a simple discovery mechanism to form a cluster—calleda hive—that can be dynamically reconfigured as cells join and leave.The user manually designates one cell as a master—typically the de-vice that the user launches the visualization on first, though this canchange over the duration of a session—and the others become slaves.The master, in addition to also participating in the computation, is re-sponsible for partitioning the problem (or dataset) into smaller chunks,assigning the chunks to cells as they become available (or recoveringa lost chunk if the cell is lost from the hive), and finally recombiningthe results and presenting them to the user in the visualization on themaster device.

The VisHive toolkit is in no way a replacement to existing softwarearchitectures, frameworks, or middleware for scientific computing,machine learning, or other computational engineering applications. Inparticular, it does not replace server-based, cloud-based, or cluster-based computational frameworks that are already being used to powerweb applications and visualizations today. Rather, VisHive providesan entirely new opportunity to cheaply, conveniently, and easily buildvisualizations that also integrate opportunistic offloading of computa-tion to networked devices that also happen to be available in the samephysical environment of the device being used to display a visualiza-tion. To facilitate this mission, our implementation of VisHive is builtin JavaScript and uses standard web technologies, such as WebRTC,WebSockets, and JQuery, thus requiring no separate download or in-stallation for each participating cell. In other words, the main objectiveof the VisHive toolkit is to provide a “poor man’s cluster” solution in-side the very same web framework that developers routinely use whenbuilding web-based visualizations, while at the same time eliminatingthe need for server-side components.

To demonstrate the utility of the VisHive toolkit, we present

1http://www.w3.org/TR/workers/

STOCKHIVE, a time-series prediction visual analytics tool for stockmarket data. StockHive supports a mixed-initiative interaction modelwhere the tool provides predictions of future changes for differentstocks, and the analyst can choose to either accept a specific predic-tion or make a new one. The tool then responds by computing newpredictions based on the selected one. Temporal as well as inter-stockpredictions generate a prediction path into the future for each stock,which is a computationally expensive operation performed using neu-ral networks. Since user choice will cause predictions to become abranching decision tree of alternative possibilities, StockHive offloadsthe predictions in a breadth-first way to available VisHive cells. Whilewaiting for user input, the tool continuously adds new predictions asthe cells finish their computation and return with a result. When theuser makes a choice, the prediction tree is invalidated and a new oneis created and iteratively refined. We also show performance mea-surements for VisHive for various combinations of devices, such asa laptop, a tablet, and a smartphone device. In summary, our perfor-mance results show a significant completion time improvement usingthe VisHive toolkit compared to single-thread computation.

The remainder of this paper is structured as follows: We first reviewthe relevant background on distributed computing and web-based vi-sualization, as well as the application to time-series prediction. Wethen discuss our design framework for ad-hoc clusters in web-basedvisualization. We present the VisHive toolkit, and then exemplify itsuse with the StockHive visual analytics tool. This is followed by aperformance evaluation. We close the paper with a discussion, ourconclusions, and our plans for future work.

2 BACKGROUNDIn this section, we discuss existing literature on distributed computing,especially on mobile devices, web-based visualization methods, andvisual analytics for big data. We focus on the specific componentsrelevant to the design and construction of VisHive.

2.1 Distributed Computing On Mobile DevicesLin et al. [26] first proposed a mobile network where nodes would beorganized into non-overlapping clusters that are independently con-trolled and dynamically loaded. The proposed cluster algorithm wasrobust in the face of node failure or insertion/deletion. Wang et al. [33]worked on a bandwidth adaptive clustering approach for mobile ad-hoc networks that maintains clusters using local topology informationonly. In their approach, the member nodes forward only the mainte-nance messages probabilistically based on available bandwidth. Thisensured adaptability to network conditions and reduces message over-head. This yielded the idea to construct our system to forward chunksof data for calculation to the member nodes in a probabilistic mannerdepending on the topology information. That is, the member nodeswith more computational power and faster response time get morepreference automatically. Lee et al. [25] discussed the challenges andadvantages of utilizing mobile devices for distributed analytics basedon an implementation of the Hadoop analytic framework. After a per-formance analysis of their implementation, they concluded that currentmobile devices face significant limitations on transmitting and receiv-ing reliable TCP data streams, which is required to avoid interruptionswhile performing distributed analytics.

A number of computational offloading frameworks have been pro-posed for computationally intensive mobile applications, such as byHung et al. [22], Shiraz et al. [30], and Goyal et al. [19]. Such ap-plications are said to be elastic in nature, and each approach partitionproblems at different levels of granularity at runtime. In most cases,the distributed application processing platform is composed of a mo-bile device that runs a local application, a wireless network medium,and a remote cloud server node. In cases where there are insufficientresources on the mobile device, an elastic mobile application can bepartitioned such that any computationally intensive components of theapplication can be offloaded dynamically during runtime. Hassan etal. [20] showed in their study of computing-intensive mobile applica-tions that outsourcing these computations to nearby residential com-puters or devices may be more advantageous than public clouds due

to network impact. Cuckoo, a computation offloading framework forsmartphones developed by Kemp et al. [23], allows computation of-floading for Android phones to a remote server. Shiraz et al. [30]showed that current mobile computational offloading frameworks im-plement resource-intensive procedures for offloading. This involvesthe overhead of transmitting application binary code as well as deploy-ing distributed platforms at runtime. However, runtime computationaloffloading is useful in decentralized distributed platforms, such as mo-bile ad-hoc networks. On the other hand, as Shiraz et al. mention intheir conclusion, remote server nodes are unpredictable and compu-tational offloading should therefore be performed on ad-hoc basis, atruntime. This motivated us to design our framework as an ad-hoc net-work of mobile devices that perform computations as needed.

2.2 Visualization on the Web

In the early 1990s, information visualization was still considered anemerging discipline. Towards the turn of the century, however, thepervasiveness of the World Wide Web had led to many changes, in-cluding one important application: visualization on the web. Rohrer etal. [29] notes that the Web had progressed as a source of informationas well as the underlying delivery mechanism for interactive informa-tion visualization. They also note that the web, as a fundamentallynew medium for visualization, was changing the way visualization ap-plications were developed and used.

Most recent work includes Protovis [5], the JavaScript InfoVistoolkit [4], and additional toolkits for web visualization such as Pro-cessing.js, Raphaël, and Paper.js. Protovis aimed at making visual-izations more accessible to the web and interaction designers, uses‘marks’ to define visualization primitives such as bars (for bar charts),lines (for line charts), and labels. It provides the ability to assign dataand visual properties such as position, color, and opacity both stati-cally and dynamically to each mark, making for a succinct way forexpressing visualizations over the web. In contrast, the JavaScriptInfoVis (TheJIT) toolkit provides chart-level abstractions instead ofshapes. However, of most interest to us is D3. Bostock et al [6] devel-oped the D3 visualization toolkit to enable a direct binding between theinput data and the document object model. This is ideal for browser-based visualizations because browsers do not provide the same opti-mization opportunities as compiled programming languages. It is alsotwice as fast as Protovis. For these reasons, we use D3 as the targetplatform for web-based visualization in VisHive.

2.3 Big Data Analytics

When it comes to big data and its visualization techniques, one set ofrules might not apply across all datasets. It is usually the case that re-searchers develop or use a few specialized techniques to tackle differ-ent types of datasets. For example, Fisher et al. [16] show techniquesfor tackling business intelligence, Wong et al. [34] discuss challengesfacing extreme-scale visual analytics, and Steed et al. [31] developed avisual analytics system for specific application to the analysis of com-plex earth system simulation datasets.

Nevertheless, some common problems exist when it comes to vi-sual analysis of big data. For example, traditional data analysis toolsfor big data rarely allow real-time visualization, whereas most real-time visualization tools perform poorly with big data [21]. Hence,tackling big data visualization involves two main challenges: percep-tual and interactive scalability. Support for perceptual scalability hasbeen discussed by Ahlberg and Snheiderman [1] using filtering, byDas Sarma et al. [13] using spatial sampling, and by Carr et al. [9]using aggregation for scatterplots. For interactive scalability, Liu etal. [27] developed a VA system called imMens, which uses WebGLfor data processing and is based on the principle that interactive scala-bility should be limited by the chosen resolution of the visualized dataand not the total number of records. The computational requirement isanother stumbling block that hinders real-time interactive visualizationof big data. Choo and Park [12] propose methods such as data scaleconfinement, classification of pre-clustered data, and linear transformsof higher dimensions to deal with computational complexity.

3 DESIGN FRAMEWORK: AD-HOC COMPUTATIONAL CLUS-TERS FOR WEB-BASED VISUALIZATION

The web is becoming a ubiquitous medium for sensemaking throughvisualizations [6], sharing visual insights from data, and harnessingcollective intelligence [32]. The goal of the VisHive framework is tofacilitate the creation of ad-hoc and opportunistic device clusters overthe web that forms a distributed system for handling computationallyintensive algorithms driven by visual analytics applications. The driv-ing scenario behind VisHive is the fact that people tend to carry morethan a single device with them at all times. Leveraging these devicestogether can help scale our analytics applications to the challenges ofbig data. Many open and public distributed platforms exist for similarpurposes; however, these are not built for visual sensemaking and datascience, which yields its own unique requirements. Visualizations fol-low a transformative pipeline that turn data into interactive graphicalrepresentations through multiple stages [10].

To target visual analytics of big data, we need distributed frame-works capable of coupling with the visualization pipeline and aidingeach step in the pipeline using connected local devices to generate avisual representation and handle user interaction. VisHive is our at-tempt at building such a framework using modern web technologies.For this purpose, we list seven design guidelines categorized into threegroups for developing the VisHive framework.

3.1 Networked DevicesThe fundamental requirement for creating a distributed system is anetwork of connected devices. To utilize their devices including per-sonal computers and mobile phones for distributed computation, theVisHive framework should be capable of seamlessly connecting mul-tiple devices into a distributed system.

D1 Cross-platform support: The devices used by analysts for per-sonal computing and sensemaking can be diverse, ranging frompersonal and public computers to mobile devices. Therefore, theframework should work independent of the underlying platformsof these devices. A distributed computational cluster should bedeveloped using the available devices irrespective of their plat-form, modality, and physicality.

D2 Ad-hoc connectivity: A user should be capable of opportunis-tically creating a cluster from available devices. This includesadding to or removing devices from the clusters at any point.

Peer-to-peer networks are ideal for this purpose [3, 17] as they donot set a hierarchy among the devices, and they do not require addi-tional infrastructure in the form of a server to create clusters.

3.2 Responsive DistributionOnce the devices are connected into a distributed system, supportingvisual sensemaking on the device cluster would require an intelligentmanagement of the connected devices. The challenge in this case isto ensure that opportunistic maintenance of the clusters—adding orremoving devices at any point—does not interfere with user activitywithin the visual analytics system.

D3 Responsive computation management: The devices that are notactively used are free to engage in potentially hard computa-tional activities compared to the active ones. Computation jobsassigned to the devices within a cluster should not only be basedon the processing power and available memory on the device,but also based on their current use. Therefore, the frameworkshould identify activity on these devices and manage distributionaccordingly.

D4 Fault-tolerance: Devices entering or leaving the cluster shouldbe able to take up new jobs or pass on existing jobs. The frame-work should be fault tolerant: adaptive to the event of failure atsome devices and new devices in the cluster.

Computation Job Sharing Channel (P2P)

Hive

Chunks

Cells

Fig. 2. Example VisHive application network architecture.

3.3 Supporting Visualization and Interaction

Visual analytics systems often utilize computationally complex algo-rithms. For example, to understand the web traversal history of users,spanning trees from browsing histories were generated and visual-ized [15]. Machine learning and data mining models are also usedto identify specific features, visualize interesting patterns, and promptuser exploration [18, 28]. While some of these models are inherentlyparallelizable, it should also be possible to configure how the underly-ing algorithm can be spread across the clusters.

D5 Parallelization: The algorithms driving visual analytics shouldbe parallelized to take full advantage of the computational clus-ter. The framework should support defining a parallel versionof an algorithm at each stage of the visualization pipeline withfeatures to adapt the algorithm to the cluster’s size and elements.

D6 Data-driven distribution: Data can come from multiple locationsover time (spatiotemporal), multiple data sources (multi-modal),and can comprise several attributes per data point (multi-variate).Computations in the visualization pipeline involve transformingdata of one form (input) to another (output) at each step. It shouldbe possible to create computational job chunks for devices in thecluster based on a split at any dimension of the data. For examplein spatiotemporal data, jobs can be created either by splitting databased on time or space.

D7 Handling user interaction: Interaction is an integral part of vi-sualization and visual analytics. Users often perform continu-ous interaction to explore different segments of data. These in-teractions generate computations for the cluster to handle, andnew interactions can render the previous computations obsolete.Therefore, the framework should be able to preempt and restartcomputations at each device in the cluster upon user interaction.

4 THE VISHIVE TOOLKIT

The VisHive toolkit was developed for building ad-hoc and oppor-tunistic clusters of computing devices for web-based visualization. Itis implemented completely in JavaScript targeting the web platformand thus providing cross-platform support (D1). It uses the WebRTCAPI by W3C2 for establishing peer-to-peer connections across webbrowsers. Since the web is the target platform, the devices—calledcells—are connected into a device cluster—known as a hive—as soonas they open a VisHive application webpage on the web browser (D2).The toolkit provides modules for structural definitions of distributedalgorithms based upon the attributes of the hive (D3), and handles en-tering/leaving cells in the hive (D4). The toolkit integrates closely withthe visualization pipeline, allowing developers to handle the stages inthe pipeline in parallel using the connected devices (D5, D6, D7). Fig-ure 2 shows the network architecture of an VisHive toolkit example.

2http://www.w3.org/TR/webrtc/

4.1 System/Network ArchitectureThe VisHive toolkit consists of four components based on the designguidelines D1 through D7:

C1 Job partition layer that divides a high-level computation opera-tion into computation jobs (known as chunks);

C2 Communication layer to share the chunks across the cells;

C3 Integration layer that combine the results from all the cells andpass them to the web visualization; and

C4 Job control layer that handles fault-tolerance i.e., cells enteringand leaving the hive.

Figure 3 depicts the VisHive architecture covering these compo-nents. The VisHive network architecture is a peer-to-peer (P2P) con-nection established across the browsers of the cells using WebRTCtechnology, popularly used for real-time video calls over the webbrowser3. Our implementation uses the open source PeerJS frame-work4 for establishing the P2P connections across the cells. TheP2P connection creates the communication layer (C2) for transferringchunks to the cells within the hive.

The control of the distributed system is inherently with the instancethat the user actively interacts with. Currently, we support just oneactive user interacting with VisHive on a device—the framework isnot intended for collaborative visualization. That way, the control-ling instance, or master, takes the help of other idle devices, or slaves,and distributes the computation jobs amongst them. The slaves areour cells, and, along with the master instance, make up the hive. Thecells use their browsers and join the VisHive server by navigating toa VisHive application URL. As these cells join the hive, they sharedetails on their free CPU power and resource allocation, which themaster then uses to allocate the jobs amongst them. Since the VisHivetoolkit targets ad hoc and opportunistic device clusters (for example,between an analyst’s smartwatch, smart phone, and laptop), this reg-istration process ensures that distribution happens in an environment-aware fashion.

After the cell registration process, the individual cells are capableof distributing the computations involved in each stage of the pipelinefor generating a visualization (C1). The VisHive toolkit also maintainscommon data structures across cells. For example, this can contain thecomputational models used for each step of the pipeline. When a mas-ter shares computation jobs with the slaves in the hive, the cells acceptthe jobs and look up the input data from the job definition. All the cellsincluding the master will then perform the required computations onthe input using the shared computational models and send the outputback to the master. The master then handles the integration process byrecombining the results from the cells using the integration modules(C3). By default, the integration modules combine the results from thecells into a list, however, this logic can be overriden by the developerbased on the context of use.

4.2 Job Allocation and ControlJob allocation and control within the VisHive distributed system ishandled by the C1 and C4 components of the toolkit. Each compu-tation job (chunk) is treated as a mapping from input to output, gen-erated by shared computation logic. This is similar to the MapReducemodel [14] for processing big data on parallel and distributed systems.The default configuration for job partitioning involves splitting the in-put data for a high-level computation into jobs that work on parts ofthe data. The job allocation module creates the chunks based on theavailable resources on each cell and the number of cells in the hive(including the master and the slave cells). Furthermore, explicit ap-plication logic created by the VisHive application developer (end-userdeveloper) for splitting a computation (and input) into chunks is alsosupported.

3https://apprtc.appspot.com/4http://peerjs.com/

Communication Layer (P2P)

Job Partition Layer

Integration Layer

Job Control Layer

Fig. 3. VisHive toolkit infrastructure containing the four components tocreate and manage distributed computation jobs (chunks).

The VisHive toolkit maintains a shared memory containing thecommon data structures for performing computation jobs on eachcell. For example, a pattern recognition model that translates multi-dimensional input vectors part of a large dataset into features, can beshared with other cells to perform individual input to feature mapping.Once the computation jobs are finished the results are sent back to theorigin, where the results are compiled and transferred to the visualiza-tion pipeline.

The recovery modules (C4) are responsible for handling the failurescenarios in the VisHive distributed systems. This automatically han-dles the failure scenarios when existing cells leave or new cells enterthe hive. For these scenarios, a cell that leaves a hive passes on itsassigned computational jobs uniformly to the other cells in the hive,and a cell that enters a hive during a current computation process hasto wait till the next batch of jobs.

Distributed systems also face issues with conflicts and concurrency.In our case, the VisHive toolkit inherently splits the input data basedon the application logic and passes it to the cells in the hive. Thismeans each cell handles a specific job and, therefore, cannot by de-sign cause conflicts. When simultaneous responses are sent back tothe origin, the event-driven and asynchronous I/O nature of JavaScripthandles the incoming results in a queue. This provides the toolkit withthe capability of concurrency managed distribution. Apart from this,we assume that any conflicts occuring during the integration processthat are application specific are handled by explicit application logicdeveloped by the toolkit users for combining the results from the hivein their application.

4.3 Visual InterfaceThe VisHive toolkit works in close coupling with the visualizationpipeline, as described in the design guidelines (Section 3). A stan-dard visualization pipeline [8, 10] consists of data processing, datafiltering, spatial mapping, and rendering stages. Beyond this, user in-teraction requires going through these stages to re-render the new vi-sual representation. Following this model, each stage of the pipelineinvolves transforming an input into an output, which is exactly howVisHive computation jobs are created. For example, data cleansinginvolves converting the raw data into a structured data structure. Thisrequires going through individual data points and parsing them. Forlarge datasets, this operation can be expensive due to the sheer amountof data. In a distributed system, this operation is embarrassingly par-allel. VisHive can split the data into computational jobs that can beprocessed across the cells in the connected hive (component C1).

After the results from the cells are received, they are integrated atthe master. VisHive supports definition of the computation chunks asdependent or independent of each other. If they are independent, theresults are chunked together and passed to the visualization pipelineas soon as they are received. If otherwise, an explicit logic to integratethe results is expected from the end-user developer (VisHive API user).As soon as the computation job is finished, the shared data structurescan be destroyed if they are longer required.

Beyond data cleansing, operations in data filtering can involve si-multaneously analyzing multiple data points to find interesting data.The VisHive toolkit is flexible to share the data and computation model

required for any part of this filtering process. Beyond this, the toolkitcan works with generating spatial layout across multiple cells. Forthis, information about the display configurations including resolu-tion, color schemes, and representation types about the origin cell areshared as part of the chunk by the master. The rendering layer of thepipeline, however, is generated on the master, as this is very muchlinked to the graphics modules on the master.

Component C1 and C3 of the toolkit also handle the interactionhappening on the master. A user interaction is linked to the earlierstages in the visualization pipeline, and connected to the original data.For handling the user interaction, the pipeline can be executed againbased on the new data of interest. VisHive toolkit can be configured tomaintain if new interactions are dependent of the previous interactions,in which case the data of interest can be combined with correspondingdata from the previous interaction. For example, in case of brush andlink style interaction with visualizations [7], every new brush can beindependent or dependent on the previous brushes based on the joininglogic (AND, OR). Independent operations are handled by VisHive bypreempting current computation jobs on the cells.

5 EXAMPLE: STOCKHIVE

We developed a web visual analytics tool, STOCKHIVE, for testing theVisHive toolkit. StockHive is an application for visual prediction ofstock market data based on suggestions from machine learning mod-els. It is built completely using web technologies, including HTML5,JavaScript, and CSS, with data from the Yahoo! Finance API. We col-lected 2 years of stock market data, covering parts of years 2013, 2014,and 2015, for the internet information providers subsector, and addedoptions in StockHive to manage a portfolio of stocks from this sub-sector. The stocks in the portfolio are visualized as line charts and thepatterns of future behavior are generated. The visualization pipelinefor this application includes parsing the CSV files of stock prices,converting them into (date, price) pairs, training the learning models,calculating the predictions for each stock in the portfolio, mapping theparsed data onto a 2D line, and then finally creating a tree layout fromthe predictions (Fig. 4). The users can interact with the visualizedprediction space to make their own predictions.

StockHive explores a user guided prediction model that relies uponuser knowledge to pick a prediction from multiple possibilites gener-ated by neural networks. A feed-forward neural network that is trainedon the past 2 years of stock data provides multiple predictions for thefuture given the stock prices within the last week (temporal predic-tions). This model generates predictions of different confidences. Fur-thermore, predictions from the neural network can be chained togetherto create a temporal prediction path in the future, which resembles atree (Fig. 4).

When a user interacts with the system by making a prediction basedon either the suggestions from prediction tree or external informationsuch as quarterly reports, news, and social media, the prediction treeis modified to show predictions based on the historical correlation be-tween the stocks. For this, we used a self organizing map [24] trainedon the co-occurences of a group of stocks is used. Thus, the user canpredict the behavior of a stock in his portfolio using the temporal andcorrelation models. For example, if Google stock has three temporalpredictions and Yahoo stock has two temporal predictions, then thecorrelation predictions obtained from choosing a temporal predictionfor Google can help the user make a prediction for Yahoo.

StockHive uses VisHive toolkit to generate the prediction spaceconsisting of the tree structure. There are three complex operationsthat are modified into chunks and distributed across the hive,

O1 Prediction tree: The prediction tree contains more than N(L+1)nodes when predicting L days into future with N number of pos-sibilities (branches) at each node in the tree. Creating this treefor each stock requires calculating all the nodes in the tree, whichinvolves operations of exponential complexity. We created twostrategies with VisHive for this operation: (1) splitting the pro-cess of generation of this tree based on stocks, so each device

Prediction tree generated by splitting computations

based on stock or the tree branchesStock B

Stock A

Hive

Stock A

Stock B

User Prediction

Prediction space is recomputed based on user interaction with one stock

Fig. 4. StockHive application visualizes stock market data and provides predictions of future behavior. These predictions of different confidencesgenerated for each stock form a tree structure (left). This operation is distributed across the hive by splitting the computation based on stock or thepath in the tree. Once the user interacts (right) by predicting a stock value for the future, the prediction trees for the remaining stocks need to berecomputed based on the correlation between stocks. This process is also handled by the VisHive toolkit.

gets a list of stocks it needs to predict; and (2) splitting the pro-cess based on time, where each cell in the cluster generates someprediction paths (branches) in the tree.

O2 Prediction fitting: Once a prediction tree is generated, whenuser interacts by making a prediction for a stock, a path in thetree which fits this prediction should be found. For example, ifthe user says that Apple will increase by 10% in the next 10 days,a path in the prediction tree that best fits this estimate should befound. This process involve finding the closest and highest confi-dence paths from a list of NL paths. The paths object maintainedon the master device is split into chunks and passed to all otherdevices for filtering the best possible fit on each cell, which arefurther verified to find the best predidiction fit by the master.

O3 Correlation predictions: After the best prediction fit is calcu-lated, the prediction tree for the rest of the stocks are modifiedto reflect the correlation between stocks. This process involvesrecreating the prediction tree using the correlation predictions ateach time step, in contrast to the temporal prediction tree genera-tion. However, similar to previous operation, this process is par-allelized by splitting the tree generation computation into chunksbased on the rest of the stocks or prediction paths/branches thatare handled by each cell.

While O1 represents an operation happening when the user visu-alizes stock market data for any time period by adding a stock to hisportfolio, the remaining operations (O2 and O3) happen when the userinteracts with specific stocks of interest. The operations performed inthe visualization pipeline to handle the stock data are inherently paral-lelizable. Therefore, we use VisHive to split the data belonging to thestocks that are in the user’s portfolio based on the strategies describedabove and generate the chunks that need to handled by each cell.

For tree generation (O1), VisHive defines chunks as the past sevenvalues that are used for prediction and along the strategy that needs tobe followed to handle this data. On the other hand for handling userinteraction (O2, O3), the chunks contain the predictions made by theuser and links to the computational models to perform the correpond-ing operation. Once the chunks are received by the cells, they processthe chunks using the computational models (trained neural networks)that are maintained on the shared memory of each cell. StockHive isable to seamlessly handle generation of these complex prediction pathswith VisHive and utilize the computational power of the opportunisticcluster. We tested the performance of this application using multipledevice combinations: laptop, smartphones, and tablets, and results ofthis study are described in the next section.

6 PERFORMANCE EVALUATION

We executed StockHive on an ad hoc network hosted by a laptop run-ning Google Chrome that served as the master instance. Worker cellsare mobile devices that joined the ad hoc network and were addedto the StockHive cluster. A preliminary performance evaluation wasdone to see how the number of cells in the hive, apart from the mas-ter, impacted the performance for a computation that can be split intochunks just by splitting the data. For the performance evaluation, wechose to create the chunks by letting each cells handle a subset of thestocks of interest. The performance of VisHive was measured as thetotal time taken to finish all the predictions for the stock values of thecompanies chosen by the user.

During the performance evaluation, the system was started and theuser chose 2, 4 or 8 stocks as their portfolio. After the stocks havebeen chosen, their future values have to be calculated by StockHiveand the prediction tree should be drawn for each. We predict 20 daysinto the future, using the trained neural network on the past values.This was repeated for the scenario of one master instance, one masterand a slave (a tablet) and one master and two slaves (a tablet and aphone). Results are reported in Fig. 5.

When the timing results were analyzed, we found that the timetaken for all the predictions to finish reduced (almost) linearly withthe number of cells added, with few overheads that need to be furtherinvestigated. Although this is expected in embarrassingly parallel sce-narios, it validates our system as the performance of the hive improveswith added cells and more computation power.

On closer analysis, we found that the time taken for the commu-nication between the cells and the master or vice versa was in factnegligible (order of 10ms) since the chunks are fairly small in size (or-der of Kb). This means that there was a small overhead due to thecommuncation layer itself. However it is to be noted that since thiswas a locally hosted ad hoc network, the communication was muchfaster and more stable than other wireless connections (for instance,public Wifi).

The processing power of the devices plays a big role in the perfor-mance, as the phone outperformed the tablet due to its superior pro-cessor. However mobile devices are inherently throttled to avoid over-heating under continued stress and this could be a major reason for thedelays on both devices, along with their limited computing power.

7 DISCUSSION

One contentious topic that we have not yet covered in this paper is theexistence of the IPython5 development environment, which throughthe guise of IPython Notebooks—a web-based interactive shell for

5http://www.ipython.org/

Fig. 5. StockHive performance for three scenarios: only master, onemaster-one slave (tab) and one master-two slaves (tab, phone).

Python—is quickly becoming the main platform for scientific comput-ing in the web browser. Meanwhile, our work on the VisHive toolkitin this project is squarely aimed at distributing JavaScript code acrossmultiple devices. One of the reasons for the success of IPython forscientific computing is the immense ecosystem of Python packagesavailable for all conceivable computational needs. Obviously, our in-tention with VisHive is in no way to call for replacing IPython in itsundisputed leadership position. Rather, we see VisHive as filling aniche that is very different from the greater mandate of IPython: in-tegrating computation in a web-based visualization setting, which isalready going to be JavaScript-based given the current state of visu-alization toolkits for the web. IPython, in contrast, is still specialitysoftware that is not considered useful for the general population, istherefore not integrated with standard browser installations, and thusrequires a separate download.

The same argument extends to general server-based, cloud-based,or cluster-based computational platforms. VisHive is not intended toreplace such platforms in any shape or form, and this paper is in noway a call to abolish them in favor of local computation. Instead, ourtoolkit provides an example solution for the common situation when auser has access to multiple local devices that could be formed into anad-hoc cluster to help with computation performed in the browser run-ning one of them. Since mobile devices as well as personal computersare exclusively designed for focused use—i.e., with one user using asingle device, and not many devices at once—these additional devicesare underutilized anyway. Our work here simply offers a suggestionon how to leverage these available devices in a way that is lightweightand easily integrated with current web development practices.

In addition to these caveats, it is also important to realize thatour VisHive toolkit is still a research prototype and not yet readyfor production use. A production toolkit would need robust mecha-nisms for discovery, matchmaking, and session management, some-thing VisHive currently does not have. We deemed such functionalityoutside the scope of our research contributions for this project, so wehave not spent any effort on them. However, we anticipate that a pro-duction version of our toolkit, or another toolkit that incorporates someour toolkit’s functionality, would be able to provide such mechanismsrelatively easily.

Another limitation is that VisHive does not provide any explicit sup-port for how to distribute computation so that it can be assigned intomanageable chunks, sent off to separate cells, performed separately,and then recombined correctly by the master. In fact, our examplein the StockHive application is almost embarrassingly parallel, in thatpredictions for one branch are not dependent upon predictions in otherbranches. Our focus in this work has been on the parallelization mech-anism itself, and not the distributed algorithms you would run on theindividual cells. There exists vast amount of work in fields such as

parallel computing, distributed systems, and high-performance com-puting that can begin to guide the design of suitable algorithms thatcan be run on top of VisHive.

8 CONCLUSION AND FUTURE WORKWe have presented VisHive, a JavaScript toolkit that allows for con-necting multiple devices into an ad-hoc cluster using just the webbrowser as the computational platform. Devices become cells in ahive where a master allocates and recombines jobs to slaves that per-form the actual calculation. The communication between the cells isperformed using direct browser-to-browser connections in a peer-to-peer architecture, thus requiring no central server or connection to theinternet. To showcase the utility of the technique, we presented Stock-Hive, a visual analytics tool for stock market prediction that employsmixed-initiative interaction. Instead of performing a set number ofpredictions and then waiting for user input, StockHive splits the pre-dictions across the available devices and keeps performing them untilinterrupted by user input. Our performance evaluation using the Stock-Hive application show a significant speedup basically linear with thenumber of connected cells.

We see many potential refinements and improvements of theVisHive toolkit in the future. For example, we envision continuingour work on parallelizing all steps of the visualization pipeline, in-cluding not just data transformations and the visual encoding, but alsothe view transformations and input management. Furthermore, we areinterested in investigating how to use the slave cells not just as head-less computational units, but also for collaboration (for multiple users)or for supporting the main device with additional views and input sur-faces (for a single user with multiple devices). Finally, we would liketo study the usability aspects of firing up multiple devices to offload amain device, and how this discovery process can be streamlined.

ACKNOWLEDGMENTSThis work was partially supported by U.S. National Science Founda-tion award IIS-1253863. Any opinions, findings, and conclusions orrecommendations expressed in this article are those of the authors anddo not necessarily reflect the views of the funding agency.

REFERENCES[1] C. Ahlberg and B. Shneiderman. Visual information seeking: Tight cou-

pling of dynamic query filters with starfield displays. In Proceedings ofthe ACM Conference on Human Factors in Computing Systems, pages313–317, 1994.

[2] S. K. Badam and N. Elmqvist. PolyChrome: A cross-device frameworkfor collaborative web visualization. In Proceedings of the ACM Confer-ence on Interactive Tabletops and Surfaces, pages 109–118, 2014.

[3] S. K. Badam, E. Fisher, and N. Elmqvist. Munin: A peer-to-peer mid-dleware for ubiquitous analytics and visualization spaces. IEEE Transac-tions on Visualization and Computer Graphics, 21(2):215–228, Feb 2015.

[4] N. G. Belmonte. The JavaScript InfoVis Toolkit. http://philogb.github.io/jit/, accessed March 2014.

[5] M. Bostock and J. Heer. Protovis: A graphical toolkit for visual-ization. IEEE Transactions on Visualization and Computer Graphics,15(6):1121–1128, 2009.

[6] M. Bostock, V. Ogievetsky, and J. Heer. D3: Data-driven docu-ments. IEEE Transactions on Visualization and Computer Graphics,17(12):2301–2309, 2011.

[7] A. Buja, J. A. McDonald, J. Michalak, and W. Stuetzle. Interactive datavisualization using focusing and linking. In Proceedings of the IEEEConference on Visualization, pages 156–163, 1991.

[8] S. K. Card, J. D. Mackinlay, and B. Shneiderman. Readings in Informa-tion Visualization: Using Vision to Think. Morgan Kaufmann, 1999.

[9] D. B. Carr, R. J. Littlefield, W. Nicholson, and J. Littlefield. Scatterplotmatrix techniques for large n. Journal of the American Statistical Associ-ation, 82(398):424–436, 1987.

[10] E. H. Chi. A taxonomy of visualization techniques using the data statereference model. In Proceedings of the IEEE Symposium on InformationVisualization, pages 69–75, 2000.

[11] J. Choi, D. Park, Y. Wong, E. Fisher, and N. Elmqvist. VisDock: Atoolkit for cross-cutting interactions in visualization. IEEE Transactionson Visualization and Computer Graphics, PP(99):1–1, 2015.

[12] J. Choo and H. Park. Customizing computational methods for visual ana-lytics with big data. IEEE Computer Graphics & Applications, 33(4):22–28, 2013.

[13] A. Das Sarma, H. Lee, H. Gonzalez, J. Madhavan, and A. Halevy. Ef-ficient spatial sampling of large geographical tables. In Proceedings ofthe ACM SIGMOD Conference on Management of Data, pages 193–204,2012.

[14] J. Dean and S. Ghemawat. MapReduce: simplified data processing onlarge clusters. Communications of the ACM, 51(1):107–113, 2008.

[15] P. Doemel. WebMap: a graphical hypertext navigation tool. ComputerNetworks and ISDN Systems, 28(1):85–97, 1995.

[16] D. Fisher, R. DeLine, M. Czerwinski, and S. M. Drucker. Interactionswith big data analytics. ACM Interactions, 19(3):50–59, 2012.

[17] E. R. Fisher, S.-K. Badam, and N. Elmqvist. Designing peer-to-peerdistributed user interfaces: Case studies on building distributed appli-cations. International Journal of Human-Computer Studies, 72(1):100–110, 2014.

[18] R. Fuchs, J. Waser, and M. E. Gröller. Visual human+ machine learn-ing. IEEE Transactions on Visualization and Computer Graphics,15(6):1327–1334, 2009.

[19] S. Goyal and J. Carter. A lightweight secure cyber foraging infrastructurefor resource-constrained device. In Proceedings of the IEEE Workshopon Mobile Computing Systems and Applications, pages 186–195, 2004.

[20] M. A. Hassan and S. Chen. Mobile MapReduce: Minimizing responsetime of computing intensive mobile applications. In Mobile Computing,Applications, and Services, pages 41–59. 2012.

[21] J. Heer and S. Kandel. Interactive analysis of big data. ACM XRDS:Crossroads, 19(1):50–54, 2012.

[22] S.-H. Hung, C.-S. Shih, J.-P. Shieh, C.-P. Lee, and Y.-H. Huang. Execut-ing mobile applications on the cloud: Framework and issues. Computers& Mathematics with Applications, 63(2):573–587, 2012.

[23] R. Kemp, N. Palmer, T. Kielmann, and H. Bal. Cuckoo: a computationoffloading framework for smartphones. In Mobile Computing, Applica-tions, and Services, pages 59–79. 2012.

[24] T. Kohonen. Self-organizing maps, 1995.[25] S. Lee, K. Grover, and A. Lim. Enabling actionable analytics for mobile

devices: performance issues of distributed analytics on Hadoop mobileclusters. Journal of Cloud Computing: Advances, Systems and Applica-tions, 2(1):15, 2013.

[26] C. R. Lin and M. Gerla. Adaptive clustering for mobile wireless networks.IEEE Journal on Selected Areas in Communications, 15(7):1265–1275,1997.

[27] Z. Liu, B. Jiang, and J. Heer. imMens: Real-time visual querying of bigdata. In Computer Graphics Forum, pages 421–430, 2013.

[28] A. Malik, R. Maciejewski, Y. Jang, W. Huang, N. Elmqvist, and D. Ebert.A correlative analysis process in a visual analytics environment. In Pro-ceedings of the IEEE Conference on Visual Analytics Science and Tech-nology, pages 33–42, 2012.

[29] R. M. Rohrer and E. Swing. Web-based information visualization. IEEEComputer Graphics & Applications, 17(4):52–59, 1997.

[30] M. Shiraz, M. Sookhak, A. Gani, and S. A. A. Shah. A study on thecritical analysis of computational offloading frameworks for mobile cloudcomputing. Journal of Network and Computer Applications, 47:47–60,2015.

[31] C. A. Steed, D. M. Ricciuto, G. Shipman, B. Smith, P. E. Thornton,D. Wang, X. Shi, and D. N. Williams. Big data visual analytics for ex-ploratory earth system simulation analysis. Computers & Geosciences,61:71–82, 2013.

[32] F. B. Viégas, M. Wattenberg, F. Van Ham, J. Kriss, and M. McKeon.ManyEyes: A site for visualization at internet scale. IEEE Transactionson Visualization and Computer Graphics, 13(6):1121–1128, 2007.

[33] Y. Wang and M. S. Kim. Bandwidth-adaptive clustering for mobile ad hocnetworks. In Proceedings of the IEEE Conference on Computer Commu-nications and Networks, pages 103–108, 2007.

[34] P. C. Wong, H.-W. Shen, C. R. Johnson, C. Chen, and R. B. Ross. The top10 challenges in extreme-scale visual analytics. IEEE Computer Graph-

ics & Applications, 32(4):63, 2012.

VisHive: Creating Ad-hoc Computational Clusters using ... · 7 JT) JWF DMVTUFS PG DPOOFDUFE EFWJDFT...

Documents

Transcript of VisHive: Creating Ad-hoc Computational Clusters using ... · 7 JT) JWF DMVTUFS PG DPOOFDUFE EFWJDFT...