THE UNIVERSITY OF CHICAGO ANALYTICS-ORIENTED VIDEO ... · DDS, a video streaming stack that caters...

THE UNIVERSITY OF CHICAGO

ANALYTICS-ORIENTED VIDEO STREAMING STACK

A DISSERTATION SUBMITTED TO

THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES DIVISION

IN CANDIDACY FOR THE DEGREE OF

MASTER OF SCIENCE (MS)

DEPARTMENT OF COMPUTER SCIENCE

BY

AHSAN PERVAIZ

CHICAGO, ILLINOIS

15TH JUNE 2019

Dedication Text

Epigraph Text

TABLE OF CONTENTS

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1 Distributed Video Analytics Pipelines . . . . . . . . . . . . . . . . . . . . . . 52.2 Limitation of existing solutions . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 The case for a Custom Video Streaming Stack . . . . . . . . . . . . . . . . . 8

3 OVERVIEW OF DDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1 DDS: DNN Driven Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 DESIGN AND IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . . . 154.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.2 Accuracy vs Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3 Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.4 Microbenchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.5 Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6 FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

8 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

v

LIST OF FIGURES

1.1 A Client-Side heuristics based protocol . . . . . . . . . . . . . . . . . . . . . . . 21.2 A DNN feeback based streaming protocol . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Diminishing returns of increasing resolution . . . . . . . . . . . . . . . . . . . . 72.2 Scaling down and Spatially Cropping Frames . . . . . . . . . . . . . . . . . . . . 10

3.1 Iterative control flow of DDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1 Components and Interfaces of DDS . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.1 F1 Against Bandwidth for various configurations . . . . . . . . . . . . . . . . . 245.2 1-σ ellipse for DDS and Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.3 Worse Cases for DDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.4 Batch Processing Time (Delay) . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.5 DDS maintains high accuracy in presence of bandwidth fluctuations . . . . . . . 30

6.1 Impact on Accuracy on stitching 4 images together . . . . . . . . . . . . . . . . 32

vi

LIST OF TABLES

4.1 Responsibilities of different parts of the Analytics Pipelines . . . . . . . . . . . . 21

5.1 Resource Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

vii

ACKNOWLEDGMENTS

viii

ABSTRACT

As deep learning becomes the de-facto approach to live video analytics, we are seeing a surge

in efforts to develop distributed analytics pipelines that leverage the cloud resources to run

deep neural networks (DNNs) on live videos from edge cameras. Prior work has focused on

offloading computation from the edge devices to the cloud. This paper highlights a crucial yet

missing piece in the distributed video analytics pipeline – a custom video streaming stack

to save bandwith between the camera and analytics server while ensuring high inference

accuracy. Through empirical studies, we have found unexploited opportunities to customize

today’s video streaming stack for video analytics. Inspired by the findings, we present

DDS, a video streaming stack that caters to distributed video analytics. Unlike previous

video streaming mechanisms designed to meet user experience or bandwidth budget, DDS

is built on a novel DNN-driven workflow that relies on server-side DNN logic to explicitly

balance the video analytics accuracy and bandwidth savings. At the core of DDS is an

interactive streaming protocol built on the active learning framework to make DNN-driven

workflow practical. We implement DDS as an underlying video streaming layer which could

be combined with a wide variety of distributed video analytics pipelines. Through evaluation

on real-world video datasets, we find that DDS increases the inference accuracy by up to 2x

while consuming the same or lesser amount of bandwidth compared to other video analytics

pipelines.

ix

CHAPTER 1

INTRODUCTION

In recent years there has been an increased interest in smart devices and surveillance cameras.

In 2018 alone, approximately 130 million networked cameras were shipped worldwide [34].

With the increase in video content produced by smart camera devices, there is a growing need

to extract meaningful information from the large input in an automated and cost-effective

way. Deep neural networks have improved the accuracy of such computer vision tasks dra-

matically, in some cases even outperforming human beings [15]. Using deep neural networks

(DNNs) on the devices that are typically producing the video content is prohibitively expen-

sive [28, 27]. Allowing these smart devices to be able to efficiently run these DNNs would

require engineering a large amount of compute and storage resources in them, which would

increase the cost of such devices. Hence, researchers have been developing offloading schemes

and distributed pipelines for such analytics which take the computationally expensive task

of running the DNNs away from the device and on to the cloud. In such schemes, instead

of the server consuming raw frames produced by the camera, the client compresses the raw

frames and selects a subset to send to the server, which then runs the DNNs and sends

results back to the client.

Prior work on offloading schemes has missed a crucial component - a custom video stream-

ing stack. The streaming stack has a significant impact on the accuracy, bandwidth utiliza-

tion and delay of such protocol.

In this work, we use the key insight that streaming video content to a neural network

rather than a human being has fundamentally different requirements. Traditional video

streaming protocols (e.g. [30]) focus on the user-perceived quality. So one goal for such

streaming protocols is to stream video in highest possible resolution and quality uninterrupt-

edly. In contrast, playback smoothness does not affect analytics in the same way. Rather

than focusing on the content as a whole in time, video analytics focuses on the temporal and

1

Figure 1.1: A Client-Side heuristics based protocol

spatial features of a video across time.

Previously proposed schemes make use of client side logic to apply a variety of heuristics

to send only important frames to the server for further analysis [8, 38, 21]. The schemes

then use the returned results to further process frames not sent to the server. Client side

heuristics are inherently less accurate than server side DNNs, causing either too many frames

to be dropped, in which case the accuracy of the whole system decreases, or too many frames

to be sent to the server, in which case the system consumes more bandwidth than is needed

for the task. Furthermore, these protocols do not exploit other venues of saving bandwidth

with the help of compression because they work on the granularity of frames produced by

the smart device [8, 38].

The limitations of existing solutions motivate a different approach to solve the problem

of video analytics by making use of unexplored opportunities to achieve a better accu-

racy/bandwidth trade-off.

In this paper we present DDS, a new DDN-driven streaming stack that systemati-

cally balances bandwidth consumption and inference accuracy of distributed video analytics

pipelines. Unlike previous approaches which utilize client side logic, DDS is an iterative

protocol which makes use of the inference results from the server to determine the best com-

pression and streaming strategy. This approach allows DDS to maximize analytics accuracy

while saving bandwidth based on the results from the server rather than relying on client-side

heuristics which might not be good proxies for the content needed by the server to maximize

accuracy.

2

Figure 1.2: A DNN feeback based streaming protocol

The DDS protocol works iteratively. DDS, in its first iteration, sends a segment of the

video encoded in low resolution to the server. The server runs a complex DNN on these

low resolution frames and compiles a list of confirmed results along with a list of regions

in each frame in the segment that seem interesting and require further investigation. The

server sends both of these back to the client. In the second iteration, the client goes over

each frame and crops out the regions requested by the server. It then encodes these regions

in high resolution and send them to the server. The server runs the DNN on these specific

regions rather than the whole frame and send the results back to the client. The client places

the results onto the frames. This approach allows DDS to minimize the amount of data sent

to the server across iterations (low resolution and the cropped regions) while allowing us to

maximize accuracy, as DDS allows the server to further investigate only those regions which

it thinks can have objects of interest. This iterative process solves a fundamental issue in

other distributed video analytics pipelines: the analytics server cannot provide results for

what is not sent by the client, and the client cannot know what information is important for

the DNN to maximize accuracy.

As part of the paper, we present an implementation of DDS along with the evaluation

that shows the efficacy of this protocol. We provide end-to-end evaluation of our implemen-

tation that studies the accuracy/bandwidth trade-off along with delay of the protocol. The

contributions of this work are as follows:

• We demonstrate the common limitations of prior approaches by empirically looking

at the drawbacks of using client-side heuristics which result in sub-optimal accu-

3

racy/bandwidth tradeoff.

• We present DDS, a novel server-driven approach to video streaming protocols that uses

feedback from the server to send only those parts of the frame that are most likely to

increase accuracy.

• We address several challenges that are inherent to a server-driven approach. We con-

cretely define the notion of ‘feedback’, delineate concrete ways to reduce bandwidth

consumption and minimize delay in the presence of multiple iterations.

• We provide a comprehensive technical design of DDS and a concrete implementation to

demonstrates the practicality of our proposed protocol.

• We also perform an end-to-end evaluation of the implementation. The results of the

evaluation empirically demonstrate the benefits of incorporating server feedback into

the streaming stack.

The DDS protocol in itself is complete enough to produce good results without the need

to incorporate client side heuristics. At the same time, it is flexible enough to be extended

using client side heuristics to further improve the results, illustrating the flexibility of a server

driven design.

The following sections of the paper motivate the need for a custom streaming stack

along with discussing the limitations of existing offloading schemes(§2). We then provide

an overview (§3) and implementation (§4) of DDS along with the end-to-end evaluation (§5).

Towards the end, we discuss potential future work (S6) and prior work in this areas most

closely related to DDS (§7).

4

CHAPTER 2

MOTIVATION

In this section, we explore the need for advancement in video streaming analytics. We

look at the flaws in the common denominator of existing solution and explain why they

cannot be used to optimize bandwidth/accuracy effectively. Finally we show the potential

in customizing video streaming stack.

2.1 Distributed Video Analytics Pipelines

The function of a live video analytics pipeline is to perform multi-class object detection

on a live video stream and extract objects of interest. The object classes which are of interest

to an application are defined in an analytics query.

Different applications have different analytics queries, e.g. a traffic monitoring application

might be interested in the ‘number of cars on a road’, a coffee shop might be interested in

the ‘number of people in the queue at the counter’, and a surveillance application might be

interested in the ‘number of people entering or exiting a building’ or ‘location of a particular

object for the duration of the video stream’. Getting real-time results from these queries have

tangible benefits for the application owner. In the case of a traffic monitoring application,

the result from the analytics pipeline can be used to dictate the flow of traffic so as to

minimize traffic congestion. For a coffee shop application, getting an accurate count of the

number of people in the queue can allow the owner to decide whether or not to open a new

counter. And in the case of a surveillance application, the information obtained from the

analytics pipeline can be used to detect intruders.

In most applications that make use of live video analytics, as in the case of aforementioned

examples, the accuracy and latency of results is of significant importance because the data

is being used to make real-time decisions that will have real-world impact.

5

Considering the impact of the accuracy of these results, vision based analytics applica-

tions are increasingly making use of deep neural networks that provide more accurate results

than traditional methods [13, 21]. But obtaining results from these DNNs at an acceptable

rate requires expensive, dedicated GPUs or a GPU enabled VM from a cloud provider which

costs $850 per month on average. [4, 2, 1].

Additionally, for location dependant object recognition, the client must run a tracking

algorithm to update location of objects across those frames that are not sent to the server [8].

2.2 Limitation of existing solutions

Traditional streaming protocols such as DASH [30] and video encoding standards such as

H.264 are designed to optimize quality of experience of human users. This experience de-

teriorates if the video stalls or drops frames. Traditional streaming stacks aim to provide

the highest possible resolution (bitrate) while avoiding delay between frames. In contrast,

stalling and startup delay does not impact the performance of video analytics in the same

way as it impacts the experience of a user. Additionally, it is difficult to identify which re-

gions are important for a human watching a video. However, this can be reliably determined

in the case of video analytics (as we will see in the following sections). Therefore, the video

analytics backend does not require the entirety of the frame to perform accurate analytics. In

fact, sending a frame in a higher resolution than is actually needed has diminishing gains, as

illustrated in the Figure 2.1. Hence, a video analytics backend does not need the maximum

possible resolution of a video. Furthermore, it has been studied that compression (even with

imperceptible changes to a frame for a human being) can impact the accuracy of a neural

network [31]. So applying certain compression algorithms might not impact the quality of

experience of a human being but might affect the inference accuracy of a neural network.

Traditional video streaming hence does not take these factors into account.

Realizing that not all frames are equally important to the server, some pipelines perform

6

0.0 0.2 0.4 0.6 0.8 1.0Resolution Scale

0.0

0.2

0.4

0.6

0.8

1.0

F1 S

core

0.15

0.25

0.37

5 0.5

0.62

5

0.75

0.87

5 1.0

Change in F1 With Resolution

Figure 2.1: Diminishing returns of increasing resolution

computation on the client side to filter out frames that do not contain useful information.

However, the client is often resource constrained and thus the heuristics performed at the

client side are inherently simpler than the server-side DNN. Hence, these client-side heuristics

cannot be used as appropriate proxies for server-side DNNs, because client-side heuristics

cannot accurately determine which frames the server will need. Such approaches end up

sending either too few (which results in lower accuracy) or too many (which results in

excessive bandwidth consumption) frames to the server. For example, in Glimpse [8], the

client decides which frames to send to the server based on the pixel level difference between

frames. This simple approach would result in a frame not being sent to the server if an

object enters the scene but is small enough that it does not change the value of pixels by a

significant enough amount resulting in lower accuracy. Conversely, changes in the brightness

of the scene would result in significant pixel level difference (another such scenario is that

of a panning camera without a change in objects in the frame), so the client will send such

7

frames to the server resulting in excessive bandwidth consumption. In the case of vigil [38],

the client-side neural network is fast but not sufficiently accurate. It can fail to recognize

several objects in a scene as well as give several false positives, resulting in the client side

heuristic to make mistakes while picking the frames to be sent to the server. AWStream [36]

also makes use of client side logic, but in a different way than Vigil [38], Glimpse [8] and

NoScope [21]. AWStream uses the client side logic to perform offline and online profiling

to learn a profile that predicts the accuracy and bandwidth trade-off. When it is given the

task to stream a video to the server, it uses the previously constructed profiles to adjust

the application data rate to match bandwidth while achieving maximum possible accuracy.

AWStream does not make use of any feedback from the server, hence it cannot actively

determine what the server deems to be important [36].

Furthermore, client-side heuristics, such as those mentioned above, are typically designed

for a specific application [21]. They cannot be trivially ported to a different analytics task

or pipeline. Hence, these schemes leave much to be desired in terms of flexibility. This

motivates the need to develop a more flexible video streaming stack that can be used with

a variety of video analytics pipelines with minimal porting effort.

2.3 The case for a Custom Video Streaming Stack

Contrary to recent work which focuses on designing client-side logic, our work considers a

server driven design. The high level idea behind the server driven design is that the server

has access to a large amount of resources which it can use to make better decisions about

what is important to achieve higher accuracy.

We propose an iterative approach to video analytics. During the first iteration, the client

provides the server just enough data to hone in on the parts that seem the most promising.

During the second iteration, the client only sends the promising parts of the data requested

by the server and the server returns the final results after processing this extra information.

8

This feedback based approach allows the pipeline to selectively send bits of data that are

of interest to the server without involving heuristics that need to act as proxies for the

server-side DNN. This provides us the opportunity to minimize bandwidth consumption

while achieving higher accuracy.

Additionally, we look at an equally important part of any analytics pipeline, the streaming

stack. The streaming stack influences the three key metrics, which we refer to as measures :

• Inference Accuracy : Number of objects of interest successfully identified.

• Bandwidth Consumption: Total data sent between the client and the server.

• Response Delay : Interval between the production of the frame and the final compilation

of its results.

The ideal streaming stack would retain just enough information in a video that allows the

server-side DNN to detect all objects of interest, to obtain the maximum possible accuracy.

Similarly, the ideal streaming stack would minimize bandwidth consumption by curtailing

the total data transferred between the client and the server including the content of the video,

the results and control signal (if any). Along with the aforementioned measures, an ideal

streaming stack would have minimum response time. Response time is important because

the results generated by the analytics pipeline are to be used to control real-world systems

and hence should not be so stale that they do not remain actionable in the real-world.

We balance inference accuracy and bandwidth consumption by compressing the video

along the following dimensions, which we refer to as control knobs.

• Resolution: The resolution that the raw image is compressed in before sending it to

the server

• Spatial Cropping: Cropping out specified regions in a given video frame and sending

it in a given resolution.

9

• Compression Level: Encoding selected regions of the frame with a specified com-

pression level (quantization parameter)

Knob selection to perform dynamic adaptation is a general concept that is used to op-

timize one measure while meeting constraints for another (e.g [20, 17, 18]). This allows

for dynamic adaptation in the presence of changing conditions and goals. To put this into

perspective of work in video streaming, recent research looks into the tuning of knobs that

are used in encoding, resolution and compression to reduce the size of the data sent over

network while maximizing accuracy [36, 37, 30].

Figure 2.2: Scaling down and Spatially Cropping Frames

Figure 2.2 provides a general idea (which will be explained in greater detail in the fol-

lowing section) of how our proposed system encodes a video frame to save bandwidth usage.

Notice how the encoded frames in the figure are likely to cause the quality of experience of

a user to drop, however, they will not cause the accuracy of a neural network to drop.

10

CHAPTER 3

OVERVIEW OF DDS

In this section we provide an overview of the design of a novel server driven video analytics

protocol. We also look at the challenges that we need to solve in order to make such a

protocol practical.

3.1 DDS: DNN Driven Streaming

The fundamental flaw in the client-driven protocols is that their decisions and adaptations are

agnostic to the performance of the server-side DNN. This results in such protocols choosing

sub-optimal configurations. In contrast, the idead behind DDS is that of a server-side DNN

drive worklow – it is the server side DNN logic, rather than the client side video encoding

or heuristics, that drives the streaming stack. In DDS the server side DNN determines the

settings for the control knobs. This difference is what sets DDS apart from existing protocols

(as illustrated using Figures 1.1 and 1.2). The key advantage that this approach provides

over existing protocols is that the DNN itself can explicitly maximize the accuracy within a

bandwidth constraint. And compared to simple video encoding (e.g. [36]), DDS integrates

the object detection results so that encoding is aware of the spatial requirements of the DNN

to perform accurate inference. DDS uses encoders such as H.264/MPEG to further compress

the frames that are sent to the server. DDS does not have to rely on less accurate client-side

heuristics to determine which frames are important.

Furthermore this design allows DDS to be flexible enough to be used for a number of

different video analytics applications, while also providing the opportunity to seamlessly

incorporate client-side logic into the protocol as an extension (§6).

11

3.2 Challenges

The aforementioned designs has some inherent challenges that need to be solved in order to

make this approach possible.

What ’feedback’ does the server provide to the client? We need to concretely

define the notion of control signals that the client must act upon in order to maximize the

inference accuracy while minimizing bandwidth. Needless to say that the total data sent

back to the client as a result of the feedback need to have just enough information that it

can be used by the server to make correction inference. In DDS this feedback is the set of

regions that the server needs to inspect more closely. This set of regions is computed using

a combination of detection and tracking (the complete details are presented in the following

section). The server compiles a list of regions in the frames that it requires in a higher

resolution, and sends this list back to the client. The client spatially crops the frames to

only include the regions that the server deems are worth investigating further. The client

then encodes these spatially cropped frames and sends these frames to the server in a higher

resolution.

This prompts an iterative design (illustrated with Figure 3.1) for the protocol. In the first

iteration the client sends some information (frames) to allow the server to decide what more

information is needed (regions in high resolution. The client then, in the second iteration,

send only the requested information (spatially cropped frames). This iterative design leads

to two more challenges that are discussed in the following paragraphs.

How do we minimize bandwidth consumption? To minimize bandwidth consump-

tion, DDS, during the first iteration, send frames in just high enough resolution that the

server is able to make accurate decisions about regions of interest. And during the second

iteration, DDS blacks out the entire frame except the regions requested by the server and

send the frames at only enough resolution that the server is able to make accurate inference.

Additionally, instead of storing each region in a frame of it’s own, the client takes a union

12

of all the regions that are in a single frame and masks the frame except the area covered by

the union of the regions. The client then encodes and compresses these frames and sends

them to the server.

How do we minimize the delay of the protocol? To minimize the delay, DDS makes

use of number of optimizations. The first optimization stems from the fact that the input to

a typical DNN resizes an input image to 600×600 regardless of the original size of the image.

If the size of the frames sent to the server is less than 600×600 then the server stitches these

frames together and run object recognition on several frames in a single go. This is done as

part of the first iteration of the protocol. We study the effect of stitching on the F1 score

and the inference time in the evaluation section (§5). Additionally, the tracking done by

the server to compute regions of interest also adds delay to the pipeline. We recognize that

tracking is an independent task. Therefore, we do tracking in parallel to reduce the total

time required for the tracking part of the pipeline. The choice of the tracking algorithm is

also important, a less accurate but fast tracker will give us low delay but lower accuracy of

the regions of interest, while a highly accurate but slow tracker will result in high accuracy

of the regions of interest but a much greater delay.

There are several other network side optimizations that are done to reduce the network

delay of the protocol, but they are not core to the design DDS hence they will be discussed

in the Implementation section (§4).

13

Figure 3.1: Iterative control flow of DDS

14

CHAPTER 4

DESIGN AND IMPLEMENTATION

In this section we describe the core DDS algorithm and the solution to the technical challenges

in greater detail. We also provide a concrete evaluation that is used for the end to end

evaluation of the protocol.

4.1 Algorithm

As mentioned in the last section, DDS is an iterative protocol which relies on the feedback

from the server. The feedback in our case is the regions that the server deems are important

and should be investigated further. Allowing it to infer objects of interest with high accuracy

while send only a small amount of data. We formalize the iterative protocol by looking at

the iterations separately and discussing the algorithm behind the iterations.

During the first iteration the client sends a batch of s frames to the server in a low

resolution. The value of the low resolution is a knob that can be set dynamically to adapt to

changing video content and available bandwidth. Upon receiving the low resolution frames

the server runs object recognition using a deep neural network. The server then compiles

two lists of detected objects. In one list (A) it adds all objects that have detection confidence

greater than a certain threshold, these are the objects that the server is sure about and they

require no further investigation. In the second list (B) it adds all objects that have detection

confidence between two thresholds, these are the objects that need to be investigated further.

Any results that are not a part of these two lists are discarded because the server decides that

there is little value in investigating them further. The maximum and minimum resolution

used to generate these lists are also knobs that can be dynamically adjusted by the server.

For every result in list B the server tracks the objects in the backward and forward direction

using an optical motion tracking algorithm. The bounding boxes retrieved using tracking

15

are checked against the results in list A, and the bounding boxes are discarded if there

exists a result in list A that has the same bounding box as the bounding box obtained using

tracking. After the server has computed all bounding boxes from list B it sends information

about these bounding boxes back to the client along with results in list A, which contains

partial results for the frames in the batch. The client stores these partial results and uses the

bounding boxes’ information to prepare frames for the next iteration. For all the bounding

boxes in the same frame the client takes a union of the bounding boxes and blacks everything

in the original frame except the regions covered by the union of the bounding boxes. The

client does this for every frame for which there is a bounding box requested by the server.

The client compresses and encodes the frames in a higher resolution and sends them to the

server. Like the low resolution, high resolution is also a knob that can be set dynamically

for adaptation. The server receives decodes the higher resolution frames and runs object

detection on them. The results from this object detection are sent back to the client which

merges the results of the second iteration along with the partial results sent during the first

iteration to obtain final object detection results for frames in the batch. The algorithms

for the process is given below (Algorithm 1). The track statement in Algorithm 1 calls the

track routine from Algorithm 2.

Another important check that the server performs while compiling the list of regions that

it thinks are important is that it ensure that the size of regions of interest is not greater

than a percentage of the frame size. Because such regions might result in majority of the

frame being sent back in higher resolution resulting in higher bandwidth consumption. The

likelihood of objects that large not being confidently recognized during the low resolution

iteration is quite low. Hence such regions can safely be left out during the second iteration.

However, the maximum and minimum thresholds along

16

Data: lowRes, highRes, maxThres, minThres, maxSize

Result: boundingBoxes for each frame

acceptedResults = ∅, regions = ∅

lowResVideo = getVideo(lowRes)

lowResResults = runDNN (lowResVideo)

if lowResResults is ∅ then

regions = getVideo(highRes)

else

for result in lowResResults do

if result.confidence > maxThres then

acceptedResults = acceptedResults ∪ result

end

end

for result in lowResResults do

if result.confidence > minThreshold and result.boxSize < maxSize then

regions = regions ∪ result

region = region ∪ track(result, acceptedResults, lowResVideo)

end

end

end

highResCroppedVideo = getCroppedFrames(highRes, regions)

boundingBoxes = runDNN (highResCroppedVideo)

return boundingBoxesAlgorithm 1: Algorithm for the iterative process. track calls Algorithm 2

17

Data: result, acceptedResults, lowResVideo, trackerLen

Result: regions of interest

regions = ∅, frameNum = result.frameNum

for frameNum to frameNum+trackerLen do

newResult = trackingAlgo(frameNum, result)

if newResult not in acceptedResults and newResult.boxSize < maxSize then

regions = regions ∪ newResults

end

end

for frameNum to frameNum−trackerLen do

newResult = updatePosition(frameNum, result)

if newResult not in acceptedResults and newResult.boxSize < maxSize then

regions = regions ∪ newResults

end

end

return regionsAlgorithm 2: Algorithm for the tracking phase of the server

4.2 Implementation

Figure 4.1 show the key components and the interfaces of the DDS protocol. The protocol is

amenable to a variety of video analytics applications. In principle, the protocol can be used

with any camera capturing a feed and a DNN running on a remote server. Client Side

DDS: As in other video analytics pipelines, a video feed is encoded by the camera into a

sequences of frames in high quality using either H.264 or ffmpeg. For our implementation

we use ffmpeg. DDS assigns frames IDs to the frames, these are used by DDS to fulfill regions

of interest queries (for the second iteration) sent by the server. The camera-facing API is

also responsible for changing the resolution, cropping and re-encoding the frames as required

by the algorithm describe above. DDS also keeps the frames in a batch in memory until the

18

two iterations for the batch have been completed.

Server Side DDS: The server-side code of DDS queries the DNN and retrieves the results

for object detection using the DNN-facing API. Instead of relying on the pretrained DNN

to actively control the streading; DDS, passively queries the DNN using the aforementioned

API and the interface resturns the results in a format that can be consumed by DDS for our

implementation the DNN returns the results as a key-value map in which the keys are the

frame IDs and the value is a list of objects that were detection in the frame. The result for

each object is stored as a 3-tuple whose elements are the frame ID, the confidence and the

bounding box co-ordiantes. The server-side code uses the 3-tuples and tracks the objects

represented by a 3-tuple using a high speed tracker with kernel correlation filters (KCF) [10].

We use KCF tracker because, to the best of our knowledge, it provides a balance between

the tracking time and accuracy. Another tracking algorithm ,based on multiple instance

learning [5], is more accurate but is significantly slower than the KCF algorithm [10].

Figure 4.1: Components and Interfaces of DDS

Network: DDS maintains two TCP connections between the client and the server. One

of the connection, C1, is used for sending encoded (cropped) video data from the client to

the server. The other connection, C2, is used for receiving control or feedback messages

along with the results from the server. We implement two types of control signals. A

19

‘heartbeat’ is sent by the client and the server to provide status updates and are also used

a keep-alive messages to keep the connections open. The second type of control message

is a ‘request’ for the regions that the server needs for the second iteration of the protocol.

Upon receiving a request from the server, the client retrieves the corresponding frames from

local storage, performs cropping, downsizing and compression using ffmpeg and sends the

resulting encoded video to the server using C1

Optimizations: DDS’s iterative process requires repeated fetching of short flows. If the

connections were closed and reopened for every flow we would have to suffer from the slow

start phase of TCP every time. This can increase the delay significantly. So to avoid the

slow-start phase we make the persistent TCP-based sockets, which maintains a persistent

congestion control session. Additionally, instead of running inference on low resolution im-

ages one by one the server stitches images together based on their size and runs inference

on several images in one go. This decreases the inference time in the low resolution phase

of the protocol. The same optimization can also be made for the high resolution, but for

the current implementation we only used stitching for the first phase. As mentioned earlier,

tracking is done in parallel to speed up the tracking phase of the batch processing. Now

that we have a clear understanding of the DDS protocol, we can compare the division of work

between the camera, the client-side logic and the server side logic. Table 4.1 summarizes the

responsibilities of the components of the streaming stack for each protocol.

20

Camera Client-Side Server-Side

Glimpse Produce Frames

Heuristics (Differencing)

DNN DetectionTracking

Cache Management

Vigil Produce Frames

Heuristics (NN)

DNN DetectionTracking

Cache Management

AWStream Produce Frames DNN DetectionEncode and Compress Frames

DDS Produce Frames

Encode and compress framesDNN Detection

TrackingSpatially Crop Frames

Batch Management

Table 4.1: Responsibilities of different parts of the Analytics Pipelines

21

CHAPTER 5

EVALUATION

We evaluate the performance of DDS with respect to other baselines. By evaluating the

end-to-end performance of DDS and through microbenchmarking, we show that:

• DDS shows promising results by imporving the bandwidth-accuracy tradeoffs over latest

video analytics pipelines as well as traditional video streaming.

• The optimization on each control knob contributes a substantial fraction of improve-

ment of DDS.

• DDS achieves stable bandwidth-accuracy tradeoffs even in presence of dynamic network

conditions

• Compared to baselines, DDS introduces only marginal computing overhead on the

server/client sides and inference delay.

5.1 Methodology

Dataset: The object detection model uses the vehicle class. We use twenty-seven videos

from the KITTI dataset [12] each at 10 frames per second.

Server/Client Setup: We use 30% as the low confidence threshold, 80% as the high

confidence threshold and a batch size of 15 for our experiments. We run the client and server

on a gcloud VM with an Nvidia P100 GPU [3], 16GB of RAM with a 4 Core Haswell CPU.

For delay measurements the client is run on a laptop running ffmpeg and the DDS client

to encode videos and communicate with the server. The code for the client and the server

is written in Python and for the tracking algorithm we use library implementation of the

KCF Tracker from OpenCV [6].The DNN facing API is written using Tensorflow [32] with

ResNet 101 as the DNN model [14]. The implementation of DDS used for evaluation only

22

includes the parallel tracking optimization. The stitching optimization was not included in

the implementation used to obtain the measurements, the reason for this is discussed in the

§6.

Baseline setup: For our evaluation we use Glimpse, Vigil and AWStream as baselines.

For Glimpse [8] we use the frame difference detector as mentioned in the paper along with

the parameter values that were used in the paper. For Vigil [38], we use MobileNet SSD [19]

with a mean average precision of 21 [32] as the less accurate but fast client side neural

network. For AWStream [36] we try out multiple configuration of ffmpeg and choose the

most optimal configuration to represent the result of AWStream. For our experiments the

ffmpeg based AWStream and DDS were allowed to change only the resolution of the frames.

The client-side and server-side logic for all of the baselines are running on the same machine

that were used for DDS.

5.2 Accuracy vs Bandwidth

We present results for the accuracy/bandwidth trade-off of our system on a set of four videos

from the KITTI dataset. Figure 5.1 shows the change in accuracy/bandwidth with different

configurations. Each point on the plots corresponds to a different value of the high and low

resolution control knob. We can see that even under different configurations the F1 score

remains largely stable while the bandwidth consumption varies significantly. The points to

the top left represent the best possible tradeoff. This shows the potential for a server driven

protocol. This shows (as we will demonstrate later) that the accuracy can be kept largely

stable in the face of fluctuations in available bandwidth. By considering the behavior of the

protocol from the plots we can see that we need to choose the value of the control knobs

carefully to achieve the best accuracy/bandwidth trade-off.

We are able to achieve 2x greater accuracy than client-side heuristics based baselines.

But when it comes to AWStream, a client-side adaptation based protocol, we are able to

23

0 500 1000 1500 2000 2500 3000 3500

Bandwidth (Kbps)

0.0

0.2

0.4

0.6

0.8

1.0F

1A

ccur

acy

MPEG

DDS

Vigil

glimpse

0 250 500 750 1000 1250 1500 1750

Bandwidth (Kbps)

0.0

0.2

0.4

0.6

0.8

1.0

F1

Acc

urac

y

MPEG

DDS

Vigil

glimpse

0 500 1000 1500 2000 2500

Bandwidth (Kbps)

0.0

0.2

0.4

0.6

0.8

1.0

F1

Acc

urac

y

MPEG

DDS

Vigil

glimpse

0 500 1000 1500 2000 2500 3000 3500 4000

Bandwidth (Kbps)

0.0

0.2

0.4

0.6

0.8

1.0

F1

Acc

urac

y

MPEG

DDS

Vigil

glimpse

Figure 5.1: F1 Against Bandwidth for various configurations

perform slightly better than the most optimal setting that AWStream could use. In doing

so we either use strictly less or equal amount of bandwidth as the baselines.

To better illustrate the results we extend our experiments to a set of 22 videos and

use a DDS simulation to draw 1-σ ellipse (mean and one standard deviation) of the best

configuration points of DDS and the baselines across videos. Based on first plot from Figure

5.2 we can see that in the average case DDS provides better accuracy while utilizing lower

bandwidth. DDS consistently provides high accuracy (F1 Score > 0.90) across all videos.

The second plot in Figure 5.2 shows the good cases in which DDS outperforms all baselines

for both in terms of bandwidth consumption and accuracy. A typical characteristic of these

24

0 250 500 750 1000 1250 1500 1750 2000Bandwidth (Kbps)

0.0

0.2

0.4

0.6

0.8

1.0F1

Sco

reAll Videos

AWStreamDDSVigilGlimpse

0 250 500 750 1000 1250 1500 1750 2000Bandwidth (Kbps)

0.0

0.2

0.4

0.6

0.8

1.0

F1 S

core

Good Cases


Figure 5.2: 1-σ ellipse for DDS and Baselines

videos is that the objects in the frames are far apart from each other and they do not occlude

each other. Furthermore, the bounding boxes for the objects do not overlap.

Unfortunately there are videos for which AWStream performs slightly better than DDS.

The results from these videos are shown in Figure 5.3. For these videos, although DDS

performs better than Vigil and Glimpse, AWStream is able to beat DDS in terms of accuracy

and bandwidth utilization. The main characteristic of such videos is that these are largely

empty with a small number of objects of interest in frames. The reason DDS does worse

in these scenarios is that during the low resolution phase of DDS, the DNN reports false

positives that have confidence above the minimum threshold, so the bounding boxes for these

false positives are requested during the second iteration of DDS increasing the bandwidth

utilization. Another important characteristic of these videos is that the objects that appear

in the videos are often occluded by other objects. Along with that the lighting condition

under which these objects are detected changes frequently in the video (for e.g an object

from being under sunlight to being under the shade of a tree). This causes problems at

several levels of the protocol. First and foremost, when performing object recognition on low

resolution images, an object detected in one frame, while present in the next frame might

not be successfully detected. As mentioned earlier, tracking is used to rectify this mistake.

25

0 200 400 600 800 1000 1200 1400 1600Bandwidth (Kbps)

0.0

0.2

0.4

0.6

0.8

1.0

F1 S

core

Worse Cases


Figure 5.3: Worse Cases for DDS

However, tracking algorithms are susceptible to failing when the object movies from one

lighting condition to another, hence in the case of these videos tracking is unable to track

such objects. As a result of the tracking failure the bounding box for the object is not added

in the regions of interest and hence is not requested from the client in high resolution. This

means that the mistake made by DDS in low resolution can never be rectified.

5.3 Delay

We measure the delay for DDS and all baselines across five videos and report the average

delay. Delay is defined as the average time for the results of a frame to be finalized. In the

context of our measurements this is the average amount of time required to process a batch

of 15 frames. There is an important distinction in the way the delay measurement is made

in the Glimpse and Vigil papers’ evaluations. The delay in the original papers is defined

as the interval between the production of a frame and the results for that frame to appear

on screen. In both of these protocols there are caches that are used to store frames while

waiting for the result of a prior frame from the server. Once the result is received both Vigil

and Glimpse track the objects in the result through frames in the cache and show bounding

boxes on the latest frame. They report the interval between the production of the latest

26

DDS AWStream Vigil GlimpseScheme

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Dela

y (s

)

Figure 5.4: Batch Processing Time (Delay)

frame and the time it takes to get finalize the results for this frame using tracking. Hence,

this interval ends up being equal to the amount of time it takes to track objects through

the frames in the cache. We report numbers for delay that includes the time required to

processing each and every frame (even the ones in the cache).

We can see that Vigil and Glimpse have a far greater delay than DDS and AWStream.

This is because the client side logic is running on a computationally constraint device. Due

to this constraint Vigil and Glimpse cannot parallelize the most time consuming part of the

client side logic, tracking. Furthermore, they have to pay cache management overhead as

well. Since tracking through each and every frame in the cache would not allow Vigil and

Glimpse to catch up to the latest frame, they include extra functions in their logic that allow

them to select frames intelligently from the cache. Running these functions adds non-trivial

overhead to Glimpse and Vigil.

DDS and AWStream take about the same time to process the batch. DDS takes a little

more time because while AWStream does inference just once and does not do any tracking,

DDS has to track objects along with performing an extra round of inference in the high

resolution phase of the protocol. Tracking objects in parallel greatly reduces the delay due

to tracking. Adding the stitching optimization can provide further reduction in the delay

27

(this is discussed further in §6).

5.4 Microbenchmark

Next we look at the resource utilization for each of these schemes. It is important to recall

the division of labour within these protocols. This helps us appreciate the resource utilization

measurements for these schemes.

DDS AWStream Glimpse Vigil

Client CPU (%) 22 18 38 82

Server CPU (%) 62 20 18 18

Memory (# frames) 15 1 ∼8 ∼8

Table 5.1: Resource Utilization

The resource utilization measurements in the Table 5.1 are complete in agreement with the

division of labor shown in Table 4.1. The client side DDS logic uses more CPU than as

compared to AWStream because it also has the responsibility of cropping images, while the

CPU utilization for Glimpse and Vigil is much greater because of the client side heuristics.

Notably, Vigil has a significantly higher CPU utilization on the client because of it’s use of a

neural network. However, on the server-side the DDS uses the most CPU because tracking has

also been shifted to the server. The parallelization optimization further increases the CPU

utilization on the server-side of DDS. On the other hand the only responsibility of the server

on the baseline protocols is object recognition. However, since the server is not running in a

constrained environment it is much better to utilize more resources on the server side than

as compared to the client side.

Memory utilization for DDS is much higher than the baselines. Because the server might

request regions from the frames that it is currently processing, DDS client must keep the

frames of the batch in memory. Since the batch size in our experiments was 15, DDS has to

28

keep 15 frames in memory at any given time. Both Glimpse and Vigil save all frames that

are produced between the time when they send a frame to the server and receive the results

for the frame. The number of frames that will be saved depends upon the time it takes to

receive a response to the server. In our experiments both Glimpse and Vigil saved roughly

8 frames in their caches. AWStream uses the least memory out of all the protocols because

it does not need to run tracking or respond to the servers feedback it has no need to store

any frames other than the one that it is has to send to the server.

5.5 Adaptation

Figure 5.5 shows the change in F1 score across time while there are fluctuations in available

bandwidth.

The experiment has three intervals. During the first interval DDS has a large amount of

bandwidth available and so it chooses the configuration that provides the best F1 Score. In

the second interval the available bandwidth decreases and DDS recognizes that based on the

increased delay in response. To cope with this change DDS chooses a different configuration

that uses less bandwidth while providing highest possible accuracy. And in the third interval,

the available bandwidth increases but it does not go back to the original so DDS can choose

a configuration in between and increase the accuracy a bit more than that from the second

interval. This shows the potential advantages and the amenability of DDS to adaptation

schemes.

The methodology used in the experiment was that the first several segments were used

to profile the performance of each configuration. After the profiling was done, DDS was run

on the rest of the video. DDS decided which configuration to use based on the response time.

Figure 5.5 shows results from a single video which has fairly consistent content.

This simple profiling method, however, is not expected to work because for most videos

the content is not consistent. So we need to design a better way to adapt to changes in

29

0 5 10 15 20 25 30Time

0.0

0.2

0.4

0.6

0.8

1.0

F1 S

core

200KBps 100KBps 150KBps

Figure 5.5: DDS maintains high accuracy in presence of bandwidth fluctuations

available bandwidth. There are more challenges associated with the strawman profiling

approach. These challenges are discussed in §6

30

CHAPTER 6

FUTURE WORK

While we have shown the potential gains for using a server driven streaming protocol, there

remains work that needs to be done.

Investigate the bad cases. As shown in the evaluation section there are certain kinds

of videos for which DDS does not perform well. We need to develop a greater understanding

of the reason for the poor performance. Our investigation so far has revealed that confi-

dence threshold in conjunction with the type of scene in the video plays a huge role in the

performance of DDS on these videos. The videos for which DDS does not perform well have

objects moving through different regions of brightness. In such frames the DNN is suscepti-

ble to giving false positives. If the confidence of these false positives is greater than the low

confidence threshold then the server requests the bounding boxes for these false positives in

higher resolution, consuming high bandwidth. A way forward would be to find the distri-

bution of objects detected in low resolution in each confidence range. Along with that we

need to find the distribution of false positives for each confidence range. We need to confirm

if the distributions are roughly the same for the worse case videos. Then based on these

distribution we can decide the optimal values for the low and high resolution thresholds.

Study the effect of stitching on the overall system. Currently we have studied

the effect of stitching images on accuracy in isolation. As shown in Figure 6.1, stitching

images gives much better results when the original resolution of the images is low and as

the image resolution increases the gap between stitched and non-stitched results increases.

These results were obtained on experiment using a set of 20 vehicle images from the ImageNet

[9] along with 2 videos from the KITTI datset [12]. These results, while promising, need

to be further investigated. We need to empirically demonstrate that stitching images does

not deteriorate the accuracy of detection significantly for images with a variety of different

features.

31

Figure 6.1: Impact on Accuracy on stitching 4 images together

Develop a method for dynamic adaptation. Figure 5.5 shows the potential benefits

of adding adaption to DDS. Developing a more sophisticated and robust dynamic adaptation

mechanism would allow DDS to perform well in the presence of bandwidth fluctuations and

change in video content. Before we can add dynamic adaptation to DDS, we will need to

find answers for some for questions. Essentially, we need to develop a notion of feedback

measures that would provide the adaptation mechanism with real-time information about

the performance of DDS. This is a non-trivial problem to solve because DDS would not have

information about accuracy under current configurations. This means that we would need

to develop other measures that can be used as a proxy for the accuracy of the current

configuratinos. This would be different from the adaptation done in AWStream [36] which

performs adaptation to match changes in bandwidth. In DDS we will perform adaptation

for response delay and accuracy as well. Furthermore, adaptation is also necessary in the

client side, which is running an encoder in an energy constrained environment. CoAdapt [17]

bears direct relevance to the adaptation requirements of DDS. CoAdapt co-ordinates accuracy

aware applications, which have accuracy/performance tradeoffs, with power aware systems,

which expose power/performance tradeoffs. DDS has both the accuracy and power aware

aspects as seperate components of the overall system. In summary, adding adaptations for

32

energy would also allow us to meet energy aware setups that work under a given energy

budget.

Investigate other video encoding and compression knobs. Currently we are only

looking at frame resolution as a knob. We need to investigate the effect of changing the

quantization parameter. We can look at a cross product of quantization parameter and

resolution to increase the configuration space. This would also allow us to perform a direct

comparison between DDS and AWStream (since quantization parameter is one of the knobs

in AWStream). However, this is not quite straightforward. As shown in recent works [31]

using compression can impact the accuracy of the DNN in unexpected ways. So we will need

to study the impact of change in the quantization parameter and bitrate on the performance

of DDS in more detail. Furthermore while adding more compression can help us decrease

accuracy, this will change the time it takes to compress frames. So the delay will also need

to be considered in addition to the overall accuracy.

Incorporating client side logic. Existing analytics pipelines either use client side

heuristics based on pixel level difference ([8, 21]) or visual cues ([38]) to determine an im-

portant frame and send that frame to the server. Upon receiving results from the server

the client side logic propagates the results to the frames that were not sent to the server.

Based on the general design of DDS, one can easily incorporate client-side logic in DDS to add

another layer of frame selection. As part of future work we need to develop a natural way

to integrate client side logic into DDS as an extension to the pipeline. After that we need to

investigate whether such an extension would provide noticable benefits.

33

CHAPTER 7

RELATED WORK

DDS attempts to bridge the gap between deep video analytics and video streaming via a novel

iterative process that allows it to rectify it’s mistakes. Here, we discuss the most related

work three sides.

Object Detection and Tracking: Vision research has a deep literature on object

detection [25] and tracking [35] and more recently deep learning based approaches are gaining

traction ([28, 19]). Most deep learning based approaches are slow enough that they cannot

keep up with the frames that are produced in real-time. Hence, recent research has also

focused carry information from one frame to another using a relatively faster method (e.g.

with the help of tracking). In [16], deep feature extraction is run on a key frame and the

impression feature is propagated down to the next key frame. Similarly, work has been done

to generate feature maps with a joint convolutional recurrent unit formed by combining a

standard convolutional layer with an LSTM [23]. The recurrent layers are able to propagate

temporal cues across frames, allowing the network to access a progressively greater quantity

of information as it processes the video. Furthermore there is growing interest in making

DNN architectures more scalable [37]. All of these advances can be used to improve the

capability of the server side of DDS, allowing the server to detect regions of interest more

accurately while reduce the delay.

Iterative rectification: Iterative rectification of results over unlabeled data using active

learning has shown promising results in several computer vision tasks related to object

classification and scene understanding [22, 33]. The works focus on processing a small amount

of information and figuring out the best regions to process in the next steps. Advancements

in this area can help the iterative pipeline of DDS by allowing DDS to determine the best

regions to retrieve in higher resolution. Such advancements can help us solve the bad cases

for DDS as discussed in §5.

34

Dynamic Adaptation: Modern computing systems must provide a certain quality

of service with minimal energy in presence of fluctuation that impact the performance of

the systems. Metronome [29] provides a framework that extends operating systems with

self-adaptive capabilities. Recent work in dynamic adaptation [20, 17, 18], has shown that

adaptation can be used for general applications (including streaming applications, such as

H.264 and ffmpeg, to optimize an objective function while meeting constraints. In [11],

the authors have extended the idea of general adaptation to allow optimization of multiple

objectives. Furthermore [17], provides a system that allows co-ordination between accu-

racy aware application with power aware systems, which usually have competing objectives.

CALOREE [24] provides a control systems, whose parameters are defined by a machine

learning framework, which performs adaptation in the presence of a dynamic environment.

CALOREE learns control parameters to meet latency requirements with minimal energy

in systems that have complex interactions with hardware in the presence of unpredictable

changes in the operating environment and inputs. All of the aforementioned works have a

direct relevance to the adaptation requirments of DDS. As in [20] adaptation in DDS must

be done on a general purpose application, a video encoder, on the client side. And as in

[17], the objectives of the client side must be co-ordinated with the overall objects of DDS

pipeline, i.e. maximize accuracy and minimize latency. Hence, work from prior research can

be used for the adaptation requirements of DDS

Video streaming: Recent work in this area has yielded better compression gains in

recent encoding standards. Furthermore, advances in scalable coding [26] and regions of

interest encoding [7] provide the most direct benefit to DDS. Region of interest encoding

requires the viewer to specify a region of interest while scalable coding allow efficient uti-

lization of bandwidth. Unfortunately, the focus of these video compression approaches is

content-agnostic. This is perhaps the side of the literature that lack the most. DDS makes

focuses on a fundamentally different target, as delineated in the previous sections. There

35

need to be more research done in streaming videos to target that are computational logic

rather than human beings.

36

CHAPTER 8

CONCLUSION

The use of deep-learning for video analytics is on the rise. We present DDS, a way to make

efficient use of cloud computing resources for live video analytics. Our analytics pipeline pro-

vides high accuracy while making efficient use of available bandwidth. We base our protocol

on two key insights, 1) streaming to computation logic presents many unique opportunities

compared to traditional video streaming protocols, and 2) using the deterministic nature

of computational logic, we can exploit these opportunities by obtaining feedback from the

server and then acting upon this feedback by sending only important regions from frames

in higher resolution. We provide solutions to the inherent problems of such a server-driven

iterative protocol. To further demonstrate the practicality of our proposed protocol, we

provide a complete implementation. Our end to end evaluation makes use of our imple-

mentation and shows improvements both in terms of accuracy and bandwidth utilization

over baselines in the average case. The evaluation, in general, provides validation to our

initial hypothesis that incorporating server feedback into the live video analytics pipeline

can provide tangible benefits accuracy/bandwidth benefits. However there still remain cases

that need to be investigated further to improve the general performance of DDS. Nonetheless,

our evaluation shows the potential benefits of using a custom streaming stack that uses an

iterative approach based on server feedback.

37

REFERENCES

[1] Ec2 instance pricing amazon web services (aws).

[2] Google compute engine pricing.

[3] Nvidia tesla p100: The most advanced data center accelerator.

[4] Pricing - linux virtual machines — microsoft azure.

[5] Boris Babenko and Ming-Hsuan Yang Serge Belongie. Robust object tracking withonline multiple instance learning. 2011.

[6] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.

[7] Mingliang Chen, Weiyao Lin, and Xiaozhen Zheng. An efficient coding method forcoding region-of-interest locations in AVS2. CoRR, abs/1503.00118, 2015.

[8] Tiffany Yu-Han Chen, Hari Balakrishnan, Lenin Ravindranath, and Paramvir Bahl.Glimpse: Continuous, real-time object recognition on mobile devices. GetMobile: MobileComp. and Comm., 20(1):26–29, July 2016.

[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-ScaleHierarchical Image Database. In CVPR09, 2009.

[10] F. Feng, X. Wu, and T. Xu. Object tracking with kernel correlation filters based onmean shift. In 2017 International Smart Cities Conference (ISC2), pages 1–7, Sep.2017.

[11] Antonio Filieri, Henry Hoffmann, and Martina Maggio. Automated multi-objectivecontrol for self-adaptive software design. In Proceedings of the 2015 10th Joint Meetingon Foundations of Software Engineering, ESEC/FSE 2015, pages 13–24, New York, NY,USA, 2015. ACM.

[12] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meetsrobotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013.

[13] Seungyeop Han, Haichen Shen, Matthai Philipose, Sharad Agarwal, Alec Wolman, andArvind Krishnamurthy. Mcdnn: An approximation-based execution framework for deepstream processing under resource constraints. In Proceedings of the 14th Annual Inter-national Conference on Mobile Systems, Applications, and Services, MobiSys ’16, pages123–136, New York, NY, USA, 2016. ACM.

[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning forimage recognition. CoRR, abs/1512.03385, 2015.

[15] Alex Hern. Computers are now better than humans at recognising images, May 2015.

38

[16] Congrui Hetang, Hongwei Qin, Shaohui Liu, and Junjie Yan. Impression network forvideo object detection. CoRR, abs/1712.05896, 2017.

[17] H. Hoffmann. Coadapt: Predictable behavior for accuracy-aware applications runningon power-aware systems. In 2014 26th Euromicro Conference on Real-Time Systems,pages 223–232, July 2014.

[18] Henry Hoffmann, Stelios Sidiroglou, Michael Carbin, Sasa Misailovic, Anant Agarwal,and Martin Rinard. Dynamic knobs for responsive power-aware computing. In Proceed-ings of the Sixteenth International Conference on Architectural Support for ProgrammingLanguages and Operating Systems, ASPLOS XVI, pages 199–212, New York, NY, USA,2011. ACM.

[19] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, To-bias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutionalneural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.

[20] C. Imes, D. H. K. Kim, M. Maggio, and H. Hoffmann. Poet: a portable approachto minimizing energy under soft real-time constraints. In 21st IEEE Real-Time andEmbedded Technology and Applications Symposium, pages 75–86, April 2015.

[21] Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. Opti-mizing deep cnn-based queries over video streams at scale. CoRR, abs/1703.02529,2017.

[22] Sam Leroux, Pavlo Molchanov, Pieter Simoens, Bart Dhoedt, Thomas Breuel, andJan Kautz. Iamnn: Iterative and adaptive mobile neural network for efficient imageclassification. CoRR, abs/1804.10123, 2018.

[23] Mason Liu and Menglong Zhu. Mobile video object detection with temporally-awarefeature maps. CoRR, abs/1711.06368, 2017.

[24] Nikita Mishra, Connor Imes, John D. Lafferty, and Henry Hoffmann. Caloree: Learn-ing control for predictable latency and low energy. In Proceedings of the Twenty-ThirdInternational Conference on Architectural Support for Programming Languages and Op-erating Systems, ASPLOS ’18, pages 184–198, New York, NY, USA, 2018. ACM.

[25] P. K. Mishra and G. P. Saroha. A study on video surveillance system for object detectionand tracking. In 2016 3rd International Conference on Computing for Sustainable GlobalDevelopment (INDIACom), pages 221–226, March 2016.

[26] J. . Ohm. Advances in scalable video coding. Proceedings of the IEEE, 93(1):42–56, Jan2005.

[27] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You onlylook once: Unified, real-time object detection. CoRR, abs/1506.02640, 2015.

39

[28] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detectionwith region proposal networks. IEEE Transactions on Pattern Analysis and MachineIntelligence, 39(6):1137–1149, June 2017.

[29] F. Sironi, D. B. Bartolini, S. Campanoni, F. Cancare, H. Hoffmann, D. Sciuto, andM. D. Santambrogio. Metronome: Operating system level performance managementvia self-adaptive computing. In DAC Design Automation Conference 2012, pages 856–865, June 2012.

[30] Thomas Stockhammer. Dynamic adaptive streaming over http –: Standards and de-sign principles. In Proceedings of the Second Annual ACM Conference on MultimediaSystems, MMSys ’11, pages 133–144, New York, NY, USA, 2011. ACM.

[31] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, IanGoodfellow, and Rob Fergus. Intriguing properties of neural networks. In InternationalConference on Learning Representations, 2014.

[32] Tensorflow. tensorflow/models.

[33] J. Wang, O. Russakovsky, and D. Ramanan. The more you look, the more you see:Towards general object understanding through recursive refinement. In 2018 IEEEWinter Conference on Applications of Computer Vision (WACV), pages 1794–1803,March 2018.

[34] Josh Woodhouse. Market for storage used for video surveillance worth 1.7bn in 2017,Dec 2017.

[35] Alper Yilmaz, Omar Javed, and Mubarak Shah. Object tracking: A survey. ACMComput. Surv., 38(4), December 2006.

[36] Ben Zhang, Xin Jin, Sylvia Ratnasamy, John Wawrzynek, and Edward A. Lee. Aw-stream: Adaptive wide-area streaming analytics. In Proceedings of the 2018 Conferenceof the ACM Special Interest Group on Data Communication, SIGCOMM ’18, pages236–252, New York, NY, USA, 2018. ACM.

[37] Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, ParamvirBahl, and Michael J. Freedman. Live video analytics at scale with approximation anddelay-tolerance. In 14th USENIX Symposium on Networked Systems Design and Imple-mentation (NSDI 17), pages 377–392, Boston, MA, 2017. USENIX Association.

[38] Tan Zhang, Aakanksha Chowdhery, Paramvir (Victor) Bahl, Kyle Jamieson, and SumanBanerjee. The design and implementation of a wireless video surveillance system. InProceedings of the 21st Annual International Conference on Mobile Computing andNetworking, MobiCom ’15, pages 426–438, New York, NY, USA, 2015. ACM.

40

THE UNIVERSITY OF CHICAGO ANALYTICS-ORIENTED VIDEO ... · DDS, a video streaming stack that caters...

Documents

Transcript of THE UNIVERSITY OF CHICAGO ANALYTICS-ORIENTED VIDEO ... · DDS, a video streaming stack that caters...