Unsupervised Modeling, Detection and Localization …...Unsupervised Modeling, Detection and...

Unsupervised Modeling, Detection and Localization of Anomalies

in Surveillance Videos

Abhijit Sharanga, Deepak Pathaka, Amitabha Mukerjeea

aDepartment of Computer Science and Engineering, IIT Kanpur

Abstract

Most techniques today focus either on trajectory clustering or capturing intrinsic scene fea-tures to detect and identify the abnormal content in videos. On lines similar to the latterparadigm, we model the usual and dominant behavior of videos using unsupervised proba-bilistic topic models, as complement of which we identify the “anomalous” ones. Throughthis paper, we make the following contributions: (a) We design the visual vocabulary byincorporating the location and size information with quantized spatio-temporal descriptors,which is particularly relevant for static camera scenes. The visual clips over this vocab-ulary are then represented in latent topic space using models like pLSA. (b) We proposean algorithm to quantify the anomalous content in a video clip by projecting the informa-tion learned from training on to the clip. (c) Based on the algorithm, we finally detectwhether the video clip is abnormal or not and if positive, further localize the anomaly inspatio-temporal domain. The performance of our approach is demonstrated by experimentalevaluation on surveillance video dataset.

Keywords: vision, usual events, lda, plsa, topic modelling

1. Introduction

In the present era of big data, the presence of large volume of un-annotated data over theweb has motivated the research in unsupervised data classification, recognition and segmen-tation. Owing to enhanced security mechanisms and deployment of surveillance cameras,the need for automated analysis of videos to detect abnormal and anomalous events has re-cently given rise to an active research in computer vision and machine learning communityin this field. Since there can be potentially infinite number of actions which can be classifiedas anomalous, this problem of modeling abnormal actions as such, seems hopeless to strivefor. Moreover, lack of annotation adds to the non-triviality of this challenge.

The major approaches followed for detection of anomalies follow two broad techniques.One involves tracking object motions in a series of frames, and then working in the trajec-tory space to identify the deviant points which are potential candidates for being anomalous.In the second approach, the intrinsic patterns of the scene are captured using feature de-scriptors. In this paper, we work with traffic surveillance videos and address the problemof detecting anomalous clips. The term “abnormal” can be ambiguous if defined withoutcontext, and subjected to personal bias. We define the anomaly as the events which are not

Preprint submitted to Undergraduate Project April 18, 2014

‘usual’ in the video i.e. we model the dominant behavior in the video and the events whichare not prevalent are being classified as anomalous.

At high level, we learn the parametric models from obtained features and then project theinformation modeled from training data on to the test video clip to identify the anomalouscontent present in it. We suggest the use of location information endorsed by the spatio-temporal flow and gradient description as feature for defining the complete visual vocabulary.We believe the location information is particularly of great significance in videos from staticcameras like the ones mostly used for security and surveillance purposes. The foregroundsegregation in such scenes reduces the complexity of problem to a great extent with locationinformation capturing the spatial localization of dynamic actions. So it makes sense for usto combine it with standard video point descriptors which have shown to perform well inliterature. This forms the underlying concept for spatio-temporal localization of unusualevents in the video in an unsupervised setting.

The preliminary idea for this work is based on the model suggested in Varadarajan andOdobez (2009). Primarily, the authors use the likelihood obtained from parametric topicmodel to determine whether complete video clip is anomalous or not. This approach ishighly sensitive to the quantity of anomalous content present in video. Generally, if a videoclip is anomalous, then the anomaly in it is mostly confined to a comparatively smallerdomain both spatially and temporally. It is this little anomalous content in any abnormalclip which prevents the clear distinction between overall likelihood of usual and unusual ones.Instead we try to scrutinize every visual word in the given clip individually by projectingthe information learned from ‘usual’ data. Thus, the words which don’t correspond thesignificant projections are most likely candidate for unusualness. We call this ‘projectionmodel algorithm’ and we are able to both detect whether the clip is abnormal or not andlocalize anomaly conditioned on clip being classified as unusual.

Figure 1: Sample frame images from the dataset (Varadarajan and Odobez, 2009)Note: The left image is usual/normal scene while right being an anomaly when car stops after the stop-line.

2

The major challenges involved in the real world setting are the frequent occlusions, andlarge number of possible normal behaviors. Figure 1 shows the sample frame from the datasetwe are using (Varadarajan and Odobez, 2009). Through this paper, we make the followingthree contributions: (a) We design the visual vocabulary by incorporating the locationand blob size information with quantized spatio-temporal descriptors, which is particularlyrelevant for static camera scenes. The blob size helps in differentiating individuals fromvehicles. The visual clips over this vocabulary are then represented in latent topic spaceusing models like pLSA. (b) We propose an algorithm to quantify the anomalous content ina video clip by projecting the information learned from training on to the clip. (c) Based onthe algorithm, we finally detect whether the video clip is abnormal or not and if so, furtherlocalize the anomaly in spatio-temporal domain.

This paper is organized further as follows. Section 2 discusses about the literary sur-vey in general and the previous approaches relevant to our subject of concern. Section 3provides methodology i.e. construction details of visual words, pLSA and ‘projection modelalgorithm’ for detecting and localizing anomalies. We present the analysis of results andexperiments with respect to baseline in Section 4. Conclusion and applications are discussedin Section 5.

2. Related Work

The abnormality detection techniques have broadly been tackled in two ways. Firstly,using trajectory clustering techniques that involves tracking of object motions across theframes. Second one involves extraction of intrinsic features like texture patterns, flow pat-terns etc. and then modelling these features using parametric techniques. However, removalof tracking information in the latter approach leads to significant loss but this shows a morepromising way to build generalizable real-life models.

Trajectory clustering

In the first approach, after extracting the trajectories of different objects in the motionvideos, they are clustered into certain groups and if some object follows a path which isnot close to any of the existing clusters then it is classified as anomalous. (Zhang et al.,2013) discusses techniques for better object tracking and trajectory clustering for trafficsurveillance videos.

The trajectory clusters can be used for the extraction of semantic contextual informationfrom the videos. (Le et al., 2008) and (Hu et al., 2007) extract physical and semanticcharacteristics of the objects present in the context and build efficient indexes in the videobased on these characterisitics for retrieval

Unsupervised behaviour modelling

Second approach models the usual behavious in a scene using feature descriptors. Weintially explored the literature for unsupervised action recognition and scene understanding.(Niebles et al., 2008) is well celebrated paper discussing the results obtained by applying topicmodels for the task of action recognition and classification. They model the features in terms

3

of visual words, and use pLSA-LDA models to predict the correct actions in unsupervisedsense. Competing results on real world datasets showed promising future for topic models.

Detecting Abnormal Behaviour

Works which attempt to detect unusual events in the videos begin by buidling a modelfor the normal events occuring in the videos.(Mahadevan et al., 2010),which aims at anomalydetection and its extension (Li et al., 2013),which aims at detection and localisation of theanomaly rely on dynamic texture models for building a joint temporal and spatial model forconstructing saliency measure for the events occuring in the video.A rare or unusual eventis expected to possess temporal and spatial saliency values which are significantly deviantfrom the expected salieny values.

(Roshtkhari and Levine, 2013) also build a joint model for spatial and temporal dominantbehaviour by constructing spatio-temporal volumes centred around every pixel.Low levelfeatures are extracted from these volumes to generate a codebook.The codebook is clusteredto generate centres of usual events using a fuzzy C-means clustering algorithm.Anomalousevents are detected based on the distances of the words occuring in the events to thesecentres.

(Varadarajan and Odobez, 2009) utilises topic modelling for understanding the usualevents occuring in the video.It is assumed that in a domain, the set of usual events isfixed and can be mined from the distribution of the visual words and the video clips in thedomain.A video clip which has the occurence of anomalous events would then be expectedto have a low likelihood over the learnt model.

3. Methodology

We parametrically model the usual events in a given surveillance video in completeunsupervised setting, thereafter utilizing the learned behavior in detection of anomalousvideos, and ultimately localizing the ‘anomaly’ in spatio-temporal domain. The preliminaryprocessing stage is similar to (Varadarajan and Odobez, 2009) with incorporation of moreextensive descriptive features to emerge as a more generic framework. The methodologyinvolved in this complete pipeline can be summarized in following steps -

• We extract context-based finite dimensional, discrete domain words from video. Videois divided into clips which is analogous to documents and then these are representedas histograms over the finite vocabulary.

• The tf -matrix of these document clips and vocabulary is being modeled using Bayesianapproach in the parametric topic models like pLSA and LDA.

• Each document is now represented as probability distribution over latent topic space.For a new video document, we investigate the ‘usualness’ of each visual word bycomparing with the projected word histograms of nearest train documents in topicspace.

4

• Using the proposed ‘projection model algorithm’, we then identify and localize theanomaly in the given test video document.

We would now discuss each of these steps in detail in the following subsections.

3.1. Formation of Visual Words

Here, we will first describe how we capture words from single frame i.e. image. Thesewords are called ‘visual words’ for obvious reasons. Currently, the words we have for trainingthe pLSA model are of three dimensions namely location in frame, quantized bins of spatio-temporal gradient-flow descriptors, and connected blob size.

The dataset we use comes from (Varadarajan and Odobez, 2009). It is a continuous 45minutes traffic scene video as shown in Figure 1. We work on each frame separately to findout the set of visual words. First of all, we find the foreground pixels in each frame, andthen consider the words only at these foreground pixels. Thus this becomes important forus to not lose information in terms of objects which are in foreground e.g. persons, cars,bicycles etc., although we can manage to have extra pixels in foreground. For this, we use theViBe1 foreground extraction technique suggested in (Barnich and Van Droogenbroeck, 2011).This approach develops the background model by ignoring the insertion time attribute ofpixel entering it. Their generalizable approach gives state of the art performance in manydomains.

Figure 2: Foreground pixels identified in the image frames.Note: This is using simple frame difference background subtraction method.

The main benefit of ViBe or any foreground model should be to depict the objects likevehicles in foreground even if they have stopped motion for sometime. One example of thisis when cars wait at crossing for the signal to go green. We then use median filtering, and

1The implementation of ViBe has been made available by authors at http://www.motiondetection.

org/

5

http://www.motiondetection.org/

http://www.motiondetection.org/

repeated morphological filters to smoothen the image, remove the noise and make the blobsof single object connected. It is same as assigning the median intensity value of nearbypixels to a pixel. Figure 2 depicts the result of this pre-processing on sample frames.

Now we discuss how the information in individual dimension of these visual words isbeing assimilated.

3.1.1. Location Information:

First dimension of words is the location where it is located in the frame conditioned onbeing in the foreground. We divide each frame into 20 ∗ 20 disjoint grids, and the origin ofeach grid is stored as the first variable of word. The size of each frame is 288∗360, so we get15 ∗ 18 possible locations, hence these many discrete values in its domain. Every locationgrid in a frame containing at least one foreground pixel is a candidate for defining a visualword.

3.1.2. Spatio-Temporal Descriptor Information:

We use space time extension of HOG descriptor complemented with quantized opticalflow features. This is HOG-HOF descriptor suggested in Laptev et al. (2008). We find thisdescriptor around a pixel selected randomly from the foreground ones in each cell. The ideaof the descriptor is to consider a spatio-temporal volume around the interest point, whichis further divided into disjoint cells. Thereafter in each cell, gradient orientations are quan-tized into 4-bins (HOG) and optical flow into a 5-bin histogram (HOF). These histogramsare normalized within cells and then concatenated for the complete volume. We use the ex-ecutable from authors’ website2. The spatial scale (σ2) and temporal scale (τ 2) parametersused are 4 and 2 respectively. The total length of descriptor is 162 i.e. 72 dimensional HOGvector and 90 dimensional HOF vector. Overall, these orientation quantized bins capture thegradient-texture information, while optical flow histogram incorporates the motion contentin the neighborhood.

These descriptor values are then clustered using k-means algorithm. We randomly pick200K HOG-HOF descriptors from training documents and quantize them into 20 centers.Now we can assign each foreground pixel containing grid to a cluster center. So, finally wehave 20 possible values for the domain of second dimension in words.

3.1.3. Blob Size Information:

Final dimension corresponds to the size connected component of the foreground pixel inconsideration. We find the connected ‘blob’ of the pixel using contour detection algorithm(Suzuki et al., 1985). Contours are then filled and the area under the contour is computed,which we quantize using a threshold into large and small.

We believe that the location information complemented with HOG-HOF quantized cen-ters is particularly of great significance in videos from static cameras like the ones mostlyused for security and surveillance purposes. This combo captures the effect of informativecontent in every dimension of video.

2http://www.di.ens.fr/~laptev/#stip

6

http://www.di.ens.fr/~laptev/#stip

3.2. Construction of vocabulary and documents

These words intuitively capture every possible combination in the traffic video datasetwe are dealing with. This very property of words is crucial in captivating the context-specific information for instance, the car will always run on road in either direction. Sincecamera is static, so location, HOG-HOF quantization and connected blob exactly capturethis behavior. Another instance of pedestrians crossing road only through zebra crossing isestimated almost completely by location and size dimensions.

Currently our vocabulary consists of all possibles values for these words. Each word isa triplet of above discussed features accounting for (15 ∗ 18) ∗ 20 ∗ 2 = 10800 words in all.Further complete video set is divided into clips of say l seconds. These video clips are thedocuments for topic model. We represent each video clip as distribution over the vocabularyas histogram. Typically, we experiment for l = 4 to 10 seconds, given the traffic surveillancevideo is 25 frames per second.

3.3. Probabilistic Latent Semantic Analysis

Probabilistic topic modelling is a widely known literature in statistical machine learning.Starting from Latent Semantic Analysis (LSA), their probabilistic graph based model pLSAwas suggested by (Hofmann, 1999), and subsequently a parametric fully generative modelwith dirichlet prior LDA was suggested in (Blei et al., 2003). It is suggested in literaturethat pLSA and LDA give similar results in such a scenario (Varadarajan and Odobez, 2009),so we would discuss pLSA model for topic discovery.

Say we represent each word as w ∈ W = {w1, w2, . . . wM} and each document as d ∈D = {d1, d2 . . . , dN}, then we have a N ∗M term-frequency matrix N where N(i, j) is thefrequency wj in di. LSA tries to factorize this matrix into lower vector space by estimatingthe SVD of N considering only significant diagonal terms. pLSA is probabilistic version ofLSA to represent document as probability distribution over space of latent factors calledtopics say z ∈ Z = {z1, z2 . . . , zK}. The conditional independence assumption in the pLSAmodel is that given the topic z, the variable w and d are independent. The joint distributionof word and topic space, respecting this independence assumption is given by

P (d, w) = P (d)P (w|d) = P (d)∑z∈Z

P (w, z|d)

= P (d)∑z∈Z

P (w|z, d)P (z|d)

= P (d)∑z∈Z

P (w|z)P (z|d)

The parameters of the model are estimated using EM algorithm as suggested in (Hofmann,1999). The likelihood estimate of the document matrix is shown as follows

L(θ;N) =∑d∈D

∑w∈W

n(d, w) log(P (d, w))

7

We estimate the distribution P (z|d) and P (w|z) from training data and change the EMalgorithm for test set to estimate P (z|d) keeping the P (w|z) distribution estimate samefrom that of training set. Thus we use following form of Likelihood function:

L(θ;N) =∑d∈D

∑w∈W

n(d, w) log(P (d)

∑z∈Z

P (w|z)P (z|d))

The limitation of pLSA model is that it is generative model for the training data where totaldocuments are known, and the above hack allows us to reconstruct distribution of trainingset like documents, for the words of test documents.

3.4. Anomaly Detection: Projection Model Algorithm

Although we have the overall likelihood values for any document obtained as a resultof topic modeling, using them for abnormality detection is not a robust approach. This issensitive to the amount of anomaly i.e. number of anomalous words present in the document.In general, the abnormal event in any clip is confined to a small spatio-temporal region thusleading to few anomalous words in the clip relative to number of words present in it. Dueto this, there is not much difference in the likelihood of anomalous test clips and usual ones.So, we propose the algorithm for individual evaluation of visual word present in the testvideo document. We call this projection model algorithm due to the fact that for everyword we try to search in the information projected from nearest training documents in topicspace. Algorithm is as follows.

1. The likelihood of documents in topic space i.e. P (z|d) is given by pLSA model.Using this, we can represent every document dx as distribution over topic space as(θx1 , θ

x2 . . . , θ

xk). Let Dtrain be the set of all such topic vectors for training documents.

2. Given a new test document dtest, represent it in terms of topic vector. In the topicspace, find the nearest m training documents di ∈ Dtrain using the Bhattacharyyametric. The Bhattacharyya distance between two documents dx and dy is defined asfollows -

DB(dx, dy) = − log

( K∑i=1

√θxi θ

yi

)3. Let word histogram of document dx be Hx. Then stack (i.e. add frequency of each bin)

the histogram of all these m nearest train documents into one, lets call that combinedhistogram be H0.

4. Now observe every word wtest that occurs in the test document dtest (the word thatdoes not occur are ignored right away) at least once. Now if the frequency of bincorresponding to wtest in H0 is more than certain threshold then call it usual i.e.

If (H0(wtest) ≥ thcur) then wtest is usual word

5. Consider the eight neighbors of wtest in grid image in all possible spatial direction i.e.up,down,left etc. Lets call their set to be N(wtest). Now if H0(w) ≥ thnbr is true forat least l neighbors w ∈ N(wtest), then the word wtest will be called usual.Note that l is any integer from 1 to 8.

8

Figure 3: Anomalous frames identified and anomalous words localised by the algorithmNote: Currently, in our implementation we highlight the anomalous event in test documents as shown above.

6. If steps 4 and 5 does not hold for wtest, then call it an anomalous word.

In the above algorithm, if the training data does not contain any abnormal event then wekeep thcur = 1. Thus, final parameters to optimize are {m, thnbr, l}. Through experiments,we observed that keeping the value of m around one fourth of total training documents andkeeping value of l as 3 gives decent performance.

Now, the test document clip dtest will be called abnormal if the number of anomalouswords in that are more than threshold. We vary this threshold and present the precision-recall curve in results section.

3.5. Anomaly Localization

We get the localization of anomaly as a direct bi-product of our projection model al-gorithm. If the overall test document dtest is being called abnormal, then just mark theanomalous flagged words in clip dtest. This is possible because words contain the spatiallocation information in them as their first dimension. To localize these words temporally,we just need to do a bit of book-keeping while creating word histograms of each document.Just save the list of frame numbers, for any word-document pair.

We now present the results of this analysis over the dataset. This is crucial to note thattraining set should contain no or very less anomaly, so that resultant normalized likelihoodof usual events is high.

4. Experiments and Results

The dataset consists of a single video of 45 minutes duration shot from a camera perchedat the top of a building at a traffic junction. Anomalous events occuring in the video

9

Figure 4: Roc curves for the three models

have been marked in a separate file which states the start time of the anomalous event,the end time of the anomalous event and the kind of anomaly. There are four kinds ofanomalous actions in the video, namely: jay walking , car stopping after the stop line on theroad, people crossing the road away from the zebra crossing and car entering the pedestrianarea(see Figure 3).

The experimentation was conducted by keeping the number of actions in the video tobe 20. The number of actions served as the number of topics in the document. The videowas divided into contiguous clips of 4s duration each, each clip serving as a document in themodel. Anomalous video clips were separated from the rest of the video clips for testing.These numbers we obtained as experimentation carried out during BTP phase-1, so we don’tinclude those curves here. From this set of the rest of the video clips, 75% of the clips wereused for training and the remaining 25% of the clips were included in the test data alongwith the anomalous clips. In the test data, anomalous clips were considered as positiveexamples and the non-anomalous clips were considered as negative examples.

By the application of pLSA on the training documents, the distribution of the documentswere generated over the topic space. For each topic,documents were sorted according to theprobability P (d|z),and the top k documents were manually examined for the discovery ofthe topic. Some semantically relevant topic which were discovered included: people walkingon the zebra crossing from left to right and cars moving on the right lane of the road.

For the purpose of anomaly detection, the baseline model was fixed as the likelihoodmodel with the feature set consisting of location information, quantized optical flow in-formation and the size of the object possessing the motion. The feature space was thentransformed by replacing quantized optical flow information with quantized Histogram of

10

Figure 5: Precision Recall curve for the three models.

Gradient(HOG) and Histogram of Flow(HOF) information. The likelihood model and theprojection model were then constructed in this space. The receiver operating character-istic(ROC) curve and the precision recall curve for the three models were constructed byvarying the log-normalized likelihood of the test documents in the likelihood model and thenumber of anomalous words present in the test documents for classifying them as anomalous(see Figure 4 and Figure 5).

5. Discussion

In this work, we employed pLSA for modeling the usual events occurring in a video. Themodel was then used to identify video clips which contain events which are rare and do notfit in the domain of any of the events learned.

The main highlight of this work is its three fold contribution. Firstly, we present thenovel design for visual codebook combining location information with quantized spatio-temporal HOG-HOF descriptors. This combo finds it relevance mainly in the static camerascenes. Secondly, our proposed projection model algorithm quantifies the anomalous contentin a test video clip. This approach is quite robust to the quantity of anomalous contentpresent in video. Thirdly, this algorithm simultaneously leads to detection and localizationof anomalous events in the scene.

11

References

Olivier Barnich and Marc Van Droogenbroeck. Vibe: A universal background subtraction algorithm forvideo sequences. Image Processing, IEEE Transactions on, 20(6):1709–1724, 2011.This paper discusses the technique for foreground extraction in static camera videos. This is , as theyclaim, the current state of the art technique to model foreground in real time. The source code is avaliableat http://www.motiondetection.org/.

David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of machineLearning research, 3:993–1022, 2003.LDA is a generative model for topic discovery in documents.Here,topics are drawn from a mixture ofmultinomials from a dirichlet distribution.A document is a combination of the words which in turn are acombination of these topics.

Gunnar Farneback. Fast and accurate motion estimation using orientation tensors and parametric motionmodels. In Pattern Recognition, 2000. Proceedings. 15th International Conference on, volume 1, pages135–139. IEEE, 2000.This algorithm proposed by Farneback computes dense optical flow.3D orientation tensors are computedand velocity is parameterised over the tensor values using an affine model in a local neighborhood.

Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual internationalACM SIGIR conference on Research and development in information retrieval, pages 50–57. ACM, 1999.pLSA is a semi-generative paradigm for topic discovery.It is a probablisitic extension of LSA.

Weiming Hu, Dan Xie, Zhouyu Fu, Wenrong Zeng, and Steve Maybank. Semantic-based surveillance videoretrieval. Image Processing, IEEE Transactions on, 16(4):1168–1181, 2007.This work aims at querying and indexing a video database for mining relevant semantic information fromthe videos.The semantic information is obtained from the clusters of motion trajectories.

Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. Learning realistic human actionsfrom movies. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages1–8. IEEE, 2008.Introduced Hollywood Dataset (smaller one). Main paper of Laptev reporting accuracy of HOG, HOF andcombination of them on Hollywood dataset and KTH dataset. Note that the accuracy reported is in termsof average accuracy for KTH dataset and in terms of per-class Average Precision (AP) for Hollywooddataset. Source code is available.

Thi-Lan Le, Monique Thonnat, Alain Boucher, and Francois Bremond. A query language combining objectfeatures and semantic events for surveillance video retrieval. In Advances in Multimedia Modeling, pages307–317. Springer, 2008.This work attempts to design a query language for enabling the user to obtain images from a database.Thequery can either be syntactic or semantic in nature.Database indexing and querying is independent of themethod used for object and event recognition.

Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Anomaly detection and localization in crowded scenes.IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1, 2013.This work proposed a joint spatial and temporal model for detection and localisation of anomaly invideos.The model uses a mixture of dynamic texture models.Anomaly localisation and detection is doneby computing saliency scores.Temporal saliency scores are computed by learning a model of normal eventsoccuring in the video.Spatial saliency scores are computed through a discriminant saliency detector.Whiletemporal anomaly is flagged if an event has a low probability of occurence,spatial anomaly is flagges byan unusual salieny score.

12

Vijay Mahadevan, Weixin Li, Viral Bhalodia, and Nuno Vasconcelos. Anomaly detection in crowded scenes.In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 1975–1981.IEEE, 2010.This work proposed a joint spatial and temporal model for detection and localisation of anomaly invideos.The model uses a mixture of dynamic texture models.Anomaly localisation and detection is doneby computing saliency scores.Temporal saliency scores are computed by learning a model of normal eventsoccuring in the video.Spatial saliency scores are computed through a discriminant saliency detector.

Juan Carlos Niebles, Hongcheng Wang, and Li Fei-Fei. Unsupervised learning of human action categoriesusing spatial-temporal words. International Journal of Computer Vision, 79(3):299–318, 2008.This seminal paper leverages the idea of topic modelling used in semantic analysis of text documentsfor analysis of scenes in a video.Similar to words drawn from a vocabulary of natural language whichcomprise a text document,a video clip consists of visual words drawn from a bag of words extractedfrom the video.Moreover,topics present in the text documents are analogous to events/actions occuringin the set of videos.Hence,a term-document matrix can be mined using a model like pLSA or LDA fordiscovering the events/actions.Besides categorising the actions/events,they can also be localised in thedocument.

Mehrsan Javan Roshtkhari and Martin D Levine. Online dominant and anomalous behavior detection invideos. Computer Vision and Image Understanding, 2013.This work learns a joint model of spatial and temporal behaviour by considring spatio-temporal vol-umes centred around every pixel in the video.Features learnt on these volumes are used to generate acodebook,which in turn is used to generate cluster centres of the dominant activities.

Satoshi Suzuki et al. Topological structural analysis of digitized binary images by border following. ComputerVision, Graphics, and Image Processing, 30(1):32–46, 1985.This is a seminal work on contour detection in images.

Jagannadan Varadarajan and J-M Odobez. Topic models for scene analysis and abnormality detection. InComputer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference on, pages1338–1345. IEEE, 2009.This work exploits topic modelling for scene segmentation and anomaly detection.A model of usual eventsoccuring in the video is learnt by maximising the log likelihood of the joint probability distribution ofthe words(visual words) and the topics(video clips).On a new video clip,the learnt parameters are used toobtain the distribution of the topics given the document.It is expected that the log likelihood for a videoclip containing anomalous events will be low.

Tianzhu Zhang, Si Liu, Changsheng Xu, and Hanqing Lu. Mining semantic context information for intelligentvideo surveillance of traffic scenes. 2013.This work employs the trajectory clustring model for obtaining sematic contextual information in thevideos.The scene model for normal events is learnt from the clusters of the trajectories observed overa long period of time.Any event which has a trajectory which does not fit into any of the availabletrajectories is marked anomalous.

13

Unsupervised Modeling, Detection and Localization …...Unsupervised Modeling, Detection and...

Documents

Transcript of Unsupervised Modeling, Detection and Localization …...Unsupervised Modeling, Detection and...