Dynamic Data-driven Application System (DDDAS) for Video … · 2017-01-18 · dynamic data-driven...

doi: 10.1016/j.procs.2015.05.359

Dynamic Data-Driven Application System (DDDAS) for Video Surveillance User Support

Erik P. Blasch, Alex J. Aved Air Force Research Laboratory, Information Directorate

Rome, NY, USA [email protected], [email protected]

Abstract Human-machine interaction mixed initiatives require a pragmatic coordination between different systems. Context understanding is established from the content, analysis, and guidance from query-based coordination between users and machines. Inspired by Level 5 Information Fusion ‘user refinement’, a live-video computing (LVC) structure is presented for user-based query access of a data-base management of information. Information access includes multimedia fusion of query-based text, images, and exploited tracks which can be utilized for context assessment, content-based information retrieval (CBIR), and situation awareness. In this paper, we explore new developments in dynamic data-driven application systems (DDDAS) of context analysis for user support. Using a common image processing data set, a system-level time savings is demonstrated using a query-based approach in a context, control, and semantic-aware information fusion design. Keywords: DDDAS, CBIR, Live-video computing, information fusion

1 Introduction Dynamic-Data Driven Application Systems (DDDAS) require system-level coordination between applications modeling, measurement systems, statistical algorithms, and software methods. Applications modeling include physical, geometrical, or relational models that support control techniques such as target tracking (Uzkent, et al., 2013; Fujimoto, et al., 2014). Measurement systems are used to gather information about the environment such as a distributed sensor network (Andreopoulos, et al., 2014), vision-based systems (Bhattacharyya, et al., 2014) or user queries (Aved, 2013). Statistical algorithms assess the information for data correlation, feature association, and information fusion for applications of face recognition (Metaxas, et al., 2004). Finally, software systems include the architectures for cloud-based data access, message passing, data mining, and security (Weissman, et al., 2007, Xiong, et al., 2013, Liu, B., et al., 2014; Wu, et al., 2014). While

Procedia Computer Science

Volume 51, 2015, Pages 2503–2517

ICCS 2015 International Conference On Computational Science

Selection and peer-review under responsibility of the Scientific Programme Committee of ICCS 2015c© The Authors. Published by Elsevier B.V.

2503

http://crossmark.crossref.org/dialog/?doi=10.1016/j.procs.2015.05.359&domain=pdf

http://crossmark.crossref.org/dialog/?doi=10.1016/j.procs.2015.05.359&domain=pdf

each of these DDDAS constructs are important, it is the orchestration of these elements to provide timely and actionable information for user support that is required.

Human-machine interaction (HMI) incorporates decision support, human centered design, and user augmentation. In each of these approaches, a user interacts with the system for which DDDAS would enable system analysis. If we use information fusion as an example, kinematic models support estimation of the incoming physical data for target detection (Blasch, Seetharaman, et al, 2013). Likewise, semantic models process the human-derived measurement data. The data is fused using a statistical method such as Bayes rule for a given software architecture (Blasch, Al-Nashif, 2014). The software architecture includes a database, services, and methods of access. To access the information, information fusion modeling is needed to provide context (Nguyen, et al, 2013)

Information fusion has been applied to many applications. One commonly accepted model is the Data Fusion Information Group (DFIG) model (shown in Figure 1) as a common processing framework (Blasch, Lambert, et al., 2012). The levels (L) determine the processing in the system such as L0 data registration, L1 object tracking and identification assessment (Ling, et al, 2010), L2 situation awareness activity analysis, (Blasch, Wang, et al., 2013), and L3 impact assessment (Chen, et al., 2007). The complementary control levels are: L4 sensor management, L5 user refinement and L6 mission (SUM) management (Blasch, 2006). The information fusion processing levels are similar to the DDDAS framework (Blasch, Seetharaman, et al., 2013).

Fig. 1. Data Fusion Information Group Model (L = Level).

Utilization of contextual information by a machine includes the database system, the sensor type (e.g., video), the context data, the extracted features (e.g., the target), and the scenes (e.g., the environment). These operating conditions of the sensor, target, and environment need to established together to support information exploitation and contextual analysis (Blasch, Steinberg, et al., 2013).

Fig. 2. Man-Machine Systems operation for DDDAS query-based analysis.

DDDAS for Video Surveillance User Support Erik Blasch and Alex Aved

2504

The remainder of this paper proceeds as follows. Section 2 describes multimedia context analysis and Section 3 context indexing for the live-video computing database management system (LVCDMS) starting with the data representation for indexing and retrieval. Section 4 provides an example and Section 5 draws conclusions.

2 Multimedia Database Context Analysis Early database systems were designed to efficiently manage the storage, retrieval and querying of

alphanumeric data (Date, 1977). Figure 3 compares a traditional database management system (DBMS) with a multi-media database system (MMDBS). A typical DBMS implementation, Figure 3 (left), supports business applications by persisting application state, resolving queries, and facilitating transactions to mitigate concurrency errors. Figure 3 (right) illustrates a Multimedia DataBase (MMDBS), which can utilize a traditional DBMS to manage metadata and indices, but also encompasses additional technologies and services not typically present in DBMSs which include: video on demand, document management and imaging, spatial data, specialized query languages, face recognition and relevance feedback, to name a few. Because multimedia content, and video in particular, can be quite large and its communication bandwidth intensive, MMDBS are often paired with specialized communication frameworks, such as the HeRO protocol discussed in (Tantaoui, et al., 2004), in order to provide content delivery to a multitude of concurrent users without overwhelming the physical communication medium.

Multimedia constructs include content (data), entities (features), and scenes (context). Context enhanced information fusion examples include imagery (Liu, et al., 2012), user queries (Blasch, et al., 2004), text and tracking (Blasch, Bosse, Lambert, 2012), and content-based image retrieval (CBIR). The multiple applications of fusion require resource management (Blasch, et al., 20108 to facilitate the ability of the user-defined queries to be determined from the information management system.

.

Fig. 3 A typical database architecture (left) vs. a multimedia database (right).

2.1 Context Data Data can be classified as structured or unstructured. Structured data is organized in accordance

with a data model (Hoberman, 2005). Some examples of structured data include tabular data stored in a relational database or in an Extensible Markup Language (XML) file. Structured data is recognizable (by both humans and computers) due the determined ontology. Unstructured data is not inherently organized by an identifiable structure. Examples of unstructured data include audio (e.g. MP3 files), images and video.


2505

Unstructured data can be categorized by its inherent dimensionality. The simplest type of unstructured data consists of alphabet characters, and the more complex is video. Table 1 provides a list of different types of unstructured data. The data listed as “Continuous” in the state column consists of data that is related temporally, and one or more of these classifications (or types) of data may be combined and still be considered multimedia (Grimes & Potel, 1991).

Data Dimensionality Example of Data State

0 Characters, text Discrete 1 Audio, output from sensor Continuous 2 Image, Graphics Discrete 3 Video, Animations Continuous

Table 1: Unstructured Data by State

As previously stated, features represent a measurable property of a type of data that can be observed. Typically more than one feature is extracted to represent an item of multimedia data, and taken together these features form a vector which can correspond to a point in a multidimensional Euclidean feature space. The process of identifying and calculating features from multimedia data is called feature extraction.

2.2 Context Features MMDBS frameworks typical consist of three primary components, or phases (Zhang, 2008) that

include feature extraction, knowledge representation, and information analysis. At each stage, context can be used such as determining which features to extract, the knowledge ontology, and the analysis needed.

The first component entails representing the raw multimedia data as a point in an abstract, n-dimensional space termed a feature space, where n is the number of features that describes the data item. The process of representing the data as a point in the feature space is called feature extraction. Similar items should be grouped together (e.g., Figure 4), thus, the feature selection and extraction methods affect the grouping and compactness of the data points. The data compactness in the feature space can have ramifications pertaining to the effectiveness of retrieval (e.g., k-nearest neighbor (k-NN) and classification (e.g., the application of support vector machines).

Fig. 4 Multimedia data (images) represented as points in a 2-dimensional feature space.

The second component of the framework is knowledge representation. A feature represents a

measurable property of the multimedia data item (e.g., the number of target pixels in an image (Ling, 2010), and are typically represented as numeric data, though they can be a string or also a graph representation. Numeric features are usually chosen, as they can be operated upon mathematically.


2506

Discriminative features should be chosen, and the effectiveness by which the multimedia data may be represented by the selected features will have a significant impact on the performance of the MMDBS.

The third framework component performs some type of analysis or retrieval on the multimedia data that is represented in the feature space, for example, categorization (applying class labels or keywords), retrieval (k-NN), data mining, and image fusion (Lui, et al., 2012), etc.

Each phase is dependent of the feature type; and some features are applicable only to certain modalities of data. Three types of features are described here: geometric, statistical and meta. Geometric features apply to specific objects that have been identified within a unit of multimedia data (such as a frame of video). Before objects can have features calculated for them, a previous processing step must have been executed to identify the objects contained in the data item. An example of a geometric feature is a moment. In image processing, a moment is a weighted average of the intensities of the pixels that represent the appearance of an object. Features that can be derived from the moment include area (the number of pixels that contribute to the object’s representation) and also the centroid (or, the coordinates of the center of the object. Another simple geometric feature is the shape number. The shape number represents the contour of a shape, and is a sequence that describes the directions of line segments that one would encounter when tracing the shape of an object, having started from some particular boundary point. For details about shape and image processing the reader is referred to a computer vision text, for example (Nixon & Aguado, 2012).

A statistical feature is another type of feature that can describe an image. Statistical features are generally applied to the image as a whole; however image regions can be used to minimize processing, pass through occlusions, and multisource fusion (Mei, et al., 2011; Mei, et al., 2012). A histogram is an example of a statistical feature that can represent a property of an image, for example, the intensities of the pixel values that represent the appearance of the image. Consider, for example, a grayscale image, which is a two-dimensional image whose pixels represent shades of gray with intensities ranging from 255 (white) to 0 (black). A histogram representing a particular grayscale image could have 256 bins, one for each possible pixel intensity, and the value of each bin would be the number of pixels contained in the image with that particular value. To make the histogram more compact, the bins can be generalized to represent non-overlapping ranges of pixel values. Other features that could fit into the statistical category are edges (e.g., number of pixels that represent edges in the image, as outputted by some edge detection algorithm , and interesting points within the image.

Meta features are another class of features that can describe data. Meta features apply to the data as a whole. For example, for an audio recording of music a meta feature could be the name of the artist who recorded the work. For an image, a meta feature could be the focal length of the lens used to capture the image, or the model of camera. For video, frame rate, aspect ratio, language, producer, etc. are all examples of meta features.

As indicated in Table 1, the term multimedia encompasses a number of different modalities of data. In the remainder of this work the modalities of data that are of primary consideration primarily are video, and the images (i.e. frames) extracted from the video. It is important to also that note that the data (and metadata) generation techniques considered in this work are those that are primarily automated. For example, some algorithms for image segmentation require a human to provide “seed” parameters, but we would still consider such a technique to be automated; as opposed to a technique in which a human observes some data and performs some manual transformation such as determining relevant labels to associate with said data. This includes user correction (such as correcting a metadata value that is incorrect) or applying (i.e. associating) context with an object (e.g. marking whether or not a video sequence contains a representation of a particular person), other than for purposes such as determining a ground truth baseline. The features are dependent on the contextual scene (Wu, et al., 2011).


2507

2.3 Context Sensors Background subtraction is the process of identifying objects (or portions thereof) of interest in an

image, from the rest of the image (Ling, et al, 2011). The output from the background segmentation process is a mask image of binary values that indicates which pixels (in the corresponding image) represent the foreground object (or said another way, the pixels which are detected to not represent the scene background). Frame differencing is the simplest case of background subtraction, in which the foreground pixels of a scene can be determined by taking two images (and converting them to grey scale images to simplify handling the separate color channels) and subtracting (or, finding the absolute difference) between the pixels in the images. Frame differencing can be improved upon by computing the average pixel value from the last n frames, and slowly updating the background model over time to account for slow changes to the illumination of the scene.

The background models just discussed model each pixel independently from its neighbors and base the color model on each pixel’s recent history, such as the weighted average of the previous n frames. These don’t take into account complex scenes with moving objects, like branches moving in the wind, moving water or clouds passing overhead. Background subtraction methods that improves upon these base the value of background pixels on a probably distribution function (PDF) that follows a Gaussian distribution (Wren, et al., 1997), or a Mixture of Gaussians (MOG) (Stauffer & Grimson, 1999). The downside of MOG is that it does not adapt well to fast-changing backgrounds like waves, or to cases where more than a few Gaussians might be required. The Codebook (Kim, et al., 2005) background segmentation model takes into consideration periodic background variations over a long period of time. In order to conserve the amount of memory required to implement the algorithm, a codebook is constructed by associating with each pixel one or more codewords which can be thought of as clusters of colors at each pixel (e.g. each pixel may be associated with one or more codewords), and the clusters may not necessarily correspond to a Gaussian distribution or any other parametric distribution. That is, Codebook still encodes the background representation on a pixel-by-pixel basis. Classification of a pixel as background or foreground is done by comparing a pixel’s value to the corresponding codewords; if its color distribution is sufficiently close to one of the codewords and its brightness is within a range of the corresponding codeword, the pixel is considered to be part of the background, else it is classified as a foreground pixel. For additional information pertaining to background subtraction methods the reader is referred to the works of (Piccardi, 2004) and (Cheung et al., 2004).

It should also be noted that the pixels in the foreground mask might not always represent the object completely; that is, there may be some error due to noise. For example, as can be observed in Figure 5, in some situations the pixels that represent the appearance of the object can match the color of the background. The foreground/background error can be mitigated by introducing a post-processing step to reduce noise in the binary foreground mask image, or also group together nearby disconnected components that could correspond to the same object (Parks & Fels, 2008). Further information on context scenes is discussed in (Duda, et al., 1995).

Fig. 5 Aerial scene from the VIRAT dataset (left), and corresponding foreground mask (right).


2508

3 Multimedia Data Representation for Context Indexing Multimedia constructs include content (data), entities (features), and scenes (context) that are utilized through indexing and retrieval. Collections of multimedia information can grow to very large sizes, consuming many gigabytes of storage space. In order to utilize multimedia content it must be retrieved; whether the retrieval is to find a movie based upon its title, or one is looking for images, clips of audio or video segments showing a particular subject or class of objects. As an example, consider a table of records in a traditional relational database. Each record in the table can be considered as a point in a multidimensional space (Samet, 2006). Consider a record for an object-track relation with the following fields: {object_id, track_id, situation_id, start_date, end_date}. In this case, records in this table correspond with points in a 5-dimensional space, where three of the dimensions refer to, say, integers (object_id, track_id and situation_id) and the other two dimensions are of type date-time (i.e. start_date, end_date). The DBMS manages the collection of these records and stores them in a file on some persistent media. In order to facilitate efficient retrieval of records in the database, indexes can be created.

The index itself is simply another table (or, correspondingly, a file created and maintained by the DBMS). For example, an index over the field object_id could contain only object_id’s and the location of associated records in the corresponding employee-department file. By utilizing the index file in order to resolve queries, less data would need to be loaded and processed, since the index file contains primarily object_id data (and not other data fields such as situation_id). To further enable efficient retrieval, an ordering can be imposed upon the records, either in the primary data file or in the index. However, to accommodate future record operations to the primary data table (e.g. delete, insert, update) it is often more efficient to impose the ordering only on the data in the index files. For numeric fields, the ordering can be based upon numeric value. For character fields, the order can be based upon corresponding ASCII or UNICODE numeric values, or based upon lexicographic order. For other types of data, such as color, the ordering could be based upon the corresponding hexadecimal value (e.g. red is “ff0000”) or the color’s wavelength.

There are many different ways data can be represented, and considering questions such as these can guide the process of designing an implementation.

When considering multimedia for browsing and searching, an index is also required. Some fundamental question are pertaining to multimedia data are what, which and how. At what granularity should the item be indexed (e.g., frame of interval) Which refers to which items should be indexed; should all pixels shown in each frame of video be represented somewhere in an index, or should only moving objects be stored? Should the time index an object appears or disappears be recorded? How to index an item pertains to selecting and extracting features to be indexed. Data indexing, and more specifically multimedia data indexing is a multifaceted and difficult problem, and as such, there is a significant quantity of research and correspondingly, solutions and indexing algorithms and data structures. Some works that addresses the issues of multimedia indexing holistically are (Bolle, et. al., 1998; Brunelli, et al., 1999; Wang, et al., 2000; Snoek & Worring, 2005).

To illustrate multimedia indexing, consider the information that can be extracted from a video: the visual component (the visual content represented by pixels in the frames), the auditory information (i.e. audio tracks) and text (text that can be extracted; and metadata pertaining to the video itself such as genre, actors, etc). A multitude of semantic properties of the video can be extracted from the metadata pertaining to its content: the type of video (e.g. education, training, entertainment), the time period the video covers; major actors who appear, and so forth (Jain & Hampapur, 1994). To index content that is depicted visually in the video, pattern recognition approaches can be employed; for example, template matching (e.g., Bayes classifier, decision trees, Hidden Markov Models, face and people detection. The reader is referred to (Jain, et al., 2000) for a comprehensive review of pattern recognition techniques. To index videos, they can be decomposed into a series of semantic shots, and each shot can be individually indexed (Ide, et al., 1999). Pertaining to audio data, a number of


2509

different techniques can be employed, for example sounds can be analyzed to detect discussions (Wold, et al., 1996).

3.1 Content-Based Image Retrieval Multimedia context is established through content (data) which has been termed content-based

image retrieval (CBIR). In literature there are many ways in which CBIR described. In particular, it is the application of computer vision techniques to extract information from an image in an automated fashion for the purposes of retrieval. Also referred to as query by image content (QBIC) (Flickner, et al., 1995), it pertains to the retrieval of images based upon what they visually depict; not by metadata or human-ascribed annotations, whose assignment can vary from person to person, culture to culture, reflect personal biases, etc. In CBIR systems, image data is represented by features corresponding to its visual appearance; color, texture, shape, edges, etc. Early work in CBIR was done with pictorial databases (Blaser, 1979, Chang & Fu, 1980).

Present day CBIR systems facilitate retrieval by accommodating a variety of query methods, to include query by example, sketching an image by hand, random browsing, text search (i.e., keyword, speech/voice recognition) and hierarchical navigation by category (Chang, Elefheriadis, McClintock, 1998). Objects in CBIR systems are represented by features associated with their content (Ezekiel, et al., 2013) As such, feature extraction is an important step inherent to CBIR systems. Features (color, shape, texture, edges, regions, etc.) are extracted and stored in a multidimensional index (feature vectors can range from very few to hundreds of dimensions). Figure 6 provides an example of a system architecture for generic CBIR systems. A user submits an image as a query through a user interface. The query image is parsed and its representative features are extracted. The features from the query image are mapped to a multidimensional query point in the index, and similar images are returned back to the user as the query result.

Fig. 6 Representative architecture of a typical CBIR system.

There are presently many research and commercial CBIR systems; a few representative early

examples include QBIC, Virage (Bach, et al., 1996) Photobook (Pentland, et al., 1996), and Multimedia analysis and retrieval system (MARS) (Huang, 1997), to name a few. Additionally there are many good surveys on CBIR techniques and systems (Rui, et al., 1999; Zhao & Grosky, 2002; Y. Liu, et al., 2007).

Although originally applied to images, content based video retrieval (CBVR) is another active area of research due to the commoditization of compute and storage capacity. CBVR is semantically similar to CBIR except its domain is that of video, rather than images. Videos are segmented into


2510

shots, which may be represented by key frames (Sato, et al., 1999), features are extracted and indexed. At that point retrieval is similar to the workflow for CBIR (Geetha, Narayanan, 2008). Of course, video adds the potential to fuse additional data modalities not available in traditional CBIR into the indexing and retrieval process, such as correlation with audio tracks (Foote, 1999; Z. Liu and Huang, 2000; Makhoul et al., 2000).

CBIR is enabled by the database structure for contextual analysis.

3.2 LVC-DBMS The LVC-DBMS uses a query optimizer and associated execution environment that is designed for the LVC environment (Aved, 2013). It performs query optimization at runtime, taking a new query and finding any possible overlap with the existing queries in the system and rewriting the new query in order to minimize duplicate subexpressions and optimize the utilization of the query execution engine (Figure 7). Query optimization methods in the LVC environment reduce query execution overhead by merging the physical algebra query trees. The merging of query trees is done through context associations. The LVC-DBMS prototype query optimization is done in the spatial and stream processing layers in the LVC-DBMS (Figure 8). To facilitate performance evaluation of query optimization, a query cost metric was derived and used to present optimization performance results.

Fig. 7. Query lifecycle; from the inception of a query to results delivered to the issuer.


2511

Fig. 8. LVC-DBMS major system components.

4 Results To evaluate the performance of the cross-camera tracking, videos from the DARPA VIRAT project are utilized as shown in Figure 9. The VIRAT video collection shows a number of different scenarios with different objects from multiple vantage points. These scenarios have different backgrounds but the same objects moving about in them, being observed from different angles.

Video 1 – Bus Moves through Parking Lot Video 2 - People Move Around a Truck

Video 3 – People Move Around Building Video 4 - Truck Load and Unload in Lot

Fig. 9. Sample frames from VIRAT dataset.

The LVC-DBMS receives the raw imagery from the camera and adds contextual metadata pertaining to the frames and objects observed. The raw imagery is a temporally ordered series of frames of video that is simply data; a two-dimensional matrix of pixel values corresponding to what was sensed by the imaging device. The camera adapter implements algorithms to model the scene background. As salient objects move across the background, a segmentation algorithm attempts to determine which pixels belong to the scene background and which pixels do not. Pixels that are not


2512

recorded as part of the background are grouped together into a blob. Enhanced methods can make use of context to more accurately determine the pixels belong to the target.

Blobs are assigned an identifier that is unique to each instance of a camera adapter (and thus, unique to each video stream), as shown in Figure 10. A frame-to-frame tracker maintains correspondences between objects, as they move, from one frame to the next. By maintaining these correspondences, the image analysis module computes a feature vector corresponding to the visual appearance of each object in each frame of video. The output of the image analysis module is contextual metadata describing each frame, e.g. a monotonically increasing frame number, timestamp when the frame was received, the number of objects in the frame and their identifiers, privacy filters associated with the camera, etc.

Fig. 10. Video Exploitation Examples. The continuous queries (i.e., subqueries) in the LVC-DBMS are evaluated periodically. The

interval in which they are evaluated is referred to as the resolution of the query. The amount of time required to evaluate each subquery should be less than its resolution, else the subquery evaluations will be missed due to its evaluation time running over into the next evaluation time slot. To test the subquery evaluation performance of the execution engine and related metadata structures, an evaluation scenario comprised of executing queries simultaneously for a period of duration. The evaluation was done to detect objects, classify type (by size), and discern activities (Hammoud, Sahin, Blasch, et al., 2014). The LVC-DMBS could be applicable to make image collections such as plume detection (Ravela, 2013).

In the videos, various scenes presented different scenarios of which the evaluation is shown in Figure 11. For a single target in a simple scene, tracking to activity analysis is timely As the number of objects increases, detection is about the same; however, discerning the activity type requires interactions and relationships between targets. Finally, as the number of objects grows and the scene complexity increase; there could be a long delay for discerning the activities. However, if a query is used to capture only the designated type of target in the complex scene, there could a timely response. Furthermore, the query could actually reduce the time needed as the kinematic and physical object properties (e.g., features) aid the algorithms in processing.


2513

Fig. 11 Time Analysis of Object Detection and Activity Discerning for Evaluation. For the first analysis, the videos were analyzed for the complexity based on the valid detections as

plotted in Figure 12. Video 1 and 2 had simple activities which allowed for more valid detections. The complex activity challenged the system such that there were a low number of discerned activities, such as a continuous track.

Fig. 12 Evaluation scene complexity.

One of the challenges in video analysis is both the scene complexity against the object activity.

The four videos established the four criteria from which to evaluate for a given tracking analysis. Using the same tracker to detect, classify, and discern objects provided a fair analysis. However, absolute scoring was difficult, so we used a relative analysis as shown in Table 2.

Video Scene Activity Normalized Switches Bus Moves through Parking Lot (1) Simple Simple 2.86 People Move Around Truck (2) Complex Simple 4.93 People Move Around Building (3) Complex Complex 15.19 Truck Load and Unload in Lot (4) Simple Complex 11.52

Table 2. Analysis of DDDAS-queries in the Selected Videos

010002000300040005000

113

927

741

555

369

182

996

711

0512

4313

8115

1916

5717

9519

3320

7122

0923

47

Num

ber o

f Det

ectio

ns

Time (S)

Video 1 Video 2 Video 3 Video 4


2514

5 Conclusions Large scale multimedia applications will require dynamic data-driven application systems (DDDAS) approaches to bring together context, activity, and semantic analysis. Contextual analysis includes feature extraction, scene content, and continuity of target handoff. To efficiently transfer contextual data for information fusion exploitation, information management was developed using content-based image retrieval (CBIR) and live-query computing (LVC). Context-aware tracking is established with the tracking methods and mapping linguistic queries to image features facilitates semantic-aware information fusion. The LVC-DBMS includes efficient query processing techniques, web-service communication, and a scalable application. Experimental results show that the LVC-DBMS can effectively recognize events observed in video streams and efficiently processing semantic queries. The key elements described include a prototype LVC implementation including:

Context-aware: a high-level query language for specifying events and interacting with the

LVC-DBMS, Semantic-aware: efficient query processing and execution techniques to maximize compute

memory resource usage over a large set of linguistic information, and Activity-aware: Infrastructures permitting for the specification of object activities and their

relationships in a real-time stream processing environment.

Future efforts include using the information exploitation system for big-data problems including physics-based and human-based video-to-text information fusion (Hammoud, et al., 2014), pattern of life analysis (Gao, et al., 2013), and other data sources (Panasyuk, et al., 2013).

Acknowledgements This work is partly supported by the Air Force Office of Scientific Research (AFOSR) under the Dynamic Data Driven Application Systems program and the Air Force Research Lab.

References Andreopoulos, I., Atan, O., Tekin, C., van der Schaar, M. (2014). Bandit framework for systematic learning in

wireless video-based face recognition. IEEE Journal of Selected Topics in Signal Processing, Vol. 99. Aved, A. J. (2013). Scene Understanding for Real Time Processing of Queries over Big Data Streaming Video,

PhD Dissertation, University of Central Florida. Bach, J.R., Fuller, C., Gupta, A., Hampapur, A., Horowitz, B., Humphrey, R., Jain, R., Shu, C.F. (1996). The

Virage image search engine: An open framework for image management. Proc SPIE, Vol. 2670. Bhattacharyya, S. S., van der Schaar, M., Atan, O., Tekin, C., Sudusinghe, K. (2014). Data-Driven Stream Mining

Systems for Computer Vision,” Ch12 in B. Kisacanin and M. Gelautz (eds.), Advances in Embedded Computer Vision, Springer International Publishing Switzerland.

Blasch, E., Plano, S. (2004). Cognitive Fusion Analysis Based on Context, Proc. of SPIE, Vol. 5434. Blasch, E. (2006). Sensor, User, Mission (SUM) Resource Management and their interaction with Level 2/3

fusion, Int. Conf. on Info Fusion. Blasch, E., Kadar, I., Hintz, K., Biermann, J., et al. (2008). Resource Management Coordination with Level 2/3

Fusion Issues and Challenges, IEEE Aerospace and Elec. Sys. Mag., Vol. 23, No. 3, pp. 32-46, Mar. Blasch, E., Lambert, D. A., Valin, P., et al. (2012). High Level Information Fusion (HLIF) Survey of Models,

Issues, and Grand Challenges, IEEE Aerospace and Electronic Systems Mag., Vol. 27, No. 9, Sept. Blasch, E., Al-Nashif, Y., Hariri, S. (2014) Static versus Dynamic Data Information Fusion analysis using

DDDAS for Cyber Trust, International Conference on Computational Science. Blasch, E., Seetharaman, G. et al. (2013). Dynamic Data Driven Applications Systems (DDDAS) modeling for

Automatic Target Recognition, Proc. SPIE, Vol. 8744.


2515

Blasch, E. P., Bosse, E., Lambert, D. A. (2012). High-Level Information Fusion Management and Systems Design, Artech House, Norwood, MA.

Blasch, E., Seetharaman, G., Reinhardt, K. (2013). Dynamic Data Driven Applications System concept for Information Fusion, International Conference on Computational Science.

Blasch, E., Steinberg, A., Das, S., Llinas, J., Chong, C-Y., Kessler, O., Waltz, E., White, F. (2013). Revisiting the JDL model for Information Exploitation, International Conf. on Information Fusion.

Blasch, E., Wang, Z., Ling, H., Palaniappan, K., Chen, G., Shen, D., Aved, A. (2013). Video-Based Activity Analysis Using the L1 tracker on VIRAT data, IEEE Applied Imagery Pattern Recognition Workshop.

Blaser, A. (1979). Data base techniques for pictorial applications. Springer, Florence. Bolle, R.M., Yeo, B.L., Yeung, M., (1998). Video query: Research directions, IBM J. Res. Dev. 42, 233–252. Brunelli, R., Mich, O., Modena, C.M. (1999). A Survey on the Automatic Indexing of Video Data. J. Vis.

Communication Image Represent. 10, 78–112. Chang, N.S., Fu, K.S. (1980). Query-by-pictorial-example. IEEE Trans. Software Eng. 519–524. Chang, S.F., Eleftheriadis, A., McClintock, R. (1998). Next-generation content representation, creation, and

searching for new-media applications in education. Proc. IEEE, 86, 884–904. Chen, G., Shen, D., Kwan, C., et al. (2007). Game Theoretic Approach to Threat Prediction and Situation

Awareness, Journal of Advances in Information Fusion, Vol. 2, No. 1, 1-14, June. Cheung, S.C.S., Kamath, C. (2004). Robust techniques for background subtraction in urban traffic video. Proc.

SPIE, Vol. 5308, pp. 881–892 Date, C. J. (1977). An Introduction to Database Systems (2nd ed.). Addison-Wesley Publishing Company, Inc. Duda, R. O., Hart, P. E., & Stork, D. G. (1995). Pattern Classification and Scene Analysis 2nd ed., Wiley, NY. Ezekiel, S., Alford, M. G., Ferris, D., Jones, E., et al. (2013). Multi-Scale Decomposition Tool for Content Based

Image Retrieval,” IEEE Applied Imagery Pattern Recognition Workshop. Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dom, B., Gorkani, M., Hafner, J., Lee, D.,

Petkovic, D. (1995). Query by image and video content: The QBIC system, Computer, 28, 23–32. Foote, J. (1999). An overview of audio information retrieval. Multimedia Systems, 7, 2–10. Fujimoto, R., Guin, A., Hunter, M., Park, H., Kanitkar, G. (2014). A Dynamic Data Driven Application System

for Vehicle Tracking, International Conference on Computational Science. Gao, J., Ling, H., et al., (2013). Pattern of life from WAMI objects tracking based on visual context-aware

tracking and infusion network models, Proc. SPIE, Vol. 8745. Geetha, P., Narayanan, V. (2008). A Survey of Content-Based Video Retrieval. J. Comput. Sci. 4, 474–486. Grimes, J., Potel, M. (1991). What is multimedia? Computer Graphics and Applications, IEEE, 11(1), 49–52. Hammoud, R. I., Sahin, C. S., Blasch, E. et al. (2014). Multi-Source Multi-Modal Activity Recognition in Aerial

Video Surveillance, Int. Computer Vision and Pattern Recognition (ICVPR) Workshop. Hammoud, R. I., Sahin, C. S., et al. (2014). Automatic Association of Chats and Video Tracks for Activity

Learning and Recognition in Aerial Video Surveillance, Sensors, 14, 19843-19860. Hoberman, S. (2005). Data modeling made simple: A practical guide for business & information technology

professionals, Technics Publications. Huang, T., Mehrotra, S., Ramchandran, K. (1997). Multimedia analysis and retrieval system (MARS) project. Ide, I., Yamamoto, K., Tanaka, H. (1999). Automatic video indexing based on shot classification, Adv. Multimedia

Content Process, 87–102. Jain, R., Hampapur, A. (1994). Metadata in video databases, ACM Sigmod Rec., 23, 27–33. Jain, A.K., Duin, R.P.W., Mao, J. (2000). Statistical pattern recognition: A review, IEEE T. Pattern Analysis and

Machine Intelligence. 22, 4–37. Kim, K., Chalidabhongse, T.H., Harwood, D., Davis, L. (2005). Real-time foreground–background segmentation

using codebook model. Real-Time Imaging 11, 172–185. Ling, H., Bai, L., Blasch, E., Mei, X. (2010). Robust Infrared Vehicle Tracking Across Target Pose Change using

L1 regularization,” Int’l Conf. on Info Fusion. Ling, H., Wu, Y., Blasch, E., Chen, G., Bai, L. (2011). Evaluation of Visual Tracking in Extremely Low Frame

Rate Wide Area Motion Imagery, International Conference on Information Fusion. Liu, B., Chen, Y., et al. (2014). Information Fusion in a Cloud Computing Era: A Systems-Level Perspective,

IEEE Aerospace and Electronic Sys. Mag., Vol. 29, No. 10, pp. 16 – 24, Oct. Liu, Y., Zhang, D., Lu, G., Ma, W.Y. (2007). A survey of content-based image retrieval with high-level

semantics, Pattern Recognition. 40, 262–282. Liu, Z., Huang, Q., (2000). Content-based indexing and retrieval-by-example in audio. IEEE International

Conference on Multimedia and Expo, pp. 877–880.


2516

Liu, Z., Blasch, E., Xue, Z., Langaniere, R., Wu, W. (2012). Objective Assessment of Multiresolution Image Fusion Algorithms for Context Enhancement in Night Vision: A Comparative Survey, IEEE Trans. Pattern Analysis and Machine Intelligence, 34(1):94-109.

Makhoul, J., Kubala, F., Leek, T., Liu, D., Nguyen, L., Schwartz, R., Srivastava, A. (2000). Speech and language technologies for audio indexing and retrieval,” Proc. IEEE, 88, 1338–1353.

Mei, X., Ling, H, Wu, Y, et al. (2011). Minimum Error Bounded Efficient L1 Tracker with Occlusion Detection, IEEE Computer Vision and Pattern Recognition.

Mei, X., Ling, H, Wu, Y, et al. (2013). Efficient Minimum Error Bounded Particle Resampling L1 Tracker with Occlusion Detection, IEEE Trans. on Image Processing (T-IP), Vol. 22, Issue 7, 2661 – 2675.

Metaxas, D., Venkataraman, S., Vogler, C. (2004). Image-based stress recognition using a model-based dynamic face tracking system,” International Conference on Computational Science.

Nixon, M., Aguado, A.S., (2012). Feature Extraction & Image Processing for Computer Vision. Academic Press. Nguyen, N. and Khan, M. M. H. (2013). Context Aware Data Acquisition Framework for Dynamic Data Driven

Applications Systems (DDDAS),” IEEE MILCOM. Panasyuk, A., Blasch, E., Kase, S. E., Bowman, L. (2013). Extraction of Semantic Activities from Twitter Data,

Proc. International Conference on Semantic Technologies for Intelligence, Defense, and Security (STIDS). Parks, D.H., Fels, S.S. (2008). Evaluation of background subtraction algorithms with post-processing, IEEE

International Conference on Advanced Video and Signal Based Surveillance. Pentland, A., Picard, R.W., Sclaroff, S. (1996). Photobook: Content-based manipulation of image databases, Int.

J. Computer Vision, 18, 233–254. Piccardi, M. (2004). Background subtraction techniques: a review, IEEE International Conference Systems, Man

and Cybernetics, pp. 3099–3104. Ravela, S., Sleder, I., (2013). Mapping Coherent Atmospheric Structures with Small Unmanned Aircraft Systems,

AIAA Infotech at Aerospace Conference. Rui, Y., Huang, T.S., Chang, S.F. (1999). Image retrieval: Current techniques, promising directions, and open

issues, J. Vis. Commun. Image Represent. 10, 39–62. Samet, H. (2006). Foundations of multidimensional and metric data structures. Morgan Kaufmann. Sato, T., Kanade, T., Hughes, E.K., Smith, M.A., Satoh, S. (1999). Video OCR: indexing digital news libraries by

recognition of superimposed captions. Multimedia Systems, 7, 385–395. Snoek, C.G.M., Worring, M. (2005). Multimodal video indexing: A review of the state-of-the-art, Multimedia

Tools Appl. 25, 5–35. Stauffer, C., Grimson, W.E.L. (1999). Adaptive background mixture models for real-time tracking. IEEE Conf. on

Computer Vision and Pattern Recognition. Tantaoui, M. A., Hua, K. A., & Do, T. T. (2004). BroadCatch: a periodic broadcast technique for heterogeneous

video-on-demand, IEEE Transactions on Broadcasting, 50(3), 289–301. Uzkent, B., Hoffman, M. J., Vodacek, A., Kerekes, J. P., Chen, B. (2013). Feature Matching and Adaptive

Prediction Models in an Object Tracking DDDAS, International Conference on Computational Science. Wang, Y., Liu, Z., Huang, J.C. (2000). Multimedia content analysis-using both audio and visual clues. IEEE

Signal Processing Mag. 17, 12–36. Weissman, J. B., Kumar, V., Chandola, V., et al. (2007) “DDDAS/ITR: A Data Mining and Exploration

Middleware for Grid and Distributed Computing,” International Conference on Computational Science. Wold, E., Blum, T., Keislar, D., Wheaten, J. (1996). Content-based classification, search, and retrieval of audio,

IEEE Multimedia, 3, 27–36. Wren, C.R., Azarbayejani, A., Darrell, T., Pentland, A.P. (1997). Pfinder: Real-time tracking of the human body,

IEEE Trans. on Pattern Analysis and Machine Intelligence, 19, 780–785. Wu, R., Chen, Y., et al. (2014). A Container-based Elastic Cloud Architecture for Real-Time Full-Motion Video

(FMV) Target Tracking, IEEE Applied Imagery Pattern Recognition Workshop. Wu, Y., Blasch, E., Chen, G., Bai, L., Ling, H. (2011). Multiple Source Data Fusion via Sparse Representation for

Robust Visual Tracking, Int. Conf. on Information Fusion. Xiong, L., Sunderam, V., Fan, L., et al. (2013). PREDICT: Privacy and Security Enhancing Dynamic Information

Collection and Monitoring, Int’l Conf. on Computational Science, pp 1979–1988. Zhang, Z., Zhang, R. (2008). Multimedia data mining: a systematic introduction to concepts and theory.

Chapman & Hall/CRC. Zhao, R., Grosky, W.I. (2002). Negotiating the semantic gap: from feature maps to semantic landscapes. Pattern

Recognition. 35, 593–600.


2517

Dynamic Data-driven Application System (DDDAS) for Video … · 2017-01-18 · dynamic data-driven...

Documents

Transcript of Dynamic Data-driven Application System (DDDAS) for Video … · 2017-01-18 · dynamic data-driven...