Analysis of Incident-Level Crime Data Using Clustering...

13
9 Analysis of Incident-Level Crime Data Using Clustering with Hybrid Metrics Mehmet Sait VURAL Computer Engineering Dept., Mersin University ([email protected]) Mustafa GÖK Electric-Electronic Engineering Dept., Çukurova University ([email protected]) Zeki YETGİN Computer Engineering Department, Mersin University ([email protected]) Abstract Law enforcement agencies use modern crime analysis software to solve and prevent crimes. The development of crime analysis tools requires access to incident-level crime data (criminals’ IDs, time and place of incidents, etc). However, obtaining such data is very hard in practice, since the crime data is confidential. On the other hand, crime analysis usually processes aggregate-level information, such as frequency of crimes occurring over a particular geography, rather than processing the incident-level data. In this paper, a decision-making method is proposed to infer the closely related incidents using clustering with hybrid similarity metrics. The incident-level crime data is artificially generated by using a parametric GIS model. The motivation for this approach is that in general crime analysis methods do not require fully realistic data set in order to develop and test decision making algorithms. The motivation behind using hybrid metrics is that the incident- level crime data includes both numeric and categorical information with unstable feature vector lengths. In order to evaluate the proposed method, a novel performance metric, called Crime Driven Casual Relation Performance (CDCRP), is introduced. The results show that the proposed method well decides the causally related incidents. Keywords: Crime Analysis, GIS, Causal Relation, Decision-Making, Clustering Algorithms. 1. Introduction The crime incidents are continuing to grow in rate and complexity. Criminology supports the law enforcement agencies and government forces to analyze and prevent the crime incidents using interdisciplinary studies, such as statistics and artificial intelligence. The crime analysis has no universally agreed definition, which varies, in practice, from statistical information (Enzmann & Podana 2010) to more complex data analyses such as decision-making (Horváth & Kolomazníková 2003) or pattern recognition (Nath 2006) in the crime data. The objective of most crime analysis is to find meaningful information in the crime data to assist the investigators’ efforts in criminal activities. There are two general issues in crime analysis. First one is the data acquisition process, which has many limitations due to practical reasons. Crime inspectors and law enforcement agencies can’t share the incident-level data that include the details

Transcript of Analysis of Incident-Level Crime Data Using Clustering...

9

Analysis of Incident-Level Crime Data Using Clustering with Hybrid Metrics

Mehmet Sait VURAL Computer Engineering Dept., Mersin University ([email protected])

Mustafa GÖK

Electric-Electronic Engineering Dept., Çukurova University ([email protected])

Zeki YETGİN Computer Engineering Department, Mersin University ([email protected])

Abstract Law enforcement agencies use modern crime analysis software to solve and prevent crimes. The development of crime analysis tools requires access to incident-level crime data (criminals’ IDs, time and place of incidents, etc). However, obtaining such data is very hard in practice, since the crime data is confidential. On the other hand, crime analysis usually processes aggregate-level information, such as frequency of crimes occurring over a particular geography, rather than processing the incident-level data. In this paper, a decision-making method is proposed to infer the closely related incidents using clustering with hybrid similarity metrics. The incident-level crime data is artificially generated by using a parametric GIS model. The motivation for this approach is that in general crime analysis methods do not require fully realistic data set in order to develop and test decision making algorithms. The motivation behind using hybrid metrics is that the incident-level crime data includes both numeric and categorical information with unstable feature vector lengths. In order to evaluate the proposed method, a novel performance metric, called Crime Driven Casual Relation Performance (CDCRP), is introduced. The results show that the proposed method well decides the causally related incidents. Keywords: Crime Analysis, GIS, Causal Relation, Decision-Making, Clustering Algorithms.

1. Introduction

The crime incidents are continuing to grow in rate and complexity. Criminology supports the law enforcement agencies and government forces to analyze and prevent the crime incidents using interdisciplinary studies, such as statistics and artificial intelligence. The crime analysis has no universally agreed definition, which varies, in practice, from statistical information (Enzmann & Podana 2010) to more complex data analyses such as decision-making (Horváth & Kolomazníková 2003) or pattern recognition (Nath 2006) in the crime data. The objective of most crime analysis is to find meaningful information in the crime data to assist the investigators’ efforts in criminal activities. There are two general issues in crime analysis. First one is the data acquisition process, which has many limitations due to practical reasons. Crime inspectors and law enforcement agencies can’t share the incident-level data that include the details

Analysis of Incident-Level Crime Data Using Clustering with Hybrid Metrics

10

of the incidents such as the criminals’ identifications, the time and place of incidents, criminals’ acquaintance, etc, due to official regulations. Thus, in practice, the crime related data is given at aggregate-level by hiding the incident-level data. Moreover, the incident-level data includes heterogeneous data types such numerical (e.g., time and place of incidents)and categorical (e.g. crime types and criminals’ IDs). Importantly, the acquaintance list is not stable in length, which causes the feature vectors describing the crime incidents to be unstable too. This implicitly requires that the dataset should contain sufficient characteristics and sufficient observations to develop and test the crime analysis methods over a GIS map. The other issue is the inference engine that derives decisions from the crime data. The decision may be in many forms such as responding to a GIS query, statistical information, clustering and classification in crime data, etc. In general, decision-making processes in crime analysis adopt learning from observations either in supervised or unsupervised manner. The former class of methods learns a model by training from the data, named supervised learning, whereas the latter class directly uses the data without a training phase involved, named unsupervised learning. Here, we used the unsupervised learning approach to cluster the similar incidents. The clustering based crime analysis, here, can be used to find the criminals, past and future trends. For example, finding the criminals of an incident requires simply the analysis of the cluster where the incident belongs to. In general, clustering based crime analysis can be used to find interrelations among the incidents. However, unstable feature vectors make the clustering process hard and require new similarity metrics to account for heterogeneous nature of vectors. In this paper, we used a GIS model to generate the incident-level crime dataset. The datasets are generated by modeling the population in real-life. Then, we have demonstrated an unsupervised decision-making process using hierarchical clustering algorithms in finding closely related incidents. The clustering process is based on using hybrid similarity metrics to account for unstable and heterogeneous nature of the feature vectors. The rest of the paper is organized as follows; second section provides the system model and formulization of the problem. The fourth section demonstrates the experimental results of the proposed methods. Finally, conclusion and future directions are given.

2. The Related Works

In literature, the work related to the crime analysis uses well known crime datasets, which are publicly available through web sites (The City of Oakland’s Crime Watch 2013, View Crime Statistics 2013, Interpol 2013, UNODC 2013, European Sourcebook 2013). These data sources, in common, provide statistical information about the crimes occurring over a geographical region. For example, The Oakland Police Department provides periodic crime data to the public through the web site (The City of Oakland’s Crime Watch 2013) where the criminal related information is hidden for criminals’ safety. The type of information given in the datasets also varies. Some datasets provide only the crime-types and their frequencies in a particular region. Some of them provide additional data such as locality (town/city/county), latitude, and additional information as given in (Al-Janabi 2011). The datasets in the literature can be broadly classified according to whether they provide incident-level data (The City of Oakland’s Crime Watch 2013, Al-Janabi 2011) or aggregate-level data (Al-Janabi 2011, View Crime Statistics 2013). In

GAU Journal of Soc. & App. Sciences, vol. 6, issue 10

11

incident-level datasets, data is given separately for each crime incident whereas in aggregate-level datasets, incidents are grouped and summary data is given for a particular criterion such as geographical region or time period. Decision-making perspective of the crime analysis either uses supervised such as analysis using neural networks (Memon & Mehboob 2003) or unsupervised approach such as analysis using clustering techniques (Singhal & Seborg 2005, Corduas & Piccolo 2008). Clustering techniques group similar data items according the given similarity metric where the feature extraction and the selection of the similarity metrics highly affect the final results. These techniques are effective in crime association and prediction by exploring the relationship between geography and crime. For example, some fields of criminology including crime mapping (Harries 1999), dasymetric mapping (Poulsen & Kennedy 2004), geographic profiling (Rossmo 2000), and crime forecasting (Gorr & Harries 2003) arose about the relations between geography and other crime data. Similarly, grouping of statistically significant crimes in an area, simply called clusters, and alerting problematic areas are studied in (Memon & Mehboob 2003, Corduas & Piccolo 2008). Cluster analysis is also used in mining associations (Singhal & Seborg 2005, Corduas & Piccolo 2008, Nath 2006, Keogh et. al. 2003, Everitt 1993, Wang et. al. 2006) where association of the clusters with the important crime attributes is considered. Clustering techniques are effective in crime association and prediction by considering the relationship between aspects of place, which leads to enhanced crime prevention strategies. A comparison of various clustering methods in crime analysis including k-means, dbscan, and hierarchical clustering is given in (Memon & Mehboob 2003, Corduas & Piccolo 2008). All the forementioned works assume aggregate-level data and hence analysis of crime data with the primary focuses on aggregate-level functionalities such as finding interrelations among the grouped entities, but not the incidents. In this paper, we provided a model to generate incident-level data and hence a cluster analysis with the primary focuses on incident-level functionalities such as finding interrelations among the incidents by considering any crime attributes, e.g. criminal and the social network of criminals.

3. System Model

3.1. Dataset generation

In this section, the proposed GIS model that generates the incident-level data is formulated. The generated crime data involves following crime attributes: crime types, incident place, incident time, criminal identification, and criminals’ acquaintances who are/were criminal too. Figure 1 shows the relation among the crime attributes. The model consists of three sequential phases where the first phase involves creation of the geographical map, second phase involves the distribution of the crime incidents over the map, and the third phase involves the distribution of the criminals and their acquaintances over the crimes. The model supports generation of random maps where the map parameters, such as the number of regions and map size, are given via the user interface of the model.

Analysis of Incident-Level Crime Data Using Clustering with Hybrid Metrics

12

Figure 1: Crime attributes considered in the dataset

In the second phase, all the incidents are distributed over the map according to a Gaussian mixture and instantiated with the incident time, place and crime types. Each region center is used as the mean of its Gaussian component and variances are randomly formed using the given parameters via the user interface. The main philosophy behind the use of the mixture of Gaussians method is that occurrences of crimes have usually higher density around the population centers in real life. Moreover, a Gaussian mixture can model any population structure if suitable Gaussian parameters are given. In the last phases, all the criminals with the associated acquaintances are distributed over the crimes. The distribution of acquaintances arises around many localities as shown in Figure 2. Majority of the acquaintances are selected randomly around the home locality (e.g. municipal/town/city) and some are around the incident localities. Minority of the acquaintances are selected across the map locality. The home and incident localities are modeled by a mixture of Gaussians whereas the map locality is modeled by uniform distribution. The reason of this approach is that people usually have higher number of acquaintances in local regions, e.g. in their living places as well as the incident places. The detailed information about the phases of the crime data generation is given in (Vural et. al. 2013).

Figure 2: Distribution of the criminal’s acquaintances on localities.

GAU Journal of Soc. & App. Sciences, vol. 6, issue 10

13

3.2. Problem Formulation

One challenge of the dataset is that the feature vectors are not uniform in both length and data format. So the similarity between any pair of vectors requires sophisticated distance metrics or transformations. One solution, provided here, is to project (transform) the feature vector space onto the so called similarity space where each point is the similarity between an incident pair. The idea is the similarity between a pair of incidents is based on the inter-similarities between features of same time. So the similarity space consists of quadruples (SimCrime, CimCriminal, SimPlaceTime, SimAcq) where each dimension describes a similarity with respect to the given perspective. Let 퐼푛푐푖푑푒푛푡 and 퐼푛푐푖푑푒푛푡 be two vectors. The similarity between them is denoted by Sim(i,j) and formulized at Eq.1-6;

푆푖푚(푖, 푗)

= 푆푖푚퐶푟푖푚푒(푖, 푗) + 푆푖푚퐶푟푖푚푖푛푎푙(푖, 푗) + 푆푖푚푃푙푎푐푒푇푖푚푒(푖, 푗) + 푆푖푚퐴푐푞(푖, 푗)

(1)

푆푖푚퐶푟푖푚푒 , =⋃ 퐶푟푖푚푖푛푎푙 ⋂ ⋃ 퐶푟푖푚푖푛푎푙

푀푎푥( ⋃ 퐶푟푖푚푖푛푎푙 , ⋃ 퐶푟푖푚푖푛푎푙 )

(2)

푆푖푚퐶푟푖푚푖푛푎푙 ,

=⋃ 퐶푟푖푚푖푛푎푙∈ ⋂ ⋃ 퐶푟푖푚푖푛푎푙∈

푀푎푥( ⋃ 퐶푟푖푚푖푛푎푙∈ , ⋃ 퐶푟푖푚푖푛푎푙∈ )

(3)

퐷푖푠푡푆푖푚푃푙푎푐푒 = (푥 − 푥 ) + (푦 − 푦 ) + (푡 − 푡 )

(4)

푆푖푚푃푙푎푐푒푇푖푚푒 =퐷푖푠푡푆푖푚푃푙푎푐푒

푀푎푥(퐷푖푠푡푆푖푚푃푙푎푐푒)

(5)

Analysis of Incident-Level Crime Data Using Clustering with Hybrid Metrics

14

푆푖푚퐴푐푞 ,

=퐴푐푞푢푎푖푛푡푎푛푐푒푠 ∩ 퐴푐푞푢푎푖푛푡푎푛푐푒푠

푀푎푥(|퐴푐푞푢푎푖푛푡푎푛푐푒푠 |, 퐴푐푞푢푎푖푛푡푎푛푐푒푠 )+

12

∗퐴푐푞푢푎푖푛푡푎푛푐푒푠(퐴푐푞푢푎푖푛푡푎푛푐푒푠 ) ∩ 퐴푐푞푢푎푖푛푡푎푛푐푒푠(퐴푐푞푢푎푖푛푡푎푛푐푒푠 )

푀푎푥(|퐴푐푞푢푎푖푛푡푎푛푐푒푠(퐴푐푞푢푎푖푛푡푎푛푐푒푠 )|, 퐴푐푞푢푎푖푛푡푎푛푐푒푠(퐴푐푞푢푎푖푛푡푎푛푐푒푠 ) )

(6)

where⋃ 푍 is the set of values of the attribute Z that belongs to the incidents selected by the operation “X op Y”.That is, “X op Y”selects the incidents in the dataset; Z selects the attribute in the incident. The similarity with respect to the time and place of the incidents becomes meaningful in finding relation between incidents if both are close to each other. So SimPlaceTime is grouped as one perspective in Eq.4-5. Since they are numeric, Euclid similarity is used to define the similarity with the time and place perspective. SimCrime and SimCriminal are both categorical types so their similarity measures are based on the operations on the set. Their formulas at Eq.2-3 are empirically found to be better than the Jackkard similarity. Similarity between the two acquaintance sets is defined in Eq.6. Acquaintances of acquaintance is actually a recursive process that requires some stopping criteria, here dept. Thus, the set of all acquaintances that can be reachable from the criminal could be considered as the social network of the criminal with some depth constraint, here 2.

3.3 Performance Metrics : Crime-Driven Causal Relation

Since the original cluster labels are assumed to be unavailable, measuring the clustering performance of the proposed method becomes a challenge. In this paper, first we empirically show the performance of the proposed method. Then, we provide a quantitative measurement, called as Crime-Driven Causal Relation Performance. The causal relation actually defines a performance from one important perspective, but not the all perspectives. If we know that one perspective creates a true relation between incidents, any successful algorithm is expected not to violate the truths of this perspective relation. That is, a successful algorithm should not conflict with the true relation from a particular perspective. We define a perspective from the causal relation between incidents. Let 퐼푛푐푖푑푒푛푡 = {퐶푟푖푚푖푛푎푙 ,퐶푟푖푚푒 , 푃푙푎푐푒 ,푇푖푚푒 ,퐴푐푞_푙푖푠푡 } where Criminal is the criminal ID of the i. incident, 퐶푟푖푚푒 is the the crime type of the i. incident, and so on. 퐼푛푐푖푑푒푛푡 is causally related to the 퐼푛푐푖푑푒푛푡 denoted as simply 푖 → 푗, if following conditionsare hold: 1- 푇푖푚푒 <푇푖푚푒 2- 퐴푐푞_푙푖푠푡 ∩ 퐶푟푖푚푖푛푎푙 ≠ ∅

퐼푛푐푖푑푒푛푡 is crime-driven causally related to the 퐼푛푐푖푑푒푛푡 if the following additional condition is hold:

GAU Journal of Soc. & App. Sciences, vol. 6, issue 10

15

3- 퐶푟푖푚푒 퐶푟푖푚푒푖푠푐표푚푚푖푡푡푒푑푏푦푐푟푖푚푖푛푎푙 푏푒푓표푟푒푡푖푚푒 ∩ 퐶푟푖푚푒 ≠ ∅ Crime-Driven Causal Relationpartitions the overall set into overlapping clusters where each cluster contain pairs of causally related incidents. The overlapping sets are converted to disjoint sets by making simple conflict resolution assumptions such as “closer time is better to define a true perspective”. For example, if there exist 푖 → 푘 and 푗 → 푘 then 푖 → 푘 if 푡푖푚푒 − 푡푖푚푒 < 푡푖푚푒 − 푡푖푚푒 푗 → 푘 otherwise. If the cluster labels generated by the Crime-Driven Causal Relation is used as an original cluster labels, then the labels that proposed algorithm produced could be compared based on the False-Alarm Rate (Yetgin & Gözükara 2014). False Alarms indicate the number of incident pairs that are clustered as related by the proposed algorithm, although they are actually not according to the original cluster label.A successful algorithm should make the False-Alarm rate as small as possible, which in turns shows the degree that the algorithm validates the truth of the Crime-Driven Causal Relation perspective.

4. Experimental Work

Three clustering experiments using the parameters in Table 1 are performed. The user interface of the simulation system is shown in Figure3. The user interface is simply divided into four panels, which are the panels for crime-events, crime-types, map/incidentlocations and criminals. Crime-events panel simulates the number of incidents and the time interval that the incidents occurred.

Table 1: Parameter values used in the experiments

Figure 3: User-interface of the simulation system

Analysis of Incident-Level Crime Data Using Clustering with Hybrid Metrics

16

Map/Incident location panel includes the user parameters that describe the generation of random maps and the incident distribution across the map. The map is created with the given map size with a constraint that the minimum distance between region centers conforms to the user defined parameter. The incident locations are generated randomly using the given distribution type. In the experiments, the incident locations are distributed around the centers using the mixture of Gaussians. Criminal’s panels involve the user defined parameters for criminal related information such as the number of criminals, their acquaintances, and the distribution of acquaintances around the localities. For demonstration purpose, we selected 300 incidents, which are distributed across the 5 regions given in Figure 4-7. Figure 5 shows the distribution of the crime types over the geographical map. Figure 6 and Table 2 show the result of the clustering process where only three clusters (cluster#1, cluster#12, and cluster#20) are shown for demonstration purpose. The clustering process is acquired using hierarchical clustering method over the proposed similarity metric for crime incidents. Hierarchical clustering is an agglomerative clustering method. The idea of this method is to build a hierarchy of clusters where each iterative step takes the two closest sub-clusters and merges them. This process is usually continued until there is one large cluster containing all the vectors. We empirically show the performance of the hierarchical clustering as demonstrated in Table 3 where the related features are underlined for visualization. In Table 3, we only consider the incidents in Cluster#20. The results show that the proposed method well groups the incidents that are related. The results also conform to the Crime-Driven Causal Relation with 96% performance at worst as shown in Table 4.

Figure 4: Distribution of crime incidents over the geographical map.

GAU Journal of Soc. & App. Sciences, vol. 6, issue 10

17

Figure 5: Distribution of the crime types over the geographical map.

Figure 6: The results of clustering with 3 clusters demonstrated.

Analysis of Incident-Level Crime Data Using Clustering with Hybrid Metrics

18

Table 2: Demonstration of incident-level data and their clusters

Figure 7: The results of clustering over GIS map with one cluster demonstrated.

GAU Journal of Soc. & App. Sciences, vol. 6, issue 10

19

Table 3: Demonstration of empirical analysis for clustering performance

Table 4: Causal Relation Performance of the Experiments

5. Conclusion

We developed a crime simulation system that generates incident-level crime data and support query based crime analysis on GIS map. The proposed system is developed to provide incident-level data which is hard to obtain in practice. Though the data generated by our tool is not exactly realistic, the data generated by it is sufficient to develop decision-making algorithm for crime analysis. With the resulting model, various GIS related queries can be demonstrated on the GIS map to enable the visual analysis of the incidents. Moreover, we provided clustering based decision-making algorithms for crime analysis. With the model, we enable the incidents to be clustered in one or more related perspective and measure the clustering performance using the proposed performance metric.

Analysis of Incident-Level Crime Data Using Clustering with Hybrid Metrics

20

References

Al-Janabi KBS, 2011. A Proposed framework for analyzing crime data set using decision tree and simple k-means mining algorithms. Journal of Kufafor Math. and Computer, volume 1, issue 3, pp. 8-24. Corduas M, Piccolo D, 2008. Time series clustering and classification by the autoregressive metric. Computational Statistics & Data Analysis, volume 52, issue 4, pp. 1860–1872. Enzmann D, Podana Z, 2010. Official crime statistics and survey data: Comparing trends of youth violence between 2000 and 2006 in cities of the Czech Republic, Germany, Poland, Russia, and Slovenia. European Journal on Criminal Policy and Research, volume. 16, issue 3, pp. 191-205. European Sourcebook of crime and criminal justice statistics. [WWW document; retrieved 2013]. URL http://www.europeansourcebook.org Everitt B, 1993. Cluster Analysis, John Wiley & Sons, New York. Gorr W, Harries R, 2003. Introduction to crime forecasting. International Journal of Forecasting (Special Section on Crime Forecasting), volume 19, issue 4, pp. 551–555. Harries K,1999. Mapping Crime: Principles and Practice. National Institute of Justice. Washington D.C. Horváth R,Kolomazníková E, 2003. Individual Decision-Making to commit a crime: A Survey of early models. Finance a úvûr–Czech Journal of Economics and Finance, volume 53, issue 3-4, pp. 154-168. Interpol [WWW document; retrieved 2013]. URL http://www.interpol.int. Keogh E, Lin J, 2005. Clustering of time series subsequences is meaningless: Implications for past and future research. Journal of Knowledge and Information Systems, volume 8, issue 2, pp. 154-157. Nath SV.,2006. Crime Pattern Detection using Data Mining. IEEE/WIC/ACM international conference on Web Intelligence and Intelligent Agent Technology WI-IATW'06, pp. 41–44. North Caroline Department of Justice, View Crime Statistics. [WWW document; retrieved 2013]. URL http://www.ncdoj.gov/Crime/View-Crime-Statistics.aspx. Memon, Q.A., Mehboob, S.,2003. Crime investigation and analysis using neural nets. 7th International Multi Topic Conference, INMIC, pp. 346-350. Poulsen E, Kennedy LW, 2004. Using dasymetric mapping for spatially aggregated. Journal of Quantitative Criminology, volume 20, issue 3, pp 243-262. Rossmo, DK, 2000. Geographic profiling. Boca Raton, FL: CRC Press. Singhal A., Seborg D.E., 2005. Clustering multivariate time-series data. Journal of Chemometrics, volume 19, issue 8, pp. 427-438.

GAU Journal of Soc. & App. Sciences, vol. 6, issue 10

21

The City of Oakland’s Crime Watch. [WWW document; retrieved 2013]. URL http://gismaps.oaklandnet.com/crimewatch/ 2013. UNODC, United Nations office on drugs and crime. [WWW document; retrieved 2013]. URL http://www.unodc.org/unodc/en/crime_cicp_surveys.html Vural MS, Gök M, Yetgin Z, 2013. Generating Incident-Level Artificial Data Using GIS Based Crime Simulation. ICECCO 2013 Ankara, pp:243-246. Wang X, Smith K, Hyndman R, 2006. Characteristic-Based Clustering For Time Series Data. Data Mining And Knowledge Discovery, volume 13, issue 3, pp 335-364. Yetgin Z, Gözükara F, New metrics for clustering of identical products over an imperfect data. Turkish Journal of Electrical Engineering & Computer Sciences, To be Published, 2014. DOI: 10.3906/elk-1307-127.