Georgetown Data Analytics Project (Team DC) Capstone Paper

10
Identifying Types of DC Crime by Neighborhood and Time December 13, 2014 Trooper Boyd, Douglas Byrne, and Noah Turner

Transcript of Georgetown Data Analytics Project (Team DC) Capstone Paper

Page 1: Georgetown Data Analytics Project (Team DC) Capstone Paper

Identifying Types of DC Crime by Neighborhood and Time

December 13, 2014

Trooper Boyd, Douglas Byrne, and Noah Turner

Introduction/Problem

Page 2: Georgetown Data Analytics Project (Team DC) Capstone Paper

Washington DC, like any other major metropolitan area is vulnerable to criminal activities. Although the Washington DC Metropolitan Police Department provides crime data on its website it is not necessarily user friendly for the average citizen. DC business owners and metro area residents are at greater risk of becoming victims of crime if they are unaware of the frequency and types of crime in their neighborhoods of interest.

- Imperfect information - Persons visiting or buying a home in an unfamiliar neighborhood face imperfect information regarding crime and its impact on safety and property values.

- A DC business owner who knows that armed robberies have increased in his neighborhood , particularly on the weekend, can take the necessary steps to lessen the chance his business is robbed.

Our project aims to create a report identifying types of crime, the times and the frequency they occur in specific Washington DC neighborhoods, which will allow residents to be better educated about crime in their areas. A longer term goal beyond the scope of this project, would be to use the aggregated data to create a web-based application so that residents can conduct their own research online.

It is also possible the results of our project could benefit the Washington DC Metropolitan Police Department in properly allocating resources, however, without knowing specifically how they currently use data analytics we cannot presume our results would be useful to them.

Page 3: Georgetown Data Analytics Project (Team DC) Capstone Paper

Data Science Pipeline/Project Implementation

Architecture

Page 4: Georgetown Data Analytics Project (Team DC) Capstone Paper

Data Science Pipeline/Project Implementation (cont)

In order to perform the analysis associated with our capstone project, we had to build an effective data science pipeline. This pipeline would need to provide a means to take raw data and insert it into a relational database for storage and wrangling, allow us to perform statistical analysis on the data, and finally enable us to present our findings in a graphical format. The building of this pipeline required us to download, install, and configure a variety of software packages. For instance, for the purpose of automating the function of importing our raw data into a relational database, we developed programming modules using Python version 2.7, and used the psycopg2 driver (DB API 2.0) for connecting to the database. For the purpose of data storage, and data wrangling, we chose to use a relational database. The relational database chosen for this was PostgreSQL 9.3.5. For data analysis, we used a variety of statistical software but primarily relied on Microsoft Excel, R, and SOFA. Finally, for data visualization, we utilized Tableau.

Data Collection/Wrangling

The data we obtained from the DC Metropolitan Police Department website and the NeighborhoodInfoDC (US census data) website were CSV files. Although using structured data files was beneficial we still spent considerable time cleaning, normalizing and aggregating the data using Python and SQL servers. Since we were using structured CSV files, we found it easier to use Google Drive to “warehouse” our original raw CSV data and SQL databases to wrangle the data.

Wrangling consisted of creating new fields to break up the “long date” of the individual crimes and joining our CSV files to have neighborhood names in the final analysis table using SQL. We also ran into the problem of outdated latitude/longitude coordinates, which we updated using Python (specifically pyproj and proj.4 libraries).

Data Analysis/Methodologies

To better understand the data, the team utilized a multitude of different tactics ranging from geo plotting to statistical models. We implemented various tools and software components including Tableau, R, SOFA, and Excel to explore the data visually as well as to run models such as regression, and time series analysis. We used Tableau mapping and charts to provide us a means to gain a better understanding of the data, the neighborhood clusters, and level of crime therein. We ran linear regressions against 39 distinct clusters of data (2012 median property value, total crime per category). In these linear regressions we used property value as the dependent variable and the various types of crime as the independent variable in an effort to determine if the various types of crime had a significant “effect” on property value. We also explored time series median home price data on violent crime per 1000 residents from 2000-2011 using the R software product. Finally we explored regression coefficients between crime areas to determine impact of violent crime events on property value and if there is marginal negative return to violent crime – as hypothesized.

Utilizing maps as a means to research and evaluate crime activity in a particular city or area of the country is a nice alternative, but has limitations. For instance, maps cannot provide very little insight into critical factors affecting crime rates such as commercial business activity. Of particularly interest are population density and the number of liquor licenses (as a proxy for commercial activity). Regarding the statistical models implemented in our study, the linear regressions successfully identified that only two crime groups (theft, and theft from motor vehicle) we re weakly positively correlated with property value. Conversely, the other seven

Page 5: Georgetown Data Analytics Project (Team DC) Capstone Paper

categories of crime activity reported strong negative correlations with property value. This conforms to our expectations about violent crime impact property value. The correlation itself is fairly weak, but its sign is interesting.

Our research suggests that violent crime impacts property values, but the strength of the effect proportional to the % change in total number of violent crimes is an important factor. For example, we posit that a 50% increase in violent crime (from 2 to 3) has far greater impact than that of an increase of 1 incident from 10 to 11. We also determined that Non-violent crime is weakly associated with increasing property values, which is unsurprising. This distorts the total crime comparison to property value because the incidence of non-violent theft is high relative to other crimes. We expected that high crime areas would see lower property value growth (not just lower property values as a starting point). As a public policy matter, if true, this is another tax on persons living in low-income/high-crime areas. Asset value accumulation is an important part of breaking cycle of poverty and violent crime has a severely detrimental effect on the ability to accumulate wealth. However, high crime areas do not show a trend of lower growth. The reason for this, which requires further analysis, may be that while crime is indeed higher, the relative crime rate in that neighborhood is falling quickly and potential homeowners are more influenced by this fact that overall crime. We posit that although crime is high, it was higher in the past and some high crime areas are playing “catch up” in terms of property values (a.k.a. gentrification).

Final Product

We created a Tableau mapping tool and an analytical presentation that provided an overview of DC crime data and property values, including visual charts showing DC crime data by neighborhood, as well, as day of the week.

Challenges We Faced

We came across several problems during our project:

- Although we were using structured data (CSV files) we still had difficulty cleaning, normalizing and aggregating the data in order to merge the data files.

- Specifically, the “long date” which was provided in the DC Crime data raised issues with the software tools we were using. This was resolved by breaking out the long date into YEAR, MONTH, DAY, and TIME fields.

- Lack of knowledge of the tools we needed to use (PostgreSQL, Tableau)

- It wasn’t until we started using the data analytics and visualization tools that we recognized we still had some errors in our datasets, which required us to redo some of our work.

- Tableau didn’t recognize the latitude/longitude coordinates, which needed to be updated. We updated the coordinates using Python (pyproj and proj.4 libraries).

- Also, the DC Crime data did not contain neighborhood names or property values. Therefore, we used SQL to inner join the tables and populate these fields in the final dataset.

Page 6: Georgetown Data Analytics Project (Team DC) Capstone Paper

Conclusion

Original Problem

DC homebuyers and metro area residents are at greater risk of becoming victims of crime if they are unaware of the frequency and types of crime in their neighborhoods of interest.

SolutionBy implementing effective data analytic techniques, we were able to expose important trends and relationships regarding crime activity in relation to the different neighborhoods and associated property values throughout the DC metro area. This information can provide important insight for business owners, potential homebuyers, as well as the DC Metro Police Department.

We are aware there are many factors that affect crime, which makes it inherently difficult to predict criminal activities. However, using even just a few factors in our data analysis still allowed us to create a product that consumers may find useful in evaluating crime data in their neighborhoods of interest.

Future Steps

If time allotted, the team would have liked to have performed additional research in the areas identified below:

● Drilling down to one number for the impact of violent crime on property value● Decay rate of violent crime negative impact on property value growth rate - when does violent crime

“wear off”?● Quantifying negative wealth accumulation effect over time of violent crime● Determining factors that precede this property value “catch up” - for potential investors and

homebuyers.● Create a web-based application which can graphically represent trends, patterns, and crime

frequencies in near-real time.

Page 7: Georgetown Data Analytics Project (Team DC) Capstone Paper

References

Data Sources

Washington DC Metropolitan Police crime data (http://crimemap.dc.gov/): Data reflects crimes reported at least two business days before today's date. All statistics are subject to change due to a variety of reasons, such as a change in classification, the determination that certain offense reports were unfounded, or late reporting

Washington DC Census Data (http://www.neighborhoodinfodc.org): Neighborhood Info DC provides more detailed U.S. government census data specific to Washington DC

Tools Used: Access, Excel, Python, PostgreSQL, Tableau, R.

Github Repository: https://github.com/DCGT/Crime-Data-Project1

Page 8: Georgetown Data Analytics Project (Team DC) Capstone Paper

Glossary

DC Crime Terms

Arson: The malicious burning, or attempt to burn, any structure, vessel, vehicle, railroad car or property of another.

Assault Dangerous Weapon (ADW)-Aggravated Assault: Knowingly or purposely causing serious bodily injury, threatening to do so, or knowingly engaging in conduct that creates a grave risk of serious bodily injury to another person.

Burglary: The unlawful entry of a structure, vessel, watercraft, railroad car or yard where chattels are deposited with the intent to commit any criminal offense

Homicide: Killing of another purposely, or otherwise, with malice aforethought.

Robbery: The taking of anything of value from another person by force, violence or fear.

Sex Abuse: Engaging in or causing another person to submit to a sexual act by force, threat or reasonable fear.

Motor Vehicle Theft: The theft of any self-propelled, motor driven vehicle that is primarily intended to transport persons and property on a highway.

Theft F/Auto: Theft of items from within a vehicle, excluding motor vehicle parts and accessories.

Theft/Other: A broad inclusion of Theft offenses including embezzlement, theft of services and fraud/false pretenses. The Theft/Other category excludes theft of items from a motor vehicle or the motor vehicle itself.