Crime Trends in Large and Medium Jurisdictions

By Aaron Margolis

This notebook is an update of Mean Shift Analysis I first performed in 2017 to examine the recent increases in crime following the historic decrease during the late 1990s and early 2000s. For instance, was crime increasing everywhere or just in certain cities, such as Baltimore? By grouping cities into clusters based on their crime patterns over time, we can see where crime is continuing to fall and where it is rising.

This notebook will look at crime rates in jurisdictions over 250,000 people. These 131 jurisdictions account for approximately 30% of the US population. They are a mix of urban areas such as cities and suburbs, and also suburban counties. Urban areas are over-represented, but there are enough lower density jurisdictions to conduct analysis.

This analysis can be expanded to look at smaller jurisdictions, especially using a more powerful backend. I incorporated TensorFlow 2.0 and its eager execution capability. Because there are only 600 columns for 131 jurisdiction, or 78,600 data points, this notebook uses CPUs rather GPUs or TPUs. If more jurisdictions were incorporated, a more powerful backend can be added.

Results:

Where crime was highest in the late 1990s, such as New York and other large cities, crime continues to be down considerably. But in rural jurisdictions, such as Anchorage and Wichita, the amount of crime today is much higher than it was 25 years ago, despite the large overall decrease in crime. There has been an increase in most of these jurisdictions in the last few years. The large variation in crime trends across jurisdictions explains the differing perception of crime overall.

Methodology

We start by loading csv files that I created using an API on cloud.gov's Crime Data Explorer, which hosts FBI Uniform Crime Data in a computer-friendly format. ORI stands for Originating Reporter Identifier, the police department providing the data.

Each row includes a police department identifier (ORI), a year, a crime, and the number that were reported (actual) and resulted in arrest (cleared). However, this method already introduces uncertainty in terms of actual crime, because many crimes go unreported.

We will use a pivot table to put the data across 4 axes (ORI, year, crime, actual vs. cleared).

Next we look at the null values. Police departments either report no data values or all 24, so the nulls should be multiples of 24. Let's see which departments have the most null results.

We will remove the 5 police departments that have multiples missed years (Alaska State Troopers, Louisville, Raleigh and two Long Island counties). These null values show the importance of using jurisdiction data rather than state data: Kentucky and North Carolina will have much lower crime rates in years where their major cities did not provide data. Even the one year where Cincinnati did not provide data may affect analysis of Ohio crime data.

For the cases with only one missing value, we will interpolate.

To ease comparison, we will look at crime rates per 100,000 people.

Before going to TensorFlow from a Pandas dataframe, we need to reshape the via Numpy to show all 4 axes. We will print the last 24 values of the first row in both Pandas and Numpy to confirm the reshaping is correct.

We import TensorFlow, convert the Numpy array and then normalize it.

Now we are going to perform Mean Shift Analysis in TensorFlow. The concept is to gradually shift data points closer to its neighbors, until all the points converge with their neighbors. This implementation uses a Gaussian function with a given "bandwidth" to weight the nearer neighbors. We are implementing in TensorFlow because the process is O(r^2*c), where r is the number of rows and c is the number of columns. The cluster_step functions returns both the new data and the square of the change.

We will keep clustering until the change is less than 0.01, which we also set as the bandwidth. We will also note the time.

Now that TensorFlow has done the math-intensive part, we use sklearn to label the points based on where their means have shifted.

We group the jurisdictions by creating a list of lists.

We will create a table to see how reported aggravated assaults have changed over time in each of the groups. Aggravated assaults are a relatively common crime, so they are a good indicator of overall trends. We will take the average amount of each group in order to chart assaults over time.

Now we will use bokeh to create a chart. We'll immediately see one group (brown) where was assaults were highest in the lates 1990s but has fallen by about half over the past 25 years. We also see another group (bright red) where this crime started low but increased.

Now we'll create an interactive map to show these jurisdictions, using the population and geographic data from the ORI guide, which comes from the Department of Justice's National Justice Information System. Some locations give their coordinates in terms of latitude and longitude as whole number, without minutes or seconds, so they may seem off on the map.

Using Bokeh, we create an interactive map where each jurisdiction is represented by a circle. The area each circle is proportional to the population, and the color of the outline shows what group it belongs to. You can scroll over the circles to get the jurisdiction. The background map is taken from Google Maps.

Conclusion

The stark contrast between the large cities, which saw decreases in crime over 25 years, and the smaller cities and less dense jurisdictions, which have seen an increase, is apparent from the above and graph and chart. Further research using more jurisdictions may shed even more light. The use of TensorFlow for Mean Shift Analysis allows further scaling at speed.