Git Product home page Git Product logo

englandcrimeassociations's Introduction

Association Rule Mining for reported street crimes in England & Wales

The aim here is to see if there are any associations between the reported aspects of street crime, such as Month of Year, Location, Crime type etc. This will be done in Pyspark due to the size of the data but it will still be possible to execute on a local cluster.

The data can be downloaded from here: https://data.police.uk/data/.

The date range for this data is December 2010 - July 2019 and all constabularies in England & Wales were selected (we will be excluding British Transport Police and Police Service of Northern Ireland)

What is an Association Rule?

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. This rule-based approach also generates new rules as it analyzes more data. The ultimate goal, assuming a large enough dataset, is to help a machine mimic the human brain’s feature extraction and abstract association capabilities from new uncategorized data.

We will be looking for rules with a high level of confidence

Confidence is an indication of how often the rule has been found to be true... Confidence can be interpreted as an estimate of the conditional probability

import glob
import os
import pandas as pd
import matplotlib.pyplot as plt
import calendar
import seaborn as sns
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Set up Spark

Running Spark locally using 6 out of 8 cores

from pyspark import SparkContext
from pyspark.sql import SQLContext, SparkSession
spark = SparkSession.builder\
        .master("local[7]")\
        .appName("Crime Assocations")\
        .config("spark.executor.memory", "6g")\
        .config("spark.memory.fraction", 0.7)\
        .getOrCreate()
sc = spark.sparkContext
# Set up a SQL Context
sqlCtx = SQLContext(sc)
#sc.stop()

Load Data into Spark

from p01_load import load_data

The police data comes in several csv files with a folder for each Month-Year. Within each folder, there is a CSV file for each constabulary. We will concatenate these

path = glob.glob(os.getcwd() + "/all_data/*/*-street.csv")
police_data_df = load_data(file_locations=path, sqlcontext=sqlCtx)
Loading CSV files to sqlcontext...
Load Complete

Inspecting the data

police_data_df.select(police_data_df.columns[1:]).show()
+-------+--------------------+--------------------+---------+---------+--------------------+---------+--------------------+--------------------+---------------------+-------+
|  Month|         Reported by|        Falls within|Longitude| Latitude|            Location|LSOA code|           LSOA name|          Crime type|Last outcome category|Context|
+-------+--------------------+--------------------+---------+---------+--------------------+---------+--------------------+--------------------+---------------------+-------+
|2012-08|Metropolitan Poli...|Metropolitan Poli...|-0.508053|50.809718|On or near Claigm...|E01031464|           Arun 007F|       Violent crime|  Under investigation|   null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| -1.01393|51.899297|On or near St Mic...|E01017673| Aylesbury Vale 010C|         Other crime|  Under investigation|   null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.964612|52.045416|On or near Barnes...|E01029896|        Babergh 004E|       Violent crime|  Under investigation|   null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.134947|51.588063|On or near Mead G...|E01000027|Barking and Dagen...|Anti-social behav...|                 null|   null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.140634|51.583427|On or near Rams G...|E01000027|Barking and Dagen...|Anti-social behav...|                 null|   null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.134947|51.588063|On or near Mead G...|E01000027|Barking and Dagen...|Anti-social behav...|                 null|   null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.140634|51.583427|On or near Rams G...|E01000027|Barking and Dagen...|Anti-social behav...|                 null|   null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.134947|51.588063|On or near Mead G...|E01000027|Barking and Dagen...|Anti-social behav...|                 null|   null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.145888|51.593835|On or near Provid...|E01000027|Barking and Dagen...|Anti-social behav...|                 null|   null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.141143|51.590873|On or near Furze ...|E01000027|Barking and Dagen...|Anti-social behav...|                 null|   null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.140634|51.583427|On or near Rams G...|E01000027|Barking and Dagen...|Anti-social behav...|                 null|   null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.140634|51.583427|On or near Rams G...|E01000027|Barking and Dagen...|Anti-social behav...|                 null|   null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.134947|51.588063|On or near Mead G...|E01000027|Barking and Dagen...|Anti-social behav...|                 null|   null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.134947|51.588063|On or near Mead G...|E01000027|Barking and Dagen...|Anti-social behav...|                 null|   null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.140035|51.589112|On or near Beansl...|E01000027|Barking and Dagen...|Anti-social behav...|                 null|   null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.134947|51.588063|On or near Mead G...|E01000027|Barking and Dagen...|Anti-social behav...|                 null|   null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.137065|51.583672|On or near Police...|E01000027|Barking and Dagen...|Anti-social behav...|                 null|   null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.137065|51.583672|On or near Police...|E01000027|Barking and Dagen...|Anti-social behav...|                 null|   null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.135866|51.587336|On or near Gibbfi...|E01000027|Barking and Dagen...|Anti-social behav...|                 null|   null|
|2012-08|Metropolitan Poli...|Metropolitan Poli...| 0.134947|51.588063|On or near Mead G...|E01000027|Barking and Dagen...|Anti-social behav...|                 null|   null|
+-------+--------------------+--------------------+---------+---------+--------------------+---------+--------------------+--------------------+---------------------+-------+
only showing top 20 rows

Each dataset contains the following columns:

police_data_df.printSchema()
root
 |-- Crime ID: string (nullable = true)
 |-- Month: string (nullable = true)
 |-- Reported by: string (nullable = true)
 |-- Falls within: string (nullable = true)
 |-- Longitude: double (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- LSOA code: string (nullable = true)
 |-- LSOA name: string (nullable = true)
 |-- Crime type: string (nullable = true)
 |-- Last outcome category: string (nullable = true)
 |-- Context: string (nullable = true)

The Data Dictionary is as follows

dictionary = pd.read_csv('data_dictionary.csv')
pd.set_option('display.max_colwidth', -1)
for elem in dictionary.to_records(index=False):
    print(elem[0] + ": " + elem[1])
Reported by: The force that provided the data about the crime.
Falls within: At present, also the force that provided the data about the crime. This is currently being looked into and is likely to change in the near future.
Longitude and Latitude: The anonymised coordinates of the crime. See Location Anonymisation for more information.
LSOA code and LSOA name: References to the Lower Layer Super Output Area that the anonymised point falls into, according to the LSOA boundaries provided by the Office for National Statistics.
Crime type: One of the crime types listed in the Police.UK FAQ.
Last outcome category: A reference to whichever of the outcomes associated with the crime occurred most recently. For example, this crime's 'Last outcome category' would be 'Formal action is not in the public interest'.
Context: A field provided for forces to provide additional human-readable data about individual crimes. Currently, for newly added CSVs, this is always empty.

NOTE: LSOA (Lower Layer Super Output Area)

From NHS Data Dictionary (https://www.datadictionary.nhs.uk/data_dictionary/nhs_business_definitions/l/lower_layer_super_output_area_de.asp?shownav=1)

"A Lower Layer Super Output Area (LSOA) is a GEOGRAPHIC AREA. Lower Layer Super Output Areas are a geographic hierarchy designed to improve the reporting of small area statistics in England and Wales. Lower Layer Super Output Areas are built from groups of contiguous Output Areas and have been automatically generated to be as consistent in population size as possible, and typically contain from four to six Output Areas. The Minimum population is 1000 and the mean is 1500. There is a Lower Layer Super Output Area for each POSTCODE in England and Wales"

How many Rows do we have?

num_rows = police_data_df.count()
num_rows
52835178

Cleaning the Data

from p02_clean import clean_months, clean_location, clean_non_england
# The month column in the data is actually a Year-Month, here we will split that on the - delimiter and create a Year and Month_of_Year Column
police_data_clean = clean_months(police_data_df)
# Now lets create a Location and Town/City Column
police_data_clean = clean_location(police_data_clean)
police_data_clean = clean_non_england(police_data_clean)
Cleaning Year and Month Columns...
Creating Month_of_Year and Year columns...
Cleaning Complete
Cleaning Location and Town and City...
Cleaning Complete
Removing non England and Wales entries
Removal Complete
police_data_clean.printSchema()
root
 |-- Crime ID: string (nullable = true)
 |-- Month: string (nullable = true)
 |-- Reported by: string (nullable = true)
 |-- Falls within: string (nullable = true)
 |-- Longitude: double (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Location: string (nullable = true)
 |-- LSOA code: string (nullable = true)
 |-- LSOA name: string (nullable = true)
 |-- Crime type: string (nullable = true)
 |-- Last outcome category: string (nullable = true)
 |-- Context: string (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Month_of_Year: string (nullable = true)
 |-- Town_City: string (nullable = true)

Exploratory Analysis

import pyspark.sql.functions as F
import p03_eda as eda

Are there any cases of when the constabulary that reported the crime is different to the constabulary area?

police_data_clean\
    .where(F.col("Reported by") != F.col("Falls within"))\
    .count()
0

Number of Incidents over time

plt.figure(figsize=(20, 5))
crime_over_time_plot = eda.plot_crime_time_series(police_data_clean, read=True)
plt.show()
Setting Month to a categorical variable...
Setting complete
Converting to Series object...
Conversion complete
Creating plot object
Complete.. Plotting...

png

There appears to be a stationary trend with some periodicity with the numbers of reported crimes, although we do not have complete years in this dataset. It looks like there is a pattern to the level/numbers of street crimes!

Most Common Crime and Outcome Category Combination

plt.figure(figsize=(10, 10))
outcome_category_plot = eda.plot_crime_type_and_category_counts(police_data_clean, read=True)
plt.show()
Converting to Series
Collapsing Multi Index
Plotting...

png

It seems the most common type to outcome association is an anti social behaviour crime with no recorded outcome

The Most Common Type of Crime

plt.figure(figsize=(10, 5))
crime_type_plot = plot_crime_counts(police_data_clean, read=True)
plt.show()
Converting to Series object...
Plotting...

png

Anti-social behaviour makes up about 35% of crime in England - which is expected... It is concerning that violence and sexual offences is in second place

Which Town or City has the most crime?

plt.figure(figsize=(10,5))
crime_town_city_counts = plot_crime_town_city_counts(police_data_clean, read=True)
plt.xticks(rotation=90)
plt.show()
Converting to Series
Plotting...

png

Feature Engineering

Let's look at what the feature engineering code is actually doing

%%bash
cat p04_feature_engineer.py
import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, StringType

clear_duplicates = F.udf(lambda x: list(set(x)), ArrayType(StringType()))

def get_modelling_data(df):
    
    select_cols = ["Falls within", "Town_City", "Crime type", "Last outcome category", "Month_of_Year"]
    
    # Remove the crimes with no crime ID and no LSOA Information.
    # Then select the features of interest
    
    print('Filtering data with no Crime ID...')
    police_data_modelling = df\
        .filter(df["Crime ID"].isNotNull() & df["Last outcome category"].isNotNull())\
        .select(select_cols)
    print('Filtering complete')
    return police_data_modelling

def make_item_sets(df):
    # The FP growth algorithm (like association rules), needs the items to be concatenated into a list/array of "transactions".
    print('Making item sets...')
    print('Collapsing data to list of transactions')
    police_item_set = df.withColumn("items_temp", F.array(df["Falls within"],
                                                     df["Town_City"],
                                                     df["Crime type"],
                                                     df["Last outcome category"],
                                                     df["Month_of_Year"]))
    
    police_item_set = police_item_set.withColumn("items", clear_duplicates(police_item_set["items_temp"]))
    # Select the items column and id
    print('Adding increasing id column...')
    
    police_item_set = police_item_set\
        .select("items")\
        .withColumn("id", F.monotonically_increasing_id())
    print('Itemset creation complete')
    
    return police_item_set


def feature_engineer(df):
    """Invoke the full feature engineering pipeline"""
    print('Starting Feature Engineering pipeline...')
    selected_data = get_modelling_data(df)
    item_sets = make_item_sets(selected_data)
    print('Feature Engineering complet')
    return item_sets
from p04_feature_engineer import *
# Remove the crimes with no crime ID and no LSOA Information.
police_item_set = feature_engineer(police_data_clean)
Starting Feature Engineering pipeline...
Filtering data with no Crime ID and no outcome category..
Filtering complete
Making item sets...
Collapsing data to list of transactions
Adding increasing id column...
Itemset creation complete
Feature Engineering complet

The FP growth algorithm (like association rules), needs the items to be concatenated into a list/array of "transactions".

police_item_set.show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------+
|items                                                                                                                             |
+----------------------------------------------------------------------------------------------------------------------------------+
|[Metropolitan Police Service, Aug, Arun, Violent crime, Under investigation]                                                      |
|[Other crime, Metropolitan Police Service, Aug, Aylesbury Vale, Under investigation]                                              |
|[Metropolitan Police Service, Aug, Violent crime, Under investigation, Babergh]                                                   |
|[Burglary, Barking and Dagenham, Metropolitan Police Service, Aug, Under investigation]                                           |
|[Barking and Dagenham, Metropolitan Police Service, Investigation complete; no suspect identified, Aug, Criminal damage and arson]|
|[Offender given a drugs possession warning, Drugs, Barking and Dagenham, Metropolitan Police Service, Aug]                        |
|[Barking and Dagenham, Metropolitan Police Service, Aug, Vehicle crime, Under investigation]                                      |
|[Barking and Dagenham, Metropolitan Police Service, Aug, Vehicle crime, Under investigation]                                      |
|[Barking and Dagenham, Metropolitan Police Service, Aug, Vehicle crime, Under investigation]                                      |
|[Barking and Dagenham, Metropolitan Police Service, Investigation complete; no suspect identified, Aug, Violent crime]            |
|[Offender sent to prison, Barking and Dagenham, Metropolitan Police Service, Aug, Violent crime]                                  |
|[Barking and Dagenham, Metropolitan Police Service, Aug, Violent crime, Under investigation]                                      |
|[Offender given a caution, Barking and Dagenham, Metropolitan Police Service, Aug, Violent crime]                                 |
|[Barking and Dagenham, Metropolitan Police Service, Aug, Violent crime, Under investigation]                                      |
|[Barking and Dagenham, Metropolitan Police Service, Aug, Violent crime, Under investigation]                                      |
|[Burglary, Barking and Dagenham, Metropolitan Police Service, Aug, Under investigation]                                           |
|[Burglary, Barking and Dagenham, Metropolitan Police Service, Aug, Under investigation]                                           |
|[Burglary, Barking and Dagenham, Metropolitan Police Service, Aug, Under investigation]                                           |
|[Burglary, Barking and Dagenham, Metropolitan Police Service, Aug, Under investigation]                                           |
|[Barking and Dagenham, Metropolitan Police Service, Aug, Vehicle crime, Under investigation]                                      |
+----------------------------------------------------------------------------------------------------------------------------------+
only showing top 20 rows

Modelling: Create the FP growth algorithm

For Association rules

from p05_model import build_association_rule_model, extract_model_rules
# Use a low support as we have a large dataset
model = build_association_rule_model(police_item_set, min_support=0.01, min_confidence=0.6)
Fitting FPGrowth....
Fit Complete

Extract the Association Rules

rules_df_pd = extract_model_rules(model)
Extracting Rules...
Rule extraction complete
Collecting Rules to Pandas...
Collection Complete...
rules_df_pd
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
antecedent consequent confidence lift
0 Theft from the person Investigation complete; no suspect identified 0.601021 1.274382
1 Greater Manchester Police Investigation complete; no suspect identified 0.603717 1.280098
2 West Midlands Police Investigation complete; no suspect identified 0.606495 1.285988
3 Birmingham Investigation complete; no suspect identified 0.607874 1.288914
4 Birmingham,West Midlands Police Investigation complete; no suspect identified 0.608291 1.289796
5 Unable to prosecute suspect Violence and sexual offences 0.621860 2.453091
6 Criminal damage and arson Investigation complete; no suspect identified 0.641662 1.360556
7 Manchester Investigation complete; no suspect identified 0.650860 1.380060
8 Manchester,Greater Manchester Police Investigation complete; no suspect identified 0.651027 1.380413
9 Other theft Investigation complete; no suspect identified 0.666711 1.413669
10 Vehicle crime Investigation complete; no suspect identified 0.692018 1.467329
11 Burglary Investigation complete; no suspect identified 0.711254 1.508117
12 Bicycle theft Investigation complete; no suspect identified 0.719270 1.525113
13 Birmingham West Midlands Police 0.998434 20.240328
14 Birmingham,Investigation complete; no suspect ... West Midlands Police 0.999117 20.254189
15 Sheffield South Yorkshire Police 0.999125 35.791218
16 Leeds West Yorkshire Police 0.999156 18.917528
17 Westminster Metropolitan Police Service 0.999549 5.338211
18 Bradford West Yorkshire Police 0.999584 18.925635
19 Liverpool Merseyside Police 0.999654 38.381717
20 Manchester Greater Manchester Police 0.999672 16.812479
21 Bristol Avon and Somerset Constabulary 0.999688 34.073433
22 Manchester,Investigation complete; no suspect ... Greater Manchester Police 0.999928 16.816784
rules_df_pd.to_csv('crime_associations.csv')
# Stop the Spark Session
sc.stop()

Rules Analysis

As you can see, the rules in the 98%+ confidence region appear to be rules that don't really tell us anything. i.e. Birmingham -> West Midlands Police. Let's remove those from the analysis

useful_rules_df = rules_df_pd[rules_df_pd['confidence'] < 0.98]\
    .sort_values(by="confidence", ascending = False)
useful_rules_df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
antecedent consequent confidence lift
12 Bicycle theft Investigation complete; no suspect identified 0.719270 1.525113
11 Burglary Investigation complete; no suspect identified 0.711254 1.508117
10 Vehicle crime Investigation complete; no suspect identified 0.692018 1.467329
9 Other theft Investigation complete; no suspect identified 0.666711 1.413669
8 Manchester,Greater Manchester Police Investigation complete; no suspect identified 0.651027 1.380413
7 Manchester Investigation complete; no suspect identified 0.650860 1.380060
6 Criminal damage and arson Investigation complete; no suspect identified 0.641662 1.360556
5 Unable to prosecute suspect Violence and sexual offences 0.621860 2.453091
4 Birmingham,West Midlands Police Investigation complete; no suspect identified 0.608291 1.289796
3 Birmingham Investigation complete; no suspect identified 0.607874 1.288914
2 West Midlands Police Investigation complete; no suspect identified 0.606495 1.285988
1 Greater Manchester Police Investigation complete; no suspect identified 0.603717 1.280098
0 Theft from the person Investigation complete; no suspect identified 0.601021 1.274382

Now the rule with the highest confidence is (Bicycle Theft -> Investigation complete; no suspect identified). So what does this mean? This means that given that a crime is a Bike Theft, the probability the investigation will be complete with no suspect identified is around 72%

The other 3 rules in the 65%+ confidence/conditional probability region follow a similar pattern.

  • (Other theft -> Investigation complete; no suspect identified)
  • (Burglary -> Investigation complete; no suspect identified)
  • (Vehicle Crime -> Investigation complete; no suspect identified)

So, it implies that the probability of no suspect being identified after a burglary, vehicle crime, an incident of criminal damage or arr is about 69-72%.

Another interesting rule is (Manchester -> Investigation complete; no suspect identified). So what this is saying is, the model estimates that the probability that a reported crime leads to a complete investigation with no suspect identified, given that the crime occurred in Manchester around 65%

Another block of these rules is (Unable to prosecute suspect -> Violence and sexual offences), which sounds worrying but doesn't really say much. The conditional probability of a crime being a violence and sexual offence, given that you were unable to prosecute the suspect is around 61%.

sc.stop()

englandcrimeassociations's People

Contributors

alistairlr112 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.