yangchang0 / orie5741_project Goto Github PK

Jupyter Notebook 100.00%

orie5741_project's Introduction

ORIE5741_project

Yang Chang(yc2567)

Dongting Ma(dm892)

Project description:

Who's ready to leave? Criminal edition. Prisoners are often released on parole for good behaviour, but some immediately commit crimes and return to prison. The percentage of prisoners who return (within a short time window) is called the recidivism rate. Your goal is to help the government understand which inmates are likely to commit a subsequent crime, and which inmates are ready for parole. Your solution must not discriminate on the basis of a protected class, such as race, sex, or national origin

Project goal:

Our goal is to analyze crime history data, identify criminal defendant’s likelihood of becoming a recidivist and help decide which inmates are ready for parole. We plan to answer the following questions

orie5741_project's People

Contributors

Watchers

orie5741_project's Issues

Midterm peer review

The project is about predicting the likelihood of re-offending and make recommendations to decide who are ready for parole. They use COMPAS data set to make the prediction.

Advantages:

They make some critical graphs to describe the features of data set, which help to solve some significant problem in model fitting like collinearity.
They make full use of the knowledge learnt in the course to deal with categorical data and try to fix missing data.
They start from some features rather than all features, which may be a good attempt to avoid overfitting and may help them to refine their model.

Improvements:

The description of the data set could be more comprehensive. It is better to let me know the data structure and the features they choose in the model. Besides, how they choose features is also a question that I am interested in.
The data cleaning step could contain more statistical description which makes it more convincing. The percentage of dropped data and missing values should be presented in the report.
It is better to compare the random forest model to others as well, which may help with a better choice on model.

Proposal Peer Review (ybg3)

This project is about analysing inmates on parole who reoffend within 2 years of release. The dataset being used is a github dataset from a study done on a similar idea. The objective of this project is to analyse different features to predict probability of recidivism for making a decision on parole.

What I liked:

The aim of this project is very unconventional, and it uniquely aims to solve a problem with high importance to human life using data science.
The applications of the results of this project are multifarious, and if successfully executed, can be extended to behavioural predictive modeling.
A job well done putting the importance of the project into context and explaining the outcomes should the project succeed.

Areas for improvement:

The timeframe of the dataset a bit vague, as it is not mentioned what years the data spans across, since crime rates have drastically changed over time.
A brief explanation for COMPAS could be useful, even if it's a single line, since the reviewer will mostly not know what this parameter means.
It is unclear how the features will be used to predict the data/ how they will fit into the model. What kind of model is to be used can also be expanded upon a little in the proposal.

Peer review yp387

Who is ready to leave?
The project is developed to use algorithms to assess a criminal defendant’s likelihood of becoming a recidivist. Another goal of the project is to check whether the current COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) system has a huge bias towards people of color and minorities. The project uses three datasets with each prisoners’ demographic (sex, race, age and etc.), criminal history, jail, prison time, and COMPAS risk scores from Broward County. The research and analysis result of the model in the project can be considered as reasonable and reliable data to both support the prisons’ work and judge the COMPAS.

Three advantages:

In the project, the authors dropped over 15 features that have over half missing values and features that have only one category in them because they were not helpful when fitting our model. The data cleaning could exclude unnecessary features and prevent them from affecting the final result.
Among 30 features after data cleaning, authors used backward selection to select the best 15 features. For example, they found features that are exclusive to every observation such as name and date of birth. They think these two variables were not useful. Hence these kinds of features were dropped too.
They used three models in prediction and compares their results clearly. Also, they explained why the random forest model had a more precise result than the other two models.

Three areas for improvement:

Try more models. Now, the prediction result is far from used in the real life. More complicated models like deep learning models would be helpful.
Try different combinations of features. In the project, the authors used all cleaned features during training. How about trying different combinations of features and taking the average of these models' results as the final result.
Add a name for figures. It could help the reader know what the figure wants to show faster.

Peer Review - jc3472

In this project, the team is trying to predict whether a prisoner will commit crimes again after being released (the parole). The dataset comes from Github. The motivation of this project is to build a model that can help reduce the high recidivism rate in U.S.

What I like about this project:

The project has great societal importance. If successful, the project can help the government lower the budget on prisons and the crime rate.
The project has decent amount of data points.
The team also states some detailed questions to be answered besides the classification problem.

What may need improvement:

The team did not state details on the origin of the dataset (e.g., who collects the data?). Understanding the origin of the data helps us know the quality of the data.
The team did not state details about the types of the features (numerical? Set? Text?).
Certain acronym needs to be explained. E.g., what is COMPAS score? Why is it useful? Why is it important for your project?

Peer review yp387

Peer review
Who is ready to leave?

The project is developed to help the prison to find which prisoners will be allowed to leave as they are relatively low-risk. Another goal of the project is to check whether the current COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) system has a huge bias towards people of color and minorities. The project uses three datasets with each prisoners’ demographic (sex, race, age and etc.), criminal history, jail, prison time, and COMPAS risk scores from Broward County. The research and analysis result of the model in the project can be considered as reasonable and reliable data to both support the prisons’ work and judge the COMPAS.

Three advantages:
(1) Preprocess the dataset well. In this project, they not only remove the invalid values, like “null”, but also drop some samples that can’t be considered as the COMPAS cases and unrelated features. A clear and reliable dataset is necessary for an accurate model.

(2) Select some features for the first few analyses rather than apply all features. Feature selection is helpful at the beginning of the training since you have no idea which features affect more and which less. Too many features may attribute to the model overfitting and it takes a long time to train with so many features.

(3) Use quantity results to prove that there exists a bias in the COMPAS algorithm. In the report, they adapt the FN and FP rate of the COMPAS algorithm to check whether the COMPAS system has a huge bias towards people of color and minorities. This is compelling and direct proof.

Three areas for improvement:
(1) Split the dataset for training and test. Training with a large number of samples is time-consuming and has a higher probability of overfitting. Why not random sample some data points as a test dataset to validate the model after training.
(2) Try different combinations of features. In the project, ten features are selected to train the model, but other features may also influence the predicted result.
(3) Four histograms can be merged into one. It’s easier to compare the result between different races when they are plotted in one chart with the same coordinate system.

Peer review

The project trains a model by analyzing criminal history data to determine the likelihood of criminal defendants becoming habitual offenders. It also makes recommendations to determine who can be released on parole.

This is an interesting project. It has a very clear structure and a specific data cleaning process. It also does a good job in data visualization. Different data charts strongly support their analysis process. It is also effective from multiple features Increased the robustness of the model.

The chart types can be more diversified. For another part, because the data is relatively sparse, the preprocessing and resolution can be discussed in more detail. Some explanations on the methods and principles of model parameter adjustment can be added.

Midterm Report Peer Reivew - xl477

The project is about predicting the the likelihood of a defendent being a recidivist in the future, with the hope of helping the officers and the nation to decide which inmates are ready for parole. The dataset they used are from COMPASS, which contains 2 years' data for Broward County. They were trying to predict the scores like 'Risk of Recidivism' to identify the possibility for the defendent being a recidivist.

What I like about the report:

They clearly outlined the steps they followed to generat results, from 1 to 6.
It clearly explained how they dealt with missing values, and for each Exploratory Data Analysis, they included a figure with the explanation in words, which is easy to absorb.
They clearly indicated the next steps, which are feature engineering and model optimization.

What I don't like about the report:

It would be benificial to include the source (link) of the dataset and some brief introduction of the dataset, like what are the years in the data, what are the types of data included (eg, demographics, criminal types etc).
It would be benificial to include how they chose the features for training the model (eg, why did they chose these features instead of others).
It might be benificial to add model selection into the next steps, which is to include more classfication models other than tree (eg logistic regression), or apply more emsamble methods (eg. gradient boosting) to the tree model to pick the best performer.

Midterm review - zz274

The project aims to predict whether a criminal is likely to re-offend or not, help the nations, probation, and parole officers decide which inmates are ready for parole.

Things I like:

They did a basic analysis to check the quality of the dataset and implemented some preprocessing such as dropping the NULLs and encoding categorical data.
They included some data visualizations to see the correlations
They started with a relatively sparse model and tried to add more features.

Things that can be improved:

Other than checking the missing data, they can check for the possible outliers of each feature by histogram or boxplot. Removing the outliers can help improve the model performance.
Identifying the collinearity in the heatmap might be helpful. Also, from the heatmap, they can print the correlation coefficients and set a threshold for feature selection.
They could explain more on how they would like to improve the score of the random forest classifiers, such as grid search and parameter tuning.

Midterm peer review

The project aims to train a model using crime history data, predict criminal defendant’s likelihood of becoming a recidivist and help decide which inmates are ready for parole. The dataset is two years’ worth of COMPAS scores from the Broward County Sheriff’s Office in Florida and criminal defendants’ data. The project can be applied for crime prediction in US, and it’s really a topic that I’m interested in.

Things that I like very much:

Very clear structure with subtitles and descriptions.
Specific data cleaning process, especially explaining solutions to missing data.
Good future plan including feature engineering and model optimization.

Things that can be improved:

The data visualization graph types can be more diverse.
More specific information about the time-series analysis and the features in the heatmap in your data visualization part.
The snapshot of code is not recommended. Maybe a description of algorithm or model is more preferred.

Proposal peer review

This is a project that analyzes criminal history data to determine the likelihood of criminal defendants becoming repeat offenders and to help determine which inmates are eligible for parole. They chose data from a study by Larson J "https://github.com/propublica/compas-analysis/". They used COMPAS scores, number of prior convictions, length of time in custody and other characteristics to help predict whether an offender would recidivate, thereby reducing recidivism rates.
What I like：
1. This is an interesting project because crime and social safety issues are something that concerns all of us, and if it is true that crime rates can be reduced through similar studies it is something that would be very beneficial.
2. the sample size and characteristics are very consistent with big and messy data
3. The two questions of the study are very clear and it is immediately clear what they want to do

Improvements:
1. I was expecting to see an explanation of the COMPAS scores since I have no relevant background knowledge.
2. How valuable is this project for real world problems? I think it still needs solid theoretical support.
3. Is the data sensitive? Does it need to be protected for privacy.