Analysis

Data

The first critical point of error is data. The general concept of “Garbage in, garbage out” is one of the most important fundamental pieces of avoiding biased data, if poorly generated, collected, or biased data is put in, the algorithm will inherently be created with that bias.

            
# fetch url and read into the data file
url = ("towns.csv")
towns = pd.read_csv(open_url(url))

The following is the first 5 entries of the towns.csv data file:
Synthetic data example

MORE SAUCE NEEDED

Feature Selection

The selection of features (variables) used for prediction is crucial. If any relevant factors related to policing needs are omitted or underrepresented in the dataset, it can lead to biased predictions. In this case, the feature selection of crimes, population, and ethnicity are too simple to solidly construct a detailed understanding of relationships between data and policing.

            
# set up analysis variables
y=towns['Policing needed']
x=towns.drop(["Policing needed","Town"], axis=1)

Model Selection

Linear regression is a simple model that may not capture complex relationships in the data. This choice of model might lead to biased predictions if the true underlying relationships are non-linear. In the case of predictive policing, this is absolutely true. Given only three features, a linear regression model is not the form of model that is most effective in this situation. This leads to a clear point where bias can find itself slip into the model.

            
# split variables for training
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.2, random_state=10)

global lr
lr = LinearRegression()
lr.fit(X_train, Y_train)
y_lr_train_pred = lr.predict(X_train)
y_lr_test_pred = lr.predict(X_test)

Error Testing

The following code is used for error analysis. Often the singular goal of an AI developer is to have their algorithm fit the results of their already existing tests. By allowing for a level of tunnel-vision on a singular definition of success, AI developers may fail to truly analyze their results through a lens of bias.

            
lr_train_mse = mean_squared_error(Y_train, y_lr_train_pred)
lr_train_r2 = r2_score(Y_train, y_lr_train_pred)


lr_test_mse = mean_squared_error(Y_test,y_lr_test_pred)
lr_test_r2 = r2_score(Y_test, y_lr_test_pred)


lr_results = pd.DataFrame(['linear regression', lr_train_mse, lr_train_r2, lr_test_mse, lr_test_r2]).transpose()
lr_results.columns = ['Method', 'Training MSE', 'Training r2', 'Test MSE', 'Test R2']