Critical Analysis Of Our Code

This section delves into the source code behind the algorithm presented here. Although it technically takes in data and gives, at some points, reasonable results, the algorithm is deeply flawed. By breaking down our code there are some clear mistakes and assumptions made. These are displayed here in an obvious manner but it takes significant diligence at scale and in implementation to avoid bias.

Data

The first critical point of error is data. The general concept of “Garbage in, garbage out” is one of the most important fundamental pieces of avoiding biased data, if poorly generated, collected, or biased data is put in, the algorithm will inherently be created with that bias.

            
# fetch url and read into the data file
url = ("towns.csv")
towns = pd.read_csv(open_url(url))
            
        

The first critical point of error is data. The general concept of “Garbage in, garbage out” is one of the most important fundamental pieces of avoiding biased data, if poorly generated, collected, or biased data is put in, the algorithm will inherently be created with that bias.


The following is the first 5 entries of the towns.csv data file:
Synthetic data example
MORE SAUCE NEEDED

Feature Selection

The selection of features (variables) used for prediction is crucial. If any relevant factors related to policing needs are omitted or underrepresented in the dataset, it can lead to biased predictions. In this case, the feature selection of crimes, population, and ethnicity are too simple to solidly construct a detailed understanding of relationships between data and policing.

            
# set up analysis variables
y=towns['Policing needed']
x=towns.drop(["Policing needed","Town"], axis=1)
            
        

Model Selection

Linear regression is a simple model that may not capture complex relationships in the data. This choice of model might lead to biased predictions if the true underlying relationships are non-linear. In the case of predictive policing, this is absolutely true. Given only three features, a linear regression model is not the form of model that is most effective in this situation. This leads to a clear point where bias can find itself slip into the model.

            
# split variables for training
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.2, random_state=10)

global lr
lr = LinearRegression()
lr.fit(X_train, Y_train)
y_lr_train_pred = lr.predict(X_train)
y_lr_test_pred = lr.predict(X_test)
            
        

Error Testing

The following code is used for error analysis. Often the singular goal of an AI developer is to have their algorithm fit the results of their already existing tests. By allowing for a level of tunnel-vision on a singular definition of success, AI developers may fail to truly analyze their results through a lens of bias.

            
lr_train_mse = mean_squared_error(Y_train, y_lr_train_pred)
lr_train_r2 = r2_score(Y_train, y_lr_train_pred)


lr_test_mse = mean_squared_error(Y_test,y_lr_test_pred)
lr_test_r2 = r2_score(Y_test, y_lr_test_pred)


lr_results = pd.DataFrame(['linear regression', lr_train_mse, lr_train_r2, lr_test_mse, lr_test_r2]).transpose()
lr_results.columns = ['Method', 'Training MSE', 'Training r2', 'Test MSE', 'Test R2']
            
        
Daniel Blatner & Julian Mariscal