Fair machine learning python package.
This python package implements machine learning methods and functionality for evaluating machine learning models in terms of fairness. See sections below for each module and classes.
Module for running bayesian networks
Click here for documentation
class latentLabelClassifier
Bayesian network which constructs a bayesian network that models the discrimination process. It assumes that the labels in the training dataset is biased and is genereted from a probability distribution
$$ P(D | D_f, S) $$
where
$$ P(D, D_f, S, X) = P(D|D_f, S)P(X|D_f, S)P(D_f)P(S) $$
where
Prediction is done by estimating
$$ P(D_f | X, S) $$
latentLabelClassifier()
Description: Constructor for a new latentLabelClassifier
Parameters:
- df: Training dataset
- sensitives: List of sensitive attributes (must be in df)
- label: Dataset labels (must be in df)
- atol: Accepted Tolerance for expectation maximization
- classes: No of classes in the labels
fit()
Description: Constructs the bayesian network with a fair latent variable.
Structure Learning: Hill Climb Search
Parameter Learning: Expectation Maximation (EM)
predict_probability()
Description: Predict and return probabilities for unobserved nodes.
Parameters:
- test: Test dataset (without data labels and with sensitive attributes.)
predict()
Description: Predict and return labels for unobserved nodes.
Parameters:
- test: Test dataset (without data labels and with sensitive attributes.)
load()
Description: Load a learned model from pickle file. See save() for file format.
Parameters:
- file: File path of learned model.
save()
Description: Save a learned model using pickle.
pickle.dump(self.model, open(file, "wb"))
Parameters:
- file: File path of learned model.
check_model()
Description: Checks if model is valid. Returns true or false.
Module for data preprocessing.
Click here for documentation
translate_categorical()
Description: Takes a pandas dataframe and translate all categorical attributes to numerical values and returns encoded dataframe.
Parameters:
- dataframe (pandas dataframe): Dataframe to translate
extract_sensitive()
Description: Takes a pandas dataframe and extract sensitive attributes from list of attributes.
Parameters:
- dataframe (pandas dataframe): Dataframe to translate
- attributes: list of sensitive attributes
encode_dummies()
Description: Dummy encodes dataframe.
dummy = pd.get_dummies(df, prefix_sep=".", drop_first=True)
return dummy
Module for fairness evaluation
Click here for documentation
parity_score()
Description: Demographic parity is defined as
$$ P(\hat{Y} | S = 0) = P(\hat{Y} | S = 1) $$
Where
This can be generalised to a multiclass case with
$$ P(\hat{Y} | S_i) = P(\hat{Y} | S_j) \qquad i, j \in {0, \dots, K-1} $$
We want to condense this to a single metric between
$$ L ={ P(\hat{Y} | S=0), \dots, P(\hat{Y} | S=K-1) } $$
and for that, we have worked out the following funsction
$$ f = \frac{\text{geometric mean}(L)}{\text{mean}(L)} $$
Parameters:
- probabilities (list): list of sensitive conditional probabilities.
def parity_score(probabilities):
a = np.array(probabilities)
return a.prod() ** (1.0 / len(a)) / a.mean()
fairness_report() Description:
Fairness report.
Calculates some fairness and performance metrics from test labels and predictions. Returns a dataframe of results.
Parameters:
- y (array): Dataset Labels.
- y_pred (array): Model predictions.
- sensitives (dataframe): Test dataset of sensitive attributes.
- model_name (string): Name of model.
Module for tree-based models.
Click here for documentation
Fair Decision Tree Classifier. The code for this class is borrowed from this GitHub repository. This decision tree evaluates candidate splits using Splitting Criterion AUC for Fairness (SCAFF). See their paper for more details.
fit()
Description:
Trains the decision tree using the traditional algorithm of generating candidate splits, evaluating split in terms of the chosen splitting criterion (SCAFF) and selecting the best split.
Parameters:
- X -> any_dim pandas.df or np.array: numerical/categorical
- y -> one_dim pandas.df or np.array: only binary
- b (bias) -> any_dim pandas.df or np.array: treated as str
predict_proba()
Description:
Predict the class of of feature vectors. Predictions are calculated as probabilities of belonging to each class.
Parameters:
- X -> any_dim pandas.df or np.array: numerical/categorical
predict()
Predict the class of of feature vectors. Predictions are outputted as class the feature vector is classified to.
Parameters:
- X -> any_dim pandas.df or np.array: numerical/categorical
This classifier learns several decision trees using tradition random forest methods. The decision trees are trained on bootstrapped dataset with removed columns etc.
fit()
Description:
Trains the random forestusing the traditional algorithm of bootstrapping datasets and removing features from teh dataset.
Parameters:
- X -> any_dim pandas.df or np.array: numerical/categorical
- y -> one_dim pandas.df or np.array: only binary
- b (bias) -> any_dim pandas.df or np.array: treated as str
predict_proba()
Description:
Predict the class of of feature vectors. Predictions are calculated as probabilities of belonging to each class.
Parameters:
- X -> any_dim pandas.df or np.array: numerical/categorical
predict()
Predict the class of of feature vectors. Predictions are outputted as class the feature vector is classified to.
Parameters:
- X -> any_dim pandas.df or np.array: numerical/categorical
Click here for documentation
datasetgen_numerical()
Description:
Creates a dataset with two sensitive columns (Race and Gender) as well as 4 gaussian features.
Parameters:
- n_samples (integer): Number of datapoints in dataset.
- informative (bool): Is dataset informative? True or False.
- seperability (float): Parameter for seperating sensitive classes.
Returns:
- df: Pandas DataFrame.