Welcome to the documentoon of flat classification of nodes!¶
Written by Miguel Romero
Last update: 07/07/21
Classification of nodes with structural properties¶
This package aims to evaluate whether the structural (topological) properties of a network are useful for predicting node attributes of nodes (i.e., node classification). It uses a combination of multiple machine learning techniques, such as, XGBoost and the SMOTE sampling technique.
Installation¶
The xgbfnc package can be install using pip, the requirements will be automatically installed:
python3 -m pip install xgbfnc
The source code and examples can be found in the GitHub repository.
Example¶
The example illustrates how the algorithm can be used to check whether the structural properties of the gene co-expression network improve the performance of the prediction of gene functions for rice (Oryza sativa Japonica). In this example, a gene co-expression network gathered from ATTED II is used.
How to run the example?¶
The complete source code of the example can be found in the GitHub repository. First, the xgbfnc package need to be imported:
from xgbfnc import xgbfnc
from xgbfnc import data
After creating adjacency matrix adj for the network, the structural
properties are computed using the module data of the package:
df, strc_cols = data.compute_strc_prop(adj)
This method returns a DataFrame with the structural properties of the network and a list of the names of these properties (i.e., column names). After adding the additional features of the network to the DataFrame, the XGBfnc module is used to instantiate the XGBfnc class:
test = XGBfnc()
test.load_data(df, strc_cols, y, term, output_path='output')
ans, pred, params = test.structural_test()
The data of the network is loaded using the load_data method. And the
structural test is execute using the structural_test method. The test
returns a boolean value which indicates whether the structural properties
help to improve the prediction performance, the prediction for the model
including the structural properties and its best parameters.
To run the example execute the following commands:
cd test/
python3 test_small.py
XGBfnc package¶
xgbfnc.xgbfnc module¶
Module for flat node classification and testing the importance of the structural properties of the network.
- class xgbfnc.xgbfnc.XGBfnc¶
Bases:
objectClass for flat node classification. This class builds two XGBoost binary classifier for the attribute prediction using two datasets with different features, one including structural properties of the network and the other one without them.
- Variables
df (DataFrame) – Datasets with all features of the network.
orig_cols (List[string]) – List of feature names (columns of df) non realted to structural properties of the network.
strc_cols (List[string]) – List of feature names (columns of df) realted to structural properties of the network. The intersection between orig_cols and strc_cols must be empty.
y (Series) – Serie representing the node attribute to be predicted.
ylabel (string) – Name of the node attribute to be predcited.
output_path (string) – Path where the output of the algorithm will be stored.
figs_pat (string) – Path where the figures will be stored.
- compare_plots(a, b, labels=['without', 'with'])¶
Plot roc curve, precision-recall curve and confussion matrices for the prediction of both models, i.e., without and with structural properties.
- Parameters
a (np.array[float]) – Predicted probabilities for the model without the structural properties of the network.
b (np.array[float]) – Predicted probabilities for the model including the structural properties of the network.
labels (List[string]) – Labels of both models for the plots.
- create_classifier(n_iter=5, n_jobs_cv=None, n_jobs_xgb=2, eval_metric='aucpr', scoring='recall', seed=None)¶
Builds the binary classifier within a hyper-parameters tuning model.
- Parameters
n_iter (int) – Number of iterations in cross validation for hyper-parameters tuning, defaults to 5
n_jobs_cv (int) – Number of parallel jobs running for hyper-parameters tuning, defaults to None
n_jobs_xgb (int) – Number of parallel jobs running for training the classifier, defaults to 2
eval_metric (string) – Evaluation metric for training the classifier, defaults to “aucpr”
scoring (string) – Scoring metric for hyper-parameters tuning, defaults to “recall”
seed (int) – Random number seed, defaults to None
- Returns
Hyper-parameter tuning model with XGBoost binary classsifier
- Return type
RandomizedSearchCV
- create_path(path)¶
Create a path.
- Parameters
path (string) – Relative path to be created.
- Raises
OSError – the path already exist
- evaluate(y_orig, y_pred_prob)¶
Evaluate the performance of the prediction using metrics such as the auc roc, average precision score, precision, recall and F1 score.
- Parameters
y_orig (np.array[int]) – Truth values of the prediction.
y_pred_prob (np.array[float]) – Predicted probabilities with the XGBoost classifier.
- Returns
Evaluation metrics for the prediction.
- Return type
dict[string->float]
- load_data(df, strc_cols, y, ylabel, output_path=None, figs_path=None)¶
Load the data of the network.
- Parameters
df (DataFrame) – Dataset with all node features.
strc_cols (List[string]) – List of features related to structural properties.
y (Series) – Node attribute to be predicted. Should be the same size as the df.
ylabel (string) – Name of the node attribute to be predicted.
output_path (string) – Path to save output, defaults to “YYYY-MM-DD/”.
figs_path (string) – Path to save figs, defaults to “YYYY-MM-DD/”.
- opt_threshold(y_orig, y_pred)¶
Compute the classification from probabilities based on the optimum threshold according to precision-recall curve, that is the threshold that maximies the F1 score.
- Parameters
y_orig (np.array[int]) – Truth values of the prediction.
y_pred (np.array[float]) – Predicted probabilities with the XGBoost classifier.
- Returns
Classification for the input array of probabilities which maximies F1 score.
- Return type
np.array[int]
- plot_performance(a, label)¶
Plot roc curve, precision-recall curve and confussion matrices for a prediction.
- Parameters
a (np.array[float]) – Predicted probabilities for the model.
label (string) – Label of the models for the plots.
- print_performance(scores, title)¶
Print the evaluation metrics for the prediction.
- Parameters
scores (dict[string->float]) – Evaluation metrics for the prediction.
title (string) – Name of the model or experiment.
- structural_test(n_splits=5, seed=None, log=False, csv=True)¶
Test whether the structural properties of the network help to improve the prediction performance by building two different models and compare their results. One model includes the structural properties, whereas the other not.
- Parameters
n_splits (int) – Number of folds for cross-validation, defaults to 5
seed (int) – Random number seed, defaults to None
log (bool) – Flag for logging of the results of the test, default to False
csv (bool) – Flag for saving the results of the test in a csv file, defaults to True
- Returns
Result of the structural test (True if the the structural properties improve the prediction performance, False otherwise), predicted labels and best parameter combination for the classfifier including the structural properties.
- Return type
Tuple(bool, np.array[Int], dict[string->float])
- train(X, n_splits=5, seed=None, n_iter=5, n_jobs_cv=None, n_jobs_xgb=2, eval_metric='aucpr', scoring='recall')¶
Evaluate the performance of the prediction using metrics such as the auc roc, average precision score, precision, recall and F1 score.
- Parameters
X (DataFrame) – Iput dataset for the prediction.
n_splits (int) – Number of folds for cross-validation, defaults to 5
seed (int) – Random number seed, defaults to None
n_iter (int, optional) – Number of iterations in cross validation for hyper-parameters tuning, defaults to 5
n_jobs_cv (int) – Number of parallel jobs running for hyper-parameters tuning, defaults to None
n_jobs_xgb (int) – Number of parallel jobs running for training the classifier, defaults to 2
eval_metric (string) – Evaluation metric for training the classifier, defaults to “aucpr”
scoring (string) – Scoring metric for hyper-parameters tuning, defaults to “recall”
- Returns
Predicted probabilities with the XGBoost classifier, feature importance measured by total gain, and best parameter combination for the classfifier.
- Return type
Tuple(np.array[float], dict[string->float], dict[string->float])
- write_csv(a, b, labels=['without', 'with'])¶
Save the evaluation metrics for prediction of both models, i.e., without and with structural properties.
- Parameters
a (np.array[float]) – Predicted probabilities for the model without the structural properties of the network.
b (np.array[float]) – Predicted probabilities for the model including the structural properties of the network.
labels (List[string]) – Labels of both models for the plots.
xgbfnc.data module¶
Module for computing the structural properties of a network.
- xgbfnc.data.compute_strc_prop(adj_mad, dimensions=16, p=1, q=0.5, path=None, log=False, seed=None)¶
Compute multiple structural properties of the input network. Two types of properties are computed: hand-crafted and node embeddings.
- Parameters
adj_mad (np.matrix[int]) – Adjacency matrix representation of the network, square and symmetric matrix.
dimensions (int) – Dimension of the node embedding, defaults to 16
p (float) – Return parameter of node2vec, defaults to 1
q (floar) – In-out parameter of node2vec, defaults to 0.5
path (string) – Relative path where the dataset will be saved, defaults to current path
log (bool) – Flag for logging of the results of the test, default to False
seed (float) – Random number seed, defaults to None
- Returns
Dataset with scaled features representing the structural properties of the network and list of labels (names) of the features.
- Return type
Tuple(Dataframe, List[string])
- xgbfnc.data.scale_data(data)¶
Scale the data of a dataset without modifying the distribution of data.
- Parameters
data (DataFrame) – Dataset
- Returns
Dataset with scaled features
- Return type
Dataframe
xgbfnc.plots module¶
Module for plotting the results of the prediction.
- xgbfnc.plots.plot_conf_matrix(cm, filename, path, labels=[0, 1])¶
Plot a confusion matrix and save it in a PDF file
- Parameters
cm (np.matrix[float]) – Confusion matrix
filename (string) – Name of the PDF file
path (string) – Path where the plot will be stored.
labels (List[float]) – Labels of the classes used in both axis of the matrix, default to [0,1].
- xgbfnc.plots.plot_pr(rec, prc, ap, filename, path)¶
Plot a precision-recall curve and save it in a PDF file
- Parameters
rec (np.array[float]) – Array of recall values
prc (np.array[float]) – Array of precision values
ap (float) – Average precision score
filename (string) – Name of the PDF file
path (string) – Path where the plot will be stored.
- xgbfnc.plots.plot_prs(recl, prcl, apl, labels, filename, path)¶
Plot multiple precision-recall curves in the same figure and save it in a PDF file
- Parameters
recl (np.arry[np.array[float]]) – Array of arrays of recall values for multiple predictions
prcl (np.arry[np.array[float]]) – Array of arrays of precision values for multiple predictions
apl (np.array[float]) – Array of average precision values for multiple predictions
labels (List[string]) – Labels of the multiple models plotted
filename (string) – Name of the PDF file
path (string) – Path where the plot will be stored.
- xgbfnc.plots.plot_roc(fpr, tpr, auc, filename, path)¶
Plot a roc curve and save it in a PDF file
- Parameters
fpr (np.array[float]) – Array of false positive rate values
tpr (np.array[float]) – Array of true positive rate values
auc (float) – Area under roc curve
filename (string) – Name of the PDF file
path (string) – Path where the plot will be stored.
- xgbfnc.plots.plot_rocs(fprl, tprl, aucl, labels, filename, path)¶
Plot multiple roc curves in the same figure and save it in a PDF file
- Parameters
fprl (np.arry[np.array[float]]) – Array of arrays of false positive rate values for multiple predictions
tprl (np.arry[np.array[float]]) – Array of arrays of true positive rate values for multiple predictions
aucl (np.array[float]) – Array of area under roc curve values for multiple predictions
labels (List[string]) – Labels of the multiple models plotted
filename (string) – Name of the PDF file
path (string) – Path where the plot will be stored.