Welcome to the documentoon of flat classification of nodes!

  • Written by Miguel Romero

  • Last update: 07/07/21

Classification of nodes with structural properties

This package aims to evaluate whether the structural (topological) properties of a network are useful for predicting node attributes of nodes (i.e., node classification). It uses a combination of multiple machine learning techniques, such as, XGBoost and the SMOTE sampling technique.

Installation

The xgbfnc package can be install using pip, the requirements will be automatically installed:

python3 -m pip install xgbfnc

The source code and examples can be found in the GitHub repository.

Documentation

Documentation of the package can be found here.

Example

The example illustrates how the algorithm can be used to check whether the structural properties of the gene co-expression network improve the performance of the prediction of gene functions for rice (Oryza sativa Japonica). In this example, a gene co-expression network gathered from ATTED II is used.

How to run the example?

The complete source code of the example can be found in the GitHub repository. First, the xgbfnc package need to be imported:

from xgbfnc import xgbfnc
from xgbfnc import data

After creating adjacency matrix adj for the network, the structural properties are computed using the module data of the package:

df, strc_cols = data.compute_strc_prop(adj)

This method returns a DataFrame with the structural properties of the network and a list of the names of these properties (i.e., column names). After adding the additional features of the network to the DataFrame, the XGBfnc module is used to instantiate the XGBfnc class:

test = XGBfnc()
test.load_data(df, strc_cols, y, term, output_path='output')
ans, pred, params = test.structural_test()

The data of the network is loaded using the load_data method. And the structural test is execute using the structural_test method. The test returns a boolean value which indicates whether the structural properties help to improve the prediction performance, the prediction for the model including the structural properties and its best parameters.

To run the example execute the following commands:

cd test/
python3 test_small.py

XGBfnc package

xgbfnc.xgbfnc module

Module for flat node classification and testing the importance of the structural properties of the network.

class xgbfnc.xgbfnc.XGBfnc

Bases: object

Class for flat node classification. This class builds two XGBoost binary classifier for the attribute prediction using two datasets with different features, one including structural properties of the network and the other one without them.

Variables
  • df (DataFrame) – Datasets with all features of the network.

  • orig_cols (List[string]) – List of feature names (columns of df) non realted to structural properties of the network.

  • strc_cols (List[string]) – List of feature names (columns of df) realted to structural properties of the network. The intersection between orig_cols and strc_cols must be empty.

  • y (Series) – Serie representing the node attribute to be predicted.

  • ylabel (string) – Name of the node attribute to be predcited.

  • output_path (string) – Path where the output of the algorithm will be stored.

  • figs_pat (string) – Path where the figures will be stored.

compare_plots(a, b, labels=['without', 'with'])

Plot roc curve, precision-recall curve and confussion matrices for the prediction of both models, i.e., without and with structural properties.

Parameters
  • a (np.array[float]) – Predicted probabilities for the model without the structural properties of the network.

  • b (np.array[float]) – Predicted probabilities for the model including the structural properties of the network.

  • labels (List[string]) – Labels of both models for the plots.

create_classifier(n_iter=5, n_jobs_cv=None, n_jobs_xgb=2, eval_metric='aucpr', scoring='recall', seed=None)

Builds the binary classifier within a hyper-parameters tuning model.

Parameters
  • n_iter (int) – Number of iterations in cross validation for hyper-parameters tuning, defaults to 5

  • n_jobs_cv (int) – Number of parallel jobs running for hyper-parameters tuning, defaults to None

  • n_jobs_xgb (int) – Number of parallel jobs running for training the classifier, defaults to 2

  • eval_metric (string) – Evaluation metric for training the classifier, defaults to “aucpr”

  • scoring (string) – Scoring metric for hyper-parameters tuning, defaults to “recall”

  • seed (int) – Random number seed, defaults to None

Returns

Hyper-parameter tuning model with XGBoost binary classsifier

Return type

RandomizedSearchCV

create_path(path)

Create a path.

Parameters

path (string) – Relative path to be created.

Raises

OSError – the path already exist

evaluate(y_orig, y_pred_prob)

Evaluate the performance of the prediction using metrics such as the auc roc, average precision score, precision, recall and F1 score.

Parameters
  • y_orig (np.array[int]) – Truth values of the prediction.

  • y_pred_prob (np.array[float]) – Predicted probabilities with the XGBoost classifier.

Returns

Evaluation metrics for the prediction.

Return type

dict[string->float]

load_data(df, strc_cols, y, ylabel, output_path=None, figs_path=None)

Load the data of the network.

Parameters
  • df (DataFrame) – Dataset with all node features.

  • strc_cols (List[string]) – List of features related to structural properties.

  • y (Series) – Node attribute to be predicted. Should be the same size as the df.

  • ylabel (string) – Name of the node attribute to be predicted.

  • output_path (string) – Path to save output, defaults to “YYYY-MM-DD/”.

  • figs_path (string) – Path to save figs, defaults to “YYYY-MM-DD/”.

opt_threshold(y_orig, y_pred)

Compute the classification from probabilities based on the optimum threshold according to precision-recall curve, that is the threshold that maximies the F1 score.

Parameters
  • y_orig (np.array[int]) – Truth values of the prediction.

  • y_pred (np.array[float]) – Predicted probabilities with the XGBoost classifier.

Returns

Classification for the input array of probabilities which maximies F1 score.

Return type

np.array[int]

plot_performance(a, label)

Plot roc curve, precision-recall curve and confussion matrices for a prediction.

Parameters
  • a (np.array[float]) – Predicted probabilities for the model.

  • label (string) – Label of the models for the plots.

print_performance(scores, title)

Print the evaluation metrics for the prediction.

Parameters
  • scores (dict[string->float]) – Evaluation metrics for the prediction.

  • title (string) – Name of the model or experiment.

structural_test(n_splits=5, seed=None, log=False, csv=True)

Test whether the structural properties of the network help to improve the prediction performance by building two different models and compare their results. One model includes the structural properties, whereas the other not.

Parameters
  • n_splits (int) – Number of folds for cross-validation, defaults to 5

  • seed (int) – Random number seed, defaults to None

  • log (bool) – Flag for logging of the results of the test, default to False

  • csv (bool) – Flag for saving the results of the test in a csv file, defaults to True

Returns

Result of the structural test (True if the the structural properties improve the prediction performance, False otherwise), predicted labels and best parameter combination for the classfifier including the structural properties.

Return type

Tuple(bool, np.array[Int], dict[string->float])

train(X, n_splits=5, seed=None, n_iter=5, n_jobs_cv=None, n_jobs_xgb=2, eval_metric='aucpr', scoring='recall')

Evaluate the performance of the prediction using metrics such as the auc roc, average precision score, precision, recall and F1 score.

Parameters
  • X (DataFrame) – Iput dataset for the prediction.

  • n_splits (int) – Number of folds for cross-validation, defaults to 5

  • seed (int) – Random number seed, defaults to None

  • n_iter (int, optional) – Number of iterations in cross validation for hyper-parameters tuning, defaults to 5

  • n_jobs_cv (int) – Number of parallel jobs running for hyper-parameters tuning, defaults to None

  • n_jobs_xgb (int) – Number of parallel jobs running for training the classifier, defaults to 2

  • eval_metric (string) – Evaluation metric for training the classifier, defaults to “aucpr”

  • scoring (string) – Scoring metric for hyper-parameters tuning, defaults to “recall”

Returns

Predicted probabilities with the XGBoost classifier, feature importance measured by total gain, and best parameter combination for the classfifier.

Return type

Tuple(np.array[float], dict[string->float], dict[string->float])

write_csv(a, b, labels=['without', 'with'])

Save the evaluation metrics for prediction of both models, i.e., without and with structural properties.

Parameters
  • a (np.array[float]) – Predicted probabilities for the model without the structural properties of the network.

  • b (np.array[float]) – Predicted probabilities for the model including the structural properties of the network.

  • labels (List[string]) – Labels of both models for the plots.

xgbfnc.data module

Module for computing the structural properties of a network.

xgbfnc.data.compute_strc_prop(adj_mad, dimensions=16, p=1, q=0.5, path=None, log=False, seed=None)

Compute multiple structural properties of the input network. Two types of properties are computed: hand-crafted and node embeddings.

Parameters
  • adj_mad (np.matrix[int]) – Adjacency matrix representation of the network, square and symmetric matrix.

  • dimensions (int) – Dimension of the node embedding, defaults to 16

  • p (float) – Return parameter of node2vec, defaults to 1

  • q (floar) – In-out parameter of node2vec, defaults to 0.5

  • path (string) – Relative path where the dataset will be saved, defaults to current path

  • log (bool) – Flag for logging of the results of the test, default to False

  • seed (float) – Random number seed, defaults to None

Returns

Dataset with scaled features representing the structural properties of the network and list of labels (names) of the features.

Return type

Tuple(Dataframe, List[string])

xgbfnc.data.scale_data(data)

Scale the data of a dataset without modifying the distribution of data.

Parameters

data (DataFrame) – Dataset

Returns

Dataset with scaled features

Return type

Dataframe

xgbfnc.plots module

Module for plotting the results of the prediction.

xgbfnc.plots.plot_conf_matrix(cm, filename, path, labels=[0, 1])

Plot a confusion matrix and save it in a PDF file

Parameters
  • cm (np.matrix[float]) – Confusion matrix

  • filename (string) – Name of the PDF file

  • path (string) – Path where the plot will be stored.

  • labels (List[float]) – Labels of the classes used in both axis of the matrix, default to [0,1].

xgbfnc.plots.plot_pr(rec, prc, ap, filename, path)

Plot a precision-recall curve and save it in a PDF file

Parameters
  • rec (np.array[float]) – Array of recall values

  • prc (np.array[float]) – Array of precision values

  • ap (float) – Average precision score

  • filename (string) – Name of the PDF file

  • path (string) – Path where the plot will be stored.

xgbfnc.plots.plot_prs(recl, prcl, apl, labels, filename, path)

Plot multiple precision-recall curves in the same figure and save it in a PDF file

Parameters
  • recl (np.arry[np.array[float]]) – Array of arrays of recall values for multiple predictions

  • prcl (np.arry[np.array[float]]) – Array of arrays of precision values for multiple predictions

  • apl (np.array[float]) – Array of average precision values for multiple predictions

  • labels (List[string]) – Labels of the multiple models plotted

  • filename (string) – Name of the PDF file

  • path (string) – Path where the plot will be stored.

xgbfnc.plots.plot_roc(fpr, tpr, auc, filename, path)

Plot a roc curve and save it in a PDF file

Parameters
  • fpr (np.array[float]) – Array of false positive rate values

  • tpr (np.array[float]) – Array of true positive rate values

  • auc (float) – Area under roc curve

  • filename (string) – Name of the PDF file

  • path (string) – Path where the plot will be stored.

xgbfnc.plots.plot_rocs(fprl, tprl, aucl, labels, filename, path)

Plot multiple roc curves in the same figure and save it in a PDF file

Parameters
  • fprl (np.arry[np.array[float]]) – Array of arrays of false positive rate values for multiple predictions

  • tprl (np.arry[np.array[float]]) – Array of arrays of true positive rate values for multiple predictions

  • aucl (np.array[float]) – Array of area under roc curve values for multiple predictions

  • labels (List[string]) – Labels of the multiple models plotted

  • filename (string) – Name of the PDF file

  • path (string) – Path where the plot will be stored.

Module contents

Indices and tables