A package for preforming dynamic recursive feature elimination with sklearn.
Project description
dRFEtools  dynamic Recursive Feature Elimination
dRFEtools
is a package for dynamic recursive feature elimination with
sklearn. Currently supporting random forest classification and regression,
and linear models (linear, lasso, ridge, and elastic net).
Authors: Apuã Paquola, Kynon Jade Benjamin, and Tarun Katipalli
Package developed in Python 3.7+.
In addition to scikitlearn, dRFEtools
is also built with NumPy, SciPy,
Pandas, matplotlib, plotnine, and statsmodels.
This package has several function to run dynamic recursive feature elimination (dRFE) for random forest and linear model classifier and regression models. For random forest, it assumes OutofBag (OOB) is set to True. For linear models, it generates a developmental set. For both classification and regression, three measurements are calculated for feature selection:
Classification:
 Normalized mutual information
 Accuracy
 Area under the curve (AUC) ROC curve
Regression:
 R2 (this can be negative if model is arbitrarily worse)
 Explained variance
 Mean squared error
The package has been split in to three additional scripts for:
 Random forest feature elimination (AP)
 Linear model regression feature elimination (KJB)
 Rank features function (TK)
 Lowess redundant selection (KJB)
Table of Contents
Citation
Installation
pip install user dRFEtools
Reference Manual
dRFEtools main functions

dRFE  Random Forest
rf_rfe
Runs random forest feature elimination step over iterator process.
Args:
 estimator: Random forest classifier object
 X: a data frame of training data
 Y: a vector of sample labels from training data set
 features: a vector of feature names
 fold: current fold
 out_dir: output directory. default '.'
 elimination_rate: percent rate to reduce feature list. default .2
 RANK: Output feature ranking. default=True (Boolean)
Yields:
 dict: a dictionary with number of features, normalized mutual information score, accuracy score, and array of the indexes for features to keep

dRFE  Linear Models
dev_rfe
Runs recursive feature elimination for linear model step over iterator process assuming developmental set is needed.
Args:
 estimator: regressor or classifier linear model object
 X: a data frame of training data
 Y: a vector of sample labels from training data set
 features: a vector of feature names
 fold: current fold
 out_dir: output directory. default '.'
 elimination_rate: percent rate to reduce feature list. default .2
 dev_size: developmental set size. default '0.20'
 RANK: run feature ranking, default 'True'
 SEED: random state. default 'True'
Yields:
 dict: a dictionary with number of features, r2 score, mean square error, expalined variance, and array of the indices for features to keep

Feature Rank Function
feature_rank_fnc
This function ranks features within the feature elimination loop.
Args:
 features: A vector of feature names
 rank: A vector with feature ranks based on absolute value of feature importance
 n_features_to_keep: Number of features to keep. (Int)
 fold: Fold to analyzed. (Int)
 out_dir: Output directory for text file. Default '.'
 RANK: Boolean (True or False)
Yields:
 Text file: Ranked features by fold tabdelimited text file, only if RANK=True

N Feature Iterator
n_features_iter
Determines the features to keep.
Args:
 nf: current number of features
 keep_rate: percentage of features to keep
Yields:
 int: number of features to keep
Redundant features functions

Run lowess
run_lowess
This function runs the lowess function and caches it to memory.
Args:
 x: the xvalues of the observed points
 y: the yvalues of the observed points
 frac: the fraction of the data used when estimating each yvalue. default 3/10
Yields:
 z: 2D array of results

Convert array to tuple
array_to_tuple
This function attempts to convert a numpy array to a tuple.
Args:
 np_array: numpy array
Yields:
 tuple

Extract dRFE as a dataframe
get_elim_df_ordered
This function converts the dRFE dictionary to a pandas dataframe.
Args:
 d: dRFE dictionary
 multi: is this for multiple classes. (True or False)
Yields:
 df_elim: dRFE as a dataframe with log10 transformed features

Calculate lowess curve
cal_lowess
This function calculates the lowess curve.
Args:
 d: dRFE dictionary
 frac: the fraction of the data used when estimating each yvalue
 multi: is this for multiple classes. (True or False)
Yields:
 x: dRFE log10 transformed features
 y: dRFE metrics
 z: 2D numpy array with lowess curve
 xnew: increased intervals
 ynew: interpolated metrics for xnew

Calculate lowess curve for log10
cal_lowess
This function calculates the rate of change on the lowess fitted curve with log10 transformated input.
Args:
 d: dRFE dictionary
 frac: the fraction of the data used when estimating each yvalue
 multi: is this for multiple classes. default False
Yields:
 data frame: dataframe with n_features, lowess value, and rate of change (DxDy)

Extract max lowess
extract_max_lowess
This function extracts the max features based on rate of change of log10 transformed lowess fit curve.
Args:
 d: dRFE dictionary
 frac: the fraction of the data used when estimating each yvalue. default 3/10
 multi: is this for multiple classes. default False
Yields:
 int: number of max features (smallest subset)

Extract redundant lowess
extract_redundant_lowess
This function extracts the redundant features based on rate of change of log10 transformed lowess fit curve.
Args:
 d: dRFE dictionary
 frac: the fraction of the data used when estimating each yvalue. default 3/10
 step_size: rate of change step size to analyze for extraction. default 0.05
 multi: is this for multiple classes. default False
Yields:
 int: number of redundant features

Optimize lowess plot
plot_with_lowess_vline
Redundant set selection optimization plot. This will be ROC AUC for multiple classification (3+), NMI for binary classification, or R2 for regression. The plot returned has fraction and step size as well as lowess smoothed curve and indication of predicted redundant set.
Args:
 d: feature elimination class dictionary
 fold: current fold
 out_dir: output directory. default '.'
 frac: the fraction of the data used when estimating each yvalue. default 3/10
 step_size: rate of change step size to analyze for extraction. default 0.05
 classify: is this a classification algorithm. default True
 multi: does this have multiple (3+) classes. default True
Yields:
 graph: plot of dRFE with estimated redundant set indicated as well as fraction and set size used. It automatically saves files as pdf, png, and svg

Plot lowess vline
plot_with_lowess_vline
Plot feature elimination results with the redundant set indicated. This will be ROC AUC for multiple classification (3+), NMI for binary classification, or R2 for regression.
Args:
 d: feature elimination class dictionary
 fold: current fold
 out_dir: output directory. default '.'
 frac: the fraction of the data used when estimating each yvalue. default 3/10
 step_size: rate of change step size to analyze for extraction. default 0.05
 classify: is this a classification algorithm. default True
 multi: does this have multiple (3+) classes. default True
Yields:
 graph: plot of dRFE with estimated redundant set indicated, automatically saves files as pdf, png, and svg
Plotting functions

Save plots
save_plots
This function save plot as svg, png, and pdf with specific label and dimension.
Args:
 p: plotnine object
 fn: file name without extensions
 w: width, default 7
 h: height, default 7
Yields: SVG, PNG, and PDF of plotnine object

Plot dRFE Accuracy
plot_acc
Plot feature elimination results for accuracy.
Args:
 d: feature elimination class dictionary
 fold: current fold
 out_dir: output directory. default '.'
Yields:
 graph: plot of feature by accuracy, automatically saves files as pdf, png, and svg

Plot dRFE NMI
plot_nmi
Plot feature elimination results for normalized mutual information.
Args:
 d: feature elimination class dictionary
 fold: current fold
 out_dir: output directory. default '.'
Yields:
 graph: plot of feature by NMI, automatically saves files as pdf, png, and svg

Plot dRFE ROC AUC
plot_roc
Plot feature elimination results for AUC ROC curve.
Args:
 d: feature elimination class dictionary
 fold: current fold
 out_dir: output directory. default '.'
Yields:
 graph: plot of feature by AUC, automatically saves files as pdf, png, and svg

Plot dRFE R2
plot_r2
Plot feature elimination results for R2 score. Note that this can be negative if model is arbitarily worse.
Args:
 d: feature elimination class dictionary
 fold: current fold
 out_dir: output directory. default '.'
Yields:
 graph: plot of feature by R2, automatically saves files as pdf, png, and svg

Plot dRFE MSE
plot_mse
Plot feature elimination results for mean squared error score.
Args:
 d: feature elimination class dictionary
 fold: current fold
 out_dir: output directory. default '.'
Yields:
 graph: plot of feature by mean squared error, automatically saves files as pdf, png, and svg

Plot dRFE Explained Variance
plot_evar
Plot feature elimination results for explained variance score.
Args:
 d: feature elimination class dictionary
 fold: current fold
 out_dir: output directory. default '.'
Yields:
 graph: plot of feature by explained variance, automatically saves files as pdf, png, and svg
Metric functions

OOB Prediction
oob_predictions
Extracts outofbag (OOB) predictions from random forest classifier classes.
Args:
 estimator: Random forest classifier object
Yields:
 vector: OOB predicted labels

OOB Accuracy Score
oob_score_accuracy
Calculates the accuracy score from the OOB predictions.
Args:
 estimator: Random forest classifier object
 Y: a vector of sample labels from training data set
Yields:
 float: accuracy score

OOB Normalized Mutual Information Score
oob_score_nmi
Calculates the normalized mutual information score from the OOB predictions.
Args:
 estimator: Random forest classifier object
 Y: a vector of sample labels from training data set
Yields:
 float: normalized mutual information score

OOB Area Under ROC Curve Score
oob_score_roc
Calculates the area under the ROC curve score for the OOB predictions.
Args:
 estimator: Random forest classifier object
 Y: a vector of sample labels from training data set
Yields:
 float: AUC ROC score

OOB R2 Score
oob_score_r2
Calculates the r2 score from the OOB predictions.
Args:
 estimator: Random forest regressor object
 Y: a vector of sample labels from training data set
Yields:
 float: r2 score

OOB Mean Squared Error Score
oob_score_mse
Calculates the mean squared error score from the OOB predictions.
Args:
 estimator: Random forest regressor object
 Y: a vector of sample labels from training data set
Yields:
 float: mean squared error score

OOB Explained Variance Score
oob_score_evar
Calculates the explained variance score for the OOB predictions.
Args:
 estimator: Random forest regressor object
 Y: a vector of sample labels from training data set
Yields:
 float: explained variance score

Developmental Test Set Predictions
dev_predictions
Extracts predictions using a development fold for linear regressor.
Args:
 estimator: Linear model regression classifier object
 X: a data frame of normalized values from developmental dataset
Yields:
 vector: Development set predicted labels

Developmental Test Set R2 Score
dev_score_r2
Calculates the r2 score from the developmental dataset predictions.
Args:
 estimator: Linear model regressor object
 X: a data frame of normalized values from developmental dataset
 Y: a vector of sample labels from developmental dataset
Yields:
 float: r2 score

Developmental Test Set Mean Squared Error Score
dev_score_mse
Calculates the mean squared error score from the developmental dataset predictions.
Args:
 estimator: Linear model regressor object
 X: a data frame of normalized values from developmental dataset
 Y: a vector of sample labels from developmental dataset
Yields:
 float: mean squared error score

Developmental Test Set Explained Variance Score
dev_score_evar
Calculates the explained variance score for the develomental dataset predictions.
Args:
 estimator: Linear model regressor object
 X: a data frame of normalized values from developmental dataset
 Y: a vector of sample labels from developmental data set
Yields:
 float: explained variance score

DEV Accuracy Score
`dev_score_accuracy`
Calculates the accuracy score from the DEV predictions.
**Args:**
 estimator: Linear model classifier object
 X: a data frame of normalized values from developmental dataset
 Y: a vector of sample labels from training data set
**Yields:**
 float: accuracy score
 DEV Normalized Mutual Information Score
`dev_score_nmi`
Calculates the normalized mutual information score from the DEV predictions.
**Args:**
 estimator: Linear model classifier object
 X: a data frame of normalized values from developmental dataset
 Y: a vector of sample labels from training data set
**Yields:**
 float: normalized mutual information score
 DEV Area Under ROC Curve Score
`dev_score_roc`
Calculates the area under the ROC curve score for the DEV predictions.
**Args:**
 estimator: Linear model classifier object
 X: a data frame of normalized values from developmental dataset
 Y: a vector of sample labels from training data set
**Yields:**
 float: AUC ROC score
Linear model classes for dRFE

Lasso Class
Lasso
andLassoCV
Add feature importance to Lasso class similar to random forest output. LassoCV uses crossvalidation for alpha tuning.

Ridge Class
Ridge
andRidgeCV
Add feature importance to Ridge class similar to random forest output. LassoCV uses crossvalidation for alpha tuning.

ElasticNet Class
ElasticNet
andElasticNetCV
Add feature importance to ElasticNet class similar to random forest output. ElasticNetCV uses crossvalidation to chose alpha.

LinearRegression Class
LinearRegression
Add feature importance to LinearRegression class similar to random forest output.

LogisticRegression
LogisticRegression
Adds feature importance to LogisticRegression class similar to random forest output. This was originally modified from Apua Paquola script.
SVM model classes for dRFE

LinearSVC Class
LinearSVC
Add feature importance to linear SVC class similar to random forest output.

LinearSVR Class
LinearSVR
Add feature importance to linear SVR class similar to random forest output.

SGDClassifier Class
SGDClassifier
Add feature importance to stochastic gradient descent classification class similar to random forest output.

SGDRegressor Class
SGDRegressor
Add feature importance to stochastic gradient descent regression class similar to random forest output.
Random forest helper functions

dRFE Subfunction
rf_fe
Iterate over features to by eliminated by step.
Args:
 estimator: Random forest classifier object
 X: a data frame of training data
 Y: a vector of sample labels from training data set
 n_features_iter: iterator for number of features to keep loop
 features: a vector of feature names
 fold: current fold
 out_dir: output directory. default '.'
 RANK: Boolean (True or False)
Yields:
 list: a list with number of features, normalized mutual information score, accuracy score, and array of the indices for features to keep

dRFE Step function
rf_fe_step
Apply random forest to training data, rank features, conduct feature elimination.
Args:
 estimator: Random forest classifier object
 X: a data frame of training data
 Y: a vector of sample labels from training data set
 n_features_to_keep: number of features to keep
 features: a vector of feature names
 fold: current fold
 out_dir: output directory. default '.'
 RANK: Boolean (True or False)
Yields:
 dict: a dictionary with number of features, normalized mutual information score, accuracy score, and selected features
Linear model helper functions

dRFE Subfunction
regr_fe
Iterate over features to by eliminated by step.
Args:
 estimator: regressor or classifier linear model object
 X: a data frame of training data
 Y: a vector of sample labels from training data set
 n_features_iter: iterator for number of features to keep loop
 features: a vector of feature names
 fold: current fold
 out_dir: output directory. default '.'
 dev_size: developmental test set propotion of training
 SEED: random state
 RANK: Boolean (True or False)
Yields:
 list: a list with number of features, r2 score, mean square error, expalined variance, and array of the indices for features to keep

dRFE Step function
regr_fe_step
Split training data into developmental dataset and apply estimator to developmental dataset, rank features, and conduct feature elimination, single steps.
Args:
 estimator: regressor or classifier linear model object
 X: a data frame of training data
 Y: a vector of sample labels from training data set
 n_features_to_keep: number of features to keep
 features: a vector of feature names
 fold: current fold
 out_dir: output directory. default '.'
 dev_size: developmental test set propotion of training
 SEED: random state
 RANK: Boolean (True or False)
Yields:
 dict: a dictionary with number of features, r2 score, mean square error, expalined variance, and selected features
