Single line of code in Python: does multi-class classification with visualization
I don’t think necessity is the mother of invention. Invention, in my opinion, arises directly from idleness, possibly also from laziness — to save oneself trouble.
— Agatha Christie
I couldn’t agree more with the best-selling author of all time.
The open-source Python packages are so very rich that one hardly needs to write lengthy codes, which is a boon to lazy programmers. Or, to put it in another way, it frees up a lot of programmer's time. Small snippets of codes do wonder in the arena of Jupyter Notebook.
For my Data Science project, I was trying out several classification algorithms. While writing code for my purpose was not difficult, I found myself re-using a large part of my code across projects. Therefore, I, the Indolent, decided to write generalized functions that can be copy-pasted in a new project to jump-start it.
I have a dataset and after loading it in pandas dataframe, the initial analysis is performed to take care of a) missing values and b) feature selection. My dataframe now has a target variable (dependent variable, having categorical data) and the predictor (independent) variables, the values of the latter are all numeric.
I am ready to perform classification modelling. I want to test out several algorithms one after another, get the various scores, plot the corresponding confusion matrices, plot feature importance (if that is available for the algorithm) and finally compare the accuracies for all the algorithms of interest. And I am impatient.
Can I get a preview of how different classification algorithms are faring, right away, without coding?
These are the steps one must follow in a normal course (which are to be automated):
- Import all the necessary packages
2. Divide the dataset between the target (y) and the features (X)
3. Scale X
4. Split the data to train and test sets
5. Select a classification model
6. Train the model using the training dataset
7. Test the model
8. Get several scores to evaluate the model, using the test and predicted data
9. Plot the confusion matrix
10. Plot the feature importance (if that function is available for the selected algorithm)
11. Select another classification model and go to step 6, until the list of classifiers is exhausted
12. Compare the accuracies of all the models tried out in the form of a bar plot
Each of the above steps requires coding… and in totality, it results in quite a chunk of code. Wouldn’t you love to have it all by calling a single function? And have that in an installable package?
Running the code
where calling the function “train_test_plot_def” spits out:
Don’t worry if you do not understand the dataset and consequently the significance of the plots, they are shown here for illustration purposes, although a real-world dataset has been used.
To explain it in a nut-shell, we are passing the dataframe, name of the target column, the first letters of the classification algorithms (comma separated) and a size of the plots to this function. That’s all.
The models are evaluated by
a) Accuracy (proportion of true results in all classified cases)
b) Precision (proportion of predicted positive that is truly positive)
c) Recall (proportion of actual Positives that is correctly classified)
d) F-score (harmonic mean of precision and recall)
e) Support (number of actual occurrences of the class in the specified dataset)
By printing the docstring of the function we get the full documentation.
Performs the following operations:
1. Splits the dataframe into target (dependent variable) and predictors (independent variable)
2. Scales the values of independent variables (all input values must be numeric)
3. Splits the dataset into training and testing sets
4. Loops through the list of classification algorithms to
c) Evaluate and report performance
d) Plot Confusion Matrix
e) Plot feature importance (if it is available for this particular algorithm)
5. Shows comparative plot of accuracies for all the algorithms
df (pandas dataframe): the whole dataset containing observations for both target and predictor variables
target (string): column name of the target variable in df, e.g. 'Species'
algos (comma separated character string): the first letters of classification algorithms to be applied, e.g. l,r,x
s: Support Vector Machine
size (int): size of the plots, typical values are 5, 10, 15
train_test_plot_def(iris_df, 'Species', 'l,r,x', 5)
iris_df: input dataframe, e.g. iris_df = pd.read_csv('Iris.csv')
'Species': name of the target column in iris_df
'l,r,x': first letters of (L)ogisticRegression', (R)andomForestClassifier and (X)GBClassifier (case insensitive)
5: size of the plots generated
Each generalization has a limit. Here, we are limited to 6 classification algorithms (Logistic Regression, K-Neighbors, Support Vector Machine, Decision Tree, Random Forest and XGBoost). Moreover, the hyperparameter tuning has not been done, the classifiers are called mostly with their default parameters (hence the suffix ‘_def’ in the function name). What this means is, as we already know, classifiers with default settings may not be suitable for special cases like imbalanced data, where one or more classes may be underrepresented.
This function can easily be modified to take the classification models of choice with prescribed hyperparameters as input. Let us call the new function as train_test_plot.
This is how the function has to be called:
In summary, for each classification model, first, the relevant python package has to be imported. Then the model with defined hyperparameters is added to a pandas series, which is to be passed to the new function. So, here we are passing objects, in contrast with the earlier function, where the first letters of the pre-defined model names were used.
The packages have been deployed on PyPI and can be installed using:
pip install train_test_plot_def
pip install train_test_plot