Machine Learning and Neural Networks - page 75

 

13.4.5 Sequential Feature Selection -- Code Examples (L13: Feature Selection)


13.4.5 Sequential Feature Selection -- Code Examples (L13: Feature Selection)

Okay, it's time to conclude our discussion on feature selection. In this video, I will demonstrate how to use sequential feature selection in Python. I'll start by showing you an example using the MLxtend library, which is the original implementation I developed several years ago. Later, I'll also demonstrate how to achieve the same results using scikit-learn, which offers a more streamlined implementation.

Before we dive into the code, I encourage you to check out the documentation, which contains additional examples that I won't cover in this video to avoid making it too lengthy and overwhelming. It's always helpful to refer to the documentation for more in-depth information.

First, let's begin by loading the "watermark" plugin that I developed to track the versions of my notebooks and software libraries over the years. It's a good practice to ensure that the version numbers match our expectations, especially if some options may no longer work due to version discrepancies. We will also be using the matplotlib library later, so let's import it to ensure that the plots display correctly in the notebook.

Now, let's move on to preparing the dataset for feature selection. As in previous videos, we will be using the wine dataset. We load the dataset from the UCI machine learning dataset repository using Pandas. After loading the dataset, we print out some basic information to ensure that everything loaded correctly. It's also important to check that the class labels are represented as integers.

Next, we divide the dataset into training and test sets, as we have done in previous videos. Additionally, we standardize the dataset because we will be using a K-nearest neighbor classifier, which is sensitive to feature scaling. We divide the dataset into 80% training set and 20% test set, and standardize both sets.

To establish a baseline before feature selection, we fit a K-nearest neighbor classifier on the standardized dataset and compute the training and test accuracies. In this example, we arbitrarily choose five neighbors for the classifier, but this parameter could be subject to grid search for optimal performance. Although we won't perform grid search here to keep the code and video simpler, combining grid search with sequential feature selection is a common approach. You can find examples of this in the documentation.

The baseline results show that we achieve 98.6% accuracy on the training set and 94% accuracy on the test set. The performance is quite good using all 13 features from the wine dataset. However, there may be some overfitting due to the curse of dimensionality associated with K-nearest neighbor classifiers. To mitigate this, we can select a smaller subset of features to potentially improve performance.

Now, let's demonstrate how to use sequential feature selection to select a subset of five features. We import the SequentialFeatureSelector class from the MLxtend library and shorten the import name to sfs for convenience. This class takes the model, the desired feature subset size, and the selection direction (forward, backward, floating) as input. We set the direction to forward for sequential forward selection. The verbose parameter allows us to control the amount of output displayed during training, which can be useful for monitoring progress. We specify the scoring metric as accuracy, and use 5-fold cross-validation to evaluate the feature subsets. Parallel processing can be enabled by setting the n_jobs parameter to a positive integer or -1 to utilize all available CPU cores. In this case, we set it to 8 for faster execution.

The output shows the progress of the feature selection process, starting with one feature and gradually increasing the number of features until reaching the desired subset size of five. The performance of each feature subset is also displayed, indicating improvement as more features are added.

After completion, we can access the selected feature indices and the corresponding feature names using the k_feature_idx_ and k_feature_names_ attributes, respectively, of the sfs object. Additionally, we can access the performance history of the feature subsets using the k_score_ attribute. Let's print out the selected feature indices, names, and their corresponding scores:

print('Selected feature indices:', sfs.k_feature_idx_)
print('Selected feature names:', sfs.k_feature_names_)
print('Selected feature scores:', sfs.k_score_)

The output will show the indices, names, and scores of the selected five features.

Next, we can retrain the K-nearest neighbor classifier on the selected feature subset. To do this, we need to create a new training and test set that only contain the selected features. We can use the transform method of the sfs object to transform the original datasets into the new feature space:

X_train_selected = sfs.transform(X_train)
X_test_selected = sfs.transform(X_test)
After transforming the datasets, we can fit a new K-nearest neighbor classifier on the selected feature subset and compute the training and test accuracies. Let's print out the results:

knn_selected = KNeighborsClassifier(n_neighbors=5)
knn_selected.fit(X_train_selected, y_train)

train_acc_selected = knn_selected.score(X_train_selected, y_train)
test_acc_selected = knn_selected.score(X_test_selected, y_test)

print('Training accuracy on selected features:', train_acc_selected)
print('Test accuracy on selected features:', test_acc_selected)
The output will show the training and test accuracies achieved using only the selected five features.

By comparing the results with the baseline accuracies, we can evaluate the impact of feature selection on the classifier's performance. In some cases, feature selection can lead to better generalization and improved model performance by reducing overfitting and removing irrelevant or redundant features.

That's it for the demonstration using the MLxtend library. Now, let's move on to using scikit-learn for sequential feature selection.

In scikit-learn, the SequentialFeatureSelector class is available in the feature_selection module. We import it as follows:

from sklearn.feature_selection import SequentialFeatureSelector
The usage of the scikit-learn version is similar to the MLxtend version, but with some minor differences in the parameter names and attribute access. The scikit-learn version also provides more flexibility in terms of using different feature selection algorithms and scoring metrics.

I hope this demonstration helps you understand how to use sequential feature selection in Python. Remember to refer to the documentation for additional examples and information.

13.4.5 Sequential Feature Selection -- Code Examples (L13: Feature Selection)
13.4.5 Sequential Feature Selection -- Code Examples (L13: Feature Selection)
  • 2022.01.06
  • www.youtube.com
This final video in the "Feature Selection" series shows you how to use Sequential Feature Selection in Python using both mlxtend and scikit-learn.Jupyter no...
Reason: