In support vector machines, \(C\) is a hyperparameter determining the penalty for misclassifying an observation. One method for handling imbalanced classes in support vector machines is to weight \(C\) by classes, so that
where \(C\) is the penalty for misclassification, \(w_j\) is a weight inversely proportional to class \(j\)'s frequency, and \(C_j\) is the \(C\) value for class \(j\). The general idea is to increase the penalty for misclassifying minority classes to prevent them from being "overwhelmed" by the majority class.
In scikit-learn, when using
SVC we can set the values for \(C_j\) automatically by setting
balanced argument automatically weighs classes such that:
where \(w_j\) is the weight to class \(j\), \(n\) is the number of observations, \(n_j\) is the number of observations in class \(j\), and \(k\) is the total number of classes.
# Load libraries from sklearn.svm import SVC from sklearn import datasets from sklearn.preprocessing import StandardScaler import numpy as np
Load Iris Flower Dataset
#Load data with only two classes iris = datasets.load_iris() X = iris.data[:100,:] y = iris.target[:100]
Imbalanced Iris Flower Classes
# Make class highly imbalanced by removing first 40 observations X = X[40:,:] y = y[40:] # Create target vector indicating if class 0, otherwise 1 y = np.where((y == 0), 0, 1)
# Standarize features scaler = StandardScaler() X_std = scaler.fit_transform(X)
Train Support Vector Classifier With Weighted Classes
# Create support vector classifier svc = SVC(kernel='linear', class_weight='balanced', C=1.0, random_state=0) # Train classifier model = svc.fit(X_std, y)