Using Support Vector Machine to Classify Income Levels (> 50k or <= 50k) ( Part 3 / 5)

邱麗安 Annette Chiu
7 min readDec 24, 2018

--

We decided to use an SVM classifier in order to learn and predict whether a person’s income would exceed $50,000 a year based on various features (such as age, race, education, etc.)

SVM to Classify Income Levels ( Part 1/5 ) Feature analysis

SVM to Classify Income Levels ( Part 2/5 ) Maximize the Margins

SVM to Classify Income Levels ( Part 3/5 ) SVM method Implementation

SVM to Classify Income Levels ( Part 4/5 ) L1 Regularization

SVM to Classify Income Levels ( Part 5/5 ) Packaged SVM Classifier

Implementation details The main function that implements the traditional SVM method is learnSvmW . For input the function takes :

x_train : matrix of observations (rows) and features (columns) used for training

y_train : column vector of actual class values used for training

x_test : matrix of observations (rows) and features (columns) used for checking error trained parameters

y_test : column vector of actual class values used for checking error of trained parameters

epsilon : the size of each gradient descent step (smaller value = smaller steps)

epochs : number of times to cycle through the data during testing

And returns :

wfinal : final set of feature weights for trained model

error : vector of error rates (for every 10 records) of trained weights used on testing data

The function starts by adding a column of 1’s to the training data to account for b being included in the w-vector.

Weights are initialized by setting w equal to the mean for each feature.The innermost loops in the function iterate through each training example, i, and each feature value, j, and using the derived update formula from above re-calculates and updates the weight for that data point and feature (w (j) ).

Because we are minimizing, we use gradient descent, and thus subtract (step down) from the previous w (j) .

Where gamma (ɣ) is calculated differently depending on whether the actual class for that data point is 1 or -1 :

For every 10 updates (of all features, j, for a row, i) we call a function, calcAccuracy, passing the testing data (all

rows/features and classes) and most recent weights, w, to determine the error (1-accuracy). We use this error rate as a check of when to stop our training (in the outermost loop of the main function). If the error rate goes up, then we will exit training and the latest weights, w, is our final weight values. The reason for using Testing data to check error rate is the more you train your data, the error on training data should continue to go down but there is a risk that you are overfitting your model for that specific data and thus will perform poorly in the real world with different data. So we use testing data as the ‘real world’ other data set to test against. Once error starts to increase, that indicates your model is starting to overfit the training data.

Between the error check (outer loop) and model fitting (inner two loops), there is another loop using the epochs variable that allows us to run our fitting over multiple iterations of the same data. Our data has 32,000 records so we did not use any value other than 1 when we were running our final version, but when testing with smaller sets of data, it was useful to iterate through the data more than once.

Another function, selectBestFeatures , has the purpose of helping with feature selection It wraps learnSvmW so it has all of the same inputs, but also has an additional input to determine which of our two implementations is used,

Traditional or Traditional with L1 Regularization :

lambda : a hyper parameter for L1 Regularization — if lambda = 0, use traditional implementation (not L1)

The function returns:

featsIdx : the index (from the cell data frame) that points to the full list of features that performed best

featsOut : the full list of features that performed best

final_acc : the accuracy of the best set of features

error : vector of error rates (for every 10 records) of trained weights used on testing data

runtime : time taken to train the model (returns only the last, aka the best, trained model)

Not accounting for the one-hot encoding, and after removing 2 features (for reasons stated before) we are left with 12 features. Given the amount of data we have, that is not an unreasonable number of features but we were not sure how useful they all were. This function iterates through each individual feature, calculates the accuracy, and saves the feature with the highest accuracy. It then takes the best set(s) and iterates through all the features again, calculating the accuracy, and as long as one combination has a higher accuracy it saves that new combination. It continues that until the accuracy decreases, and returns the best set. We call this function for multiple epsilon (step size) values (results provided below). Because of our one-hot encoding of the categorical

variables, we had to treat each ‘feature’ as a ‘group of features’. This is done by using a cell data type in matlab which allowed us to have a key (e.g. ‘Sex’) which contained a vector of 2 elements (‘Sex_male’ and ‘Sex_female’).

Then when adding features, we could add all of the encoded features together. This function calls the main learnSvmW function to learn the weight, w, values for that list of features and, as described above, learnSvmW also calls calcAccuracy and returns the accuracy. The accuracy is used to determine when to stop adding more features. The best set of features (both the key and list of features) for that epsilon value and the accuracy is returned.

The final function, calcAccuracy calculates the projected class (via dot product) for each row based on the input data (feature values) and input weights for each feature (including ‘b’). It then compares the projected class against the actual class and counts the number of correctly classified. It divides by the total number of records to get and returns the accuracy.

It takes as input :

x : a matrix of data records to be classified (rows) and features (columns)

y : column vector of the actual class values for each data record

w : feature weight (likely learned from the training process)

And returns:

accuracy : number of correctly classified records divided by total number of records

The parameters used for the traditional implementation are :

w — these are the weights, for each feature, that are learned using training data and are then used with new data to predict the class of new data.

b — is the y-intercept (offset from 0) for the w-vector

gamma (ɣ) — a penalty on the weight of the point depending how far away from the separator it is.

The hyper parameters used in the traditional implementation are :

epsilon (ε) — epsilon controls the size of the step in gradient descent process. We tried 5 different epsilon

values to determine which provided the best accuracy (see table below for results).

epochs — number of times to iterate through full data in the process of learning w’s. Because we had so much data, we only ran for a single epoch, but this would be useful for testing with smaller sets of data.

Below are the results of the traditional implementation. For each epsilon (ε) value, using all features ( Age,WorkClass, Education, MaritalStatus, Occupation, Relationship, Race, Gender, CapitalGain, CapitalLoss, HoursPerWeek, NativeCountry )

Looking at the same results, but without exiting learning when testing error goes up, we can see that the accuracy does decrease for the same features and same epsilon values, so we are likely overfitting our data. And the run times are 4–10x longer.

For each epsilon (ε) we also found the best set of features that had the highest accuracy. Below table shows the best set of features, runtimes, and accuracy for each epsilon ( ε)

The runtime for 0.0001 being the longest makes sense as it’s taking the smallest steps, so would end up taking more steps and thus longer time. Except for epsilon = 0.001, all epsilon values have a higher accuracy with the ‘best features’ over ‘all features’. At epsilon = 0.001, the values are very close, but it could be because the error rate increased for the addition of one feature (which stopped the learning), but then would have gone back down when a subsequent feature was added. Below shows the error rate, stopping when the error increases. We see that ‘all features’ has a more gradual decrease before stopping. It also exits much earlier (after ~800 iterations) than the ‘best features’ (~1200 iterations)

Team:

Annette Chiu ( Python Coding )

Mark Giroux ( Project Management / Math )

Spencer Sigtryggsson ( Math )

--

--