Machine Learning Practice

/ Home / Blog

19th November 2018 Machine Learning Python

I have spent a lot of time recently watching videos and completing programming exercises in the deeplearning.ai online course on Coursera. All of that time I have had the feeling that I need to apply some of what I’m learning to my own project. So, I decided to try the techniques I have learned on some data that is not part of the course, to see how I was progressing.

These are the courses I have completed so far.

https://www.coursera.org/learn/machine-learning

https://www.coursera.org/specializations/deep-learning

I’m going to stick with Python for now, but I want to create my own Neural Network classes, based on the teachings from the course, rather than use the code directly from the exercises. I’ve been itching to reduce the number of parameters being passed around and make the code more to my own style of programming. I appreciate that there are frameworks out there that make this simpler, but I’m still learning and want to build everything from the ground up. It also gives me a bit more practice with Python.

For the data, I decided to keep it simple and use the Titanic: Machine Learning from Disaster exercise from Kaggle. This is one of their “Getting Started” exercises, it is nice and simple and there are lots of discussions around it, in case I need any help.

The task

Kaggle have provided a database of 891 passengers on the Titanic with the following details about each:

Name
Male or Female
Age
Ticket Class
…
Whether they survived or not

The task I am setting myself is to use a machine learning algorithm to create a predictor that will use the properties of a passenger to try to determine whether they survived or not. Morbid, but hopefully quite straightforward.

Kaggle also provides a test set, where the passengers’ survival is not specified. This is used to submit your results to them to see how you fared.

My approach

As I also want to build some neural network classes, I’m going to use a neural network. To start with I will use no hidden layers, which will effectively make it like logistic regression. Then I’ll try changing the hyper parameters and adding hidden layers to see if I can improve my results.

On first looking at the data I discovered that it is not entirely "clean". The initial problems I spotted were:

The age is only defined for 715 the 891 passengers
The ticket numbers have different formats
The cabin is only defined for 205 of the 891 passengers in the training set
A typical cabin is marked as "C99", whereas some just have a single letter (e.g. "D"), and some passengers have multiple cabins marked (e.g. "C23 C25 C27"). Some are even more confusing (e.g. "F G63")
Two of the passengers do not have "embarked" defined

There are similar issues on the test data set.

Cleaning the data

At first, I'm going to try and keep things simple, so I'll only deal with properties that can be treated as a number. Where something is not defined, I'll treat that as just another number. This means losing the "Name", "Ticket" and “Cabin” properties from the data. I can add these in later, if needed. That leaves me with the following columns:

Pclass. All rows contain either 1, 2 or 3
Sex. All rows contain either "male" or "female". I’ll convert these to 0 and 1 respectively.
Age. Ranges from 0.42 to 80. Where it is not defined Python will use nan, which I’ll set to an impossible value (0), to see what that does. I might need some guidance on what to do here, I may even investigate creating an extra "hasAge" parameter.
SibSp. Number of Siblings/Spouses aboard. All rows in range from 0 to 8.
Parch. Number of Parents/Children aboard. All rows in range from 0 to 6.
Fare. All rows range from 0 to 512.3292.
Embarked. "Q", "C" or "S" or undefined. Mapped to 1, 2 or 3 or 0 respectively.
Survived. 0 or 1. Will be used for training, to track how well the model is predicting the training set results and adjusting the model.

I also decided to ignore "PassengerId", which I believe is just a unique number and not related to the tragedy. I can always add it in again later if needed. In Python I used the "pandas" library to read the CSV file into a "numpy" (a mathematical library that I’ll be using) matrix.

import numpy as np
import pandas
import TitanicConverters as tc
import random

dictionary = pandas.read_csv("all\\train.csv", 
                             quotechar='"', 
                             skipinitialspace=True, 
                             usecols=["Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"],
                             dtype={"Survived": np.float, "Pclass": np.float, "SibSp": np.float, "Parch": np.float, "Fare": np.float},
                             converters={"Sex": tc.SexConverter,"Age": tc.AgeConverter, "Embarked": tc.EmbarkedConverter})
data = dictionary.values

In the above code, the "usecols" parameter restricts which columns are imported, the "dtype" forces the integer columns to be treated as floats, and the "converters" parameter specifies the functions that will convert the string columns to floats.

The final line in the code converts the data from a "dictionary" type object to a numpy matrix. The converter functions are simple enough to write and I kept them in a separate Python file. e.g.

def EmbarkedConverter(embarkedAsString):
    if embarkedAsString == "C":
        return 1
    elif embarkedAsString == "Q":
        return 2
    elif embarkedAsString == "S":
        return 3
    else:
        return 0

Splitting the data

I want to split the Kaggle "train" data into a training set and a dev set. I will use the latter to measure how well I’m doing and to tune my model’s hyper parameters. In this example, there is no need for my own test set, as Kaggle have already provided one. This will be used as a final measure of how well I did.

From the suggestions in the courses I have done I’m going to put 20% of the Kaggle "train" data into my dev set, with the remaining data being put into my training set. The 20% will need to be randomly selected from the Kaggle data. I used the following code to randomly shuffle the data and then divide it into the two sets. I saved the division into two new data files, so as to keep the same data for all of my testing.

np.random.shuffle(data)

dev_split = int(0.2 * len(data))

dev_data = data[:dev_split]
train_data = data[dev_split:]

np.save("all\\dev.npy", dev_data)
np.save("all\\train.npy", train_data)

My Neural Network Classes

The neural network that I created is based on the teachings of the Coursera courses that I have done (see above). I decided to take the code presented and use Python classes to implement it, to avoid the passing of so many parameters between the functions (often passed as dictionaries). I cannot include all of the code here, as it is based on that from the courses, and I am not allowed to repeat it (I signed up to an agreement not to as part of the courses).

I’ll show the class definitions and provide some comments to show what they implement, along with any original code that I wrote.

My starting point is to create a NeuralNetwork class, which can be instantiated with all of the parameters required and has methods to run the training and predictions.

from NeuralNetworkLayer import NeuralNetworkLayer, ActivationType
import numpy as np

class NeuralNetwork(object):

    def __init__(self, X, Y, layer_dimensions, learning_rate = 0.1, l2_lambda = 0):
        """layer_dimensions is an array of length num_layers, where [0] is the input layer size, and [number_layers-1] is the output layer size."""
        
        self._layer_dimensions = layer_dimensions
        self._number_layers = len(layer_dimensions) - 1     # Excludes input layer
        self.Y = Y
        self.l2_lambda = l2_lambda
        self.learning_rate = learning_rate

        # Normalise X and record the mean and variances as "self" variables (code not shown)

        # Initialise the parameters in the layers
        # Remember that layer_dimensions[] is one bigger than layers[]/_number_layers
        self._layers = []
        for l in range(self._number_layers):
            layer = NeuralNetworkLayer(self._layer_dimensions[l], self._layer_dimensions[l+1])      # (previous layer size, this layer size)
            self._layers.append(layer)

        # Set the final layer's activation function as Sigmoid. The rest will default to ReLU
        self._layers[self._number_layers - 1].set_activation_type(ActivationType.SIGMOID)


    def predict(self, X):
        # Normalise X, using the stored mean and variances
        # Calculate the activations for the given X with the current parameters (W and b in the layers)
		# return the prdictions (0 or 1)


    def train_model(self, num_iterations):
        for x in range(num_iterations):
            self.update_model()


    def update_model(self):
        
        # Calculate the activations with the current parameters (W and b in the layers)
		# Uses layer.calculate_forward_activations

        cost = self.compute_cost()
        print("Cost: " + str(cost))

        # Backward propogation (calculate the gradients for gradient descent)
		# Uses layer.calculate_backward_gradients

        # Update the parameters in the layers using the calculated gradients
        for layer in self._layers:
            layer.update_parameters(self.learning_rate)
		

    def compute_cost(self):
        
        # Compute the current cost using L2 regularisation (uses stored l2_lambda)

This main NeuralNetwork class relies on NeuralNetworkLayer, which implements the individual layer functionality. The activation functions are recorded as variables, set in “set_activation_type” for the layer.

import numpy as np
from enum import Enum

class ActivationType(Enum):
    RELU = 1,
    SIGMOID = 2

class NeuralNetworkLayer(object):
    """An individual layer in a neural network"""

    def __init__(self, previous_layer_size, this_layer_size):
        # Initialise W and b to the correct sizes
		# self.W = ...
        # self.b = ...
        self.Z = []
        self.A = []
        self.dA = []
        self.dZ = []
        self.dW = []
        self.db = []
        self.set_activation_type(ActivationType.RELU)


    def set_activation_type(self, type):
        self.activation_type = type
        if type == ActivationType.RELU:
            self.activation_function = NeuralNetworkLayer.relu
            self.backward_gradient_function = NeuralNetworkLayer.relu_backward
        else:
            self.activation_function = NeuralNetworkLayer.sigmoid
            self.backward_gradient_function = NeuralNetworkLayer.sigmoid_backward


    def calculate_forward_activations(self, prev_A):
        # Implement the linear part of a layer's forward propagation and then calculate the activations.
		
		
    def calculate_backward_gradients(self, dA, A_prev):
        # Calculate the gradients for gradient descent
		
		
    def update_parameters(self, learning_rate):
	    # Update the parameters using the gradients calculated

    def sigmoid(Z):
	    # Return the sigmoid activation

    def sigmoid_backward(A):
	    # Return the sigmoid activation gradient

    def relu(Z):
	    # Return the ReLU activation

    def relu_backward(A):
	    # Return the ReLU activation gradient

Training the Model

Here’s the code that loads in the data I cleaned earlier and then trains the neural network.

import numpy as np
from NeuralNetwork import NeuralNetwork

dev_data = np.load("all\\dev.npy")
train_data = np.load("all\\train.npy")

train_data_transpose = train_data.T
X = train_data_transpose[1:,:]
Y = train_data_transpose[0,:].reshape(1,X.shape[1])
l2_lambda = 0
learning_rate = 1
layer_dimensions = [X.shape[0],1]

model = NeuralNetwork(X, Y, layer_dimensions, learning_rate, l2_lambda)

model.train_model(1000)

predictions = model.predict(X)

actuals = (Y > 0.5)
matching_rows = (Y == predictions)
accuracy = np.sum(matching_rows) * 100.0 / matching_rows.shape[1]
print("Train accuracy: " + str(accuracy))

data_transpose = dev_data.T
X = data_transpose[1:,:]
Y = data_transpose[0,:].reshape(1,X.shape[1])
predictions = model.predict(X)

actuals = (Y > 0.5)
matching_rows = (Y == predictions)
accuracy = np.sum(matching_rows) * 100.0 / matching_rows.shape[1]
print("Dev accuracy: " + str(accuracy))

It’s initially set up with just one output layer node (no hidden layers), after some experimentation (learning rate = 1.0, iterations = 1000) I got the following results.

Training set accuracy: 78.4%
Dev set accuracy: 79.77%

By adding a hidden layer of 7 units to the model, I was able to improve the accuracy as follows.

Training set accuracy: 80.08
Dev set accuracy: 81.46%

Neither increasing the number of units in the hidden layer nor adding further hidden layers improved the results, so I decided to take another look at the data I was using.

Improving the Features

In order to improve my model, I then decided to take a look at the input features to see if I could add more information for the model to utilise. First, I decided to look at “Age”. Only 73% of the passengers have their age specified and I had decided to set the others to 0, which may have reduced the usefulness of the feature.

My hunch is that adding an “hasAge” feature will help this. This will be set to 1 for those passengers where an age is specified and set to 0 where it is not. Hopefully, this gives the model some extra ammunition to work with.

Here is the code that I added after loading the original CSV file, to add the hasAge column. I saved the dev and train data sets to some different files, to be loaded for use in the neural network.

new_data = np.zeros((data.shape[0], data.shape[1] + 1))
has_age = data[:,3] > 0.0
new_data[:,:-1] = data
new_data[:,new_data.shape[1] - 1] = has_age

This change doesn’t seem to have helped. With the one hidden layer, a reduced learning rate and more iterations, the best results I could get was:

Training set accuracy: 82.04%
Dev set accuracy: 78.09%

Notice though, that this has began over-fitting, i.e. it doesn’t generalise to the dev set very well. Introducing some regularisation brought the variance down, but resulted in much lower scores.

My next thought was to include the cabin information in the features. Rather than trying to decode the cabin names, I decided to give each cabin it’s own index (e.g. 0 = "", 1 = "A14", 2 = "B57 B59 B63 B66"…). Again, I added some code after loading the original file and saved some new dev and train sets.

cabins = np.unique(dictionary["Cabin"].values).tolist()

cabin_indexes = []
for cabin in dictionary["Cabin"]:
    index = cabins.index(cabin)
    cabin_indexes.append(index)

cabin_index_dict = dictionary.assign(CabinIndex = cabin_indexes)
no_cabin_dict = cabin_index_dict.drop("Cabin", 1)

data = no_cabin_dict.values

Again, this gave a slight improvement, but nothing to write home about. After tuning, my best results were around the following (learning rate = 0.3, iterations = 8000, 4 units in one hidden layer).

Training set accuracy: 81.91%
Dev set accuracy: 80.33%

Each run varied slightly, but were generally around 82% for both train and dev.

I have some concerns with using this method, as it assumes all of the possible cabin strings were included in the training set. An alternative approach may be to encode the strings as 15 new features, each new feature being an integer representation of the nth character in the string. Here’s my code to do that.

cabins = dictionary["Cabin"]
ascii_cabins = np.zeros((cabins.shape[0],15))
i = 0
for cabin in cabins:
    right_fill = cabin.rjust(15, "+")
    chars = np.array(list(right_fill))
    ascii_cabin = chars.view(np.int).astype(np.float)
    ascii_cabins[i] = ascii_cabin
    i += 1

no_cabin_dict = dictionary.drop("Cabin", 1)                                 # Remove the "cabin" string column
data = np.append(no_cabin_dict.values, ascii_cabins, axis=1)

These are about the best results I could get for a model with those features.

Training set accuracy: 83.73%
Dev set accuracy: 83.15%

With this model I started to get numerical stability problems. Quite often the cost will come out as NaN (not a number). I did manage to make it a bit more stable with the use of numpy.nan_to_num in the cost function, but I think I’m getting to the limits of my code for now. The model I had built was also quite inconsistent, so it was difficult to compare like with like when each run using the same data and settings could produce different results.

Conclusion

This was a really useful exercise for me, to build my own model from scratch. I’m a bit disappointed with the results (I was hoping for high 80%s) and it didn’t really seem that the addition of features made a great deal of difference to my results.

I’m really glad I went through this process and am looking forward to learning about the frameworks that are out there and experimenting with some other data sets. Watch this space!