nntrain (0/n): Preliminaries

foundations
PyTorch
nn.Module
Author

Lucas van Walstijn

Published

August 9, 2023

In this series, I want to discuss the creation of a small PyTorch based library for training neural networks: nntrain. It’s based off the excellent part 2 of Practical Deep Learning for Coders by Jeremy Howard, in which from lessons 13 to 18 (roughly) the development of the miniai library is discussed.

We’ll try to build everything as much as possible from scratch to understand how things work. Once the main functionality of components is implemented and verified, we can switch over to PyTorch’s version. This is similar to how things are done in the course. However, this is not just a “copy / paste” of the course: on many occasions I take a different route, and most of the code is my own. That is not to say that all of this is meant to be extremely innovative, instead I had the following goals:

nb_dev is another great project from the fastai community, which allows python libraries to be written in jupyter notebooks. This may sound a bit weird and controversial, but it has the advantage that we can create the source code for our library in the very same environment in which we want to experiment and interact with our methods, objects and structure while we are building the library. For more details on why this is a good idea and other nice features of nb_dev, see here.

So without further ado, let’s start with some data!

Data

To keep things simple, let’s use the fashion-mnist dataset. We can get the data from the huggingface datasets library:

from datasets import load_dataset,load_dataset_builder

name = "fashion_mnist"
ds_builder = load_dataset_builder(name)
print(ds_builder.info.description)
Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of
60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image,
associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in
replacement for the original MNIST dataset for benchmarking machine learning algorithms.
It shares the same image size and structure of training and testing splits.
ds = load_dataset(name, split='train')
Reusing dataset fashion_mnist (/root/.cache/huggingface/datasets/fashion_mnist/fashion_mnist/1.0.0/8d6c32399aa01613d96e2cbc9b13638f359ef62bb33612b077b4c247f6ef99c1)

ds is a Dataset object. These kind of objects appear in many Deep Learning libraries and have two main functionalities: you can index into them and they have a length:

ds[0]
{'image': <PIL.PngImagePlugin.PngImageFile image mode=L size=28x28>,
 'label': 9}
len(ds)
60000
ds
Dataset({
    features: ['image', 'label'],
    num_rows: 60000
})

Hugginface datasets (as opposed to PyTorch datasets) also have some properties, in this case num_rows, which is the length of the dataset (60000) and features, a dictionary giving metadata on what is returned when we index into the dataset:

ds.features
{'image': Image(decode=True, id=None),
 'label': ClassLabel(num_classes=10, names=['T - shirt / top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'], id=None)}

Let’s visualize one single item:

import matplotlib.pyplot as plt

image = ds[0]['image']
label = ds[0]['label']

figure, axs = plt.subplots()

axs.imshow(ds[0]['image'], cmap='Greys')
axs.set_title(f'Image of the first item in the dataset: label={label} -> "{ds.features["label"].int2str(label)}"');
axs.axis('off');

Since we want to start simple, and only later get to Datsets and Dataloaders: let’s pull out the data into a tensor so we can pull the data through some linear layers.

import torchvision.transforms.functional as TF   # to transform from PIL to tensor
import torch

x_train = [TF.to_tensor(i).view(-1) for i in ds['image']]
y_train = [torch.tensor(i) for i in ds['label']]

len(x_train), len(y_train), len(x_train[0])
(60000, 60000, 784)

So x_train and y_train are both lists of length 60000, and an element in x_train has length 784 (28x28 pixels).

Linear layers

Now that we have the data, let’s create our very first network operation: a linear layer which takes the 784 long flattened out image vector, and maps it to an output vector of length 10

import torch

def lin(x, a, b):
    return x@a + b

a = torch.randn(784, 10)
b = torch.randn(10)

out = lin(x_train[0], a, b)
out.shape
torch.Size([10])
Note

For details on matrix multiplications, check out this post I wrote earlier.

Let’s do the same for all our training data at once:

x_train = torch.stack(x_train)
out = lin(x_train, a,b)
out.shape
torch.Size([60000, 10])

Nice, that’s basically a forward pass through our model on all our training data!

Now, if we want to increase the depth of our network by adding an additional layer, we need to add a non-linearity in the middle. Why? See for example the first paragraphs of this answer.

Non-linearities are simple element-wise operations, we apply on (a tensor of) values. A common one is the ReLu nonlinearity:

def relu(x):
    return x.clamp_min(0.0)

And let’s combine these into our first “model”, consisting of two linear layers and a relu nonlinearity in the middle:

n_in = 784 # number of input units (28x28)
n_h = 50   # number of hidden units
n_out = 10 # number of output units

w1 = torch.randn(n_in, n_h)
b1 = torch.zeros(n_h)
w2 = torch.randn(n_h, n_out)
b2 = torch.zeros(n_out)

def model(x):
    a1 = lin(x, w1, b1)
    z1 = relu(a1)
    return lin(z1, w2, b2)
out = model(x_train)
out.shape
torch.Size([60000, 10])

Our “model” currently only does a forward pass through the network. And as a matter of fact, it’s doing a forward pass with random weights. When training a neural network, we want to change these parameters in a way that the outputs of the network align with the labels (y_train). We thus need a way to compare the outputs of the network (the predictions) with the labels: the loss function.

Since the outputs represent any of the 10 classes the image corresponds with, cross entropy is a straight forward loss function. Some details about cross entropy loss can be found in a post I wrote earlier.

Once the loss is computed, training the network means that we want to change the weights in a way that minimizes the loss. Perhaps you remember from high school that we can find the extreme values of a function \(f(x)\) by finding the derivative of \(f\) and setting it to zero: \(f'(x) = 0\). Furthermore, we could use the second derivative to check whether we have found a minimum or maximum. Unfortunately, this doesn’t work in the context of neural networks because of the complexity of the function and the non-linearities. Instead, we use something called gradient descent which is an algorithm that calculates the gradient (derivative) of the loss function with respect to the parameters (weights and biases) of the neural network. It then adjusts the parameters in the opposite direction of the gradient to minimize the loss. This process is repeated iteratively until convergence or a stopping criterion is met.

The way to compute these gradients is by making use of backpropagation, which is an algorithm to find the gradients of a loss function with respect to the weights. I won’t go into the details of backpropation, but here is a great video by Andrej Karpathy which in my opinion gives one of the best explanations into how this works.

Below, I would like to show how we can backpropagate the gradients through the network. However, since doing manual backpropagation of the cross entropy loss function is not super trivial, let’s use a much easier loss function for now: mean squared error (MSE). This obviously doesn’t make any sense in the context of our data, but mathematically it’s possible. We just have to end up with a single activation of our model instead of 10:

n_out = 1  # number of output units changed to 1

w2 = torch.randn(n_h, n_out)
b2 = torch.zeros(n_out)

def model(x):
    a1 = lin(x, w1, b1)
    z1 = relu(a1)
    return lin(z1, w2, b2)

out = model(x_train)
out.shape
torch.Size([60000, 1])

From which we see that the outputs have an empty trailing dimension. y_train doesn’t have this, so we have to squeeze out this empty dimension when computing the MSE:

def mse(pred, targ): 
    return (pred.squeeze(-1)-targ).pow(2).mean() 

y_train = torch.stack(y_train)
mse(out, y_train)
tensor(7615.3135)

The next step will be to add the backward pass. But let’s refactor our code to put things into classes, that way the backward pass can be added more easily:

class Linear():
    def __init__(self, n_in, n_out):
        self.w = torch.randn(n_in, n_out, requires_grad=True)
        self.b = torch.zeros(n_out)
    
    def __call__(self, x):
        self.inp = x                      # storing this for the backward pass
        self.out = x@self.w + self.b      # storing this for the backward pass
        return self.out
    
class Relu():
    def __call__(self, x):
        self.inp = x                      # storing this for the backward pass
        self.out = x.clamp_min(0.)        # storing this for the backward pass
        return self.out
    
class MSE():
    def __call__(self, pred, targ):
        self.pred = pred                   # storing this for the backward pass
        self.targ = targ                   # storing this for the backward pass
        self.out = (pred.squeeze(-1)-targ).pow(2).mean()
        return self.out
    
class Model():
    def __init__(self, n_in, n_h, n_out):
        self.layers = [Linear(n_in, n_h), Relu(), Linear(n_h, n_out)]
        self.loss = MSE()
        
    def __call__(self, x, y):
        for l in self.layers: x = l(x)
        return self.loss(x, y)
x_train.shape
torch.Size([60000, 784])
m = Model(n_in, n_h, n_out)
l = m(x_train, y_train)

To add in the functionality for the backward pass, redefining the whole class is a nuisance. So instead we’ll patch the classes. We can do this very easily by using the fastcore library. Let’s see a small example:

import fastcore.all as fc

class A():
    def hi(self): print('hello 😎')
    
a = A()
a.hi()

@fc.patch
def hi(self:A): print('howdy 🤠')

a.hi()
hello 😎
howdy 🤠

So with fc.patch we can extend or change the behavior of Classes that have been defined elsewhere, even on instances of the objects that are already created. Nice!

@fc.patch
def backward(self: Linear):
    self.inp.g = self.out.g @ self.w.t()
    self.w.g = self.inp.t() @ self.out.g
    self.b.g = self.out.g.sum(0)
    
@fc.patch
def backward(self: Relu):
    self.inp.g = (self.inp>0).float() * self.out.g
    
@fc.patch
def backward(self: MSE):
    self.pred.g = 2. * (self.pred.squeeze() - self.targ).unsqueeze(-1) / self.targ.shape[0]
    
@fc.patch
def backward(self: Model):
    self.loss.backward()
    for l in reversed(self.layers): l.backward()
m = Model(n_in, n_h, n_out)
l = m(x_train, y_train)
m.backward()

Now the actual operations in the backward methods you will just have to take for granted as I am not going to derive them. If you want, you can have some fun (?) to try and derive it yourself. What I think is most important about these formulas:

  1. Notice that each layer has a reference to it’s inputs and it’s outputs
  2. During the backward pass, each layer uses the gradient from the outputs and uses it to set the gradient on the inputs
  3. The inputs from layer \(n\) are the outputs from layer \(n-1\), so when the gradients are being set on the inputs from layer \(n\), this means that layer \(n-1\) it’s outputs are being set at the same time
  4. This is the fundamental point about backpropagation of the gradient: in reverse order, layer by layer the gradients are being propagated back through the network using the chain rule
  5. Although we don’t derive the operations, we can see that that there exist operations that do this. These operations are not magical, they are just the result of calculus: not very different from the fact that if \(f(x) = x^2\) then \(f'(x) = 2x\) and if \(h(x) = f(g(x))\) then \(h'(x) = f'(g(x)) * g'(x)\)

By calling backward, we have effectively computed the gradients of the loss with respect to the parameters in our network. With that, we know how the loss will change if we make a small change in these parameters and we can thus change them in such a way that the loss will decrease a tiny bit.

First refactor: Module baseclass and training loop

Now let’s see how we can make this a little better. One thing that seems a bit silly is that in each of the Linear, MSE and Relu classes, we are storing explicitly the inputs and outputs when doing a forward call. As mentioned, we need this to backpropagate the gradients. However, we rather not store that explicitly all the time when creating a new layer.

So let’s create a base class that takes care of this:

  • put the forward functionality of each layer in a dedicated forward method
  • let the storing of inputs and ouputs be done in the __call__ method of the baseclass, and call the self.forward method in between.

This works, but there is one caveat: most layers just have one input when they are called (x), but the loss has 2 (pred and targ). To make this storing of the inputs generic we can store them as an array on the base class, and also pass them as positional arguments to _backward. This way, forward and _backward have the same arguments.

class Module():
    def __call__(self, *args):
        self.args = args
        self.out = self.forward(*args)
        return self.out
    
    def backward(self): self._backward(*self.args)

    
class Linear(Module):
    def __init__(self, n_in, n_out):
        self.w = torch.randn(n_in, n_out)
        self.b = torch.zeros(n_out)
    
    def forward(self, x):
        return x@self.w + self.b
    
    def _backward(self, inp):
        inp.g = self.out.g @ self.w.t()
        self.w.g = inp.t() @ self.out.g
        self.b.g = self.out.g.sum(0)
    
    
class Relu(Module):
    def forward(self, x):
        return x.clamp_min(0.)
    
    def _backward(self, inp):
        inp.g = (inp>0).float() * self.out.g

    
class MSE(Module):
    def forward(self, pred, targ):
        return (pred.squeeze(-1)-targ).pow(2).mean()
    
    def _backward(self, pred, targ):
        pred.g = 2. * (pred.squeeze() - targ).unsqueeze(-1) / targ.shape[0]
    
    
class Model(Module):
    def __init__(self, n_in, n_h, n_out):
        self.layers = [Linear(n_in, n_h), Relu(), Linear(n_h, n_out)]
        self.loss = MSE()
        
    def forward(self, x, y):
        for l in self.layers: x = l(x)
        return self.loss(x, y)
    
    def backward(self):
        self.loss.backward()
        for l in reversed(self.layers): l.backward()

With these objects, let’s create our first training loop:

epochs = 5                              # train for nr of epochs
bs     = 1024                           # batch-size
lr     = 0.01                           # learning rate
m = Model(n_in, n_h, n_out)             # instantiate our model

for epoch in range(epochs):             # iterate through epochs
    for i in range(0,len(x_train), bs): # iterate through the batches
        xb = x_train[i:i+bs]            # get minibatch 
        yb = y_train[i:i+bs]
        
        loss = m(xb, yb)                # forward pass
        m.backward()                    # backward pass
        
        for l in m.layers:              # iterate through the layers
            if isinstance(l, Linear):   # only update the linear layers
                l.w += - lr * l.w.g     # update the weights
                l.b += - lr * l.b.g     # update the bias

                l.w.g = None            # reset the gradients
                l.b.g = None
    print(f'{epoch=} | {loss=:.1f}')
epoch=0 | loss=4853.3
epoch=1 | loss=518.7
epoch=2 | loss=54.7
epoch=3 | loss=12.3
epoch=4 | loss=8.5

Awesome, the loss is decreasing i.e. the model is training!

Second refactor: simplify the weight update

Let’s try to simplify our training loop, and make it more generic. By adding functionality to our Module class so that it has a reference to it’s trainable parameters, we can update the weights as shown below.

def fit(epochs):
    for epoch in range(epochs):
        for i in range(0,len(x_train), bs):
            xb = x_train[i:i+bs]
            yb = y_train[i:i+bs]

            loss = m(xb, yb)
            m.backward()

            for p in m.parameters():    # model has a reference to the trainable parameters
                p -= lr * p.g           
            m.zero_grad()               # model can reset the gradients
        print(f'{epoch=} | {loss=:.1f}')

To do so, we will create a new baseclass (NNModule), from which our model and all the layers will inherit. We have the following conditions and properties:

  1. The class will hold a dictionary _named_args, in which all the named arguments are stored that are set on the Module.
  2. This is done by defining a __setattr__ method, which stores any named argument that doesn’t start with an _ in this dictionary
  3. For the Linear, these named arguments will be the parameters w and b
  4. For the Model, these named arguments will be layers (an array containing the layer objects) and loss containing the MSE object.
  5. Because we want to get the parameters directly out of a layer, as well as out of the model, we need to implement some logic in _parameters() to iterate through the lowest “level” and get the actual parameters out
  6. Last but not least we have to implement a zero_grad() method to zero the gradients on the parameters
class NNModule:
    def __init__(self):
        self._named_args = {}                           # [1]
        
    def __setattr__(self, name, value):                 # [2]
        if not name.startswith("_"): self._named_args[name] = value
        super().__setattr__(name, value)
        
    def _parameters(self, obj):                         # [5]
        for i in obj:
            if isinstance(i, torch.Tensor): yield i
            if isinstance(i, NNModule):
                yield from iter(self._parameters(i._named_args.values()))
            if isinstance(i, list):
                yield from iter(self._parameters(i))
        
    def parameters(self):
        return list(self._parameters(self._named_args.values()))
    
    def zero_grad(self):
        for p in self.parameters():
            p.g = None                                   # [6]
        
    def __call__(self, *args):
        self._args = args                                # NOT stored under _named_args as \
        self._out = self.forward(*args)                  # it starts with "_"
        return self._out
    
    def backward(self): self._backward(*self._args)
class Linear(NNModule):
    def __init__(self, n_in, n_out):
        super().__init__()
        self.w = torch.randn(n_in, n_out)               # [3] stored under _named_args 
        self.b = torch.zeros(n_out)                     # [3] stored under _named_args
    
    def forward(self, x):
        return x@self.w + self.b
    
    def _backward(self, inp):
        inp.g = self._out.g @ self.w.t()
        self.w.g = inp.t() @ self._out.g
        self.b.g = self._out.g.sum(0)
        
        
class Relu(NNModule):
    def forward(self, x):
        return x.clamp_min(0.)
    
    def _backward(self, inp):
        inp.g = (inp>0).float() * self._out.g

    
class MSE(NNModule):
    def forward(self, pred, targ):
        return (pred.squeeze(-1)-targ).pow(2).mean()
    
    def _backward(self, pred, targ):
        pred.g = 2. * (pred.squeeze() - targ).unsqueeze(-1) / targ.shape[0]
        
        
class Model(NNModule):
    def __init__(self, n_in, n_h, n_out):
        super().__init__()
        self.layers = [Linear(n_in, n_h), Relu(), Linear(n_h, n_out)]
        self.loss = MSE()                              # [4] < and ^ are stored under _named_args
        
    def forward(self, x, y):
        for l in self.layers: x = l(x)
        return self.loss(x, y)
    
    def backward(self):
        self.loss.backward()
        for l in reversed(self.layers): l.backward()

And now we can indeed call parameters on both the model as well as on individual layers:

m = Model(n_in, n_h, n_out)
[p.shape for p in m.parameters()]
[torch.Size([784, 50]), torch.Size([50]), torch.Size([50, 1]), torch.Size([1])]
[p.shape for p in Linear(n_in, n_h).parameters()]
[torch.Size([784, 50]), torch.Size([50])]

Let’s fit with our new training loop:

fit(5)
epoch=0 | loss=2118316928.0
epoch=1 | loss=195283376.0
epoch=2 | loss=18002500.0
epoch=3 | loss=1659511.5
epoch=4 | loss=152958.9

Third refactor: use nn.Module

Finally we are in a position to use PyTorch’s nn.Module, since we understand all of it’s behavior! We can simplify:

import torch.nn as nn

class Model(nn.Module):
    def __init__(self, n_in, n_h, n_out):
        super().__init__()
        self.layers = [nn.Linear(n_in, n_h), nn.ReLU(), nn.Linear(n_h, n_out)]
        for i,l in enumerate(self.layers):               # ^ we use the nn.Linear and nn.ReLU from PyTorch
            self.add_module(f'layer_{i}', l)             # we need to register the modules explicitly
        self.loss = nn.MSELoss()                         # we use the MSELoss from PyTorch
        
    def forward(self, x, y):
        for l in self.layers: x = l(x)
        return self.loss(x.squeeze(-1), y)
# Autograd needs all tensors to be float
x_train = x_train.to(torch.float32)
y_train = y_train.to(torch.float32)
m = Model(n_in, n_h, n_out)
def fit(epochs):
    for epoch in range(epochs):
        for i in range(0,len(x_train), bs):
            xb = x_train[i:i+bs]
            yb = y_train[i:i+bs]

            loss = m(xb, yb)
            loss.backward()

            with torch.no_grad():
                for p in m.parameters():
                    p -= lr * p.grad
                m.zero_grad()
        print(f'{epoch=} | {loss=:.1f}')

Fourth refactor: nn.ModuleList and nn.Sequential

To simplify the storing of the layers array and the registration of the modules, we can use nn.ModuleList. Up till now, we compute the loss as part of the forward pass of the model, let’s change that and let the model return the predictions. With these predictions we can now also compute a metric: accuracy, which will represent the percentage of images correctly classified by the model.

class Model(nn.Module):
    def __init__(self, n_in, n_h, n_out):
        super().__init__()
        self.layers = nn.ModuleList([nn.Linear(n_in, n_h), nn.ReLU(), nn.Linear(n_h, n_out)])
        
    def forward(self, x, y):
        for l in self.layers: x = l(x)
        return x

This turns out to be such an elementary operation, that PyTorch has a module for it: nn.Sequential.

import torch.nn.functional as F


layers = [nn.Linear(n_in,n_h), nn.ReLU(), nn.Linear(n_h, n_out)]
model = nn.Sequential(*layers)

And let’s update our training loop as we mentioned:

  • The loss needs to be computed separately, since we took it out of the model
  • Let’s now also use a loss function that actually makes sense: cross entropy loss instead of MSE
  • We then need to switch back to using 10 output activations conforming with the 10 categories
n_out = 10

layers = [nn.Linear(n_in,n_h), nn.ReLU(), nn.Linear(n_h, n_out)]
model = nn.Sequential(*layers)

Let’s also add a metric: accuracy, to see how our model is doing. For this, we need to find the class that our model predicts. However, the model is outputting not a single class, it outputs logits: the unweighted predictions for any of the 10 classes. When applying a softmax to these logits, we turn them into 10 probabilities: the probability that our model assigns to each class.

When computing the accuracy, we actually just use the logits instead of the probabilities, since the softmax is a monotonically increasing we largest logit, will also have the largest probability.

x0 = x_train[0]
logits = model(x0)

print(f'{logits=}')                           # Logit output of the model

probs = logits.softmax(dim=0)

print(f'{probs=}')                            # class probabilites

assert torch.allclose(probs.sum(),            # probabilities sum to 1
                      torch.tensor(1.0))      

assert torch.all(probs > 0)                   # no negative probabilities

assert (logits.argmax() == probs.argmax())
logits=tensor([-0.1345,  0.1549, -0.0635,  0.0619,  0.0516, -0.0358,  0.1625, -0.0322,
        -0.0614,  0.1931], grad_fn=<AddBackward0>)
probs=tensor([0.0844, 0.1127, 0.0906, 0.1027, 0.1016, 0.0931, 0.1136, 0.0935, 0.0908,
        0.1171], grad_fn=<SoftmaxBackward0>)
def accuracy(preds, targs):
    return (preds.argmax(dim=1) == targs).float().mean()
loss_func = F.cross_entropy
y_train = y_train.to(torch.long)

for epoch in range(epochs):
    for i in range(0,len(x_train), bs):
        xb = x_train[i:i+bs]
        yb = y_train[i:i+bs]

        preds = model(xb)
        acc = accuracy(preds, yb)
        loss = loss_func(preds, yb)
        loss.backward()

        with torch.no_grad():
            for p in model.parameters():
                p -= lr * p.grad
            model.zero_grad()
    print(f'{epoch=} | {loss=:.3f} | {acc=:.3f}')
epoch=0 | loss=1.071 | acc=0.684
epoch=1 | loss=0.992 | acc=0.681
epoch=2 | loss=0.934 | acc=0.688
epoch=3 | loss=0.889 | acc=0.697
epoch=4 | loss=0.853 | acc=0.706

Fifth refactor: add an Optimizer

We can further refactor the model by adding an Optimizer, this is an object that will have access to the parameters and does the updating of the weights (step) and zeroing the gradient. Most notably, we want to go from:

# ...
# with torch.no_grad():
#     for p in model.parameters():
#         p -= lr * p.grad
#     model.zero_grad()
# ...

to:

# opt.step()
# opt.zero_grad()

So that the training loop becomes:

def fit(epochs):
    for epoch in range(epochs):
        for i in range(0,len(x_train), bs):
            xb = x_train[i:i+bs]
            yb = y_train[i:i+bs]

            preds = model(xb)
            acc = accuracy(preds, yb)
            loss = loss_func(preds, yb)
            loss.backward()

            opt.step()                       # optimizer takes care of the weight update
            opt.zero_grad()                  # as well as zeroing the grad
        print(f'{epoch=} | {loss=:.3f} | {acc=:.3f}')

So we introduce the Optimizer, which has exactly these two methods:

class Optimizer():
    def __init__(self, params, lr=0.5):
        self.params = list(params)
        self.lr = lr
        
    def step(self):
        with torch.no_grad():
            for p in self.params: p -= self.lr * p.grad
        
    def zero_grad(self):
        with torch.no_grad():
            for p in self.params: p.grad.zero_()
layers = [nn.Linear(n_in,n_h), nn.ReLU(), nn.Linear(n_h, n_out)]
model = model = nn.Sequential(*layers)
opt = Optimizer(model.parameters(), lr)
fit(5)
epoch=0 | loss=2.074 | acc=0.447
epoch=1 | loss=1.832 | acc=0.582
epoch=2 | loss=1.571 | acc=0.653
epoch=3 | loss=1.354 | acc=0.676
epoch=4 | loss=1.195 | acc=0.673

The optimizer we just created is basically the SGD optimizer from PyTorch so let’s use that:

def get_model():
    layers = [nn.Linear(n_in, n_h), nn.ReLU(), nn.Linear(n_h, n_out)]
    model = nn.Sequential(*layers)
    
    opt = torch.optim.SGD(model.parameters(), lr)
    
    return model, opt

model, opt = get_model()
fit(5)
epoch=0 | loss=2.026 | acc=0.456
epoch=1 | loss=1.751 | acc=0.559
epoch=2 | loss=1.502 | acc=0.605
epoch=3 | loss=1.314 | acc=0.630
epoch=4 | loss=1.179 | acc=0.635

End

We have come a long way, and covered a lot of ground. We have seen many of the fundamental components of training a neural network: the data, a simple model, training loops, loss functions, metrics and optimizers. We have seen why things like nn.Module exist, and understand it’s behavior. Furthermore, we have seen that the need for nn.Module and torch.optim comes out of the need for simplifying things in the training loop.

In the next post, we will get to datasets and dataloaders as a way to further improve the training loop, and we will start adding our first things into the nntrain library 🕺.