Introduction: the Newton-Raphson method

The first time I was introduced to the Newton-Raphson method was a couple of years ago. I was studying physics and during a course on computational physics we had to optimize some function related to the N-body problem. About a week ago, I encountered Newton’s method for the second time, during the Stanford course by Andrew Ng on machine learning. Prof. Ng introduces the Newton-Raphson method as an alternative to the very well known gradient descent algorithm. I really like the algorithm, because the process is so intuitive. Its best explained by simply looking at the iterations:

newton method

In short, the goal is to find . You start at some random point . Take the derivative to compute the tangent. Compute the intersection between the tangent and the zero, . This x-coordinate is at least as close to as .

Logistic regression

In one of the exercises of the Stanford course, students are asked to first prove that the emperical loss function for logistic regression:

\begin{equation} J(\theta) = \frac{1}{m}\sum_{i} \log{1+e^{-y^i \theta^T x^i}} \end{equation}

is positive semi-definite and thus has a single global optimum. Secondly, some data is provided and students have to implement the Newton-Raphson method to determine the parameters of a logistic regression.

I proved the former by showing that the quadratic-form of the Hessian was larger-or-equal zero, and in doing so I had already computed the first and second order derivatives:

\begin{align} \frac{\delta}{\delta_l} J(\theta) &= \frac{-1}{m}\sum_{i}g(-y^i \theta^T x^i)y^i x^i_l \\\
\frac{\delta^2}{\delta_l \delta_k} J(\theta) &= \frac{1}{m}\sum_{i}(g(-y^i \theta^T x^i))(1 - g(-y^i \theta^T x^i)) x^i_k x^i_l \\\
\textrm{where }g(z) &= \frac{1}{1 + e^{-z}} \textrm{, the sigmoid-function} \end{align}


I used these results for my python implementation of the Newton-Raphson method, which was unvectorized. Unvectorized implementations are always a little unsatisfactory, since they tend to be large and have lots of (nested) loops. With all those indexes around, it is often difficult to debug in case something is not working as expected. Also, it is well-known that vectorization is much faster in performance. So, as a little test, I decided to do the vectorization and test how much faster this solution would be. Short anwser: 250 times faster.

Now -depending on the actual functions involved- vectorization is not super easy. It requires a good knowledge of matrix and vector multiplications and a certain amount of intuition.

So I started with the argument for the sigmoid function. Currently, gives a scalar. For each observation , one scalar is computed. Instead of a single scalar, a column vector for all the observations should be computed. To achieve this, the is replaced with since:

\begin{align} X\theta = \begin{bmatrix} – x^1 – \\\
\vdots \\\
– x^m – \end{bmatrix} \cdot \theta = \begin{bmatrix} – x^1 \cdot \theta – \\\
\vdots \\\
– x^m \cdot \theta – \end{bmatrix} \end{align}

Resulting in a column vector with all our estimates for (one for each observations). Next, I had to multiplicate each element in this vector with . Here, I can’t use a dot-product since that sums everything up, and collapses to a scalar. Instead an element-wise multiplication is used: . Accordingly I updated my sigmoid function implementation so that it takes vectors and matrices as input and returns objects of the same size.

For the element outside of the sigmoid function, I also used an element-wise multiplication so that we are now at which is still a column vector. Finally, this vector needs to multiplied with while collapsing on the observations () and expanding on the features (). To do this we need to take the inner product with respect to and the outer product with respect to . To refresh, an inner product between and is defined as:

\begin{align} x^Ty = \begin{bmatrix} x_1 \ \cdots \ x_n \end{bmatrix} \cdot \begin{bmatrix} y_1 \\\
\vdots \\\
y_n \end{bmatrix} = \sum x_i y_i \end{align}

Since we know that has each observation as a single row, and we need to take the inner product over the observations, we need to have individual observations as columns. Therefore we transpose and multiply in front to end up with:

\begin{align} \nabla_{\theta} J(\theta) = X^Tg(-y\circ X\theta)\circ y \end{align}

For the Hessian, we have to translate the term. Since the summation is over , there is an implied inner product over the observations, and an outer product over the features: . However, the also needs to be considered, and more specifically during the inner-product (summation). This means this term has to be put in between. The simplest way to do this, is to construct a diagonal matrix with the values of the column vector on the diagonal:

\begin{align} g(-y^i \theta^T x^i))(1 - g(-y^i \theta^T x^i)) = \begin{bmatrix} a_1 \\\
\vdots \\\
a_m \end{bmatrix} => \begin{bmatrix} a_1 \ 0 \ \cdots \ 0 \\\
0 \ a_2 \ \cdots \ 0 \\\
\ \vdots \\\
0 \ 0 \ \cdots \ a_m \end{bmatrix} = A \end{align}

With this matrix the expression for the Hessian becomes:

\begin{align} \nabla_{\theta}^2 J(\theta) = X^T A X \end{align}

Coming back to the code, these vectorized versions are implemented here. The last thing to do is to run both methods and compare the times. With the provided data, the unvectorized version takes on my system around 600 to 800 ms. The vectorized implementation takes around 2 to 3 ms. On average around 250 times as fast!