Stochastic Optimization

In a world where data can be collected continuously and storage costs are cheap, issues related to the growing size of interesting datasets can pose a problem unless we have the right tools for the task. Indeed, in the event where we have streaming data it might be impossible to wait until the "end" before fitting our model since it may never come. Alternatively it might be problematic to even store all of the data, scattered across many different servers, in memory before using it. Instead it would be preferable to do an update each time some new data (or a small batch of it) arrives. Similarly we might find ourselves in an offline situation where the number of training examples is very large and traditional approaches, such as gradient descent, start to become too slow for our needs.

Stochastic gradient descent (SGD) offers an easy solution to all of these problems.

In this post we explore the convergence properties of stochastic gradient descent and a few of its variants, namely

  • Polyak-Ruppert averaged stochastic gradient;
  • Adaptive Gradient, also known as AdaGrad;
  • Stochastic Average Gradient, also known as SAG.

 

Stochastic Gradient Descent for Logistic Regression¶

In this post we will consider the classification dataset quantum.mat that I used during an assignment for the course CPSC 540 - Machine Learning at UBC. It contains a matrix $X$ with dimensions $50000 \times 79$, as well as a vector $y$ taking values $\pm 1$, which we will classify using logistic regression. The true minimum of our model's cost function lies at $f(w) = 2.7068 \times 10^4$, but various implementations of SGD have drastically different performances as we will see.

The cost function for logistic regression with L2-regularization can be written in a form well-suited to our minimization objective:

\begin{equation} f(w) = \sum_{i=1}^n f_i(w), \;\;\;\;\; \text{with} \;\; f_i(w) = \log \Big( 1 + \exp \left( - y_i X_{ia} w_a \right) \Big) + \frac{\lambda}{2 n} w^2. \end{equation}

The key idea behind SGD is to approximate the true gradient $\nabla f(w)$ by the successive gradients at individual training examples $\nabla f_i(w)$, which naturally befits online learning. The rationale for doing so is that it is very costly to compute the exact direction of the gradient, whereas a good but noisy estimate can be obtained by looking at a few examples only. This noise has the added benefit of preventing SGD from getting stuck in the shallow local minima that might be present for non-convex optimization objectives (such as neural networks), at the cost of never truly converging to a minimum but rather in a neighborhood around it.

We now proceed to build a LogisticRegressionSGD() class that implements the logistic regression cost function written above. We will then include different stochastic optimization methods and see how close they can get to the true minimum after 10 epochs.

In [1]:
import numpy as np
from scipy.io import loadmat

# load data from MATLAB file
datamat = loadmat('quantum.mat')
X = datamat['X']
y = datamat['y']
In [2]:
class LogisticRegressionSGD(object):
    
    def __init__(self, X, y, progTol=1e-4, nEpochs=10):
        self.X = X
        self.y = y
        self.n, self.d = X.shape
        # define convergence parameters here so all methods can use them
        self.progTol = progTol
        self.nEpochs = nEpochs
        
    def LogReg(self, w, Lambda, i=None):
        """ Logistic regression cost function with L2-regularization """
        """ Outputs negative log-likelihood (nll) and gradient (grad) """
        if isinstance(i, np.int64):   # for individual training examples
            X = self.X[[i], :]
            y = self.y[[i]]
        else:
            X = self.X
            y = self.y
            
        yXw = y*X.dot(w)
        sigmoid = 1/(1 + np.exp(-yXw))
        
        nll = -np.sum(np.log(sigmoid)) + 0.5*Lambda*w.T.dot(w)
        grad = -X.T.dot(y*(1-sigmoid)) + Lambda*w
        
        return nll, grad

Regular Stochastic Gradient Descent¶

Schematically the standard SGD algorithm takes the form

  1. Initialize weights $w$ and learning rate $\eta_t$.

    • Randomly permute training examples.

    • For $i = 1 : n$ do $w \leftarrow w - \eta_t \nabla f_i(w)$.

  2. Repeat step 2 until convergence.

Many theory papers use the step size $\eta_t = 1/\lambda t$, which offers the best convergence rate in the worst case scenario. However choosing this learning rate typically leads the SGD algorithm in regions of parameter space afflicted by numerical overflow of the cost function before it ultimately "converges" to $f(w) \geq 4.5 \times 10^4$, far away from the global minimum.

Alternatives include choosing a constant learning rate (we found that $\eta = 10^{-4}$ gave reasonable results) or an iteration-dependent rate that slowly converges to 0, such as

\begin{equation} \eta_t = \frac{\sqrt{n}}{\sqrt{n} + t}. \end{equation}

We find that the last two approaches yield similar results, although the latter requires the fine-tuning of both the numerator and denominator in order to work optimally, and also makes it hard to decide when to stop since later iterations move very slowly.

In [3]:
def RegularSGD(self, case):
    # Initialize
    n = self.n
    w = np.zeros((self.d, 1))
    w_old = w
    Lambda = 1
    
    # Randomly shuffle training data
    arr = np.arange(0, n)
    np.random.shuffle(arr) # shuffles arr directly
        
    for t in range(1, self.nEpochs*n + 1):
        # Compute nll and grad for one random training example
        nll, grad = self.LogReg(w=w, Lambda=Lambda/n, i=arr[np.mod(t, n)])
        if case==1:
            eta = 1/(Lambda*t)
        elif case==2:
            eta = 1e-4
        elif case==3:
            eta = np.sqrt(n)/(np.sqrt(n) + t)   # step size
        else:
            print("The variable 'case' is not specified correctly; abort.")
            break
        w = w - eta*grad   # gradient step
            
        # One epoch has passed: check for convergence
        if np.mod(t, n) == 0:
            change = np.linalg.norm(w-w_old, ord=np.inf)
            print('Passes = %d, function = %e, change = %f' %((t+1)/n, self.LogReg(w=w, Lambda=Lambda)[0], change))
            if change < self.progTol:
                print('Parameters changed b less than progress tolerance on pass')
                break
            np.random.shuffle(arr)   # reshuffle
            w_old = w
            
# Add method to our class
LogisticRegressionSGD.RegularSGD = RegularSGD
In [4]:
print('---------------------------------------------')
print('  Case 1: best rate for worst case scenario  ')
print('---------------------------------------------')
LogisticRegressionSGD(X, y).RegularSGD(case=1)
print('---------------------------------------------')
print('  Case 2: small and constant step size       ')
print('---------------------------------------------')
LogisticRegressionSGD(X, y).RegularSGD(case=2)
print('---------------------------------------------')
print('  Case 3: monotonously decreasing step size  ')
print('---------------------------------------------')
LogisticRegressionSGD(X, y).RegularSGD(case=3)
---------------------------------------------
  Case 1: best rate for worst case scenario  
---------------------------------------------
Passes = 1, function = 6.497187e+04, change = 4.257697
Passes = 2, function = 6.332083e+04, change = 0.071927
Passes = 3, function = 6.241682e+04, change = 0.041528
Passes = 4, function = 6.179935e+04, change = 0.029383
Passes = 5, function = 6.133539e+04, change = 0.022594
Passes = 6, function = 6.096411e+04, change = 0.018459
Passes = 7, function = 6.065550e+04, change = 0.015570
Passes = 8, function = 6.039239e+04, change = 0.013437
Passes = 9, function = 6.016341e+04, change = 0.011837
Passes = 10, function = 5.996084e+04, change = 0.010564
---------------------------------------------
  Case 2: small and constant step size       
---------------------------------------------
Passes = 1, function = 2.827318e+04, change = 0.428875
Passes = 2, function = 2.767742e+04, change = 0.148022
Passes = 3, function = 2.744054e+04, change = 0.082473
Passes = 4, function = 2.731625e+04, change = 0.055343
Passes = 5, function = 2.723884e+04, change = 0.045643
Passes = 6, function = 2.719352e+04, change = 0.038638
Passes = 7, function = 2.715925e+04, change = 0.037987
Passes = 8, function = 2.713423e+04, change = 0.027871
Passes = 9, function = 2.711764e+04, change = 0.023492
Passes = 10, function = 2.710850e+04, change = 0.024259
---------------------------------------------
  Case 3: monotonously decreasing step size  
---------------------------------------------
Passes = 1, function = 2.775579e+04, change = 0.995766
Passes = 2, function = 2.726339e+04, change = 0.156621
Passes = 3, function = 2.729197e+04, change = 0.070505
Passes = 4, function = 2.724176e+04, change = 0.077701
Passes = 5, function = 2.723945e+04, change = 0.052928
Passes = 6, function = 2.718166e+04, change = 0.051606
Passes = 7, function = 2.711573e+04, change = 0.032869
Passes = 8, function = 2.713621e+04, change = 0.056568
Passes = 9, function = 2.715035e+04, change = 0.042247
Passes = 10, function = 2.709499e+04, change = 0.034909

Polyak-Ruppert Averaged Stochastic Gradient¶

Rather than use the information contained in the weights $w$ at iteration $t$ to determine the descent direction, it is often an improvement to use a running average instead, which keeps a memory of previous iterations

\begin{equation} \overline{w}_t = \overline{w}_{t-1} - \frac{1}{t} \left( \overline{w}_{t-1} - w_t \right). \end{equation}

Doing so results in a slight improvement over regular SGD. Note that convergence improves if we start averaging after nAvg $\geq 2$ passes in order to smooth out the initial irregularities.

In [5]:
def AverageSGD(self, nAvg=2):
    # Initialize
    n = self.n
    w = np.zeros((self.d, 1))
    w_old = w
    w_avg = w   # averaged weights
    Lambda = 1
    nPasses = 0
    
    # Randomly shuffle training data
    arr = np.arange(0, n)
    np.random.shuffle(arr) # shuffles arr directly
        
    for t in range(1, self.nEpochs*n + 1):
        # Compute nll and grad for one random training example
        nll, grad = self.LogReg(w=w, Lambda=Lambda/n, i=arr[np.mod(t, n)])
        eta = 1e-4   # step size
        w = w - eta*grad   # gradient step
            
        if nPasses >= nAvg:
            w_avg = w_avg - 1/(t-nAvg*n+1)*(w_avg - w)
            
        # One epoch has passed: check for convergence
        if np.mod(t, n) == 0:
            nPasses = nPasses + 1
            change = np.linalg.norm(w-w_old, ord=np.inf)
            print('Passes = %d, function = %e, change = %f' %((t+1)/n, self.LogReg(w=w, Lambda=Lambda)[0], change))
            if change < self.progTol:
                print('Parameters changed b less than progress tolerance on pass')
                break
            np.random.shuffle(arr)   # reshuffle
            w_old = w
            
LogisticRegressionSGD.AverageSGD = AverageSGD
In [6]:
LogisticRegressionSGD(X, y).AverageSGD()
Passes = 1, function = 2.826491e+04, change = 0.430516
Passes = 2, function = 2.767581e+04, change = 0.146381
Passes = 3, function = 2.744244e+04, change = 0.078595
Passes = 4, function = 2.731937e+04, change = 0.058312
Passes = 5, function = 2.724000e+04, change = 0.045239
Passes = 6, function = 2.719524e+04, change = 0.039755
Passes = 7, function = 2.715876e+04, change = 0.033641
Passes = 8, function = 2.713543e+04, change = 0.025909
Passes = 9, function = 2.711776e+04, change = 0.021958
Passes = 10, function = 2.710598e+04, change = 0.020059

Adaptive Gradient (AdaGrad)¶

One of the main drawbacks of the stochastic optimization methods outlined above is the need to manually choose the optimal learning rate for the problem at hand. AdaGrad, an algorithm proposed in 2011, eschews this problem by computing an appropriate learning rate for each direction $\hat{w_a} \in \mathbb{R}^d$.

AdaGrad automatically assigns a higher learning rate to rare/sparse features, which typically have a higher predictive power than common ones. We can understand this intuitively by thinking about words in a story: rare words like Daenerys and dragons provide significantly more information and context for the audience of Game of Thrones than common ones such as the or a. Therefore AdaGrad ensures that the most predictive features have larger updates (i.e. the associated weights increase/decrease proportionally to their importance) than the ones providing irrelevant information.

The weight update for AdaGrad is given by

\begin{equation} w_{t+1} = w_t - \eta_t D_t \nabla f(w_t), \end{equation}

where the diagonal matrix $D_t$ has elements

\begin{equation} (D_t)_{jj} = \frac{1}{\sqrt{\delta + \sum_{k=0}^t \nabla_j f_{i_k}(w_k)^2}}. \end{equation}

Here $i_k$ denotes example $i$ chosen randomly on iteration $k$, $\nabla_j$ is the $j$th element of the gradient, and $\delta$ is a small number to prevent division by 0. All we need to do is fiddle with the constant learning rate $\eta_t = \eta$ since $D_t$ automatically takes care of assigning higher importance to sparse features.

In [7]:
def AdaGrad(self, eta = 0.025, delta=1e-3):
    # Initialize
    n = self.n
    w = np.zeros((self.d, 1))
    w_old = w
    Lambda = 1
    
    # keep sum of squared gradients in memory
    sumGrad_sq = 0
    
    # Randomly shuffle training data
    arr = np.arange(0, n)
    np.random.shuffle(arr) # shuffles arr directly
    
    for t in range(1, self.nEpochs*n + 1):
        # Compute nll and grad for one random training example
        nll, grad = self.LogReg(w=w, Lambda=Lambda/n, i=arr[np.mod(t, n)])
        sumGrad_sq = sumGrad_sq + grad**2
        D = np.diag(1/np.sqrt(delta + sumGrad_sq.ravel()))
        w = w - eta*D.dot(grad)   # gradient step
            
        # One epoch has passed: check for convergence
        if np.mod(t, n) == 0:
            change = np.linalg.norm(w-w_old, ord=np.inf)
            print('Passes = %d, function = %e, change = %f' %((t+1)/n, self.LogReg(w=w, Lambda=Lambda)[0], change))
            if change < self.progTol:
                print('Parameters changed b less than progress tolerance on pass')
                break
            np.random.shuffle(arr)   # reshuffle
            w_old = w
            
LogisticRegressionSGD.AdaGrad = AdaGrad
In [8]:
LogisticRegressionSGD(X, y).AdaGrad()
Passes = 1, function = 2.724693e+04, change = 0.753612
Passes = 2, function = 2.715153e+04, change = 0.081374
Passes = 3, function = 2.712245e+04, change = 0.041782
Passes = 4, function = 2.710405e+04, change = 0.028955
Passes = 5, function = 2.709053e+04, change = 0.021070
Passes = 6, function = 2.708448e+04, change = 0.018139
Passes = 7, function = 2.708505e+04, change = 0.011846
Passes = 8, function = 2.707827e+04, change = 0.010523
Passes = 9, function = 2.707679e+04, change = 0.009694
Passes = 10, function = 2.707562e+04, change = 0.006893

Stochastic Adaptive Gradient (SAG)¶

Last but not least, we now discuss the SAG algorithm, a variant on batching in SGD that was published in 2015. The basic implementation of this method can be explained schematically as follows:

  1. Randomly select $i_t$ and compute the gradient $\nabla f_{i_t} (w_t)$.

  2. Update the weights by taking a step towards the average of all the gradients computed so far

    $$ w_{t+1} = w_t - \eta_t \left( \frac{1}{n} \sum_{i=1}^n G_i^t \right), $$

    where $G_i^t$ keeps in memory all the gradients $\nabla f_{i_t} (w)$ computed before iteration $t$ (with replacement if training example $i_t$ is visited repeatedly).

  3. Repeat.

Additionally, in contrast the the methods outlined before, SAG also leverages a property we have not used so far: Lipschitz continuity in the gradient of convex cost functions $f$

$$ \lVert \nabla f(x) - \nabla f(y) \lVert \; \leq \; L \lVert x - y \lVert. $$

By choosing the learning rate to be inversely proportional to the maximal Lipschitz constant over all training examples

$$ L = \frac{1}{4} \max_{1 \leq i \leq n} \left( \lVert x^i \lVert^2 \right) + \lambda, \;\;\; \eta_t = 1/L, $$

(here $x^i$ denotes a row of $X$), SAG achieves vastly superior convergence than all of the methods discussed above. In fact it is the only method of the ones outlined in this post that converges to the global minimum to 5 significant figures.

A caveat on randomness¶

The implementation of SAG below approaches the random updating of the gradients in two different ways, with surprising consequences.

  1. In case = 1, the index $i_t$ is sampled randomly with replacement, meaning that not all training examples are necessarily visited after an epoch has been completed. This choice of sampling leads to the best convergence properties.

  2. In case = 2, the index $i_t$ is sampled randomly without replacement, such that all training examples are cycled through exactly once during each pass. It turns out that simply reshuffling the cycle after each pass, the method of choice for all the methods above, actually yields a much worse performance for SAG.

It can be verified that random sampling with replacement barely affects the other SGD algorithms, but it remains somewhat of a mystery to me why this choice in randomness affects convergence so drastically.

In [9]:
def SAG(self, case=1):
    # Initialize
    n = self.n
    d = self.d
    w = np.zeros((d, 1))
    w_old = w
    Lambda = 1
    
    # Randomly shuffle training data
    arr = np.arange(0, n)
    np.random.shuffle(arr) # shuffles arr directly
    
    # SAG parameters
    G = np.zeros((n, d))
    dvec = np.zeros((d, 1))
    
    L = 0.25*np.max(np.sum(self.X**2, axis=1)) + Lambda
    eta = 1/L
    
    # strange property of random numbers with SAG
    if case==1:
        # much faster to generate all at once
        arr = np.random.randint(n, size=(n,))
    elif case==2:
        arr = np.arange(0, n)
        np.random.shuffle(arr) # shuffles arr directly
    else:
        print("The variable 'case' is not specified correctly; abort.")
        return
    
    for t in range(1, self.nEpochs*n + 1):
        # Compute grad for one random training example
        i = arr[np.mod(t, n)]
        # i = np.random.randint(n)
        grad = self.LogReg(w=w, Lambda=Lambda/n, i=i)[1]

        # SAG algorithm
        dvec = dvec - G[[i], :].T + grad
        G[[i], :] = grad.T
        w = w - eta*dvec/n
            
        # One epoch has passed: check for convergence
        if np.mod(t, n) == 0:
            change = np.linalg.norm(w-w_old, ord=np.inf)
            print('Passes = %d, function = %e, change = %f' %((t+1)/n, self.LogReg(w=w, Lambda=Lambda)[0], change))
            if change < self.progTol:
                print('Parameters changed by less than progress tolerance on pass')
                break
            w_old = w
            # careful with random numbers
            if case==1:
                arr = np.random.randint(n, size=(n,))
            elif case==2:
                np.random.shuffle(arr) # shuffles arr directly
                
LogisticRegressionSGD.SAG = SAG
In [10]:
print('-----------------------------------------------------------------')
print('  Case 1: completely random walk through training examples       ')
print('-----------------------------------------------------------------')
LogisticRegressionSGD(X, y).SAG(case=1)
print('-----------------------------------------------------------------')
print('  Case 2: visiting every training example exactly once per pass  ')
print('-----------------------------------------------------------------')
LogisticRegressionSGD(X, y).SAG(case=2)
-----------------------------------------------------------------
  Case 1: completely random walk through training examples       
-----------------------------------------------------------------
Passes = 1, function = 2.827896e+04, change = 0.800740
Passes = 2, function = 2.750555e+04, change = 0.333843
Passes = 3, function = 2.719657e+04, change = 0.231760
Passes = 4, function = 2.715866e+04, change = 0.142823
Passes = 5, function = 2.710138e+04, change = 0.073561
Passes = 6, function = 2.708099e+04, change = 0.047769
Passes = 7, function = 2.707263e+04, change = 0.039457
Passes = 8, function = 2.706958e+04, change = 0.015866
Passes = 9, function = 2.706824e+04, change = 0.011306
Passes = 10, function = 2.706798e+04, change = 0.006513
-----------------------------------------------------------------
  Case 2: visiting every training example exactly once per pass  
-----------------------------------------------------------------
Passes = 1, function = 2.997070e+04, change = 1.087841
Passes = 2, function = 2.791141e+04, change = 0.457067
Passes = 3, function = 2.798343e+04, change = 0.471225
Passes = 4, function = 2.926349e+04, change = 0.359653
Passes = 5, function = 2.934598e+04, change = 0.418543
Passes = 6, function = 2.826330e+04, change = 0.320090
Passes = 7, function = 2.722061e+04, change = 0.136731
Passes = 8, function = 2.847064e+04, change = 0.192135
Passes = 9, function = 3.097148e+04, change = 0.350279
Passes = 10, function = 3.087167e+04, change = 0.466785

Whereas regular stochastic gradient descent has been known for a long time, some of the variants discussed in this post are quite recent: AdaGrad and SAG were first described in peer-reviewed publications in 2011 and 2015 respectively. It is interesting to note that the latter is already one of the main solvers available for Logistic and Ridge regressions in scikit-learn because of its efficiency with large datasets. The rapid rise of online learning and artificial neural networks, where stochastic optimization shines brightest, are sure to stimulate research for even better methods in the near future.

blogroll

social