Batch L-BFGS

This document provides a walkthrough of the L-BFGS example. To run the application, first install these dependencies.

pip install tensorflow
pip install scipy

Then you can run the example as follows.

python ray/examples/lbfgs/driver.py

Optimization is at the heart of many machine learning algorithms. Much of machine learning involves specifying a loss function and finding the parameters that minimize the loss. If we can compute the gradient of the loss function, then we can apply a variety of gradient-based optimization algorithms. L-BFGS is one such algorithm. It is a quasi-Newton method that uses gradient information to approximate the inverse Hessian of the loss function in a computationally efficient manner.

The serial version

First we load the data in batches. Here, each element in batches is a tuple whose first component is a batch of 100 images and whose second component is a batch of the 100 corresponding labels. For simplicity, we use TensorFlow’s built in methods for loading the data.

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
batch_size = 100
num_batches = mnist.train.num_examples // batch_size
batches = [mnist.train.next_batch(batch_size) for _ in range(num_batches)]

Now, suppose we have defined a function which takes a set of model parameters theta and a batch of data (both images and labels) and computes the loss for that choice of model parameters on that batch of data. Similarly, suppose we’ve also defined a function that takes the same arguments and computes the gradient of the loss for that choice of model parameters.

def loss(theta, xs, ys):
  # compute the loss on a batch of data
  return loss

def grad(theta, xs, ys):
  # compute the gradient on a batch of data
  return grad

def full_loss(theta):
  # compute the loss on the full data set
  return sum([loss(theta, xs, ys) for (xs, ys) in batches])

def full_grad(theta):
  # compute the gradient on the full data set
  return sum([grad(theta, xs, ys) for (xs, ys) in batches])

Since we are working with a small dataset, we don’t actually need to separate these methods into the part that operates on a batch and the part that operates on the full dataset, but doing so will make the distributed version clearer.

Now, if we wish to optimize the loss function using L-BFGS, we simply plug these functions, along with an initial choice of model parameters, into scipy.optimize.fmin_l_bfgs_b.

theta_init = 1e-2 * np.random.normal(size=dim)
result = scipy.optimize.fmin_l_bfgs_b(full_loss, theta_init, fprime=full_grad)

The distributed version

In this example, the computation of the gradient itself can be done in parallel on a number of workers or machines.

First, let’s turn the data into a collection of remote objects.

batch_ids = [(ray.put(xs), ray.put(ys)) for (xs, ys) in batches]

We can load the data on the driver and distribute it this way because MNIST easily fits on a single machine. However, for larger data sets, we will need to use remote functions to distribute the loading of the data.

Now, lets turn loss and grad into methods of an actor that will contain our network.

class Network(object):
  def __init__():
    # Initialize network.

  def loss(theta, xs, ys):
    # compute the loss
    return loss

  def grad(theta, xs, ys):
    # compute the gradient
    return grad

Now, it is easy to speed up the computation of the full loss and the full gradient.

def full_loss(theta):
  theta_id = ray.put(theta)
  loss_ids = [actor.loss(theta_id) for actor in actors]
  return sum(ray.get(loss_ids))

def full_grad(theta):
  theta_id = ray.put(theta)
  grad_ids = [actor.grad(theta_id) for actor in actors]
  return sum(ray.get(grad_ids)).astype("float64") # This conversion is necessary for use with fmin_l_bfgs_b.

Note that we turn theta into a remote object with the line theta_id = ray.put(theta) before passing it into the remote functions. If we had written

[actor.loss(theta_id) for actor in actors]

instead of

theta_id = ray.put(theta)
[actor.loss(theta_id) for actor in actors]

then each task that got sent to the scheduler (one for every element of batch_ids) would have had a copy of theta serialized inside of it. Since theta here consists of the parameters of a potentially large model, this is inefficient. Large objects should be passed by object ID to remote functions and not by value.

We use remote actors and remote objects internally in the implementation of full_loss and full_grad, but the user-facing behavior of these methods is identical to the behavior in the serial version.

We can now optimize the objective with the same function call as before.

theta_init = 1e-2 * np.random.normal(size=dim)
result = scipy.optimize.fmin_l_bfgs_b(full_loss, theta_init, fprime=full_grad)