In this post on integrating SigOpt with machine learning frameworks, we will show you how to useSigOpt and TensorFlow to efficiently search for an optimal configuration of a convolutional neural network (CNN). There are a large number of tunable parameters associated with defining and training deep neural networks ( Bergstra  ) and SigOpt accelerates searching through these settings to find optimal configurations. This search is typically a slow and expensive process, especially when using standard techniques like grid or random search, as evaluating each configuration can take multiple hours. SigOpt finds good combinations far more efficiently than these standard methods by employing an ensemble of state-of-the-art Bayesian optimization techniques, allowing users to arrive at the best models faster and cheaper.
In this example, we consider the same optical character recognition task of the SVHN dataset as discussed in aprevious post. Our goal is to build a model capable of recognizing digits (0-9) in small, real-world images of house numbers. We use SigOpt to efficiently find a good structure and training configuration for a convolutional neural net. Check out the code here if you’d like to start experimenting!
Convolutional Neural Net Structure
The structure and topology of a deep neural network can have dramatic implications for performance on a given task ( Bengio  ). Many small decisions go into the connectivity and aggregation strategies for each of the layers that make up a deep neural net. These parameters can be non-intuitive to choose in an optimal, or even acceptable, fashion. In this experiment we used a TensorFlow CNN example designed for the MNIST dataset as a starting point. Figure 1 represents a typical CNN structure, highlighting the parameters we chose to vary in this experiment. A more complete discussion of these architectural decisions can be found in an online course from Stanford ( Li  ). It should be noted that Figure 1 is an approximation of the architecture used in this example, and the code serves as a more complete reference .
Figure 1: Representative convolutional neural net topology. Important parameters include the width and depth of the convolutional filters, as well as dropout probability. (Sermanet )
TensorFlow has greatly simplified the effort required to build and experiment with deep neural network (DNN) designs. Tuning these networks, however, is still an incredibly important part of creating a successful model. The optimal structural parameters often highly depend on the dataset under consideration. SigOpt offers Bayesian optimization as a service to minimize the amount of trial and error required to find good structural parameters for DNNs and CNNs.
Stochastic Gradient Descent Parameters ($/alpha, /beta, /gamma$)
Once the structure of the neural net has been selected, an optimization strategy based on stochastic gradient descent (SGD) is used to fit the weight parameters of the convolutional neural net. There is no shortage of SGD algorithm variations implemented in TensorFlow and several parametrizations of RMSProp, a particular SGD variation, are compared in Figure 2.
Figure 2: Progression of RMSProp gradient descent with different parametrizations. left: Various decay rates with other parameters fixed: purple = .01, black = .5, red = .93. center: Various learning rates with other parameters fixed: purple = .016, black = .1, red = .6. right: Various momentums with other parameters fixed: purple = .2, black = .6, red = .93.
It can be a counterintuitive and time consuming task to optimally configure a particular SGD algorithm for a given model and dataset. To simplify this tedious process, we expose to SigOpt the parameters that govern the RMSProp optimization algorithm. Important parameters governing its behavior are the learning rate ($/alpha$) , momentum ($/beta$) and decay ($/gamma$) terms. These parameters define the RMSProp gradient update step:
Algorithm 1: Pseudocode for RMSProp stochastic gradient descent. Stochastic gradient refers to the fact that we are estimating the loss function gradient using a subsample (batch) of the entire training data
For this example, we used only a single epoch of the training data, where one epoch refers to a complete presentation of the entire training data (~500K images in our example). Batch size refers to the number of training examples used in the computation of each stochastic gradient (10K images in our example). One epoch is made up of several batch sized updates, so as to minimize the in-memory resources associated required for the optimization (Hinton ). Using only a single epoch can be detrimental to performance, but this was done in the interest of time for this example.
To compare tuning the CNNs hyperparameters when using random search versus SigOpt, we ran 5 experiments using each method and compared the median best seen trace. The objective was the classification accuracy on a single 80 / 20 fold of the training and “extra” set of the SVHN dataset (71K + 500K images). The median best seen trace for each optimization strategy is shown below in Figure 3.
In our experiment we allowed SigOpt and random search to perform 80 function evaluations (each representing a different proposed configuration of the CNN). A progression of the best seen objective at each evaluation for both methods is shown below in Figure 3. We include, as a baseline, the accuracy of an untuned TensorFlow CNN using the default parameters suggested in the official TensorFlow example . We also include the performance of a random forest classifier using sklearn defaults.
Figure 3: Median best seen trace of CV accuracy over 5 independent optimization runs using SigOpt, random search as well as two baselines where no tuning was performed
After hyperparameter optimization was completed for each method, we compared accuracy using a completely held out data set (SHVN test set, 26k images) using the best configuration found in the tuning phase. The best hyperparameter configurations for each method in each of the 5 optimization runs was used for evaluation. The mean of these accuracies is reported in the table below. We also include the same baseline models described above and report their performance on the held out evaluation set.
|SigOpt(TensorFlow CNN)||Random Search(TensorFlow CNN)||No Tuning(sklearn RF)||No Tuning(TensorFlow CNN)|
|Hold out ACC||0.8130 ( +315.2% )||0.5690||0.5278||0.1958|
Table 1: Comparison of model accuracy on the held out dataset after different tuning strategies
Using SigOpt to optimize deep learning architectures instead of a standard approach like random search can translate to real savings in the total cost of tuning a model. This is especially true when expensive computational resources (for example GPU EC2 instances) are required by your modelling efforts.
We compare the cost required to reach specific performances on the CV accuracy objective metric in our example experiment. Quickly finding optimal configurations has a direct savings on computational costs associated with tuning on top of the performance benefits of having a better model. Here we assume each observation costs $2.60, which is the cost per hour of using a single on-demand g2.8xlarge instance in EC2.
|Model Performance(CV Acc. threshold)||Random Search Cost||SigOpt Cost||SigOpt Cost Savings||
Potential Production Savings
Table 2: Required costs for achieving same performance when tuning with SigOpt and random search. For CNNs in production more epochs are traditionally used; for this example we assume 50 GPUs and that the results scale perfectly with the parallelism.
We observe that SigOpt offers a drastic discount in cost to achieve equivalent performance levels when compared with a standard method like random search. While this experiment required only a relatively modest amount of computational resources, more sophisticated models and larger datasets require more instances training for up to weeks at a time, as was the case for the AlphaGo DNN , which used 50 GPUs for training. In this setting, an 80% reduction in computational costs could easily translate to tens of thousands of dollars in savings.
Deep learning has quickly become an exciting new area in applied machine learning. Development and innovation is often slowed by the complexity and effort required to find optimal structure and training strategies for deep learning architectures. Optimal configurations for one dataset don’t necessarily translate to others, and using default parameters can often lead to suboptimal results. This inhibited R&D cycle can be frustrating for practitioners, but it also carries a very real monetary cost. SigOpt offers Bayesian optimization as a service to assist machine learning engineers and data scientists in being more cost effective in their modelling efforts. Start building state-of-the-art machine learning models on a budget today!
SigOpt is offering a free 30 day trial to TensorFlow users. To get started, download the code :
More from SigOpt
- SigOpt’s 90 Second Tutorial demonstrates our optimization engine with no sign-up required.
- Using Model Tuning to Beat Vegas (sklearn)
- Automatically Tuning Text Classifiers (sklearn)
- Unsupervised Learning with Bayesian Optimization (xgboost)
: James Bergstra, Rémi Bardenet, Yoshua Bengio and Balázs Kégl. Algorithms for hyper-parameter optimization. Advances in Neural Information Processing Systems. 2011 [ PDF ]
: Pierre Sermanet, Soumith Chintala and Yann LeCun. Convolutional Neural Networks Applied to House Numbers Digit Classification. Pattern Recognition International Conference (ICPR). 2012. [ PDF ]
: Yoshua Bengio. Learning Deep Architectures for AI. Foundations and trends in Machine Learning. 2009. [ PDF ]
: Geoffrey Hinton, Nitish Srivastav and Kevin Swersky. Neural Networks for Machine Learning. U of T Course Slides [ LINK ]
: Fei-Fei Li, Andrej Karpathy, and Justin Johnson. Convolutional Neural Networks for Visual Recognition. Stanford Online Course. [ LINK ]
转载本站任何文章请注明：转载至神刀安全网，谢谢神刀安全网 » SigOpt for ML: TensorFlow ConvNets on a Budget with Bayesian Optimization