Deep learning has advanced performance in several machine learning problems throught the use of large labeled datasets. However, labels are expensive to collect and at times leads to biases in the trained model arising from the labeling of the data. One technique, that has gained interest in recent years, to solve these issues is self supervised learning (Self-SL). The goal of self-SL is to learn representations from data using the data itself as supervision (leveraging inductive biases that we assume on the system) i.e., a model is trained to produce representation such that the extracted features perform well on a artificially generated task for which ground truth is available.
How does one quantify the representations obtained?
Standard protocols for benchmarking self-SL models involve using a linear or weighted k-nearest neighbor classification (KNN) on features extracted from the learned model whose parameters are not updated.However, both these evaluations are sensitive to hyperparameters making the evaluation and comparison complicated. For e.g., in linear evaluations, one often applies selected augmentations to the input to train the linear classifier on top of the feature extractor in addition to hyperparameters used for training the classifier. A weighted k-nearest neighbor classifier, on the other hand, avoids data augmentations but still suffers from the selection of hyperparameter k
. One observes large variance in accuracy in both these evaluation frameworks based on the hyperparameter/augmentations chosen making the comparison of feature extractors obscure. Ideally, one would want an evaluation protocol that does not require augmentations or hyperparameter tuning and can be run quickly given the features of the data. In this post, I hope to convince you with an alternate neighborhood based classifier that can satisfy these requirements exactly. In fact, we will see that not only is the proposed classifier robust but is also capable of performing better if not at par than the previous evaluation methods in terms of classification accuracy.
In NNK, neighborhood selection is formulated as a signal representation problem, where each data point is to be approximated using a dictionary formed by its neighbors. This problem formulation, for each data point, leads to an adaptive and principled approach to the choice of neighbors and their weights. While KNN is used as an initialization, NNK performs an optimization akin to orthogonal matching pursuit in kernel space resulting in a robust representation with a geometric interpretation. Geometrically, KRI reduces to a series of hyper plane conditions, one per NNK neighbor, which applied inductively lead to a convex polytope around each data point as show in Fig. below. In words, NNK ignores neighbors that are further away along a similar direction as an already chosen point and looks for neighbors in an orthogonal or different direction.
Note that, this perspective provides an alternative to the conventional thinking of local neighborhoods as hyperspheres or balls of certain radius in space. This is made possible in NNK, by taking into account the relative positions of nodes j
and k
using the metric on (j,k)
, in addition to the metric on (i,j)
and (i,k)
that were previously used for the selection and weighing of neighbors j
,k
to node i
.
Application of NNK for classification is further interesting from a recent theoretical point of view: interpolative classifiers, namely classifiers with training error 0
, are capable of generalizing well to test data contrary to the conventional wisdom (interpolation ⇒ poor generalization). Note that overparameterized neural network models today are increasingly trained to zero training error and can be considered to fall under this category of interpolative classifiers. In the experiments that follow, we will use a normalized cosine kernel (cosine kernel shifted and scaled to have values between 0
and 1
) for NNK and KNN.
Table below lists Top-1 Linear, weighted KNN and NNK classification accuracy on ImageNet using a fixed bacbone architecture ResNet-50 that was trained using different self-SL training strategies. The evaluation protocol follows a standard setup where one evaluated performance on validation dataset based on labeled training dataset. The KNN and NNK evaluations were done using a self-SL framework VISSL with officially released model weights and setting the parameter k
, number of neighbors, to 50
. The results listed for linear evaluations are as reported on the corresponding self-SL papers and was not validated by us.
Method | Linear | KNN | NNK |
---|---|---|---|
Supervised | 76.1 | 74.5 | 75.4 |
MoCo-v2 | 71.1 | 60.3 | 64.9 |
DeepClusterV2 | 75.2 | 65.8 | 70.7 |
SwAV | 75.3 | 63.2 | 68.7 |
DINO | 75.3 | 65.6 | 71.1 |
To evaluate the robustness of NNK relative to KNN we will use a recently introduced self-SL model, DINO (distillation with no labels), as our baseline model and compare Top-1 accuracy on ImageNet for different values of k
. As can be seen in Figure below, NNK not only consitently outperforms a weighted KNN classifer but does so in a robust manner. Further, unlike the KNN whose performance decreases as k
increases, NNK improves with k
. This can be explained by the fact that NNK accommodates new neighbors only if they belong to a new direction in space that improves its interpolation whereas a KNN simply interpolates with all k
neighbors. NNK classifier in this setup achieves performance on par if not better than the linear classifier model with the small ViT model achieving ImageNet top-1 accuracy of 79.8%
, the best performance by a non parametric classifier in conjunction with self-SL models.
Several machine learning methods leverage the idea of local neighborhood by using methods such as KNN, ∈-neighborhood to better design and evaluate pattern recognition models. However, the choice of parameters such as k
is often made experimentally, e.g., via cross-validation, leading to local neighborhoods without a clear interpretation. NNK formulation allows us to overcome this shortcomings resulting in a superior framework that is adaptive to the local distribution of samples. As show in this post and our related works, NNK does exhibit robust and superior performance relative to standard local methods. Ultimately, our goal is for the NNK framework to encourage and revisit the use of interpretable and robust neighborhood methods in machine learning - for e.g., consider the problem of model distillation or clustering where NNK could be used to enforce better local consistency regularization.
Source code for experiments in this post are available at github.com/shekkizh/VISSL_NNK_Benchmark
]]>Image processing over the years has evolved from simple linear averaging filters to highly adaptive non linear filtering operations such as the bilateral filter, moving least squares, BM3D and LARK to name a few. However, a shortcoming of these powerful methods is the lack of interpretability using conventional non-adaptive grid based transform techniques such as Fourier and DCT. In recent years, graphs and the associated spectral decomposition have emerged as a unified representation for image analysis and processing. This area of research is broadly categorized under Graph Signal Processing (GSP), an upcoming field which has heralded several algorithms in various topics (including neural networks - Graph Convolution Networks). Note that with any graph based algorithm, the runtime is generally a function of the number of edges in the graph and so it is often desirable to have a sparse graph which capture the content of the underlying data.
In this post, we will focus on the problem of representing images using graphs - What is the right graph to represent images? and is the graph construction method scalable? An image graph is constructed with each pixel as a node with edges between them capturing similarity between the nodes. A standard, naive approach to image graph construction is that of a window based method i.e., each pixel (node) is connected to all neighboring pixels within a window centered at the pixel. These are then weighed using a kernel to quantify similarity. This method is similar to that of the more general K-nearest neighbor (KNN) graph construction where we set K to the window size and choose neighbors based on only spatial distance in image. A problem with this representation is that the sparsity of the image graph formed is entirely dependent on the choice of window size and is not adaptive to the local structure of the image. This poses a problem since this choice is often adhoc with no clear methodology to make the choice as in the case of KNN graphs.
As an alternative, in our work we leverage the sparse signal approximation perspective of Non Negative Kernel regression (NNK) graphs to the domain of images.
The outline and motivation for NNK graph construction is as follows
As a consequence, we observe stable, sparse represetation where the relative position of the underlying data is taken into consideration. This property can be theoretically and geometrically explained using the Kernel Ratio Interval (KRI) as shown in the fig. below. Simply put, given a node j connected to i, NNK looks for additional neighbors that are in different directions i.e., orthogonal.
This geometric interpretation along with the pixel position regularity in images and specific characteristics of kernel allows us to learn image graphs in a fast and efficient manner (10x
faster than naive NNK).
We will use the bilateral filter kernel to construct the image graph, though any kernel that has values in range [0, 1] can be integrated into the NNK image graph framework. As shown in the paper, the bilateral filter with KRI gives us a simple threshold condition (computed offline) on intensity to determine if a pixel k is to be connected given a connected pixel j is connected to the center pixel. We will apply this condition with positive threshold going radially outwards from the center pixel, starting with the four connected neighbors. We confine ourselves to positive thresholds since negative thresholds corresponds to pixels in the opposite side of the pixel window and are less likely to be affected by the connectivity of the current pixel.
The figure below presents the case of NNK image graph algorithm for a 7x7
window centered at pixel i. Note that, conventionally one would connect i to all pixels in the window. In NNK, we start with one of the closest pixel namely j and assume its connected. We observe intensity differences and compare with the precomputed threshold to prune pixels that will not be connected given the connection to j. We perform this step iteratively i.e.,
NNK image graphs have far fewer edges compared to its naive KNN-like counterparts (90%
reduction in edges for a 11x11
window). This massive reduction in number of edges speeds up graph filtering operations in images by atleast 15x
without loss in representation. Infact, we show that graph processing and transforms based on NNK image graphs are much better in capturing image content and resulting filtering performance.
Further, spectral image denoising based on NNK graphs shows promising performance. We observe that unlike BF graph based denoising whose performance worsens compared to its original non-graph based denoiser, NNK image graph based filtering improves performance achieving metrics close to more complex methods such as BM3D.
In this article, we looked at a scalable, efficient graph construction framework for images with interpretable connectivity and robust performance. Further, the local nature of the algorithm allows for parallelized execution. We believe we are only scratching the surface with bilateral filter kernels and that better performance and representation is possible by incorporating more complex kernels such as non local means, BM3D to name a few.
Source code available at github.com/STAC-USC/NNK_Image_graph
]]>Graphs provide a generic setup to describe and analyze complex patterns in data where instead of observing data as isolated set of points, we can view them as network where data are entities/nodes with relationships/edges between them. Graph driven machine learning (Graph-ML) has seen a surge of interest in the recent years with several applications in social sciences, biology, and network analysis, to name a few. Unlike standard learning methods that learn a function to each data point (where locality and connectivity assumptions are implicit), graph-ML aims at leveraging explicitly the information across points to obtain better functions. Further, by associating graphs to data without labels or task specific details, one is able to use the data structure in semi-supervised and unsupervised learning scenarios.
A critical problem in the field of graph-ML is the definition of graph itself. In some scenarios, a graph structure is intuitive and at times given alongside the data. However, in most scenarios, no graph is given a’priori and one has to infer and construct a graph to fit the data and the task to be solved. In this post, I will mostly talk about working with weighted graphs, where the weights would represent degree of association/influence between two nodes.
In a typical graph learning problem, we are given N data points and the goal is to learn an efficient or sparse graph representation of the data. The key word here is efficient: An efficient graph can be defined as one with number of edges of the same order as the number of nodes. If not for efficiency, the problem can be trivially solved by connecting each data to every other data point with edge weights proportional to the distance/similarity between them. A sparse graph construction leads to faster downstream processing and allows for better understanding of the local neighborhood structure of the data.
Most popular graph construction methods to solve this problem are the k-nearest neighbor (KNN) and ∈-neighborhood graphs (∈-graph). In these methods, one connects each data point based on a predefined similarity metric to its k most similar neighbors or neighbors that are atleast ∈-similar to the data. However, the choice of parameters such as k and ∈, which explicitly define the sparsity and connectivity of the graph, are unknown and are often assigned values in an adhoc manner or via cross-validation leading to graphs that are not robust. Also, in cases it is not clear geometrically what the choice of these parameters amount to. In this post, I will try to convince you the sub-optimal nature of these standard methods and present possible improvements from a dictionary based data approximation perspective.
Consider a collection of signals/functions that are unit normalized to form the building blocks to your signal space - we would refer to this set as a dictionary and its elements as atoms. A dictionary is complete when its atoms can represent the entire space of signals and is redundant when the atoms are linearly dependent i.e., one where an atom can be represented by a linear combination of the remaining atoms. In general, we often work with dictionaries that are redundant.
Sparse signal recovery or approximation involves finidng the simplest possible explanation of the signal using a linear combination of the atoms in the dictionary. The ability of a dictionary to provide a sparse representation for the signal depends on how well the signal matches the characteristics of the atoms in the dictionary. This problem is NP-hard in general and various relations and greedy approaches have been proposed in the past. A naive, basic approach to this problem is to find correlation between the signal and the atoms in the dictionary and use all those atoms that are within a threshold for the representation. A better approach, matching pursuit (MP), involves a greedy iterative selection procedure. This method works by finding the atoms that are maximally correlated to the residual (the part of the signal which is not represented) at each iteration until the signal is fully represented or no improvement can be made. This method does not guarantee optimality but works reasonably well for most real world appilcations and is easily better than thresholding. A second approach, orthogonal matching pursuit (OMP), works very similar to MP with one additional step at each iteration involving orthogonalization or reweighting of the contributions of atoms selected so far. Improved variations over these algorithms do exist and involve the ability to add groups of atoms at each step (instead of one per iteration) and pruning several of the atoms as redundant from the selection.
KNN and ∈-neighborhood methods rely on a similarity measure obtained using a positive definite kernel (for e.g., Gaussian kernel) to select and weight the neighbors of a particular data point. These positive definite kernels corresponds to a transformation of data points to a possibly infinite dimensional space such that similarities are dot products in this transformed space. The dot product space associated with kernels are refferred to as Reproducing Kernel Hilbert Space (RKHS) has well established properties and applications but has not been sufficiently studied for the purpose of graph learning.
Under the distinction of data and transformed space (RKHS), for similarity kernels with maximum value 1
, the inner product can be viewed as projection of one node on another i.e at a node i,
An additional detail unique to our problem setting is the non-negativity of the coefficients of approximation - the edge weights. To account for this, we will take into consideration the sign of correlation/projection of the atoms for our selection. Now, a k-nearest neighbor procedure (equivalently ∈-neighborhood) procedure at a node i can be reduced to the following steps
Thus, a KNN or ∈-graph framework is a signal approximation method obtained using a thresholding technique with thresholds set either globally (∈) or defined by desired sparsity (k). In the context of sparse dictionary based representation, it is well-known that selection via thresholding is suboptimal in general and is optimal only when the dictionary is not redundant. As discussed earlier, there exist a number of alternative methods that are better suited for the problem of sparse signal approximation such as MP and OMP. A straightforward adaptation of MP/OMP algorithm for graph construction at a node i can be obtained as below.
2 - 4
until approximation is exact or no atom exists with positive correlation.The difference between MP and OMP is at step 4
where, in MP one would use the correlation as weight for the atom selected in 3
while in OMP one performs least squared fit with all selected atoms to reweigh the contribution of the atoms selected. In reality, with the non negativity constrain one can obtain OMP like results by avoiding iterative selection and performing least squares fit on k-nearest neighbor or ∈-neighborhood selection in a single step optimization - a procedure that we proposed in NNK graphs. A key benefit of NNK graphs is its robustness to parameters such as k and ∈. This is because the number of neighbors that are assigned non-zero weights is not predetermined and instead depends on the local geometry of the data as in figure below.
In this post, we looked at a sparse signal approximation perspective to conventional graph and neighborhood construction methods with possible improvements from this perspective. The area of sparse signal approximation is vast - NNK graphs merely scratch the surface when it comes to incorporating this huge area with that of graph learning. Several possible improvements and ideas from dictionary based representation can be adapted such as
Python and Matlab source code for NNK graph is available at github.com/STAC-USC/
]]>For a basic introduction to generative models and GANs refer to my notes here.
Setup
The basic setup of GAN is two networks G(z) and D(z) trying to race against each other and reach an optimum more specifically a Nash equilibrium. The definition of Nash equilibrium as per wikipedia is
(in economics and game theory) a stable state of a system involving the interaction of different participants, in which no participant can gain by a unilateral change of strategy if the strategies of the others remain unchanged.
If you think about it this is exactly what we are trying with GAN, the generator and discriminator reach a state where they cannot improve further given the other is kept unchanged.
Now the setup of gradient descent is to take a step in a direction that reduces the loss measure defined on the problem - we are by no means enforcing the networks to reach Nash eq. in GAN which have non convex objective in a high dimensional space with continuous parameters. The networks try to take successive steps to minimize a non convex cost and end up in a oscillating process rather than decreasing the underlying true objective. There is an excellent paper by Goodfellow that tries to explain this problem On Distinguishability criteria for estimating Generative models.
Setting aside the above issue in a wishful manner, let’s just train GAN by gradient descent - But just be mindful of the issue and the fact your GAN model may not converge.
Importance of intialization and model setup
As mentioned above the setup is already unstable and so it’s absolutely crucial to setup the networks in the best way possible. I tried to follow the DCGAN model setup by Radford et. al in their paper but suffered from bad intialization. In most cases you can right away figure out something is wrong with your model when your discriminator attains a loss that is almost zero. The biggest headache is figuring out what is wrong :weary:
Another practical thing that is done while training GAN is to stall one network or purposefully make it learn slower so that the other network can catch up. Most of the times it’s the generator that lags behind so we usually let the discriminator wait - This is fine to some extent but remember that for your generator to get better you need a good discriminator and vice versa. Ideally you would want both the networks to learn at a rate where both get better over time. The ideal minimum loss for the discriminator is 0.5 - this is where the generated images are indistinguishable from the real images from the perspective of the discriminator.
Feature matching
This is an idea proposed in Improved techniques for training GANs paper. The idea is to use the features at the intermediate layers in the discriminator to match for real and fake images and make this a supervisory signal to train the generator. I found training only based on this feature matching metric ineffective by itself contrary to what is mentioned in the paper - the discriminator attains almost zero loss right at the beginning when I tried doing this. Instead a combination of both the feature matching loss and the discriminator output loss for generator was quite effective. I tried setting up this combination loss such that initially the discriminator output loss for generator dominates, this avoided the discriminator loss to reach zero in the early stages of the training.
The model used for the results below consists of 4 layer network for each generator and discriminator with batch norm. Leaky Relu’s were used for discriminator with a sigmoid activation at the output layer while Relu’s were used for generator with a tanh for the final layer. The filter depth was changed in multiples of 2. Images were resized to size of 64x64. Results below are for 15epochs of training on a succesful attempt. Latent space dimension was set to 100.
The generator loss is a combination of discriminator loss for generator and the feature matching loss hence the value of the loss seems high, but this I think is fine since with feature matching we are trying to match the statistics of dataset fed in batches and there is bound to be some loss.
Let’s take a look at how the predictions of discriminator vary with training on real and generated images - Seems like the discriminator is confused and this is essential for training. Also the gradients for the first layer in both discriminator and generator are shown below to give an idea of how an good gradient flow are when the model is setup well.
Ahh the above would have been a beautiful result if I had got that in the first try but the below is what I got. A discriminator loss that just reaches values so small that there was no gradient flowing for the discriminator or values of scale of weights for the generator. The below plots show exactly what I mean. As always looking at the gradients just tells you that there is soemthing wrong with the scale of weights for the generator and the learning is just too slow.
Key takeaways to train a successful GAN:
CelebA
Flowers
Sampling random points in latent space
Now lets try taking a random walk along just one hidden space dimension.
Dimension 25
Dimension 50
Dimension 75
From what I was able to observe it seems like the samples from latent space when making random walks generate different image depending on whether we are on the negative or positive side of the dimension. This can be seen from the above results as well.
Edit: A feedback I got on the post for the above statement was that while walking the latent space can get you from one image to another, there can be significant semantic relationships between those images (e.g. the same subject under a change of lighting, a rotation, etc.). Some dimensions can encode information about the geometry of the “scene”, and others may encode information about how that “scene” is rendered. Of course, the features are anonymous, so some finessing and reverse-engineering is needed to figure that out. (credits to reddit u/Ameren)
I agree completely with the comment and do believe the relationships captured by the latent space do get complicated and less interpretive - I was just trying to point out the interpretation for the dimensions shown above.
Code for DCGAN in tensorflow can be found at TensorflowProjects/Unsupervised_learning
As always let me know if you have comments or ideas.
In this post, we look at an experiment based on the results published by Yarin et. al. on the interpretation of dropout as uncertainty in models. The central idea behind the paper is in interpreting uncertainty in models via dropout i.e By introducing dropout at test time we can draw mean and variance information over an input which helps us to concretely explain how are deep learning models interprets the input. Alternatively when we train a network with dropout, by the above explanation we are forcing the network to learn under some uncertainty and hence the model is made to avoid over fitting.
Paper: Dropout as a Bayesian Approximation
The results below are obtained on MNIST with dropout in fully connected layer. The red line corresponds to inference with no dropout at test time. The blue dotted lines correspond to the mean and variance for inference over 100 iterations.
The model architecture for the below results is 2 conv layers - fc layer with dropout - softmax layer. Training was done with dropout probability 0.1 - This means that at test time if we have the same dropout probability, the model should be fairly confident of it’s inference. This can be seen in the below result.
Now if we introduce higher dropout, the model should show signs of uncertainty as it is not trained to overcome this. Higher the dropout higher the uncertainty. Some examples of uncertain predictions that the model makes are as below.
In the result below, we see the type of inputs where our model is uncertain and those were it is confident. This gives us valuable information on our model. Such insights gives us an idea of whether we need to train our model further, include more data and such.
Deep neural networks have found their way to wide variety of applications but only recently has the application of these networks for unsupervised and semi supervised learning leveraging large unlabeled data has been worked upon.
Generative modeling is a branch of machine learning that attempts at modeling the probability distribution of high dimensional data, for example - images. Images present themselves as high dimensional data points that are mostly correlated. Take for instance a character image specifically say digits 0-9. It is highly unlikely that given the left half of the image contains the left half of 0; the right half would contain the left half of say 5. The generative models job is to capture these kinds of dependencies between pixels. In case of images by training a model to generate images as real as possible or as close to the actual dataset as possible is a good way to capture these dependencies.
Intuitively say we have a vector of latent variable z and assuming we can find a function f(z, θ) where θ is a vector of parameters. Now, training a generative model is to define this function such that for every data point X in our dataset, there is one setting of the latent variables which is able to generate something very similar to X. By being able to obtain such a function we have found the underlying/ defining low dimension space – which can be potentially used for various other tasks like classification, clustering and such.
Auto Encoder is a methodology which tries to learn a function f(X| θ) = X, i.e an identity mapping. Seems like a trivial task, but by forcing specific constrains such as compressing the input from high dimension to lower dimension and then reconstructing we can discover interesting structure in data if one exists. In fact a simple auto encoder can learn features similar to a principal component analysis. Auto encoders have recently received attention with the advent of neural networks that have proved to be very good function approximators and their ability to encode information efficiently. Training auto encoders is theoretically nothing but a maximum likelihood technique to learn a function. Several improvements to the basic auto encoder are being studied and applied. These improvements are primarily enforced by adding additional constrains like sparsity on the latent dimension space (Sparsity auto encoders) or by defining different loss function to train the network on for e.g. by using Kullback Leibler divergence term in the loss we can train a network with strong distribution prior on the latent variables (Variational Auto encoder).
Generative adversarial networks are methods that are based on game theory. The idea is to have two networks
Generator network G(z | θ) that produces samples from the data distribution by transforming a noisy vector z. |
By jointly training these two networks to play a cat and mouse game we hope to achieve a representation that is useful to describe the dataset. GAN are very unstable to train and so require careful selection of model activations and the model itself. The problem is mainly due to the fact that the optimization techniques used for training these networks are not meant for finding Nash equilibrium which is the ideal point where we want the networks to be at after training.
Simple AutoEncoder
The figure below presents results obtained on MNIST using a simple auto encoder for the purpose of visualization with L2 loss between generated image and the actual image as the supervisory signal. The model is made up of 6 fully connected layers with a latent variable dimension of 3. It is interesting to notice how the network has separated the digits and formed clusters.
Variational Auto Encoder
Variational Auto Encoders differ from other autoendoer in that they have strong probablistic inperpretation and priors on the latent variable space and are significantly faster compared to the simple autoencoder. The figure below presents the result of VAE on MNIST data. The network encoder is made up of 4 fully connected layers with the first two layers shared for the mean and log variance encoder layers. The decoder network is made up of 3 layers.
Observations on training VAE:
Sample generated images:
Activations on mean and log variance encoder:
Generative Adversarial Networks
GANs are such a pain to train in that they are very unstable and the training requires careful tuning of parameters - will mostly create separate notes on training GAN and it’s results. And I did - notes can be found here: link
Code for experiments available at: TensorflowProjects/Unsupervised_Learning
Logs for the purpose of visualization using tensorboard can be found in logs folder
References:
Model pruning in neural networks was the answer I ended up with when I got wondering about the workings of dropout, dropconnect and papers like Do Deep Nets Really Need to be Deep? and the follow up on convolutional networks being deep.
The basic thoughts I had were along these lines, models are getting bigger and bigger but are we capturing information with each parameter and do we really need the N number of parameters we choose? Are we choosing more than what we need or are we choosing too less? The fact that dropout and dropconnect and such works just proves it should be possible to remove certains parameters without affecting the final objective. Also it seemed like I was just abusing my system by making it work with more and more parameters :grin:
These experiments are very prelimnary and are just for understanding how these methods work. The core idea in pruning is to find a saliency for the weight parameters and remove those with low saliency with the belief that these will affect the model least.
Seems model pruning in neural networks was studied and worked upon from very early times and LeCun’s Optimal brain damage is one of the classics on the topic. The second order derivatives, the Hessian on weights based on the loss objective are by definition be a very good saliency metric. But calculating these derivatives for models would require a lot of compute power and time and this would especially be true for models that are large, the likes of deep learning models we believe to work. The paper derives an approximation for saliency in terms of the second derivatives on weights with the consideration that the pruning process should be done within reasonable compute resources.
Doing a Tayor expansion on the error function with respect to weights we get
Now if we can make the following assumptions the loss objective can be further simplified.
This leaves us with saliency of weights as
The above term is fairly easy to calculate with backprop and doesn’t cost lot of compute resources.
All experiments were done on MNIST dataset with a fairly simple 4-layer fully connected model using Tensorflow. The total number weights in the model is about 50K. Pruning was done in percentiles and the performance of the model on test dataset was recorded (1 percentile = 500 weights approx).
Experiments were done with Magnitude and the second derivative as derived in OBD as saliency separately. Also observations were made with layer wise pruning meaning remove weights by saliency for each layer separately and all layer pruning, remove weights based saliency over all weights in the model. One additional experiment was done on the same model with dropout (keep probability = 0.8) with all layer pruning based on second derivative saliency - the thought being if the model could be trained to work with dropout then we should be able to prune more easily.
The plot below is between Percentage of weights pruned Vs Test accuracy. Pruning in all cases was done after training the model on train dataset.
Interesting results! All layer OBD based saliency for the win ehh?! Now let’s look at how the all layer obd models perform with even more pruning
Observations:
Also while at it I would like to clear something most people think about dropout - that this produces an ensemble of models, well yes but not the sought of way you do with random forests. Remember that you share the same weights and you are still training a single model with dropout - I believe dropout makes the neurons in the model to not rely solely on a particular sequence of ops on input and enourages the model to be more aware of different parts of the same model.
I hope to work on more experiments in pruning with retraining and pruning in convolutional networks and the above results is just touching the basics. Let me know if you have ideas :v:
Logs for the result can be found in logs folder
Code for the experiments can be found in folder here
]]>Neural networks can be viewed as layers of building blocks neurons made up of weights, biases and activation function. A fundamental understanding of how these basic units function could help one in achieving one’s objective. In this study on activation functions an attempt is made to throw some light on how the choice of activations affects a model.
Sigmoid: The sigmoid activation function squishes the real number line to values in [0, 1]. This non linear activation was historically used for a long time since it fell nicely with the biological neuron’s firing rate. Nowadays these are rarely used and fallen well out of favor due to their limitation as shwon in the experiments below. Sigmoid functions suffer from saturating gradient problem and non zero centric response. The gradients at sigmoid activation could be very small when the units are saturated - this would effectively kill the neuron and prevent the network from learning. One can usually escape from the saturation behavior in shallow networks but with deep networks the progressively dying gradient are certain to affect the training. The other issue of non zero centric response could cause the weights to swing all positive or negative during backprop but this issue can be overcome with responses normalized before being passed to later layers.
Math form: 1/(1 + exp(-x))
Tanh: Tanh as shown in the image above squashes real values to the range [-1, 1]. These activation are still used in the final layer of image generative models but are rarely used on earlier layers. Tanh can be interpreted as scaled up version of sigmoid. Tanh, similar to sigmoid suffer from saturated gradients but as mentioned are nowadays used only in final layers so the gradients are rarely bound to have small values.
Math form: (exp(2x) - 1)/ (exp(2x) + 1)
ReLu: Rectified Linear Unit are considered to be one of the reasons for the resurgence of neural networks in the recent years. It is simply a clipping functions that zeros out all negative values. (Left image below)
Math form: max(0, x)
Leaky ReLu: ReLu’s are unstable and can die when large gradients flowing throw them cause the update to happen in such a way that they never get activated again. Leaky ReLus were introduced to overcome the “dying ReLu” problem and as the name suggests allows for small negative values. But these don’t provide any improvement in performance with the recent advances in defining network architectures namely local response normalization and batch normalization as the gradients are scaled before being used for backprop updates. Nevertheless monitoring dead ReLU is still a good thing to do when training and replacing them with a leaky one if needed. The reason one might avoid a leaky ReLu during training is to reduce small valued floating point computations when backpropogation as compared to the standard ReLu. (Right image above)
Math form: max(ax, x) a in [0, 1)
ELu: Exponential Linear Units similar to leaky relus try to overcome the “dying ReLu” problem and are claimed to learn faster. But as shown from experiments below the difference in terms of convergence is very close to standard ReLu.
Math form: a(exp(x)- 1) for x < 0 and x for x > 0
Legend:
MNIST
The architecture used with MNIST data for this experiment was a 4 layered network (2 Conv layers, 1 fully connected, 1 softmax output layer) with pooling layers and dropout. The number of units, optimizer, learning rate were all fixed and only the activation function for the convolution and fully connected layer was varied for the experiment. All activations functions described above converged to relatively close values as seen from the graph below.
Activation Fn. | Test Accuracy |
---|---|
ELu | 0.9894 |
LReLu | 0.9893 |
ReLu | 0.9921 |
Tanh | 0.9868 |
Sigmoid | 0.9814 |
CIFAR10
Experiments using Cifar10 was able to expose certain issues discussed above. The model is made up of six layers (2 Conv layers, 2 fully connected, 1 softmax output layer) with pooling and local response normalization layers. Note that normalization layers unlike pooling layers affect the gradient flow during backpropogation - so effectively the model could be viewed as 8 layered. Loss objective is made up of cross entropy and regularization loss on the weights in fully connected layers.
Notice how cross entropy in case of sigmoid activation doesn’t improve at all. Total loss reduction is just the improvemnet in regularization. At this point one would throw in the towel and say sigmoid function is just not suitable in this model, but let’s try to dig deeper. I choose to look at the gradients in the first convolution layer as in backpropogation this would be the most affected if the model is ill posed.
Well, the above plot pretty much says it all - we are facing the case of saturated activations for sigmoid which directly affects the gradient and updates in backpropagation. As described earlier in introduction as we increase the number of layers in the model sigmoid are more susceptible to weak gradient flow. (I have to claim responsibilty here as I purposefully for the sake of experiment tried to setup the model to expose these type of issues) While we are at it, lets look at the gradients for other activation functions as well.
notice how similar gradients for the activation functions are - this would directly reflect on to how the model is minimizing its loss objective. One can very well look at the gradients and activations and say how well they are training. If there is one core take away message apart from the choice of activations in this article I would say it is
“Gradients are key to understanding models”
Looking at just the loss objective rarely gives insights into the training process. I would strongly suggest to visualize gradients and activations as an important part of training neural networks.
Notice also from the above activations, how very little of use the leakiness in relu affects this model. This aspect is debatable but even the small leakiness provides for a bit better generalizationin the model. Overall I believe ELu > Leaky ReLU > ReLu but all these activations perform very closely and are only slightly better than the other.
Logs for the experiments can be found at github repository which can be viewed using Tensorboard.
references: The images used in this post was borrowed from stanford cs231n notes, wikipedia, wolfram
]]>