Deep Learning and Computer Vision in Low Power Devices

Lokesh Vadlamudi
7 min readMay 9, 2020

Generally, Neural Networks require lot of power and memory. This makes them harder to deploy on smaller devices. So in this article i summarize few different methods for making it possible to run DNN’s on smaller devices. The practices can be divided into four major types: (1) parameter quantization and pruning, (2) compressed convolutional filters and matrix factorization, (3) network architecture search, and (4) knowledge distillation. We study the advantages, disadvantages, and potential solutions of these methods.

1 . Parameter Quantization and Pruning:

A. Quantization of Deep Neural Networks:

The first method we are going to talk about is quantization which is nothing but lowering the size of Parameters of DNN’s. Some experiments show that this method can be very effective.

As you can see in the above diagram, as the parameter size reduces, it takes down the energy requirement with it but the test error increases.

There are few Neural Networks which help us in finding the optimal bit-width for our DNN parameters. The examples which help are LightNN, CompactNet , etc.

A preTrained CNN model can be improved in the sense of resource saving through compactNet. We should give a clear-cut target for our purpose. It reduces additional filters without hampering our target accuracy. Like wise, other NeuralNets perform operations to make sure memory and energy is saved.

Advantage:

Even though this method lowers the bit width memories, it only leads to regularization effect. Because of this regularization effect in training the model, the accuracy doesn’t reduce drastically. Shift and XNOR operations replace energy sucking multiply actions due to creation of custom hardware.

Disadvantages and Potential Improvements:

Retraining the Model numerous number of times due to quantization can be a costly experience. Various Layers are delicate to distinct features making it even more difficult to employ quantization.

B. Pruning Parameters and Connections

In order to curtail the huge number of memory connections, we can clear out insignificant parameters. To point out necessary parameters, Hessian weighted distortion helps us to achieve this task. Only fully connected layers can be pruned based on this approach. Upto 95% capacity of our model can be diminished by using the above methods along with encoding.

Tree-based hierarchical DNNs can be experimented with path level pruning. Despite finding the unnecessary connections, sparsity is created in matrices. These require lot of custom made data structures and assigning to GPU’s is hard.

Advantages:

VGG-16 can be contracted to 2 percent of its actual size through pruning, quantization and encoding. Also, overfitting can be avoided by this method.

Disadvantages and Potential Improvements:

Again, pruning has almost same disadvantages as quantization and hence retraining multiple times is a pricy practice. The need for custom hardware and also custom data structures can be avoided by channel level pruning.

2 . Compressed Convolutional Filters and Matrix Factorization:

Majority of parameters lie in fully connected layers, so it makes sense to compress the layers in order to save memory and eliminate unnecessary matrix operations.

A. Convolutional Filter Compression

As we know, smaller filters have lesser parameters compared to larger filters. But if we eliminate all the huge filters, accuracy affects a lot (translation invariance property also affects). We can only eliminate few non essential large layers and replace them with mini filters for performance boost in speed. Let’s talk about one example of this technique known as SqueezeNet.

It transforms 3 by 3 filter into 1 by 1 with the help of three strategies to diminish count of parameters. SqueezeNet is further more detailed in the paper entitled “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size.” It is given in references for better understanding. Compared to AlexNet, Squeeze net is very small and almost 50 times in lesser size. That being said, it is not a miniature Net of Alex Net but has a totally different layer architecture. On a imageNet dataset, these two can perform at similar accuracies.

Advantages:

Memory and other requirements like latency are drastically reduced by bottle neck convolution filters.The techniques such as pruning and quantization when combined with filter compaction helps in lowering energy required.

The core concept of bottle neck layer is to lower the size of input tensor with kernels larger than one into one with the help of diminishing number of input channels.

based on Andrew Ng’s lectures at DeepLearning.ai

Disadvantages and Potential Improvements:

There are two main disadvantages here, first is performing proper depth wise separable convolutions as they occupy low arithmetic intensity, leading to less usage of hardware. The second is poor accuracy results due to one by one convolution computations. Spatial and Temporal locality of parameters can be enhanced for lessening the count of memory access.

B. Matrix Factorization

Matrix factorizations along with tensor decompositions mean in a sum-product shape for performance speed. Big multiple dimensional tensors are converted into smaller matrices for eradicating unnecessary computations. Locality problem in non structured sparse multiplications can be evaded by forming dense parameter matrices, which in turn speeds up DNNs by a factor of four.

Matrix factorizations are done layer by layer in order to deal with accuracy loss and cut down it. A few reconstruction errors help in factorizing the following layers. Since we are processing one layer at a time, it will difficult to extend this technique to huge DNN’s due to large number of parameters.

Advantages:

Computation Costs are cutdown significantly due to less number of operations. In both fully connected layers and convolution layers, identical factorization can be performed.

Disadvantages and Potential Improvements:

Even though Matrix Factorization require less of computations, there is a performance dip for small number of operations. Gigantic DNN’s take a hell lot of time for training with matrix factorization.Since the search for hyper parameter decomposition is huge, we can include them during training to reduce time required.

3 . Network Architecture Search:

This technique actually saves a lot of time by trying out different architectures for creating a neural net. Network Architecture Search employs RNN controller to produce different combination of layers that can best fit our requirement i.e performance of the model.

The main condition for improving the architecture is validation accuracy. When a type of layer construction is trained, the validation accuracy determines whether to further improve or not. And based on this, the next architecture is engineered. MNasNet developed by Tan applies multi objective reward activity to search a combination of layers that can satisfy the performance criteria (accuracy) for the model. It has very less parameters compared to NASNet and also twice as fast. Despite being smaller in size, it can be better in precision too than NASNet. More than just over fifty thousand hours are required on GPU for MNASNet to find a good combination of layers for imageNet Dataset.

Advantages:

On a variety of mobile devices, NAS achieves wonderful accuracies and best performance overall without any interactivity being performed by us. It can keep in mind all the performance parameters (latency , accuracy) in check so that nothing is compromised.

Disadvantages and Potential Improvements:

When we use this technique for very large datasets, we need to train the model for all the combinations of our search results based on reward function.Hence, It can really take a lot of time and requires lot of computations.

4 . Knowledge Distillation:

With a routine back propagation algorithm, it is always difficult to learn the complicated activities of a parameter heavy DNN. Because of this it is harder for miniature DNN’s to fully dupe gigantic DNN’s. This is where a technique called knowledge transfer comes into picture.

The main motive of this method is to simulate the results of high chances of various categories for a single image. If the miniature DNN can learn these results like knowing the probabilities, then it can easily imitate the parent DNN through training. Also, another method similar to knowledge transfer is knowledge distillation introduced by Hilton. In KD, the training method is way straight forward. The teacher is the larger DNN and the student DNN is one which needs to learn the functions from teacher. With just a small cost in accuracy, the student can perform the challenging activities of the teacher.

Advantages:

As you all know the cost of training a large model net, these techniques like Knowledge Transfer and Knowledge Distillation-based methods will eliminate the majority of time and energy required in training such models. These techniques are useful outside deep learning too.

Disadvantages and Potential Improvements:

Knowledge Distillation requires both the student and teacher to have similar DNN architecture which is almost not possible always. So instead of just imitating the neuron results, student can understand the neuron sequence in which they are switched on, Thus eliminating the dependence on soft max output layers.

Summary of above methods:

Conclusion:

Even though the above methods can help in reducing the memory and energy needs, there is still a long way in making those miniature DNN’s completely functional as good as parent DNN’s. There is no go to technique which stands out as a clear winner when it comes to making a mini DNN version.

References:

https://en.wikipedia.org/wiki/ImageNet

--

--