Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Roy Miles

VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections

May 28, 2024

Roy Miles, Pradyumna Reddy, Ismail Elezi, Jiankang Deng

Abstract:Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks. Despite their success, training and fine-tuning these models is still far too computationally and memory intensive. In this paper, we identify and characterise the important components needed for effective model convergence using gradient descent. In doing so we find that the intermediate activations used to implement backpropagation can be excessively compressed without incurring any degradation in performance. This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs. The proposed algorithm simply divides the tokens up into smaller sub-tokens before projecting them onto a fixed 1-dimensional subspace during the forward pass. These features are then coarsely reconstructed during the backward pass to implement the update rules. We confirm the effectiveness of our algorithm as being complimentary to many state-of-the-art PEFT methods on the VTAB-1k fine-tuning benchmark. Furthermore, we outperform QLoRA for fine-tuning LLaMA and show competitive performance against other memory-efficient pre-training methods on the large-scale C4 dataset.

Via

Access Paper or Ask Questions

Learning to Project for Cross-Task Knowledge Distillation

Mar 21, 2024

Dylan Auty, Roy Miles, Benedikt Kolbeinsson, Krystian Mikolajczyk

Figure 1 for Learning to Project for Cross-Task Knowledge Distillation

Figure 2 for Learning to Project for Cross-Task Knowledge Distillation

Figure 3 for Learning to Project for Cross-Task Knowledge Distillation

Figure 4 for Learning to Project for Cross-Task Knowledge Distillation

Abstract:Traditional knowledge distillation (KD) relies on a proficient teacher trained on the target task, which is not always available. In this setting, cross-task distillation can be used, enabling the use of any teacher model trained on a different task. However, many KD methods prove ineffective when applied to this cross-task setting. To address this limitation, we propose a simple modification: the use of an inverted projection. We show that this drop-in replacement for a standard projector is effective by learning to disregard any task-specific features which might degrade the student's performance. We find that this simple modification is sufficient for extending many KD methods to the cross-task setting, where the teacher and student tasks can be very different. In doing so, we obtain up to a 1.9% improvement in the cross-task setting compared to the traditional projection, at no additional cost. Our method can obtain significant performance improvements (up to 7%) when using even a randomly-initialised teacher on various tasks such as depth estimation, image translation, and semantic segmentation, despite the lack of any learned knowledge to transfer. To provide conceptual and analytical insights into this result, we show that using an inverted projection allows the distillation loss to be decomposed into a knowledge transfer and a spectral regularisation component. Through this analysis we are additionally able to propose a novel regularisation loss that allows teacher-free distillation, enabling performance improvements of up to 8.57% on ImageNet with no additional training costs.

Via

Access Paper or Ask Questions

$V_kD:$ Improving Knowledge Distillation using Orthogonal Projections

Mar 10, 2024

Roy Miles, Ismail Elezi, Jiankang Deng

Abstract:Knowledge distillation is an effective method for training small and efficient deep learning models. However, the efficacy of a single method can degenerate when transferring to other tasks, modalities, or even other architectures. To address this limitation, we propose a novel constrained feature distillation method. This method is derived from a small set of core principles, which results in two emerging components: an orthogonal projection and a task-specific normalisation. Equipped with both of these components, our transformer models can outperform all previous methods on ImageNet and reach up to a 4.4% relative improvement over the previous state-of-the-art methods. To further demonstrate the generality of our method, we apply it to object detection and image generation, whereby we obtain consistent and substantial performance improvements over state-of-the-art. Code and models are publicly available: https://github.com/roymiles/vkd

* CVPR 2024. Code available at https://github.com/roymiles/vkd

Via

Access Paper or Ask Questions

A closer look at the training dynamics of knowledge distillation

Mar 20, 2023

Roy Miles, Krystian Mikolajczyk

Abstract:In this paper we revisit the efficacy of knowledge distillation as a function matching and metric learning problem. In doing so we verify three important design decisions, namely the normalisation, soft maximum function, and projection layers as key ingredients. We theoretically show that the projector implicitly encodes information on past examples, enabling relational gradients for the student. We then show that the normalisation of representations is tightly coupled with the training dynamics of this projector, which can have a large impact on the students performance. Finally, we show that a simple soft maximum function can be used to address any significant capacity gap problems. Experimental results on various benchmark datasets demonstrate that using these insights can lead to superior or comparable performance to state-of-the-art knowledge distillation techniques, despite being much more computationally efficient. In particular, we obtain these results across image classification (CIFAR100 and ImageNet), object detection (COCO2017), and on more difficult distillation objectives, such as training data efficient transformers, whereby we attain a 77.2% top-1 accuracy with DeiT-Ti on ImageNet.

Via

Access Paper or Ask Questions

MobileVOS: Real-Time Video Object Segmentation Contrastive Learning meets Knowledge Distillation

Mar 14, 2023

Roy Miles, Mehmet Kerim Yucel, Bruno Manganelli, Albert Saa-Garriga

Abstract:This paper tackles the problem of semi-supervised video object segmentation on resource-constrained devices, such as mobile phones. We formulate this problem as a distillation task, whereby we demonstrate that small space-time-memory networks with finite memory can achieve competitive results with state of the art, but at a fraction of the computational cost (32 milliseconds per frame on a Samsung Galaxy S22). Specifically, we provide a theoretically grounded framework that unifies knowledge distillation with supervised contrastive representation learning. These models are able to jointly benefit from both pixel-wise contrastive learning and distillation from a pre-trained teacher. We validate this loss by achieving competitive J&F to state of the art on both the standard DAVIS and YouTube benchmarks, despite running up to 5x faster, and with 32x fewer parameters.

* CVPR 2023

Via

Access Paper or Ask Questions

Information Theoretic Representation Distillation

Dec 01, 2021

Roy Miles, Adrián López Rodríguez, Krystian Mikolajczyk

Figure 1 for Information Theoretic Representation Distillation

Figure 2 for Information Theoretic Representation Distillation

Figure 3 for Information Theoretic Representation Distillation

Figure 4 for Information Theoretic Representation Distillation

Abstract:Despite the empirical success of knowledge distillation, there still lacks a theoretical foundation that can naturally lead to computationally inexpensive implementations. To address this concern, we forge an alternative connection between information theory and knowledge distillation using a recently proposed entropy-like functional. In doing so, we introduce two distinct complementary losses which aim to maximise the correlation and mutual information between the student and teacher representations. Our method achieves competitive performance to state-of-the-art on the knowledge distillation and cross-model transfer tasks, while incurring significantly less training overheads than closely related and similarly performing approaches. We further demonstrate the effectiveness of our method on a binary distillation task, whereby we shed light to a new state-of-the-art for binary quantisation. The code, evaluation protocols, and trained models will be publicly available.

Via

Access Paper or Ask Questions

Network compression and faster inference using spatial basis filters

Oct 25, 2021

Roy Miles, Krystian Mikolajczyk

Figure 1 for Network compression and faster inference using spatial basis filters

Figure 2 for Network compression and faster inference using spatial basis filters

Figure 3 for Network compression and faster inference using spatial basis filters

Figure 4 for Network compression and faster inference using spatial basis filters

Abstract:We present an efficient alternative to the convolutional layer through utilising spatial basis filters (SBF). SBF layers exploit the spatial redundancy in the convolutional filters across the depth to achieve overall model compression, while maintaining the top-end accuracy of their dense counter-parts. Training SBF-Nets is modelled as a simple pruning problem, but instead of zeroing out the pruned channels, they are replaced with inexpensive transformations from the set of non-pruned features. To enable an adoption of these SBF layers, we provide a flexible training pipeline and an efficient implementation in CUDA with low latency. To further demonstrate the effective capacity of these models, we apply semi-supervised knowledge distillation that leads to significant performance improvements over the baseline networks. Our experiments show that SBF-Nets are effective and achieve comparable or improved performance to state-of-the-art across CIFAR10, CIFAR100, Tiny-ImageNet, and ILSCRC-2012.

Via

Access Paper or Ask Questions

Cascaded channel pruning using hierarchical self-distillation

Aug 16, 2020

Roy Miles, Krystian Mikolajczyk

Figure 1 for Cascaded channel pruning using hierarchical self-distillation

Figure 2 for Cascaded channel pruning using hierarchical self-distillation

Figure 3 for Cascaded channel pruning using hierarchical self-distillation

Figure 4 for Cascaded channel pruning using hierarchical self-distillation

Abstract:In this paper, we propose an approach for filter-level pruning with hierarchical knowledge distillation based on the teacher, teaching-assistant, and student framework. Our method makes use of teaching assistants at intermediate pruning levels that share the same architecture and weights as the target student. We propose to prune each model independently using the gradient information from its corresponding teacher. By considering the relative sizes of each student-teacher pair, this formulation provides a natural trade-off between the capacity gap for knowledge distillation and the bias of the filter saliency updates. Our results show improvements in the attainable accuracy and model compression across the CIFAR10 and ImageNet classification tasks using the VGG16and ResNet50 architectures. We provide an extensive evaluation that demonstrates the benefits of using a varying number of teaching assistant models at different sizes.

* BMVC 2020

Via

Access Paper or Ask Questions

Compression of convolutional neural networks for high performance imagematching tasks on mobile devices

Jan 09, 2020

Roy Miles, Krystian Mikolajczyk

Figure 1 for Compression of convolutional neural networks for high performance imagematching tasks on mobile devices

Figure 2 for Compression of convolutional neural networks for high performance imagematching tasks on mobile devices

Figure 3 for Compression of convolutional neural networks for high performance imagematching tasks on mobile devices

Figure 4 for Compression of convolutional neural networks for high performance imagematching tasks on mobile devices

Abstract:Deep neural networks have demonstrated state-of-the-art performance for feature-based image matching through the advent of new large and diverse datasets. However, there has been little work on evaluating the computational cost, model size, and matching accuracy tradeoffs for these models. This paper explicitly addresses these practical constraints by considering state-of-the-art L2Net architecture. We observe a significant redundancy in the L2Net architecture, which we exploit through the use of depthwise separable layers and an efficient Tucker decomposition. We demonstrate that a combination of these methods is more effective, but still sacrifices the top-end accuracy. We therefore propose the Convolution-Depthwise-Pointwise (CDP) layer, which provides a means of interpolating between the standard and depthwise separable convolutions. With this proposed layer, we are able to achieve up to 8 times reduction in the number of parameters on the L2Net architecture, 13 times reduction in the computational complexity, while sacrificing less than 1% on the overall accuracy across the HPatches benchmarks. To further demonstrate the generalisation of this approach, we apply it to the SuperPoint model. We show that CDP layers improve upon the accuracy while using significantly less parameters and floating-point operations for inference.

Via

Access Paper or Ask Questions