Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vladimír Boža

Addition is almost all you need: Compressing neural networks with double binary factorization

May 16, 2025

Vladimír Boža, Vladimír Macko

Abstract:Binary quantization approaches, which replace weight matrices with binary matrices and substitute costly multiplications with cheaper additions, offer a computationally efficient approach to address the increasing computational and storage requirements of Large Language Models (LLMs). However, the severe quantization constraint ($\pm1$) can lead to significant accuracy degradation. In this paper, we propose Double Binary Factorization (DBF), a novel method that factorizes dense weight matrices into products of two binary (sign) matrices, each accompanied by scaling vectors. DBF preserves the efficiency advantages of binary representations while achieving compression rates that are competitive with or superior to state-of-the-art methods. Specifically, in a 1-bit per weight range, DBF is better than existing binarization approaches. In a 2-bit per weight range, DBF is competitive with the best quantization methods like QuIP\# and QTIP. Unlike most existing compression techniques, which offer limited compression level choices, DBF allows fine-grained control over compression ratios by adjusting the factorization's intermediate dimension. Based on this advantage, we further introduce an algorithm for estimating non-uniform layer-wise compression ratios for DBF, based on previously developed channel pruning criteria. Code available at: https://github.com/usamec/double_binary

Via

Access Paper or Ask Questions

Two Sparse Matrices are Better than One: Sparsifying Neural Networks with Double Sparse Factorization

Sep 27, 2024

Vladimír Boža, Vladimír Macko

Figure 1 for Two Sparse Matrices are Better than One: Sparsifying Neural Networks with Double Sparse Factorization

Figure 2 for Two Sparse Matrices are Better than One: Sparsifying Neural Networks with Double Sparse Factorization

Figure 3 for Two Sparse Matrices are Better than One: Sparsifying Neural Networks with Double Sparse Factorization

Figure 4 for Two Sparse Matrices are Better than One: Sparsifying Neural Networks with Double Sparse Factorization

Abstract:Neural networks are often challenging to work with due to their large size and complexity. To address this, various methods aim to reduce model size by sparsifying or decomposing weight matrices, such as magnitude pruning and low-rank or block-diagonal factorization. In this work, we present Double Sparse Factorization (DSF), where we factorize each weight matrix into two sparse matrices. Although solving this problem exactly is computationally infeasible, we propose an efficient heuristic based on alternating minimization via ADMM that achieves state-of-the-art results, enabling unprecedented sparsification of neural networks. For instance, in a one-shot pruning setting, our method can reduce the size of the LLaMA2-13B model by 50% while maintaining better performance than the dense LLaMA2-7B model. We also compare favorably with Optimal Brain Compression, the state-of-the-art layer-wise pruning approach for convolutional neural networks. Furthermore, accuracy improvements of our method persist even after further model fine-tuning. Code available at: https://github.com/usamec/double_sparse.

Via

Access Paper or Ask Questions

Fast and Optimal Weight Update for Pruned Large Language Models

Jan 01, 2024

Vladimír Boža

Abstract:Pruning large language models (LLMs) is a challenging task due to their enormous size. The primary difficulty is fine-tuning the model after pruning, which is needed to recover the lost performance caused by dropping weights. Recent approaches have either ignored fine-tuning entirely, focusing on efficient pruning criteria, or attempted layer-wise weight updates, preserving the behavior of each layer. However, even layer-wise weight updates can be costly for LLMs, and previous works have resorted to various approximations. In our paper, we propose a fast and optimal weight update algorithm for pruned layers based on the Alternating Direction Method of Multipliers (ADMM). Coupled with a simple iterative pruning mask selection, our algorithm achieves state-of-the-art pruning performance across a wide range of LLMs. Code is available at https://github.com/fmfi-compbio/admm-pruning.

Via

Access Paper or Ask Questions

Merging of neural networks

Apr 21, 2022

Martin Pašen, Vladimír Boža

Abstract:We propose a simple scheme for merging two neural networks trained with different starting initialization into a single one with the same size as the original ones. We do this by carefully selecting channels from each input network. Our procedure might be used as a finalization step after one tries multiple starting seeds to avoid an unlucky one. We also show that training two networks and merging them leads to better performance than training a single network for an extended period of time. Availability: https://github.com/fmfi-compbio/neural-network-merging

Via

Access Paper or Ask Questions

Dynamic Pooling Improves Nanopore Base Calling Accuracy

May 16, 2021

Vladimír Boža, Peter Perešíni, Broňa Brejová, Tomáš Vinař

Figure 1 for Dynamic Pooling Improves Nanopore Base Calling Accuracy

Figure 2 for Dynamic Pooling Improves Nanopore Base Calling Accuracy

Figure 3 for Dynamic Pooling Improves Nanopore Base Calling Accuracy

Figure 4 for Dynamic Pooling Improves Nanopore Base Calling Accuracy

Abstract:In nanopore sequencing, electrical signal is measured as DNA molecules pass through the sequencing pores. Translating these signals into DNA bases (base calling) is a highly non-trivial task, and its quality has a large impact on the sequencing accuracy. The most successful nanopore base callers to date use convolutional neural networks (CNN) to accomplish the task. Convolutional layers in CNNs are typically composed of filters with constant window size, performing best in analysis of signals with uniform speed. However, the speed of nanopore sequencing varies greatly both within reads and between sequencing runs. Here, we present dynamic pooling, a novel neural network component, which addresses this problem by adaptively adjusting the pooling ratio. To demonstrate the usefulness of dynamic pooling, we developed two base callers: Heron and Osprey. Heron improves the accuracy beyond the experimental high-accuracy base caller Bonito developed by Oxford Nanopore. Osprey is a fast base caller that can compete in accuracy with Guppy high-accuracy mode, but does not require GPU acceleration and achieves a near real-time speed on common desktop CPUs. Availability: https://github.com/fmfi-compbio/osprey, https://github.com/fmfi-compbio/heron Keywords: nanopore sequencing, base calling, convolutional neural networks, pooling

Via

Access Paper or Ask Questions

Nanopore Base Calling on the Edge

Nov 09, 2020

Peter Perešíni, Vladimír Boža, Broňa Brejová, Tomáš Vinař

Figure 1 for Nanopore Base Calling on the Edge

Figure 2 for Nanopore Base Calling on the Edge

Figure 3 for Nanopore Base Calling on the Edge

Figure 4 for Nanopore Base Calling on the Edge

Abstract:We developed a new base caller DeepNano-coral for nanopore sequencing, which is optimized to run on the Coral Edge Tensor Processing Unit, a small USB-attached hardware accelerator. To achieve this goal, we have designed new versions of two key components used in convolutional neural networks for speech recognition and base calling. In our components, we propose a new way of factorization of a full convolution into smaller operations, which decreases memory access operations, memory access being a bottleneck on this device. DeepNano-coral achieves real-time base calling during sequencing with the accuracy slightly better than the fast mode of the Guppy base caller and is extremely energy efficient, using only 10W of power. Availability: https://github.com/fmfi-compbio/coral-basecaller

Via

Access Paper or Ask Questions