Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuma Ichikawa

More Than Bits: Multi-Envelope Double Binary Factorization for Extreme Quantization

Dec 31, 2025

Yuma Ichikawa, Yoshihiko Fujisawa, Yudai Fujimoto, Akira Sakai, Katsuki Fujisawa

Abstract:For extreme low-bit quantization of large language models (LLMs), Double Binary Factorization (DBF) is attractive as it enables efficient inference without sacrificing accuracy. However, the scaling parameters of DBF are too restrictive; after factoring out signs, all rank components share the same magnitude profile, resulting in performance saturation. We propose Multi-envelope DBF (MDBF), which retains a shared pair of 1-bit sign bases but replaces the single envelope with a rank-$l$ envelope. By sharing sign matrices among envelope components, MDBF effectively maintains a binary carrier and utilizes the limited memory budget for magnitude expressiveness. We also introduce a closed-form initialization and an alternating refinement method to optimize MDBF. Across the LLaMA and Qwen families, MDBF enhances perplexity and zero-shot accuracy over previous binary formats at matched bits per weight while preserving the same deployment-friendly inference primitive.

* 14 pages, 2 figures

Via

Access Paper or Ask Questions

PHOTON: Hierarchical Autoregressive Modeling for Lightspeed and Memory-Efficient Language Generation

Dec 22, 2025

Yuma Ichikawa, Naoya Takagi, Takumi Nakagawa, Yuzi Kanazawa, Akira Sakai

Abstract:Transformers operate as horizontal token-by-token scanners; at each generation step, the model attends to an ever-growing sequence of token-level states. This access pattern increases prefill latency and makes long-context decoding increasingly memory-bound, as KV-cache reads and writes dominate inference throughput rather than arithmetic computation. We propose Parallel Hierarchical Operation for Top-down Networks (PHOTON), a hierarchical autoregressive model that replaces flat scanning with vertical, multi-resolution context access. PHOTON maintains a hierarchy of latent streams: a bottom-up encoder progressively compresses tokens into low-rate contextual states, while lightweight top-down decoders reconstruct fine-grained token representations. Experimental results show that PHOTON is superior to competitive Transformer-based language models regarding the throughput-quality trade-off, offering significant advantages in long-context and multi-query tasks. This reduces decode-time KV-cache traffic, yielding up to $10^{3}\times$ higher throughput per unit memory.

* 12 pages, 5 figures

Via

Access Paper or Ask Questions

Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization

Apr 13, 2025

Yamato Arai, Yuma Ichikawa

Figure 1 for Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization

Figure 2 for Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization

Figure 3 for Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization

Figure 4 for Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization

Abstract:Layer-wise post-training quantization has emerged as a widely used technique for compressing large language models (LLMs) without retraining. However, recent progress in this line of research is saturating, underscoring the need to revisit its core limitation and explore further improvements. This study identifies a critical bottleneck in existing layer-wise PTQ methods: the accumulation of quantization errors across layers significantly degrades performance, particularly in low-bit regimes. To address this, we propose Quantization Error Propagation (QEP), a lightweight and general framework that enhances layer-wise PTQ by explicitly propagating the quantization error which enable compensating for accumulated quantization errors. Additionally, we introduce a tunable propagation mechanism that allows for control over both propagation strength and computational overhead, making the framework adaptable to various architectures and resource constraints. Empirical evaluation on LLaMA2 models (7B, 13B, 70B) demonstrate that incorporating QEP into standard layer-wise PTQ pipelines outperforms standard PTQ methods. Notably, QEP yields substantial performance improvements under extreme low-bit quantization settings.

* 16 pages, 1 figure

Via

Access Paper or Ask Questions

Ratio Divergence Learning Using Target Energy in Restricted Boltzmann Machines: Beyond Kullback--Leibler Divergence Learning

Sep 12, 2024

Yuichi Ishida, Yuma Ichikawa, Aki Dote, Toshiyuki Miyazawa, Koji Hukushima

Figure 1 for Ratio Divergence Learning Using Target Energy in Restricted Boltzmann Machines: Beyond Kullback--Leibler Divergence Learning

Figure 2 for Ratio Divergence Learning Using Target Energy in Restricted Boltzmann Machines: Beyond Kullback--Leibler Divergence Learning

Figure 3 for Ratio Divergence Learning Using Target Energy in Restricted Boltzmann Machines: Beyond Kullback--Leibler Divergence Learning

Figure 4 for Ratio Divergence Learning Using Target Energy in Restricted Boltzmann Machines: Beyond Kullback--Leibler Divergence Learning

Abstract:We propose ratio divergence (RD) learning for discrete energy-based models, a method that utilizes both training data and a tractable target energy function. We apply RD learning to restricted Boltzmann machines (RBMs), which are a minimal model that satisfies the universal approximation theorem for discrete distributions. RD learning combines the strength of both forward and reverse Kullback-Leibler divergence (KLD) learning, effectively addressing the "notorious" issues of underfitting with the forward KLD and mode-collapse with the reverse KLD. Since the summation of forward and reverse KLD seems to be sufficient to combine the strength of both approaches, we include this learning method as a direct baseline in numerical experiments to evaluate its effectiveness. Numerical experiments demonstrate that RD learning significantly outperforms other learning methods in terms of energy function fitting, mode-covering, and learning stability across various discrete energy-based models. Moreover, the performance gaps between RD learning and the other learning methods become more pronounced as the dimensions of target models increase.

* 14 pages, 19 figures

Via

Access Paper or Ask Questions

Statistical Mechanics of Min-Max Problems

Sep 09, 2024

Yuma Ichikawa, Koji Hukushima

Figure 1 for Statistical Mechanics of Min-Max Problems

Abstract:Min-max optimization problems, also known as saddle point problems, have attracted significant attention due to their applications in various fields, such as fair beamforming, generative adversarial networks (GANs), and adversarial learning. However, understanding the properties of these min-max problems has remained a substantial challenge. This study introduces a statistical mechanical formalism for analyzing the equilibrium values of min-max problems in the high-dimensional limit, while appropriately addressing the order of operations for min and max. As a first step, we apply this formalism to bilinear min-max games and simple GANs, deriving the relationship between the amount of training data and generalization error and indicating the optimal ratio of fake to real data for effective learning. This formalism provides a groundwork for a deeper theoretical analysis of the equilibrium properties in various machine learning methods based on min-max problems and encourages the development of new algorithms and architectures.

* 16 pages, 1 figures

Via

Access Paper or Ask Questions

Optimization by Parallel Quasi-Quantum Annealing with Gradient-Based Sampling

Sep 02, 2024

Yuma Ichikawa, Yamato Arai

Abstract:Learning-based methods have gained attention as general-purpose solvers because they can automatically learn problem-specific heuristics, reducing the need for manually crafted heuristics. However, these methods often face challenges with scalability. To address these issues, the improved Sampling algorithm for Combinatorial Optimization (iSCO) using discrete Langevin dynamics has been proposed, demonstrating better performance than several learning-based solvers. This study proposes a different approach that integrates gradient-based update through continuous relaxation, combined with Quasi-Quantum Annealing (QQA). QQA smoothly transitions the objective function from a simple convex form, where half-integral solutions dominate, to the original objective function, where the variables are restricted to 0 or 1. Furthermore, we incorporate parallel run communication leveraging GPUs, enhancing exploration capabilities and accelerating convergence. Numerical experiments demonstrate that our approach is a competitive general-purpose solver, achieving comparable performance to iSCO across various benchmark problems. Notably, our method exhibits superior trade-offs between speed and solution quality for large-scale instances compared to iSCO, commercial solvers, and specialized algorithms.

* 18 pages, 3 figures

Via

Access Paper or Ask Questions

Training-Free Time-Series Anomaly Detection: Leveraging Image Foundation Models

Aug 27, 2024

Nobuo Namura, Yuma Ichikawa

Figure 1 for Training-Free Time-Series Anomaly Detection: Leveraging Image Foundation Models

Figure 2 for Training-Free Time-Series Anomaly Detection: Leveraging Image Foundation Models

Figure 3 for Training-Free Time-Series Anomaly Detection: Leveraging Image Foundation Models

Figure 4 for Training-Free Time-Series Anomaly Detection: Leveraging Image Foundation Models

Abstract:Recent advancements in time-series anomaly detection have relied on deep learning models to handle the diverse behaviors of time-series data. However, these models often suffer from unstable training and require extensive hyperparameter tuning, leading to practical limitations. Although foundation models present a potential solution, their use in time series is limited. To overcome these issues, we propose an innovative image-based, training-free time-series anomaly detection (ITF-TAD) approach. ITF-TAD converts time-series data into images using wavelet transform and compresses them into a single representation, leveraging image foundation models for anomaly detection. This approach achieves high-performance anomaly detection without unstable neural network training or hyperparameter tuning. Furthermore, ITF-TAD identifies anomalies across different frequencies, providing users with a detailed visualization of anomalies and their corresponding frequencies. Comprehensive experiments on five benchmark datasets, including univariate and multivariate time series, demonstrate that ITF-TAD offers a practical and effective solution with performance exceeding or comparable to that of deep models.

Via

Access Paper or Ask Questions

Continuous Tensor Relaxation for Finding Diverse Solutions in Combinatorial Optimization Problems

Feb 03, 2024

Yuma Ichikawa, Hiroaki Iwashita

Figure 1 for Continuous Tensor Relaxation for Finding Diverse Solutions in Combinatorial Optimization Problems

Figure 2 for Continuous Tensor Relaxation for Finding Diverse Solutions in Combinatorial Optimization Problems

Figure 3 for Continuous Tensor Relaxation for Finding Diverse Solutions in Combinatorial Optimization Problems

Figure 4 for Continuous Tensor Relaxation for Finding Diverse Solutions in Combinatorial Optimization Problems

Abstract:Finding the best solution is the most common objective in combinatorial optimization (CO) problems. However, a single solution may not be suitable in practical scenarios, as the objective functions and constraints are only approximations of original real-world situations. To tackle this, finding (i) "heterogeneous solutions", diverse solutions with distinct characteristics, and (ii) "penalty-diversified solutions", variations in constraint severity, are natural directions. This strategy provides the flexibility to select a suitable solution during post-processing. However, discovering these diverse solutions is more challenging than identifying a single solution. To overcome this challenge, this study introduces Continual Tensor Relaxation Annealing (CTRA) for unsupervised-learning-based CO solvers. CTRA addresses various problems simultaneously by extending the continual relaxation approach, which transforms discrete decision variables into continual tensors. This method finds heterogeneous and penalty-diversified solutions through mutual interactions, where the choice of one solution affects the other choices. Numerical experiments show that CTRA enables UL-based solvers to find heterogeneous and penalty-diversified solutions much faster than existing UL-based solvers. Moreover, these experiments reveal that CTRA enhances the exploration ability.

* 16 pages, 10 figures

Via

Access Paper or Ask Questions

Learning Dynamics in Linear VAE: Posterior Collapse Threshold, Superfluous Latent Space Pitfalls, and Speedup with KL Annealing

Oct 24, 2023

Yuma Ichikawa, Koji Hukushima

Abstract:Variational autoencoders (VAEs) face a notorious problem wherein the variational posterior often aligns closely with the prior, a phenomenon known as posterior collapse, which hinders the quality of representation learning. To mitigate this problem, an adjustable hyperparameter $\beta$ and a strategy for annealing this parameter, called KL annealing, are proposed. This study presents a theoretical analysis of the learning dynamics in a minimal VAE. It is rigorously proved that the dynamics converge to a deterministic process within the limit of large input dimensions, thereby enabling a detailed dynamical analysis of the generalization error. Furthermore, the analysis shows that the VAE initially learns entangled representations and gradually acquires disentangled representations. A fixed-point analysis of the deterministic process reveals that when $\beta$ exceeds a certain threshold, posterior collapse becomes inevitable regardless of the learning period. Additionally, the superfluous latent variables for the data-generative factors lead to overfitting of the background noise; this adversely affects both generalization and learning convergence. The analysis further unveiled that appropriately tuned KL annealing can accelerate convergence.

* 24 pages, 5 figures

Via

Access Paper or Ask Questions

Controlling Continuous Relaxation for Combinatorial Optimization

Sep 29, 2023

Yuma Ichikawa

Abstract:Recent advancements in combinatorial optimization (CO) problems emphasize the potential of graph neural networks (GNNs). The physics-inspired GNN (PI-GNN) solver, which finds approximate solutions through unsupervised learning, has attracted significant attention for large-scale CO problems. Nevertheless, there has been limited discussion on the performance of the PI-GNN solver for CO problems on relatively dense graphs where the performance of greedy algorithms worsens. In addition, since the PI-GNN solver employs a relaxation strategy, an artificial transformation from the continuous space back to the original discrete space is necessary after learning, potentially undermining the robustness of the solutions. This paper numerically demonstrates that the PI-GNN solver can be trapped in a local solution, where all variables are zero, in the early stage of learning for CO problems on the dense graphs. Then, we address these problems by controlling the continuity and discreteness of relaxed variables while avoiding the local solution: (i) introducing a new penalty term that controls the continuity and discreteness of the relaxed variables and eliminates the local solution; (ii) proposing a new continuous relaxation annealing (CRA) strategy. This new annealing first prioritizes continuous solutions and intensifies exploration by leveraging the continuity while avoiding the local solution and then schedules the penalty term for prioritizing a discrete solution until the relaxed variables are almost discrete values, which eliminates the need for an artificial transformation from the continuous to the original discrete space. Empirically, better results are obtained for CO problems on the dense graphs, where the PI-GNN solver struggles to find reasonable solutions, and for those on relatively sparse graphs. Furthermore, the computational time scaling is identical to that of the PI-GNN solver.

* 15 pages, 6 figures

Via

Access Paper or Ask Questions