Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chi-Ying Tsui

Partial Knowledge Distillation for Alleviating the Inherent Inter-Class Discrepancy in Federated Learning

Nov 23, 2024

Xiaoyu Gan, Xizi Chen, Jingyang Zhu, Xiaomeng Wang, Jingbo Jiang, Chi-Ying Tsui

Figure 1 for Partial Knowledge Distillation for Alleviating the Inherent Inter-Class Discrepancy in Federated Learning

Figure 2 for Partial Knowledge Distillation for Alleviating the Inherent Inter-Class Discrepancy in Federated Learning

Figure 3 for Partial Knowledge Distillation for Alleviating the Inherent Inter-Class Discrepancy in Federated Learning

Figure 4 for Partial Knowledge Distillation for Alleviating the Inherent Inter-Class Discrepancy in Federated Learning

Abstract:Substantial efforts have been devoted to alleviating the impact of the long-tailed class distribution in federated learning. In this work, we observe an interesting phenomenon that weak classes consistently exist even for class-balanced learning. These weak classes, different from the minority classes in the previous works, are inherent to data and remain fairly consistent for various network structures and learning paradigms. The inherent inter-class accuracy discrepancy can reach over 36.9% for federated learning on the FashionMNIST and CIFAR-10 datasets, even when the class distribution is balanced both globally and locally. In this study, we empirically analyze the potential reason for this phenomenon. Furthermore, a class-specific partial knowledge distillation method is proposed to improve the model's classification accuracy for weak classes. In this approach, knowledge transfer is initiated upon the occurrence of specific misclassifications within certain weak classes. Experimental results show that the accuracy of weak classes can be improved by 10.7%, reducing the inherent interclass discrepancy effectively.

Via

Access Paper or Ask Questions

FedAQ: Communication-Efficient Federated Edge Learning via Joint Uplink and Downlink Adaptive Quantization

Jun 26, 2024

Linping Qu, Shenghui Song, Chi-Ying Tsui

Figure 1 for FedAQ: Communication-Efficient Federated Edge Learning via Joint Uplink and Downlink Adaptive Quantization

Figure 2 for FedAQ: Communication-Efficient Federated Edge Learning via Joint Uplink and Downlink Adaptive Quantization

Figure 3 for FedAQ: Communication-Efficient Federated Edge Learning via Joint Uplink and Downlink Adaptive Quantization

Figure 4 for FedAQ: Communication-Efficient Federated Edge Learning via Joint Uplink and Downlink Adaptive Quantization

Abstract:Federated learning (FL) is a powerful machine learning paradigm which leverages the data as well as the computational resources of clients, while protecting clients' data privacy. However, the substantial model size and frequent aggregation between the server and clients result in significant communication overhead, making it challenging to deploy FL in resource-limited wireless networks. In this work, we aim to mitigate the communication overhead by using quantization. Previous research on quantization has primarily focused on the uplink communication, employing either fixed-bit quantization or adaptive quantization methods. In this work, we introduce a holistic approach by joint uplink and downlink adaptive quantization to reduce the communication overhead. In particular, we optimize the learning convergence by determining the optimal uplink and downlink quantization bit-length, with a communication energy constraint. Theoretical analysis shows that the optimal quantization levels depend on the range of model gradients or weights. Based on this insight, we propose a decreasing-trend quantization for the uplink and an increasing-trend quantization for the downlink, which aligns with the change of the model parameters during the training process. Experimental results show that, the proposed joint uplink and downlink adaptive quantization strategy can save up to 66.7% energy compared with the existing schemes.

* This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Via

Access Paper or Ask Questions

How Robust is Federated Learning to Communication Error? A Comparison Study Between Uplink and Downlink Channels

Oct 25, 2023

Linping Qu, Shenghui Song, Chi-Ying Tsui, Yuyi Mao

Abstract:Because of its privacy-preserving capability, federated learning (FL) has attracted significant attention from both academia and industry. However, when being implemented over wireless networks, it is not clear how much communication error can be tolerated by FL. This paper investigates the robustness of FL to the uplink and downlink communication error. Our theoretical analysis reveals that the robustness depends on two critical parameters, namely the number of clients and the numerical range of model parameters. It is also shown that the uplink communication in FL can tolerate a higher bit error rate (BER) than downlink communication, and this difference is quantified by a proposed formula. The findings and theoretical analyses are further validated by extensive experiments.

* Submitted to IEEE for possible publication

Via

Access Paper or Ask Questions

Step-GRAND: A Low Latency Universal Soft-input Decoder

Jul 27, 2023

Syed Mohsin Abbas, Marwan Jalaleddine, Chi-Ying Tsui, Warren J. Gross

Abstract:GRAND features both soft-input and hard-input variants that are well suited to efficient hardware implementations that can be characterized with achievable average and worst-case decoding latency. This paper introduces step-GRAND, a soft-input variant of GRAND that, in addition to achieving appealing average decoding latency, also reduces the worst-case decoding latency of the corresponding hardware implementation. The hardware implementation results demonstrate that the proposed step-GRAND can decode CA-polar code $(128,105+11)$ with an average information throughput of $47.7$ Gbps at the target FER of $\leq10^{-7}$. Furthermore, the proposed step-GRAND hardware is $10\times$ more area efficient than the previous soft-input ORBGRAND hardware implementation, and its worst-case latency is $\frac{1}{6.8}\times$ that of the previous ORBGRAND hardware.

* Submitted to 2023 IEEE Globecom Workshops

Via

Access Paper or Ask Questions

A 137.5 TOPS/W SRAM Compute-in-Memory Macro with 9-b Memory Cell-Embedded ADCs and Signal Margin Enhancement Techniques for AI Edge Applications

Jul 19, 2023

Xiaomeng Wang, Fengshi Tian, Xizi Chen, Jiakun Zheng, Xuejiao Liu, Fengbin Tu, Jie Yang, Mohamad Sawan, Kwang-Ting Cheng, Chi-Ying Tsui

Abstract:In this paper, we propose a high-precision SRAM-based CIM macro that can perform 4x4-bit MAC operations and yield 9-bit signed output. The inherent discharge branches of SRAM cells are utilized to apply time-modulated MAC and 9-bit ADC readout operations on two bit-line capacitors. The same principle is used for both MAC and A-to-D conversion ensuring high linearity and thus supporting large number of analog MAC accumulations. The memory cell-embedded ADC eliminates the use of separate ADCs and enhances energy and area efficiency. Additionally, two signal margin enhancement techniques, namely the MAC-folding and boosted-clipping schemes, are proposed to further improve the CIM computation accuracy.

* Submitted to IEEE ASSCC 2023

Via

Access Paper or Ask Questions

FedDQ: Communication-Efficient Federated Learning with Descending Quantization

Oct 13, 2021

Linping Qu, Shenghui Song, Chi-Ying Tsui

Figure 1 for FedDQ: Communication-Efficient Federated Learning with Descending Quantization

Figure 2 for FedDQ: Communication-Efficient Federated Learning with Descending Quantization

Figure 3 for FedDQ: Communication-Efficient Federated Learning with Descending Quantization

Figure 4 for FedDQ: Communication-Efficient Federated Learning with Descending Quantization

Abstract:Federated learning (FL) is an emerging privacy-preserving distributed learning scheme. Due to the large model size and frequent model aggregation, FL suffers from critical communication bottleneck. Many techniques have been proposed to reduce the communication volume, including model compression and quantization. Existing adaptive quantization schemes use ascending-trend quantization where the quantizaion level increases with the training stages. In this paper, we formulate the problem as optimizing the training convergence rate for a given communication volume. The result shows that the optimal quantizaiton level can be represented by two factors, i.e., the training loss and the range of model updates, and it is preferable to decrease the quantization level rather than increase. Then, we propose two descending quantization schemes based on the training loss and model range. Experimental results show that proposed schemes not only reduce the communication volume but also help FL converge faster, when compared with current ascending quantization.

* Submitted to ICASSP 2022

Via

Access Paper or Ask Questions

Microshift: An Efficient Image Compression Algorithm for Hardware

Apr 20, 2021

Bo Zhang, Pedro V. Sander, Chi-Ying Tsui, Amine Bermak

Figure 1 for Microshift: An Efficient Image Compression Algorithm for Hardware

Figure 2 for Microshift: An Efficient Image Compression Algorithm for Hardware

Figure 3 for Microshift: An Efficient Image Compression Algorithm for Hardware

Figure 4 for Microshift: An Efficient Image Compression Algorithm for Hardware

Abstract:In this paper, we propose an image compression algorithm called Microshift. We employ an algorithm hardware co-design methodology, yielding a hardware-friendly compression approach with low power consumption. In our method, the image is first micro-shifted, then the sub-quantized values are further compressed. Two methods, the FAST and MRF model, are proposed to recover the bit-depth by exploiting the spatial correlation of natural images. Both methods can decompress images progressively. Our compression algorithm compresses images to 1.25 bits per pixel on average with PSNR of 33.16 dB, outperforming other on-chip compression algorithms. Then, we propose a hardware architecture and implement the algorithm on an FPGA and ASIC. The results on the VLSI design further validate the low hardware complexity and high power efficiency, showing our method is promising, particularly for low-power wireless vision sensor networks.

* Accepted to IEEE Transactions on Circuits and Systems for Video Technology

Via

Access Paper or Ask Questions

Tight Compression: Compressing CNN Through Fine-Grained Pruning and Weight Permutation for Efficient Implementation

Apr 03, 2021

Xizi Chen, Jingyang Zhu, Jingbo Jiang, Chi-Ying Tsui

Figure 1 for Tight Compression: Compressing CNN Through Fine-Grained Pruning and Weight Permutation for Efficient Implementation

Figure 2 for Tight Compression: Compressing CNN Through Fine-Grained Pruning and Weight Permutation for Efficient Implementation

Figure 3 for Tight Compression: Compressing CNN Through Fine-Grained Pruning and Weight Permutation for Efficient Implementation

Figure 4 for Tight Compression: Compressing CNN Through Fine-Grained Pruning and Weight Permutation for Efficient Implementation

Abstract:The unstructured sparsity after pruning poses a challenge to the efficient implementation of deep learning models in existing regular architectures like systolic arrays. On the other hand, coarse-grained structured pruning is suitable for implementation in regular architectures but tends to have higher accuracy loss than unstructured pruning when the pruned models are of the same size. In this work, we propose a model compression method based on a novel weight permutation scheme to fully exploit the fine-grained weight sparsity in the hardware design. Through permutation, the optimal arrangement of the weight matrix is obtained, and the sparse weight matrix is further compressed to a small and dense format to make full use of the hardware resources. Two pruning granularities are explored. In addition to the unstructured weight pruning, we also propose a more fine-grained subword-level pruning to further improve the compression performance. Compared to the state-of-the-art works, the matrix compression rate is significantly improved from 5.88x to 14.13x. As a result, the throughput and energy efficiency are improved by 2.75 and 1.86 times, respectively.

* Previous Conference Version: 2020 57th ACM/IEEE Design Automation Conference (DAC)

Via

Access Paper or Ask Questions

A Reconfigurable Winograd CNN Accelerator with Nesting Decomposition Algorithm for Computing Convolution with Large Filters

Feb 26, 2021

Jingbo Jiang, Xizi Chen, Chi-Ying Tsui

Figure 1 for A Reconfigurable Winograd CNN Accelerator with Nesting Decomposition Algorithm for Computing Convolution with Large Filters

Figure 2 for A Reconfigurable Winograd CNN Accelerator with Nesting Decomposition Algorithm for Computing Convolution with Large Filters

Figure 3 for A Reconfigurable Winograd CNN Accelerator with Nesting Decomposition Algorithm for Computing Convolution with Large Filters

Figure 4 for A Reconfigurable Winograd CNN Accelerator with Nesting Decomposition Algorithm for Computing Convolution with Large Filters

Abstract:Recent literature found that convolutional neural networks (CNN) with large filters perform well in some applications such as image semantic segmentation. Winograd transformation helps to reduce the number of multiplications in a convolution but suffers from numerical instability when the convolution filter size gets large. This work proposes a nested Winograd algorithm to iteratively decompose a large filter into a sequence of 3x3 tiles which can then be accelerated with a 3x3 Winograd algorithm. Compared with the state-of-art OLA-Winograd algorithm, the proposed algorithm reduces the multiplications by 1.41 to 3.29 times for computing 5x5 to 9x9 convolutions.

Via

Access Paper or Ask Questions

Polyimide-Based Flexible Coupled-Coils Design and Load-Shift Keying Analysis

Feb 02, 2021

Yuan Yao, Wing-Hung Ki, Chi-Ying Tsui

Figure 1 for Polyimide-Based Flexible Coupled-Coils Design and Load-Shift Keying Analysis

Figure 2 for Polyimide-Based Flexible Coupled-Coils Design and Load-Shift Keying Analysis

Figure 3 for Polyimide-Based Flexible Coupled-Coils Design and Load-Shift Keying Analysis

Figure 4 for Polyimide-Based Flexible Coupled-Coils Design and Load-Shift Keying Analysis

Abstract:Wireless power transfer using inductive coupling is commonly used for medical implantable devices. The design of the secondary coil on the implantable device is important as it will affect the power transfer efficiency, the size of the implant, and also the data transmission between the implant and the in-vitro controller. In this paper, we present a design of the secondary coil on a polyimide-based flexible substrate to achieve high power transfer efficiency. Load shift keying modulation is used for the data communication between the primary and secondary coils. A thorough analysis is done for the ideal and practical scenario and it shows that a mismatched secondary LC tank will affect the communication range and communication correctness. A solution to achieve robust data transmission is proposed and then verified by SPICE simulations.

* 4 pages, 8 figures

Via

Access Paper or Ask Questions