Abstract:Owing to their powerful semantic reasoning capabilities, Large Language Models (LLMs) have been effectively utilized as recommenders, achieving impressive performance. However, the high inference latency of LLMs significantly restricts their practical deployment. To address this issue, this work investigates knowledge distillation from cumbersome LLM-based recommendation models to lightweight conventional sequential models. It encounters three challenges: 1) the teacher's knowledge may not always be reliable; 2) the capacity gap between the teacher and student makes it difficult for the student to assimilate the teacher's knowledge; 3) divergence in semantic space poses a challenge to distill the knowledge from embeddings. To tackle these challenges, this work proposes a novel distillation strategy, DLLM2Rec, specifically tailored for knowledge distillation from LLM-based recommendation models to conventional sequential models. DLLM2Rec comprises: 1) Importance-aware ranking distillation, which filters reliable and student-friendly knowledge by weighting instances according to teacher confidence and student-teacher consistency; 2) Collaborative embedding distillation integrates knowledge from teacher embeddings with collaborative signals mined from the data. Extensive experiments demonstrate the effectiveness of the proposed DLLM2Rec, boosting three typical sequential models with an average improvement of 47.97%, even enabling them to surpass LLM-based recommenders in some cases.
Abstract:This paper presents a Long Range (LoRa) physical-layer data aggregation system (LoRaPDA) that aggregates data (e.g., sum, average, min, max) directly in the physical layer. In particular, after coordinating a few nodes to transmit their data simultaneously, the gateway leverages a new multi-packet reception (MPR) approach to compute aggregate data from the phase-asynchronous superimposed signal. Different from the analog approach which requires additional power synchronization and phase synchronization, our MRP-based digital approach is compatible with commercial LoRa nodes and is more reliable. Different from traditional MPR approaches that are designed for the collision decoding scenario, our new MPR approach allows simultaneous transmissions with small packet arrival time offsets, and addresses a new co-located peak problem through the following components: 1) an improved channel and offset estimation algorithm that enables accurate phase tracking in each symbol, 2) a new symbol demodulation algorithm that finds the maximum likelihood sequence of nodes' data, and 3) a soft-decision packet decoding algorithm that utilizes the likelihoods of several sequences to improve decoding performance. Trace-driven simulation results show that the symbol demodulation algorithm outperforms the state-of-the-art MPR decoder by 5.3$\times$ in terms of physical-layer throughput, and the soft decoder is more robust to unavoidable adverse phase misalignment and estimation error in practice. Moreover, LoRaPDA outperforms the state-of-the-art MPR scheme by at least 2.1$\times$ for all SNRs in terms of network throughput, demonstrating quick and reliable data aggregation.
Abstract:Learned image compression has achieved extraordinary rate-distortion performance in PSNR and MS-SSIM compared to traditional methods. However, it suffers from intensive computation, which is intolerable for real-world applications and leads to its limited industrial application for now. In this paper, we introduce neural architecture search (NAS) to designing more efficient networks with lower latency, and leverage quantization to accelerate the inference process. Meanwhile, efforts in engineering like multi-threading and SIMD have been made to improve efficiency. Optimized using a hybrid loss of PSNR and MS-SSIM for better visual quality, we obtain much higher MS-SSIM than JPEG, JPEG XL and AVIF over all bit rates, and PSNR between that of JPEG XL and AVIF. Our software implementation of LIC achieves comparable or even faster inference speed compared to jpeg-turbo while being multiple times faster than JPEG XL and AVIF. Besides, our implementation of LIC reaches stunning throughput of 145 fps for encoding and 208 fps for decoding on a Tesla T4 GPU for 1080p images. On CPU, the latency of our implementation is comparable with JPEG XL.