Peking University, China
Abstract:We present Vinci, a vision-language system designed to provide real-time, comprehensive AI assistance on portable devices. At its core, Vinci leverages EgoVideo-VL, a novel model that integrates an egocentric vision foundation model with a large language model (LLM), enabling advanced functionalities such as scene understanding, temporal grounding, video summarization, and future planning. To enhance its utility, Vinci incorporates a memory module for processing long video streams in real time while retaining contextual history, a generation module for producing visual action demonstrations, and a retrieval module that bridges egocentric and third-person perspectives to provide relevant how-to videos for skill acquisition. Unlike existing systems that often depend on specialized hardware, Vinci is hardware-agnostic, supporting deployment across a wide range of devices, including smartphones and wearable cameras. In our experiments, we first demonstrate the superior performance of EgoVideo-VL on multiple public benchmarks, showcasing its vision-language reasoning and contextual understanding capabilities. We then conduct a series of user studies to evaluate the real-world effectiveness of Vinci, highlighting its adaptability and usability in diverse scenarios. We hope Vinci can establish a new framework for portable, real-time egocentric AI systems, empowering users with contextual and actionable insights. Including the frontend, backend, and models, all codes of Vinci are available at https://github.com/OpenGVLab/vinci.
Abstract:Pinching-antenna systems have recently been proposed as a new candidate for flexible-antenna systems, not only inheriting the reconfiguration capability but also offering a unique feature: establishing line-of-sight links to mitigate large-scale path loss. However, sophisticated optimization of the placement of pinching antennas has very high complexity, which is challenging for practical implementation. This paper proposes a low-complexity placement design, providing the closed-form expression of the placement of pinching antennas, to maximize the sum rate of multiple downlink users. Orthogonal multiple access (OMA) and non-orthogonal multiple access (NOMA) are both investigated when the pinching-antenna system is only equipped with a single antenna and only the OMA case is studied when there are multiple antennas equipped by the pinching-antenna system. Simulation results indicate pinching-antenna systems can outperform conventional fixed-antenna systems and are more suitable for large service areas.
Abstract:Federated Learning (FL) has gained significant attention in recent years due to its distributed nature and privacy preserving benefits. However, a key limitation of conventional FL is that it learns and distributes a common global model to all participants, which fails to provide customized solutions for diverse task requirements. Federated meta-learning (FML) offers a promising solution to this issue by enabling devices to finetune local models after receiving a shared meta-model from the server. In this paper, we propose a task-oriented FML framework over non-orthogonal multiple access (NOMA) networks. A novel metric, termed value of learning (VoL), is introduced to assess the individual training needs across devices. Moreover, a task-level weight (TLW) metric is defined based on task requirements and fairness considerations, guiding the prioritization of edge devices during FML training. The formulated problem, to maximize the sum of TLW-based VoL across devices, forms a non-convex mixed-integer non-linear programming (MINLP) challenge, addressed here using a parameterized deep Q-network (PDQN) algorithm to handle both discrete and continuous variables. Simulation results demonstrate that our approach significantly outperforms baseline schemes, underscoring the advantages of the proposed framework.
Abstract:Secure communication is crucial in many emerging systems enabled by unmanned aerial vehicle (UAV) communication networks. To protect legitimate communication in a chaotic UAV environment, where both eavesdropping and jamming become straightforward from multiple adversaries with line-of-sight signal propagation, a new reliable and integrated physical layer security mechanism is proposed in this paper for a massive multiple-input-multiple-output (MIMO) UAV system. Particularly, a physical layer fingerprint, also called a tag, is first embedded into each message for authentication purpose. We then propose to reuse the tag additionally as a reference to encode each message to ensure secrecy for confidentiality enhancement at a low cost. Specifically, we create a new dual-reference symmetric tag generation mechanism by inputting an encoding-insensitive feature of plaintext along with the key into a hash function. At a legitimate receiver, an expected tag, reliable for decoding, can be symmetrically regenerated based on the received ciphertext, and authentication can be performed by comparing the regenerated reference tag to the received tag. However, an illegitimate receiver can only receive the fuzzy tag which can not be used to decode the received message. Additionally, we introduce artificial noise (AN) to degrade eavesdropping to further decrease message leakage. To verify the efficiency of our proposed tag-based encoding (TBE) scheme, we formulate two optimization problems including ergodic sum secrecy rate maximization and authentication fail probability minimization. The power allocation solutions are derived by difference-of-convex (DC) programming and the Lagrange method, respectively. The simulation results demonstrate the superior performance of the proposed TBE approach compared to the prior AN-aided tag embedding scheme.
Abstract:Despite the advantage of preserving data privacy, federated learning (FL) still suffers from the straggler issue due to the limited computing resources of distributed clients and the unreliable wireless communication environment. By effectively imitating the distributed resources, digital twin (DT) shows great potential in alleviating this issue. In this paper, we leverage DT in the FL framework over non-orthogonal multiple access (NOMA) network to assist FL training process, considering malicious attacks on model updates from clients. A reputationbased client selection scheme is proposed, which accounts for client heterogeneity in multiple aspects and effectively mitigates the risks of poisoning attacks in FL systems. To minimize the total latency and energy consumption in the proposed system, we then formulate a Stackelberg game by considering clients and the server as the leader and the follower, respectively. Specifically, the leader aims to minimize the energy consumption while the objective of the follower is to minimize the total latency during FL training. The Stackelberg equilibrium is achieved to obtain the optimal solutions. We first derive the strategies for the followerlevel problem and include them in the leader-level problem which is then solved via problem decomposition. Simulation results verify the superior performance of the proposed scheme.
Abstract:Effective task-oriented semantic communications relies on perfect knowledge alignment between transmitters and receivers for accurate recovery of task-related semantic information, which can be susceptible to knowledge misalignment and performance degradation in practice. To tackle this issue, continual knowledge updating and sharing are crucial to adapt to evolving task and user related demands, despite the incurred resource overhead and increased latency. In this paper, we propose a novel collaborative knowledge sharing-empowered semantic transmission mechanism in a two-tier edge network, exploiting edge cooperations and bit communications to address KB mismatch. By deriving a generalized effective semantic transmission rate (GESTR) that considers both semantic accuracy and overhead, we formulate a mixed integer nonlinear programming problem to maximize GESTR of all mobile devices by optimizing knowledge sharing decisions, extraction ratios, and BS/subchannel allocations, subject to task accuracy and delay requirements. The joint optimum solution can be obtained by proposed fractional programming based branch and bound algorithm and modified Kuhn-Munkres algorithm efficiently. Simulation results demonstrate the superior performance of proposed solution, especially in low signal-to-noise conditions.
Abstract:In task-oriented semantic communications, the transmitters are designed to deliver task-related semantic information rather than every signal bit to receivers, which alleviates the spectrum pressure by reducing network traffic loads. Effective semantic communications depend on the perfect alignment of shared knowledge between transmitters and receivers, however, the alignment of knowledge cannot always be guaranteed in practice. To tackle this challenge, we propose a novel knowledge sharing-enabled task-oriented hybrid semantic and bit communications mechanism, where a mobile device (MD) can proactively share and upload the task-related mismatched knowledge to associated small base station (SBS). The traditional bit communications can be adopted as an aid to transmit the rest data related to unshared mismatched knowledge to guarantee the effective execution of target tasks. Considering the heterogeneous transceivers in multi-cell networks, target task demands, and channel conditions, an optimization problem is formulated to maximize the generalized effective semantic transmission rate of all MDs by jointly optimizing knowledge sharing, semantic extraction ratio, and SBS association, while satisfying the semantic accuracy requirements and delay tolerances of MD target tasks. The formulated mixed integer nonlinear programming problem is decomposed into multiple subproblems equivalently. An optimum algorithm is proposed and another efficient algorithm is further developed using hierarchical class partitioning and monotonic optimization. Simulation results demonstrate the validity and superior performance of proposed solutions.
Abstract:Semantic communication focuses on transmitting the meaning of data, aiming for efficient, relevant communication, while non-orthogonal multiple access (NOMA) enhances spectral efficiency by allowing multiple users to share the same spectrum. Integrating semantic users into a NOMA network with bit-based users improves both transmission and spectrum efficiency. However, the performance metric for semantic communication differs significantly from that of traditional communication, posing challenges in simultaneously meeting individual user demands and minimizing transmission power, especially in scenarios with coexisting semantic and bit-based users. Furthermore, the different hardware architectures of semantic and bit-based users complicate the implementation of successive interference cancellation (SIC). To address these challenges, in this paper, we propose a clustered framework to mitigate the complexity of SIC and two multiple access (MA) schemes, e.g., pure cluster-based NOMA (P-CNOMA) and hybrid cluster-based NOMA (H-CNOMA), to minimize the total transmission power. The P-CNOMA scheme can achieve the minimum transmission power, but may not satisfy the high quality of service (QoS) requirement. In contrast, H-CNOMA addresses these issues with a slight increase in power and a reduced semantic rate. These two schemes complement each other, enabling an adaptive MA selection mechanism that adapts to specific network conditions and user requirements.
Abstract:We introduce Vinci, a real-time embodied smart assistant built upon an egocentric vision-language model. Designed for deployment on portable devices such as smartphones and wearable cameras, Vinci operates in an "always on" mode, continuously observing the environment to deliver seamless interaction and assistance. Users can wake up the system and engage in natural conversations to ask questions or seek assistance, with responses delivered through audio for hands-free convenience. With its ability to process long video streams in real-time, Vinci can answer user queries about current observations and historical context while also providing task planning based on past interactions. To further enhance usability, Vinci integrates a video generation module that creates step-by-step visual demonstrations for tasks that require detailed guidance. We hope that Vinci can establish a robust framework for portable, real-time egocentric AI systems, empowering users with contextual and actionable insights. We release the complete implementation for the development of the device in conjunction with a demo web platform to test uploaded videos at https://github.com/OpenGVLab/vinci.
Abstract:Learning robot manipulation skills in real-world environments is extremely challenging. Robots learning manipulation skills in real-world environments is extremely challenging. Recent research on imitation learning and visuomotor policies has significantly enhanced the ability of robots to perform manipulation tasks. In this paper, we propose Admit Policy, a visuo-proprioceptive imitation learning framework with force compliance, designed to reduce contact force fluctuations during robot execution of contact-rich manipulation tasks. This framework also includes a hand-arm teleoperation system with vibrotactile feedback for efficient data collection. Our framework utilizes RGB images, robot joint positions, and contact forces as observations and leverages a consistency-constrained teacher-student probabilistic diffusion model to generate future trajectories for end-effector positions and contact forces. An admittance model is then employed to track these trajectories, enabling effective force-position control across various tasks.We validated our framework on five challenging contact-rich manipulation tasks. Among these tasks, while improving success rates, our approach most significantly reduced the mean contact force required to complete the tasks by up to 53.92% and decreased the standard deviation of contact force fluctuations by 76.51% compared to imitation learning algorithms without dynamic contact force prediction and tracking.