Abstract:Large language models (LLMs) have been shown to memorize and reproduce content from their training data, raising significant privacy concerns, especially with web-scale datasets. Existing methods for detecting memorization are largely sample-specific, relying on manually crafted or discretely optimized memory-inducing prompts generated on a per-sample basis, which become impractical for dataset-level detection due to the prohibitive computational cost of iterating over all samples. In real-world scenarios, data owners may need to verify whether a susceptible LLM has memorized their dataset, particularly if the LLM may have collected the data from the web without authorization. To address this, we introduce \textit{MemHunter}, which trains a memory-inducing LLM and employs hypothesis testing to efficiently detect memorization at the dataset level, without requiring sample-specific memory inducing. Experiments on models such as Pythia and Llama-2 demonstrate that \textit{MemHunter} can extract up to 40\% more training data than existing methods under constrained time resources and reduce search time by up to 80\% when integrated as a plug-in. Crucially, \textit{MemHunter} is the first method capable of dataset-level memorization detection, providing an indispensable tool for assessing privacy risks in LLMs that are powered by vast web-sourced datasets.
Abstract:Coherent technology inherent with more availabledegrees of freedom is deemed a competitive solution for nextgeneration ultra-high-speed short-reach optical interconnects.However, the fatal barriers to implementing the conventiona.coherent system in short-reach optical interconnect are the costfootprint, and power consumption. Self-homodyne coherentsystem exhibits its potential to reduce the power consumption ofthe receiver-side digital signal processing (Rx-DSP) by deliveringthe local oscillator (LO) from the transmitter. However, anautomatic polarization controller (APC) is inevitable in the remoteLO link to avoid polarization fading, resulting in additional costsTo address the polarization fading issue, a simplified self.homodyne coherent system is proposed enabled by Alamouticoding in this paper. Benefiting from the Alamouti coding betweentwo polarizations, a polarization-insensitive receiver onlyincluding a 3dB coupler, a 90o Hybrid, and two balancedphotodiodes (BPDs)is sufficient for reception. Meanwhile, theAPC in the LO link is needless, simplifying the receiver structuresignificantly. Besides, the digital subcarrier multiplexing (DSCM)technique is also adopted to relax the computational complexity ofthe chromatic dispersion compensation (CDC), which is one of thedominant power consumption modules in Rx-DSP. Thetransmission performance of 50Gbaud 4-subcarrier 16/32OAM(4SC-16/320AM) DSCM signal based on the proposed simplifiedself-homodyne coherent system is investigated experimentallyThe results show that the bit-error-ratio(BER) performancedegradation caused by CD can be solved by increasing 4 taps inthe equalizer for 80km single mode fiber(SMF)transmissionwithout individual CDC, which operates in a low-complexitymanner.
Abstract:Artificial Intelligence Generated Content (AIGC) is one of the latest achievements in AI development. The content generated by related applications, such as text, images and audio, has sparked a heated discussion. Various derived AIGC applications are also gradually entering all walks of life, bringing unimaginable impact to people's daily lives. However, the rapid development of such generative tools has also raised concerns about privacy and security issues, and even copyright issues in AIGC. We note that advanced technologies such as blockchain and privacy computing can be combined with AIGC tools, but no work has yet been done to investigate their relevance and prospect in a systematic and detailed way. Therefore it is necessary to investigate how they can be used to protect the privacy and security of data in AIGC by fully exploring the aforementioned technologies. In this paper, we first systematically review the concept, classification and underlying technologies of AIGC. Then, we discuss the privacy and security challenges faced by AIGC from multiple perspectives and purposefully list the countermeasures that currently exist. We hope our survey will help researchers and industry to build a more secure and robust AIGC system.