Abstract:Vision-language models (VLMs) have shown remarkable advancements in multimodal reasoning tasks. However, they still often generate inaccurate or irrelevant responses due to issues like hallucinated image understandings or unrefined reasoning paths. To address these challenges, we introduce Critic-V, a novel framework inspired by the Actor-Critic paradigm to boost the reasoning capability of VLMs. This framework decouples the reasoning process and critic process by integrating two independent components: the Reasoner, which generates reasoning paths based on visual and textual inputs, and the Critic, which provides constructive critique to refine these paths. In this approach, the Reasoner generates reasoning responses according to text prompts, which can evolve iteratively as a policy based on feedback from the Critic. This interaction process was theoretically driven by a reinforcement learning framework where the Critic offers natural language critiques instead of scalar rewards, enabling more nuanced feedback to boost the Reasoner's capability on complex reasoning tasks. The Critic model is trained using Direct Preference Optimization (DPO), leveraging a preference dataset of critiques ranked by Rule-based Reward~(RBR) to enhance its critic capabilities. Evaluation results show that the Critic-V framework significantly outperforms existing methods, including GPT-4V, on 5 out of 8 benchmarks, especially regarding reasoning accuracy and efficiency. Combining a dynamic text-based policy for the Reasoner and constructive feedback from the preference-optimized Critic enables a more reliable and context-sensitive multimodal reasoning process. Our approach provides a promising solution to enhance the reliability of VLMs, improving their performance in real-world reasoning-heavy multimodal applications such as autonomous driving and embodied intelligence.
Abstract:In occluded person re-identification(ReID), severe occlusions lead to a significant amount of irrelevant information that hinders the accurate identification of individuals. These irrelevant cues primarily stem from background interference and occluding interference, adversely affecting the final retrieval results. Traditional discriminative models, which rely on the specific content and positions of the images, often misclassify in cases of occlusion. To address these limitations, we propose the Data Distribution Reconstruction Network (DDRN), a generative model that leverages data distribution to filter out irrelevant details, enhancing overall feature perception ability and reducing irrelevant feature interference. Additionally, severe occlusions lead to the complexity of the feature space. To effectively handle this, we design a multi-center approach through the proposed Hierarchical SubcenterArcface (HS-Arcface) loss function, which can better approximate complex feature spaces. On the Occluded-Duke dataset, we achieved a mAP of 62.4\% (+1.1\%) and a rank-1 accuracy of 71.3\% (+0.6\%), surpassing the latest state-of-the-art methods(FRT) significantly.
Abstract:In this correspondence, a new single-carrier waveform, called chirped discrete Fourier transform spread orthogonal frequency division multiplexing (DFT-s-OFDM), is proposed for the sixth generation of communications. By chirping DFT-s-OFDM in the time domain, the proposed waveform maintains the low peak-to-average-power ratio (PAPR) of DFT-s-OFDM. Thanks to full-band transmission and symbols retransmission enabled by chirping and discrete Fourier transform (DFT) precoding, the proposed waveform can enhance noise suppression of linear minimum mean square error equalization. Its bit error rate (BER) upper bound and diversity order are derived using pairwise error probability. Simulation results confirm that the proposed waveform outperforms the state-of-the-art waveforms in terms of BER, output signal-to-noise-ratio, and PAPR.
Abstract:Traditional spectral imaging methods are constrained by the time-consuming scanning process, limiting the application in dynamic scenarios. One-shot spectral imaging based on reconstruction has been a hot research topic recently and the primary challenges still lie in both efficient fabrication techniques suitable for mass production and the high-speed, high-accuracy reconstruction algorithm for real-time spectral imaging. In this study, we introduce an innovative on-chip real-time hyperspectral imager that leverages nanophotonic film spectral encoders and a Massively Parallel Network (MP-Net), featuring a 4 * 4 array of compact, all-dielectric film units for the micro-spectrometers. Each curved nanophotonic film unit uniquely modulates incident light across the underlying 3 * 3 CMOS image sensor (CIS) pixels, enabling a high spatial resolution equivalent to the full CMOS resolution. The implementation of MP-Net, specially designed to address variability in transmittance and manufacturing errors such as misalignment and non-uniformities in thin film deposition, can greatly increase the structural tolerance of the device and reduce the preparation requirement, further simplifying the manufacturing process. Tested in varied environments on both static and moving objects, the real-time hyperspectral imager demonstrates the robustness and high-fidelity spatial-spectral data capabilities across diverse scenarios. This on-chip hyperspectral imager represents a significant advancement in real-time, high-resolution spectral imaging, offering a versatile solution for applications ranging from environmental monitoring, remote sensing to consumer electronics.
Abstract:Our paper focuses on automating the generation of medical reports from chest X-ray image inputs, a critical yet time-consuming task for radiologists. Unlike existing medical re-port generation efforts that tend to produce human-readable reports, we aim to generate medical reports that are both fluent and clinically accurate. This is achieved by our fully differentiable and end-to-end paradigm containing three complementary modules: taking the chest X-ray images and clinical his-tory document of patients as inputs, our classification module produces an internal check-list of disease-related topics, referred to as enriched disease embedding; the embedding representation is then passed to our transformer-based generator, giving rise to the medical reports; meanwhile, our generator also pro-duces the weighted embedding representation, which is fed to our interpreter to ensure consistency with respect to disease-related topics.Our approach achieved promising results on commonly-used metrics concerning language fluency and clinical accuracy. Moreover, noticeable performance gains are consistently ob-served when additional input information is available, such as the clinical document and extra scans of different views.