Abstract:We propose a compressive yet effective mesh representation, Blocked and Patchified Tokenization (BPT), facilitating the generation of meshes exceeding 8k faces. BPT compresses mesh sequences by employing block-wise indexing and patch aggregation, reducing their length by approximately 75\% compared to the original sequences. This compression milestone unlocks the potential to utilize mesh data with significantly more faces, thereby enhancing detail richness and improving generation robustness. Empowered with the BPT, we have built a foundation mesh generative model training on scaled mesh data to support flexible control for point clouds and images. Our model demonstrates the capability to generate meshes with intricate details and accurate topology, achieving SoTA performance on mesh generation and reaching the level for direct product usage.
Abstract:In recent years, multimodal benchmarks for general domains have guided the rapid development of multimodal models on general tasks. However, the financial field has its peculiarities. It features unique graphical images (e.g., candlestick charts, technical indicator charts) and possesses a wealth of specialized financial knowledge (e.g., futures, turnover rate). Therefore, benchmarks from general fields often fail to measure the performance of multimodal models in the financial domain, and thus cannot effectively guide the rapid development of large financial models. To promote the development of large financial multimodal models, we propose MME-Finance, an bilingual open-ended and practical usage-oriented Visual Question Answering (VQA) benchmark. The characteristics of our benchmark are finance and expertise, which include constructing charts that reflect the actual usage needs of users (e.g., computer screenshots and mobile photography), creating questions according to the preferences in financial domain inquiries, and annotating questions by experts with 10+ years of experience in the financial industry. Additionally, we have developed a custom-designed financial evaluation system in which visual information is first introduced in the multi-modal evaluation process. Extensive experimental evaluations of 19 mainstream MLLMs are conducted to test their perception, reasoning, and cognition capabilities. The results indicate that models performing well on general benchmarks cannot do well on MME-Finance; for instance, the top-performing open-source and closed-source models obtain 65.69 (Qwen2VL-72B) and 63.18 (GPT-4o), respectively. Their performance is particularly poor in categories most relevant to finance, such as candlestick charts and technical indicator charts. In addition, we propose a Chinese version, which helps compare performance of MLLMs under a Chinese context.
Abstract:Dropout, a network operator, when enabled is likely to dramatically impact the performance of Flash-Attention, which in turn increases the end-to-end training time of Large-Language-Models (LLMs). The main contributor to such performance degradation is the Random Number Generation (RNG) phase that is traditionally fused into the Flash-Attention kernel. As RNG and Attention have the same hardware bottlenecks, RNG latency can hardly be hidden within the Attention kernel. We propose overlapping RNG with previous GEMM layers in the network to hide RNG runtime and improve end-to-end performance. RNG and GEMM have distinct resource requirements and hardware bottlenecks, so they can run in parallel without compromising each other's performance. Our fine-grained performance model, cross-validated by silicon results, shows 1.14x speedup on one transformer block (including multi-head attention and feed-forward layers) for Llama2, and up to 1.23x speedup when varying workload sizes, on GH100 GPUs with FP8 precision. Further, we extend our theoretical model to different RNG implementations and hardware architectures, and discuss the widely applicable benefits for overlapping RNG with GEMM layers.
Abstract:Federated Learning (FL) is a technique that allows multiple parties to train a shared model collaboratively without disclosing their private data. It has become increasingly popular due to its distinct privacy advantages. However, FL models can suffer from biases against certain demographic groups (e.g., racial and gender groups) due to the heterogeneity of data and party selection. Researchers have proposed various strategies for characterizing the group fairness of FL algorithms to address this issue. However, the effectiveness of these strategies in the face of deliberate adversarial attacks has not been fully explored. Although existing studies have revealed various threats (e.g., model poisoning attacks) against FL systems caused by malicious participants, their primary aim is to decrease model accuracy, while the potential of leveraging poisonous model updates to exacerbate model unfairness remains unexplored. In this paper, we propose a new type of model poisoning attack, EAB-FL, with a focus on exacerbating group unfairness while maintaining a good level of model utility. Extensive experiments on three datasets demonstrate the effectiveness and efficiency of our attack, even with state-of-the-art fairness optimization algorithms and secure aggregation rules employed.
Abstract:Monocular object pose estimation, as a pivotal task in computer vision and robotics, heavily depends on accurate 2D-3D correspondences, which often demand costly CAD models that may not be readily available. Object 3D reconstruction methods offer an alternative, among which recent advancements in 3D Gaussian Splatting (3DGS) afford a compelling potential. Yet its performance still suffers and tends to overfit with fewer input views. Embracing this challenge, we introduce SGPose, a novel framework for sparse view object pose estimation using Gaussian-based methods. Given as few as ten views, SGPose generates a geometric-aware representation by starting with a random cuboid initialization, eschewing reliance on Structure-from-Motion (SfM) pipeline-derived geometry as required by traditional 3DGS methods. SGPose removes the dependence on CAD models by regressing dense 2D-3D correspondences between images and the reconstructed model from sparse input and random initialization, while the geometric-consistent depth supervision and online synthetic view warping are key to the success. Experiments on typical benchmarks, especially on the Occlusion LM-O dataset, demonstrate that SGPose outperforms existing methods even under sparse view constraints, under-scoring its potential in real-world applications.
Abstract:Point clouds are crucial for capturing three-dimensional data but often suffer from incompleteness due to limitations such as resolution and occlusion. Traditional methods typically rely on point-based approaches within discriminative frameworks for point cloud completion. In this paper, we introduce \textbf{Diffusion-Occ}, a novel framework for Diffusion Point Cloud Completion. Diffusion-Occ utilizes a two-stage coarse-to-fine approach. In the first stage, the Coarse Density Voxel Prediction Network (CDNet) processes partial points to predict coarse density voxels, streamlining global feature extraction through voxel classification, as opposed to previous regression-based methods. In the second stage, we introduce the Occupancy Generation Network (OccGen), a conditional occupancy diffusion model based on a transformer architecture and enhanced by our Point-Voxel Fuse (PVF) block. This block integrates coarse density voxels with partial points to leverage both global and local features for comprehensive completion. By thresholding the occupancy field, we convert it into a complete point cloud. Additionally, our method employs diverse training mixtures and efficient diffusion parameterization to enable effective one-step sampling during both training and inference. Experimental results demonstrate that Diffusion-Occ outperforms existing discriminative and generative methods.
Abstract:The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs adopt a simple MLP to preserve all visual contexts via one-to-one transformation. However, the visual tokens are redundant and can be considerably increased when dealing with high-resolution images, impairing the efficiency of MLLMs significantly. Some recent works have introduced resampler or abstractor to reduce the number of resulting visual tokens. Unfortunately, they fail to capture finer details and undermine the visual reasoning capabilities of MLLMs. In this work, we propose a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens. In specific, we first interpolate the visual features as a low-resolution point query, providing the overall visual representation as the foundation. Then, we introduce a region-to-point injection module that utilizes high-resolution, multi-level region-based cues as fine-grained reference keys and values, allowing them to be fully absorbed within the corresponding local context region. This step effectively updates the coarse point query, transforming it into an enriched one for the subsequent LLM reasoning. Extensive experiments demonstrate that our approach compresses the visual tokens by 75%~89%, while achieves comparable or even better performance across diverse benchmarks with significantly higher efficiency. The source codes can be found at https://github.com/CircleRadon/TokenPacker.
Abstract:In recent years, employing layer attention to enhance interaction among hierarchical layers has proven to be a significant advancement in building network structures. In this paper, we delve into the distinction between layer attention and the general attention mechanism, noting that existing layer attention methods achieve layer interaction on fixed feature maps in a static manner. These static layer attention methods limit the ability for context feature extraction among layers. To restore the dynamic context representation capability of the attention mechanism, we propose a Dynamic Layer Attention (DLA) architecture. The DLA comprises dual paths, where the forward path utilizes an improved recurrent neural network block, named Dynamic Sharing Unit (DSU), for context feature extraction. The backward path updates features using these shared context representations. Finally, the attention mechanism is applied to these dynamically refreshed feature maps among layers. Experimental results demonstrate the effectiveness of the proposed DLA architecture, outperforming other state-of-the-art methods in image recognition and object detection tasks. Additionally, the DSU block has been evaluated as an efficient plugin in the proposed DLA architecture.The code is available at https://github.com/tunantu/Dynamic-Layer-Attention.
Abstract:The progression of deep learning and the widespread adoption of sensors have facilitated automatic multi-view fusion (MVF) about the cardiovascular system (CVS) signals. However, prevalent MVF model architecture often amalgamates CVS signals from the same temporal step but different views into a unified representation, disregarding the asynchronous nature of cardiovascular events and the inherent heterogeneity across views, leading to catastrophic view confusion. Efficient training strategies specifically tailored for MVF models to attain comprehensive representations need simultaneous consideration. Crucially, real-world data frequently arrives with incomplete views, an aspect rarely noticed by researchers. Thus, the View-Centric Transformer (VCT) and Multitask Masked Autoencoder (M2AE) are specifically designed to emphasize the centrality of each view and harness unlabeled data to achieve superior fused representations. Additionally, we systematically define the missing-view problem for the first time and introduce prompt techniques to aid pretrained MVF models in flexibly adapting to various missing-view scenarios. Rigorous experiments involving atrial fibrillation detection, blood pressure estimation, and sleep staging-typical health monitoring tasks-demonstrate the remarkable advantage of our method in MVF compared to prevailing methodologies. Notably, the prompt technique requires finetuning less than 3% of the entire model's data, substantially fortifying the model's resilience to view missing while circumventing the need for complete retraining. The results demonstrate the effectiveness of our approaches, highlighting their potential for practical applications in cardiovascular health monitoring. Codes and models are released at URL.
Abstract:Surround-view depth estimation is a crucial task aims to acquire the depth maps of the surrounding views. It has many applications in real world scenarios such as autonomous driving, AR/VR and 3D reconstruction, etc. However, given that most of the data in the autonomous driving dataset is collected in daytime scenarios, this leads to poor depth model performance in the face of out-of-distribution(OoD) data. While some works try to improve the robustness of depth model under OoD data, these methods either require additional training data or lake generalizability. In this report, we introduce the DINO-SD, a novel surround-view depth estimation model. Our DINO-SD does not need additional data and has strong robustness. Our DINO-SD get the best performance in the track4 of ICRA 2024 RoboDepth Challenge.