Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dong Yi

A Benchmark for Crime Surveillance Video Analysis with Large Models

Feb 13, 2025

Haoran Chen, Dong Yi, Moyan Cao, Chensen Huang, Guibo Zhu, Jinqiao Wang

Abstract:Anomaly analysis in surveillance videos is a crucial topic in computer vision. In recent years, multimodal large language models (MLLMs) have outperformed task-specific models in various domains. Although MLLMs are particularly versatile, their abilities to understand anomalous concepts and details are insufficiently studied because of the outdated benchmarks of this field not providing MLLM-style QAs and efficient algorithms to assess the model's open-ended text responses. To fill this gap, we propose a benchmark for crime surveillance video analysis with large models denoted as UCVL, including 1,829 videos and reorganized annotations from the UCF-Crime and UCF-Crime Annotation datasets. We design six types of questions and generate diverse QA pairs. Then we develop detailed instructions and use OpenAI's GPT-4o for accurate assessment. We benchmark eight prevailing MLLMs ranging from 0.5B to 40B parameters, and the results demonstrate the reliability of this bench. Moreover, we finetune LLaVA-OneVision on UCVL's training set. The improvement validates our data's high quality for video anomaly analysis.

Via

Access Paper or Ask Questions

MME-Industry: A Cross-Industry Multimodal Evaluation Benchmark

Jan 28, 2025

Dongyi Yi, Guibo Zhu, Chenglin Ding, Zongshu Li, Dong Yi, Jinqiao Wang

Abstract:With the rapid advancement of Multimodal Large Language Models (MLLMs), numerous evaluation benchmarks have emerged. However, comprehensive assessments of their performance across diverse industrial applications remain limited. In this paper, we introduce MME-Industry, a novel benchmark designed specifically for evaluating MLLMs in industrial settings.The benchmark encompasses 21 distinct domain, comprising 1050 question-answer pairs with 50 questions per domain. To ensure data integrity and prevent potential leakage from public datasets, all question-answer pairs were manually crafted and validated by domain experts. Besides, the benchmark's complexity is effectively enhanced by incorporating non-OCR questions that can be answered directly, along with tasks requiring specialized domain knowledge. Moreover, we provide both Chinese and English versions of the benchmark, enabling comparative analysis of MLLMs' capabilities across these languages. Our findings contribute valuable insights into MLLMs' practical industrial applications and illuminate promising directions for future model optimization research.

* 9 pages,2 figures

Via

Access Paper or Ask Questions

AnyDesign: Versatile Area Fashion Editing via Mask-Free Diffusion

Aug 21, 2024

Yunfang Niu, Lingxiang Wu, Dong Yi, Jie Peng, Ning Jiang, Haiying Wu, Jinqiao Wang

Figure 1 for AnyDesign: Versatile Area Fashion Editing via Mask-Free Diffusion

Figure 2 for AnyDesign: Versatile Area Fashion Editing via Mask-Free Diffusion

Figure 3 for AnyDesign: Versatile Area Fashion Editing via Mask-Free Diffusion

Figure 4 for AnyDesign: Versatile Area Fashion Editing via Mask-Free Diffusion

Abstract:Fashion image editing aims to modify a person's appearance based on a given instruction. Existing methods require auxiliary tools like segmenters and keypoint extractors, lacking a flexible and unified framework. Moreover, these methods are limited in the variety of clothing types they can handle, as most datasets focus on people in clean backgrounds and only include generic garments such as tops, pants, and dresses. These limitations restrict their applicability in real-world scenarios. In this paper, we first extend an existing dataset for human generation to include a wider range of apparel and more complex backgrounds. This extended dataset features people wearing diverse items such as tops, pants, dresses, skirts, headwear, scarves, shoes, socks, and bags. Additionally, we propose AnyDesign, a diffusion-based method that enables mask-free editing on versatile areas. Users can simply input a human image along with a corresponding prompt in either text or image format. Our approach incorporates Fashion DiT, equipped with a Fashion-Guidance Attention (FGA) module designed to fuse explicit apparel types and CLIP-encoded apparel features. Both Qualitative and quantitative experiments demonstrate that our method delivers high-quality fashion editing and outperforms contemporary text-guided fashion editing methods.

Via

Access Paper or Ask Questions

Recurrent Context Compression: Efficiently Expanding the Context Window of LLM

Jun 10, 2024

Chensen Huang, Guibo Zhu, Xuepeng Wang, Yifei Luo, Guojing Ge, Haoran Chen, Dong Yi, Jinqiao Wang

Abstract:To extend the context length of Transformer-based large language models (LLMs) and improve comprehension capabilities, we often face limitations due to computational resources and bounded memory storage capacity. This work introduces a method called Recurrent Context Compression (RCC), designed to efficiently expand the context window length of LLMs within constrained storage space. We also investigate the issue of poor model responses when both instructions and context are compressed in downstream tasks, and propose an instruction reconstruction method to mitigate this problem. We validated the effectiveness of our approach on multiple tasks, achieving a compression rate of up to 32x on text reconstruction tasks with a BLEU4 score close to 0.95, and nearly 100\% accuracy on a passkey retrieval task with a sequence length of 1M. Finally, our method demonstrated competitive performance in long-text question-answering tasks compared to non-compressed methods, while significantly saving storage resources in long-text inference tasks. Our code, models, and demo are available at https://github.com/WUHU-G/RCC_Transformer

Via

Access Paper or Ask Questions

PFDM: Parser-Free Virtual Try-on via Diffusion Model

Feb 05, 2024

Yunfang Niu, Dong Yi, Lingxiang Wu, Zhiwei Liu, Pengxiang Cai, Jinqiao Wang

Figure 1 for PFDM: Parser-Free Virtual Try-on via Diffusion Model

Figure 2 for PFDM: Parser-Free Virtual Try-on via Diffusion Model

Figure 3 for PFDM: Parser-Free Virtual Try-on via Diffusion Model

Figure 4 for PFDM: Parser-Free Virtual Try-on via Diffusion Model

Abstract:Virtual try-on can significantly improve the garment shopping experiences in both online and in-store scenarios, attracting broad interest in computer vision. However, to achieve high-fidelity try-on performance, most state-of-the-art methods still rely on accurate segmentation masks, which are often produced by near-perfect parsers or manual labeling. To overcome the bottleneck, we propose a parser-free virtual try-on method based on the diffusion model (PFDM). Given two images, PFDM can "wear" garments on the target person seamlessly by implicitly warping without any other information. To learn the model effectively, we synthesize many pseudo-images and construct sample pairs by wearing various garments on persons. Supervised by the large-scale expanded dataset, we fuse the person and garment features using a proposed Garment Fusion Attention (GFA) mechanism. Experiments demonstrate that our proposed PFDM can successfully handle complex cases, synthesize high-fidelity images, and outperform both state-of-the-art parser-free and parser-based models.

* Accepted by IEEE ICASSP 2024

Via

Access Paper or Ask Questions

Large-scale Bisample Learning on ID vs. Spot Face Recognition

Jun 11, 2018

Xiangyu Zhu, Hao Liu, Zhen Lei, Hailin Shi, Fan Yang, Dong Yi, Stan Z. Li

Figure 1 for Large-scale Bisample Learning on ID vs. Spot Face Recognition

Figure 2 for Large-scale Bisample Learning on ID vs. Spot Face Recognition

Figure 3 for Large-scale Bisample Learning on ID vs. Spot Face Recognition

Figure 4 for Large-scale Bisample Learning on ID vs. Spot Face Recognition

Abstract:In many face recognition applications, there is large amount of face data with two images for each person. One is an ID photo for face enrollment, and the other is a probe photo captured on spot. Most existing methods are designed for training data with limited breadth (relatively small class number) and sufficient depth (many samples for each class). They would meet great challenges when applied on this ID vs. Spot (IvS) data, including the under-represented intra-class variations and the excessive demand on computing devices. In this paper, we propose a deep learning based large-scale bisample learning (LBL) method for IvS face recognition. To tackle the bisample problem that there are only two samples for each class, a classification-verification-classification (CVC) training strategy is proposed to progressively enhance the IvS performance. Besides, a dominant prototype softmax (DP-softmax) is incorporated to make the deep learning applicable on large-scale classes. We conduct LBL on a IvS face dataset with more than two million identities. Experimental results show the proposed method achieves superior performance than previous ones, validating the effectiveness of LBL on IvS face recognition.

Via

Access Paper or Ask Questions

When Face Recognition Meets with Deep Learning: an Evaluation of Convolutional Neural Networks for Face Recognition

Apr 09, 2015

Guosheng Hu, Yongxin Yang, Dong Yi, Josef Kittler, William Christmas, Stan Z. Li, Timothy Hospedales

Figure 1 for When Face Recognition Meets with Deep Learning: an Evaluation of Convolutional Neural Networks for Face Recognition

Figure 2 for When Face Recognition Meets with Deep Learning: an Evaluation of Convolutional Neural Networks for Face Recognition

Figure 3 for When Face Recognition Meets with Deep Learning: an Evaluation of Convolutional Neural Networks for Face Recognition

Figure 4 for When Face Recognition Meets with Deep Learning: an Evaluation of Convolutional Neural Networks for Face Recognition

Abstract:Deep learning, in particular Convolutional Neural Network (CNN), has achieved promising results in face recognition recently. However, it remains an open question: why CNNs work well and how to design a 'good' architecture. The existing works tend to focus on reporting CNN architectures that work well for face recognition rather than investigate the reason. In this work, we conduct an extensive evaluation of CNN-based face recognition systems (CNN-FRS) on a common ground to make our work easily reproducible. Specifically, we use public database LFW (Labeled Faces in the Wild) to train CNNs, unlike most existing CNNs trained on private databases. We propose three CNN architectures which are the first reported architectures trained using LFW data. This paper quantitatively compares the architectures of CNNs and evaluate the effect of different implementation choices. We identify several useful properties of CNN-FRS. For instance, the dimensionality of the learned features can be significantly reduced without adverse effect on face recognition accuracy. In addition, traditional metric learning method exploiting CNN-learned features is evaluated. Experiments show two crucial factors to good CNN-FRS performance are the fusion of multiple CNNs and metric learning. To make our work reproducible, source code and models will be made publicly available.

* 7 pages, 4 figures, 7 tables

Via

Access Paper or Ask Questions

Learning Face Representation from Scratch

Nov 28, 2014

Dong Yi, Zhen Lei, Shengcai Liao, Stan Z. Li

Figure 1 for Learning Face Representation from Scratch

Figure 2 for Learning Face Representation from Scratch

Figure 3 for Learning Face Representation from Scratch

Figure 4 for Learning Face Representation from Scratch

Abstract:Pushing by big data and deep convolutional neural network (CNN), the performance of face recognition is becoming comparable to human. Using private large scale training datasets, several groups achieve very high performance on LFW, i.e., 97% to 99%. While there are many open source implementations of CNN, none of large scale face dataset is publicly available. The current situation in the field of face recognition is that data is more important than algorithm. To solve this problem, this paper proposes a semi-automatical way to collect face images from Internet and builds a large scale dataset containing about 10,000 subjects and 500,000 images, called CASIAWebFace. Based on the database, we use a 11-layer CNN to learn discriminative representation and obtain state-of-theart accuracy on LFW and YTF. The publication of CASIAWebFace will attract more research groups entering this field and accelerate the development of face recognition in the wild.

Via

Access Paper or Ask Questions

Deep Metric Learning for Practical Person Re-Identification

Jul 18, 2014

Dong Yi, Zhen Lei, Stan Z. Li

Figure 1 for Deep Metric Learning for Practical Person Re-Identification

Figure 2 for Deep Metric Learning for Practical Person Re-Identification

Figure 3 for Deep Metric Learning for Practical Person Re-Identification

Figure 4 for Deep Metric Learning for Practical Person Re-Identification

Abstract:Various hand-crafted features and metric learning methods prevail in the field of person re-identification. Compared to these methods, this paper proposes a more general way that can learn a similarity metric from image pixels directly. By using a "siamese" deep neural network, the proposed method can jointly learn the color feature, texture feature and metric in a unified framework. The network has a symmetry structure with two sub-networks which are connected by Cosine function. To deal with the big variations of person images, binomial deviance is used to evaluate the cost between similarities and labels, which is proved to be robust to outliers. Compared to existing researches, a more practical setting is studied in the experiments that is training and test on different datasets (cross dataset person re-identification). Both in "intra dataset" and "cross dataset" settings, the superiorities of the proposed method are illustrated on VIPeR and PRID.

Via

Access Paper or Ask Questions

Shared Representation Learning for Heterogeneous Face Recognition

Jun 05, 2014

Dong Yi, Zhen Lei, Shengcai Liao, Stan Z. Li

Figure 1 for Shared Representation Learning for Heterogeneous Face Recognition

Figure 2 for Shared Representation Learning for Heterogeneous Face Recognition

Figure 3 for Shared Representation Learning for Heterogeneous Face Recognition

Figure 4 for Shared Representation Learning for Heterogeneous Face Recognition

Abstract:After intensive research, heterogenous face recognition is still a challenging problem. The main difficulties are owing to the complex relationship between heterogenous face image spaces. The heterogeneity is always tightly coupled with other variations, which makes the relationship of heterogenous face images highly nonlinear. Many excellent methods have been proposed to model the nonlinear relationship, but they apt to overfit to the training set, due to limited samples. Inspired by the unsupervised algorithms in deep learning, this paper proposes an novel framework for heterogeneous face recognition. We first extract Gabor features at some localized facial points, and then use Restricted Boltzmann Machines (RBMs) to learn a shared representation locally to remove the heterogeneity around each facial point. Finally, the shared representations of local RBMs are connected together and processed by PCA. Two problems (Sketch-Photo and NIR-VIS) and three databases are selected to evaluate the proposed method. For Sketch-Photo problem, we obtain perfect results on the CUFS database. For NIR-VIS problem, we produce new state-of-the-art performance on the CASIA HFB and NIR-VIS 2.0 databases.

Via

Access Paper or Ask Questions