Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bo Feng

Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?

May 20, 2025

Bo Feng, Zhengfeng Lai, Shiyu Li, Zizhen Wang, Simon Wang, Ping Huang, Meng Cao

Abstract:Existing video understanding benchmarks often conflate knowledge-based and purely image-based questions, rather than clearly isolating a model's temporal reasoning ability, which is the key aspect that distinguishes video understanding from other modalities. We identify two major limitations that obscure whether higher scores truly indicate stronger understanding of the dynamic content in videos: (1) strong language priors, where models can answer questions without watching the video; and (2) shuffling invariance, where models maintain similar performance on certain questions even when video frames are temporally shuffled. To alleviate these issues, we propose VBenchComp, an automated pipeline that categorizes questions into different domains: LLM-Answerable, Semantic, and Temporal. Specifically, LLM-Answerable questions can be answered without viewing the video; Semantic questions remain answerable even when the video frames are shuffled; and Temporal questions require understanding the correct temporal order of frames. The rest of the questions are labeled as Others. This can enable fine-grained evaluation of different capabilities of a video LLM. Our analysis reveals nuanced model weaknesses that are hidden by traditional overall scores, and we offer insights and recommendations for designing future benchmarks that more accurately assess video LLMs.

Via

Access Paper or Ask Questions

Large Language Models Are More Persuasive Than Incentivized Human Persuaders

May 14, 2025

Philipp Schoenegger, Francesco Salvi, Jiacheng Liu, Xiaoli Nan, Ramit Debnath, Barbara Fasolo, Evelina Leivada, Gabriel Recchia, Fritz Günther, Ali Zarifhonarvar(+30 more)

Figure 1 for Large Language Models Are More Persuasive Than Incentivized Human Persuaders

Figure 2 for Large Language Models Are More Persuasive Than Incentivized Human Persuaders

Figure 3 for Large Language Models Are More Persuasive Than Incentivized Human Persuaders

Figure 4 for Large Language Models Are More Persuasive Than Incentivized Human Persuaders

Abstract:We directly compare the persuasion capabilities of a frontier large language model (LLM; Claude Sonnet 3.5) against incentivized human persuaders in an interactive, real-time conversational quiz setting. In this preregistered, large-scale incentivized experiment, participants (quiz takers) completed an online quiz where persuaders (either humans or LLMs) attempted to persuade quiz takers toward correct or incorrect answers. We find that LLM persuaders achieved significantly higher compliance with their directional persuasion attempts than incentivized human persuaders, demonstrating superior persuasive capabilities in both truthful (toward correct answers) and deceptive (toward incorrect answers) contexts. We also find that LLM persuaders significantly increased quiz takers' accuracy, leading to higher earnings, when steering quiz takers toward correct answers, and significantly decreased their accuracy, leading to lower earnings, when steering them toward incorrect answers. Overall, our findings suggest that AI's persuasion capabilities already exceed those of humans that have real-money bonuses tied to performance. Our findings of increasingly capable AI persuaders thus underscore the urgency of emerging alignment and governance frameworks.

Via

Access Paper or Ask Questions

StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant

May 08, 2025

Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, Ping Huang

Abstract:We present StreamBridge, a simple yet effective framework that seamlessly transforms offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: (1) limited capability for multi-turn real-time understanding, and (2) lack of proactive response mechanisms. Specifically, StreamBridge incorporates (1) a memory buffer combined with a round-decayed compression strategy, supporting long-context multi-turn interactions, and (2) a decoupled, lightweight activation model that can be effortlessly integrated into existing Video-LLMs, enabling continuous proactive responses. To further support StreamBridge, we construct Stream-IT, a large-scale dataset tailored for streaming video understanding, featuring interleaved video-text sequences and diverse instruction formats. Extensive experiments show that StreamBridge significantly improves the streaming understanding capabilities of offline Video-LLMs across various tasks, outperforming even proprietary models such as GPT-4o and Gemini 1.5 Pro. Simultaneously, it achieves competitive or superior performance on standard video understanding benchmarks.

Via

Access Paper or Ask Questions

TrojText: Test-time Invisible Textual Trojan Insertion

Mar 03, 2023

Yepeng Liu, Bo Feng, Qian Lou

Figure 1 for TrojText: Test-time Invisible Textual Trojan Insertion

Figure 2 for TrojText: Test-time Invisible Textual Trojan Insertion

Figure 3 for TrojText: Test-time Invisible Textual Trojan Insertion

Figure 4 for TrojText: Test-time Invisible Textual Trojan Insertion

Abstract:In Natural Language Processing (NLP), intelligent neuron models can be susceptible to textual Trojan attacks. Such attacks occur when Trojan models behave normally for standard inputs but generate malicious output for inputs that contain a specific trigger. Syntactic-structure triggers, which are invisible, are becoming more popular for Trojan attacks because they are difficult to detect and defend against. However, these types of attacks require a large corpus of training data to generate poisoned samples with the necessary syntactic structures for Trojan insertion. Obtaining such data can be difficult for attackers, and the process of generating syntactic poisoned triggers and inserting Trojans can be time-consuming. This paper proposes a solution called TrojText, which aims to determine whether invisible textual Trojan attacks can be performed more efficiently and cost-effectively without training data. The proposed approach, called the Representation-Logit Trojan Insertion (RLI) algorithm, uses smaller sampled test data instead of large training data to achieve the desired attack. The paper also introduces two additional techniques, namely the accumulated gradient ranking (AGR) and Trojan Weights Pruning (TWP), to reduce the number of tuned parameters and the attack overhead. The TrojText approach was evaluated on three datasets (AG's News, SST-2, and OLID) using three NLP models (BERT, XLNet, and DeBERTa). The experiments demonstrated that the TrojText approach achieved a 98.35\% classification accuracy for test sentences in the target class on the BERT model for the AG's News dataset. The source code for TrojText is available at https://github.com/UCF-ML-Research/TrojText.

* ICLR 2023 Camera Ready

Via

Access Paper or Ask Questions

GTrans: Spatiotemporal Autoregressive Transformer with Graph Embeddings for Nowcasting Extreme Events

Jan 18, 2022

Bo Feng, Geoffrey Fox

Figure 1 for GTrans: Spatiotemporal Autoregressive Transformer with Graph Embeddings for Nowcasting Extreme Events

Figure 2 for GTrans: Spatiotemporal Autoregressive Transformer with Graph Embeddings for Nowcasting Extreme Events

Figure 3 for GTrans: Spatiotemporal Autoregressive Transformer with Graph Embeddings for Nowcasting Extreme Events

Figure 4 for GTrans: Spatiotemporal Autoregressive Transformer with Graph Embeddings for Nowcasting Extreme Events

Abstract:Spatiotemporal time series nowcasting should preserve temporal and spatial dynamics in the sense that generated new sequences from models respect the covariance relationship from history. Conventional feature extractors are built with deep convolutional neural networks (CNN). However, CNN models have limits to image-like applications where data can be formed with high-dimensional arrays. In contrast, applications in social networks, road traffic, physics, and chemical property prediction where data features can be organized with nodes and edges of graphs. Transformer architecture is an emerging method for predictive models, bringing high accuracy and efficiency due to attention mechanism design. This paper proposes a spatiotemporal model, namely GTrans, that transforms data features into graph embeddings and predicts temporal dynamics with a transformer model. According to our experiments, we demonstrate that GTrans can model spatial and temporal dynamics and nowcasts extreme events for datasets. Furthermore, in all the experiments, GTrans can achieve the highest F1 and F2 scores in binary-class prediction tests than the baseline models.

Via

Access Paper or Ask Questions

Earthquake Nowcasting with Deep Learning

Dec 18, 2021

Geoffrey Fox, John Rundle, Andrea Donnellan, Bo Feng

Figure 1 for Earthquake Nowcasting with Deep Learning

Figure 2 for Earthquake Nowcasting with Deep Learning

Figure 3 for Earthquake Nowcasting with Deep Learning

Figure 4 for Earthquake Nowcasting with Deep Learning

Abstract:We review previous approaches to nowcasting earthquakes and introduce new approaches based on deep learning using three distinct models based on recurrent neural networks and transformers. We discuss different choices for observables and measures presenting promising initial results for a region of Southern California from 1950-2020. Earthquake activity is predicted as a function of 0.1-degree spatial bins for time periods varying from two weeks to four years. The overall quality is measured by the Nash Sutcliffe Efficiency comparing the deviation of nowcast and observation with the variance over time in each spatial region. The software is available as open-source together with the preprocessed data from the USGS.

Via

Access Paper or Ask Questions

Top 3 in FG 2021 Families In the Wild Kinship Verification Challenge

Oct 27, 2021

Junyi Huang, Maxwell Benjamin Strome, Ian Jenkins, Parker Williams, Bo Feng, Yaning Wang, Roman Wang, Vaibhav Bagri, Newman Cheng, Iddo Drori

Figure 1 for Top 3 in FG 2021 Families In the Wild Kinship Verification Challenge

Figure 2 for Top 3 in FG 2021 Families In the Wild Kinship Verification Challenge

Figure 3 for Top 3 in FG 2021 Families In the Wild Kinship Verification Challenge

Figure 4 for Top 3 in FG 2021 Families In the Wild Kinship Verification Challenge

Abstract:Kinship verification is the task of determining whether a parent-child, sibling, or grandparent-grandchild relationship exists between two people and is important in social media applications, forensic investigations, finding missing children, and reuniting families. We demonstrate high quality kinship verification by participating in the 2021 Recognizing Families in the Wild challenge which provides the largest publicly available dataset in the field. Our approach is among the top 3 winning entries in the competition. We ensemble models written by both human experts and OpenAI Codex. We make our models and code publicly available.

* IEEE International Conference on Automatic Face and Gesture Recognition, Recognizing Families In the Wild Kinship Verification Challenge, 2021

Via

Access Paper or Ask Questions

Inverse Kinematics and Dexterous Workspace Formulation for 2-Segment Continuum Robots with Inextensible Segments

Oct 05, 2021

Yifan Wang, Zhonghao Wu, Longfei Wang, Bo Feng, Kai Xu

Figure 1 for Inverse Kinematics and Dexterous Workspace Formulation for 2-Segment Continuum Robots with Inextensible Segments

Figure 2 for Inverse Kinematics and Dexterous Workspace Formulation for 2-Segment Continuum Robots with Inextensible Segments

Figure 3 for Inverse Kinematics and Dexterous Workspace Formulation for 2-Segment Continuum Robots with Inextensible Segments

Figure 4 for Inverse Kinematics and Dexterous Workspace Formulation for 2-Segment Continuum Robots with Inextensible Segments

Abstract:The inverse kinematics (IK) problem of continuum robots has been investigated in depth in the past decades. Under the constant-curvature bending assumption, closed-form IK solution has been obtained for continuum robots with variable segment lengths. Attempting to close the gap towards a complete solution, this paper presents an efficient solution for the IK problem of 2-segment continuum robots with one or two inextensible segments (a.k.a, constant segment lengths). Via representing the robot's shape as piecewise line segments, the configuration variables are separated from the IK formulation such that solving a one-variable nonlinear equation leads to the solution of the entire IK problem. Furthermore, an in-depth investigation of the boundaries of the dexterous workspace of the end effector caused by the configuration variables limits as well as the angular velocity singularities of the continuum robots was established. This dexterous workspace formulation, which is derived for the first time to the best of the authors' knowledge, is particularly useful to find the closest orientation to a target pose when the target orientation is out of the dexterous workspace. In the comparative simulation studies between the proposed method and the Jacobian-based IK method involving 500,000 cases, the proposed variable separation method solved 100% of the IK problems with much higher computational efficiency.

* Submitted to IEEE Robotics and Automation Letters

Via

Access Paper or Ask Questions

Spatial Attention-based Non-reference Perceptual Quality Prediction Network for Omnidirectional Images

Mar 10, 2021

Li Yang, Mai Xu, Deng Xin, Bo Feng

Figure 1 for Spatial Attention-based Non-reference Perceptual Quality Prediction Network for Omnidirectional Images

Figure 2 for Spatial Attention-based Non-reference Perceptual Quality Prediction Network for Omnidirectional Images

Figure 3 for Spatial Attention-based Non-reference Perceptual Quality Prediction Network for Omnidirectional Images

Figure 4 for Spatial Attention-based Non-reference Perceptual Quality Prediction Network for Omnidirectional Images

Abstract:Due to the strong correlation between visual attention and perceptual quality, many methods attempt to use human saliency information for image quality assessment. Although this mechanism can get good performance, the networks require human saliency labels, which is not easily accessible for omnidirectional images (ODI). To alleviate this issue, we propose a spatial attention-based perceptual quality prediction network for non-reference quality assessment on ODIs (SAP-net). To drive our SAP-net, we establish a large-scale IQA dataset of ODIs (IQA-ODI), which is composed of subjective scores of 200 subjects on 1,080 ODIs. In IQA-ODI, there are 120 high quality ODIs as reference, and 960 ODIs with impairments in both JPEG compression and map projection. Without any human saliency labels, our network can adaptively estimate human perceptual quality on impaired ODIs through a self-attention manner, which significantly promotes the prediction performance of quality scores. Moreover, our method greatly reduces the computational complexity in quality assessment task on ODIs. Extensive experiments validate that our network outperforms 9 state-of-the-art methods for quality assessment on ODIs. The dataset and code have been available on \url{ https://github.com/yanglixiaoshen/SAP-Net}.

* Accepted by IEEE ICME 2021

Via

Access Paper or Ask Questions

TSEQPREDICTOR: Spatiotemporal Extreme Earthquakes Forecasting for Southern California

Dec 20, 2020

Bo Feng, Geoffrey C. Fox

Figure 1 for TSEQPREDICTOR: Spatiotemporal Extreme Earthquakes Forecasting for Southern California

Figure 2 for TSEQPREDICTOR: Spatiotemporal Extreme Earthquakes Forecasting for Southern California

Figure 3 for TSEQPREDICTOR: Spatiotemporal Extreme Earthquakes Forecasting for Southern California

Figure 4 for TSEQPREDICTOR: Spatiotemporal Extreme Earthquakes Forecasting for Southern California

Abstract:Seismology from the past few decades has utilized the most advanced technologies and equipment to monitor seismic events globally. However, forecasting disasters like earthquakes is still an underdeveloped topic from the history. Recent researches in spatiotemporal forecasting have revealed some possibilities of successful predictions, which becomes an important topic in many scientific research fields. Most studies of them have many successful applications of using deep neural networks. In the geoscience study, earthquake prediction is one of the world's most challenging problems, about which cutting edge deep learning technologies may help to discover some useful patterns. In this project, we propose a joint deep learning modeling method for earthquake forecasting, namely TSEQPREDICTOR. In TSEQPREDICTOR, we use comprehensive deep learning technologies with domain knowledge in seismology and exploit the prediction problem using encoder-decoder and temporal convolutional neural networks. Comparing to some state-of-art recurrent neural networks, our experiments show our method is promising in terms of predicting major shocks for earthquakes in Southern California.

Via

Access Paper or Ask Questions