Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Fanjie Kong

Beyond Speaker Identity: Text Guided Target Speech Extraction

Jan 15, 2025

Mingyue Huo, Abhinav Jain, Cong Phuoc Huynh, Fanjie Kong, Pichao Wang, Zhu Liu, Vimal Bhat

Abstract:Target Speech Extraction (TSE) traditionally relies on explicit clues about the speaker's identity like enrollment audio, face images, or videos, which may not always be available. In this paper, we propose a text-guided TSE model StyleTSE that uses natural language descriptions of speaking style in addition to the audio clue to extract the desired speech from a given mixture. Our model integrates a speech separation network adapted from SepFormer with a bi-modality clue network that flexibly processes both audio and text clues. To train and evaluate our model, we introduce a new dataset TextrolMix with speech mixtures and natural language descriptions. Experimental results demonstrate that our method effectively separates speech based not only on who is speaking, but also on how they are speaking, enhancing TSE in scenarios where traditional audio clues are absent. Demos are at: https://mingyue66.github.io/TextrolMix/demo/

* Accepted by ICASSP 2025

Via

Access Paper or Ask Questions

Autonomous Driving in Unstructured Environments: How Far Have We Come?

Oct 10, 2024

Chen Min, Shubin Si, Xu Wang, Hanzhang Xue, Weizhong Jiang, Yang Liu, Juan Wang, Qingtian Zhu, Qi Zhu, Lun Luo(+27 more)

Figure 1 for Autonomous Driving in Unstructured Environments: How Far Have We Come?

Figure 2 for Autonomous Driving in Unstructured Environments: How Far Have We Come?

Figure 3 for Autonomous Driving in Unstructured Environments: How Far Have We Come?

Figure 4 for Autonomous Driving in Unstructured Environments: How Far Have We Come?

Abstract:Research on autonomous driving in unstructured outdoor environments is less advanced than in structured urban settings due to challenges like environmental diversities and scene complexity. These environments-such as rural areas and rugged terrains-pose unique obstacles that are not common in structured urban areas. Despite these difficulties, autonomous driving in unstructured outdoor environments is crucial for applications in agriculture, mining, and military operations. Our survey reviews over 250 papers for autonomous driving in unstructured outdoor environments, covering offline mapping, pose estimation, environmental perception, path planning, end-to-end autonomous driving, datasets, and relevant challenges. We also discuss emerging trends and future research directions. This review aims to consolidate knowledge and encourage further research for autonomous driving in unstructured environments. To support ongoing work, we maintain an active repository with up-to-date literature and open-source projects at: https://github.com/chaytonmin/Survey-Autonomous-Driving-in-Unstructured-Environments.

* Survey paper; 38 pages

Via

Access Paper or Ask Questions

Hyperbolic Learning with Synthetic Captions for Open-World Detection

Apr 07, 2024

Fanjie Kong, Yanbei Chen, Jiarui Cai, Davide Modolo

Figure 1 for Hyperbolic Learning with Synthetic Captions for Open-World Detection

Figure 2 for Hyperbolic Learning with Synthetic Captions for Open-World Detection

Figure 3 for Hyperbolic Learning with Synthetic Captions for Open-World Detection

Figure 4 for Hyperbolic Learning with Synthetic Captions for Open-World Detection

Abstract:Open-world detection poses significant challenges, as it requires the detection of any object using either object class labels or free-form texts. Existing related works often use large-scale manual annotated caption datasets for training, which are extremely expensive to collect. Instead, we propose to transfer knowledge from vision-language models (VLMs) to enrich the open-vocabulary descriptions automatically. Specifically, we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images, and incorporate these captions to train a novel detector that generalizes to novel concepts. To mitigate the noise caused by hallucination in synthetic captions, we also propose a novel hyperbolic vision-language learning approach to impose a hierarchy between visual and caption embeddings. We call our detector ``HyperLearner''. We conduct extensive experiments on a wide variety of open-world detection benchmarks (COCO, LVIS, Object Detection in the Wild, RefCOCO) and our results show that our model consistently outperforms existing state-of-the-art methods, such as GLIP, GLIPv2 and Grounding DINO, when using the same backbone.

* CVPR 2024

Via

Access Paper or Ask Questions

Neural Insights for Digital Marketing Content Design

Feb 02, 2023

Fanjie Kong, Yuan Li, Houssam Nassif, Tanner Fiez, Shreya Chakrabarti, Ricardo Henao

Abstract:In digital marketing, experimenting with new website content is one of the key levers to improve customer engagement. However, creating successful marketing content is a manual and time-consuming process that lacks clear guiding principles. This paper seeks to close the loop between content creation and online experimentation by offering marketers AI-driven actionable insights based on historical data to improve their creative process. We present a neural-network-based system that scores and extracts insights from a marketing content design, namely, a multimodal neural network predicts the attractiveness of marketing contents, and a post-hoc attribution method generates actionable insights for marketers to improve their content in specific marketing locations. Our insights not only point out the advantages and drawbacks of a given current content, but also provide design recommendations based on historical data. We show that our scoring model and insights work well both quantitatively and qualitatively.

Via

Access Paper or Ask Questions

Efficient Classification of Very Large Images with Tiny Objects

Jun 04, 2021

Fanjie Kong, Ricardo Henao

Figure 1 for Efficient Classification of Very Large Images with Tiny Objects

Figure 2 for Efficient Classification of Very Large Images with Tiny Objects

Figure 3 for Efficient Classification of Very Large Images with Tiny Objects

Figure 4 for Efficient Classification of Very Large Images with Tiny Objects

Abstract:An increasing number of applications in the computer vision domain, specially, in medical imaging and remote sensing, are challenging when the goal is to classify very large images with tiny objects. More specifically, these type of classification tasks face two key challenges: $i$) the size of the input image in the target dataset is usually in the order of megapixels, however, existing deep architectures do not easily operate on such big images due to memory constraints, consequently, we seek a memory-efficient method to process these images; and $ii$) only a small fraction of the input images are informative of the label of interest, resulting in low region of interest (ROI) to image ratio. However, most of the current convolutional neural networks (CNNs) are designed for image classification datasets that have relatively large ROIs and small image size (sub-megapixel). Existing approaches have addressed these two challenges in isolation. We present an end-to-end CNN model termed Zoom-In network that leverages hierarchical attention sampling for classification of large images with tiny objects using a single GPU. We evaluate our method on two large-image datasets and one gigapixel dataset. Experimental results show that our model achieves higher accuracy than existing methods while requiring less computing resources.

Via

Access Paper or Ask Questions

Quantum Tensor Network in Machine Learning: An Application to Tiny Object Classification

Jan 08, 2021

Fanjie Kong, Xiao-yang Liu, Ricardo Henao

Figure 1 for Quantum Tensor Network in Machine Learning: An Application to Tiny Object Classification

Figure 2 for Quantum Tensor Network in Machine Learning: An Application to Tiny Object Classification

Figure 3 for Quantum Tensor Network in Machine Learning: An Application to Tiny Object Classification

Figure 4 for Quantum Tensor Network in Machine Learning: An Application to Tiny Object Classification

Abstract:Tiny object classification problem exists in many machine learning applications like medical imaging or remote sensing, where the object of interest usually occupies a small region of the whole image. It is challenging to design an efficient machine learning model with respect to tiny object of interest. Current neural network structures are unable to deal with tiny object efficiently because they are mainly developed for images featured by large scale objects. However, in quantum physics, there is a great theoretical foundation guiding us to analyze the target function for image classification regarding to specific objects size ratio. In our work, we apply Tensor Networks to solve this arising tough machine learning problem. First, we summarize the previous work that connects quantum spin model to image classification and bring the theory into the scenario of tiny object classification. Second, we propose using 2D multi-scale entanglement renormalization ansatz (MERA) to classify tiny objects in image. In the end, our experimental results indicate that tensor network models are effective for tiny object classification problem and potentially will beat state-of-the-art. Our codes will be available online https://github.com/timqqt/MERA_Image_Classification.

* https://tensorworkshop.github.io/NeurIPS2020/CFP.html
* 8 pages, 7 figures

Via

Access Paper or Ask Questions

Proactive Pseudo-Intervention: Causally Informed Contrastive Learning For Interpretable Vision Models

Dec 06, 2020

Dong Wang, Yuewei Yang, Chenyang Tao, Fanjie Kong, Ricardo Henao, Lawrence Carin

Figure 1 for Proactive Pseudo-Intervention: Causally Informed Contrastive Learning For Interpretable Vision Models

Figure 2 for Proactive Pseudo-Intervention: Causally Informed Contrastive Learning For Interpretable Vision Models

Figure 3 for Proactive Pseudo-Intervention: Causally Informed Contrastive Learning For Interpretable Vision Models

Figure 4 for Proactive Pseudo-Intervention: Causally Informed Contrastive Learning For Interpretable Vision Models

Abstract:Deep neural networks have shown significant promise in comprehending complex visual signals, delivering performance on par or even superior to that of human experts. However, these models often lack a mechanism for interpreting their predictions, and in some cases, particularly when the sample size is small, existing deep learning solutions tend to capture spurious correlations that compromise model generalizability on unseen inputs. In this work, we propose a contrastive causal representation learning strategy that leverages proactive interventions to identify causally-relevant image features, called Proactive Pseudo-Intervention (PPI). This approach is complemented with a causal salience map visualization module, i.e., Weight Back Propagation (WBP), that identifies important pixels in the raw input image, which greatly facilitates the interpretability of predictions. To validate its utility, our model is benchmarked extensively on both standard natural images and challenging medical image datasets. We show this new contrastive causal representation learning model consistently improves model performance relative to competing solutions, particularly for out-of-domain predictions or when dealing with data integration from heterogeneous sources. Further, our causal saliency maps are more succinct and meaningful relative to their non-causal counterparts.

Via

Access Paper or Ask Questions

Physics-enhanced machine learning for virtual fluorescence microscopy

Apr 21, 2020

Colin L. Cooke, Fanjie Kong, Amey Chaware, Kevin C. Zhou, Kanghyun Kim, Rong Xu, D. Michael Ando, Samuel J. Yang, Pavan Chandra Konda, Roarke Horstmeyer

Figure 1 for Physics-enhanced machine learning for virtual fluorescence microscopy

Figure 2 for Physics-enhanced machine learning for virtual fluorescence microscopy

Figure 3 for Physics-enhanced machine learning for virtual fluorescence microscopy

Figure 4 for Physics-enhanced machine learning for virtual fluorescence microscopy

Abstract:This paper introduces a new method of data-driven microscope design for virtual fluorescence microscopy. Our results show that by including a model of illumination within the first layers of a deep convolutional neural network, it is possible to learn task-specific LED patterns that substantially improve the ability to infer fluorescence image information from unstained transmission microscopy images. We validated our method on two different experimental setups, with different magnifications and different sample types, to show a consistent improvement in performance as compared to conventional illumination methods. Additionally, to understand the importance of learned illumination on inference task, we varied the dynamic range of the fluorescent image targets (from one to seven bits), and showed that the margin of improvement for learned patterns increased with the information content of the target. This work demonstrates the power of programmable optical elements at enabling better machine learning algorithm performance and at providing physical insight into next generation of machine-controlled imaging systems.

* 12 pages, 13 figures

Via

Access Paper or Ask Questions

The Synthinel-1 dataset: a collection of high resolution synthetic overhead imagery for building segmentation

Jan 15, 2020

Fanjie Kong, Bohao Huang, Kyle Bradbury, Jordan M. Malof

Figure 1 for The Synthinel-1 dataset: a collection of high resolution synthetic overhead imagery for building segmentation

Figure 2 for The Synthinel-1 dataset: a collection of high resolution synthetic overhead imagery for building segmentation

Figure 3 for The Synthinel-1 dataset: a collection of high resolution synthetic overhead imagery for building segmentation

Figure 4 for The Synthinel-1 dataset: a collection of high resolution synthetic overhead imagery for building segmentation

Abstract:Recently deep learning - namely convolutional neural networks (CNNs) - have yielded impressive performance for the task of building segmentation on large overhead (e.g., satellite) imagery benchmarks. However, these benchmark datasets only capture a small fraction of the variability present in real-world overhead imagery, limiting the ability to properly train, or evaluate, models for real-world application. Unfortunately, developing a dataset that captures even a small fraction of real-world variability is typically infeasible due to the cost of imagery, and manual pixel-wise labeling of the imagery. In this work we develop an approach to rapidly and cheaply generate large and diverse virtual environments from which we can capture synthetic overhead imagery for training segmentation CNNs. Using this approach, generate and publicly-release a collection of synthetic overhead imagery - termed Synthinel-1 with full pixel-wise building labels. We use several benchmark dataset to demonstrate that Synthinel-1 is consistently beneficial when used to augment real-world training imagery, especially when CNNs are tested on novel geographic locations or conditions.

* Preprint of paper accepted for publication at Winter Conference on Applications of Computer Vision (WACV) 2020

Via

Access Paper or Ask Questions