Picture for Kazuhito Koishida

Kazuhito Koishida

VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

Add code
Oct 24, 2024
Figure 1 for VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
Figure 2 for VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
Figure 3 for VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
Figure 4 for VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
Viaarxiv icon

Zero-Shot Text-to-Speech from Continuous Text Streams

Add code
Oct 01, 2024
Figure 1 for Zero-Shot Text-to-Speech from Continuous Text Streams
Figure 2 for Zero-Shot Text-to-Speech from Continuous Text Streams
Figure 3 for Zero-Shot Text-to-Speech from Continuous Text Streams
Figure 4 for Zero-Shot Text-to-Speech from Continuous Text Streams
Viaarxiv icon

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Add code
Sep 12, 2024
Figure 1 for Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Figure 2 for Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Figure 3 for Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Figure 4 for Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Viaarxiv icon

LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes

Add code
Jun 05, 2024
Figure 1 for LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
Figure 2 for LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
Figure 3 for LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
Figure 4 for LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
Viaarxiv icon

Weakly-supervised Audio Separation via Bi-modal Semantic Similarity

Add code
Apr 02, 2024
Figure 1 for Weakly-supervised Audio Separation via Bi-modal Semantic Similarity
Figure 2 for Weakly-supervised Audio Separation via Bi-modal Semantic Similarity
Figure 3 for Weakly-supervised Audio Separation via Bi-modal Semantic Similarity
Figure 4 for Weakly-supervised Audio Separation via Bi-modal Semantic Similarity
Viaarxiv icon

uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with Unsupervised Audio Mixtures

Add code
Mar 14, 2024
Viaarxiv icon

Learned Image Compression with Text Quality Enhancement

Add code
Feb 13, 2024
Viaarxiv icon

Single-channel speech enhancement using learnable loss mixup

Add code
Dec 20, 2023
Figure 1 for Single-channel speech enhancement using learnable loss mixup
Figure 2 for Single-channel speech enhancement using learnable loss mixup
Figure 3 for Single-channel speech enhancement using learnable loss mixup
Figure 4 for Single-channel speech enhancement using learnable loss mixup
Viaarxiv icon

Automatic Disfluency Detection from Untranscribed Speech

Add code
Nov 01, 2023
Figure 1 for Automatic Disfluency Detection from Untranscribed Speech
Figure 2 for Automatic Disfluency Detection from Untranscribed Speech
Figure 3 for Automatic Disfluency Detection from Untranscribed Speech
Figure 4 for Automatic Disfluency Detection from Untranscribed Speech
Viaarxiv icon

Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation

Add code
Sep 19, 2023
Figure 1 for Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
Figure 2 for Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
Figure 3 for Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
Figure 4 for Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
Viaarxiv icon