Picture for Victor Zhong

Victor Zhong

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows

Add code
Nov 12, 2024
Figure 1 for Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows
Figure 2 for Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows
Figure 3 for Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows
Figure 4 for Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows
Viaarxiv icon

Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

Add code
Jul 15, 2024
Figure 1 for Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
Figure 2 for Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
Figure 3 for Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
Figure 4 for Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
Viaarxiv icon

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Add code
Apr 11, 2024
Figure 1 for OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Figure 2 for OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Figure 3 for OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Figure 4 for OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Viaarxiv icon

Policy Improvement using Language Feedback Models

Add code
Feb 25, 2024
Viaarxiv icon

Text2Reward: Automated Dense Reward Function Generation for Reinforcement Learning

Add code
Sep 21, 2023
Viaarxiv icon

When Not to Trust Language Models: Investigating Effectiveness and Limitations of Parametric and Non-Parametric Memories

Add code
Dec 20, 2022
Viaarxiv icon

RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question Answering

Add code
Oct 25, 2022
Viaarxiv icon

M2D2: A Massively Multi-domain Language Modeling Dataset

Add code
Oct 13, 2022
Figure 1 for M2D2: A Massively Multi-domain Language Modeling Dataset
Figure 2 for M2D2: A Massively Multi-domain Language Modeling Dataset
Figure 3 for M2D2: A Massively Multi-domain Language Modeling Dataset
Figure 4 for M2D2: A Massively Multi-domain Language Modeling Dataset
Viaarxiv icon

Improving Policy Learning via Language Dynamics Distillation

Add code
Sep 30, 2022
Figure 1 for Improving Policy Learning via Language Dynamics Distillation
Figure 2 for Improving Policy Learning via Language Dynamics Distillation
Figure 3 for Improving Policy Learning via Language Dynamics Distillation
Figure 4 for Improving Policy Learning via Language Dynamics Distillation
Viaarxiv icon

Improving Intrinsic Exploration with Language Abstractions

Add code
Feb 17, 2022
Figure 1 for Improving Intrinsic Exploration with Language Abstractions
Figure 2 for Improving Intrinsic Exploration with Language Abstractions
Figure 3 for Improving Intrinsic Exploration with Language Abstractions
Figure 4 for Improving Intrinsic Exploration with Language Abstractions
Viaarxiv icon