Picture for Arman Cohan

Arman Cohan

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Add code
Jan 21, 2025
Viaarxiv icon

ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning

Add code
Jan 11, 2025
Viaarxiv icon

Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

Add code
Dec 31, 2024
Figure 1 for Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference
Figure 2 for Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference
Figure 3 for Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference
Figure 4 for Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference
Viaarxiv icon

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

Add code
Dec 30, 2024
Viaarxiv icon

ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain

Add code
Nov 23, 2024
Viaarxiv icon

FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents

Add code
Nov 08, 2024
Figure 1 for FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents
Figure 2 for FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents
Figure 3 for FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents
Figure 4 for FinDVer: Explainable Claim Verification over Long and Hybrid-Content Financial Documents
Viaarxiv icon

SciDQA: A Deep Reading Comprehension Dataset over Scientific Papers

Add code
Nov 08, 2024
Figure 1 for SciDQA: A Deep Reading Comprehension Dataset over Scientific Papers
Figure 2 for SciDQA: A Deep Reading Comprehension Dataset over Scientific Papers
Figure 3 for SciDQA: A Deep Reading Comprehension Dataset over Scientific Papers
Figure 4 for SciDQA: A Deep Reading Comprehension Dataset over Scientific Papers
Viaarxiv icon

Bayesian Calibration of Win Rate Estimation with LLM Evaluators

Add code
Nov 07, 2024
Viaarxiv icon

M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models

Add code
Nov 06, 2024
Figure 1 for M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models
Figure 2 for M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models
Figure 3 for M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models
Figure 4 for M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models
Viaarxiv icon

MDCure: A Scalable Pipeline for Multi-Document Instruction-Following

Add code
Oct 30, 2024
Figure 1 for MDCure: A Scalable Pipeline for Multi-Document Instruction-Following
Figure 2 for MDCure: A Scalable Pipeline for Multi-Document Instruction-Following
Figure 3 for MDCure: A Scalable Pipeline for Multi-Document Instruction-Following
Figure 4 for MDCure: A Scalable Pipeline for Multi-Document Instruction-Following
Viaarxiv icon