Picture for Guijin Son

Guijin Son

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

Add code
Oct 23, 2024
Figure 1 for MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
Figure 2 for MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
Figure 3 for MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
Figure 4 for MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
Viaarxiv icon

LLM-as-a-Judge & Reward Model: What They Can and Cannot Do

Add code
Sep 17, 2024
Viaarxiv icon

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

Add code
Jun 09, 2024
Figure 1 for The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Figure 2 for The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Figure 3 for The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Figure 4 for The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Viaarxiv icon

ESG Classification by Implicit Rule Learning via GPT-4

Add code
Mar 22, 2024
Viaarxiv icon

KMMLU: Measuring Massive Multitask Language Understanding in Korean

Add code
Feb 18, 2024
Viaarxiv icon

Multi-Task Inference: Can Large Language Models Follow Multiple Instructions at Once?

Add code
Feb 18, 2024
Viaarxiv icon

HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models

Add code
Sep 15, 2023
Viaarxiv icon

Beyond Classification: Financial Reasoning in State-of-the-Art Language Models

Add code
Apr 30, 2023
Viaarxiv icon

Removing Non-Stationary Knowledge From Pre-Trained Language Models for Entity-Level Sentiment Classification in Finance

Add code
Jan 25, 2023
Viaarxiv icon