Picture for Juyoung Suk

Juyoung Suk

LLM-AS-AN-INTERVIEWER: Beyond Static Testing Through Dynamic LLM Evaluation

Add code
Dec 10, 2024
Viaarxiv icon

Evaluating Language Models as Synthetic Data Generators

Add code
Dec 04, 2024
Figure 1 for Evaluating Language Models as Synthetic Data Generators
Figure 2 for Evaluating Language Models as Synthetic Data Generators
Figure 3 for Evaluating Language Models as Synthetic Data Generators
Figure 4 for Evaluating Language Models as Synthetic Data Generators
Viaarxiv icon

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

Add code
Oct 23, 2024
Figure 1 for MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
Figure 2 for MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
Figure 3 for MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
Figure 4 for MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
Viaarxiv icon

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models

Add code
Jun 09, 2024
Figure 1 for The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Figure 2 for The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Figure 3 for The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Figure 4 for The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Viaarxiv icon

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Add code
May 02, 2024
Viaarxiv icon

CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean

Add code
Mar 15, 2024
Viaarxiv icon