Picture for Hyunsoo Ha

Hyunsoo Ha

LP Data Pipeline: Lightweight, Purpose-driven Data Pipeline for Large Language Models

Add code
Nov 18, 2024
Viaarxiv icon

1 Trillion Token (1TT) Platform: A Novel Framework for Efficient Data Sharing and Compensation in Large Language Models

Add code
Sep 30, 2024
Figure 1 for 1 Trillion Token (1TT) Platform: A Novel Framework for Efficient Data Sharing and Compensation in Large Language Models
Figure 2 for 1 Trillion Token (1TT) Platform: A Novel Framework for Efficient Data Sharing and Compensation in Large Language Models
Viaarxiv icon

Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora

Add code
Sep 15, 2024
Figure 1 for Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora
Figure 2 for Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora
Figure 3 for Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora
Figure 4 for Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora
Viaarxiv icon