Picture for Jillian Bommarito

Jillian Bommarito

The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models

Add code
Apr 10, 2025
Viaarxiv icon

Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary

Add code
Apr 05, 2025
Viaarxiv icon

KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications

Add code
Mar 21, 2025
Viaarxiv icon

Towards Best Practices for Open Datasets for LLM Training

Add code
Jan 14, 2025
Viaarxiv icon

GPT as Knowledge Worker: A Zero-Shot Evaluation of CPA Capabilities

Add code
Jan 11, 2023
Viaarxiv icon