Picture for Kaiyue Wen

Kaiyue Wen

Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images

Add code
Feb 19, 2025
Viaarxiv icon

Task Generalization With AutoRegressive Compositional Structure: Can Learning From $\d$ Tasks Generalize to $\d^{T}$ Tasks?

Add code
Feb 13, 2025
Viaarxiv icon

Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models

Add code
Jan 21, 2025
Viaarxiv icon

Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective

Add code
Oct 07, 2024
Figure 1 for Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective
Figure 2 for Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective
Figure 3 for Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective
Figure 4 for Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective
Viaarxiv icon

From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency

Add code
Oct 07, 2024
Figure 1 for From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency
Figure 2 for From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency
Figure 3 for From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency
Figure 4 for From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency
Viaarxiv icon

RNNs are not Transformers : The Key Bottleneck on In-context Retrieval

Add code
Feb 29, 2024
Viaarxiv icon

Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars

Add code
Dec 03, 2023
Viaarxiv icon

Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization

Add code
Jul 23, 2023
Figure 1 for Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization
Figure 2 for Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization
Figure 3 for Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization
Figure 4 for Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization
Viaarxiv icon

Practically Solving LPN in High Noise Regimes Faster Using Neural Networks

Add code
Mar 14, 2023
Viaarxiv icon

Finding Skill Neurons in Pre-trained Transformer-based Language Models

Add code
Nov 14, 2022
Viaarxiv icon