Paper page - FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
…positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content…