Paper page - Dynamic Latent Routing
…DLR searches for useful codes, trains the model to reuse them, and lets codes compose into longer thoughts. Across low-data fine-tuning settings, DLR matches or outperforms SFT, with learned codes…
…DLR searches for useful codes, trains the model to reuse them, and lets codes compose into longer thoughts. Across low-data fine-tuning settings, DLR matches or outperforms SFT, with learned codes…
…No dataset linking this paper Cite arxiv.org/abs/2605.09877 in a dataset README.md to link it from this page. No Space linking this paper Cite arxiv.org/abs/2605…
…AI-generated summary We present StateSMix, a fully self-contained lossless compressor that couples an online-trained Mamba-style State Space Model ( SSM ) with sparse n-gram context mixing and arithmetic coding…
…In heterogeneous training systems , the total importance ratio should ideally be decomposed into two semantically distinct factors: a training--inference discrepancy term that aligns inference-side and training-side distributions at the…
…writing! I tried this training on a subset of data (AllNLI, GooAQ, MSMacro, PAQ, S2ORC) with batch size 16384. Took 5 hours. w&b: https://api.wandb.ai/links/arunarumugam411-sui/dkcwm6gs…
…Self-improving language models construct environments for training rather than generating data, utilizing stable solve-verify asymmetry to maintain informative rewards during learning. AI-generated summary We pursue a vision for self…
…training approach called D-OPSD enables efficient supervised fine-tuning for diffusion models by leveraging on-policy self-distillation with text and multimodal features while preserving few-step inference capabilities. AI-generated…
…However, existing work remains limited on both evaluation and training: benchmarks such as BRIGHT provide narrow gold sets and evaluate retrievers in isolation, while synthetic training corpora often optimize single-passage relevance…
…It represents memory across four orthogonal relational graphs — Semantic, Temporal, Causal, and Entity — and introduces a co-evolutionary training framework that jointly optimizes trainable edge features and a query-conditioned QueryRouter MLP…
…TS-DFM achieves the best perplexity of any discrete-generation baseline we compare against, including methods trained on 6x more data or using 5x larger models. View arXiv page View PDF Add…