Deep Learning Weekly: Issue 449

Gemini 3.1 Flash Live, Cohere Transcribe: state-of-the-art speech recognition, a paper on IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse, and many more!

Apr 02, 2026

This week in deep learning, we bring you Gemini 3.1 Flash Live, Cohere Transcribe: state-of-the-art speech recognition, and a paper on IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse.

You may also enjoy Mistral AI’s Voxtral, How Kimi, Cursor, and Chroma Train Agentic Models with RL, a paper on Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Gemini 3.1 Flash Live: Making audio AI more natural and reliable

Google launches Gemini 3.1 Flash Live, its highest-quality real-time audio model, scoring 90.8% on ComplexFuncBench Audio and 36.1% on AudioMultiChallenge

Speaking of Voxtral

Mistral launches Voxtral TTS, a 4B-parameter multilingual text-to-speech model supporting 9 languages with 70ms latency, voice cloning from 3-second samples, and more.

SAM 3.1: Faster and More Accessible Real-Time Video Detection and Tracking With Multiplexing and Global Reasoning

Meta updates SAM 3 to SAM 3.1, adding object multiplexing to double video processing speed to 32 FPS on a single H100 for its open-source text-prompted segmentation and tracking model.

Cohere Transcribe: state-of-the-art speech recognition

Cohere launches Transcribe, a 2B-parameter open-source ASR model that tops the HuggingFace Open ASR Leaderboard with a 5.42% average word error rate across 14 languages,

Granola raises $125M at $1.5B valuation for its AI note-taking app

Granola raises $125M Series C at a $1.5B valuation led by Index Ventures, following a quarter of 250% revenue growth, with plans to expand its AI meeting notes app toward agentic task automation.

MLOps/LLMOps

Deploying Disaggregated LLM Inference Workloads on Kubernetes

A technical guide to deploying disaggregated LLM inference (separate prefill, decode, and router services) on Kubernetes using NVIDIA Grove, KAI Scheduler, and NVIDIA Dynamo.

Learning

Best Embedding Model for RAG 2026: 10 Models Compared

A practical benchmarking guide comparing 10 embedding models across four production-critical RAG dimensions — cross-modal, cross-lingual, long-document retrieval, and MRL compression

How Kimi, Cursor, and Chroma Train Agentic Models with RL

A technical synthesis of three recent agentic RL training reports — Kimi K2.5, Cursor Composer 2, and Chroma Context-1 — distilling shared patterns around production-environment training, context management, and reward design.

Multimodal Embeddings and RAG: A Practical Guide

A practical guide to multimodal embeddings and RAG covering the core theory (contrastive learning, modality gap, MRL), three concrete build patterns (audio, PDF, video), and when multimodal actually outperforms text-only pipelines.

Five techniques to reach the efficient frontier of LLM inference

A practical guide to LLM inference optimization framed around the “efficient frontier” concept — five techniques that move production systems toward the latency/throughput Pareto boundary without additional hardware spend.

Libraries & Code

comet-ml/opik

An open-source LLM evaluation tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

katanemo/plano

Plano is an AI-native proxy and data plane for agentic apps — with built-in orchestration, safety, observability, and smart LLM routing so you stay focused on your agents core logic.

Papers & Publications

Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

Abstract:

Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Abstract:

Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from O(L2) to O(Lk). However, the indexer itself retains O(L2) complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer’s top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82× prefill speedup and 1.48× decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model.

A guest post by

Miko Planas

~~~

Deep Learning Weekly

Discussion about this post

Ready for more?