Deep Learning Weekly: Issue 456

Gemini 3.5: frontier intelligence with action, Codex-maxxing, a paper on Lance: Unified Multimodal Modeling by Multi-Task Synergy, and many more!

May 21, 2026

This week in deep learning, we bring you Gemini 3.5: frontier intelligence with action, Codex-maxxing and a paper on Lance: Unified Multimodal Modeling by Multi-Task Synergy.

You may also enjoy Introducing Command A+, SLEIGHT-Bench: Finding Blind Spots in AI Monitors, a paper on CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence, and more!

As always, happy reading and hacking. If you have something you think should be in next week’s issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Gemini 3.5: frontier intelligence with action

Google launched Gemini 3.5, leading with 3.5 Flash—a model delivering flagship-tier agentic and coding performance at under half the cost.

Introducing Command A+

Cohere released Command A+ open-source —a 218B/25B-active MoE model for enterprise agentic workflows that runs on as little as two H100s or one Blackwell GPU, supports 48 languages, and adds multimodal reasoning.

Introducing Gemini Omni

Google introduced Gemini Omni, a natively multimodal generation model debuting with Omni Flash—it creates and conversationally edits video from any combination of image, audio, video, and text inputs, rolling out across the Gemini app, Flow, and YouTube Shorts.

Introducing Grok Build

xAI launched Grok Build — a coding CLI powered by Grok 4.3 Heavy, featuring a 2M-token context window, 8 parallel subagents, and more.

FLUX Outpainting: Extend any image, in any direction

Black Forest Labs launched FLUX Outpainting, a purpose-built API endpoint that extends images in any direction without prompts.

Introducing Composer 2.5

Cursor released Composer 2.5, a coding model (built on Moonshot’s Kimi K2.5) with gains on long-horizon agentic tasks.

MLOps/LLMOps/AgentOps

LLM Cost Tracking Solution: How to Monitor and Control AI Spend in Agentic Systems

A guide on treating LLM cost as an observability problem in agentic systems, using span/trace/project-level tracing to pinpoint token-burning prompts and routing.

Context is all you need: Introducing Redis Iris

Redis launched Redis Iris, a context engine sitting between agents and enterprise data—bundling five tools (two new: Context Retriever and Agent Memory) to deliver navigable, fresh, low-latency context with semantic caching that cuts token costs up to 90%.

Learning

Text Analysis for Hybrid Search: Tokenization, Stopwords & Accent Folding

A technical guide on how Weaviate v1.37 makes BM25 tokenization observable and per-property configurable—covering accent folding, per-language stopwords, and more.

Running PyTorch Models on Apple Silicon GPUs with the ExecuTorch MLX Delegate

PyTorch released the experimental ExecuTorch MLX delegate, a backend that runs PyTorch models on Apple Silicon GPUs via Apple’s MLX framework

Codex-maxxing

A power user’s playbook for extracting more value from Codex—using durable threads, file-based memory, verifiable goals, and self-scheduling loops to turn it into a workspace where long-running knowledge work keeps progressing between sessions.

SLEIGHT-Bench: Finding Blind Spots in AI Monitors

Anthropic researchers released SLEIGHT-Bench, a benchmark of 40 synthetic attacks across 11 categories that exploit “blind spots” in frontier AI monitors—on the Opus 4.6 monitor, 50% of attacks evaded all 10 trials and only 8 of 40 were reliably caught.

Libraries & Code

comet-ml/opik

An open-source AI observability tool used to debug, evaluate, and monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

generalaction/emdash

Emdash is the Open-Source Agentic Development Environment. Run multiple coding agents in parallel. Use any provider.

Papers & Publications

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Abstract:

We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi-task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality-aware rotary positional encoding to mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling to strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strong multimodal understanding capabilities.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

Abstract:

Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage -- a critical risk in high-stakes domains like law, finance, and medicine, where every conclusion must be traceable to a specific source region. To address this, we introduce CiteVQA, a benchmark that requires models to return element-level bounding-box citations alongside each answer, evaluating both jointly. CiteVQA comprises 1,897 questions across 711 PDFs spanning seven domains and two languages, averaging 40.6 pages per document. To ensure fidelity and scalability, the ground-truth citations are generated by an automated pipeline-which identifies crucial evidence via masking ablation-and are subsequently validated through expert review. At the core of our evaluation is Strict Attributed Accuracy (SAA), which credits a prediction only when the answer and the cited region are both correct. Auditing 20 MLLMs reveals a pervasive Attribution Hallucination: models frequently produce the right answer while citing the wrong region. The strongest system (Gemini-3.1-Pro-Preview) achieves an SAA of only 76.0, and the strongest open-source MLLM reaches just 22.5. Ultimately, towards trustworthy document intelligence, CiteVQA exposes a reliability gap that answer-only evaluations overlook, providing the instrumentation needed to close it.

A guest post by

Miko Planas

~~~

Deep Learning Weekly

Discussion about this post

Ready for more?