Deep Learning Weekly: Issue 409

Mistral Compute, BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems, a paper on Beyond Text Compression: Evaluating Tokenizers Across Scales, and more!

Jun 18, 2025

This week in deep learning, we bring you Stateful Voice Agents, How we built our multi-agent research system \ Anthropic, and a paper on Time Blindness: Why Video-Language Models Can't See What Humans Can?

You may also enjoy Mistral Compute, BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems, a paper on Beyond Text Compression: Evaluating Tokenizers Across Scales, and more!

As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Stateful Voice Agents

Letta now enables stateful voice agents with persistent memory while maintaining the response times needed for natural conversation.

Mistral Compute

The Mistral team introduced Mistral Compute – a new AI infrastructure offering that will provide customers a private, integrated stack.

Unpacking the bias of large language models

MIT researchers discovered the underlying cause of position bias, a phenomenon that causes large language models to overemphasize the beginning or end of a document or conversation.

MiniMax-M1

The MiniMax team just released MiniMax – an open-weight offering that sets new standards in long-context reasoning, agentic tool use, and efficient compute performance.

MLOps & LLMOps

How we built our multi-agent research system \ Anthropic

A detailed blog post from Anthropic where they outline the architecture and engineering challenges of their multi-agent research system.

Building a Data Team that Never Sleeps with Pydantic AI

An article that explains how Definite leverages Pydantic AI to build its agent, achieving high-accuracy data engineering and real-time insights through strategies like model hot-swapping, selective context filtering, and multi-agent introspection.

Learning

Understanding and Coding the KV Cache in LLMs from Scratch

An article that provides an accessible explanation and from-scratch code implementation of the KV cache in LLMs.

BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

A blog post that introduces BountyBench, a cybersecurity benchmark designed to evaluate offensive and defensive AI cyber-capabilities and quantify their economic impact.

How I program with Agents

A blog post that defines AI agents as LLMs with environmental feedback loops (e.g., bash, patch tools), demonstrating how they transform programming by enabling iterative code development.

Libraries & Code

comet-ml/opik

An open-source LLM evaluation tool used to debug, evaluate, monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

tauricresearch/tradingagents

A multi-agent trading framework that mirrors the dynamics of real-world trading firms.

Papers & Publications

Beyond Text Compression: Evaluating Tokenizers Across Scales

Abstract:

Tokenizer design significantly impacts language model performance, yet evaluating tokenizer quality remains challenging. While text compression has emerged as a common intrinsic metric, recent work questions its reliability as a quality indicator. We investigate whether evaluating tokenizers on smaller models (350M parameters) reliably predicts their impact at larger scales (2.7B parameters). Through experiments with established tokenizers from widely-adopted language models, we find that tokenizer choice minimally affects English tasks but yields significant, scale-consistent differences in machine translation performance. Based on these findings, we propose additional intrinsic metrics that correlate more strongly with downstream performance than text compression. We combine these metrics into an evaluation framework that enables more reliable intrinsic tokenizer comparisons.

Time Blindness: Why Video-Language Models Can't See What Humans Can?

Abstract:

Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce SpookyBench, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding.

A guest post by

Miko Planas

~~~

Deep Learning Weekly