Deep Learning Weekly: Issue 401

Generate videos in Gemini and Whisk with Veo 2, Speeding up Transformers by 80% by Removing Self Attention, a paper on Packing Input Frame Context in Next-Frame Prediction Models for Video Generation,

Apr 23, 2025

This week in deep learning, we bring you Generate videos in Gemini and Whisk with Veo 2, Unlocking Gen AI at the Edge: Speeding up Transformers by 80% by Removing Self Attention, and a paper on Packing Input Frame Context in Next-Frame Prediction Models for Video Generation.

You may also enjoy Training LLMs to self-detoxify their language, Allie: A Human-Aligned Chess Bot, a paper on ReTool: Reinforcement Learning for Strategic Tool Use in LLMs, and more!

As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Generate videos in Gemini and Whisk with Veo 2

Gemini Advanced users can now generate and share videos using Google’s state-of-the-art video model, Veo 2.

Training LLMs to self-detoxify their language

A new method from the MIT-IBM Watson AI Lab helps large language models to steer their own responses toward safer, more ethical, value-aligned outputs.

A new, open source text-to-speech model called Dia has arrived to challenge ElevenLabs, OpenAI and more

Nari Labs introduced Dia, a 1.6 billion parameter text-to-speech (TTS) model designed to produce naturalistic dialogue directly from text prompts.

Manychat taps $140M to boost its business messaging platform with AI

Manychat, which provides a tool for managing and automating conversations and engagement across multiple messaging channels, has picked up $140 million in a Series B round led by Summit Partners.

Making AI-generated code more accurate in any language

A new approach developed by researchers at MIT and elsewhere automatically guides an LLM to generate text that adheres to the rules of the relevant language

MLOps & LLMOps

Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

A post that demonstrates how to deploy and serve the Mixtral 8x7B language model on AWS Inferentia2 instances for cost-effective, high-performance inference.

OpenAI Codex CLI, how does it work?

A technical breakdown explaining the core components and workflow of the OpenAI Codex CLI, detailing its agent loop, tool execution, and how it manages context and prompts.

Learning

Unlocking Gen AI at the Edge: Speeding up Transformers by 80% by Removing Self Attention

A deep dive into FNet, FFT-based mixing, and why the future of AI might belong to fixed-structure models that don’t even try to learn what they can encode.

Allie: A Human-Aligned Chess Bot

A research blog post introducing Allie, a human-aligned chess bot designed to play in a skill-calibrated manner by training a transformer and using adaptive Monte-Carlo Tree Search.

The State of Reinforcement Learning for LLM Reasoning

A comprehensive overview discussing the state of reinforcement learning for improving LLM reasoning, explaining methods like GRPO and summarizing recent research insights on topics like verifiable rewards and emergent abilities.

Libraries & Code

comet-ml/opik

An open-source LLM evaluation tool used to debug, evaluate, monitor LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

princeton-nlp/HELMET

A comprehensive benchmark for long-context language models covering seven diverse categories of tasks.

Papers & Publications

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

Abstract:

We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. The FramePack compresses input frames to make the transformer context length a fixed number regardless of the video length. As a result, we are able to process a large number of frames using video diffusion with computation bottleneck similar to image diffusion. This also makes the training video batch sizes significantly higher (batch sizes become comparable to image diffusion training). We also propose an anti-drifting sampling method that generates frames in inverted temporal order with early-established endpoints to avoid exposure bias (error accumulation over iterations). Finally, we show that existing video diffusion models can be finetuned with FramePack, and their visual quality may be improved because the next-frame prediction supports more balanced diffusion schedulers with less extreme flow shift timesteps.

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Abstract:

While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model's tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on the challenging MATH Olympiad benchmark AIME demonstrate ReTool's superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in efficiency and performance. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%. Further analysis reveals emergent behaviors such as code self-correction, signaling an ''aha moment'' in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.

Byte Latent Transformer: Patches Scale Better Than Tokens

Abstract:

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first FLOP controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.

A guest post by

Miko Planas

~~~

Deep Learning Weekly

Deep Learning Weekly: Issue 401

Generate videos in Gemini and Whisk with Veo 2, Speeding up Transformers by 80% by Removing Self Attention, a paper on Packing Input Frame Context in Next-Frame Prediction Models for Video Generation,

Industry

MLOps & LLMOps

Learning

Libraries & Code

Papers & Publications

Discussion about this post