Deep Learning Weekly: Issue 400

OpenAI launches o3 and o4-mini, An Overview of Late Interaction Retrieval Models, a paper on Less-to-More Generalization: Unlocking More Controllability by In-Context Generation, and many more!

Apr 17, 2025

This week in deep learning, we bring you OpenAI launches o3 and o4-mini, An Overview of Late Interaction Retrieval Models: ColBERT, ColPali, and ColQwen, and a paper on Less-to-More Generalization: Unlocking More Controllability by In-Context Generation.

You may also enjoy Claude Research and Google Workspace Integration, Circuit Tracing: Revealing Computational Graphs in Language Models, a paper on PixelFlow: Pixel-Space Generative Models with Flow, and more!

As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

OpenAI launches o3 and o4-mini, AI models that 'think with images' and use tools autonomously

Anthropic launched o3 and o4-mini, its most advanced reasoning models yet, capable of integrating images, searching the web, running code, analyzing files, and generating images within a unified workflow.

Could LLMs help design our next medicines and materials?

Researchers developed a multimodal tool that combines a large language model with graph-based AI models to efficiently find new molecules with desired properties.

Introducing Mirage Edit

Captions launched Mirage Edit — an AI tool that lets anyone go from a text prompt to a fully-edited talking video, featuring actors that don’t exist.

Claude takes research to new places \ Anthropic

Anthropic introduced two new capabilities that make Claude a more informed and capable collaborator — Research and a Google Workspace integration.

Introducing GPT-4.1 in the API

OpenAI launched a new series of GPT models featuring major improvements on coding, instruction following, and long context—plus a nano model.

Introducing Embed 4: Multimodal search for business

The Cohere team released Embed 4: their latest state-of-the-art multimodal embedding model that enables enterprises to add frontier search and retrieval capabilities to AI applications.

MLOps & LLMOps

Agents Whitepaper

A whitepaper about AI agents explores how generative models can be trained to use external tools and execute tasks independently, similar to how humans supplement their pattern recognition abilities with resources like books and calculators.

Efficient Federated Learning in the Era of LLMs with Message Quantization and Streaming

A blog post demonstrating how message quantization and streaming can be integrated into federated learning frameworks to improve efficiency and reduce communication overhead when training large language models.

Learning

An Overview of Late Interaction Retrieval Models: ColBERT, ColPali, and ColQwen

An overview of late interaction retrieval models, such as ColBERT, ColPali, and ColQwen, highlighting their mechanisms for achieving accurate and scalable semantic retrieval across various data modalities.

Copilot Arena: A Platform for Code

A blog post from ML@CMU describing how Copilot Arena was designed to collect real-world human preferences to evaluate code-generating language models.

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research

A post presenting negative results on the utility of sparse autoencoders (SAEs) for downstream tasks like out-of-distribution harmful intent detection.

Circuit Tracing: Revealing Computational Graphs in Language Models

A methodological article on circuit tracing, a novel approach using transcoders to identify interpretable features and construct attribution graphs for revealing computational graphs within language models.

To Make Language Models Work Better, Researchers Sidestep Language

An article exploring the trend of researchers focusing on the internal mathematical representations of language models, like embeddings, to improve their capabilities.

Libraries & Code

hanguo97/flute

Flexible Lookup Table Engine for LUT-quantized LLMs

foundationagents/awesome-foundation-agents

A curated collection of papers exploring the path towards Foundation Agents.

Papers & Publications

Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

Abstract:

Although subject-driven generation has been extensively explored in image generation due to its wide applications, it still has challenges in data scalability and subject expansibility. For the first challenge, moving from curating single-subject datasets to multiple-subject ones and scaling them is particularly difficult. For the second, most recent methods center on single-subject generation, making it hard to apply when dealing with multi-subject scenarios. In this study, we propose a highly-consistent data synthesis pipeline to tackle this challenge. This pipeline harnesses the intrinsic in-context generation capabilities of diffusion transformers and generates high-consistency multi-subject paired data. Additionally, we introduce UNO, which consists of progressive cross-modal alignment and universal rotary position embedding. It is a multi-image conditioned subject-to-image model iteratively trained from a text-to-image model. Extensive experiments show that our method can achieve high consistency while ensuring controllability in both single-subject and multi-subject driven generation.

PixelFlow: Pixel-Space Generative Models with Flow

Abstract:

We present PixelFlow, a family of image generation models that operate directly in the raw pixel space, in contrast to the predominant latent-space models. This approach simplifies the image generation process by eliminating the need for a pre-trained Variational Autoencoder (VAE) and enabling the whole model end-to-end trainable. Through efficient cascade flow modeling, PixelFlow achieves affordable computation cost in pixel space. It achieves an FID of 1.98 on 256×256 ImageNet class-conditional image generation benchmark. The qualitative text-to-image results demonstrate that PixelFlow excels in image quality, artistry, and semantic control. We hope this new paradigm will inspire and open up new opportunities for next-generation visual generation models.

A guest post by

Miko Planas

~~~

Deep Learning Weekly