Deep Learning Weekly: Issue 385
OpenAI’s o3 model, BERTScore For LLM Evaluation, a paper on Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder, and many more!
This week in deep learning, we bring you OpenAI announces new o3 model, BERTScore For LLM Evaluation, and a paper on Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference.
You may also enjoy Pre-Deployment Evaluation of OpenAI's o1 Model, The AI Agent Spectrum, a paper on Understanding Transformer Reasoning Capabilities via Graph Algorithms, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
OpenAI announced the successors to its o1 reasoning model family: o3 and o3-mini. While the models are not widely available yet, safety researchers can sign up for a preview.
Pre-Deployment Evaluation of OpenAI's o1 Model
The U.S. Artificial Intelligence Safety Institute (US AISI) and the UK Artificial Intelligence Safety Institute (UK AISI) conducted a joint pre-deployment evaluation of OpenAI’s o1.
GitHub is making its AI programming Copilot free for VS Code developers
GitHub has announced GitHub Copilot Free, which integrates directly into Visual Studio Code.
Anysphere reportedly raises $100M for its AI-driven Cursor code editor
Anysphere, a startup with a popular AI-powered code editor called Cursor, has raised $100 million in funding.
MLOps & LLMOps
Intro to LLM Observability: What to Monitor & How to Get Started
A blog post discussing LLM observability and its importance for real-world LLM applications.
Building effective agents \ Anthropic
A practical blog post about building effective agents that use simple and composable patterns, along with guidelines for prompt engineering tools used by AI agents.
An article discussing the different types of AI agents and the spectrum of their complexity.
Learning
An article that highlights how BERTScore improves upon traditional evaluation methods, and describes its role in the hierarchy of LLM evaluation metrics.
Controlling Language Model Generation with NVIDIA's LogitsProcessorZoo
A post about controlling language model generation using NVIDIA’s LogitsProcessorZoo.
How to fine-tune open LLMs in 2025 with Hugging Face
A practical guide on how to fine-tune open LLMs in 2025 using Hugging Face, with a focus on optimization, distributed training, and customization.
Libraries & Code
A semantic query engine for fast and easy LLM-powered data processing
A comprehensive library for learning neural operators in PyTorch
Papers & Publications
Abstract:
Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.
Cultural Evolution of Cooperation among LLM Agents
Abstract:
Large language models (LLMs) provide a compelling foundation for building generally-capable AI agents. These agents may soon be deployed at scale in the real world, representing the interests of individual humans (e.g., AI assistants) or groups of humans (e.g., AI-accelerated corporations). At present, relatively little is known about the dynamics of multiple LLM agents interacting over many generations of iterative deployment. In this paper, we examine whether a "society" of LLM agents can learn mutually beneficial social norms in the face of incentives to defect, a distinctive feature of human sociality that is arguably crucial to the success of civilization. In particular, we study the evolution of indirect reciprocity across generations of LLM agents playing a classic iterated Donor Game in which agents can observe the recent behavior of their peers. We find that the evolution of cooperation differs markedly across base models, with societies of Claude 3.5 Sonnet agents achieving significantly higher average scores than Gemini 1.5 Flash, which, in turn, outperforms GPT-4o. Further, Claude 3.5 Sonnet can make use of an additional mechanism for costly punishment to achieve yet higher scores, while Gemini 1.5 Flash and GPT-4o fail to do so. For each model class, we also observe variation in emergent behavior across random seeds, suggesting an understudied sensitive dependence on initial conditions. We suggest that our evaluation regime could inspire an inexpensive and informative new class of LLM benchmarks, focussed on the implications of LLM agent deployment for the cooperative infrastructure of society.
Understanding Transformer Reasoning Capabilities via Graph Algorithms
Abstract:
Which transformer scaling regimes are able to perfectly solve different classes of algorithmic problems? While tremendous empirical advances have been attained by transformer-based neural networks, a theoretical understanding of their algorithmic reasoning capabilities in realistic parameter regimes is lacking. We investigate this question in terms of the network's depth, width, and number of extra tokens for algorithm execution. Our novel representational hierarchy separates 9 algorithmic reasoning problems into classes solvable by transformers in different realistic parameter scaling regimes. We prove that logarithmic depth is necessary and sufficient for tasks like graph connectivity, while single-layer transformers with small embedding dimensions can solve contextual retrieval tasks. We also support our theoretical analysis with ample empirical evidence using the GraphQA benchmark. These results show that transformers excel at many graph reasoning tasks, even outperforming specialized graph neural networks.