Deep Learning Weekly: Issue 395
Manus AI, Omni AI's OCR Benchmark which uses structured outputs and complex documents, a paper on Visual-RFT: Visual Reinforcement Fine-Tuning, and many more!
This week in deep learning, we bring you Manus AI, Omni AI's OCR Benchmark, and a paper on Visual-RFT: Visual Reinforcement Fine-Tuning.
You may also enjoy OpenAI's new tools for building agents, What Are Agentic Workflows? Patterns, Use Cases, Examples, and More, a paper on Forecasting Frontier Language Model Agent Capabilities, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Manus AI partners with Alibaba's Qwen team in expansion bid
Manus AI announced a strategic partnership with the team behind Alibaba's Qwen, a move that could bolster the startup's roll-out of what it called the world's first general AI agent.
New tools for building agents | OpenAI
OpenAI launched a new set of APIs and tools specifically designed to simplify the development of agentic applications, including the Responses API and integrated observability tools.
Google co-founder Larry Page reportedly has a new AI startup
Google co-founder Larry Page is building a new company called Dynatomics that’s focused on applying AI to product manufacturing.
The Mistral Team introduced Mistral OCR, an Optical Character Recognition API that sets a new standard in complex document understanding.
AI21 debuts Maestro AI planning and orchestration system
AI21 Labs introduced Maestro, a software system that promises to boost the output quality of large language models significantly.
Lila Sciences raises $200M to accelerate autonomous scientific research
A report on the launch of Lila Sciences, a startup backed by $200 million, aiming to accelerate scientific research through autonomous AI agents.
MLOps & LLMOps
What Are Agentic Workflows? Patterns, Use Cases, Examples, and More
A comprehensive article that gives a concise definition of agentic workflows, goes through the key components of AI agents, highlights real-world examples and use cases.
AI in software engineering at Google: Progress and the path ahead
An informative blog post from Google AI discusses the progress and future of AI in their internal software engineering tools, highlighting improvements in code completion, review, and other development processes.
Streamline LLM Deployment for Autonomous Vehicle Applications with NVIDIA DriveOS LLM SDK
A post that introduces the NVIDIA DriveOS LLM SDK, a library designed to optimize the inference of state-of-the-art LLMs and VLMs on the DRIVE AGX platform for autonomous vehicles.
Learning
How we evaluated Elicit Reports
The Elicit team describes how they evaluated Elicit Reports: fully-automated research overviews inspired by systematic reviews.
A technical article about OmniAI OCR benchmark, which evaluates various traditional OCR providers and Vision Language Models using structured output and real-world documents.
Current and New Activation Checkpointing Techniques in PyTorch
A blog post that walks through the basics of what activation memory is, as well as the high-level ideas behind existing activation checkpointing techniques.
LeRobot goes to driving school: World’s largest open-source self-driving dataset
A comprehensive article announces the release of the "Learning to Drive" (L2D) dataset, the world's largest open-source multimodal dataset for autonomous driving research.
Libraries & Code
A framework that enables multi-agent systems to continuously evolve by generating data and interacting with environments.
Official repository for Sony Research’s work on micro-budget training of large-scale diffusion models.
Papers & Publications
Visual-RFT: Visual Reinforcement Fine-Tuning
Abstract:
Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce. Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is one key direction in reproducing o1. While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored. This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks. Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs) to generate multiple responses containing reasoning tokens and final answers for each input, and then uses our proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization (GRPO). We design different verifiable reward functions for different perception tasks, such as the Intersection over Union (IoU) reward for object detection. Experimental results on fine-grained image classification, few-shot object detection, reasoning grounding, as well as open-vocabulary object detection benchmarks show the competitive performance and advanced generalization ability of Visual-RFT compared with Supervised Fine-tuning (SFT). For example, Visual-RFT improves accuracy by 24.3% over the baseline in one-shot fine-grained image classification with around 100 samples. In few-shot object detection, Visual-RFT also exceeds the baseline by 21.9 on COCO's two-shot setting and 15.4 on LVIS. Our Visual-RFT represents a paradigm shift in fine-tuning LVLMs, offering a data-efficient, reward-driven approach that enhances reasoning and adaptability for domain-specific tasks.
Forecasting Frontier Language Model Agent Capabilities
Abstract:
As Language Models (LMs) increasingly operate as autonomous agents, accurately forecasting their capabilities becomes crucial for societal preparedness. We evaluate six forecasting methods that predict downstream capabilities of LM agents. We use "one-step" approaches that predict benchmark scores from input metrics like compute or model release date directly or "two-step" approaches that first predict an intermediate metric like the principal component of cross-benchmark performance (PC-1) and human-evaluated competitive Elo ratings. We evaluate our forecasting methods by backtesting them on a dataset of 38 LMs from the OpenLLM 2 leaderboard. We then use the validated two-step approach (Release Date→Elo→Benchmark) to predict LM agent performance for frontier models on three benchmarks: SWE-Bench Verified (software development), Cybench (cybersecurity assessment), and RE-Bench (ML research engineering). Our forecast predicts that by the beginning of 2026, non-specialized LM agents with low capability elicitation will reach a success rate of 54% on SWE-Bench Verified, while state-of-the-art LM agents will reach an 87% success rate. Our approach does not account for recent advances in inference-compute scaling and might thus be too conservative.