Deep Learning Weekly: Issue 391
OpenAI Deep Research, Building Opik: A Scalable Open-Source LLM Observability Platform, a paper on Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling, and many more
This week in deep learning, we bring you OpenAI Deep Research, Building Opik: A Scalable Open-Source LLM Observability Platform, and a paper on Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling.
You may also enjoy Scaling the Tülu 3 post-training recipes to surpass the performance of DeepSeek V3, Choosing the Right AI Agent Framework: LangGraph vs CrewAI vs OpenAI Swarm, a paper on UI-TARS: Pioneering Automated GUI Interaction with Native Agents, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Introducing deep research | OpenAI
OpenAI launched deep research in ChatGPT, a new agentic capability that conducts multi-step research on the internet for complex tasks. It accomplishes in tens of minutes what would take a human many hours.
Scaling the Tülu 3 post-training recipes to surpass the performance of DeepSeek V3
Ai2 announced the launch of Tülu 3 405B—the first application of fully open post-training recipes to the largest open-weight models.
The Mistral AI team introduced Mistral Small 3, a latency-optimized 24B-parameter model released under the Apache 2.0 license.
Initiative Aims to Enable Ethical Coding LLMs
Software Heritage is launching a project called CodeCommons, which will provide access to those willing to sign up for ethical principles aimed at boosting transparency and accountability in AI training.
Toward video generative models of the molecular world
MIT CSAIL and Department of Mathematics researchers have developed a generative model called MDGen, which can take a frame of a 3D molecule and simulate what will happen next.
MLOps & LLMOps
Building Opik: A Scalable Open-Source LLM Observability Platform
Principal Software Engineer, Andrés Cruz, discusses the process of building Opik, Comet’s open-source platform for evaluating, testing, and monitoring LLM applications.
A comprehensive article on G-Eval for LLM evaluation, detailing its components and step-by-step implementation.
Choosing the Right AI Agent Framework: LangGraph vs CrewAI vs OpenAI Swarm
An article that explores and compares three popular frameworks for building agentic applications: LangGraph, CrewAI, and OpenAI Swarm.
Multimodal Semantic Search with Images and Text
A helpful blog post on multimodal semantic search using images and text with Milvus.
Learning
Your Company Needs Small Language Models
An article that discusses how small models can reduce costs, improve accuracy, and maintain control of data.
Productive Struggle: The Future of Human Learning in the Age of AI
A blog post about productive struggle and the future of human learning in the age of AI.
Mini-R1: Reproduce Deepseek R1 „aha moment“ a RL tutorial
A blog post on reproducing the small "aha moment" of DeepSeek-R1 using Group Relative Policy Optimization (GRPO) and the Countdown Game.
Self-Improving Diffusion Models with Synthetic Data
A research summary about self-improving diffusion models using synthetic data.
Libraries & Code
An AI coding assistant that uses the latest LLMs and codebase context to help you understand, write, and fix code faster.
No-code multi-agent framework to build LLM Agents, workflows and applications with your data.
A list of AI autonomous agents.
Papers & Publications
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Abstract:
In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Abstract:
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.
Given the continuing interest in R1 (and DeepSeek in general), the following report provides insights into s1 and DeepSeek-R1 that you may find valuable:
From Brute Force to Brain Power: How Stanford's s1 Surpasses DeepSeek-R1
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5130864