Deep Learning Weekly: Issue 391

OpenAI Deep Research, Building Opik: A Scalable Open-Source LLM Observability Platform, a paper on Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling, and many more

Feb 05, 2025

This week in deep learning, we bring you OpenAI Deep Research, Building Opik: A Scalable Open-Source LLM Observability Platform, and a paper on Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling.

You may also enjoy Scaling the Tülu 3 post-training recipes to surpass the performance of DeepSeek V3, Choosing the Right AI Agent Framework: LangGraph vs CrewAI vs OpenAI Swarm, a paper on UI-TARS: Pioneering Automated GUI Interaction with Native Agents, and more!

As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Introducing deep research | OpenAI

OpenAI launched deep research in ChatGPT, a new agentic capability that conducts multi-step research on the internet for complex tasks. It accomplishes in tens of minutes what would take a human many hours.

Scaling the Tülu 3 post-training recipes to surpass the performance of DeepSeek V3

Ai2 announced the launch of Tülu 3 405B—the first application of fully open post-training recipes to the largest open-weight models.

Mistral Small 3

The Mistral AI team introduced Mistral Small 3, a latency-optimized 24B-parameter model released under the Apache 2.0 license.

Initiative Aims to Enable Ethical Coding LLMs

Software Heritage is launching a project called CodeCommons, which will provide access to those willing to sign up for ethical principles aimed at boosting transparency and accountability in AI training.

Toward video generative models of the molecular world

MIT CSAIL and Department of Mathematics researchers have developed a generative model called MDGen, which can take a frame of a 3D molecule and simulate what will happen next.

MLOps & LLMOps

Building Opik: A Scalable Open-Source LLM Observability Platform

Principal Software Engineer, Andrés Cruz, discusses the process of building Opik, Comet’s open-source platform for evaluating, testing, and monitoring LLM applications.

G-Eval for LLM Evaluation

A comprehensive article on G-Eval for LLM evaluation, detailing its components and step-by-step implementation.

Choosing the Right AI Agent Framework: LangGraph vs CrewAI vs OpenAI Swarm

An article that explores and compares three popular frameworks for building agentic applications: LangGraph, CrewAI, and OpenAI Swarm.

Multimodal Semantic Search with Images and Text

A helpful blog post on multimodal semantic search using images and text with Milvus.

Learning

Your Company Needs Small Language Models

An article that discusses how small models can reduce costs, improve accuracy, and maintain control of data.

Productive Struggle: The Future of Human Learning in the Age of AI

A blog post about productive struggle and the future of human learning in the age of AI.

Mini-R1: Reproduce Deepseek R1 „aha moment“ a RL tutorial

A blog post on reproducing the small "aha moment" of DeepSeek-R1 using Group Relative Policy Optimization (GRPO) and the Countdown Game.

Self-Improving Diffusion Models with Synthetic Data

A research summary about self-improving diffusion models using synthetic data.

Libraries & Code

sourcegraph/cody

An AI coding assistant that uses the latest LLMs and codebase context to help you understand, write, and fix code faster.

trypromptly/LLMStack

No-code multi-agent framework to build LLM Agents, workflows and applications with your data.

e2b-dev/awesome-ai-agents

A list of AI autonomous agents.

Papers & Publications

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Abstract:

In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Abstract:

This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for context-aware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines. Through iterative training and reflection tuning, UI-TARS continuously learns from its mistakes and adapts to unforeseen situations with minimal human intervention. We also analyze the evolution path of GUI agents to guide the further development of this domain.

A guest post by

Miko Planas

~~~

Byndnglsh

Feb 11

Given the continuing interest in R1 (and DeepSeek in general), the following report provides insights into s1 and DeepSeek-R1 that you may find valuable:

From Brute Force to Brain Power: How Stanford's s1 Surpasses DeepSeek-R1

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5130864

Expand full comment

Deep Learning Weekly