Deep Learning Weekly: Issue 393

Claude 3.7 Sonnet and Claude Code, LLM Juries for Evaluation, a paper on MoBA: Mixture of Block Attention for Long-Context LLMs, and many more!

Feb 28, 2025

This week in deep learning, we bring you Claude 3.7 Sonnet and Claude Code, LLM Juries for Evaluation, and a paper on MoBA: Mixture of Block Attention for Long-Context LLMs.

You may also enjoy Like human brains, large language models reason about diverse data in a general way, 10 Foot Guns in Fine-Tuning and Few-Shots, a paper on Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction, and more!

As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Claude 3.7 Sonnet and Claude Code

Anthropic announced Claude 3.7 Sonnet, the first hybrid reasoning model on the market, which can produce near-instant or extended responses depending on the user’s control.

Can deep learning transform heart failure prevention?

Researchers from MIT and Harvard Medical School built an AI model called CHAIS that makes it easier for clinicians to monitor a patient’s heart health.

AI search engine startup Genspark reportedly raises $100M at $530M valuation

Genspark, an AI search startup that boasts e-commerce product search and financial report visualizations as specialized features, has reportedly raised $100 million in funding.

MLOps & LLMOps

LLM Evaluation Frameworks

An overview of the most notable LLM evaluation tools available within the market, and an analysis of their key features.

Mechanism design for large language models

Research scientists from Google investigate the design of auction mechanisms for aggregating the output of multiple self-interested LLMs into one joint output.

Open-source DeepResearch – Freeing our search agents

An article about open-sourcing Deep Research agent frameworks, detailing how to reproduce OpenAI's results by using a code agent with tools for web browsing and text inspection.

Optimizing Qwen2.5-Coder Throughput with NVIDIA TensorRT-LLM Lookahead Decoding

A post that explores the benefits of inference optimizations for Qwen2.5-Coder models supported in NVIDIA TensorRT-LLM.

The Ultra-Scale Playbook - a Hugging Face Space by nanotron

An open-source book that walks you through the knowledge necessary to scale the training of LLMs from one GPU to thousands of GPUs, illustrating theory with practical code examples.

How To Scale Your Model

A book that aims to demystify the science of LLMs on TPUs: how TPUs work and how they communicate with each other, how LLMs run on real hardware, and how to parallelize your models during training and inference so they run efficiently at massive scale

Beyond RAG: Implementing Agent Search with LangGraph for Smarter Knowledge Retrieval

An instructive blog post on implementing Agent Search with LangGraph for smarter knowledge retrieval, enhancing enterprise data insights.

Learning

LLM Juries for Evaluation

An article that explores the advantages and limitations of LLM Juries and how to implement one from scratch in Opik.

10 Foot Guns in Fine-Tuning and Few-Shots

Jason Liu’s blog post where he shares real stories, practical solutions, and lessons learned from some of the biggest pitfalls when it comes to AI deployment.

PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action

A blog post from Stanford AI Lab about PrivacyLens, a framework for evaluating privacy norm awareness of language models in action and the inference-time privacy risks associated with LM agents.

10x Cheaper PDF Processing: Ingesting and RAG on Millions of Documents with Gemini 2.0 Flash

An article about processing PDFs for RAG Retrieval-Augmented Generation using Gemini 2.0 Flash, making it cheaper and more efficient.

A new generation of AIs: Claude 3.7 and Grok 3

Ethan Mollick discusses the capabilities of the new AI models, Claude 3.7 and Grok 3, and their implications for the future of AI.

Libraries & Code

SakanaAI/self-adaptive-llms

A self-adaptation Framework that adapts LLMs for unseen tasks in real-time.

microsoft/omniparser

A simple screen parsing tool towards pure vision based GUI agent.

topoteretes/cognee

Reliable LLM Memory for AI Applications and AI Agents.

microsoft/data-formulator

Transform data and create rich visualizations iteratively with AI.

Papers & Publications

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

Abstract:

Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies.

MoBA: Mixture of Block Attention for Long-Context LLMs

Abstract:

Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored.

In this work, we propose a solution that adheres to the ``less structure'' principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi's long-context requests and demonstrates significant advancements in efficient attention computation for LLMs.

A guest post by

Miko Planas

~~~

Deep Learning Weekly