Deep Learning Weekly: Issue 381
Introducing the Model Context Protocol, Which Foundation Model is best for Agent Orchestration, a paper on Agent-as-a-Judge: Evaluate Agents with Agents, and many more!
This week in deep learning, we bring you Introducing the Model Context Protocol, Which Foundation Model is best for Agent Orchestration, and a paper on Agent-as-a-Judge: Evaluate Agents with Agents.
You may also enjoy Fugatto, World’s Most Flexible Sound Machine, Debuts, Faster Text Generation with Self-Speculative Decoding, a paper on Re-Invoke: Tool Invocation Rewriting for Zero-Shot Tool Retrieval, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Introducing the Model Context Protocol
Anthropic open-sourced the Model Context Protocol (MCP), a new standard for connecting AI assistants to the systems where data lives.
Fugatto, World’s Most Flexible Sound Machine, Debuts
Using text and audio as inputs, a new generative AI model from NVIDIA can create any combination of music, voices and sounds.
Salesforce Introduces Agentforce Testing Center
Salesforce announced agentic lifecycle management tools to automate Agentforce testing, prototype agents in Sandbox environments, and manage usage at scale.
Meta Opens Its AI Model for the U.S. Military
Nick Clegg, Meta’s president of public affairs, announced that Meta will allow use of Llama for U.S. national security.
Luma expands Dream Machine AI video model into full creative platform, mobile app
Luma AI is expanding its Dream Machine AI video model with a new interface, mobile app, and new image generation model.
Enveda Biosciences raises $130M to advance AI-driven drug discovery from natural compounds
Enveda Biosciences, a company that leverages AI to develop new medicines, announced that it has raised $130 million to deliver clinical catalysts across multiple programs with strong commercial opportunities.
New AI tool generates realistic satellite images of future flooding
MIT scientists have developed a method that generates satellite imagery from the future to depict how a region would look after a potential flooding event.
MLOps & LLMOps
Which Foundation Model is best for Agent Orchestration
A blog post about which foundation model is best for agent orchestration, analyzing five different LLMs based on factors such as compliance, working memory, and precision.
A blog post on how to implement perplexity from scratch in Python, and how to add it to your evaluation suite using Opik, an open-source LLM evaluation framework.
An article that introduces the core concepts of Swarm (Routines and Handoffs) and implements them step by step using Haystack.
Deploy and serve open models over Google Kubernetes Engine
An informative blog post explaining how Google Cloud Platform users can deploy and serve the large open model, Llama 3.1 405B FP16, on Google Kubernetes Engine.
Constructing a Knowledge Graph with LlamaIndex and Memgraph
An article about how to construct a knowledge graph using LlamaIndex and Memgraph.
Learning
Introducing SPDL: Faster AI model training with thread-based data loading
Meta introduces SPDL, a framework-agnostic data loading solution that utilizes multi-threading.
How we speed up filtered vector search with ACORN
A blog post discussing how Weaviate implemented ACORN, a new strategy to improve the performance of HNSW graph indexes for filtered vector search.
Faster Text Generation with Self-Speculative Decoding
A technical blog post that explores self-speculative decoding, a novel technique that speeds up text generation in large language models while also saving memory and reducing computational latency.
Supercharging Training using float8 and FSDP2
An article on how to achieve a 50% speedup in PyTorch model training throughput through the combination of Float8, FSDP2, and DTensor.
Libraries & Code
The Fastest Deep Reinforcement Learning Library.
Docling parses documents and exports them to the desired format with ease and speed.
An out-of-the-box (OOTB) version of Anthropic Claude Computer Use for Windows and macOS.
Papers & Publications
Agent-as-a-Judge: Evaluate Agents with Agents
Abstract:
Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes -- ignoring the step-by-step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We apply the Agent-as-a-Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems using Agent-as-a-Judge and find it dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that Agent-as-a-Judge marks a concrete step forward for modern agentic systems -- by providing rich and reliable reward signals necessary for dynamic and scalable self-improvement.
Re-Invoke: Tool Invocation Rewriting for Zero-Shot Tool Retrieval
Abstract:
Recent advances in large language models (LLMs) have enabled autonomous agents with complex reasoning and task-fulfillment capabilities using a wide range of tools. However, effectively identifying the most relevant tools for a given task becomes a key bottleneck as the toolset size grows, hindering reliable tool utilization. To address this, we introduce Re-Invoke, an unsupervised tool retrieval method designed to scale effectively to large toolsets without training. Specifically, we first generate a diverse set of synthetic queries that comprehensively cover different aspects of the query space associated with each tool document during the tool indexing phase. Second, we leverage LLM's query understanding capabilities to extract key tool-related context and underlying intents from user queries during the inference phase. Finally, we employ a novel multi-view similarity ranking strategy based on intents to pinpoint the most relevant tools for each query. Our evaluation demonstrates that Re-Invoke significantly outperforms state-of-the-art alternatives in both single-tool and multi-tool scenarios, all within a fully unsupervised setting. Notably, on the ToolE datasets, we achieve a 20% relative improvement in nDCG@5 for single-tool retrieval and a 39% improvement for multi-tool retrieval.
Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts
Abstract:
Time series foundation models have demonstrated impressive performance as zero-shot forecasters. However, achieving effectively unified training on time series remains an open challenge. Existing approaches introduce some level of model specialization to account for the highly heterogeneous nature of time series data. For instance, Moirai pursues unified training by employing multiple input/output projection layers, each tailored to handle time series at a specific frequency. Similarly, TimesFM maintains a frequency embedding dictionary for this purpose. We identify two major drawbacks to this human-imposed frequency-level model specialization: (1) Frequency is not a reliable indicator of the underlying patterns in time series. For example, time series with different frequencies can display similar patterns, while those with the same frequency may exhibit varied patterns. (2) Non-stationarity is an inherent property of real-world time series, leading to varied distributions even within a short context window of a single time series. Frequency-level specialization is too coarse-grained to capture this level of diversity. To address these limitations, this paper introduces Moirai-MoE, using a single input/output projection layer while delegating the modeling of diverse time series patterns to the sparse mixture of experts (MoE) within Transformers. With these designs, Moirai-MoE reduces reliance on human-defined heuristics and enables automatic token-level specialization. Extensive experiments on 39 datasets demonstrate the superiority of Moirai-MoE over existing foundation models in both in-distribution and zero-shot scenarios. Furthermore, this study conducts comprehensive model analyses to explore the inner workings of time series MoE foundation models and provides valuable insights for future research.
I appreciate these weekly updates!!