Deep Learning Weekly: Issue 397
Gemini 2.5, Scaling Supervision or: How We Learned to Stop Worrying and Love Bitbucket Pipelines, a paper on VGGT: Visual Geometry Grounded Transformer, and many more!
This week in deep learning, we bring you Gemini 2.5, Scaling Supervision or: How We Learned to Stop Worrying and Love Bitbucket Pipelines, and a paper on VGGT: Visual Geometry Grounded Transformer.
You may also enjoy Microsoft Security Copilot Agents, The Future of AI Agents is Event-Driven, a paper on Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Gemini 2.5: Our most intelligent AI model
DeepMind introduced Gemini 2.5, their most intelligent AI model, which has a 1 million token context window and improved reasoning capabilities.
Microsoft unveils Microsoft Security Copilot agents and new protections for AI
The Microsoft team announced the next evolution of Security Copilot, with AI agents designed to assist with critical areas such as phishing, data security, and identity management.
Claude can now search the web \ Anthropic
Anthropic’s Claude now incorporates information from the web and provides direct citations.
Nexthop AI launches with $110M to build next-gen cloud AI infrastructure
Nexthop AI launched with $110 million in funding led by Lightspeed Ventures to help the world’s largest cloud companies build the next generation of AI infrastructure.
MLOps & LLMOps
The Future of AI Agents is Event-Driven
An article about how the future of AI agents relies on an event-driven architecture for scalability and interoperability.
Comparing Open-Source AI Agent Frameworks
An article that explores and compares the leading open-source AI agent frameworks: LangGraph, OpenAI Agents, Smolagents, CrewA, LlamaIndex agents, and more.
Using Spanner Graph with LangChain for GraphRAG
A demonstrative blog post showing how to build GraphRAG applications using Spanner Graph and LangChain to enhance knowledge retrieval with interconnected data.
Learning
Scaling Supervision or: How We Learned to Stop Worrying and Love Bitbucket Pipelines
A post that dives into how Fetch Technology is rethinking its data infrastructure and lifecycle management to deeply integrate human-in-the-loop workflows.
Open R1: How to use OlympicCoder locally for coding
A practical blog post providing a step-by-step guide on how to use the open-source OlympicCoder model locally for coding assistance with LM Studio and VS Code.
Model Context Protocol explained as simply as possible
A blog post that briefly and simply explains Anthropic’s Model Context Protocol (MCP) as a universal interface for integrating LLM tools.
Parsing PDFs with LlamaParse: a how-to guide
A comprehensive how-to guide explaining how to use LlamaParse to simplify the process of extracting information from PDF documents for LLM applications.
Libraries & Code
A high-performance distributed file system designed to address the challenges of AI training and inference workloads.
A collection of reference implementations for the Model Context Protocol (MCP).
Papers & Publications
VGGT: Visual Geometry Grounded Transformer
Abstract:
We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views. This approach is a step forward in 3D computer vision, where models have typically been constrained to and specialized for single tasks. It is also simple and efficient, reconstructing images in under one second, and still outperforming alternatives that require post-processing with visual geometry optimization techniques. The network achieves state-of-the-art results in multiple 3D tasks, including camera parameter estimation, multi-view depth estimation, dense point cloud reconstruction, and 3D point tracking. We also show that using pretrained VGGT as a feature backbone significantly enhances downstream tasks, such as non-rigid point tracking and feed-forward novel view synthesis.
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
Abstract:
Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Extensive experiments demonstrate that Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis.