Deep Learning Weekly: Issue 379
DeepMind open-sources AlphaFold 3, Unintended Impacts of Alignment on Global Representation, a paper on Device-Directed Speech Detection for Follow-up Conversations Using LLMs, and many more!
This week in deep learning, we bring you Google DeepMind open-sources AlphaFold 3, Unintended Impacts of Alignment on Global Representation, and a paper on Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models.
You may also enjoy AI power: Expanding data center capacity to meet growing demand, Generating zero-shot personalized portraits, a paper on Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Google DeepMind has made the code and model weights of AlphaFold 3 available for academic use, marking a significant advance that could accelerate scientific discovery and drug development.
AI power: Expanding data center capacity to meet growing demand
McKinsey’s detailed blog post about the opportunities in the data center market driven by the increasing demand for AI.
Despite its impressive output, generative AI doesn’t have a coherent understanding of the world
Researchers show that even the best-performing large language models don’t form a true model of the world and its rules, and can thus fail unexpectedly on similar tasks.
Mistral AI introduced a new moderation service enabling users to detect undesirable text content along several policy dimensions.
Microsoft-backed startup debuts task optimized enterprise AI models that run on CPUs
A startup focused on enterprise AI is emerging from stealth with the promise of providing ‘task-optimized’ models that provide better performance using only general-purpose CPUs.
Generative AI startup Writer raises $200M at a $1.9B valuation
Writer has raised $200 million at a $1.9 billion valuation to expand its enterprise-focused generative AI platform.
MLOps & LLMOps
Matryoshka Embeddings: Detail at Multiple Scales
A post introducing Matryoshka embeddings for efficient similarity search, outlining the process and highlighting its benefits and limitations.
Deploy and serve open models over Google Kubernetes Engine
A technical blog post explaining how to use multi-host generative AI on Google Kubernetes Engine.
Building a Multimodal Nutrition Agent
An article detailing how to build a multimodal agent that can interpret nutrition fact labels.
Designing Cognitive Architectures: Agentic Workflow Patterns from Scratch
A deep dive into eight agentic workflow patterns that enhance AI systems' capabilities, with explanations, examples, and GitHub repository links for each pattern.
Learning
Unintended Impacts of Alignment on Global Representation
A technical blog post analyzing the unintended global impacts of aligning LLMs to user preferences, focusing on English dialects, multilingualism, and opinions from and about different countries.
Generating zero-shot personalized portraits
A post that introduces a novel zero-shot I2I model specifically designed for personalized and stylized selfie generation.
What Makes a True AI Agent? Rethinking the Pursuit of Autonomy
A blog post about the definition of a true AI agent, which offers a framework to assess agentic behavior based on six attributes, and emphasizes the importance of foundations over hype.
A blog post explaining the "evals gap" in evaluating the safety of frontier AI models.
Libraries & Code
anthropics/anthropic-quickstarts
A collection of projects designed to help developers quickly get started with building deployable applications using the Anthropic API.
Composio equips your AI agents & LLMs with 100+ high-quality integrations via function calling.
A framework designed for learning generalized tool-use abilities in compact language models with minimal human supervision.
Papers & Publications
Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models
Abstract:
Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-Directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the notion of Large Language Models (LLMs) and model the first query when making inference about the follow-ups (based on the ASR-decoded text), via prompting of a pretrained LLM, or by adapting a binary classifier on top of the LLM. In doing so, we also exploit the ASR uncertainty when designing the LLM prompts. We show on the real-world dataset of follow-up conversations that this approach yields large gains (20-40% reduction in false alarms at 10% fixed false rejects) due to the joint modeling of the previous speech context and ASR uncertainty, compared to when follow-ups are modeled alone.
Autoregressive Models in Vision: A Survey
Abstract:
Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, \textit{i.e.}, pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the strategy of representation. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multi-faceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multi-modal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions.
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
Abstract:
In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications.