Deep Learning Weekly: Issue 364
Meta Segment Anything Model 2, Chip Huyen's Building A Generative AI Platform, Tool Calling Evals with Phoenix, a paper on Efficient Portrait Animation with Stitching and Retargeting Control, & more!
This week in deep learning, we bring you The next generation of Meta Segment Anything Model for videos and images, Chip Huyen's Building A Generative AI Platform, Tool Calling Evals with Phoenix, and a paper on LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control.
You may also enjoy Introducing torchchat: Accelerating Local LLM Inference on Laptop, Desktop and Mobile, Building an AI Agent for Supply Chain Optimization with NVIDIA NIM and cuOpt, a paper on Very Large-Scale Multi-Agent Simulation in AgentScope*, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Introducing SAM 2: The next generation of Meta Segment Anything Model for videos and images
Meta released SAM 2, a unified model for real-time promptable object segmentation in images and videos that achieves state-of-the-art performance.
xAI's Memphis Supercluster Goes Live
Elon Musk's AI venture, xAI, has launched the Memphis Supercluster, a groundbreaking AI training facility in Tennessee equipped with 100,000 Nvidia H100 GPUs.
Introducing torchchat: Accelerating Local LLM Inference on Laptop, Desktop and Mobile
The PyTorch team released torchchat, a library showcasing how to seamlessly and performantly run Llama 3, 3.1, and other large language models across laptop, desktop, and mobile.
Mistral AI announced Mistral Large 2, a model which is significantly more capable in code generation, mathematics, and reasoning than its predecessor.
AI model identifies certain breast tumor stages likely to progress to invasive cancer
An interdisciplinary team of researchers from MIT and ETH Zurich developed an AI model that can identify the different stages of Ductal carcinoma in situ from a breast tissue image.
MLOps & LLMOps
Building A Generative AI Platform
Chip Huyen describes how to build a generative AI platform by starting from the simplest architecture and progressively adding more components.
Building an AI Agent for Supply Chain Optimization with NVIDIA NIM and cuOpt
A post that demonstrates how linear programming and LLMs, with the NVIDIA cuOpt microservice for optimization AI, can help you overcome optimization challenges.
LLM Knowledge Graph Builder: From Zero to GraphRAG in Five Minutes
An article that explores how GraphRag and Neo4j empower users to transform unstructured data into dynamic knowledge graphs using LLMs.
How to Use Comet's New Integration with Union & Flyte
An article that discusses how the combination of Union (an optimized version of the open-source solution Flyte) and Comet’s ML platform enhances scalability, declarative infrastructure, and data lineage for AI developers.
Learning
Generative AI in healthcare: Adoption trends and what’s next | McKinsey
An article highlighting the transformative power of generative AI in healthcare through survey data.
On Open-Weights Foundation Models | Federal Trade Commission
An article that discusses the potential benefits — such as driving innovation, reducing costs, and increasing consumer choice — associated with publicly available AI model weights.
A Visual Guide to Quantization
An introduction and visual guide to the field of quantization in the context of language modeling.
Architect Scalable and Cost-Effective LLM and RAG Inference Pipelines
A guide to building a scalable inference pipeline for serving LLMs and RAG systems.
Tool Calling Evals with Phoenix
A notebook that demonstrates how to evaluate the multi-step LLM logic involved in tool calling and more.
Libraries & Code
Build applications that make decisions (chatbots, agents, simulations, etc...). Monitor, persist, and execute on your own infrastructure.
Mem0 provides an intelligent, adaptive memory layer for LLMs, enhancing personalized AI experiences.
Papers & Publications
LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control
Abstract:
Portrait Animation aims to synthesize a lifelike video from a single source image, using it as an appearance reference, with motion (i.e., facial expressions and head pose) derived from a driving video, audio, text, or generation. Instead of following mainstream diffusion-based methods, we explore and extend the potential of the implicit-keypoint-based framework, which effectively balances computational efficiency and controllability. Building upon this, we develop a video-driven portrait animation framework named LivePortrait with a focus on better generalization, controllability, and efficiency for practical usage. To enhance the generation quality and generalization ability, we scale up the training data to about 69 million high-quality frames, adopt a mixed image-video training strategy, upgrade the network architecture, and design better motion transformation and optimization objectives. Additionally, we discover that compact implicit keypoints can effectively represent a kind of blendshapes and meticulously propose a stitching and two retargeting modules, which utilize a small MLP with negligible computational overhead, to enhance the controllability. Experimental results demonstrate the efficacy of our framework even compared to diffusion-based methods. The generation speed remarkably reaches 12.8ms on an RTX 4090 GPU with PyTorch.
Very Large-Scale Multi-Agent Simulation in AgentScope
Abstract:
Recent advances in large language models (LLMs) have opened new avenues for applying multi-agent systems in very large-scale simulations. However, there remain several challenges when conducting multi-agent simulations with existing platforms, such as limited scalability and low efficiency, unsatisfied agent diversity, and effort-intensive management processes. To address these challenges, we develop several new features and components for AgentScope, a user-friendly multi-agent platform, enhancing its convenience and flexibility for supporting very large-scale multi-agent simulations. Specifically, we propose an actor-based distributed mechanism as the underlying technological infrastructure towards great scalability and high efficiency, and provide flexible environment support for simulating various real-world scenarios, which enables parallel execution of multiple agents, centralized workflow orchestration, and both inter-agent and agent-environment interactions among agents. Moreover, we integrate an easy-to-use configurable tool and an automatic background generation pipeline in AgentScope, simplifying the process of creating agents with diverse yet detailed background settings. Last but not least, we provide a web-based interface for conveniently monitoring and managing a large number of agents that might deploy across multiple devices. We conduct a comprehensive simulation to demonstrate the effectiveness of the proposed enhancements in AgentScope, and provide detailed observations and discussions to highlight the great potential of applying multi-agent systems in large-scale simulations. The source code is released on GitHub at this https URL to inspire further research and development in large-scale multi-agent simulations.
Abstract:
This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology.