Deep Learning Weekly: Issue 387
NVIDIA launches blueprint for AI Agents that can analyze video, LLM Evaluation Metrics Every Developer should know, a paper on Memory Layers at Scale, and many more!
This week in deep learning, we bring you NVIDIA Launches Blueprint for AI Agents That Can Analyze Video, LLM Evaluation Metrics Every Developer Should Know and a paper on Memory Layers at Scale.
You may also enjoy A new computational model can predict antibody structures more accurately, Visualize and understand GPU memory in PyTorch, a paper on 3D Shape Tokenization, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
NVIDIA Launches Blueprint for AI Agents That Can Analyze Video
To accelerate the creation of agents with visual perception capabilities, NVIDIA announced early access to a new version of the NVIDIA AI Blueprint for video search and summarization.
A new computational model can predict antibody structures more accurately
MIT researchers have developed a computational technique that allows large language models to predict antibody structures more accurately.
Microsoft reveals plan to spend $80B building AI data centers in fiscal 2025
Microsoft has revealed plans to invest more than $80 billion to build data centers for AI workloads during its current fiscal year.
Rembrand raises $23M for AI-powered product placement in videos
Rembrand, which uses AI to place virtual objects in video content for brand marketing, has raised $23 million to expand from social media into connected TV formats.
MLOps & LLMOps
Chip Huyen’s comprehensive article about agents, focusing on how they are built, what tools can enhance them, how planning works, and more.
AI Agent Workflow Design Patterns
A blog post about AI agent workflow design patterns with a focus on the ReAct and Plan-Solve patterns.
Accelerate Custom Video Foundation Model Pipelines with New NVIDIA NeMo Framework Capabilities
A detailed post about the new capabilities of the NVIDIA NeMo framework for accelerating custom video foundation model pipelines.
Building Agentic Workflows with Inngest
An article about building agentic workflows with Inngest, using the example of creating a dinner menu generator.
Learning
LLM Evaluation Metrics Every Developer Should Know
A comprehensive article about key LLM evaluation metrics and how to calculate them for various applications such as machine translation, summarization, and chatbots.
A guide to JAX for PyTorch developers
A blog post about the fundamentals of JAX for PyTorch users, using the example of building a neural network to predict the survivors of the Titanic.
Visualize and understand GPU memory in PyTorch
A step-by-step tutorial on how to visualize and understand GPU memory usage in PyTorch during training.
Explore the business case for responsible AI in new IDC whitepaper
An article about the business case for responsible AI, highlighting key findings from an IDC whitepaper.
Papers & Publications
Abstract:
Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the computation budget, as well as mixture-of-expert models when matched for both compute and parameters. We find gains are especially pronounced for factual tasks. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
Abstract:
We introduce Shape Tokens, a 3D representation that is continuous, compact, and easy to integrate into machine learning models. Shape Tokens serve as conditioning vectors, representing shape information within a 3D flow-matching model. This flow-matching model is trained to approximate probability density functions corresponding to delta functions concentrated on the surfaces of 3D shapes. By incorporating Shape Tokens into various machine learning models, we can generate new shapes, convert images to 3D, align 3D shapes with text and images, and render shapes directly at variable, user-specified resolutions. Additionally, Shape Tokens enable a systematic analysis of geometric properties, including normals, density, and deformation fields. Across tasks and experiments, the use of Shape Tokens demonstrates strong performance compared to existing baselines.
LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync
Abstract:
We present LatentSync, an end-to-end lip sync framework based on audio conditioned latent diffusion models without any intermediate motion representation, diverging from previous diffusion-based lip sync methods based on pixel space diffusion or two-stage generation. Our framework can leverage the powerful capabilities of Stable Diffusion to directly model complex audio-visual correlations. Additionally, we found that the diffusion-based lip sync methods exhibit inferior temporal consistency due to the inconsistency in the diffusion process across different frames. We propose Temporal REPresentation Alignment (TREPA) to enhance temporal consistency while preserving lip-sync accuracy. TREPA uses temporal representations extracted by large-scale self-supervised video models to align the generated frames with the ground truth frames. Furthermore, we observe the commonly encountered SyncNet convergence issue and conduct comprehensive empirical studies, identifying key factors affecting SyncNet convergence in terms of model architecture, training hyperparameters, and data preprocessing methods. We significantly improve the accuracy of SyncNet from 91% to 94% on the HDTF test set. Since we did not change the overall training framework of SyncNet, our experience can also be applied to other lip sync and audio-driven portrait animation methods that utilize SyncNet. Based on the above innovations, our method outperforms state-of-the-art lip sync methods across various metrics on the HDTF and VoxCeleb2 datasets.