Deep Learning Weekly: Issue 383
Genie 2: A large-scale foundation world model, an AI Chatbot example project: Clairebot, a paper on a systematic framework for large video generation models, and many more!
This week in deep learning, we bring you Genie 2: A large-scale foundation world model, an AI Chatbot Example Project: ClaireBot, an AI Personal Stylist ,and a paper on HunyuanVideo: A Systematic Framework For Large Video Generative Models.
You may also enjoy Sora Turbo, Population-based Model Merging via Quality Diversity, a paper on Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Genie 2: A large-scale foundation world model
DeepMind introduced Genie 2, a foundation world model capable of generating a variety of action-controllable, playable 3D environments for embodied agents.
OpenAI released Sora Turbo — a significantly faster Sora — as a standalone product to ChatGPT Plus and Pro users.
Neil Thompson has created the AI Risk Repository, a living database of over 700 risks posed by AI, categorized by cause and risk domain.
LeMaterial: an open source initiative to accelerate materials discovery and research
Hugging Face and Entalpic announced LeMaterial, an open-source project that aims to simplify and accelerate materials research.
AI Headphones Create Cones of Silence
A team of researchers from the University of Washington, Microsoft, and Assembly AI have just shown that AI can outdo humans in isolating sound sources to create a zone of silence.
AI data center builder Nscale nabs $155M investment
Nscale, a London startup that builds data centers optimized for AI workloads, has raised $155 million to grow its infrastructure footprint.
MLOps & LLMOps
AI Chatbot Example Project: ClaireBot, an AI Personal Stylist
Learn how to build an end-to-end conversational AI system with image analysis, relevance guardrails, LLM-as-a-judge evaluation, and open source tools.
Designing Multi-Tenancy RAG with Milvus: Best Practices for Scalable Enterprise Knowledge Bases
An article about the benefits of using Milvus for multi-tenancy RAG in enterprise knowledge bases.
Optimize parsing costs with LlamaParse auto mode
A post explaining how to optimize parsing costs with LlamaParse auto mode.
Learning
Population-based Model Merging via Quality Diversity
An article that explains how to use Quality Diversity to train LLM agents for specific tasks.
Looking back at speculative decoding
An informative blog post looking back on speculative decoding and how it has been used to optimize LLMs.
PaliGemma 2 – New vision language models by Google
A blog post introducing PaliGemma 2, a new iteration of Google's vision language models.
Scheming reasoning evaluations
An article summarizing research on the scheming capabilities of six frontier AI models.
Libraries & Code
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
Papers & Publications
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Abstract:
Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem.
Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
Abstract:
We present Infinity, a Bitwise Visual AutoRegressive Modeling capable of generating high-resolution, photorealistic images following language instruction. Infinity redefines visual autoregressive model under a bitwise token prediction framework with an infinite-vocabulary tokenizer & classifier and bitwise self-correction mechanism, remarkably improving the generation capacity and details. By theoretically scaling the tokenizer vocabulary size to infinity and concurrently scaling the transformer size, our method significantly unleashes powerful scaling capabilities compared to vanilla VAR. Infinity sets a new record for autoregressive text-to-image models, outperforming top-tier diffusion models like SD3-Medium and SDXL. Notably, Infinity surpasses SD3-Medium by improving the GenEval benchmark score from 0.62 to 0.73 and the ImageReward benchmark score from 0.87 to 0.96, achieving a win rate of 66%. Without extra optimization, Infinity generates a high-quality 1024x1024 image in 0.8 seconds, making it 2.6x faster than SD3-Medium and establishing it as the fastest text-to-image model. Models and codes will be released to promote further exploration of Infinity for visual generation and unified tokenizer modeling.
Momentum Approximation in Asynchronous Private Federated Learning
Abstract:
Asynchronous protocols have been shown to improve the scalability of federated learning (FL) with a massive number of clients. Meanwhile, momentum-based methods can achieve the best model quality in synchronous FL. However, naively applying momentum in asynchronous FL algorithms leads to slower convergence and degraded model performance. It is still unclear how to effective combine these two techniques together to achieve a win-win. In this paper, we find that asynchrony introduces implicit bias to momentum updates. In order to address this problem, we propose momentum approximation that minimizes the bias by finding an optimal weighted average of all historical model updates. Momentum approximation is compatible with secure aggregation as well as differential privacy, and can be easily integrated in production FL systems with a minor communication and storage cost. We empirically demonstrate that on benchmark FL datasets, momentum approximation can achieve 1.15--4× speed up in convergence compared to existing asynchronous FL optimizers with momentum.