Deep Learning Weekly: Issue 341
OpenAI's Sora, GenAI Design Patterns, Microsoft's UI-focused dual-agent framework for fulfilling requests on Windows, World Model on Million-Length Video And Language With RingAttention, and more!
This week in deep learning, we bring you OpenAI's Sora, Generative AI Design Patterns: A Comprehensive Guide, Microsoft's UI-focused dual-agent framework for fulfilling requests on Windows, and a paper on World Model on Million-Length Video And Language With RingAttention.
You may also enjoy V-JEPA: The next step toward Yann LeCun’s vision of advanced machine intelligence (AMI), Build an LLM-Powered Data Agent for Data Analysis, Learning the importance of training data under concept drift, a paper on Automated Unit Test Improvement using Large Language Models at Meta, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
OpenAI introduced Sora, a text-to-video model capable of generating videos up to a minute long while maintaining visual quality and adherence to the user’s prompt.
SoftBank’s Masayoshi Son is reportedly seeking $100B to build a new AI chip venture
SoftBank’s Masayoshi Son is reportedly seeking $100 billion to build a new venture called Izanagi to compete with the likes of Nvidia.
V-JEPA: The next step toward Yann LeCun’s vision of advanced machine intelligence (AMI)
Meta publicly released the Video Joint Embedding Predictive Architecture (V-JEPA) model, a crucial step in advancing machine intelligence with a more grounded understanding of the world.
Apple is reportedly working on AI updates to Spotlight and Xcode
Bloomberg reports Apple has ‘ramped up’ development of an AI-powered code completion tool that’s similar to Microsoft’s GitHub Copilot.
AI hardware startup Recogni raises $102M for self-driving solutions
Recogni announced that it has raised $102 million in a funding round co-led by Celesta Capital and GreatPoint Ventures.
Google DeepMind alumni unveil Bioptimus: Aiming to build first universal biology AI model
Bioptimus, with a mission to build the first universal AI foundation model for biology, emerged from stealth following a seed funding round of $35 million.
MLOps & LLMOps
100x Faster — Scaling Your RAG App for Billions of Embeddings
An article that discusses efficient methods for computing cosine similarity between user query embedding vectors and large-scale embedding databases using the Chunkdot library.
Generative AI Design Patterns: A Comprehensive Guide
An article that highlights a handful of approaches and patterns for generative AI systems in production.
Build an LLM-Powered Data Agent for Data Analysis
A post that explains the agent types required to build an accurate LLM application that can handle nuanced data analysis tasks when queried.
Learning
Unveiling the Potential of Histogram of Oriented Gradients (HOG) in Computer Vision
An article that explores the fundamental principles and applications of Histogram of Oriented Gradients (HOG), a powerful feature extraction technique for capturing object structure and texture in visual data.
Learning the importance of training data under concept drift
Google Research explores the challenges posed by the changing world on model development, emphasizing the need to assign importance scores to training data to maximize model performance on future inputs.
Libraries & Code
DataDreamer is a powerful open-source Python library for prompting, synthetic data generation, and training workflows.
UFO is a UI-Focused dual-agent framework to fulfill user requests on Windows OS by seamlessly navigating and operating within individual or spanning multiple applications.
Papers & Publications
World Model on Million-Length Video And Language With RingAttention
Abstract:
Current language models fall short in understanding aspects of the world not easily described in words, and struggle with complex, long-form tasks. Video sequences offer valuable temporal information absent in language and static images, making them attractive for joint modeling with language. Such models could develop a understanding of both human textual knowledge and the physical world, enabling broader AI capabilities for assisting humans. However, learning from millions of tokens of video and language sequences poses challenges due to memory constraints, computational complexity, and limited datasets. To address these challenges, we curate a large dataset of diverse videos and books, utilize the RingAttention technique to scalably train on long sequences, and gradually increase context size from 4K to 1M tokens. This paper makes the following contributions: (a) Largest context size neural network: We train one of the largest context size transformers on long video and language sequences, setting new benchmarks in difficult retrieval tasks and long video understanding. (b) Solutions for overcoming vision-language training challenges, including using masked sequence packing for mixing different sequence lengths, loss weighting to balance language and vision, and model-generated QA dataset for long sequence chat. (c) A highly-optimized implementation with RingAttention, masked sequence packing, and other key features for training on millions-length multimodal sequences. (d) Fully open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens. This work paves the way for training on massive datasets of long video and language to develop understanding of both human knowledge and the multimodal world, and broader capabilities.
Automated Unit Test Improvement using Large Language Models at Meta
Abstract:
This paper describes Meta's TestGen-LLM tool, which uses LLMs to automatically improve existing human-written tests. TestGen-LLM verifies that its generated test classes successfully clear a set of filters that assure measurable improvement over the original test suite, thereby eliminating problems due to LLM hallucination. We describe the deployment of TestGen-LLM at Meta test-a-thons for the Instagram and Facebook platforms. In an evaluation on Reels and Stories products for Instagram, 75% of TestGen-LLM's test cases built correctly, 57% passed reliably, and 25% increased coverage. During Meta's Instagram and Facebook test-a-thons, it improved 11.5% of all classes to which it was applied, with 73% of its recommendations being accepted for production deployment by Meta software engineers. We believe this is the first report on industrial scale deployment of LLM-generated code backed by such assurances of code improvement.
Transformers Can Achieve Length Generalization But Not Robustly
Abstract:
Length generalization, defined as the ability to extrapolate from shorter training sequences to longer test ones, is a significant challenge for language models. This issue persists even with large-scale Transformers handling relatively straightforward tasks. In this paper, we test the Transformer's ability of length generalization using the task of addition of two integers. We show that the success of length generalization is intricately linked to the data format and the type of position encoding. Using the right combination of data format and position encodings, we show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length. Nevertheless, unlike in-distribution generalization, length generalization remains fragile, significantly influenced by factors like random weight initialization and training data order, leading to large variances across different random seeds.