Deep Learning Weekly: Issue 350
Meta Llama 3, Chatting with SQL Databases 3 Ways, Large Scale Transformer model training with Tensor Parallel, a paper on Leave No Context Behind: Efficient Infinite Context Transformers, and more!
This week in deep learning, we bring you Meta Llama 3, Chatting with SQL Databases 3 Ways, Large Scale Transformer model training with Tensor Parallel (TP), and a paper on Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention.
You may also enjoy AI Index Report 2024, Your Language Model Deserves Better Prompting, Vision Language Models Explained, a paper on InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Introducing Meta Llama 3: The most capable openly available LLM to date
Meta releases Llama 3, the next generation of their state-of-the-art open source large language model.
AI Index Report 2024 – Artificial Intelligence Index
Stanford HAI releases the seventh edition of the AI Index report which introduces new estimates on AI training costs, detailed analyses of the responsible AI landscape, and a new chapter dedicated to science and medicine.
Cheaper, Better, Faster, Stronger | Mistral AI
Mistral introduces Mixtral 8x22B, a sparse Mixture-of-Experts (SMoE) model that uses only 39B active parameters out of 141B, offering unparalleled cost efficiency for its size.
To build a better AI helper, start by modeling the irrational behavior of humans
MIT and other researchers developed a framework that models irrational or suboptimal behavior of a human or AI agent, based on their computational constraints.
New Standard for Speech Recognition and Translation from the NVIDIA NeMo Canary Model
The NVIDIA NeMo team just released Canary, a multilingual model that transcribes speech in English, Spanish, German, and French with punctuation and capitalization.
AI startup Mistral in talks to raise €500 million at €5 billion valuation
French AI startup Mistral is in talks to raise €500 million in a deal that would more than double its valuation to at least €5 billion.
MLOps & LLMOps
Chatting with SQL Databases 3 Ways
An article about different methods for interacting with SQL databases using Haystack, an open-source AI framework.
LangChain introduced a new tool_calls attribute on AIMessage, to provide a standard interface for interacting with tool invocations.
Your Language Model Deserves Better Prompting
A blog post that discusses the importance of prompt engineering and introduces the DSPy programming model for pipeline optimization.
Learning
Large Scale Transformer model training with Tensor Parallel (TP)
A tutorial that demonstrates how to train a large Transformer-like model across hundreds to thousands of GPUs using Tensor Parallel and Fully Sharded Data Parallel.
Efficiently fine-tune Llama 3 with PyTorch FSDP and Q-Lora
A blog post that walks through how to fine-tune a Llama 3 using PyTorch FSDP and Q-Lora with the help of Hugging Face’s TRL, Transformers, peft, and datasets.
Vision Language Models Explained
A post that goes through the main building blocks of vision language models, as well as the considerations for choosing the right model.
Multimodal Large Language Models & Apple’s MM1
A blog post that delves into the architecture and findings behind Apple’s “MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training” paper.
Libraries & Code
General purpose python library for uncertainty quantification with PyTorch.
The portable Python dataframe library.
A port of Andrjey Karpathy's llm.c to Mojo.
Papers & Publications
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
Abstract:
This work introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation. A key component in our proposed approach is a new attention technique dubbed Infini-attention. The Infini-attention incorporates a compressive memory into the vanilla attention mechanism and builds in both masked local attention and long-term linear attention mechanisms in a single Transformer block. We demonstrate the effectiveness of our approach on long-context language modeling benchmarks, 1M sequence length passkey context block retrieval and 500K length book summarization tasks with 1B and 8B LLMs. Our approach introduces minimal bounded memory parameters and enables fast streaming inference for LLMs.
Abstract:
We present InstantMesh, a feed-forward framework for instant 3D mesh generation from a single image, featuring state-of-the-art generation quality and significant training scalability. By synergizing the strengths of an off-the-shelf multiview diffusion model and a sparse-view reconstruction model based on the LRM architecture, InstantMesh is able to create diverse 3D assets within 10 seconds. To enhance the training efficiency and exploit more geometric supervisions, e.g, depths and normals, we integrate a differentiable iso-surface extraction module into our framework and directly optimize on the mesh representation. Experimental results on public datasets demonstrate that InstantMesh significantly outperforms other latest image-to-3D baselines, both qualitatively and quantitatively. We release all the code, weights, and demo of InstantMesh, with the intention that it can make substantial contributions to the community of 3D generative AI and empower both researchers and content creators.
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
Abstract:
We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction". This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and generalize well: VAR, for the first time, makes AR models surpass diffusion transformers in image generation. On ImageNet 256x256 benchmark, VAR significantly improve AR baseline by improving Frechet inception distance (FID) from 18.65 to 1.80, inception score (IS) from 80.4 to 356.4, with around 20x faster inference speed. It is also empirically verified that VAR outperforms the Diffusion Transformer (DiT) in multiple dimensions including image quality, inference speed, data efficiency, and scalability. Scaling up VAR models exhibits clear power-law scaling laws similar to those observed in LLMs, with linear correlation coefficients near -0.998 as solid evidence. VAR further showcases zero-shot generalization ability in downstream tasks including image in-painting, out-painting, and editing. These results suggest VAR has initially emulated the two important properties of LLMs: Scaling Laws and zero-shot task generalization. We have released all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning.