Deep Learning Weekly: Issue 348
Beyond Transformers: Symbolica launches with symbolic models, Binary and Scalar Embedding Quantization, A Brief Overview of Gender Bias in AI, a paper on Visual Autoregressive Modeling, and many more!
This week in deep learning, we bring you Beyond Transformers: Symbolica launches with $33M to change the AI industry with symbolic models, Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval, A Brief Overview of Gender Bias in AI, and a paper on Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction.
You may also enjoy Gretel releases world's largest open source text-to-SQL dataset, Maximizing training throughput using PyTorch FSDP, A New Coefficient of Correlation, a paper on InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Beyond Transformers: Symbolica launches with $33M to change the AI industry with symbolic models
Symbolica AI launched with a structured mathematics approach to building generative AI models.
Four things you need to know about China’s AI talent pool
Researchers from China make up over one-quarter of the world’s top AI experts, and they are increasingly staying put rather than moving overseas.
Intel Gaudi 3 launches to challenge Nvidia in the enterprise AI chip space
A post that introduces Intel’s Gaudi 3 AI accelerator, designed to streamline AI development in the enterprise, competing with Nvidia and AMD.
Gretel releases world's largest open source text-to-SQL dataset
Gretel announced the release of the world’s largest open source Text-to-SQL dataset, a move poised to accelerate AI model training and unlock new possibilities for businesses across the globe.
AI Generates 3D City Maps From Single Radar Images
A new machine learning system can create height maps of urban environments from a single synthetic aperture radar (SAR) image, potentially accelerating disaster planning and response.
MLOps & LLMOps
Binary and Scalar Embedding Quantization for Significantly Faster & Cheaper Retrieval
An article that introduces the concept of embedding quantization and showcases its impact on retrieval speed, memory usage, disk space, and cost.
Maximizing training throughput using PyTorch FSDP
A blog post that demonstrates the scalability of FSDP with a 7B model trained for 2T tokens, and highlights various techniques to achieve a rapid training speed on 128 A100 GPUs.
Accelerate Mixtral 8x7B with Speculative Decoding and Quantization on Amazon SageMaker
A tutorial on how to accelerate Mixtral-8x7B on Amazon SageMaker using Speculative Decoding (Medusa) and Quantization (AWQ).
Learning
A New Coefficient of Correlation
An article that introduces a new correlation measure that goes beyond linear and monotonic relationships of independent variables.
Deep Dive into Sora’s Diffusion Transformer (DiT) by Hand
A highly visual article that deeply explains the Sora’s Diffusion Transformer setup.
Optimizing Memory and Retrieval for Graph Neural Networks with WholeGraph
A post that highlights the performance evaluation of WholeGraph, a breakthrough feature within the RAPIDS cuGraph library, designed for large-scale GNN training.
A Brief Overview of Gender Bias in AI
An article that showcases a small selection of important work done (and currently being done) to uncover, evaluate, and measure different aspects of gender bias in AI models.
Libraries & Code
A representation fine-tuning (ReFT) library that supports adapting internal language model representations via trainable interventions.
A novel framework for generating high-quality animation driven by audio and a reference portrait image.
Interpretability and explainability of data and machine learning models
Papers & Publications
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
Abstract:
We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction". This simple, intuitive methodology allows autoregressive (AR) transformers to learn visual distributions fast and generalize well: VAR, for the first time, makes AR models surpass diffusion transformers in image generation. On ImageNet 256x256 benchmark, VAR significantly improve AR baseline by improving Frechet inception distance (FID) from 18.65 to 1.80, inception score (IS) from 80.4 to 356.4, with around 20x faster inference speed. It is also empirically verified that VAR outperforms the Diffusion Transformer (DiT) in multiple dimensions including image quality, inference speed, data efficiency, and scalability. Scaling up VAR models exhibits clear power-law scaling laws similar to those observed in LLMs, with linear correlation coefficients near -0.998 as solid evidence. VAR further showcases zero-shot generalization ability in downstream tasks including image in-painting, out-painting, and editing. These results suggest VAR has initially emulated the two important properties of LLMs: Scaling Laws and zero-shot task generalization. We have released all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning.
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation
Abstract:
Tuning-free diffusion-based models have demonstrated significant potential in the realm of image personalization and customization. However, despite this notable progress, current models continue to grapple with several complex challenges in producing style-consistent image generation. Firstly, the concept of style is inherently underdetermined, encompassing a multitude of elements such as color, material, atmosphere, design, and structure, among others. Secondly, inversion-based methods are prone to style degradation, often resulting in the loss of fine-grained details. Lastly, adapter-based approaches frequently require meticulous weight tuning for each reference image to achieve a balance between style intensity and text controllability. In this paper, we commence by examining several compelling yet frequently overlooked observations. We then proceed to introduce InstantStyle, a framework designed to address these issues through the implementation of two key strategies: 1) A straightforward mechanism that decouples style and content from reference images within the feature space, predicated on the assumption that features within the same space can be either added to or subtracted from one another. 2) The injection of reference image features exclusively into style-specific blocks, thereby preventing style leaks and eschewing the need for cumbersome weight tuning, which often characterizes more parameter-heavy designs.Our work demonstrates superior visual stylization outcomes, striking an optimal balance between the intensity of style and the controllability of textual elements.
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Abstract:
Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However, these models utilize large transformer-based encoders with significant memory and latency overhead which pose challenges for deployment on mobile devices. In this work, we introduce MobileCLIP -- a new family of efficient image-text models optimized for runtime performance along with a novel and efficient training approach, namely multi-modal reinforced training. The proposed training approach leverages knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders to improve the accuracy of efficient models. Our approach avoids train-time compute overhead by storing the additional knowledge in a reinforced dataset. MobileCLIP sets a new state-of-the-art latency-accuracy tradeoff for zero-shot classification and retrieval tasks on several datasets. Our MobileCLIP-S2 variant is 2.3 faster while more accurate compared to previous best CLIP model based on ViT-B/16. We further demonstrate the effectiveness of our multi-modal reinforced training by training a CLIP model based on ViT-B/16 image backbone and achieving +2.9% average performance improvement on 38 evaluation benchmarks compared to the previous best. Moreover, we show that the proposed approach achieves 10-1000 improved learning efficiency when compared with non-reinforced CLIP training.