Deep Learning Weekly: Issue 373
Liquid Foundation Models, A Visual Exploration of Semantic Text Chunking, a paper on Simple and Fast Distillation of Diffusion Models, and many more!
This week in deep learning, we bring you Liquid Foundation Models, A Visual Exploration of Semantic Text Chunking, and a paper on Simple and Fast Distillation of Diffusion Models.
You may also enjoy Deploying Accelerated Llama 3.2 from the Edge to the Cloud, How memory augmentation can improve large language models, a paper on Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Liquid Foundation Models: Our First Series of Generative AI Models
Liquid AI announced the first series of Liquid Foundation Models (LFMs) – a new generation of generative AI models that achieve state-of-the-art performance while maintaining a smaller memory footprint.
Google Labs adds video and audio input to AI-powered note-taking assistant NotebookLM
Google rolled out new features for its research assistant called NotebookLM, including the ability to directly upload videos from YouTube URLs and audio files.
Announcing fine-tuning for customization and support for new models in Azure AI
Microsoft announced new features on Azure AI including fine-tuning for model customization and support for new models like Phi-3.5-vision-instruct and Command R+ from Cohere.
Deploying Accelerated Llama 3.2 from the Edge to the Cloud
NVIDIA is optimizing the Llama 3.2 collection of models to deliver high throughput and low latency across millions of GPUs worldwide.
MLOps & LLMOps
Summarising Daily AI Papers with GitHub and Gemini
A blog post that introduces a project utilizing Gemini and GitHub actions to automatically generate and update summaries of AI research papers from HuggingFace.
Managing AI Inference Pipelines on Kubernetes with NVIDIA NIM Operator
A technical blog post explaining how to use NVIDIA NIM Operator to manage and deploy AI inference pipelines at scale on Kubernetes clusters.
Learning
How to Fine-Tune Multimodal Models or VLMs with Hugging Face TRL
A technical blog post outlining how to use Hugging Face TRL to fine-tune multimodal models, for specific applications like generating product descriptions from images.
Advanced RAG: Query Decomposition & Reasoning
An article exploring how to decompose complex user queries into smaller, answerable sub-questions, improving the accuracy of Retrieval Augmented Generation (RAG) systems.
A Visual Exploration of Semantic Text Chunking
A visual exploration of semantic text chunking to break down text into meaningful units, enhancing information retrieval and analysis.
How memory augmentation can improve large language models
A blog post describing how IBM Research is addressing memory limitations in large language models by augmenting their memory capabilities.
Libraries & Code
Convert any PDF into a podcast episode.
A tool that converts PDFs into machine-readable formats (e.g., markdown, JSON), allowing for easy extraction into any format.
Papers & Publications
Simple and Fast Distillation of Diffusion Models
Abstract:
Diffusion-based generative models have demonstrated their powerful performance across various tasks, but this comes at the cost of the slow sampling speed. To achieve both efficient and high-quality synthesis, various distillation-based accelerated sampling methods have been developed recently. However, they generally require time-consuming fine tuning with elaborate designs to achieve satisfactory performance in a specific number of function evaluation (NFE), making them difficult to employ in practice. To address this issue, we propose Simple and Fast Distillation (SFD) of diffusion models, which simplifies the paradigm used in existing methods and largely shortens their fine-tuning time up to 1000×. We begin with a vanilla distillation-based sampling method and boost its performance to state of the art by identifying and addressing several small yet vital factors affecting the synthesis efficiency and quality. Our method can also achieve sampling with variable NFEs using a single distilled model. Extensive experiments demonstrate that SFD strikes a good balance between the sample quality and fine-tuning costs in few-step image generation task. For example, SFD achieves 4.53 FID (NFE=2) on CIFAR-10 with only 0.64 hours of fine-tuning on a single NVIDIA A100 GPU.
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Abstract:
Visual data comes in various forms, ranging from small icons of just a few pixels to long videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual inputs to a fixed resolution for visual encoders and yield similar numbers of tokens for LLMs. This approach is non-optimal for multimodal understanding and inefficient for processing inputs with long and short visual contents. To solve the problem, we propose Oryx, a unified multimodal architecture for the spatial-temporal understanding of images, videos, and multi-view 3D scenes. Oryx offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths through two core innovations: 1) a pre-trained OryxViT model that can encode images at any resolution into LLM-friendly visual representations; 2) a dynamic compressor module that supports 1x to 16x compression on visual tokens by request. These design features enable Oryx to accommodate extremely long visual contexts, such as videos, with lower resolution and high compression while maintaining high recognition precision for tasks like document understanding with native resolution and no compression. Beyond the architectural improvements, enhanced data curation and specialized training on long-context retrieval and spatial-aware data help Oryx achieve strong capabilities in image, video, and 3D multimodal understanding simultaneously.
Abstract:
To deploy machine learning models on-device, practitioners use compression algorithms to shrink and speed up models while maintaining their high-quality output. A critical aspect of compression in practice is model comparison, including tracking many compression experiments, identifying subtle changes in model behavior, and negotiating complex accuracy-efficiency trade-offs. However, existing compression tools poorly support comparison, leading to tedious and, sometimes, incomplete analyses spread across disjoint tools. To support real-world comparative workflows, we develop an interactive visual system called Compress and Compare. Within a single interface, Compress and Compare surfaces promising compression strategies by visualizing provenance relationships between compressed models and reveals compression-induced behavior changes by comparing models' predictions, weights, and activations. We demonstrate how Compress and Compare supports common compression analysis tasks through two case studies, debugging failed compression on generative language models and identifying compression artifacts in image classification models. We further evaluate Compress and Compare in a user study with eight compression experts, illustrating its potential to provide structure to compression workflows, help practitioners build intuition about compression, and encourage thorough analysis of compression's effect on model behavior. Through these evaluations, we identify compression-specific challenges that future visual analytics tools should consider and Compress and Compare visualizations that may generalize to broader model comparison tasks.