Deep Learning Weekly : Issue #305
The first AI model based on Yann LeCun’s vision, Pipelines with Langchain, Airbyte, and Dagster, a technical guide for Falcon, a paper on Semi-Supervised and Long-Tailed Object Detection.
This week in deep learning, we bring you I-JEPA: The first AI model based on Yann LeCun’s vision for more human-like AI, Pipelines with Langchain, Airbyte, and Dagster, Falcon - A guide to finetune and inference, and a paper on Semi-Supervised and Long-Tailed Object Detection with CascadeMatch.
You may also enjoy Deepmind's AlphaDev discovers faster sorting algorithms, ML Observability in a Notebook, Fast Class-Agnostic Salient Object Segmentation, a paper on Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
I-JEPA: The first AI model based on Yann LeCun’s vision for more human-like AI
Meta’s Chief AI Scientist Yann LeCun just proposed a new architecture intended to overcome key limitations of even the most advanced AI systems.
AlphaDev discovers faster sorting algorithms
DeepMind’s AlphaDev – an RL-based system for discovering algorithms – uncovered a faster algorithm for sorting.
Scaling audio-visual learning without labels
A new multimodal technique blends major self-supervised learning methods to learn more similarly to humans.
A step toward safe and reliable autopilots for flying
MIT researchers developed a machine-learning technique that can autonomously drive a car or fly a plane through a very difficult “stabilize-avoid” scenario.
McKinsey launches new product suite to help clients scale AI
McKinsey launched QuantumBlack Horizon, a set of AI development tools from QuantumBlack, AI by McKinsey.
Introducing Snorkel’s Foundation Model Data Platform
Snorkel AI announced its Foundation Model Data Platform, which supports the broader set of data-centric operations involved in developing modern foundation models (FMs).
MLOps
Implement AI data pipelines with Langchain, Airbyte, and Dagster
An article on how to set up a maintainable and scalable pipeline for integrating diverse data sources into large language models using Airbyte, Dagster, and LangChain.
Monitoring machine learning models in production
The primary goal of model monitoring is to ensure that the model remains effective and reliable in making predictions or decisions, even as the data or environment in which it operates evolves.
The Secret Sauce behind 100K context window in LLMs: all tricks in one place
A blogpost that highlights the techniques to speed up training and inference of LLMs to use a large context window up to 100K input tokens.
Arize-ai/phoenix: ML Observability in a Notebook
Phoenix, an open-source library delivering LLM observability in a notebook for monitoring/finetuning generative models, just debuted a new version.
A post that evaluates support for DeepSpeed by Habana SynapseAI v1.5/v1.6 and how it helps scale LLM training on Habana Gaudi accelerators.
Learning
Falcon - A guide to finetune and inference
A blogpost on how to efficiently fine-tune Falcon and run inference on consumer-grade hardware with less than 4.5 GB of GPU memory.
Obsidian-Copilot: A Prototype Assistant for Writing & Thinking
Eugene Yan discusses how he built an Obsidian-Copilot, which can help draft a few paragraphs via retrieval-augmented generation and more.
Fast Class-Agnostic Salient Object Segmentation
Apple Machine Learning Research describes the architecture of the subject lifting network used in iOS 16, iPadOS, and macOS Ventura.
BAIR describes how they solve common Stable Diffusion issues by equipping diffusion models with enhanced spatial and common sense reasoning in a novel two-stage generation process.
Libraries & Code
A PyTorch library for deep learning research on audio generation.
A universal solution that allows any language models to be deployed natively on a diverse set of hardware backends and native applications.
A library for training transformer language models with Proximal Policy Optimization (PPO).
Papers & Publications
Semi-Supervised and Long-Tailed Object Detection with CascadeMatch
Abstract:
This paper focuses on long-tailed object detection in the semi-supervised learning setting, which poses realistic challenges, but has rarely been studied in the literature. We propose a novel pseudo-labeling-based detector called CascadeMatch. Our detector features a cascade network architecture, which has multi-stage detection heads with progressive confidence thresholds. To avoid manually tuning the thresholds, we design a new adaptive pseudo-label mining mechanism to automatically identify suitable values from data. To mitigate confirmation bias, where a model is negatively reinforced by incorrect pseudo-labels produced by itself, each detection head is trained by the ensemble pseudo-labels of all detection heads. Experiments on two long-tailed datasets, i.e., LVIS and COCO-LT, demonstrate that CascadeMatch surpasses existing state-of-the-art semi-supervised approaches—across a wide range of detection architectures—in handling long-tailed object detection. For instance, CascadeMatch outperforms Unbiased Teacher by 1.9 AP Fix on LVIS when using a ResNet50-based Cascade R-CNN structure, and by 1.7 AP Fix when using Sparse R-CNN with a Transformer encoder. We also show that CascadeMatch can even handle the challenging sparsely annotated object detection problem.
Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting
Abstract:
Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to input text prompts, while consistent with input images. We present Imagen Editor, a cascaded diffusion model built, by fine-tuning Imagen on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training. In addition, Imagen Editor captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.
Segment Anything in High Quality
Abstract:
The recent Segment Anything Model (SAM) represents a big leap in scaling up segmentation models, allowing for powerful zero-shot capabilities and flexible prompting. Despite being trained with 1.1 billion masks, SAM's mask prediction quality falls short in many cases, particularly when dealing with objects that have intricate structures. We propose HQ-SAM, equipping SAM with the ability to accurately segment any object, while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability. Our careful design reuses and preserves the pre-trained model weights of SAM, while only introducing minimal additional parameters and computation. We design a learnable High-Quality Output Token, which is injected into SAM's mask decoder and is responsible for predicting the high-quality mask. Instead of only applying it on mask-decoder features, we first fuse them with early and final ViT features for improved mask details. To train our introduced learnable parameters, we compose a dataset of 44K fine-grained masks from several sources. HQ-SAM is only trained on the introduced detaset of 44k masks, which takes only 4 hours on 8 GPUs. We show the efficacy of HQ-SAM in a suite of 9 diverse segmentation datasets across different downstream tasks, where 7 out of them are evaluated in a zero-shot transfer protocol.