Deep Learning Weekly: Issue #277
Meta's CICERO, exploring large-scale multimedia data with Kangas, a review of deep learning approaches to ASR, a paper on a diffusion model for object detection, and many more
This week in deep learning, we bring you Meta's CICERO, exploring large-scale multimedia data with Kangas, a review of deep learning approaches to ASR, and a paper on a diffusion model for object detection.
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
As part of Meta AI’s push to make AI systems more efficient, they have developed MultiRay, a new platform for running state-of-the-art AI models at scale.
Meta AI announces CICERO, the first AI to achieve human-level performance in the popular strategy game Diplomacy.
MLCommons released the latest set of benchmark results, offering a glimpse at the capabilities of new chips and old as they tackled executing lightweight AI on the tiniest systems and more.
Anadol’s AI-generated art made its debut at The Museum of Modern Art in New York City.
TensorFlow 2.11 has just been released. This includes enhancements to DTensor, the completion of the Keras Optimizer migration, the introduction of an experimental StructuredTensor, and more.
A guide that covers how a framework agnostic pipeline was created using DVC, Rust, and Python.
A code-along blog that shows you how to automatically surface and troubleshoot the reason for performance degradation by analyzing embedding vectors associated with the input images.
An article that guides you through hooking up Neptune to track your machine learning experiments on Google Colab.
A blog on how MLOps can be used, the benefits, and the stages of an MLOps pipeline.
A comprehensive review on the theory and application of different deep learning techniques utilized for automatic speech recognition.
A blog that presents a case study demonstrating the scaling of FLAVA, a promising model available in TorchMultimodal, to 10B params using techniques from PyTorch Distributed.
A comprehensive post that highlights a taxonomy of use cases within Document AI, the best open-source models for those use cases, and some practical solutions.
An article that covers the approach of using graph transformation to optimize PyTorch’s performance for production.
This blog post demonstrates how to perform an end-to-end sentiment analysis application.
Libraries & Code
Kangas is a tool for exploring, analyzing, and visualizing large-scale multimedia data.
GALACTICA is a general-purpose scientific language model. It is trained on a large corpus of scientific text and data.
FlagAI (Fast LArge-scale General AI models) is a fast, easy-to-use, and extensible toolkit for large-scale models.
Papers & Publications
We propose DiffusionDet, a new framework that formulates object detection as a denoising diffusion process from noisy boxes to object boxes. During training stage, object boxes diffuse from ground-truth boxes to random distribution, and the model learns to reverse this noising process. In inference, the model refines a set of randomly generated boxes to the output results in a progressive way. The extensive evaluations on the standard benchmarks, including MS-COCO and LVIS, show that DiffusionDet achieves favorable performance compared to previous well-established detectors. Our work brings two important findings in object detection. First, random boxes, although drastically different from pre-defined anchors or learned queries, are also effective object candidates. Second, object detection, one of the representative perception tasks, can be solved by a generative way.
Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while keeping the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy (e.g. one resulting in load imbalance) can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed number of experts to each token using a top-k function regardless of the relative importance of different tokens. To address this, we propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top-k experts, we have experts selecting the top-k tokens. As a result, each token can be routed to a variable number of experts and each expert can have a fixed bucket size. We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and find that our method improves training convergence time by more than 2x. For the same computational cost, our method demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and SuperGLUE benchmarks. For a smaller activation cost, our method outperforms the T5 dense model in 7 out of the 11 tasks.
Decoding visual stimuli from brain recordings aims to deepen our understanding of the human visual system and build a solid foundation for bridging human and computer vision through the Brain-Computer Interface. However, reconstructing high-quality images with correct semantics from brain recordings is a challenging problem due to the complex underlying representations of brain signals and the scarcity of data annotations. In this work, we present MinD-Vis: Sparse Masked Brain Modeling with Double-Conditioned Latent Diffusion Model for Human Vision Decoding. Firstly, we learn an effective self-supervised representation of fMRI data using mask modeling in a large latent space inspired by the sparse coding of information in the primary visual cortex. Then by augmenting a latent diffusion model with double-conditioning, we show that MinD-Vis can reconstruct highly plausible images with semantically matching details from brain recordings using very few paired annotations. We benchmarked our model qualitatively and quantitatively; the experimental results indicate that our method outperformed state-of-the-art in both semantic mapping (100-way semantic classification) and generation quality (FID) by 66% and 41% respectively. An exhaustive ablation study was also conducted to analyze our framework.