Deep Learning Weekly: Issue #255
NVIDIA's MoMA inverse rendering pipeline for quickly producing 3D objects, Apple's multi-task neural architecture for on-device scene analysis, unifying generative variational autoencoders, and more.
This week in deep learning, we bring you NVIDIA's 3D MoMA inverse rendering pipeline for quickly producing 3D objects, Apple's multi-task neural architecture for on-device scene analysis, unifying generative variational autoencoders, and a paper on unfolding half-shuffle transformers.
You may also enjoy Azure's new AI features for hybrid cloud environments, pushing and optimizing PyTorch models with OctoML, data pipelining with Ploomber, a paper on building open-ended embodied agents with internet-scale knowledge, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
By formulating the inverse rendering problem as a GPU-accelerated differentiable component, the NVIDIA 3D MoMa rendering pipeline uses modern AI and raw computational horsepower from NVIDIA GPUs to quickly produce 3D objects that creators can import, edit, and extend without limitation.
Microsoft rolled out a set of new capabilities, based on Azure Arc, Azure Machine Learning, and Azure Kubernetes Service, that will make it easier for companies to run artificial intelligence software in hybrid cloud environments.
A consortium in Sweden is developing a state-of-the-art language model with NVIDIA NeMo Megatron and will make it available to any user in the Nordic region.
In an opinion paper published 10 June in the data-science journal Patterns, University of Hull researchers outline the hurdles limiting AI’s impact on renewables, and how to surmount them using established and emerging AI methods.
Prophecy, a company providing a low-code platform for data engineering, has launched a dedicated integration for Databricks, enabling anyone to quickly and easily build data pipelines on the Apache Spark-based data platform.
An article demonstrating how to set up and run a basic MLOps workflow from data injection to training a model based on the previously achieved best one to deploying the model in Vertex AI platform.
A curated list of references for MLOps.
A technical tutorial on optimizing, packaging, and pushing a PyTorch model using OctoML and other necessary tools.
A comprehensive blog discussing how to create workflows and experiment pipelines, along with several examples of data version control techniques in action.
An article focusing on activity recognition, which is a general challenge across industries, but with some specific opportunities when leveraged in the media production field, because we can combine audio, video, and subtitles to provide a solution.
It can takes days, or even weeks, to train a DL neural network, but the use of transfer learning, optimizers, early stopping, and GPUs can speed up this process significantly.
This document provides solutions to a variety of use cases regarding the saving and loading of PyTorch models.
Apple Machine Learning Research presents how they developed Apple Neural Scene Analyzer (ANSA), a unified backbone to build and maintain scene analysis workflows in production.
An article on building a simple news article recommender system that computes the embeddings of all available articles and recommends the most relevant articles using Cohere API endpoints.
Libraries & Code
Ploomber is the fastest way to build data pipelines. This allows legacy notebooks to be refactored into modular pipelines in a single command.
Bolt is an algorithm for compressing vectors of real-valued data and running mathematical operations directly on the compressed representations.
A library that implements some of the most common VAE models. In particular it provides the possibility to perform benchmark experiments and comparisons by training the models with the same autoencoding neural network architecture.
Papers & Publications
In coded aperture snapshot spectral compressive imaging (CASSI) systems, hyperspectral image (HSI) reconstruction methods are employed to recover the spatial-spectral signal from a compressed measurement. Among these algorithms, deep unfolding methods demonstrate promising performance but suffer from two issues. Firstly, they do not estimate the degradation patterns and ill-posedness degree from the highly related CASSI to guide the iterative learning. Secondly, they are mainly CNN-based, showing limitations in capturing long-range dependencies. In this paper, we propose a principled Degradation-Aware Unfolding Framework (DAUF) that estimates parameters from the compressed image and physical mask, and then uses these parameters to control each iteration. Moreover, we customize a novel Half-Shuffle Transformer (HST) that simultaneously captures local contents and non-local dependencies. By plugging HST into DAUF, we establish the first Transformer-based deep unfolding method, Degradation-Aware Unfolding Half-Shuffle Transformer (DAUHST), for HSI reconstruction. Experiments show that DAUHST significantly surpasses state-of-the-art methods while requiring cheaper computational and memory costs.
Autonomous agents have made great strides in specialist domains like Atari games and Go. However, they typically learn tabula rasa in isolated environments with limited and manually conceived objectives, thus failing to generalize across a wide spectrum of tasks and capabilities. Inspired by how humans continually learn and adapt in the open world, we advocate a trinity of ingredients for building generalist agents: 1) an environment that supports a multitude of tasks and goals, 2) a large-scale database of multimodal knowledge, and 3) a flexible and scalable agent architecture. We introduce MineDojo, a new framework built on the popular Minecraft game that features a simulation suite with thousands of diverse open-ended tasks and an internet-scale knowledge base with Minecraft videos, tutorials, wiki pages, and forum discussions. Using MineDojo's data, we propose a novel agent learning algorithm that leverages large pre-trained video-language models as a learned reward function. Our agent is able to solve a variety of open-ended tasks specified in free-form language without any manually designed dense shaping reward. We open-source the simulation suite and knowledge bases (this https URL) to promote research towards the goal of generally capable embodied agents.