Deep Learning Weekly: Issue 335
AI models help robots execute complex plans more transparently, Solving Reasoning Problems with LLMs, a paper on Foundation Models for Large Scale Orchestration of Robotic Agents, and many more!
This week in deep learning, we bring you Multiple AI models help robots execute complex plans more transparently, Solving Reasoning Problems with LLMs in 2023, Tackling Water Pollution using YOLO-NAS, and a paper on AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents.
You may also enjoy New NIST report sounds the alarm on growing threat of AI attacks, A Cheat Sheet and Some Recipes For Building Advanced RAG, An In-Depth Guide To Help You Start Auditing Your AI Models, a paper on Relational Deep Learning: Graph Representation Learning on Relational Databases, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Multiple AI models help robots execute complex plans more transparently
The HiP framework developed at MIT CSAIL develops detailed plans for robots using three different foundation models, helping it execute household, factory, and construction tasks.
Rabbit unveils r1 AI pocket companion to handle tasks for you
Tech startup Rabbit has unveiled r1, an an AI-powered companion device that does digital tasks for you.
Nabla raises $24M for its AI-powered clinical note platform
Nabla Technologies, a startup helping medical professionals create clinical notes faster, announced that it has closed a $24 million funding round.
New NIST report sounds the alarm on growing threat of AI attacks
The National Institute of Standards and Technology (NIST) has released an urgent report to aid in the defense against an escalating threat landscape targeting artificial intelligence (AI) systems.
EU weighing whether Microsoft-OpenAI alliance could be subject to antitrust probe
The European Commission may launch an antitrust probe into Microsoft’s high-profile partnership with OpenAI.
MLOps & LLMOps
Solving Reasoning Problems with LLMs in 2023
A collection of insightful summaries focusing on the progress of LLM research on tool use and reasoning.
A Cheat Sheet and Some Recipes For Building Advanced RAG
A comprehensive RAG Cheat Sheet detailing motivations for RAG as well as techniques and strategies for progressing beyond Basic or Naive RAG builds.
Merge Large Language Models with mergekit
A detailed introduction to how SLERP, TIES, DARE, and passthrough work for merging LLMs.
Generating value from enterprise data: Best practices for Text2SQL and generative AI
An article that explores the use cases, challenges, design patterns, and best practices for Text2SQL.
Learning
Build and Monitor an Object Detection Model in 5 Steps Using Comet
An article about creating a custom object detection model and monitoring its accuracy metrics and hyperparameters in five simple steps.
Accelerate AI models on GPU using Amazon SageMaker multi-model endpoints with TorchServe
A post demonstrating how to host generative AI models on SageMaker multi-model endpoints, and how to build a language-guided editing solution that can help artists develop artworks faster.
An In-Depth Guide To Help You Start Auditing Your AI Models
A comprehensive guide to AI auditing, covering the definitions, benefits, and challenges.
Libraries & Code
EleutherAI/lm-evaluation-harness
A framework for few-shot evaluation of autoregressive language models.
Instant voice cloning by MyShell.
WikiChat stops the hallucination of large language models by retrieving data from Wikipedia.
Papers & Publications
AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents
Abstract:
Foundation models that incorporate language, vision, and more recently actions have revolutionized the ability to harness internet scale data to reason about useful tasks. However, one of the key challenges of training embodied foundation models is the lack of data grounded in the physical world. In this paper, we propose AutoRT, a system that leverages existing foundation models to scale up the deployment of operational robots in completely unseen scenarios with minimal human supervision. AutoRT leverages vision-language models (VLMs) for scene understanding and grounding, and further uses large language models (LLMs) for proposing diverse and novel instructions to be performed by a fleet of robots. Guiding data collection by tapping into the knowledge of foundation models enables AutoRT to effectively reason about autonomy tradeoffs and safety while significantly scaling up data collection for robot learning. We demonstrate AutoRT proposing instructions to over 20 robots across multiple buildings and collecting 77k real robot episodes via both teleoperation and autonomous robot policies. We experimentally show that such “in-the-wild” data collected by AutoRT is significantly more diverse, and that AutoRT’s use of LLMs allows for instruction following data collection robots that are aligned with human preferences.
Relational Deep Learning: Graph Representation Learning on Relational Databases
Abstract:
Much of the world's most valued data is stored in relational databases and data warehouses, where the data is organized into many tables connected by primary-foreign key relations. However, building machine learning models using this data is both challenging and time consuming. The core problem is that no machine learning method is capable of learning on multiple tables interconnected by primary-foreign key relations. Current methods can only learn from a single table, so the data must first be manually joined and aggregated into a single training table, the process known as feature engineering. Feature engineering is slow, error prone and leads to suboptimal models. Here we introduce an end-to-end deep representation learning approach to directly learn on data laid out across multiple tables. We name our approach Relational Deep Learning (RDL). The core idea is to view relational databases as a temporal, heterogeneous graph, with a node for each row in each table, and edges specified by primary-foreign key links. Message Passing Graph Neural Networks can then automatically learn across the graph to extract representations that leverage all input data, without any manual feature engineering. Relational Deep Learning leads to more accurate models that can be built much faster. To facilitate research in this area, we develop RelBench, a set of benchmark datasets and an implementation of Relational Deep Learning. The data covers a wide spectrum, from discussions on Stack Exchange to book reviews on the Amazon Product Catalog. Overall, we define a new research area that generalizes graph machine learning and broadens its applicability to a wide set of AI use cases.
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
Abstract:
We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency details obtained through diffusion to generate more dynamic, expressive motion. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures (e.g. sneers and smirks). To facilitate this line of research, we introduce a first-of-its-kind multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods. Furthermore, our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational gestures. Code and dataset available online.