Deep Learning Weekly: Issue 359
Claude 3.5 Sonnet, Benchmarking Haystack Pipelines for Optimal Performance, Using LLMs to Analyze and Label Satellite Imagery, a paper on Depth Anything V2, and many more!
This week in deep learning, we bring you Introducing Claude 3.5 Sonnet, Benchmarking Haystack Pipelines for Optimal Performance, Using LLMs to Analyze and Label Satellite Imagery in Edge Impulse, and a paper on Depth Anything V2.
You may also enjoy Sharing new research, models, and datasets from Meta FAIR, Deep dive into how Pinterest built its Text-to-SQL solution, a paper on MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Anthropic released Claude 3.5 Sonnet, a model that raises the industry bar for intelligence with the speed and cost of their mid-tier model.
Sharing new research, models, and datasets from Meta FAIR
Meta’s Fundamental AI Research Team has publicly released new research artifacts, including image-to-text and text-to-music generation models, a multi-token prediction model, and a technique for detecting AI-generated speech.
Generating audio for video - Google DeepMind
Google DeepMind recently introduced a video-to-audio (V2A) system that generates background music, sound effects, and character dialogue to accompany videos.
Substrate lands $8M funding to bring 'Lego blocks' approach to enterprise AI
Substrate, an AI development startup, raised $8 million in a funding round led by Lightspeed Venture Partners to grow its team and expand its product offerings.
MLOps & LLMOps
Deep dive into how Pinterest built its Text-to-SQL solution.
An article about how Pinterest’s Engineering Team implemented a Text-to-SQL solution to enable data users to retrieve data without writing SQL.
New Chunking Method for RAG-Systems
An article highlighting a chunking approach that leverages SBERT and advanced clustering techniques for accurate topic modeling in large documents.
Scale LLMs with PyTorch 2.0 FSDP on Amazon EKS
A blog post that discusses how to use the PyTorch FSDP library to achieve linear scaling of deep learning models on AWS seamlessly using Amazon EKS and AWS Deep Learning Containers.
Benchmarking Haystack Pipelines for Optimal Performance
An article that shows you how to use Haystack to evaluate the performance of a RAG pipeline.
Learning
Using LLMs to Analyze and Label Satellite Imagery in Edge Impulse
An article about auto-labeling satellite imagery using Edge Impulse and ChatGPT-4o, which leverages LLMs for automatic labeling based on simple prompts.
Reducing Model Checkpointing Times by Over 10x with PyTorch Distributed Asynchronous Checkpointing
An article about reducing model checkpointing times by over 10x with PyTorch's new asynchronous checkpointing feature, which significantly improves checkpointing efficiency for large models.
LLM Summarization: Getting To Production
An article that dives into LLM summarization – its importance, primary approaches and challenges, and a code-along example of evaluation using Arize Phoenix.
Libraries & Code
A Sharded Data Parallelism framework, designed to work well with transformer-like neural network architectures.
A powerful framework for building automatic differentiation via text. TextGrad implements backpropagation through text feedback provided by LLMs.
The simplest, fastest repository for training/finetuning medium-sized GPTs.
Papers & Publications
Abstract:
This work presents Depth Anything V2. Without pursuing fancy techniques, we aim to reveal crucial findings to pave the way towards building a powerful monocular depth estimation model. Notably, compared with V1, this version produces much finer and more robust depth predictions through three key practices: 1) replacing all labeled real images with synthetic images, 2) scaling up the capacity of our teacher model, and 3) teaching student models via the bridge of large-scale pseudo-labeled real images. Compared with the latest models built on Stable Diffusion, our models are significantly more efficient (more than 10x faster) and more accurate. We offer models of different scales (ranging from 25M to 1.3B params) to support extensive scenarios. Benefiting from their strong generalization capability, we fine-tune them with metric depth labels to obtain our metric depth models. In addition to our models, considering the limited diversity and frequent noise in current test sets, we construct a versatile evaluation benchmark with precise annotations and diverse scenes to facilitate future research.
LLMs achieve adult human performance on higher-order theory of mind tasks
Abstract:
This paper examines the extent to which large language models (LLMs) have developed higher-order theory of mind (ToM); the human ability to reason about multiple mental and emotional states in a recursive manner (e.g. I think that you believe that she knows). This paper builds on prior work by introducing a handwritten test suite -- Multi-Order Theory of Mind Q&A -- and using it to compare the performance of five LLMs to a newly gathered adult human benchmark. We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences. Our results suggest that there is an interplay between model size and finetuning for the realisation of ToM abilities, and that the best-performing LLMs have developed a generalised capacity for ToM. Given the role that higher-order ToM plays in a wide range of cooperative and competitive human behaviours, these findings have significant implications for user-facing LLM applications.
MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers
Abstract:
Recently, 3D assets created via reconstruction and generation have matched the quality of manually crafted assets, highlighting their potential for replacement. However, this potential is largely unrealized because these assets always need to be converted to meshes for 3D industry applications, and the meshes produced by current mesh extraction methods are significantly inferior to Artist-Created Meshes (AMs), i.e., meshes created by human artists. Specifically, current mesh extraction methods rely on dense faces and ignore geometric features, leading to inefficiencies, complicated post-processing, and lower representation quality. To address these issues, we introduce MeshAnything, a model that treats mesh extraction as a generation problem, producing AMs aligned with specified shapes. By converting 3D assets in any 3D representation into AMs, MeshAnything can be integrated with various 3D asset production methods, thereby enhancing their application across the 3D industry. The architecture of MeshAnything comprises a VQ-VAE and a shape-conditioned decoder-only transformer. We first learn a mesh vocabulary using the VQ-VAE, then train the shape-conditioned decoder-only transformer on this vocabulary for shape-conditioned autoregressive mesh generation. Our extensive experiments show that our method generates AMs with hundreds of times fewer faces, significantly improving storage, rendering, and simulation efficiencies, while achieving precision comparable to previous methods.