Deep Learning Weekly: Issue 343
Next Generation of Claude, Advanced RAG: From Theory to LlamaIndex Implementation, Predictive Human Preference: Model Ranking to Model Routing, Foundational Visual Encoder for Video Understanding!
This week in deep learning, we bring you Introducing the next generation of Claude, Advanced Retrieval-Augmented Generation: From Theory to LlamaIndex Implementation, Predictive Human Preference: From Model Ranking to Model Routing, and a paper on VideoPrism: A Foundational Visual Encoder for Video Understanding.
You may also enjoy Foundation Model Development Cheatsheet, Deploying LLMs Into Production Using TensorRT LLM, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Introducing the next generation of Claude
Anthropic announced the Claude 3 model family (Haiku, Sonnet, Opus), which sets new industry benchmarks across a wide range of cognitive tasks.
Alibaba's new AI system 'EMO' creates realistic talking and singing videos from photos
Researchers at Alibaba have developed a new AI system called “EMO” (or, Emote Portrait Alive) that can animate a single portrait photo and generate videos of the person talking or singing in a remarkably lifelike fashion.
Foundation Model Development Cheatsheet
A cheatsheet for foundation model developers assembled by AI2, EleutherAI, Google, Hugging Face, Masakhane, MIT, MLCommons, Princeton, and more.
Multiverse Computing raises €25M to deliver more efficient LLMs using quantum-inspired algorithms
Quantum computing startup Multiverse Computing raised $27.1 million in a new early-stage funding round.
Snowflake partners with Mistral AI, taking its open LLMs to the data cloud
Snowflake signed a multi-year agreement with Mistral to bring its open LLMs into the cloud.
Ema, a ‘Universal AI employee’, emerges from stealth with $25M
Ema, a ‘Universal AI employee’, has emerged from stealth with $25 million in funding, aiming to revolutionize work using generative A
MLOps & LLMOps
Deploying LLMs Into Production Using TensorRT LLM
An article on how TensorRT-LLM works, and how it can be used to serve LLMs to millions of users.
Advanced Retrieval-Augmented Generation: From Theory to LlamaIndex Implementation
An article on how to address limitations of naive RAG pipelines by implementing targeted advanced RAG techniques in Python.
Top Evaluation Metrics for RAG Failures
An article on troubleshooting LLMs and Retrieval Augmented Generation with retrieval and response metrics.
Learning
Seamless Integration: Combining Comet and Gradio for Enhanced Machine Learning Experiments
Experimentation is the lifeblood of machine learning. This article will explore how two powerful tools, Comet and Gradio, simplify and enhance your machine learning journey.
The Math behind Adam Optimizer
An article that dives deep into the mathematical details of the Adam optimization algorithm used in training neural networks.
Predictive Human Preference: From Model Ranking to Model Routing
Chip Huyen explores predicting user preferences for specific queries, enabling model routing and interpretability, and visualizes preference predictions for different model pairs.
Image Augmentation: A Fun and Easy Way to Improve Computer Vision Models
This article explores image enhancement techniques, including various methods that can be employed to preprocess and improve images.
Scalable Federated Learning with NVIDIA FLARE for Enhanced LLM Performance
A post that explores how federated learning enabled by NVIDIA FLARE can handle decentralized data with easy and scalable integration.
Libraries & Code
A graph-based framework for LLM-based agents
Summarize existing representative LLMs text datasets.
Papers & Publications
VideoPrism: A Foundational Visual Encoder for Video Understanding
Abstract:
We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. We pretrain VideoPrism on a heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding by global-local distillation of semantic video embeddings and a token shuffling scheme, enabling VideoPrism to focus primarily on the video modality while leveraging the invaluable text associated with videos. We extensively test VideoPrism on four broad groups of video understanding tasks, from web video question answering to CV for science, achieving state-of-the-art performance on 30 out of 33 video understanding benchmarks.
Executable Code Actions Elicit Better LLM Agents
Abstract:
Large Language Model (LLM) agents, capable of performing a broad range of actions, such as invoking tools and controlling robots, show great potential in tackling real-world challenges. LLM agents are typically prompted to produce actions by generating JSON or text in a pre-defined format, which is usually limited by constrained action space (e.g., the scope of pre-defined tools) and restricted flexibility (e.g., inability to compose multiple tools). This work proposes to use executable Python code to consolidate LLM agents' actions into a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct can execute code actions and dynamically revise prior actions or emit new actions upon new observations through multi-turn interactions. Our extensive analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that CodeAct outperforms widely used alternatives (up to 20% higher success rate). The encouraging performance of CodeAct motivates us to build an open-source LLM agent that interacts with environments by executing interpretable code and collaborates with users using natural language. To this end, we collect an instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn interactions using CodeAct. We show that it can be used with existing data to improve models in agent-oriented tasks without compromising their general capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with Python interpreter and uniquely tailored to perform sophisticated tasks (e.g., model training) using existing libraries and autonomously self-debug.
Do Large Language Models Latently Perform Multi-Hop Reasoning?
Abstract:
We study whether Large Language Models (LLMs) latently perform multi-hop reasoning with complex prompts such as "The mother of the singer of 'Superstition' is". We look for evidence of a latent reasoning pathway where an LLM (1) latently identifies "the singer of 'Superstition'" as Stevie Wonder, the bridge entity, and (2) uses its knowledge of Stevie Wonder's mother to complete the prompt. We analyze these two hops individually and consider their co-occurrence as indicative of latent multi-hop reasoning. For the first hop, we test if changing the prompt to indirectly mention the bridge entity instead of any other entity increases the LLM's internal recall of the bridge entity. For the second hop, we test if increasing this recall causes the LLM to better utilize what it knows about the bridge entity. We find strong evidence of latent multi-hop reasoning for the prompts of certain relation types, with the reasoning pathway used in more than 80% of the prompts. However, the utilization is highly contextual, varying across different types of prompts. Also, on average, the evidence for the second hop and the full multi-hop traversal is rather moderate and only substantial for the first hop. Moreover, we find a clear scaling trend with increasing model size for the first hop of reasoning but not for the second hop. Our experimental findings suggest potential challenges and opportunities for future development and applications of LLMs.