Deep Learning Weekly: Issue 371
OpenAI o1-preview, a microservice-based way to deploy LlamaIndex Workflows, a paper on Retrieval-Augmented Correction of Named Entity Speech Recognition Errors, and many more!
This week in deep learning, we bring you Introducing OpenAI o1-preview, a microservice-based way to deploy LlamaIndex Workflows, Neptune: The long orbit to benchmarking long video understanding, and a paper on Retrieval-Augmented Correction of Named Entity Speech Recognition Errors.
You may also enjoy Meet Opik: Your New Tool to Evaluate, Test, and Monitor LLM Applications, Improving Retrieval with Auto-Merging, a paper on Agent Workflow Memory, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
OpenAI has released a preview of its new series of reasoning models, OpenAI o1, which are designed to spend more time thinking before responding and excel in complex tasks related to science, coding, and math.
Meet Opik: Your New Tool to Evaluate, Test, and Monitor LLM Applications
Comet released Opik, an open-source platform for evaluating, testing, and monitoring LLM applications.
Our latest advances in robot dexterity - Google DeepMind
Google DeepMind introduced two new AI systems, ALOHA Unleashed and DemoStart, that help robots learn to perform complex tasks that require dexterous movement.
SambaNova challenges OpenAI's o1 model with Llama 3.1-powered demo on HuggingFace
SambaNova Systems has introduced a new demo on Hugging Face that utilizes Meta's Llama 3.1 Instruct model, aiming to provide a faster, open-source alternative to OpenAI's o1 model for enterprise AI infrastructure.
Slack now lets users add AI agents from Asana, Cohere, Adobe, Workday and more
Slack has announced that paying users can now integrate AI agents from Salesforce, third-party partners, and their own custom-built agents directly into the platform.
MLOps & LLMOps
Introducing llama-deploy, a microservice-based way to deploy LlamaIndex Workflows
A blog post about llama-deploy, a new tool from LlamaIndex that simplifies deploying and scaling LlamaIndex workflows as microservices.
The Data Pipeline is the New Secret Sauce
A blog post about the challenges of AI infrastructure for enterprises, specifically highlighting data pipelines and inference hosting as key areas for development and optimization.
How Much GPU Memory is Needed to Serve a Large Language Model?
An article about the importance of understanding how much GPU memory is needed to effectively serve large language models (LLMs) in production environments.
Learning
Neptune: The long orbit to benchmarking long video understanding
An article about Neptune, a new open-source video question-answering dataset that includes challenging multiple-choice and open-ended questions for videos up to 15 minutes long.
Enhancing LLM collaboration for smarter, more efficient solutions
An insightful article about a new algorithm from MIT CSAIL called Co-LLM that improves LLM accuracy and efficiency by training them to collaborate with more specialized models.
Improving Retrieval with Auto-Merging
A technical article about a new retrieval technique called Auto-Merging, which improves context retrieval in RAG applications by using a hierarchical document structure in Haystack.
Benchmarking Hallucination Detection Methods in RAG
A technical article that benchmarks different hallucination detection methods used in Retrieval-Augmented Generation (RAG) systems to evaluate their effectiveness in identifying incorrect LLM responses.
Libraries & Code
Windows Agent Arena (WAA) is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
Opik is an open-source end-to-end LLM development platform.
gsplat is an open-source library for CUDA accelerated rasterization of gaussians with python bindings.
optillm is an OpenAI API compatible optimizing inference proxy which implements several state-of-the-art techniques that can improve the accuracy and performance of LLMs.
Papers & Publications
Retrieval-Augmented Correction of Named Entity Speech Recognition Errors
Abstract:
In recent years, end-to-end automatic speech recognition (ASR) systems have proven themselves remarkably accurate and performant, but these systems still have a significant error rate for entity names which appear infrequently in their training data. In parallel to the rise of end-to-end ASR systems, large language models (LLMs) have proven to be a versatile tool for various natural language processing (NLP) tasks. In NLP tasks where a database of relevant knowledge is available, retrieval augmented generation (RAG) has achieved impressive results when used with LLMs. In this work, we propose a RAG-like technique for correcting speech recognition entity name errors. Our approach uses a vector database to index a set of relevant entities. At runtime, database queries are generated from possibly errorful textual ASR hypotheses, and the entities retrieved using these queries are fed, along with the ASR hypotheses, to an LLM which has been adapted to correct ASR errors. Overall, our best system achieves 33%-39% relative word error rate reductions on synthetic test sets focused on voice assistant queries of rare music entities without regressing on the STOP test set, a publicly available voice assistant test set covering many domains.
PuLID: Pure and Lightning ID Customization via Contrastive Alignment
Abstract:
We propose Pure and Lightning ID customization (PuLID), a novel tuning-free ID customization method for text-to-image generation. By incorporating a Lightning T2I branch with a standard diffusion one, PuLID introduces both contrastive alignment loss and accurate ID loss, minimizing disruption to the original model and ensuring high ID fidelity. Experiments show that PuLID achieves superior performance in both ID fidelity and editability. Another attractive property of PuLID is that the image elements (e.g., background, lighting, composition, and style) before and after the ID insertion are kept as consistent as possible.
Abstract:
Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contrast, humans can flexibly solve complex tasks by learning reusable task workflows from past experiences and using them to guide future actions. To build agents that can similarly benefit from this process, we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly. We experiment on two major web navigation benchmarks -- Mind2Web and WebArena -- that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations, surpassing baselines from 8.9 to 14.0 absolute points as train-test task distribution gaps widen.