Deep Learning Weekly: Issue 329
Inflection-2, Automatic detection of hallucination with SelfCheckGPT, Steerable Neural Networks, a paper on Video-LLaVA: Learning United Visual Representation by Alignment Before Projection, and more!
This week in deep learning, we bring you Inflection-2, Automatic detection of hallucination with SelfCheckGPT, Steerable Neural Networks, and a paper on Video-LLaVA: Learning United Visual Representation by Alignment Before Projection.
You may also enjoy Human Guided Exploration, Machine Learning Model Evaluation with Giskard: From Validation to CI/CD Integration, a paper on MARRS: Multimodal Reference Resolution System, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Inflection-2: The Next Step Up
Inflection announced the training completion of Inflection-2, the best model in the world for its compute class and the second most capable LLM in the world today.
Building the data framework for LLMs
LlamaIndex secured $8.5 million in seed funding, led by Greylock, to help propel its efforts to scale.
New method uses crowdsourced feedback to help train robots
Researchers from MIT, Harvard University, and more have developed a new reinforcement learning approach that leverages crowdsourced feedback called Human Guided Exploration.
AWS debuts next-generation Graviton4 and Trainium2 chips for cloud and AI workloads
AWS unveiled two next-generation chips from its silicon families for generalized cloud computing and high-efficiency AI training with the release of the Graviton4 and Trainium2
UK to invest £500M more in AI compute capacity, launch five new quantum projects
The U.K. government will invest £500 million, or $626 million, to provide local researchers and organizations with access to compute capacity for artificial intelligence projects.
MLOps & LLMOps
Boost inference performance for LLMs with new Amazon SageMaker containers
A deep dive into the new features with the latest release of DLCs, and a discussion of performance benchmarks.
How to Go From POC to LLM in Production
An article containing the highlights of Dataiku’s talk on LLM Mesh, governance, and scaling.
Automatic detection of hallucination with SelfCheckGPT
This notebook helps understand how hallucination metrics, such as SelfCheckGPT NLI score, can be used to automatically detect hallucinations.
Machine Learning Model Evaluation with Giskard: From Validation to CI/CD Integration
An article on how to integrate vulnerability scanning, model validation, and CI/CD pipeline optimization to ensure reliability and security of your AI models.
Learning
A gentle introduction to Steerable Neural Networks
A comprehensive article that breaks down the mathematical concepts behind Steerable Neural Networks, and explains how to design these networks.
Streamline ML Model Development with GitLab’s DevOps Platform and Comet
A comprehensive guide to using Comet with GitLab’s DevOps platform to streamline the workflow for your ML and Software Engineering teams
Direct Preference Optimization (DPO): A Simplified Approach to Fine-tuning Large Language Models
An introductory article that highlights the parts of a Direct Preference Optimization pipeline.
Using Self-Critiquing Chains in LangChain
A technical blog that unpacks ConstitutionalChain functions, its applications, and how it paves the way for more ethical AI systems.
Libraries & Code
google-research/project-guideline
A research project that leverages on-device ML to enable people who are blind and low-vision to walk or run for exercise independently.
An innovative library of open-source language models, fine-tuned with C-RLFT – a strategy inspired by offline reinforcement learning.
yhLeeee/Awesome-LLMs-in-Graph-tasks
A curated collection of research papers exploring the utilization of LLMs for graph-related tasks.
Papers & Publications
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Abstract:
The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM.
MARRS: Multimodal Reference Resolution System
Abstract:
Successfully handling context is essential for any dialog-understanding task. This context may be be conversational (relying on previous user queries or system responses), visual (relying on what the user sees, for example, on their screen), or background (based on signals such as a ringing alarm or playing music). In this work, we present an overview of MARRS, or Multimodal Reference Resolution System, an on-device framework within a Natural Language Understanding system, responsible for handling conversational, visual, and background context. In particular, we present different machine learning models to enable handling contextual queries; specifically, one to enable reference resolution and one to handle context via query rewriting. We also describe how these models complement each other to form a unified, coherent, lightweight system that can understand context while preserving user privacy.