Deep Learning Weekly: Issue 329
Inflection-2, Automatic detection of hallucination with SelfCheckGPT, Steerable Neural Networks, a paper on Video-LLaVA: Learning United Visual Representation by Alignment Before Projection, and more!
This week in deep learning, we bring you Inflection-2, Automatic detection of hallucination with SelfCheckGPT, Steerable Neural Networks, and a paper on Video-LLaVA: Learning United Visual Representation by Alignment Before Projection.
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Inflection announced the training completion of Inflection-2, the best model in the world for its compute class and the second most capable LLM in the world today.
LlamaIndex secured $8.5 million in seed funding, led by Greylock, to help propel its efforts to scale.
Researchers from MIT, Harvard University, and more have developed a new reinforcement learning approach that leverages crowdsourced feedback called Human Guided Exploration.
AWS unveiled two next-generation chips from its silicon families for generalized cloud computing and high-efficiency AI training with the release of the Graviton4 and Trainium2
The U.K. government will invest £500 million, or $626 million, to provide local researchers and organizations with access to compute capacity for artificial intelligence projects.
MLOps & LLMOps
A deep dive into the new features with the latest release of DLCs, and a discussion of performance benchmarks.
An article containing the highlights of Dataiku’s talk on LLM Mesh, governance, and scaling.
This notebook helps understand how hallucination metrics, such as SelfCheckGPT NLI score, can be used to automatically detect hallucinations.
An article on how to integrate vulnerability scanning, model validation, and CI/CD pipeline optimization to ensure reliability and security of your AI models.
A comprehensive article that breaks down the mathematical concepts behind Steerable Neural Networks, and explains how to design these networks.
A comprehensive guide to using Comet with GitLab’s DevOps platform to streamline the workflow for your ML and Software Engineering teams
An introductory article that highlights the parts of a Direct Preference Optimization pipeline.
A technical blog that unpacks ConstitutionalChain functions, its applications, and how it paves the way for more ethical AI systems.
Libraries & Code
A research project that leverages on-device ML to enable people who are blind and low-vision to walk or run for exercise independently.
An innovative library of open-source language models, fine-tuned with C-RLFT – a strategy inspired by offline reinforcement learning.
A curated collection of research papers exploring the utilization of LLMs for graph-related tasks.
Papers & Publications
The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM.
Successfully handling context is essential for any dialog-understanding task. This context may be be conversational (relying on previous user queries or system responses), visual (relying on what the user sees, for example, on their screen), or background (based on signals such as a ringing alarm or playing music). In this work, we present an overview of MARRS, or Multimodal Reference Resolution System, an on-device framework within a Natural Language Understanding system, responsible for handling conversational, visual, and background context. In particular, we present different machine learning models to enable handling contextual queries; specifically, one to enable reference resolution and one to handle context via query rewriting. We also describe how these models complement each other to form a unified, coherent, lightweight system that can understand context while preserving user privacy.
Thanks for reading Deep Learning Weekly! Subscribe for free to receive new posts and support my work.