Deep Learning Weekly: Issue 329

Inflection-2, Automatic detection of hallucination with SelfCheckGPT, Steerable Neural Networks, a paper on Video-LLaVA: Learning United Visual Representation by Alignment Before Projection, and more!

Nov 29, 2023

This week in deep learning, we bring you Inflection-2, Automatic detection of hallucination with SelfCheckGPT, Steerable Neural Networks, and a paper on Video-LLaVA: Learning United Visual Representation by Alignment Before Projection.

You may also enjoy Human Guided Exploration, Machine Learning Model Evaluation with Giskard: From Validation to CI/CD Integration, a paper on MARRS: Multimodal Reference Resolution System, and more!

As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

Inflection-2: The Next Step Up

Inflection announced the training completion of Inflection-2, the best model in the world for its compute class and the second most capable LLM in the world today.

Building the data framework for LLMs

LlamaIndex secured $8.5 million in seed funding, led by Greylock, to help propel its efforts to scale.

New method uses crowdsourced feedback to help train robots

Researchers from MIT, Harvard University, and more have developed a new reinforcement learning approach that leverages crowdsourced feedback called Human Guided Exploration.

AWS debuts next-generation Graviton4 and Trainium2 chips for cloud and AI workloads

AWS unveiled two next-generation chips from its silicon families for generalized cloud computing and high-efficiency AI training with the release of the Graviton4 and Trainium2

UK to invest £500M more in AI compute capacity, launch five new quantum projects

The U.K. government will invest £500 million, or $626 million, to provide local researchers and organizations with access to compute capacity for artificial intelligence projects.

MLOps & LLMOps

Boost inference performance for LLMs with new Amazon SageMaker containers

A deep dive into the new features with the latest release of DLCs, and a discussion of performance benchmarks.

How to Go From POC to LLM in Production

An article containing the highlights of Dataiku’s talk on LLM Mesh, governance, and scaling.

Automatic detection of hallucination with SelfCheckGPT

This notebook helps understand how hallucination metrics, such as SelfCheckGPT NLI score, can be used to automatically detect hallucinations.

Machine Learning Model Evaluation with Giskard: From Validation to CI/CD Integration

An article on how to integrate vulnerability scanning, model validation, and CI/CD pipeline optimization to ensure reliability and security of your AI models.

Learning

A gentle introduction to Steerable Neural Networks

A comprehensive article that breaks down the mathematical concepts behind Steerable Neural Networks, and explains how to design these networks.

Streamline ML Model Development with GitLab’s DevOps Platform and Comet

A comprehensive guide to using Comet with GitLab’s DevOps platform to streamline the workflow for your ML and Software Engineering teams

Direct Preference Optimization (DPO): A Simplified Approach to Fine-tuning Large Language Models

An introductory article that highlights the parts of a Direct Preference Optimization pipeline.

Using Self-Critiquing Chains in LangChain

A technical blog that unpacks ConstitutionalChain functions, its applications, and how it paves the way for more ethical AI systems.

Libraries & Code

google-research/project-guideline

A research project that leverages on-device ML to enable people who are blind and low-vision to walk or run for exercise independently.

imoneoi/openchat

An innovative library of open-source language models, fine-tuned with C-RLFT – a strategy inspired by offline reinforcement learning.

yhLeeee/Awesome-LLMs-in-Graph-tasks

A curated collection of research papers exploring the utilization of LLMs for graph-related tasks.

Papers & Publications

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Abstract:

The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM.

MARRS: Multimodal Reference Resolution System

Abstract:

Successfully handling context is essential for any dialog-understanding task. This context may be be conversational (relying on previous user queries or system responses), visual (relying on what the user sees, for example, on their screen), or background (based on signals such as a ringing alarm or playing music). In this work, we present an overview of MARRS, or Multimodal Reference Resolution System, an on-device framework within a Natural Language Understanding system, responsible for handling conversational, visual, and background context. In particular, we present different machine learning models to enable handling contextual queries; specifically, one to enable reference resolution and one to handle context via query rewriting. We also describe how these models complement each other to form a unified, coherent, lightweight system that can understand context while preserving user privacy.

A guest post by

Miko Planas

~~~

Deep Learning Weekly