Deep Learning Weekly: Issue 376
Computer Use with Claude, Using Dictionary Learning Features as Classifiers, a paper on Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 90% Performance, and many more!
This week in deep learning, we bring you Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku \ Anthropic, Using Dictionary Learning Features as Classifiers, and a paper on Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance.
You may also enjoy State of AI Report, VQAScore: Evaluating and Improving Vision-Language Generative Models, a paper on What Are the Odds? Language Models Are Capable of Probabilistic Reasoning, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku
Aside from upgrading Sonnet and Haiku, Anthropic introduced a groundbreaking new capability in public beta: computer use.
A report on the state of AI in 2024, highlighting key trends such as the convergence of frontier lab performance, the rise of planning and reasoning in LLM research, and more.
Genmo introduces Mochi 1, an open-source text-to-video generation model
Genmo, an AI content generation platform, announced the preview release of Mochi 1, an open-source rival to video generation tools such as Runway, Kling, and others.
Sharing new research, models, and datasets from Meta FAIR
Meta FAIR publicly released several new research artifacts in support of their goal of achieving advanced machine intelligence (AMI) while also supporting open science and reproducibility.
Un Ministral, des Ministraux | Mistral AI
Mistral AI introduced two new state-of-the-art models for on-device computing and at-the-edge use cases.
The PyTorch team announced the release of PyTorch 2.5, which focuses on performance enhancements for scaled dot product attention, regional compilation in torch.compile, and more.
MLOps & LLMOps
An article exploring the concept of memory in AI agents, discussing different types of memory, methods for updating memory, and its importance in enhancing agent capabilities.
Deploy Llama 3.2 Vision on Amazon SageMaker
A comprehensive blog post with a step-by-step guide on deploying Llama 3.2 Vision on Amazon SageMaker, utilizing Hugging Face LLM DLC for secure and managed deployment.
Learning
VQAScore: Evaluating and Improving Vision-Language Generative Models
An article that discusses VQAScore, a new evaluation metric for vision-language models, and GenAI-Bench, a challenging dataset for compositional text-to-visual generation.
Using Dictionary Learning Features as Classifiers
A technical blog post discussing the use of dictionary learning features as classifiers in large language models.
Florence-2: Advancing Multiple Vision Tasks with a Single VLM Model
A guided exploration of Florence-2's zero-shot capabilities: captioning, object detection, segmentation and OCR.
A Sanity Check on ‘Emergent Properties’ in Large Language Models
A critical article examining the concept of "emergent properties" in large language models and arguing for a more rigorous and evidence-based approach to understanding LLM capabilities.
Libraries & Code
huggingface/autotrain-advanced
A no-code solution that allows you to train machine learning models in just a few clicks.
Official inference framework for 1-bit LLMs.
Inference code for the paper "Spirit-LM Interleaved Spoken and Written Language Model".
Papers & Publications
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
Abstract:
Multimodal large language models (MLLMs) have demonstrated impressive performance in vision-language tasks across a broad spectrum of domains. However, the large model scale and associated high computational costs pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices, thereby hindering their widespread application. In this work, we introduce Mini-InternVL, a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters. This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios. To further promote the adoption of our models, we develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks, including autonomous driving, medical images, and remote sensing. We believe that our study can provide valuable insights and resources to advance the development of efficient and effective MLLMs.
What Are the Odds? Language Models Are Capable of Probabilistic Reasoning
Abstract:
Language models (LM) are capable of remarkably complex linguistic tasks; however, numerical reasoning is an area in which they frequently struggle. An important but rarely evaluated form of reasoning is understanding probability distributions. In this paper, we focus on evaluating the probabilistic reasoning capabilities of LMs using idealized and real-world statistical distributions. We perform a systematic evaluation of state-of-the-art LMs on three tasks: estimating percentiles, drawing samples, and calculating probabilities. We evaluate three ways to provide context to LMs 1) anchoring examples from within a distribution or family of distributions, 2) real-world context, 3) summary statistics on which to base a Normal approximation. Models can make inferences about distributions, and can be further aided by the incorporation of real-world context, example shots and simplified assumptions, even if these assumptions are incorrect or misspecified. To conduct this work, we developed a comprehensive benchmark distribution dataset with associated question-answer pairs that we have released publicly.
Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation
Abstract:
Recent advances in latent diffusion-based generative models for portrait image animation, such as Hallo, have achieved impressive results in short-duration video synthesis. In this paper, we present updates to Hallo, introducing several design enhancements to extend its capabilities. First, we extend the method to produce long-duration videos. To address substantial challenges such as appearance drift and temporal artifacts, we investigate augmentation strategies within the image space of conditional motion frames. Specifically, we introduce a patch-drop technique augmented with Gaussian noise to enhance visual consistency and temporal coherence over long duration. Second, we achieve 4K resolution portrait video generation. To accomplish this, we implement vector quantization of latent codes and apply temporal alignment techniques to maintain coherence across the temporal dimension. By integrating a high-quality decoder, we realize visual synthesis at 4K resolution. Third, we incorporate adjustable semantic textual labels for portrait expressions as conditional inputs. This extends beyond traditional audio cues to improve controllability and increase the diversity of the generated content. To the best of our knowledge, Hallo2, proposed in this paper, is the first method to achieve 4K resolution and generate hour-long, audio-driven portrait image animations enhanced with textual prompts. We have conducted extensive experiments to evaluate our method on publicly available datasets, including HDTF, CelebV, and our introduced "Wild" dataset. The experimental results demonstrate that our approach achieves state-of-the-art performance in long-duration portrait video animation, successfully generating rich and controllable content at 4K resolution for duration extending up to tens of minutes.