Deep Learning Weekly: Issue 377
Meta releases quantized Llama models, Hugging Face Evaluation Guidebook, a paper on Speculative Streaming: Fast LLM Inference Without Auxiliary Models, and many more!
This week in deep learning, we bring you Meta releases quantized Llama models, Hugging Face Evaluation Guidebook, and a paper on Speculative Streaming: Fast LLM Inference Without Auxiliary Models.
You may also enjoy Introducing the analysis tool in Claude.ai, Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback, a paper on InkSight: Offline-to-Online Handwriting Conversion by Learning to Read and Write, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Introducing quantized Llama models with increased speed and a reduced memory footprint
Meta released their first lightweight quantized Llama models that are small and performant enough to run on many popular mobile devices.
Introducing the analysis tool in Claude.ai
Anthropic introduced the analysis tool, a new built-in feature for Claude.ai that enables Claude to write and run JavaScript code.
Aya Expanse: Connecting Our World
Cohere For AI launched Aya Expanse, a state-of-the-art multilingual family of models to help close the language gap with AI.
LinkedIn launches its first AI agent to take on the role of job recruiters
LinkedIn unveiled Hiring Assistant, a new AI agent designed to take on a wide array of recruitment tasks, from generating longer job descriptions to sourcing candidates.
Introducing the next-level of AI-powered workflows with Amazon Q Developer inline chat
Amazon Q Developer announced support for inline chat, which empowers developers to tackle complex coding challenges efficiently.
Generative AI, the American worker, and the future of work
A report on the potential impact of generative AI on the American workforce and the need for proactive responses from employers, workers, and policymakers.
MLOps & LLMOps
A post that outlines how developers can use the AI virtual assistant for customer service NVIDIA NIM Agent Blueprint to scale operations with generative AI.
Evaluating Model Retraining Strategies
A blog post about various model retraining strategies to mitigate performance degradation caused by data and concept drift.
Slurm vs Kubernetes: Which to choose for your ML workloads
A useful article exploring the strengths and weaknesses of Slurm and Kubernetes in the context of scaling machine learning workloads.
Learning
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
A blog post about how a new routing framework can be used to create high-quality and cost-efficient preference data by combining human preferences with synthetic preferences.
Universal Assisted Generation: Faster Decoding with Any Assistant Model
A technical blog post about a new method called Universal Assisted Generation that extends assisted generation to work with small language models from any model family.
AlignEval: Building an App to Make Evals Easy, Fun, and Automated
A detailed blog post about AlignEval, which streamlines the process of building LLM evaluators, while keeping it fun.
Combining next-token prediction and video diffusion in computer vision and robotics
An article introducing a new technique for training sequence models that combines next-token prediction and video diffusion, called Diffusion Forcing.
Libraries & Code
A novel autoregressive framework that unifies multimodal understanding and generation.
huggingface/evaluation-guidebook
A guidebook that covers the different ways you can evaluate a model, guides on designing your own evaluations, and tips and tricks from practical experience.
Papers & Publications
InkSight: Offline-to-Online Handwriting Conversion by Learning to Read and Write
Abstract:
Digital note-taking is gaining popularity, offering a durable, editable, and easily indexable way of storing notes in the vectorized form, known as digital ink. However, a substantial gap remains between this way of note-taking and traditional pen-and-paper note-taking, a practice still favored by a vast majority. Our work, InkSight, aims to bridge the gap by empowering physical note-takers to effortlessly convert their work (offline handwriting) to digital ink (online handwriting), a process we refer to as Derendering. Prior research on the topic has focused on the geometric properties of images, resulting in limited generalization beyond their training domains. Our approach combines reading and writing priors, allowing training a model in the absence of large amounts of paired samples, which are difficult to obtain. To our knowledge, this is the first work that effectively derenders handwritten text in arbitrary photos with diverse visual characteristics and backgrounds. Furthermore, it generalizes beyond its training domain into simple sketches. Our human evaluation reveals that 87% of the samples produced by our model on the challenging HierText dataset are considered as a valid tracing of the input image and 67% look like a pen trajectory traced by a human
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities
Abstract:
GPT-4o, an all-encompassing model, represents a milestone in the development of large multi-modal language models. It can understand visual, auditory, and textual modalities, directly output audio, and support flexible duplex interaction. Models from the open-source community often achieve some functionalities of GPT-4o, such as visual understanding and voice chat. Nevertheless, training a unified model that incorporates all modalities is challenging due to the complexities of multi-modal data, intricate model architectures, and training processes. In this paper, we introduce Mini-Omni2, a visual-audio assistant capable of providing real-time, end-to-end voice responses to visoin and audio queries. By integrating pretrained visual and auditory encoders, Mini-Omni2 maintains performance in individual modalities. We propose a three-stage training process to align modalities, allowing the language model to handle multi-modal inputs and outputs after training on a limited dataset. For interaction, we introduce a command-based interruption mechanism, enabling more flexible interaction with users. To the best of our knowledge, Mini-Omni2 is one of the closest reproductions of GPT-4o, which have similar form of functionality, and we hope it can offer valuable insights for subsequent research.
Speculative Streaming: Fast LLM Inference Without Auxiliary Models
Abstract:
Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both draft and target models to achieve high acceptance rates. As the number of downstream tasks grows, these draft models add significant complexity to inference systems. We propose Speculative Streaming, a single-model speculative decoding method that fuses drafting into the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction. Speculative Streaming speeds up decoding by 1.8 - 3.1X in a diverse set of tasks, such as Summarization, Structured Queries, and Meaning Representation, without sacrificing generation quality. Additionally, Speculative Streaming is parameter-efficient. It achieves on-par/higher speed-ups than Medusa-style architectures while using ~10000X fewer extra parameters, making it well-suited for resource-constrained devices.