Deep Learning Weekly: Issue 366
Sakana AI’s Fully Automated AI Scientist, Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese LLMs, Distributed Pipeline Parallelism, a paper on ToolSandbox, and more!
This week in deep learning, we bring you The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery, Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models, Distributed Pipeline Parallelism, and a paper on ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities.
You may also enjoy Falcon Mamba 7B, Can Large Language Models Explain Their Internal Mechanisms?, a paper on MiniCPM-V: A GPT-4V Level MLLM on Your Phone, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Sakana AI introduces The AI Scientist, the first comprehensive system for fully automatic scientific discovery, enabling LLMs to perform research independently.
OpenAI reportedly leads $60M round for webcam startup Opal
OpenAI is expected to lead a $60 million funding round for Opal, a consumer electronics startup that develops high-end webcams.
Falcon Mamba 7B's new AI architecture rivals transformer models
Technology Innovation Institute (TII) released a new open-source model called Falcon Mamba 7B which rivals state-of-the-art transformer models.
Move over, Devin: Cosine's Genie takes the AI coding crown
Cosine has announced its own AI-powered engineer Genie, which reportedly outperforms Devin, scoring 30% on SWE-Bench compared to Devin’s 13.8%.
Hugging Face acquires XetHub to enhance its AI storage infrastructure
Hugging Face has acquired XetHub, a startup that helps developers manage the files they create as part of AI projects.
MLOps & LLMOps
Introduction to Distributed Pipeline Parallelism
A tutorial that uses a gpt-style transformer model to demonstrate implementing distributed pipeline parallelism with torch.distributed.pipelining APIs.
Curating Custom Datasets for LLM Parameter-Efficient Fine-Tuning with NVIDIA NeMo Curator
A post that walks you through creating a custom data curation pipeline using NeMo Curator, focusing specifically on SFT and PEFT use cases.
How to Deploy the Open-Source Milvus Vector Database on Amazon EKS
An article that provides a step-by-step guidance on deploying a Milvus cluster using EKS and other services.
Learning
Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth
An article that provides a comprehensive overview of supervised fine-tuning and demonstrates how to fine-tune Llama 3.1 8B using Unsloth.
A Stanford AI Lab blog post about finetuning and comprehensively evaluating Vietnamese Large Language Models.
Can Large Language Models Explain Their Internal Mechanisms?
An Explorable post that visually introduces a new family of interpretability methods called Patchscopes.
FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention
The PyTorch team provides a flexible API that allows implementing many attention variants in a few lines of idiomatic PyTorch code.
How to Evaluate Your RAG Using the RAGAs Framework
A guide that shows you how to build a full RAG evaluation pipeline using RAGAs.
Multimodal Report Generation (from a Slide Deck)
A cookbook that shows how to build a multimodal report generator for slide decks using LlamaIndex and LlamaParse.
Libraries & Code
Training Sparse Autoencoders on Language Models.
A library designed to improve LLMs ability to use external information by fine-tuning models on specially created RAG-augmented datasets.
Papers & Publications
Abstract:
Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Abstract:
The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong OCR capability and 1.8M pixel high-resolution image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future.
T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge
Abstract:
deployment of Large Language Models (LLMs) on edge devices is increasingly important to enhance on-device intelligence. Weight quantization is crucial for reducing the memory footprint of LLMs on devices. However, low-bit LLMs necessitate mixed precision matrix multiplication (mpGEMM) of low precision weights and high precision activations during inference. Existing systems, lacking native support for mpGEMM, resort to dequantize weights for high precision computation. Such an indirect way can lead to a significant inference overhead.
In this paper, we introduce T-MAC, an innovative lookup table(LUT)-based method designed for efficient low-bit LLM (i.e., weight-quantized LLM) inference on CPUs. T-MAC directly supports mpGEMM without dequantization, while simultaneously eliminating multiplications and reducing additions required. Specifically, T-MAC transforms the traditional data-type-centric multiplication to bit-wise table lookup, and enables a unified and scalable mpGEMM solution.
Our LUT-based kernels scale linearly to the weight bit-width. Evaluated on low-bit Llama and BitNet models, T-MAC demonstrates up to 4x increase in throughput and 70% reduction in energy consumption compared to llama.cpp. For BitNet-b1.58-3B, T-MAC delivers a token generation throughput of 30 tokens/s with a single core and 71 tokens/s with eight cores on M2-Ultra, and 11 tokens/s on lower-end devices like Raspberry Pi 5, which significantly exceeds the adult average reading speed. T-MAC with LUT-based computing paradigm, paves the way for the practical deployment of low-bit LLMs on resource-constrained edge devices without compromising computational efficiency.