Discover more from Deep Learning Weekly
Deep Learning Weekly: Issue #323
Kaggle's AI Report 2023, Intro to Real-Time ML, Multimodality and Large Multimodal Models (LMMs), a paper on Rethinking Calibration for In-Context Learning and Prompt Engineering, and many more!
This week in deep learning, we bring you Kaggle's AI Report 2023, An Intro to Real-Time Machine Learning, Chip Huyen's Multimodality and Large Multimodal Models (LMMs), and a paper on Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering.
You may also enjoy Baichuan raises $300M from Alibaba, Tencent and Xiaomi, Model CI/CD for Enterprise-Grade Production ML, Step-Back Prompting, a paper on Ferret: Refer and Ground Anything Anywhere at Any Granularity, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Kaggle releases AI Report 2023 which provides insights and trends about the field of artificial intelligence based on data from Kaggle.
A 21-year-old University of Nebraska-Lincoln student has used AI to decipher ancient Greek letters inside a sealed scroll from Herculaneum, Italy.
The Replit AI team makes Replit AI available for everyone. Code completion and code assistance are now enabled by default, and available to 23M developers.
Israel-based LLM leader AI21 Labs has closed $155 million in series C funding to accelerate the growth of its text-based generative AI services for enterprises.
AMD has acquired Nod.ai, an open source AI software provider, to strengthen its efforts in creating an ecosystem of AI development tools, libraries, and models around its hardware.
Google announced that users who have opted-in for its Search Generative Experience (SGE) will be able to create AI images directly from the standard Search bar.
The Beijing-based AI startup Baichuan announced that it has raised more than $300 million in a Series A1 round led by major Chinese technology giants Alibaba Group Holding, Tencent Holdings, and Xiaomi Corp.
MLOps & LLMOps
A post primarily intended for data scientists and machine learning engineers who want to gain a better understanding of the underlying data pipelines to serve features for real-time prediction.
A comprehensive guide on how to build and run LLM evals — and why you should use precision and recall when benchmarking your LLM prompt template.
An article that explains how to implement an advanced RAG pipeline using embeddings, cache, hybrid search, and ensemble retriever to improve the quality and relevance of text generation.
An article about Step-Back Prompting, a prompting technique enabling LLMs to perform abstractions and derive high-level concepts.
Chip Huyen introduces multimodality, categorizes multimodal tasks, explains influential architectures, and discusses active research areas for large multimodal models.
This blog intends to provide a picture of the current state of quantization on mobile (Android) and the opportunities it opens to bring inference of complex NN models to the edge.
An article that showcases evidence for a linear representation of factual truth in large language models.
This article delves into the world of LLMs, exploring what they are, how they work, their impact on various domains, and finally Comet’s LLMOps
Libraries & Code
A curated list of Multimodal Large Language Models (MLLMs), including datasets, multimodal instruction tuning, and others.
A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs.
Papers & Publications
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with 95K hard negative data to promote model robustness. The resulting model not only achieves superior performance in classical referring and grounding tasks, but also greatly outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination.
Prompting and in-context learning (ICL) have become efficient learning paradigms for large language models (LLMs). However, LLMs suffer from prompt brittleness and various bias factors in the prompt, including but not limited to the formatting, the choice verbalizers, and the ICL examples. To address this problem that results in unexpected performance degradation, calibration methods have been developed to mitigate the effects of these biases while recovering LLM performance. In this work, we first conduct a systematic analysis of the existing calibration methods, where we both provide a unified view and reveal the failure cases. Inspired by these analyses, we propose Batch Calibration (BC), a simple yet intuitive method that controls the contextual bias from the batched input, unifies various prior approaches, and effectively addresses the aforementioned issues. BC is zero-shot, inference-only, and incurs negligible additional costs. In the few-shot setup, we further extend BC to allow it to learn the contextual bias from labeled data. We validate the effectiveness of BC with PaLM 2-(S, M, L) and CLIP models and demonstrate state-of-the-art performance over previous calibration baselines across more than 10 natural language understanding and image classification tasks.
In long context scenarios, large language models (LLMs) face three main challenges: higher computational/financial cost, longer latency, and inferior performance. Some studies reveal that the performance of LLMs depends on both the density and the position of the key information (question relevant) in the input prompt. Inspired by these findings, we propose LongLLMLingua for prompt compression towards improving LLMs' perception of the key information to simultaneously address the three challenges. We conduct evaluation on a wide range of long context scenarios including single-/multi-document QA, few-shot learning, summarization, synthetic tasks, and code completion. The experimental results show that LongLLMLingua compressed prompt can derive higher performance with much less cost. The latency of the end-to-end system is also reduced. For example, on NaturalQuestions benchmark, LongLLMLingua gains a performance boost of up to 17.1% over the original prompt with ~4x fewer tokens as input to GPT-3.5-Turbo. It can derive cost savings of $28.5 and $27.4 per 1,000 samples from the LongBench and ZeroScrolls benchmark, respectively. Additionally, when compressing prompts of ~10k tokens at a compression rate of 2x-10x, LongLLMLingua can speed up the end-to-end latency by 1.4x-3.8x.
Thanks for reading Deep Learning Weekly! Subscribe for free to receive new posts and support my work.