Deep Learning Weekly: Issue 332
DeepMind's FunSearch, Advanced RAG Techniques: An Illustrated Overview, Building a Million-Parameter LLM from Scratch, a paper on CogAgent: A Visual Language Model for GUI Agents, and many more!
This week in deep learning, we bring you DeepMind's FunSearch, Advanced RAG Techniques: An Illustrated Overview, Building a Million-Parameter LLM from Scratch Using Python, and a paper on CogAgent: A Visual Language Model for GUI Agents.
You may also enjoy Intel's new AI chip, A Unified Evaluation Framework for LLMs, a paper on EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Intel unveils new AI chip to compete with Nvidia and AMD
Intel unveiled new computer chips including Gaudi3, an AI chip for generative use cases.
FunSearch: Making new discoveries in mathematical sciences using Large Language Models
Google DeepMind introduces a method that represents the first time a new discovery has been made, for challenging open problems in science or mathematics using LLMs.
Patronus AI finds 'alarming' safety gaps in leading AI systems
Patronus AI, a startup focused on responsible AI deployment, has released a new diagnostic test suite today called SimpleSafetyTests to help identify critical safety risks in large language models.
Lightmatter raises $155M for photonic computing at $1.2B valuation
Lightmatter, a startup building computing products that use light for computing, has raised $155 million in additional funding increasing the company’s valuation to $1.2 billion.
TomTom and Microsoft develop in-vehicle AI voice assistant
Digital maps and location tech specialist TomTom has partnered up with Microsoft to develop an AI voice assistant for vehicles.
Imagen 2 on Vertex AI is now generally available
Google made a significant upgrade to Google Cloud’s image-generation capabilities with Imagen 2, which is now generally available for Vertex AI customers on the allowlist.
MLOps & LLMOps
LlamaIndex: RAG Evaluation Showdown with GPT-4 vs. Open-Source Prometheus Model
A blog post that demonstrates how to effectively use the Prometheus model for evaluation purposes, integrating it smoothly with the LlamaIndex framework by comparing it with GPT-4 evaluation.
Advanced RAG Techniques: an Illustrated Overview
A comprehensive study of the advanced retrieval augmented generation techniques and algorithms.
Efficient Vector Similarity Search in Recommender Workflows Using Milvus with NVIDIA Merlin
An introductory blog post that demonstrates how Milvus works with the Merlin Recsys framework at training and inference time.
Learning
Building a Million-Parameter LLM from Scratch Using Python
A step-by-step guide for replicating the LlaMA 1 architecture from scratch, using a basic dataset and a minimal GPU.
Training Production AI Models with PyTorch 2.0
A blog that demonstrates how PyTorch 2.0 significantly accelerates the training of large and complex production AI models with reasonable compilation time.
An overview of modular deep learning across four dimensions (computation function, routing function, aggregation function, and training setting).
Libraries & Code
An array framework for machine learning on Apple silicon, brought to you by Apple machine learning research.
LLM orchestration framework to build customizable, production-ready LLM applications.
A unified evaluation framework for large language models.
Papers & Publications
CogAgent: A Visual Language Model for GUI Agents
Abstract:
People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120, enabling it to recognize tiny page elements and text. As a generalist visual language model, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks -- Mind2Web and AITW, advancing the state of the art.
EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM
Abstract:
This paper presents EdgeSAM, an accelerated variant of the Segment Anything Model (SAM), optimized for efficient execution on edge devices with minimal compromise in performance. Our approach involves distilling the original ViT-based SAM image encoder into a purely CNN-based architecture, better suited for edge devices. We carefully benchmark various distillation strategies and demonstrate that task-agnostic encoder distillation fails to capture the full knowledge embodied in SAM. To overcome this bottleneck, we include both the prompt encoder and mask decoder in the distillation process, with box and point prompts in the loop, so that the distilled model can accurately capture the intricate dynamics between user input and mask generation. To mitigate dataset bias issues stemming from point prompt distillation, we incorporate a lightweight module within the encoder. EdgeSAM achieves a 40-fold speed increase compared to the original SAM, and it also outperforms MobileSAM, being 14 times as fast when deployed on edge devices while enhancing the mIoUs on COCO and LVIS by 2.3 and 3.2 respectively. It is also the first SAM variant that can run at over 30 FPS on an iPhone 14.
Abstract:
VideoPoet is a simple modeling method that can convert any autoregressive language model or large language model (LLM) into a high-quality video generator. It contains a few simple components:
A pre-trained MAGVIT V2 video tokenizer and a SoundStream audio tokenizer transform images, video, and audio clips with variable lengths into a sequence of discrete codes in a unified vocabulary. These codes are compatible with text-based language models, facilitating an integration with other modalities, such as text.
An autoregressive language model learns across video, image, audio, and text modalities to autoregressively predict the next video or audio token in the sequence.
A mixture of multimodal generative learning objectives are introduced into the LLM training framework, including text-to-video, text-to-image, image-to-video, video frame continuation, video inpainting and outpainting, video stylization, and video-to-audio. Furthermore, such tasks can be composed together for additional zero-shot capabilities (e.g., text-to-audio).
This simple recipe shows that language models can synthesize and edit videos with a high degree of temporal consistency. VideoPoet demonstrates state-of-the-art video generation, in particular in producing a wide range of large, interesting, and high-fidelity motions. The VideoPoet model supports generating videos in square orientation, or portrait to tailor generations towards short-form content, as well as supporting audio generation from a video input.