Deep Learning Weekly: Issue 332
DeepMind's FunSearch, Advanced RAG Techniques: An Illustrated Overview, Building a Million-Parameter LLM from Scratch, a paper on CogAgent: A Visual Language Model for GUI Agents, and many more!
This week in deep learning, we bring you DeepMind's FunSearch, Advanced RAG Techniques: An Illustrated Overview, Building a Million-Parameter LLM from Scratch Using Python, and a paper on CogAgent: A Visual Language Model for GUI Agents.
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Intel unveiled new computer chips including Gaudi3, an AI chip for generative use cases.
Google DeepMind introduces a method that represents the first time a new discovery has been made, for challenging open problems in science or mathematics using LLMs.
Patronus AI, a startup focused on responsible AI deployment, has released a new diagnostic test suite today called SimpleSafetyTests to help identify critical safety risks in large language models.
Lightmatter, a startup building computing products that use light for computing, has raised $155 million in additional funding increasing the company’s valuation to $1.2 billion.
Digital maps and location tech specialist TomTom has partnered up with Microsoft to develop an AI voice assistant for vehicles.
Google made a significant upgrade to Google Cloud’s image-generation capabilities with Imagen 2, which is now generally available for Vertex AI customers on the allowlist.
MLOps & LLMOps
A blog post that demonstrates how to effectively use the Prometheus model for evaluation purposes, integrating it smoothly with the LlamaIndex framework by comparing it with GPT-4 evaluation.
A comprehensive study of the advanced retrieval augmented generation techniques and algorithms.
An introductory blog post that demonstrates how Milvus works with the Merlin Recsys framework at training and inference time.
A step-by-step guide for replicating the LlaMA 1 architecture from scratch, using a basic dataset and a minimal GPU.
A blog that demonstrates how PyTorch 2.0 significantly accelerates the training of large and complex production AI models with reasonable compilation time.
An overview of modular deep learning across four dimensions (computation function, routing function, aggregation function, and training setting).
Libraries & Code
An array framework for machine learning on Apple silicon, brought to you by Apple machine learning research.
LLM orchestration framework to build customizable, production-ready LLM applications.
A unified evaluation framework for large language models.
Papers & Publications
People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120, enabling it to recognize tiny page elements and text. As a generalist visual language model, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks -- Mind2Web and AITW, advancing the state of the art.
This paper presents EdgeSAM, an accelerated variant of the Segment Anything Model (SAM), optimized for efficient execution on edge devices with minimal compromise in performance. Our approach involves distilling the original ViT-based SAM image encoder into a purely CNN-based architecture, better suited for edge devices. We carefully benchmark various distillation strategies and demonstrate that task-agnostic encoder distillation fails to capture the full knowledge embodied in SAM. To overcome this bottleneck, we include both the prompt encoder and mask decoder in the distillation process, with box and point prompts in the loop, so that the distilled model can accurately capture the intricate dynamics between user input and mask generation. To mitigate dataset bias issues stemming from point prompt distillation, we incorporate a lightweight module within the encoder. EdgeSAM achieves a 40-fold speed increase compared to the original SAM, and it also outperforms MobileSAM, being 14 times as fast when deployed on edge devices while enhancing the mIoUs on COCO and LVIS by 2.3 and 3.2 respectively. It is also the first SAM variant that can run at over 30 FPS on an iPhone 14.
VideoPoet is a simple modeling method that can convert any autoregressive language model or large language model (LLM) into a high-quality video generator. It contains a few simple components:
A pre-trained MAGVIT V2 video tokenizer and a SoundStream audio tokenizer transform images, video, and audio clips with variable lengths into a sequence of discrete codes in a unified vocabulary. These codes are compatible with text-based language models, facilitating an integration with other modalities, such as text.
An autoregressive language model learns across video, image, audio, and text modalities to autoregressively predict the next video or audio token in the sequence.
A mixture of multimodal generative learning objectives are introduced into the LLM training framework, including text-to-video, text-to-image, image-to-video, video frame continuation, video inpainting and outpainting, video stylization, and video-to-audio. Furthermore, such tasks can be composed together for additional zero-shot capabilities (e.g., text-to-audio).
This simple recipe shows that language models can synthesize and edit videos with a high degree of temporal consistency. VideoPoet demonstrates state-of-the-art video generation, in particular in producing a wide range of large, interesting, and high-fidelity motions. The VideoPoet model supports generating videos in square orientation, or portrait to tailor generations towards short-form content, as well as supporting audio generation from a video input.
Thanks for reading Deep Learning Weekly! Subscribe for free to receive new posts and support my work.