Deep Learning Weekly: Issue 360
Gemma 2, From bare metal to a 70B model: infrastructure set-up and scripts, Evaluating Open LLMs with MixEval, a paper on A Fully Open, Vision-Centric Exploration of Multimodal LLMs, and more!
This week in deep learning, we bring you Gemma 2 is now available to researchers and developers, From bare metal to a 70B model: infrastructure set-up and scripts, Evaluating Open LLMs with MixEval, and a paper on Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs.
You may also enjoy Anthropic's new initiative for developing third-party model evaluations, Step-by-Step Guide to Choosing the Best Embedding Model for Your Application, a paper on WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Gemma 2 is now available to researchers and developers
Google releases Gemma 2 to researchers and developers globally, a higher-performing and more efficient version with significant safety advancements built in.
Intel Demonstrates First Fully Integrated Optical I/O Chiplet
Intel demonstrates the first fully integrated optical I/O Chiplet which is expected to revolutionize high-speed data processing for AI infrastructure.
A new initiative for developing third-party model evaluations \ Anthropic
Anthropic has a new initiative to source evaluations for measuring advanced model capabilities and outline the specific types of evaluations they are prioritizing.
Samsung backs ‘world’s most powerful’ AI chip for edge devices
Eindhoven-based startup Axelera has raised $68 million as it looks to take its AI chip business global. One of the lead investors is Samsung Catalyst, the venture arm of semiconductor giant Samsung Electronics.
Open-source AI platform Sentient raises $85M
San Francisco-based open-source AI development platform Sentient announced that it has raised $85 million in a seed funding round.
MLOps & LLMOps
From bare metal to a 70B model: infrastructure set-up and scripts
The team at Imbue shares an end-to-end guide for setting up the required infrastructure for training a high-performing 70B model from scratch.
Constructing knowledge graphs from text using OpenAI functions
A tutorial that explores how to construct a knowledge graph from unstructured text using OpenAI functions in combination with LangChain.
Step-by-Step Guide to Choosing the Best Embedding Model for Your Application
A step-by-step guide to choosing the best embedding model for your application.
Learning
Secure LLM Tokenizers to Maintain Application Integrity
A blog post that presents a weakness in a tokenizer implementation that would enable sufficiently privileged attackers to control system integrity.
Optimizing Sentence Transformers for Entity Resolution
A blog post that discusses how ML developers at Fetch utilize sentence transformers, a type of deep neural network, to facilitate the process of entity resolution.
Beyond MatMul: The New Frontier of LLMs with 10x Efficiency
Devansh breaks down the core ideas and techniques from the paper called “Scalable MatMul-free Language Modeling”.
How to Fine-Tune LLMs on Custom Datasets at Scale using Qwak and Comet
Lesson 7 of 11 in a free course series, LLM Twin: Building Your Production-Ready AI Replica, in which you’ll learn to use LLMs, vector DVs and LLMOps to design, train, and deploy a production ready “LLM twin” of yourself.
Evaluating Open LLMs with MixEval: The Closest Benchmark to LMSYS Chatbot Arena
An article that discusses MixEval, a benchmark that bridges the gap between real-world user queries and ground-truth-based benchmarks for evaluating large language models.
5 Open-Source Computer Vision Libraries You Need to Know
A blog post featuring 5 open-sourced CV libraries that will make the process of training a computer vision model easier.
Training MoEs at Scale with PyTorch
A blog post about how to scale to over 3000 GPUs using PyTorch Distributed and MegaBlocks, an efficient open-source MoE implementation in PyTorch.
Libraries & Code
Automatic Generation of Visualizations and Infographics using Large Language Models.
RAGFlow is an open-source RAG engine based on deep document understanding.
Fundamental widely-used algorithms and primitives for machine learning and information retrieval.
Papers & Publications
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Abstract:
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations, offering new insights into different models and architectures -- self-supervised, strongly supervised, or combinations thereof -- based on experiments with over 20 vision encoders. We critically examine existing MLLM benchmarks, addressing the difficulties involved in consolidating and interpreting results from various tasks, and introduce a new vision-centric benchmark, CV-Bench. To further improve visual grounding, we propose the Spatial Vision Aggregator (SVA), a dynamic and spatially-aware connector that integrates high-resolution vision features with LLMs while reducing the number of tokens. Additionally, we discuss the curation of high-quality visual instruction-tuning data from publicly available sources, emphasizing the importance of data source balancing and distribution ratio. Collectively, Cambrian-1 not only achieves state-of-the-art performance but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes. We hope our release will inspire and accelerate advancements in multimodal systems and visual representation learning.
ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning
Abstract:
Recently, advancements in video synthesis have attracted significant attention. Video synthesis models such as AnimateDiff and Stable Video Diffusion have demonstrated the practical applicability of diffusion models in creating dynamic visual content. The emergence of SORA has further spotlighted the potential of video generation technologies. Nonetheless, the extension of video lengths has been constrained by the limitations in computational resources. Most existing video synthesis models can only generate short video clips. In this paper, we propose a novel post-tuning methodology for video synthesis models, called ExVideo. This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations while incurring lower training expenditures. In particular, we design extension strategies across common temporal model architectures respectively, including 3D convolution, temporal attention, and positional embedding. To evaluate the efficacy of our proposed post-tuning approach, we conduct extension training on the Stable Video Diffusion model. Our approach augments the model's capacity to generate up to 5× its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos. Importantly, the substantial increase in video length doesn't compromise the model's innate generalization capabilities, and the model showcases its advantages in generating videos of diverse styles and resolutions. We will release the source code and the enhanced model publicly.
WorkBench: a Benchmark Dataset for Agents in a Realistic Workplace Setting
Abstract:
We introduce WorkBench: a benchmark dataset for evaluating agents' ability to execute tasks in a workplace setting. WorkBench contains a sandbox environment with five databases, 26 tools, and 690 tasks. These tasks represent common business activities, such as sending emails and scheduling meetings. The tasks in WorkBench are challenging as they require planning, tool selection, and often multiple actions. If a task has been successfully executed, one (or more) of the database values may change. The correct outcome for each task is unique and unambiguous, which allows for robust, automated evaluation. We call this key contribution outcome-centric evaluation. We evaluate five existing ReAct agents on WorkBench, finding they successfully complete as few as 3% of tasks (Llama2-70B), and just 43% for the best-performing (GPT-4). We further find that agents' errors can result in the wrong action being taken, such as an email being sent to the wrong person. WorkBench reveals weaknesses in agents' ability to undertake common business activities, raising questions about their use in high-stakes workplace settings.