Deep Learning Weekly: Issue 369
Alibaba's Qwen2-VL, GenOps: Learnings From Microservices and Traditional DevOps, Building a Low-Cost Local LLM Server to Run 70 Billion Parameter Models, and many more!
This week in deep learning, we bring you Alibaba's Qwen2-VL, GenOps: Learnings From Microservices and Traditional DevOps, Building a Low-Cost Local LLM Server to Run 70 Billion Parameter Models, and a paper on Sapiens: Foundation for Human Vision Models.
You may also enjoy California legislature passes sweeping AI safety bill , How to fine-tune: Focus on effective datasets, a paper on APE: Active Prompt Engineering - Identifying Informative Few-Shot Examples for LLMs, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Alibaba releases new AI model Qwen2-VL that can analyze videos more than 20 minutes long
Alibaba announced the release of Qwen2-VL, its latest advanced vision-language model designed to enhance visual understanding, video comprehension, and multilingual text-image processing.
California legislature passes sweeping AI safety bill
The California State Assembly and Senate have passed the Safe and Secure Innovation for Frontier Artificial Intelligence Models Act (SB 1047), one of the first significant regulations of artificial intelligence in the US.
Elon Musk's xAI launches 'Colossus' AI training system with 100,000 chips
Elon Musk’s xAI has completed the assembly of an AI training system that features 100,000 graphics cards.
Amazon hires founders of AI robotics startup Covariant
Amazon is hiring the three founders of Covariant, a well-funded startup that develops AI for warehouse robots.
MLOps & LLMOps
GenOps: Learnings From Microservices and Traditional DevOps
An article about the lessons learned from microservices and traditional DevOps, proposing the concept of GenOps to address the unique operational needs of generative AI applications.
Building a Low-Cost Local LLM Server to Run 70 Billion Parameter Models
A technical blog post about building a cost-effective local LLM server to run 70 billion parameter models, providing a comprehensive guide on hardware selection, software configuration, and deployment.
Enriching and Ingesting Data into Weaviate with Aryn
A tutorial on how to use the open-source Sycamore library and Aryn Partitioning Service to load PDFs into Weaviate.
Building a serverless RAG application with LlamaIndex and Azure OpenAI
A guide about creating serverless RAG applications using LlamaIndex and Azure OpenAI, detailing the setup, architecture, and deployment on Microsoft Azure.
Learning
How to fine-tune: Focus on effective datasets
Meta’s blog post which focuses on exploring some rules of thumb for curating a good training dataset.
File-level and Chunk-Level Retrieval with LlamaCloud and Workflows
A notebook that shows you how to perform file-level and chunk-level retrieval with LlamaCloud using a custom router query engine and a custom agent built with Workflows.
Unlocking 7B+ language models in your browser: A deep dive with Google AI Edge's MediaPipe
An article on how Google redesigned model-loading code for the web in order to overcome several memory restrictions and enable running larger (7B+) LLMs in the browser using their cross-platform inference framework.
Libraries & Code
Langflow is a low-code app builder for RAG and multi-agent AI applications.
RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation.
Papers & Publications
Sapiens: Foundation for Human Vision Models
Abstract:
We present Sapiens, a family of models for four fundamental human-centric vision tasks -- 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning models pretrained on over 300 million in-the-wild human images. We observe that, given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks. The resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. Our simple model design also brings scalability -- model performance across tasks improves as we scale the number of parameters from 0.3 to 2 billion. Sapiens consistently surpasses existing baselines across various human-centric benchmarks. We achieve significant improvements over the prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error.
APE: Active Prompt Engineering - Identifying Informative Few-Shot Examples for LLMs
Abstract:
Prompt engineering is an iterative procedure that often requires extensive manual efforts to formulate suitable instructions for effectively directing large language models (LLMs) in specific tasks. Incorporating few-shot examples is a vital and efficacious approach to provide LLMs with precise and tangible instructions, leading to improved LLM performance. Nonetheless, identifying the most informative demonstrations for LLMs is labor-intensive, frequently entailing sifting through an extensive search space. In this demonstration, we showcase an interactive tool called APE (Active Prompt Tuning) designed for refining prompts through human feedback. Drawing inspiration from active learning, APE iteratively selects the most ambiguous examples for human feedback, which will be transformed into few-shot examples within the prompt.
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Abstract:
We introduce CogVideoX, a large-scale diffusion transformer model designed for generating videos based on text prompts. To efficiently model video data, we propose to leverage a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions. To improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. By employing a progressive training technique, CogVideoX is adept at producing coherent, long-duration videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method. It significantly helps enhance the performance of CogVideoX, improving both generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations.