Deep Learning Weekly: Issue 351
Cohere Toolkit, Multimodal Search with Snowflake Embedding and MAX Engine, LLM Security, CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data, and more!
This week in deep learning, we bring you New Cohere Toolkit Accelerates Generative AI Application Development, Modular: Multimodal Search with Snowflake Embedding and MAX Engine, A Primer on LLM Security – Hacking Large Language Models for Beginners, and a paper on CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data.
You may also enjoy Introducing Phi-3: Redefining what's possible with SLMs, Advanced Retriever Techniques to Improve Your RAGs, Meditron: An LLM suite for low-resource medical settings leveraging Meta Llama, a paper on FlowMap: High-Quality Camera Poses, Intrinsics, and Depth via Gradient Descent, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
New Cohere Toolkit Accelerates Generative AI Application Development
Cohere makes Cohere Toolkit, an open-source repository of production-ready applications that you can deploy across cloud providers, available.
Introducing Phi-3: Redefining what's possible with SLMs
Microsoft introduces Phi-3, a family of capable and cost-effective small language models (SLMs).
AI unicorn Synthesia launches most 'emotionally expressive' avatars on the market
A British startup today unveiled new AI humans that blur the line between the virtual and the real. Synthesia calls the digital beings “Expressive Avatars.”
Amazon Q enterprise AI chatbot is now generally available
Amazon Q, the work-centric generative AI assistant from AWS, has become generally available.
MLOps & LLMOps
Building a Chat Application with LangChain, LLMs, and Streamlit for Complex SQL Database Interaction
An article on how to use a large language model to interact with a complex database using Langchain agents and tools, and then deploying the chat application using Streamlit.
Modular: Multimodal Search with Snowflake Embedding and MAX Engine
An article that explores how a multimodal approach can further enhance semantic search, and discusses how MAX Engine can optimize multiple models for inference.
Advanced Retriever Techniques to Improve Your RAGs
An article on cutting-edge techniques for optimizing the selection of relevant documents with Langchain.
Common Pitfalls To Avoid When Using Vector Databases
An article that covers the common pitfalls and avoidance strategies for vector databases.
Streaming Pipelines for Fine-tuning LLMs and RAG in Real-Time
An article that discusses the design and implementation of a production-ready feature pipeline that uses Bytewax and RabbitMQ queue.
Learning
A Visual Guide to Vision Transformers
A visual guide to Vision Transformers (ViTs), a class of deep learning models that have achieved state-of-the-art performance on image classification tasks.
A Primer on LLM Security – Hacking Large Language Models for Beginners
A blog post that delves into the security aspects of Large Language Models (LLMs) and their applications.
Meditron: An LLM suite for low-resource medical settings leveraging Meta Llama
A blog post that discusses the development of Meditron, an open-source large multimodal model designed to assist with clinical decision-making and diagnosis in low-resource settings.
Libraries & Code
EthicalML/awesome-production-machine-learning/
A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning.
ScrapeGraphAI is a web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites, documents and XML files.
Papers & Publications
Abstract:
Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and text pairs poses computational challenges. This paper presents a novel weakly supervised pre-training of vision models on web-scale image-text data. The proposed method reframes pre-training on image-text data as a classification task. Consequently, it eliminates the need for pairwise similarity computations in contrastive loss, achieving a remarkable 2.7× acceleration in training speed compared to contrastive learning on web-scale data. Through extensive experiments spanning diverse vision tasks, including detection and segmentation, we demonstrate that the proposed method maintains high representation quality.
FlowMap: High-Quality Camera Poses, Intrinsics, and Depth via Gradient Descent
Abstract:
This paper introduces FlowMap, an end-to-end differentiable method that solves for precise camera poses, camera intrinsics, and per-frame dense depth of a video sequence. Our method performs per-video gradient-descent minimization of a simple least-squares objective that compares the optical flow induced by depth, intrinsics, and poses against correspondences obtained via off-the-shelf optical flow and point tracking. Alongside the use of point tracks to encourage long-term geometric consistency, we introduce differentiable re-parameterizations of depth, intrinsics, and pose that are amenable to first-order optimization. We empirically show that camera parameters and dense depth recovered by our method enable photo-realistic novel view synthesis on 360-degree trajectories using Gaussian Splatting. Our method not only far outperforms prior gradient-descent based bundle adjustment methods, but surprisingly performs on par with COLMAP, the state-of-the-art SfM method, on the downstream task of 360-degree novel view synthesis (even though our method is purely gradient-descent based, fully differentiable, and presents a complete departure from conventional SfM).
ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving
Abstract:
Diffusion-based technologies have made significant strides, particularly in personalized and customized facialgeneration. However, existing methods face challenges in achieving high-fidelity and detailed identity (ID)consistency, primarily due to insufficient fine-grained control over facial areas and the lack of a comprehensive strategy for ID preservation by fully considering intricate facial details and the overall face. To address these limitations, we introduce ConsistentID, an innovative method crafted for diverseidentity-preserving portrait generation under fine-grained multimodal facial prompts, utilizing only a single reference image. ConsistentID comprises two key components: a multimodal facial prompt generator that combines facial features, corresponding facial descriptions and the overall facial context to enhance precision in facial details, and an ID-preservation network optimized through the facial attention localization strategy, aimed at preserving ID consistency in facial regions. Together, these components significantly enhance the accuracy of ID preservation by introducing fine-grained multimodal ID information from facial regions. To facilitate training of ConsistentID, we present a fine-grained portrait dataset, FGID, with over 500,000 facial images, offering greater diversity and comprehensiveness than existing public facial datasets. % such as LAION-Face, CelebA, FFHQ, and SFHQ. Experimental results substantiate that our ConsistentID achieves exceptional precision and diversity in personalized facial generation, surpassing existing methods in the MyStyle dataset. Furthermore, while ConsistentID introduces more multimodal ID information, it maintains a fast inference speed during generation.