Deep Learning Weekly: Issue 351

Cohere Toolkit, Multimodal Search with Snowflake Embedding and MAX Engine, LLM Security, CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data, and more!

May 01, 2024

This week in deep learning, we bring you New Cohere Toolkit Accelerates Generative AI Application Development, Modular: Multimodal Search with Snowflake Embedding and MAX Engine, A Primer on LLM Security – Hacking Large Language Models for Beginners, and a paper on CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data.

You may also enjoy Introducing Phi-3: Redefining what's possible with SLMs, Advanced Retriever Techniques to Improve Your RAGs, Meditron: An LLM suite for low-resource medical settings leveraging Meta Llama, a paper on FlowMap: High-Quality Camera Poses, Intrinsics, and Depth via Gradient Descent, and more!

As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.

Until next week!

Industry

New Cohere Toolkit Accelerates Generative AI Application Development

Cohere makes Cohere Toolkit, an open-source repository of production-ready applications that you can deploy across cloud providers, available.

Introducing Phi-3: Redefining what's possible with SLMs

Microsoft introduces Phi-3, a family of capable and cost-effective small language models (SLMs).

AI unicorn Synthesia launches most 'emotionally expressive' avatars on the market

A British startup today unveiled new AI humans that blur the line between the virtual and the real. Synthesia calls the digital beings “Expressive Avatars.”

Amazon Q enterprise AI chatbot is now generally available

Amazon Q, the work-centric generative AI assistant from AWS, has become generally available.

MLOps & LLMOps

Building a Chat Application with LangChain, LLMs, and Streamlit for Complex SQL Database Interaction

An article on how to use a large language model to interact with a complex database using Langchain agents and tools, and then deploying the chat application using Streamlit.

Modular: Multimodal Search with Snowflake Embedding and MAX Engine

An article that explores how a multimodal approach can further enhance semantic search, and discusses how MAX Engine can optimize multiple models for inference.

Advanced Retriever Techniques to Improve Your RAGs

An article on cutting-edge techniques for optimizing the selection of relevant documents with Langchain.

Common Pitfalls To Avoid When Using Vector Databases

An article that covers the common pitfalls and avoidance strategies for vector databases.

Streaming Pipelines for Fine-tuning LLMs and RAG in Real-Time

An article that discusses the design and implementation of a production-ready feature pipeline that uses Bytewax and RabbitMQ queue.

Learning

A Visual Guide to Vision Transformers

A visual guide to Vision Transformers (ViTs), a class of deep learning models that have achieved state-of-the-art performance on image classification tasks.

A Primer on LLM Security – Hacking Large Language Models for Beginners

A blog post that delves into the security aspects of Large Language Models (LLMs) and their applications.

Meditron: An LLM suite for low-resource medical settings leveraging Meta Llama

A blog post that discusses the development of Meditron, an open-source large multimodal model designed to assist with clinical decision-making and diagnosis in low-resource settings.

Libraries & Code

EthicalML/awesome-production-machine-learning/

A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning.

VinciGit00/Scrapegraph-ai

ScrapeGraphAI is a web scraping python library that uses LLM and direct graph logic to create scraping pipelines for websites, documents and XML files.

Papers & Publications

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Abstract:

Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and text pairs poses computational challenges. This paper presents a novel weakly supervised pre-training of vision models on web-scale image-text data. The proposed method reframes pre-training on image-text data as a classification task. Consequently, it eliminates the need for pairwise similarity computations in contrastive loss, achieving a remarkable 2.7× acceleration in training speed compared to contrastive learning on web-scale data. Through extensive experiments spanning diverse vision tasks, including detection and segmentation, we demonstrate that the proposed method maintains high representation quality.

FlowMap: High-Quality Camera Poses, Intrinsics, and Depth via Gradient Descent

Abstract:

This paper introduces FlowMap, an end-to-end differentiable method that solves for precise camera poses, camera intrinsics, and per-frame dense depth of a video sequence. Our method performs per-video gradient-descent minimization of a simple least-squares objective that compares the optical flow induced by depth, intrinsics, and poses against correspondences obtained via off-the-shelf optical flow and point tracking. Alongside the use of point tracks to encourage long-term geometric consistency, we introduce differentiable re-parameterizations of depth, intrinsics, and pose that are amenable to first-order optimization. We empirically show that camera parameters and dense depth recovered by our method enable photo-realistic novel view synthesis on 360-degree trajectories using Gaussian Splatting. Our method not only far outperforms prior gradient-descent based bundle adjustment methods, but surprisingly performs on par with COLMAP, the state-of-the-art SfM method, on the downstream task of 360-degree novel view synthesis (even though our method is purely gradient-descent based, fully differentiable, and presents a complete departure from conventional SfM).

ConsistentID: Portrait Generation with Multimodal Fine-Grained Identity Preserving

Abstract:

Diffusion-based technologies have made significant strides, particularly in personalized and customized facialgeneration. However, existing methods face challenges in achieving high-fidelity and detailed identity (ID)consistency, primarily due to insufficient fine-grained control over facial areas and the lack of a comprehensive strategy for ID preservation by fully considering intricate facial details and the overall face. To address these limitations, we introduce ConsistentID, an innovative method crafted for diverseidentity-preserving portrait generation under fine-grained multimodal facial prompts, utilizing only a single reference image. ConsistentID comprises two key components: a multimodal facial prompt generator that combines facial features, corresponding facial descriptions and the overall facial context to enhance precision in facial details, and an ID-preservation network optimized through the facial attention localization strategy, aimed at preserving ID consistency in facial regions. Together, these components significantly enhance the accuracy of ID preservation by introducing fine-grained multimodal ID information from facial regions. To facilitate training of ConsistentID, we present a fine-grained portrait dataset, FGID, with over 500,000 facial images, offering greater diversity and comprehensiveness than existing public facial datasets. % such as LAION-Face, CelebA, FFHQ, and SFHQ. Experimental results substantiate that our ConsistentID achieves exceptional precision and diversity in personalized facial generation, surpassing existing methods in the MyStyle dataset. Furthermore, while ConsistentID introduces more multimodal ID information, it maintains a fast inference speed during generation.

A guest post by

Miko Planas

~~~

Deep Learning Weekly

Deep Learning Weekly: Issue 351

Cohere Toolkit, Multimodal Search with Snowflake Embedding and MAX Engine, LLM Security, CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data, and more!

Industry

MLOps & LLMOps

Learning

Libraries & Code

Papers & Publications

Discussion about this post