Deep Learning Weekly: Issue 338
FTC Launches Inquiry into Gen AI Investments, Mixed-input Matrix Multiplication Performance Optimizations, Emulating the Attention Mechanism with a Convolutional Network, and many more!
This week in deep learning, we bring you FTC Launches Inquiry into Generative AI Investments and Partnerships, Mixed-input matrix multiplication performance optimizations, Emulating the Attention Mechanism in Transformer Models with a Fully Convolutional Network, and a paper on Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs.
You may also enjoy New hope for early pancreatic cancer intervention via AI-based risk prediction, Enhancing Interaction between Language Models and Graph Databases via a Semantic Layer, Preference Tuning LLMs with Direct Preference Optimization Methods, a paper on Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
FTC Launches Inquiry into Generative AI Investments and Partnerships
The Federal Trade Commission announced that it issued orders to Alphabet, Amazon, Anthropic, Microsoft, and OpenAI to provide information regarding recent investments and partnerships involving generative AI.
New hope for early pancreatic cancer intervention via AI-based risk prediction
MIT CSAIL researchers developed machine learning models that outperform current methods in detecting pancreatic ductal adenocarcinoma.
Korean AI chip startup Rebellions raises $124M at $650M+ valuation
Rebellions, a Seoul-based developer of AI chips, has raised $124 million in funding to support its engineering efforts.
New embedding models and API updates
OpenAI is releasing new models, reducing prices for GPT-3.5 Turbo, and introducing new ways for developers to manage API keys and understand API usage.
Kore.ai rakes in $150M to build generative AI tools for global brands
Enterprise generative AI platform developer Kore.ai announced that it has secured $150 million in strategic growth investment funding led by FTV Capital.
MLOps & LLMOps
Enhance Conversational Agents with LangChain Memory
An article that describes the concept of memory in LangChain and explores its importance, implementation, and various strategies for optimizing conversation flow.
Enhancing Interaction between Language Models and Graph Databases via a Semantic Layer
A blog on how to implement a semantic layer that allows an LLM to interact with a knowledge graph that contains information about actors, movies, and their ratings.
Mixed-input matrix multiplication performance optimizations
A blog that focuses on mapping mixed-input matrix multiplication onto the NVIDIA Ampere architecture.
Learning
Preference Tuning LLMs with Direct Preference Optimization Methods
A blog that evaluates three promising methods to align language models without reinforcement learning (or preference tuning) on a number of models and hyperparameter settings.
Emulating the Attention Mechanism in Transformer Models with a Fully Convolutional Network
An article that presents a novel method of emulating the attention mechanism in transformer models using a fully convolutional network, which achieves superior performance and efficiency compared to conventional transformers.
Stock Price Prediction with Quantum Machine Learning in Python
An article that explores and compares the performance of a quantum neural network against a simple multilayer perceptron for a stock pricing task.
Deep learning for single-cell sequencing: a microscope to see the diversity of cells
An article explores how deep learning techniques can enhance single-cell sequencing, a method that reveals the diversity and complexity of individual cells in living organisms.
Libraries & Code
Lightweight inference library for ONNX files, written in C++.
An end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, etc.
An AI-Powered tool for automated pull request analysis, feedback, suggestions and more.
Papers & Publications
Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs
Abstract:
Diffusion models have exhibit exceptional performance in text-to-image generation and editing. However, existing methods often face challenges when handling complex text prompts that involve multiple objects with multiple attributes and relationships. In this paper, we propose a brand new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our approach employs the MLLM as a global planner to decompose the process of generating complex images into multiple simpler generation tasks within subregions. We propose complementary regional diffusion to enable region-wise compositional generation. Furthermore, we integrate text-guided image generation and editing within the proposed RPG in a closed-loop fashion, thereby enhancing generalization ability. Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image diffusion models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment. Notably, our RPG framework exhibits wide compatibility with various MLLM architectures (e.g., MiniGPT-4) and diffusion backbones (e.g., ControlNet).
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
Abstract:
This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M), which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First, a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further, through fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet.