Deep Learning Weekly: Issue 338
FTC Launches Inquiry into Gen AI Investments, Mixed-input Matrix Multiplication Performance Optimizations, Emulating the Attention Mechanism with a Convolutional Network, and many more!
This week in deep learning, we bring you FTC Launches Inquiry into Generative AI Investments and Partnerships, Mixed-input matrix multiplication performance optimizations, Emulating the Attention Mechanism in Transformer Models with a Fully Convolutional Network, and a paper on Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs.
You may also enjoy New hope for early pancreatic cancer intervention via AI-based risk prediction, Enhancing Interaction between Language Models and Graph Databases via a Semantic Layer, Preference Tuning LLMs with Direct Preference Optimization Methods, a paper on Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
The Federal Trade Commission announced that it issued orders to Alphabet, Amazon, Anthropic, Microsoft, and OpenAI to provide information regarding recent investments and partnerships involving generative AI.
MIT CSAIL researchers developed machine learning models that outperform current methods in detecting pancreatic ductal adenocarcinoma.
Rebellions, a Seoul-based developer of AI chips, has raised $124 million in funding to support its engineering efforts.
OpenAI is releasing new models, reducing prices for GPT-3.5 Turbo, and introducing new ways for developers to manage API keys and understand API usage.
Enterprise generative AI platform developer Kore.ai announced that it has secured $150 million in strategic growth investment funding led by FTV Capital.
MLOps & LLMOps
An article that describes the concept of memory in LangChain and explores its importance, implementation, and various strategies for optimizing conversation flow.
A blog on how to implement a semantic layer that allows an LLM to interact with a knowledge graph that contains information about actors, movies, and their ratings.
A blog that focuses on mapping mixed-input matrix multiplication onto the NVIDIA Ampere architecture.
A blog that evaluates three promising methods to align language models without reinforcement learning (or preference tuning) on a number of models and hyperparameter settings.
An article that presents a novel method of emulating the attention mechanism in transformer models using a fully convolutional network, which achieves superior performance and efficiency compared to conventional transformers.
An article that explores and compares the performance of a quantum neural network against a simple multilayer perceptron for a stock pricing task.
An article explores how deep learning techniques can enhance single-cell sequencing, a method that reveals the diversity and complexity of individual cells in living organisms.
Libraries & Code
Lightweight inference library for ONNX files, written in C++.
An end-to-end speech processing toolkit covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, etc.
An AI-Powered tool for automated pull request analysis, feedback, suggestions and more.
Papers & Publications
Diffusion models have exhibit exceptional performance in text-to-image generation and editing. However, existing methods often face challenges when handling complex text prompts that involve multiple objects with multiple attributes and relationships. In this paper, we propose a brand new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our approach employs the MLLM as a global planner to decompose the process of generating complex images into multiple simpler generation tasks within subregions. We propose complementary regional diffusion to enable region-wise compositional generation. Furthermore, we integrate text-guided image generation and editing within the proposed RPG in a closed-loop fashion, thereby enhancing generalization ability. Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image diffusion models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment. Notably, our RPG framework exhibits wide compatibility with various MLLM architectures (e.g., MiniGPT-4) and diffusion backbones (e.g., ControlNet).
This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M), which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First, a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further, through fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet.
Thanks for reading Deep Learning Weekly! Subscribe for free to receive new posts and support my work.