Deep Learning Weekly: Issue #253
Photonic neural networks that can classify images in less than 570 picoseconds, AI ushering a new scientific revolution, deploying transformers on the Apple Neural Engine, and more.
This week in deep learning, we bring you photonic neural networks that can classify images in less than 570 picoseconds, AI ushering a new scientific revolution, deploying transformers on the Apple Neural Engine, and a paper on modern hopfield networks for tabular data.
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Real estate technology company Doma helps speed closing home purchases with machine learning models trained on NVIDIA GPUs.
In a new study, researchers have developed a photonic deep neural network that can directly analyze images without the need for a clock, sensor, or large memory modules. It can classify an image in less than 570 picoseconds, which is comparable with a single clock cycle in state-of-the-art microchips.
Landing AI announces LandingEdge, which customers can use to deploy deep learning based vision inspection to their production floor.
Vayyar, a company developing radar-imaging sensor technologies, today announced that it raised $108 million in a Series E round led by Koch Disruptive Technologies.
An article demonstrating how to use a tool like DBT to develop a data pipeline that performs feature engineering, trains, and makes predictions, all without moving data from the database.
A joint recommendation (Cohere, OpenAI, AI21) of several key principles to help providers of large language models mitigate the risks of this technology in order to achieve its full promise to augment human capabilities.
In this post, you’ll see how to use SageMaker Serverless Inference to reduce cost when you deploy an ML model as part of the testing phase of your MLOps pipeline.
In this article, we will try to understand the different categories of automated testing and how to make ML projects better with each.
A comprehensive blog on how AI is ushering in a new scientific revolution by making remarkable breakthroughs in a number of fields, unlocking new approaches to science, and accelerating the pace of science and innovation.
An article providing generalizable guidance to developers on optimizing their models for Apple Neural Engine execution.
An article briefly describing the high-level intuition behind GANs, along with a technical guide on building a small demo around a pre-trained CryptoPunks GAN.
Libraries & Code
Deepchecks is a Python package for comprehensively validating your machine learning models and data with minimal effort.
Quantus is an easy-to-use yet comprehensive toolbox for quantitative evaluation of neural network explanations — including 25+ different metrics.
A dataset for implicit hate speech detection.
Papers & Publications
Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this work, we present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips. As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.
While Deep Learning excels in structured data as encountered in vision and natural language processing, it failed to meet expectations on tabular data. For tabular data, Support Vector Machines (SVMs), Random Forests, and Gradient Boosting are the best performing techniques with Gradient Boosting in the lead. Recently, we saw a surge of Deep Learning methods that were tailored to tabular data but still underperform compared to Gradient Boosting on small-sized datasets. We suggest "Hopular," a novel Deep Learning architecture for medium- and small-sized datasets, where each layer is equipped with continuous modern Hopfield networks. The modern Hopfield networks use stored data to identify feature-feature, feature-target, and sample-sample dependencies. Hopular's novelty is that every layer can directly access the original input as well as the whole training set via stored data in the Hopfield networks. Therefore, Hopular can step-wise update its current model and the resulting prediction at every layer like standard iterative learning algorithms. In experiments on small-sized tabular datasets with less than 1,000 samples, Hopular surpasses Gradient Boosting, Random Forests, SVMs, and in particular several Deep Learning methods. In experiments on medium-sized tabular data with about 10,000 samples, Hopular outperforms XGBoost, CatBoost, LightGBM, and a state-of-the art Deep Learning method designed for tabular data. Thus, Hopular is a strong alternative to these methods on tabular data.
Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3× speedup on GPT-2 (seq. length 1K), and 2.4× speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).