Deep Learning Weekly: Issue #257
PyTorch 1.12 which includes a new dataframe library, Google's Minerva model for quantitative reasoning, DALLE2 pre-training mitigations, and more
Hey Folks,
This week in deep learning, we bring you PyTorch 1.12 which includes a new dataframe library and a deep learning compiler, Google's Minerva model for quantitative reasoning, DALLE2 pre-training mitigations, and a paper on teaching BERT to wait for detecting disfluencies in real-time.
You may also enjoy MLflow 2.0 with MLflow Pipelines, a comparison of pipeline orchestration tools, Disco Diffusion library, a paper on efficiently amalgamated CNN-transformer architecture for mobile vision applications, and more.
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Introducing MLflow Pipelines with MLflow 2.0
Databricks announces that MLflow 2.0 is coming soon and will include MLflow Pipelines, making it simple for teams to automate and scale their ML development by building production-grade ML pipelines.
PyTorch 1.12: TorchArrow, Functional API for Modules and nvFuser, are now available
The PyTorch Team announces the release of PyTorch 1.12, which includes the beta library for machine learning pre-processing over batch data and a deep learning compiler.
Deploying Real-Time AI Risk Prediction for Kidney Patients
Taipei Veterans General Hospital is analyzing streaming data during dialysis procedures with the NVIDIA Jetson edge AI platform.
Minerva: Solving Quantitative Reasoning Problems with Language Models
Google AI presents Minerva, a language model capable of solving mathematical and scientific questions using step-by-step reasoning.
MLOps
Writing Continuous Applications with Structured Streaming Python APIs in Apache Spark
This notebook shows how one can train a model using Apache Spark and MLlib, then deploy that model using Spark's structured streaming for making fraudulent transaction predictions as a continuous application.
Getting started: Serving PyTorch predictions with a custom container
This Google Cloud tutorial shows you how to use a custom container running TorchServe to deploy a PyTorch model that serves online predictions.
Shipping to Production - The Pragmatic Engineer
An article that covers the extremes of shipping to production, typical processes at different types of companies, principles and tools for shipping to production responsibly, and others.
Kedro vs ZenML vs Metaflow: Which Pipeline Orchestration Tool Should You Choose?
An article comparing different pipeline orchestration tools, particularly Kedro, ZenML, and Metaflow.
Learning
How to Evaluate Clustering Models in Python
This articles discusses different clustering algorithms and how to evaluate their results using Silhouette Score, Calinski Harabaz Index, and Davies Bouldin Index.
DALL·E 2 Pre-Training Mitigations
A post focusing on DALLE2 pre-training mitigations, a subset of the content policy guardrails which directly modify the data that DALLE 2 learns from.
Audio Classification with Deep Learning
A tutorial for conducting auditory classification within a Gradient Notebook using TensorFlow.
Accelerate Large Model Training using DeepSpeed
In this post we will look at how we can leverage the Accelerate library for training large models which enables users to leverage the ZeRO features of DeepSpeed.
Libraries & Code
jina-ai/discoart: Create Disco Diffusion artworks in one line
A fully-optimized Python library for creating Disco Diffusion artworks in one liners.
Pen and paper exercises in machine learning
This is a collection of (mostly) pen-and-paper exercises in machine learning. Each exercise comes with a detailed solution.
Papers & Publications
LViT: Language meets Vision Transformer in Medical Image Segmentation
Abstract:
Deep learning has been widely used in medical image segmentation and other aspects. However, the performance of existing medical image segmentation models has been limited by the challenge of obtaining a sufficient number of high-quality data with the high cost of data annotation. To overcome the limitation, we propose a new vision-language medical image segmentation model LViT (Language meets Vision Transformer). In our model, medical text annotation is introduced to compensate for the quality deficiency in image data. In addition, the text information can guide the generation of pseudo labels to a certain extent and further guarantee the quality of pseudo labels in semi-supervised learning. We also propose the Exponential Pseudo label Iteration mechanism (EPI) to help extend the semi-supervised version of LViT and the Pixel-Level Attention Module (PLAM) to preserve local features of images. In our model, LV (Language-Vision) loss is designed to supervise the training of unlabeled images using text information directly. To validate the performance of LViT, we construct multimodal medical segmentation datasets (image + text) containing pathological images, X-rays,etc. Experimental results show that our proposed LViT has better segmentation performance in both fully and semi-supervised conditions.
EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications
Abstract:
In the pursuit of achieving ever-increasing accuracy, large and complex neural networks are usually developed. Such models demand high computational resources and therefore cannot be deployed on edge devices. It is of great interest to build resource-efficient general purpose networks due to their usefulness in several application areas. In this work, we strive to effectively combine the strengths of both CNN and Transformer models and propose a new efficient hybrid architecture EdgeNeXt. Specifically in EdgeNeXt, we introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups and utilizes depth-wise convolution along with self-attention across channel dimensions to implicitly increase the receptive field and encode multi-scale features. Our extensive experiments on classification, detection and segmentation tasks, reveal the merits of the proposed approach, outperforming state-of-the-art methods with comparatively lower compute requirements. Our EdgeNeXt model with 1.3M parameters achieves 71.2\% top-1 accuracy on ImageNet-1K, outperforming MobileViT with an absolute gain of 2.2\% with 28\% reduction in FLOPs. Further, our EdgeNeXt model with 5.6M parameters achieves 79.4\% top-1 accuracy on ImageNet-1K.
Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection
Abstract:
In modern interactive speech-based systems, speech is consumed and transcribed incrementally prior to having disfluencies removed. This post-processing step is crucial for producing clean transcripts and high performance on downstream tasks (e.g. machine translation). However, most current state-of-the-art NLP models such as the Transformer operate non-incrementally, potentially causing unacceptable delays. We propose a streaming BERT-based sequence tagging model that, combined with a novel training objective, is capable of detecting disfluencies in real-time while balancing accuracy and latency. This is accomplished by training the model to decide whether to immediately output a prediction for the current input or to wait for further context. Essentially, the model learns to dynamically size its lookahead window. Our results demonstrate that our model produces comparably accurate predictions and does so sooner than our baselines, with lower flicker. Furthermore, the model attains state-of-the-art latency and stability scores when compared with recent work on incremental disfluency detection.