Deep Learning Weekly: Issue 345
DeepMind's Scalable Instructable Multi-world Agent, Implementing generative AI with speed and safety, Can perceptual similarity metrics be used to compare adversarial attacks?, and more!
This week in deep learning, we bring you DeepMind's Scalable Instructable Multiworld Agent (SIMA), Implementing generative AI with speed and safety, Can perceptual similarity metrics be used to compare adversarial attacks?, and a paper on LLM4Decompile: Decompiling Binary Code with Large Language Models.
You may also enjoy Building Meta’s GenAI Infrastructure, Calibration: Why Model Scores Aren’t Probabilities and How to Generate Them?, a paper on DeepSeek-VL: Towards Real-World Vision-Language Understanding, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Building Meta’s GenAI Infrastructure
Meta has announced the construction of two large-scale AI clusters, each with 24,576 GPUs, to support the company’s long-term vision of building open and responsibly created artificial general intelligence.
DeepMind's Scalable Instructable Multiworld Agent (SIMA)
DeepMind presents new research on a Scalable Instructable Multiworld Agent (SIMA) that can follow natural-language instructions to carry out tasks in a variety of video game settings.
Introducing Stable Video 3D: Quality Novel View Synthesis and 3D Generation from Single Images
Stability AI released Stable Video 3D (SV3D), a generative model based on Stable Video Diffusion, advancing the field of 3D technology.
New algorithm unlocks high-resolution insights for computer vision
FeatUp, developed by MIT CSAIL researchers, boosts the resolution of any deep network or visual foundation for computer vision systems.
Nvidia unveils Project GR00T AI foundation model for humanoid robots
Nvidia announced Project GR00T, a general-purpose foundation artificial intelligence model for bipedal humanoid robots, designed to further work on a new type of embodied AI research.
MLOps & LLMOps
Implementing generative AI with speed and safety
McKinsey’s roadmap for mitigating risks in enterprise GenAI implementations.
Navigating Transfer Learning with Comet
This article will dive deep into how you can make great observations and visualizations with Comet.
Netflix’s blog details the transition from a rule-based classifier to a machine learning-powered auto-remediation system for their data platform, enhancing efficiency and reliability.
PDF-Based Question Answering with Amazon Bedrock and Haystack
A step-by-step guide for creating a generative question answering application using Amazon Bedrock, Haystack, and OpenSearch.
Learning
Can perceptual similarity metrics be used to compare adversarial attacks?
An article that discusses the usage of so-called perceptual similarity metrics for comparing adversarial attacks.
Calibration: Why Model Scores Aren’t Probabilities and How to Generate Them?
An article that explains the importance of calibrating machine learning models to ensure that the output probabilities accurately reflect the true likelihood of events.
Fine-Tune & Evaluate LLMs in 2024 with Amazon SageMaker
A blog post on how to fine-tune and evaluate open LLMs from Hugging Face using Amazon SageMaker.
Libraries & Code
Draw a ui and make it real using GPT-4
Open Source LLM toolkit to build trustworthy LLM applications.
jphall663/awesome-machine-learning-interpretability
A curated list of awesome responsible machine learning resources.
Papers & Publications
LLM4Decompile: Decompiling Binary Code with Large Language Models
Abstract:
Decompilation aims to restore compiled code to human-readable source code, but struggles with details like names and structure. Large language models (LLMs) show promise for programming tasks, motivating their application to decompilation. However, there does not exist any open-source LLM for decompilation. Moreover, existing decompilation evaluation systems mainly consider token-level accuracy and largely ignore code executability, which is the most important feature of any program. Therefore, we release the first open-access decompilation LLMs ranging from 1B to 33B pre-trained on 4 billion tokens of C source code and the corresponding assembly code. The open-source LLMs can serve as baselines for further development in the field. To ensure practical program evaluation, we introduce Decompile-Eval, the first dataset that considers re-compilability and re-executability for decompilation. The benchmark emphasizes the importance of evaluating the decompilation model from the perspective of program semantics. Experiments indicate that our LLM4Decompile has demonstrated the capability to accurately decompile 21% of the assembly code, which achieves a 50% improvement over GPT-4.
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Abstract:
We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. Our approach is structured around three key dimensions:
We strive to ensure our data is diverse, scalable, and extensively covers real-world scenarios including web screenshots, PDFs, OCR, charts, and knowledge-based content, aiming for a comprehensive representation of practical contexts. Further, we create a use case taxonomy from real user scenarios and construct an instruction tuning dataset accordingly. The fine-tuning with this dataset substantially improves the model's user experience in practical applications. Considering efficiency and the demands of most real-world scenarios, DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024), while maintaining a relatively low computational overhead. This design choice ensures the model's ability to capture critical semantic and detailed information across various visual tasks. We posit that a proficient Vision-Language Model should, foremost, possess strong language abilities. To ensure the preservation of LLM capabilities during pretraining, we investigate an effective VL pretraining strategy by integrating LLM training from the beginning and carefully managing the competitive dynamics observed between vision and language modalities.
The DeepSeek-VL family (both 1.3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks. We have made both 1.3B and 7B models publicly accessible to foster innovations based on this foundation model.
Chronos: Learning the Language of Time Series
Abstract:
We introduce Chronos, a simple yet effective framework for pretrained probabilistic time series models. Chronos tokenizes time series values using scaling and quantization into a fixed vocabulary and trains existing transformer-based language model architectures on these tokenized time series via the cross-entropy loss. We pretrained Chronos models based on the T5 family (ranging from 20M to 710M parameters) on a large collection of publicly available datasets, complemented by a synthetic dataset that we generated via Gaussian processes to improve generalization. In a comprehensive benchmark consisting of 42 datasets, and comprising both classical local models and deep learning methods, we show that Chronos models: (a) significantly outperform other methods on datasets that were part of the training corpus; and (b) have comparable and occasionally superior zero-shot performance on new datasets, relative to methods that were trained specifically on them. Our results demonstrate that Chronos models can leverage time series data from diverse domains to improve zero-shot accuracy on unseen forecasting tasks, positioning pretrained models as a viable tool to greatly simplify forecasting pipelines.