Deep Learning Weekly: Issue 382
Introducing Rerank 3.5: Precise AI Search, Structured Generation for LLM-as-a-judge Evaluations, a paper on Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions, and many more!
This week in deep learning, we bring you Introducing Rerank 3.5: Precise AI Search, Structured Generation for LLM-as-a-Judge Evaluations and a paper on Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions.
You may also enjoy Liquid AI’s new STAR model architecture outshines Transformer efficiency, Reward Hacking in Reinforcement Learning, a paper on SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting, and more!
As always, happy reading and hacking. If you have something you think should be in next week's issue, find us on Twitter: @dl_weekly.
Until next week!
Industry
Introducing Rerank 3.5: Precise AI Search
The Cohere team introduced Rerank 3.5, their latest AI search foundation model which now includes enhanced reasoning skills, broad data capabilities, and improved multilingual performance.
Liquid AI’s new STAR model architecture outshines Transformer efficiency
Liquid AI has introduced STAR (Synthesis of Tailored Architectures), an innovative framework designed to automate the generation and optimization of AI model architectures.
Introducing Amazon Nova, our new generation of foundation models
Amazon unveiled Amazon Nova, a new generation of foundation models, which includes fast, text-only models all the way up to video generation models.
Photonic processor could enable ultrafast AI computations with extreme energy efficiency
Researchers demonstrated a fully integrated photonic processor that can perform all key computations of a deep neural network optically on the chip.
Introducing Weaviate Embeddings
Weaviate announced that Weaviate Embeddings, a new embedding service in Weaviate Cloud, is now available in preview.
Introducing Veo and Imagen 3 on Vertex AI
Google is now offering customers access to Veo, an image-to-video model, and Imagen 3 on Vertex AI.
MLOps & LLMOps
Build an Agentic Video Workflow with Video Search and Summarization
A detailed blog post about building a visual AI agent workflow using NVIDIA AI Blueprint for video search and summarization, Morpheus SDK, and Riva for a hands-free experience that can answer questions about video and image content.
Create a self-escalating chatbot in Conversational Agents using Webhook and Generator
A blog post that shows you how to create a self-escalating chatbot using Google Cloud's generative AI solutions such as Vertex AI, Conversational Agents (Dialogflow CX), and others.
Deploy QwQ-32B-Preview the best open Reasoning Model on AWS with Hugging Face
A guide on how to deploy QwQ model on Amazon SageMaker using the Hugging Face LLM Deep Learning Container.
Learning
Reward Hacking in Reinforcement Learning
A comprehensive blog post exploring the concept of reward hacking in reinforcement learning, particularly in the context of LLMs, including examples and potential mitigations.
You could have designed state of the art positional encoding
A post that walks you through the step-by-step discovery of state-of-the-art positional encoding in transformer models.
Structured Generation for LLM-as-a-Judge Evaluations
An informative blog post about using structured generation with context-free grammars to improve the reliability and complexity of LLM-based evaluations, especially in tasks like hallucination detection and content moderation.
HadaCore: Tensor Core Accelerated Hadamard Transform Kernel
A technical blog post introducing HadaCore, a Tensor Core accelerated Hadamard Transform kernel developed by IBM and Meta to speed up the process of quantizing LLMs and maintain accuracy.
Libraries & Code
LLM-powered multiagent persona simulation for imagination enhancement and business insights.
Start building LLM-empowered multi-agent applications in an easier way.
A lightweight Python library for efficient automation of machine learning and AI operations.
Papers & Publications
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
Abstract:
Currently OpenAI o1 sparks a surge of interest in the study of large reasoning models (LRM). Building on this momentum, Marco-o1 not only focuses on disciplines with standard answers, such as mathematics, physics, and coding -- which are well-suited for reinforcement learning (RL) -- but also places greater emphasis on open-ended resolutions. We aim to address the question: ''Can the o1 model effectively generalize to broader domains where clear standards are absent and rewards are challenging to quantify?'' Marco-o1 is powered by Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), reflection mechanisms, and innovative reasoning strategies -- optimized for complex real-world problem-solving tasks.
Abstract:
We introduce a co-designed approach for human portrait relighting that combines a physics-guided architecture with a pre-training framework. Drawing on the Cook-Torrance reflectance model, we have meticulously configured the architecture design to precisely simulate light-surface interactions. Furthermore, to overcome the limitation of scarce high-quality lightstage data, we have developed a self-supervised pre-training strategy. This novel combination of accurate physical modeling and expanded training dataset establishes a new benchmark in relighting realism.
Star Attention: Efficient LLM Inference over Long Sequences
Abstract:
Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 95-100% of accuracy.