Apple Machine Learning Research

Introducing the Third Generation of Apple’s Foundation Models

Mon, 08 Jun 2026 00:00:00 GMT

Our next generation of Apple Intelligence is centered around our users, integrated deeply into our operating systems, and powered by a bold new architecture with privacy at its core. At the heart of this architecture is our third generation of Apple Foundation Models (AFM), a family of five foundation models custom-built in collaboration with Google. These span from on-device models to server-based models running on Private Cloud Compute. Apple Foundation Models are built to unlock a wide range of helpful experiences for our users, like an entirely new Siri and intelligent tools that make…

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026

Thu, 28 May 2026 00:00:00 GMT

Apple is presenting new research at the annual IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), which takes place in person in Denver at the Colorado Convention Center from June 3 to June 7.

We are proud to sponsor the conference, which brings together the scientific and industrial research communities in computer vision and pattern recognition. Below is an overview of Apple’s participation at CVPR 2026.

VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models

Fri, 22 May 2026 00:00:00 GMT

Streaming vision-language models (VLMs) continuously generate responses given an instruction prompt and an online stream of input frames. This is a core mechanism for real-time visual assistants. Existing VLM frameworks predominantly assess models in offline settings. In contrast, the performance of a streaming VLM depends on additional metrics beyond pure video understanding, including proactiveness, which reflects the timeliness of the model’s responses, and consistency, which captures the robustness of its responses over time. To address this limitation, we propose VSAS-Bench, a new…

EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments

Tue, 19 May 2026 00:00:00 GMT

Modern large language models (LLMs) extend context lengths to millions of tokens, enabling coherent, personalized responses grounded in long conversational history. However, the Key-Value (KV) cache grows linearly with the extended dialogue history, causing the model’s memory footprint to quickly exceed device limits. While recent KV cache compression methods attempt to reduce memory usage, most apply cache eviction after processing the entire context, incurring unbounded peak memory usage. Additionally, query-dependent eviction narrows the cache semantics to a single query, leading to failure…

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

Mon, 11 May 2026 00:00:00 GMT

Image captioning is one of the most fundamental tasks in computer vision. Owing to its open-ended nature, it has received significant attention in the era of multimodal large language models (MLLMs). In pursuit of ever more detailed and accurate captions, recent work has increasingly turned to reinforcement learning (RL). However, existing captioning-RL methods and evaluation metrics often emphasize a narrow notion of caption quality, inducing trade-offs across core dimensions of captioning. For example, utility-oriented objectives can encourage noisy, hallucinated, or overlong captions that…

Large-Scale High-Quality 3D Gaussian Head Reconstruction from Multi-View Captures

Fri, 08 May 2026 00:00:00 GMT

We propose HeadsUp, a scalable feed-forward method for reconstructing high-quality 3D Gaussian heads from large-scale multi-camera setups. Our method employs an efficient encoder-decoder architecture that compresses input views into a compact latent representation. This latent representation is then decoded into a set of UV-parameterized 3D Gaussians anchored to a neutral head template. This UV representation decouples the number of 3D Gaussians from the number and resolution of input images, enabling training with many high-resolution input views. We train and evaluate our model on an…

RVPO: Risk-Sensitive Alignment via Variance Regularization

Fri, 08 May 2026 00:00:00 GMT

Current critic-less RLHF methods aggregate multi-objective rewards via an arithmetic mean, leaving them vulnerable to constraint neglect: high-magnitude success in one objective can numerically offset critical failures in others (e.g., safety or formatting), masking low-performing “bottleneck” rewards vital for reliable multi-objective alignment. We propose Reward-Variance Policy Optimization (RVPO), a risk-sensitive framework that penalizes inter-reward variance during advantage aggregation, shifting the objective from “maximize sum” to “maximize consistency.” We show via Taylor expansion…

Apple Workshop on Privacy-Preserving Machine Learning & AI 2026

Fri, 08 May 2026 00:00:00 GMT

At Apple, we believe privacy is a fundamental human right. As AI capabilities increase and become more integrated into people’s daily lives, advancing research in privacy-preserving techniques is increasingly important to ensure privacy is protected while users enjoy innovative AI experiences. Apple’s fundamental research has consistently pushed the state-of-the-art in this domain, and earlier this year, we hosted the Workshop on Privacy-Preserving Machine Learning & AI. This two-day event brought together Apple researchers and members of the broader research community to discuss the…

Velox: Learning Representations of 4D Geometry and Appearance

Fri, 08 May 2026 00:00:00 GMT

We introduce a framework for learning latent representations of 4D objects which are descriptive, faithfully capturing object geometry and appearance; compressive, aiding in downstream efficiency; and accessible, requiring minimal input, i.e., an unstructured dynamic point cloud, to construct. Specifically, Velox trains an encoder to compress spatiotemporal color point clouds into a set of dynamic shape tokens. These tokens are supervised using two complementary decoders: a 4D surface decoder, which models the time-varying surface distribution capturing the geometry; and a Gaussian decoder…

What Matters in Practical Learned Image Compression

Thu, 07 May 2026 00:00:00 GMT

One of the major differentiators unlocked by learned codecs relative to their hard-coded traditional counterparts is their ability to be optimized directly to appeal to the human visual system. Despite this potential, a perceptual yet practical image codec is yet to be proposed. In this work, we aim to close this gap. We conduct a comprehensive study of the key modeling choices that govern the design of a practical learned image codec, jointly optimized for perceptual quality and runtime — including within the ablations several novel techniques. We then perform performance-aware neural…