Destiner's Notes

Apple Foundation Model on device

25 June 2025

Since iOS/macOS 26, developers can utilize a built-in language model provided by Apple in their apps, using Swift APIs.

This is the same model that powers Apple Intelligence. It is local and private.

Here, I explore the model’s features, capabilities, and benchmarks to provide a high-level overview. For a deep-dive, see the Apple Intelligence Foundation Language Models by Apple and the 2025 update from the Apple ML team.

Following best practices, Apple did a terrible job naming its own models. The best nomenclature I could find is AFM-on-device and AFM-server for the local and server Apple models respectively. There seems to be no versioning schema.

Note: here, I’m mostly interested in the on-device model. The server model doesn’t have much public info and is not that interesting overall (there are better models).

Key

The model is proprietary and closed-weight.

Most likely, the same model is used across all Apple devices and is part of the software updates (i.e., newer versions of iOS/macOS will get newer/updated models).

Capabilities

No multimodality: Both the text and image models are single-modal.

No reasoning: For advanced reasoning, Apple suggests using large reasoning models.

Model Dimensions

ParamValue
Model dimension3072
Head dimension128
Num query heads24
Num key/value heads8
Num layers26
Num non-embedding params (B)2.58
Num embedding params (B)0.15

Training data

As per paper, Apple used:

No personal data from Apple users was used.

The model was trained from scratch (i.e., it’s not based on any existing open-source model).

Benchmarks

Pre-training

MetricAFM-on-deviceAFM-server
MMLU (5 shot)61.475.4
MetricAFM-server
MMLU (5-shot)75.3
GSM8K (5-shot)72.4
ARC-c (25-shot)69.7
HellaSwag (10-shot)86.9
Winogrande (5-shot)79.2
MetricAFM-server
Narrative QA77.5
Natural Questions (open)73.8
Natural Questions (closed)43.1
Openbook QA89.6
MMLU67.2
MATH-CoT55.4
GSM8K72.3
LegalBench67.9
MedQA64.4
WMT 201418.6

Human Evaluation

Side-by-side evaluation of AFM-on-device and AFM-server against comparable models

Instruction Following

Instruction-following capability (measured with IFEval)

Tool Use

Berkeley Function Calling Leaderboard Benchmark evaluation results on Function Calling API

Writing

Writing ability on internal summarization and composition benchmarks

Math

Math benchmarks for AFM-on-device and AFM-server alongside relevant sampled comparisons. GSM8K is 8-shot and MATH is 4-shot.

Safety

Fraction of preferred responses in side-by-side evaluation of Apple’s foundation model against comparable models on safety prompts.

Conclusion

Overall, the model seems to be on par with small open-source models.

On iPhone (and likely iPad), it now makes no sense to bring any third-party models (as installing and managing those will net worse UX).

On macOS, I’d expect to have the option to run third-party on-device and hosted models (especially for pro users and complex use cases), while using AFM by default.