Apple’s CoreAI Barely Beats MLX on Real-World 8B Models Despite Tiny-Model Speed Surge

Apple CoreAI Arrives as CoreML’s Successor, but Early LLM Benchmarks Show a Mixed Picture

Apple has officially introduced CoreAI, its next-generation artificial intelligence framework designed to replace CoreML after nearly nine years. The new engine is built for modern on-device AI workloads, with a focus on flexible inference, larger model support, and improved performance across Apple silicon.

CoreAI marks an important shift in Apple’s AI strategy. While CoreML was originally created for smaller and more static machine learning tasks such as image classification, object detection, and decision-tree models, CoreAI is clearly aimed at the era of edge AI and local large language models. In simple terms, Apple wants more AI processing to happen directly on devices like the iPhone, iPad, and Mac instead of relying heavily on cloud servers.

However, early benchmark results suggest that CoreAI’s real-world performance is more complex than a simple generational leap.

In new on-device LLM tests, CoreAI showed strong results with smaller language models. Using the Qwen3 0.6B model, CoreAI delivered much faster decoding performance than MLX, Apple’s research-focused machine learning framework. On an M4 Mac, CoreAI was reportedly around 2.47 times faster than MLX in decoding tasks with the smaller model.

The results were also impressive on mobile hardware. On an iPhone 17 Pro, CoreAI running on the GPU achieved around 180 tokens per second in warm, pipelined decoding tests with Qwen3 0.6B. By comparison, MLX reached about 115 tokens per second, while CoreAI running on the Apple Neural Engine reached roughly 50 tokens per second. CoreML-based LLM execution on the Apple Neural Engine came in lower, at around 39 tokens per second.

These numbers suggest that CoreAI can offer a significant speed advantage for smaller AI models, especially when using GPU acceleration.

The story changes when larger models enter the picture. With a more realistic 8-billion-parameter model such as Qwen3 8B on an M4 Max Mac, CoreAI’s advantage over MLX becomes much smaller. In decoding performance, CoreAI was only about 1.05 times faster than MLX, meaning the two frameworks were nearly tied.

That result is important because 8B-class models are much closer to what many users and developers would consider practical for local AI assistants, coding tools, summarization, and productivity features. While smaller models are useful for lightweight tasks, larger models tend to deliver better reasoning and more capable responses.

Another key finding involves sustained performance on the iPhone 17 Pro. During longer AI workloads, the GPU appears to throttle relatively quickly. When that happens, the CoreML and Apple Neural Engine combination can retain performance more effectively over time. This setup also uses the least memory, although it remains slower in raw decoding speed compared with GPU-backed CoreAI.

That tradeoff highlights one of the biggest challenges in on-device AI: peak speed is not the same as sustained efficiency. A framework may produce excellent short-burst results, but real-world AI apps often require consistent performance, controlled heat output, and reasonable battery usage.

The benchmarks also show that highly optimized engines built for specific AI models can outperform more general-purpose frameworks. In one example, a vendor-optimized runtime paired with its own model achieved faster performance on the iPhone 17 Pro while using far less memory than Apple’s MLX framework. The difference was especially notable in RAM usage, with the optimized setup using hundreds of megabytes compared with several gigabytes.

This reinforces a broader trend in AI development: the best performance often comes from tight integration between the model, runtime, hardware, and memory system. General frameworks offer flexibility, but specialized engines can deliver major gains in speed and efficiency.

Apple’s own Foundation Models also showed promising energy-efficiency results. According to the tests, they were around two times more energy-efficient per token than GPU-backed runtimes, and around four times more efficient than CoreML running on the Apple Neural Engine.

That could matter greatly for Apple’s long-term AI plans. If the company wants AI features to run locally across millions of devices, energy efficiency may be just as important as raw tokens-per-second performance. Faster AI is useful, but efficient AI is what makes features practical for everyday use without draining battery life or overheating devices.

Overall, CoreAI looks like a major upgrade over CoreML for modern on-device AI workloads. It performs especially well with smaller language models and gives developers a more capable framework for local inference. But the early benchmark data also shows that CoreAI is not a universal performance winner in every scenario.

For small models, CoreAI can be dramatically faster. For larger, more practical models, its lead over MLX narrows to near parity. On mobile devices, GPU acceleration can deliver strong short-term results, but sustained workloads may favor more efficient execution through Apple’s dedicated neural hardware.

The takeaway is clear: CoreAI is an important step forward for Apple’s on-device AI ecosystem, but performance will depend heavily on model size, runtime configuration, memory usage, and thermal behavior. As developers begin adopting the framework, users can expect more powerful AI features on iPhone, iPad, and Mac, but the best results will likely come from carefully optimized models built specifically for Apple hardware.

Apple’s CoreAI Barely Beats MLX on Real-World 8B Models Despite Tiny-Model Speed Surge

Share this:

Related Posts: