{"id":33707,"date":"2025-12-16T12:35:11","date_gmt":"2025-12-16T09:35:11","guid":{"rendered":"https:\/\/stage.cactus-now.com\/cactus-nieuws\/llms-in-ios\/"},"modified":"2026-06-01T17:11:49","modified_gmt":"2026-06-01T14:11:49","slug":"llms-in-ios","status":"publish","type":"post","link":"https:\/\/stage.cactus-now.com\/nl\/cactus-nieuws\/llms-in-ios\/","title":{"rendered":"On-device LLM&#8217;s in iOS: een technische reis van modelkeuze tot gebruikerservaring"},"content":{"rendered":"\t\t<div data-elementor-type=\"wp-post\" data-elementor-id=\"33707\" class=\"elementor elementor-33707 elementor-30604\" data-elementor-post-type=\"post\">\n\t\t\t\t<div class=\"elementor-element elementor-element-1dcd3dd e-flex e-con-boxed e-con e-parent\" data-id=\"1dcd3dd\" data-element_type=\"container\" data-e-type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-7e42275e elementor-widget elementor-widget-text-editor\" data-id=\"7e42275e\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t\t\t\t\t\t<p data-renderer-start-pos=\"452\" data-local-id=\"e06f0e7c-6a8e-4e2f-b1e0-04ca65630629\">Until recently, Large Language Models (LLMs) were too large and expensive to run locally. The only viable option was to integrate LLM capabilities via remote systems. A solution that introduces latency, network dependency, and sensitive data sending out from device.<\/p><p data-renderer-start-pos=\"722\">Thanks to recent hardware improvements in <strong>CoreML<\/strong>, and the introduction of <strong>Foundation Models<\/strong> framework for developers, a new scenario has opened. This new framework <strong>provides a high level API<\/strong> to interact with generative models integrated into Apple system. <strong>Modern Apple processors<\/strong> (M-series and A-series) now allow running these generative models <strong>locally on the device<\/strong>, ensuring fast performance and maintaining user privacy.<\/p><p data-renderer-start-pos=\"1149\">In this article, we will take a technical and conceptual look at integrating an LLM into an iOS project, with a focus on real-world production scenarios.<\/p><h4 id=\"Choosing-a-compatible-model\" data-local-id=\"c13e2bec-58c0-4518-8b4b-10106204b1f5\" data-renderer-start-pos=\"1305\"><strong data-renderer-mark=\"true\">Choosing a compatible model<\/strong><\/h4><p data-renderer-start-pos=\"1334\" data-local-id=\"d54d2e08-87d2-499f-809c-bbc105f087f9\">The <strong>first step<\/strong> in integrating an LLM model into an iOS is a <strong>deep exploration<\/strong> of a compatible model with Apple devices. That model should <strong>respect dimensions, architecture, and hardware\u2019s requirements.<\/strong> In terms of model size, it\u2019s not only about storage space, but also runtime memory (RAM) consumption. Models ranging from 1 and 3 billion parameters are the most realistic to run in iOS, as they maximize performance relative to their size.<\/p><p data-renderer-start-pos=\"1777\">The models available on repositories like Hugging Face and many others are typically trained using <strong>PyTorch<\/strong> or <strong>ONNX<\/strong> frameworks. These models must be converted to CoreML format using tools such as coremltools, which translate their operations into a representation optimized for iOS.<\/p><p data-renderer-start-pos=\"2062\">It\u2019s also important to note that <strong>not all the models are directly compatible<\/strong>. Some layers may be unsupported by CoreML. Some operations used in the original architecture don\u2019t exist in Apple\u2019s system. For example, optimized attention mechanisms such as FlashAttention cannot be converted because they rely on fused GPU kernels that CoreML does not support. The same occurs with certain custom normalization layers or dynamic tensor operations that modify shapes at runtime, which CoreML cannot represent directly. <\/p><p data-renderer-start-pos=\"2579\">For this reason, the <strong>initial model evaluation is a critical phase<\/strong> to identify a model that can be perfectly adapted to the device. Many of these operations come from the custom <strong>PyTorch<\/strong> layers, which are pieces of code created by the authors, that CoreML doesn\u2019t know how to interpret automatically. When this happens, <strong>those layers must be rewritten using supported operations<\/strong>, so the model can run correctly on Apple devices.<\/p><h4 id=\"Translating-the-model-into-Apple\u2019s-format\" data-local-id=\"18ff38d0-a115-4365-8ddf-be90cef0a3ea\" data-renderer-start-pos=\"3007\"><strong data-renderer-mark=\"true\">Translating the model into Apple\u2019s format<\/strong><\/h4><p data-renderer-start-pos=\"3050\">Once a compatible model has been selected, the <strong>technical phase begins<\/strong>. During conversion process, the original model is translated from frameworks such as PyTorch into the CoreML format, adapting its internal operations they can run efficiently on iOS. This step involves <strong>transforming complex layers, reducing the model size,<\/strong> and ensuring that the operations can be executed efficiently on <strong>Neural Engine<\/strong>.<\/p><p data-renderer-start-pos=\"3458\">Model size optimization is critical. <strong>Using<\/strong> <strong>quantization<\/strong>, the <strong>model\u2019s weight are converted into lighter numerical formats<\/strong>, as int8 or int4, instead of the high precision floating point values used during training. This means the model stores numbers using fewer bits, reducing the precision, but the overall behavior of the model remains almost identical for generative tasks.<\/p><p data-renderer-start-pos=\"3837\">The most common format for mobile deployment is int8, which offers a good balance between size and accuracy. In most cases, converting from fp16 to int8 is reliable and preserves the model\u2019s effectiveness for generative tasks. Format like int4 are much more aggressive and can significantly degrade the model\u2019s performance.<\/p><p data-renderer-start-pos=\"4164\">This reduction in model size is crucial for successful app distribution, helping to reduce load time and improve compatibility with mid-range devices. Once optimized, <strong> this process generates.mlpackage<\/strong> <strong>files<\/strong> with metadata describing its architecture and capabilities.<\/p><p data-renderer-start-pos=\"4434\"><strong>The final result is a CoreML package<\/strong> ready to be integrated into <strong>Xcode<\/strong>. Before including it in the project, it\u2019s recommended to run it in an environment similar to an iOS devices, such as a Mac with an M-series chip, to verify that the model\u2019s performance is acceptable.<\/p><h2 data-local-id=\"18ff38d0-a115-4365-8ddf-be90cef0a3ea\" data-renderer-start-pos=\"1927\"> <\/h2><h4 id=\"Integration-in-Xcode\" data-local-id=\"119ee105-20f2-4734-86a0-ae1111138e3e\" data-renderer-start-pos=\"4707\"><strong data-renderer-mark=\"true\">Integration in Xcode<\/strong><\/h4><p data-renderer-start-pos=\"4729\" data-local-id=\"81f97b01-92be-4b49-8cb4-39d3e6fc7074\">Once the model has been converted, the next step is to integrate it into Xcode, Apple\u2019s development environment where iOS, iPadOS and macOS apps are built. When the model is added to the project, <strong> Xcode will automatically generates a Swift interface<\/strong>, making it easy and safe to use the model in your code.<\/p><p data-renderer-start-pos=\"5037\">When <strong>running a LLM<\/strong>, the model does not produce a entire response in a single step. Instead, <strong>it generates text token by token<\/strong>, predicting the next token based on both the user input and the tokens already generated. This requires calling the model repeatedly in a loop, where each newly generated token becomes part of the input for the next prediction. This iterative process continues until the model decides the response is complete.<\/p><p data-renderer-start-pos=\"5476\">Developers can <strong>specify which compute unit the model should run<\/strong>: CPU, GPU or the Apple Neural Engine. Neural Engine is a specialized component in modern Apple processors designed to accelerate machine learning operations, so this is the most efficient option, but not all components of a model may be compatible with it. In this case, <strong>CoreML runtime<\/strong> automatically <strong>distributes the execution<\/strong> across the available units, running each operation on the hardware best adapted for it.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-45da1ce e-con-full e-flex e-con e-parent\" data-id=\"45da1ce\" data-element_type=\"container\" data-e-type=\"container\" data-settings=\"{&quot;background_background&quot;:&quot;classic&quot;}\">\n\t\t\t\t<div class=\"elementor-element elementor-element-372fcdbc elementor-widget elementor-widget-image\" data-id=\"372fcdbc\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"image.default\">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<img fetchpriority=\"high\" decoding=\"async\" width=\"1920\" height=\"1280\" src=\"https:\/\/stage.cactus-now.com\/wp-content\/uploads\/2025\/12\/LLms-iphone.png\" class=\"attachment-full size-full wp-image-30612\" alt=\"LLms in iOS\" srcset=\"https:\/\/stage.cactus-now.com\/wp-content\/uploads\/2025\/12\/LLms-iphone.png 1920w, https:\/\/stage.cactus-now.com\/wp-content\/uploads\/2025\/12\/LLms-iphone-1024x683.png 1024w, https:\/\/stage.cactus-now.com\/wp-content\/uploads\/2025\/12\/LLms-iphone-768x512.png 768w, https:\/\/stage.cactus-now.com\/wp-content\/uploads\/2025\/12\/LLms-iphone-1536x1024.png 1536w\" sizes=\"(max-width: 1920px) 100vw, 1920px\" \/>\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t<div class=\"elementor-element elementor-element-e21d61a e-flex e-con-boxed e-con e-parent\" data-id=\"e21d61a\" data-element_type=\"container\" data-e-type=\"container\">\n\t\t\t\t\t<div class=\"e-con-inner\">\n\t\t\t\t<div class=\"elementor-element elementor-element-45c76c8 elementor-widget elementor-widget-text-editor\" data-id=\"45c76c8\" data-element_type=\"widget\" data-e-type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t\t\t\t\t\t<h4 id=\"Interaction-and-design\" data-local-id=\"19bd0f60-3e05-41c4-9e73-64da1a835cf5\" data-renderer-start-pos=\"5955\"><strong data-renderer-mark=\"true\">Interaction and design<\/strong><\/h4><p data-renderer-start-pos=\"5979\" data-local-id=\"a000d06f-9312-45b1-bb24-24cc2f56417c\"><strong>Developers must decide how users will interact with the model<\/strong>. It\u2019s important to keep in mind that LLM models handle ambiguous instructions, conversational context, and potentially long output, so <strong> the user experience depends on how context is managed.<\/strong><\/p><p data-renderer-start-pos=\"6234\">The user interface should be designed to receive input incrementally, handle interruptions, reformulations, and provide mechanisms for correction. The developer also must determine how the conversation state is managed, if the information is stored or reset with each new query.<\/p><p data-renderer-start-pos=\"6516\">Including a large model directly in the app bundle may not be viable. Many apps download the model after the initial installation, keeping the app lightweight and allowing updates without needing to release a new version on the App Store.<\/p><h4 id=\"Model-Maintenance\" data-local-id=\"19bd0f60-3e05-41c4-9e73-64da1a835cf5\" data-renderer-start-pos=\"6757\"><strong data-renderer-mark=\"true\">Model Maintenance<\/strong><\/h4><p data-renderer-start-pos=\"6776\"><strong>The model is a living part of the project<\/strong>, evolving rapidly. Its <strong>maintenance is essential<\/strong> to ensure it remains fully functional over time.<\/p><p data-renderer-start-pos=\"6918\">Each <strong>model update may require new conversions, adjustments, or structural modifications<\/strong>. Security plays a key role in this process. Every downloaded <strong>model should be verified<\/strong> using digital signatures from trusted servers. This step guarantees the integrity of the LLM that forms part of the application.<\/p><h4 id=\"Conclusion\" data-local-id=\"19bd0f60-3e05-41c4-9e73-64da1a835cf5\" data-renderer-start-pos=\"7223\"><strong data-renderer-mark=\"true\">Conclusion<\/strong><\/h4><p data-renderer-start-pos=\"7235\">The development of applications integrating on-device LLMs marks the beginning of the a new generation of apps, capable of reasoning, generating content, and assisting users directly from their own devices. Bringing these models into iOS devices requires a new perspective. Developers must translate the original models built in PyTorch into CoreML, optimize their size through quantization and design interfaces that support conversational interactions rather than simple predictions.<\/p><p data-renderer-start-pos=\"7724\">The introduction of the Foundation Models framework is not intended to replace CoreML. Instead, it serves as a tool to simplify integration with Apple Intelligence and enables developers to adapt the apps to the system\u2019s capabilities.<\/p><p data-renderer-start-pos=\"7962\">This shift redefines the developer\u2019s role, moving from the integration of static models to working with complex, high-performance generative systems that can adapt in real time. It also <strong>offers companies the opportunity to redefine existing products and create new experiences<\/strong> with features that previously required much more complex infrastructure.<\/p><p data-renderer-start-pos=\"8314\">The <strong>future of mobile development<\/strong> points towards smarter, more autonomous, and privacy-focused applications, where the model lives directly on the user\u2019s device, shaped by its own capabilities, and by the way it is converted, optimized, and experienced through interaction design.<\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t","protected":false},"excerpt":{"rendered":"<p>Tot voor kort waren grote taalmodellen (LLMs) te groot en te duur om lokaal te gebruiken. De enige haalbare optie was het integreren van LLM-mogelijkheden via externe systemen. Een oplossing die latentie, netwerkafhankelijkheid en gevoelige gegevens die vanaf apparaten worden verzonden, introduceert. Dankzij recente hardwareverbeteringen in CoreML, en de introductie van Foundation Models raamwerk voor [&hellip;]<\/p>\n","protected":false},"author":47,"featured_media":30606,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[32,458,637],"tags":[760],"class_list":["post-33707","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cactus-nieuws","category-iot-nl","category-blog","tag-iot-nl"],"acf":[],"_links":{"self":[{"href":"https:\/\/stage.cactus-now.com\/nl\/wp-json\/wp\/v2\/posts\/33707","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/stage.cactus-now.com\/nl\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/stage.cactus-now.com\/nl\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/stage.cactus-now.com\/nl\/wp-json\/wp\/v2\/users\/47"}],"replies":[{"embeddable":true,"href":"https:\/\/stage.cactus-now.com\/nl\/wp-json\/wp\/v2\/comments?post=33707"}],"version-history":[{"count":0,"href":"https:\/\/stage.cactus-now.com\/nl\/wp-json\/wp\/v2\/posts\/33707\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/stage.cactus-now.com\/nl\/wp-json\/wp\/v2\/media\/30606"}],"wp:attachment":[{"href":"https:\/\/stage.cactus-now.com\/nl\/wp-json\/wp\/v2\/media?parent=33707"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/stage.cactus-now.com\/nl\/wp-json\/wp\/v2\/categories?post=33707"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/stage.cactus-now.com\/nl\/wp-json\/wp\/v2\/tags?post=33707"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}