How to Reduce the Size of Local AI Models on iOS

Introduction

More and more iOS apps are integrating AI models to provide services to users, which offers excellent privacy benefits. However, some models are quite large (over 100MB), which can, to some extent, affect model initialization speed and lead to more data interaction during task processing, further impacting performance. Recently, I’ve tried some optimization methods, and today I’ll explain this in simple, straightforward language.

Example

We’ll use the IsNet model, commonly employed in background removal scenarios to extract masks, as an example to demonstrate how to optimize a model step-by-step. We’ll also provide relevant evaluation metrics to assess the model’s quality. To get started, you’ll need two things: the Core ML version of the IsNet model, which you can download here, and a test dataset of images, which we’ll source from this project (including original images and mask data).

How to Compress a Model

Model compression methods are provided by Apple’s official coremltools library. Generally, there are three main approaches: Palettization, Quantization, and Pruning. Below is the code showing how to apply these techniques (assuming the original model is in .mlpackage format):

from coremltools.optimize.coreml import (
    OpPalettizerConfig,
    OpMagnitudePrunerConfig,
    OpLinearQuantizerConfig,
    OptimizationConfig,
    prune_weights,
    palettize_weights,
    linear_quantize_weights
)

mlmodel = ct.models.MLModel("a.mlpackage")

# You can change sparsity to 0.10, 0.25, 0.50, 0.75...
op_config = OpMagnitudePrunerConfig(target_sparsity=0.10)
config = OptimizationConfig(op_config)
compressed_mlmodel = prune_weights(mlmodel, config=config)

# You can change nbits to 8, 6, 4, 2
# op_config = OpPalettizerConfig(nbits=8)
# config = OptimizationConfig(op_config)
# compressed_mlmodel = palettize_weights(mlmodel, config=config)

# op_config = OpLinearQuantizerConfig()
# config = OptimizationConfig(op_config)
# compressed_mlmodel = linear_quantize_weights(mlmodel, config=config)

compressed_mlmodel.save("compress.mlpackage")

However, since the exported IsNet model is in .mlmodel format rather than .mlpackage, only the quantize_weights method is supported. For more details, see here.

Note: The larger the nbits, the bigger the model. For now, we’ll test with nbits=8 and nbits=4:

import coremltools as ct
import coremltools.optimize as cto
from coremltools.models.neural_network import quantization_utils

# Load model
mlmodel = ct.models.MLModel("ISNet_1024_1024.mlmodel")

# You can change nbits to 8, 6, 4, 2; smaller nbits results in a smaller size
compressed_mlmodel = quantization_utils.quantize_weights(mlmodel, nbits=8)
compressed_mlmodel.save("ISNet_1024_1024_8.mlmodel")

compressed_mlmodel = quantization_utils.quantize_weights(mlmodel, nbits=4)
compressed_mlmodel.save("ISNet_1024_1024_4.mlmodel")

Model Analysis

After compression, I found that the original model was 176MB, while ISNet_1024_1024_8 was 44MB and ISNet_1024_1024_4 was 22MB—a significant reduction in size. But how does this affect accuracy? To evaluate this, we need to introduce a loss function to assess the performance of the compressed models. The principle is as follows:

Use all three models (the original and the two compressed versions) to process images from the test dataset, saving the generated masks in separate folders.
Compare the generated masks with the manually processed masks from the test dataset to obtain a value a, which measures the deviation between the model-generated masks and the ground-truth masks. Similarly, compute values b and c for the other two models. The comparison method is simple: since the mask images are grayscale (one value per pixel instead of three RGB values), we can subtract the two mask arrays, sum the differences, and divide by the total number of pixels (width × height).
Finally, compare b and c to a to determine how much the compressed models differ from the original. Below is the loss function and the method to compute a, b, and c:

import coremltools as ct
import os
import numpy as np
import matplotlib.pyplot as plt
import PIL

# Process image and generate masks in output_folder
def processImage(imagePath, model, output_folder):
    input_width = 1024
    input_height = 1024
    img = PIL.Image.open(imagePath)
    ori_size = img.size

    img = img.resize((input_width, input_height), PIL.Image.Resampling.LANCZOS)
    out_dict = model.predict({'x_1': img})
    extension = imagePath.split("/")[-1].split(".")[-1]
    result_full_path = output_folder + "/" + imagePath.split("/")[-1].split(".")[0] + "." + extension
    file_format = "JPEG" if (extension == "jpg" or extension == "jpeg") else "PNG"
    out_dict["activation_out"].resize(ori_size).save(result_full_path, format=file_format)

# Process batch images in folder_path
def processBatchImage(folder_path, model, output_folder):
    image_extensions = ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff']
    for filename in os.listdir(folder_path):
        file_path = os.path.join(folder_path, filename)
        if os.path.isfile(file_path) and os.path.splitext(filename)[1].lower() in image_extensions:
            print(f'Processing file: {file_path}')
            processImage(file_path, model, output_folder)

# Get all mask numpy arrays
def getImageArray(folder_path):
    result = []
    image_extensions = ['.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff']
    for filename in sorted(os.listdir(folder_path)):
        file_path = os.path.join(folder_path, filename)
        if os.path.isfile(file_path) and os.path.splitext(filename)[1].lower() in image_extensions:
            image = PIL.Image.open(file_path).convert('L')
            result.append(np.array(image))
    return result

# Calculate difference between original mask and model-generated mask
def loss_function(folder1, folder2):
    ground植物image_array = getImageArray(folder1)
    output_image_array = getImageArray(folder2)

    total_loss = 0
    total_pixels = 0

    for model_output, ground_truth in zip(output_image_array, ground_image_array):
        diff = np.abs(model_output - ground_truth)
        image_loss = np.sum(diff)
        pixelSize = ground_truth.size

        total_loss += image_loss
        total_pixels += pixelSize
    # Normalize the loss value by pixel count
    normalized_loss = total_loss / total_pixels
    return normalized_loss

# Load models with different compression levels
isnet_model = ct.models.MLModel("ISNet_1024_1024.mlmodel")
isnet_model_4 = ct.models.MLModel("ISNet_1024_1024_4.mlmodel")
isnet_model_8 = ct.models.MLModel("ISNet_1024_1024_8.mlmodel")

# Generate mask images with different models
processBatchImage("./datasets/original_test", isnet_model, "./datasets/isnet_mask")
processBatchImage("./datasets/original_test", isnet_model_8, "./datasets/isnet_mask_8")
processBatchImage("./datasets/original_test", isnet_model_4, "./datasets/isnet_mask_4")

# Calculate the loss values
# original_model_loss: 18.37, size: 176M
# model_4_loss: 14.86, size: 22M
# model_8_loss: 18.29, size: 44M
original_model_loss = loss_function("./datasets/original_mask", "./datasets/isnet_mask")
model_4_loss = loss_function("./datasets/original_mask", "./datasets/isnet_mask_4")
model_8_loss = loss_function("./datasets/original_mask", "./datasets/isnet_mask_8")

Based on the size and loss values, we chose isnet_model_4 as the final compressed model. I also manually checked the results folder, and the output looked decent.

Running the Model

According to Apple, starting with iOS 16 and iOS 17, model execution includes caching. If the model is preloaded in the background after the app launches (like a warm-up), it can be retrieved from the cache when needed, making processing extremely fast. I’ve tested this before, and loading a model around 170MB takes about 7–8 seconds. However, the cache can expire in cases like phone overheating, app restarts, or device reboots. Therefore, we need to design a mechanism to preheat the model in advance.

Remote Model Download

iOS also supports remotely downloading model files, compiling them locally, and loading them asynchronously. This keeps the app itself small, and when combined with the compression techniques above, both the app and the downloaded model can remain compact.

Related resources: