Tutorial: Using ARMv6 SIMD Intrinsic Instructions on a Raspberry PI Zero W

Raspberry Pi Zero W computers have a single core ARM1176JZF-S CPU that implements version 6 of the ARM11 ARM architecture. From the docs:

It supports the ARM and Thumb instruction sets, Jazelle technology to enable direct execution of Java bytecodes, and a range of SIMD DSP instructions that operate on 16-bit or 8-bit data values in 32-bit registers.

The SIMD (Single Instruction Multiple Data) instructions are what we’ll be talking about in this tutorial. These instructions can help you speed up an application than needs to perform the same operation on many pieces of data. If you have two arrays of integers and you want to add each corresponding element and store the result in a third array, you can use SIMD instructions to perform four additions at a time.

In this tutorial, you will use ARMv6 SIMD operations in C code to add four 8-bit unsigned integers in a single operation.

Prerequisites

Raspbery Pi Zero W
Raspbery Pi OS running on the Zero W. I’m using the 32-bit version based on Debian 11 with version 6.1 of the Linux kernel.
The Clang compiler. Install with apt install clang, though GCC should work fine, too.

Prepare main.c

First, connect to your Pi Zero using SSH. You can also use VS Code Remote Development if you’re familiar with that.

$ ssh pi@IP_ADDRESS

Then, create a directory for your code and change into it.

$ mkdir simd
$ cd simd

Now, create a main.c file, open it in your code editor, and input the following code.

#include <stdio.h>
#ifdef __ARM_ACLE
#include <arm_acle.h>
#endif /* __ARM_ACLE */

int main() {
    return 0;
}

The first include enables you to use printf() to write text to the console. The next three lines use a macro and include the header file that defines the ARM SIMD instructions and types used in this tutorial.

Note: I had a hard time finding usage documentation for ARMv6 SIMD intrinsics on the internet. I stumbled across the arm_acle.h header by searching my Raspberry Pi for “uadd8”.

Compile and test.

$ clang -o main main.c

Clang should compile the file without printing anything to console. When it’s done you should see an executable file named main in the current directory.

Use SIMD variables and the uadd8 instruction

This tutorial uses the __uadd8 function to add four unsigned 8-bit integers in a single operation. The function accepts two parameters of the uint8x4_t type, and returns a value of the same type. The uint8x4_t type works with several SIMD instructions and holds four unsigned 8-bit integer values. When you compile, the compiler maps this function to the uadd8 instruction supported by the CPU, and maps the uint8x4_t variables to SIMD registers.

First, define three variables and assign values to the first two.

1
2
3
4
5
6
7


int main() {
    uint8x4_t simd1 = 1;
    uint8x4_t simd2 = 2;
    uint8x4_t result;

    return 0;
}

Notice that you assigned only one value to each variable. The compiler knows that the simd1 variable holds 32-bits, so it expanded the assigned value 1 to a 32-bit value with a binary representation of 00000000 00000000 00000000 00000001 (spaces added for clarity). Similarly, the simd2 variable holds a value of 00000000 00000000 00000000 00000010.

Now, use the SIMD addition instruction, and print the results of the addition.

1
2
3
4
5
6
7


int main() {
    uint8x4_t simd1 = 1;
    uint8x4_t simd2 = 2;
    uint8x4_t added = __uadd8(simd1, simd2);
    printf("%u\n", added);
    return 0;
}

Recall that the __uadd8() function performs additions in parallel. It adds the first byte of simd1 with the first byte of simd2 and stores the result in the first byte of added. Then it follows the same pattern for the second, third, and fourth bytes.

Now, compile and test.

$ clang -o main main.c && ./main
3

The program prints the number 3, as expected.

Our program works, but it doesn’t show whether the first three bytes of the input variables were added since those bytes are all zeros. In the next section, you will use arrays of integers to verify that the addition is working as expected.

Add arrays of unsigned integers

The real value of SIMD comes when you have arrays of data to operate on in parallel. Next, you will use type-casting and an assignment to copy four integers from arrays into the SIMD variables before adding.

Modify main() to match the following code.

int main() {
    uint8_t array1[] = {1, 2, 3, 4};
    uint8_t array2[] = {10, 10, 10, 10};
    uint8_t *array3;

    // Cast to expand number of bytes copied into variables during dereference
    uint8x4_t simd1 = *((uint8x4_t*)array1);
    uint8x4_t simd2 = *((uint8x4_t*)array2);
    uint8x4_t added = __uadd8(simd1, simd2);

    // Get the address of result, then case to uint8_t pointer
    // so we can reference individual bytes easily
    array3 = (uint8_t*)&added;
    printf("%u %u %u %u\n", array3[0], array3[1], array3[2], array3[3]);
    return 0;
}

The code declares two arrays of unsigned 8-bit integers, and a pointer. Then, it casts each array – which you can think of as a uint8_t pointer – to a uint8x4_t pointer. Finally, it dereferences to copy the values into the simd1 and simd2 variables. Casting changes the number of bytes referenced by the pointer from 1 to 4. When you dereference the pointer, 4 bytes will be copied into the SIMD variable instead of 1. Try to dereference array1 into simd1 without casting and see what happens.

When the addition has finished, the code takes the address of the added variable which effectively results in a pointer to a uint8x4_t value. Then, it casts the pointer to a narrower uint8_t pointer and stores the address in array3. This lets you treat array3 as if it were an array of 8-bit integers.

Now, compile and test.

clang -o main main.c && ./main
11 12 13 14

The program printed 11 12 13 14 as expected.

Create helper functions for SIMD addition

If you don’t want to add an array of integers, you can perform SIMD addition on four separate integer values, but you must combine them into a single SIMD variable first.

Define the helper function below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


#include <stdio.h>
#include <arm_acle.h>

uint8x4_t pack(uint8_t a, uint8_t b, uint8_t c, uint8_t d) {
    // Shift and perform bitwise OR to put bits in the right place
    return (
        (a << 24)
        |
        (b << 16)
        |
        (c << 8)
        |
        d
    );
}

int main() {
    // ...
}

The pack() function uses bit shifting and bitwise OR to copy bits from four input variables into a single SIMD variable.

Now, define unpack() to extract bits from the SIMD variable into 4 separate variables.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


#include <stdio.h>
#include <arm_acle.h>

uint8x4_t pack(uint8_t a, uint8_t b, uint8_t c, uint8_t d) {
    // Shift and perform bitwise OR to put bits in the right place
    return (
        (a << 24)
        |
        (b << 16)
        |
        (c << 8)
        |
        d
    );
}
void unpack(uint8x4_t input, uint8_t *a, uint8_t *b, uint8_t *c, uint8_t *d) {
    *a = input >> 24;
    *b = (input >> 16) & 255;
    *c = (input >> 8) & 255;
    *d = input & 255;
}

int main() {
    // ...
}

The unpack() function requires a uint8x4_t value and the addresses of four uint8_t destination variables. It uses bit shifting and bitwise AND to copy bytes from input into the four output variables.

Next, modify main() to use pack() and unpack().

int main() {
    uint8x4_t simd1 = pack(1, 2, 3, 4);
    uint8x4_t simd2 = pack(10, 11, 13, 14);
    uint8x4_t added;
    uint8_t byte1, byte2, byte3, byte4;

    added = __uadd8(simd1, simd2);
    unpack(added, &byte1, &byte2, &byte3, &byte4);
    printf("Results: %u %u %u %u\n", byte1, byte2, byte3, byte4);
    return 0;
}

Notice that the code takes the address of byte1, byte2, byte3, and byte4 using the ampersand operator. It passes these addresses to the unpack() function so it can write data into these variables.

Now, compile and test. The program prints 11 13 16 18 as expected.

Conclusion

In this tutorial, you learned how to write C code to use the ARMv6 SIMD operations supported by your Raspberry Pi Zero W.

Other Raspberry Pi computers have different processors and support different SIMD instructions. You may need to look up your exact process to find out which instruction set it supports

Thank you for reading!

weekends are for leisure