Tutorial: Using ARMv6 SIMD Intrinsic Instructions on a Raspberry PI Zero W
Raspberry Pi Zero W computers have a single core ARM1176JZF-S CPU that implements version 6 of the ARM11 ARM architecture. From the docs:
It supports the ARM and Thumb instruction sets, Jazelle technology to enable direct execution of Java bytecodes, and a range of SIMD DSP instructions that operate on 16-bit or 8-bit data values in 32-bit registers.
The SIMD (Single Instruction Multiple Data) instructions are what we’ll be talking about in this tutorial. These instructions can help you speed up an application than needs to perform the same operation on many pieces of data. If you have two arrays of integers and you want to add each corresponding element and store the result in a third array, you can use SIMD instructions to perform four additions at a time.
In this tutorial, you will use ARMv6 SIMD operations in C code to add four 8-bit unsigned integers in a single operation.
Prerequisites
- Raspbery Pi Zero W
- Raspbery Pi OS running on the Zero W. I’m using the 32-bit version based on Debian 11 with version 6.1 of the Linux kernel.
- The Clang compiler. Install with
apt install clang, though GCC should work fine, too.
Prepare main.c
First, connect to your Pi Zero using SSH. You can also use VS Code Remote Development if you’re familiar with that.
|
|
Then, create a directory for your code and change into it.
|
|
Now, create a main.c file, open it in your code editor, and input the following code.
|
|
The first include enables you to use printf() to write text to the console. The next three lines use a macro and include the header file that defines the ARM SIMD instructions and types used in this tutorial.
Note: I had a hard time finding usage documentation for ARMv6 SIMD intrinsics on the internet. I stumbled across the arm_acle.h header by searching my Raspberry Pi for “uadd8”.
Compile and test.
|
|
Clang should compile the file without printing anything to console. When it’s done you should see an executable file named main in the current directory.
Use SIMD variables and the uadd8 instruction
This tutorial uses the __uadd8 function to add four unsigned 8-bit integers in a single operation. The function accepts two parameters of the uint8x4_t type, and returns a value of the same type. The uint8x4_t type works with several SIMD instructions and holds four unsigned 8-bit integer values. When you compile, the compiler maps this function to the uadd8 instruction supported by the CPU, and maps the uint8x4_t variables to SIMD registers.
First, define three variables and assign values to the first two.
|
|
Notice that you assigned only one value to each variable. The compiler knows that the simd1 variable holds 32-bits, so it expanded the assigned value 1 to a 32-bit value with a binary representation of 00000000 00000000 00000000 00000001 (spaces added for clarity). Similarly, the simd2 variable holds a value of 00000000 00000000 00000000 00000010.
Now, use the SIMD addition instruction, and print the results of the addition.
|
|
Recall that the __uadd8() function performs additions in parallel. It adds the first byte of simd1 with the first byte of simd2 and stores the result in the first byte of added. Then it follows the same pattern for the second, third, and fourth bytes.
Now, compile and test.
|
|
The program prints the number 3, as expected.
Our program works, but it doesn’t show whether the first three bytes of the input variables were added since those bytes are all zeros. In the next section, you will use arrays of integers to verify that the addition is working as expected.
Add arrays of unsigned integers
The real value of SIMD comes when you have arrays of data to operate on in parallel. Next, you will use type-casting and an assignment to copy four integers from arrays into the SIMD variables before adding.
Modify main() to match the following code.
|
|
The code declares two arrays of unsigned 8-bit integers, and a pointer. Then, it casts each array – which you can think of as a uint8_t pointer – to a uint8x4_t pointer. Finally, it dereferences to copy the values into the simd1 and simd2 variables. Casting changes the number of bytes referenced by the pointer from 1 to 4. When you dereference the pointer, 4 bytes will be copied into the SIMD variable instead of 1. Try to dereference array1 into simd1 without casting and see what happens.
When the addition has finished, the code takes the address of the added variable which effectively results in a pointer to a uint8x4_t value. Then, it casts the pointer to a narrower uint8_t pointer and stores the address in array3. This lets you treat array3 as if it were an array of 8-bit integers.
Now, compile and test.
|
|
The program printed 11 12 13 14 as expected.
Create helper functions for SIMD addition
If you don’t want to add an array of integers, you can perform SIMD addition on four separate integer values, but you must combine them into a single SIMD variable first.
Define the helper function below.
|
|
The pack() function uses bit shifting and bitwise OR to copy bits from four input variables into a single SIMD variable.
Now, define unpack() to extract bits from the SIMD variable into 4 separate variables.
|
|
The unpack() function requires a uint8x4_t value and the addresses of four uint8_t destination variables. It uses bit shifting and bitwise AND to copy bytes from input into the four output variables.
Next, modify main() to use pack() and unpack().
|
|
Notice that the code takes the address of byte1, byte2, byte3, and byte4 using the ampersand operator. It passes these addresses to the unpack() function so it can write data into these variables.
Now, compile and test. The program prints 11 13 16 18 as expected.
Conclusion
In this tutorial, you learned how to write C code to use the ARMv6 SIMD operations supported by your Raspberry Pi Zero W.
Other Raspberry Pi computers have different processors and support different SIMD instructions. You may need to look up your exact process to find out which instruction set it supports
Thank you for reading!
Further Reading
- Refer to ARMv6 SIMD intrinsics, summary descriptions for information on the full set of ARMv6 SIMD instructions
- Near the end of writing this tutorial, I stumbled across this document on ARM ACLE which relates to the header file used in this tutorial
- The ARM NEON intrinsics are supported on Raspberry PI 3s and 4s