Raspberry Pi Zero W computers have a single core ARM1176JZF-S CPU that implements version 6 of the ARM11 ARM architecture. From the docs:
It supports the ARM and Thumb instruction sets, Jazelle technology to enable direct execution of Java bytecodes, and a range of SIMD DSP instructions that operate on 16-bit or 8-bit data values in 32-bit registers.
The SIMD (Single Instruction Multiple Data) instructions are what we’ll be talking about in this tutorial. These instructions can help you speed up an application than needs to perform the same operation on many pieces of data. If you have two arrays of integers and you want to add each corresponding element and store the result in a third array, you can use SIMD instructions to perform four additions at a time.
In this tutorial, you will use ARMv6 SIMD operations in C code to add four 8-bit unsigned integers in a single operation.
Prerequisites
- Raspbery Pi Zero W
- Raspbery Pi OS running on the Zero W. I’m using the 32-bit version based on Debian 11 with version 6.1 of the Linux kernel.
- The Clang compiler. Install with
apt install clang
, though GCC should work fine, too.
Prepare main.c
First, connect to your Pi Zero using SSH. You can also use VS Code Remote Development if you’re familiar with that.
$ ssh pi@IP_ADDRESS
Then, create a directory for your code and change into it.
$ mkdir simd
$ cd simd
Now, create a main.c
file, open it in your code editor, and input the following code.
#include <stdio.h>
#ifdef __ARM_ACLE
#include <arm_acle.h>
#endif /* __ARM_ACLE */
int main() {
return 0;
}
The first include enables you to use printf()
to write text to the console. The next three lines use a macro and include the header file that defines the ARM SIMD instructions and types used in this tutorial.
Note: I had a hard time finding usage documentation for ARMv6 SIMD intrinsics on the internet. I stumbled across the arm_acle.h
header by searching my Raspberry Pi for “uadd8”.
Compile and test.
$ clang -o main main.c
Clang should compile the file without printing anything to console. When it’s done you should see an executable file named main
in the current directory.
Use SIMD variables and the uadd8 instruction
This tutorial uses the __uadd8
function to add four unsigned 8-bit integers in a single operation. The function accepts two parameters of the uint8x4_t
type, and returns a value of the same type. The uint8x4_t
type works with several SIMD instructions and holds four unsigned 8-bit integer values. When you compile, the compiler maps this function to the uadd8
instruction supported by the CPU, and maps the uint8x4_t
variables to SIMD registers.
First, define three variables and assign values to the first two.
|
|
Notice that you assigned only one value to each variable. The compiler knows that the simd1
variable holds 32-bits, so it expanded the assigned value 1
to a 32-bit value with a binary representation of 00000000 00000000 00000000 00000001
(spaces added for clarity). Similarly, the simd2
variable holds a value of 00000000 00000000 00000000 00000010
.
Now, use the SIMD addition instruction, and print the results of the addition.
|
|
Recall that the __uadd8()
function performs additions in parallel. It adds the first byte of simd1
with the first byte of simd2
and stores the result in the first byte of added
. Then it follows the same pattern for the second, third, and fourth bytes.
Now, compile and test.
$ clang -o main main.c && ./main
3
The program prints the number 3, as expected.
Our program works, but it doesn’t show whether the first three bytes of the input variables were added since those bytes are all zeros. In the next section, you will use arrays of integers to verify that the addition is working as expected.
Add arrays of unsigned integers
The real value of SIMD comes when you have arrays of data to operate on in parallel. Next, you will use type-casting and an assignment to copy four integers from arrays into the SIMD variables before adding.
Modify main()
to match the following code.
int main() {
uint8_t array1[] = {1, 2, 3, 4};
uint8_t array2[] = {10, 10, 10, 10};
uint8_t *array3;
// Cast to expand number of bytes copied into variables during dereference
uint8x4_t simd1 = *((uint8x4_t*)array1);
uint8x4_t simd2 = *((uint8x4_t*)array2);
uint8x4_t added = __uadd8(simd1, simd2);
// Get the address of result, then case to uint8_t pointer
// so we can reference individual bytes easily
array3 = (uint8_t*)&added;
printf("%u %u %u %u\n", array3[0], array3[1], array3[2], array3[3]);
return 0;
}
The code declares two arrays of unsigned 8-bit integers, and a pointer. Then, it casts each array – which you can think of as a uint8_t
pointer – to a uint8x4_t
pointer. Finally, it dereferences to copy the values into the simd1
and simd2
variables. Casting changes the number of bytes referenced by the pointer from 1 to 4. When you dereference the pointer, 4 bytes will be copied into the SIMD variable instead of 1. Try to dereference array1
into simd1
without casting and see what happens.
When the addition has finished, the code takes the address of the added
variable which effectively results in a pointer to a uint8x4_t
value. Then, it casts the pointer to a narrower uint8_t
pointer and stores the address in array3
. This lets you treat array3
as if it were an array of 8-bit integers.
Now, compile and test.
clang -o main main.c && ./main
11 12 13 14
The program printed 11 12 13 14 as expected.
Create helper functions for SIMD addition
If you don’t want to add an array of integers, you can perform SIMD addition on four separate integer values, but you must combine them into a single SIMD variable first.
Define the helper function below.
|
|
The pack()
function uses bit shifting and bitwise OR to copy bits from four input variables into a single SIMD variable.
Now, define unpack()
to extract bits from the SIMD variable into 4 separate variables.
|
|
The unpack()
function requires a uint8x4_t value and the addresses of four uint8_t
destination variables. It uses bit shifting and bitwise AND to copy bytes from input
into the four output variables.
Next, modify main()
to use pack()
and unpack()
.
int main() {
uint8x4_t simd1 = pack(1, 2, 3, 4);
uint8x4_t simd2 = pack(10, 11, 13, 14);
uint8x4_t added;
uint8_t byte1, byte2, byte3, byte4;
added = __uadd8(simd1, simd2);
unpack(added, &byte1, &byte2, &byte3, &byte4);
printf("Results: %u %u %u %u\n", byte1, byte2, byte3, byte4);
return 0;
}
Notice that the code takes the address of byte1
, byte2
, byte3
, and byte4
using the ampersand operator. It passes these addresses to the unpack()
function so it can write data into these variables.
Now, compile and test. The program prints 11 13 16 18 as expected.
Conclusion
In this tutorial, you learned how to write C code to use the ARMv6 SIMD operations supported by your Raspberry Pi Zero W.
Other Raspberry Pi computers have different processors and support different SIMD instructions. You may need to look up your exact process to find out which instruction set it supports
Thank you for reading!
Further Reading
- Refer to ARMv6 SIMD intrinsics, summary descriptions for information on the full set of ARMv6 SIMD instructions
- Near the end of writing this tutorial, I stumbled across this document on ARM ACLE which relates to the header file used in this tutorial
- The ARM NEON intrinsics are supported on Raspberry PI 3s and 4s