I spent some time over the last few weeks optimizing some of our low-level DSP functions using ARM NEON instructions, and thought I’d share my experience here, as there were a few surprises.
First, some background on NEON. Many of the processors used in the latest generation of iOS and Android devices support NEON instructions, which are SIMD instructions which can be very useful for DSP algorithms.
NEON instructions are optional in the armv7 instruction set. All iOS devices with armv7 processors do support NEON, so on iOS, you can just wrap your NEON code inside of an “#ifdef __ARM_NEON__” block.
On Android you’ll also need to use the android_getCpuFeatures() function to check at runtime whether a particular armv7 processor supports NEON. (This entire post, by the way, assumes you are writing native code using the Android NDK. If you’re writing Java code, you can’t make use of NEON directly.)
(For for information, I highly recommend this overview of NEON and this overview of the ARM architecture in general.)
NEON operations let you perform certain operations in parallel. For example, say you have an array of integers, and you want to multiply them all by 2 by shifting left 1 bit. Some C code to do this would be the following:
void multiplyBy2(int* p, int count)
{
for (int i = 0; i < count; ++i)
{
*p++ <<= 1;
}
}
With NEON, you could process 4 integers at a time, for a significant speedup:
#include <arm_neon.h>
void multiplyBy2Neon(int* p, int count)
{
assert(count % 4 == 0); // must be a multiple of 4
int n = n / 4;
for (int i = 0; i < n; ++i)
{
// load 4 ints into a quadword register
int32x4_t in = vld1q_s32(p);
in = vshlq_n_s32(in, 1); // shift each int left by 1
vst1q_s32(p, in); // store register back into memory
p += 4;
}
}
This uses the NEON intrinsics which are supported by both the Apple compiler and by the version of GCC supplied with the Android NDK. The intrinsics are not well documented anywhere that I could find; there is a list of them in the GCC documentation that indicates the corresponding ARM instruction for each intrinsic, and the instructions themselves are documented in the armv7 reference manual.
You’ll find that this code runs significantly faster. It won’t run 4 times faster, probably since you still have to spend a lot of the time loading values from memory, but on an iPod 4g, I found that it ran almost twice as fast.
Note that, for simplicity, we are assuming that the count variable is a multiple of 4. If not, we would need to have extra code to deal with the leftover ints before or after our NEON processing loop.
So, on to a more complicated, real-world example. I have a function, stereoPan(), that takes an interleaved stereo input buffer and applies a pan matrix to it. The samples and the pan values are all 8.24 fixed-point values represented as 32-bit integers. Here is a somewhat simplified version of it:
void stereoPan_default(int32* buf, int frames,
int32 ll, int32 rr, int32 lr, int32 rl)
{
int32* p = buf;
const int32* pEnd = buf + frames * 2;
while (p < pEnd)
{
// read input samples
int64 in_l = *p;
int64 in_r = *(p+1);
// multiply by pan matrix and output
*p++ = (int32) ((in_l * ll + in_r * lr) >> 24);
*p++ = (int32) ((in_l * rl + in_r * rr) >> 24);
}
}
And the version using NEON intrinsics:
void stereoPan_neon(int32* buf, int frames,
int32 ll, int32 rr, int32 lr, int32 rl)
{
int32* p = buf;
const int32* pEnd = buf + frames * 2;
// load constants
int32x2_t llv = vdup_n_s32(ll);
int32x2_t rlv = vdup_n_s32(rl);
int32x2_t rrv = vdup_n_s32(rr);
int32x2_t lrv = vdup_n_s32(lr);
while (p < pEnd)
{
// load 2 samples for each channel, deinterleaving
int32x2x2_t in = vld2_s32(p);
// multiply samples by constants
int64x2_t out_l_a = vmull_s32(in.val[0], llv);
int64x2_t out_l_b = vmull_s32(in.val[1], lrv);
int64x2_t out_r_a = vmull_s32(in.val[0], rlv);
int64x2_t out_r_b = vmull_s32(in.val[1], rrv);
// add
int64x2_t out_l_c = vaddq_s64(out_l_a, out_l_b);
int64x2_t out_r_c = vaddq_s64(out_r_a, out_r_b);
// shift right and narrow to s32
int32x2x2_t out;
out.val[0] = vshrn_n_s64(out_l_c, 24);
out.val[1] = vshrn_n_s64(out_r_c, 24);
vst2_s32(p, out);
p += 4;
}
}
When I run this through my test program, I find that it does indeed run significantly faster on my iPod Touch:
stereo_pan_default : 11.18 ms stereo_pan_neon : 2.62 ms
Great! Let’s try it on Android now; I’m sure it will be no problem at all, right? Here’s the output from my Nexus S:
stereo_pan_default : 11.99 ms stereo_pan_neon : 10.47 ms
Whoa there! The NEON version is barely at all faster than the ordinary C version. What’s going on?
Let’s look at the assembly output for the iOS version. I ran this command:
otool -vt libck.a > libck.s
to disassemble the library, then looked for the stereoPan_neon symbol; here’s the disassembly:
00000ec0 4694 mov ip, r2
00000ec2 004a lsls r2, r1, #1
00000ec4 2a01 cmp r2, #1
00000ec6 db1d blt.n 0xf04
00000ec8 aa01 add r2, sp, #4
00000eca 46e9 mov r9, sp
00000ecc ee813b90 vdup.32 d17, r3
00000ed0 f9e93c8f vld1.32 {d19[]}, [r9]
00000ed4 eb0001c1 add.w r1, r0, r1, lsl #3
00000ed8 f9e20c8f vld1.32 {d16[]}, [r2]
00000edc ee82cb90 vdup.32 d18, ip
00000ee0 f960488f vld2.32 {d20-d21}, [r0]
00000ee4 efe48ca3 vmull.s32 q12, d20, d19
00000ee8 efe46ca2 vmull.s32 q11, d20, d18
00000eec efe588a1 vmlal.s32 q12, d21, d17
00000ef0 efe568a0 vmlal.s32 q11, d21, d16
00000ef4 efe85838 vqshrun.s32 d21, q12, #8
00000ef8 efe84836 vqshrun.s32 d20, q11, #8
00000efc f940488d vst2.32 {d20-d21}, [r0]!
00000f00 4288 cmp r0, r1
00000f02 d3ed bcc.n 0xee0
00000f04 4770 bx lr
00000f06 bf00 nop
(If this look like gobbledygook to you, now would be a good time to have a look at this good introduction to ARM assembly, and to have a look again at the armv7 reference manual.)
A couple things are worth noting here. First, the Apple compiler improved on my algorithm somewhat; it combined the 2 multiplies and 2 adds into 2 multiply-and-accumulates, saving 2 instructions.
Also, I think the disassembler made a mistake here; the VQSHRUN should be a VSHRN. (I tested this by writing the function using the assembly given here, and I got different output unless I switched to VSHRN instead.) This is just a useful reminder that disassembly may not always be perfectly trustworthy!
Anyway, let’s compare to what the Android GCC compiler gives us. In the interest of brevity, I’ve removed everything but the instructions, and am showing only the content of the loop:
.L150:
vld2.32 {d16-d17}, [r5]
str r6, [r7, #4]
vstmia ip, {d16-d17}
ldmia ip, {r0, r1, r2, r3}
stmia sl, {r0, r1, r2, r3}
fldd d17, [sp, #64]
fldd d16, [sp, #72]
vmull.s32 q9, d17, d27
vmull.s32 q11, d16, d24
vmull.s32 q10, d17, d26
vmull.s32 q8, d16, d25
vadd.i64 q9, q9, q11
vadd.i64 q8, q10, q8
vshrn.i64 d18, q9, #24
vshrn.i64 d16, q8, #24
fstd d18, [sp, #48]
fstd d16, [sp, #56]
ldmia fp, {r0, r1, r2, r3}
stmia r8, {r0, r1, r2, r3}
mov r3, r7
mov r7, r4
str r6, [r3], #4
adds r3, r3, #4
str r6, [r3], #4
str r6, [r3, #0]
ldmia r8, {r0, r1, r2, r3}
stmia r4, {r0, r1, r2, r3}
vldmia r4, {d16-d17}
vst2.32 {d16-d17}, [r5]
adds r5, r5, #16
cmp r9, r5
bhi .L150
GCC didn’t do the clever trick of replace the separate add and multiply operations with multiply-and-accumulates. But that’s nothing compared to all the other instructions that are there after the NEON instructions. What the heck are they doing there, other than burning up cycles? I don’t know, and don’t have the energy to puzzle it through; all I know is that I want to get rid of them!
Here’s a new version, written using GCC inline assembly:
void stereoPan_neon(int32* buf, int frames,
int32 ll, int32 rr, int32 lr, int32 rl)
{
int32* p = buf;
const int32* pEnd = buf + frames * 2;
asm volatile
(
"cmp %[p], %[pEnd]\n\t"
"bge 0b\n\t"
"vdup.32 d16, %[ll]\n\t"
"vdup.32 d17, %[lr]\n\t"
"vdup.32 d18, %[rl]\n\t"
"vdup.32 d19, %[rr]\n\t"
"0:\n\t"
"vld2.32 {d20-d21}, [%[p]]\n\t"
"vmull.s32 q11, d20, d16\n\t"
"vmlal.s32 q11, d21, d17\n\t"
"vmull.s32 q12, d20, d18\n\t"
"vmlal.s32 q12, d21, d19\n\t"
"vshrn.i64 d20, q11, #24\n\t"
"vshrn.i64 d21, q12, #24\n\t"
"vst2.32 {d20-d21}, [%[p]]!\n\t"
"cmp %[p], %[pEnd]\n\t"
"blt 0b\n\t"
: [p] "+r" (p)
: [pEnd] "r" (pEnd), [ll] "r" (ll), [rl] "r" (rl), [rr] "r" (rr), [lr] "r" (lr)
: "d16", "d17", "d18", "d19", "d20", "d21", "q11", "q12", "cc", "memory"
);
}
Now, running this on my Nexus S, I get these times:
stereo_pan_default : 9.87 ms stereo_pan_neon : 1.87 ms
Not bad at all!
(In the final version of the function, I sped things up by another 20% or so by loading 4 frames of input instead of 2, which reduces the overall number of loads and stores.)
To conclude: my general approach to this kind of low-level optimization for iOS and Android is:
- Write an initial version using NEON intrinsics;
- Look at the disassembly from the Apple compiler, and see if it’s done any particularly clever optimizations you can use;
- Rewrite using inline assembly;
- Profile, profile, profile!

About

Great post, thanks.
I was wondering if you had any experience with iOS app store review with an app that includes Neon intrinsics? Does this type of code border on “non-public API”?
Hi Ronan,
NEON intrinsics shouldn’t cause any problems with app store submissions. It’s really an instruction set, not an API, and is widely used in iOS apps.
Fantastic thanks Steve