NEON optimizations for iOS and Android

I spent some time over the last few weeks optimizing some of our low-level DSP functions using ARM NEON instructions, and thought I’d share my experience here, as there were a few surprises.

First, some background on NEON.  Many of the processors used in the latest generation of iOS and Android devices support NEON instructions, which are SIMD instructions which can be very useful for DSP algorithms.

NEON instructions are optional in the armv7 instruction set.  All iOS devices with armv7 processors do support NEON, so on iOS, you can just wrap your NEON code inside of an “#ifdef __ARM_NEON__” block.

On Android you’ll also need to use the android_getCpuFeatures() function to check at runtime whether a particular armv7 processor supports NEON.  (This entire post, by the way, assumes you are writing native code using the Android NDK.  If you’re writing Java code, you can’t make use of NEON directly.)

(For for information, I highly recommend this overview of NEON and this overview of the ARM architecture in general.)

NEON operations let you perform certain operations in parallel.  For example, say you have an array of integers, and you want to multiply them all by 2 by shifting left 1 bit.  Some C code to do this would be the following:

void multiplyBy2(int* p, int count)
{
    for (int i = 0; i < count; ++i)
    {
        *p++ <<= 1;
    }
}

With NEON, you could process 4 integers at a time, for a significant speedup:

#include <arm_neon.h>

void multiplyBy2Neon(int* p, int count)
{
    assert(count % 4 == 0); // must be a multiple of 4
    int n = count / 4;
    for (int i = 0; i < n; ++i)
    {
        // load 4 ints into a quadword register
        int32x4_t in = vld1q_s32(p);
        in = vshlq_n_s32(in, 1);  // shift each int left by 1
        vst1q_s32(p, in);  // store register back into memory
        p += 4;
    } 
}

This uses the NEON intrinsics which are supported by both the Apple compiler and by the version of GCC supplied with the Android NDK.  The intrinsics are not well documented anywhere that I could find; there is a list of them in the GCC documentation that indicates the corresponding ARM instruction for each intrinsic, and the instructions themselves are documented in the armv7 reference manual.

You’ll find that this code runs significantly faster.  It won’t run 4 times faster, probably since you still have to spend a lot of the time loading values from memory, but on an iPod 4g, I found that it ran almost twice as fast.

Note that, for simplicity, we are assuming that the count variable is a multiple of 4.  If not, we would need to have extra code to deal with the leftover ints before or after our NEON processing loop.

So, on to a more complicated, real-world example.  I have a function, stereoPan(), that takes an interleaved stereo input buffer and applies a pan matrix to it.  The samples and the pan values are all 8.24 fixed-point values represented as 32-bit integers.  Here is a somewhat simplified version of it:

void stereoPan_default(int32* buf, int frames,
                       int32 ll, int32 rr, int32 lr, int32 rl)
{
    int32* p = buf;
    const int32* pEnd = buf + frames * 2;
    while (p < pEnd)
    {
        // read input samples
        int64 in_l = *p;
        int64 in_r = *(p+1);

        // multiply by pan matrix and output
        *p++ = (int32) ((in_l * ll + in_r * lr) >> 24);
        *p++ = (int32) ((in_l * rl + in_r * rr) >> 24);
    }
}

And the version using NEON intrinsics:

void stereoPan_neon(int32* buf, int frames,
                    int32 ll, int32 rr, int32 lr, int32 rl)
{
    int32* p = buf;
    const int32* pEnd = buf + frames * 2;                 

    // load constants
    int32x2_t llv = vdup_n_s32(ll);
    int32x2_t rlv = vdup_n_s32(rl);
    int32x2_t rrv = vdup_n_s32(rr);
    int32x2_t lrv = vdup_n_s32(lr);                      

    while (p < pEnd)
    {
        // load 2 samples for each channel, deinterleaving
        int32x2x2_t in = vld2_s32(p);                        

        // multiply samples by constants
        int64x2_t out_l_a = vmull_s32(in.val[0], llv);
        int64x2_t out_l_b = vmull_s32(in.val[1], lrv);
        int64x2_t out_r_a = vmull_s32(in.val[0], rlv);
        int64x2_t out_r_b = vmull_s32(in.val[1], rrv);       

        // add
        int64x2_t out_l_c = vaddq_s64(out_l_a, out_l_b);
        int64x2_t out_r_c = vaddq_s64(out_r_a, out_r_b);     

        // shift right and narrow to s32
        int32x2x2_t out;
        out.val[0] = vshrn_n_s64(out_l_c, 24);
        out.val[1] = vshrn_n_s64(out_r_c, 24);               

        vst2_s32(p, out);
        p += 4;
    }
}

When I run this through my test program, I find that it does indeed run significantly faster on my iPod Touch:

stereo_pan_default             :  11.18 ms
stereo_pan_neon                :   2.62 ms

Great!  Let’s try it on Android now; I’m sure it will be no problem at all, right?  Here’s the output from my Nexus S:

stereo_pan_default             :  11.99 ms
stereo_pan_neon                :  10.47 ms

Whoa there!  The NEON version is barely at all faster than the ordinary C version.  What’s going on?

Let’s look at the assembly output for the iOS version.  I ran this command:

otool -vt libck.a > libck.s

to disassemble the library, then looked for the stereoPan_neon symbol; here’s the disassembly:

00000ec0        4694    mov ip, r2
00000ec2        004a    lsls    r2, r1, #1
00000ec4        2a01    cmp r2, #1
00000ec6        db1d    blt.n   0xf04
00000ec8        aa01    add r2, sp, #4
00000eca        46e9    mov r9, sp
00000ecc    ee813b90    vdup.32 d17, r3
00000ed0    f9e93c8f    vld1.32 {d19[]}, [r9]
00000ed4    eb0001c1    add.w   r1, r0, r1, lsl #3
00000ed8    f9e20c8f    vld1.32 {d16[]}, [r2]
00000edc    ee82cb90    vdup.32 d18, ip
00000ee0    f960488f    vld2.32 {d20-d21}, [r0]
00000ee4    efe48ca3    vmull.s32   q12, d20, d19
00000ee8    efe46ca2    vmull.s32   q11, d20, d18
00000eec    efe588a1    vmlal.s32   q12, d21, d17
00000ef0    efe568a0    vmlal.s32   q11, d21, d16
00000ef4    efe85838    vqshrun.s32 d21, q12, #8
00000ef8    efe84836    vqshrun.s32 d20, q11, #8
00000efc    f940488d    vst2.32 {d20-d21}, [r0]!
00000f00        4288    cmp r0, r1
00000f02        d3ed    bcc.n   0xee0
00000f04        4770    bx  lr
00000f06        bf00    nop

(If this look like gobbledygook to you, now would be a good time to have a look at this good introduction to ARM assembly, and to have a look again at the armv7 reference manual.)

A couple things are worth noting here.  First, the Apple compiler improved on my algorithm somewhat; it combined the 2 multiplies and 2 adds into 2 multiply-and-accumulates, saving 2 instructions.

Also, I think the disassembler made a mistake here; the VQSHRUN should be a VSHRN.  (I tested this by writing the function using the assembly given here, and I got different output unless I switched to VSHRN instead.)  This is just a useful reminder that disassembly may not always be perfectly trustworthy!

Anyway, let’s compare to what the Android GCC compiler gives us.  In the interest of brevity, I’ve removed everything but the instructions, and am showing only the content of the loop:

.L150:
    vld2.32 {d16-d17}, [r5]
    str r6, [r7, #4]
    vstmia  ip, {d16-d17}
    ldmia   ip, {r0, r1, r2, r3}
    stmia   sl, {r0, r1, r2, r3}
    fldd    d17, [sp, #64]
    fldd    d16, [sp, #72]
    vmull.s32   q9, d17, d27
    vmull.s32   q11, d16, d24
    vmull.s32   q10, d17, d26
    vmull.s32   q8, d16, d25
    vadd.i64    q9, q9, q11
    vadd.i64    q8, q10, q8
    vshrn.i64   d18, q9, #24
    vshrn.i64   d16, q8, #24
    fstd    d18, [sp, #48]
    fstd    d16, [sp, #56]
    ldmia   fp, {r0, r1, r2, r3}
    stmia   r8, {r0, r1, r2, r3}
    mov r3, r7
    mov r7, r4
    str r6, [r3], #4
    adds    r3, r3, #4
    str r6, [r3], #4
    str r6, [r3, #0]
    ldmia   r8, {r0, r1, r2, r3}
    stmia   r4, {r0, r1, r2, r3}
    vldmia  r4, {d16-d17}
    vst2.32 {d16-d17}, [r5]
    adds    r5, r5, #16
    cmp r9, r5
    bhi .L150

GCC didn’t do the clever trick of replace the separate add and multiply operations with multiply-and-accumulates.  But that’s nothing compared to all the other instructions that are there after the NEON instructions.  What the heck are they doing there, other than burning up cycles?  I don’t know, and don’t have the energy to puzzle it through; all I know is that I want to get rid of them!

[EDIT Aug 6 2013: I’ve done some tests using clang instead of gcc on Android, and while the assembly code is still usually faster than code generated from intrinsics, I’m no longer seeing the intrinsics version perform worse than the plain-C version.]

Here’s a new version, written using GCC inline assembly:

void stereoPan_neon(int32* buf, int frames,
                    int32 ll, int32 rr, int32 lr, int32 rl)
{
    int32* p = buf;
    const int32* pEnd = buf + frames * 2;                                 

    asm volatile
    (
     "cmp %[p], %[pEnd]\n\t"
     "bge 0b\n\t"                                                     

     "vdup.32 d16, %[ll]\n\t"
     "vdup.32 d17, %[lr]\n\t"
     "vdup.32 d18, %[rl]\n\t"
     "vdup.32 d19, %[rr]\n\t"                                         

     "0:\n\t"                                                         

     "vld2.32 {d20-d21}, [%[p]]\n\t"
     "vmull.s32 q11, d20, d16\n\t"
     "vmlal.s32 q11, d21, d17\n\t"
     "vmull.s32 q12, d20, d18\n\t"
     "vmlal.s32 q12, d21, d19\n\t"
     "vshrn.i64 d20, q11, #24\n\t"
     "vshrn.i64 d21, q12, #24\n\t"
     "vst2.32 {d20-d21}, [%[p]]!\n\t"                                 

     "cmp %[p], %[pEnd]\n\t"
     "blt 0b\n\t"                                                     

     : [p] "+r" (p)
     : [pEnd] "r" (pEnd), [ll] "r" (ll), [rl] "r" (rl), [rr] "r" (rr), [lr] "r" (lr)
     : "d16", "d17", "d18", "d19", "d20", "d21", "q11", "q12", "cc", "memory"
    );
}

Now, running this on my Nexus S, I get these times:

stereo_pan_default             :  9.87 ms
stereo_pan_neon                :  1.87 ms

Not bad at all!

(In the final version of the function, I sped things up by another 20% or so by loading 4 frames of input instead of 2, which reduces the overall number of loads and stores.)

To conclude: my general approach to this kind of low-level optimization for iOS and Android is:

  1. Write an initial version using NEON intrinsics;
  2. Look at the disassembly from the Apple compiler, and see if it’s done any particularly clever optimizations you can use;
  3. Rewrite using inline assembly;
  4. Profile, profile, profile!
This entry was posted in Android, iOS. Bookmark the permalink.

8 Responses to NEON optimizations for iOS and Android

  1. Great post, thanks.

    I was wondering if you had any experience with iOS app store review with an app that includes Neon intrinsics? Does this type of code border on “non-public API”?

  2. steve says:

    Hi Ronan,
    NEON intrinsics shouldn’t cause any problems with app store submissions. It’s really an instruction set, not an API, and is widely used in iOS apps.

  3. Fantastic thanks Steve

  4. Bob oGreaty says:

    there’s a mistake in the code (2nd block). this:
    int n = n / 4;
    should be:
    int n = count / 4;

  5. Shubhadeep Chaudhuri says:

    Is the GCC compiler still as bad with NEON intrinsics? I need make some NEON optimizations in my algorithm but having to learn NEON assembly seems like overkill.

  6. steve says:

    Last I checked it was; as I mentioned above, clang seems to do much better than gcc. Most important, though, is to profile before and after any optimization to evaluate any improvements!

  7. Junaid says:

    Hi steve,
    I was looking for FFT neon intrinsic code and I got here. But I am targeting c/c++ code for linux that can be cross-compiled for ARM. Can you help me with this?

Leave a Reply

Your email address will not be published. Required fields are marked *