A Fast Hi-Precision Fixed Point Divide
If you are a developer targeting embedded devices with ARM processors, you must begrudgingly accept that the v4 architecture common in some low cost chips-sets, doesn’t have a floating point unit. Those of you less familiar with these devices may also be surprised to lean that while they do possess a fast 32 bit multiply, there is no hardware divide. All division must be emulated in software. That’s fine, gcc does it all for you behind the scenes so its no big deal, right? Well not exactly…
A Problem Of Precision
The problem is that without a floating point unit we are back in the realm of fixed point math if we need greater precision. And as with all things performance related there is a trade off to be made between precision and speed. If we are using a 16:16 fixed point we have one of two choices for our divide.
We can use a 32 bit divide
quotient = ((numerator << 6) / (denominator >> 6)) << 4or we can use a 64 bit divide
quotient = (int32_t)((int64_t)numerator << 16) / denominator)The 32bit divide looses precision in over 75 percent of cases (assuming a random distribution), however the 64 bit divide is over 4 times slower. Wouldn’t it be nice if we could figure out a way to get the speed of a 32 bit divide, with the precision of the 64 bit one. Well it turns out we can. In fact, the 16:16 fixed point divide I have developed is only marginally slower than the 32 bit fixed point divide above, and between 3 to 5 times faster than the 64 bit version, depending on the workload.
Instruction Hi Jinks
Even though the ARM is a RISC processor, its instruction set has a depth of features that permit some of the clever instruction hi jinks that the Intel Pentium is renowned for. The following listing shows the 3 instruction inner loop used by the best software emulated 32bit divide on the ARM processor.
@ Entry r0: numerator (lo) must be signed positive
@ r2: deniminator (den) must be non-zero and signed negative
idiv:
lo .req r0; hi .req r1; den .req r2
mov hi, #0 @ hi = 0
adds lo, lo, lo
.rept 32 @ repeat 32 times
adcs hi, den, hi, lsl #1
subcc hi, hi, den
adcs lo, lo, lo
.endr
mov pc, lr @ return
@ Exit r0: quotient (lo)
@ r1: remainder (hi)Legend: For those of you with different tool chains, the gas format should still be pretty familiar. Here is a legend to help explain some of the punctuation and its meaning:
.reqthe register equate directive maps a symbol to a register. Its a bit like using a define statement in C. I use them all the time because it makes the source just that much easier to read.;separates one instruction from the next when they appear on the same line.@starts a comment that runs until the next new line..rept <n>the repeat directive repeats the block of code terminated by.endr,ntimes. This handy directive reduces clutter when unrolling loops. gas has other handy directives that permit incrementing variables as a loop is unrolled.
This inner loop takes 97 cycles and only requires three registers. It is essentially a software implementation of the restoring division algorithm described in Appendix A of Computer Architecture: A Quantitative Approach by John Hennesy. Lets break it down and explain what is actually happening.
adds lo, lo, loThe above instruction effectively left shifts lo by 1. It also sets the carry flag with the overflow.
adcs hi, den, hi, lsl #1This very clever instruction does a whole bunch of stuff at once.
- left shift hi by 1
- add the carry from previous instruction so:
hi:lo <<= 1 - subtract the denominator (remember den is a signed negative)
- set carry flag with the overflow
subcc hi, hi, denThe above instruction is the restoring part of the algorithm. Basically we add the denominator back to hi if the carry flag is clear.
adcs lo, lo, loFinally we start the loop again but cleverly we carry the overflow from the subtraction of denominator two instructions back. By the end of 32 iterations the lo 32 bits have been shifted out and replaced by the quotient that has been shifted in using the carry. The remainder is left in hi.
The only problem with this routine is it’s just an inner loop, it doesn’t check the input arguments – particularly the denominator for zero equality – and it hasn’t been optimized for early out if both the numerator and denominator have leading zeros.
Checking The Input
Checking the input is pretty straight forward.
@ Entry r0: numerator (lo) must be signed positive
@ r2: deniminator (den) must be non-zero and signed negative
idiv:
lo .req r0; hi .req r1; den .req r2
cnt .req r3; sign .req r12
cmp den, #0; beq .div0 @ 1. exceptioin if den == zero
eor sign, lo, den @ 2. sign = num ^ den
rsbmi den, den, #0 @ 3. den = -den if den < 0
subs hi, den, #1 @ hi = den - 1
beq .div1 @ 4. return if den == 1
cmp lo, #0 @ 5. num = -num if num < 0
rsbmi lo, lo, #0
cmp lo, den @ 6. return if num <= den
bls .numLeDen
tst den, hi @ 7. if(den & (den - 1) == 0)
beq .powerOf2 @ den is a power of 2
...
@ Exit r0: quotient (lo)
@ r1: remainder (hi)Here is a list of the things we can check for:
- If the denominator is zero we should throw a divide by zero exception.
- Determin the sign of the quotient.
- Negate the denominator if negative
- If the denominator equals 1 we should return the numerator.
- Negate the numerator if negative
- If the numerator is less than the denominator we should return 0, and if the numerator equals the denominator we should return 1.
- If the denominator is a power of 2 we should use a logical shift right, because its faster.
1. Divide By Zero
cmp den, #0; beq .div0
@ ...
.div0:
b __div0Here I assume that a common function called __div0 has been defined that looks after throwing the divide by zero exception. Alternatively you can simply return zero in the numerator.
2. Sign
eor sign, lo, denDetermining the sign of the quotient is simple. We just assign sign the exclusive or of the numerator and denominator. If the resulting value is negative then we need to negate the quotient before returning it.
3. Negating The Denominator
rsbmi den, den, #0The N (negative) flag has already been set by the divide by zero test, so we simply negate the denominator with a conditional reverse subtract on the minus instruction.
4. Divide By One Test
subs hi, den, #1
beq .div1
@ ...
.div1:
cmp sign, #0
rsbmi num, num, #0
mov pc, lrWe subtract 1 from the denominator and store it in hi which we use later for the power of 2 test. The subtract instruction also sets the Z (zero) flag, which we test with a conditional branch on equal insruction. The div1 jump negates the numerator if sign is minus and returns the numerator.
5. Negating The Numerator
cmp lo, #0
rsbmi lo, lo, #0The N (negative) flag is set by the comparison with zero, so we simply negate the numerator with a conditional reverse subtract on the minus instruction.
6. Numerator <= Denominator Test
Obviously if the numerator is less than or equal to the denominator we return zero or one respectively.
cmp lo, den
bls .numLeDen
@ ...
.numLeDen:
mov num, #0
moveq num, sign, asr #31
orreq num, num, #1
mov pc, lr7. Divide By Power Of 2 Test
At this point hi holds the result of the divide by one test (denominator - 1). A bit of boolean logic comes into play here, as we logically AND value hi with the denominator. If the result is zero, then the denominator is a power of two, and all we need to do is count the bits and perform a bit shift right.
tst den, hi
beq .powerOf2
@ ...
.powerOf2:
mov cnt, #0
cmp den, #(1 << 16)
movhs cnt, #16
movhs den, den, lsr #16
cmp den, #(1 << 8)
addhs cnt, cnt, #8
movhs den, den, lsr #8
cmp den, #(1 << 4)
addhs cnt, cnt, #4
movhs den, den, lsr #4
cmp den, #(1 << 2)
addhi cnt, cnt, #3
addls cnt, cnt, den, lsr #1
mov num, num, lsr cnt
cmp sign, #0
rsbmi num, num, #0 @ negate quotient if sign < 0
mov pc, lrEarly Out
A common technique used in modern processors to speed up hardware division is the early out. Essentially if the numerator and denominator both have leading zero bits then the number of iterations in the loop can be reduced by this amount if we also left shift the numerator by the same number of bits. ARM added the CLZ (Count Leading Zero) instruction the the v5 architecture for just this purpose, but alas we have to waste precious cycles and emulate it in software on v4 hardware.
@ Entry r0: numerator (lo)
@ r2: deniminator (den) must be non-zero
idiv:
lo .req r0; hi .req r1; den .req r2; sign .req r3; cnt .req r4
stmfd sp!, {r4, lr} @ push r4 and lr onto the stack
@ CHECK INPUT
cmp den, #0; beq __div0 @ throw exception if den == zero
@ set sign and ensure numerator and deniminator are positive
mov sign, #1
rsblt sign, sign, #0 @ sign = -sign if den < 0
rsblt den, den, #0 @ den = -den if den < 0
cmp lo, #0
rsblt sign, sign, #0 @ sign = -sign if num < 0
rsblt lo, lo, #0 @ lo = -lo if num < 0
@ skip if numerator <= deniminator
cmp lo, den @ skip if lo <= den
movlt hi, lo; movlt lo, #0
moveq lo, #1
ble .end
@ EARLY OUT
@ count leading bits common to both the numerator and denominator
mov cnt, #28
mov hi, lo, lsr #4
cmp den, hi, lsr #12; suble cnt, cnt, #16 ; movle hi, hi, lsr #16
cmp den, hi, lsr #4 ; suble cnt, cnt, #8 ; movle hi, hi, lsr #8
cmp den, hi ; suble cnt, cnt, #4 ; movle hi, hi, lsr #4
@ shift numerator left by cnt bits
mov lo, lo, lsl cnt @ hi:lo = num << cnt
rsb den, den, #0 @ negate den for the divide loop
adds num, num, num @ first loop instruction relocated here
@ skip cnt iterations in the divide loop
add cnt, cnt, cnt, lsl #1 @ cnt *= 3
add pc, pc, cnt @ jump cnt instructions
nop @ padding for the pipeline
.rept 32 @ repeat 32 times
adcs hi, den, hi, lsl #1
subcc hi, hi, den
adcs lo, lo, lo
.endr
.end:
cmp sign, #0 @ quotient = -quotient if sign < 0
rsblt lo, lo, #0
ldmfd sp!, {r4, pc} @ restore r4 and return
@ Exit r0: quotient (lo)
@ r1: remainder (hi)Here we use the classic divide and conquer approach to count the leading bits starting with 16, then 8, then 4. We don’t count any lower than 4 because the extra cost in instructions leads to diminishing returns in performance. If we had the hardware CLZ instruction we would count to the nearest bit and this would provide a maximum saving of 18 extra cycles, and a minimum of 9 (well worth the effort if your platform supports CLZ).
The above sequence provides some clues as to how to optimize our fixed point divide. If we remember back to our earlier 64bit high precision divide.
quotient = (int32_t)((int64_t)numerator << 16) / denominator)When we left shift our 32 bit numerator by 16 bits, our numerator and denominator are both guaranteed to have at least 16 bits of leading zeros. So we left shift the numerator by another 16 bits and reduce our inner loop by 16 iterations from 64 to 48. But to maintain precision we must also increase our numerator to 96 bits hi:mid:lo because the numerator is 16 bits larger than the denominator. Otherwise the remainder would overflow when the difference between the numerator and denominator was more than 32 bits, which will happen, because the numerator is now 32 bits « 16, i.e. 48 bits. Here is the new inner loop.
@ Entry r2: numerator (mid) must be signed positive
@ r3: deniminator (den) must be non-zero and signed negative
idiv:
lo .req r0; hi .req r1; mid .req r2; den .req r3
mov hi, #0; mov lo, #0 @ hi = lo = 0
adds mid, mid, mid
.rept 47 @ repeat 48 times
adcs hi, den, hi, lsl #1
subcc hi, hi, den
adcs lo, lo, lo
adcs mid, mid, mid
.endr
adcs hi, den, hi, lsl #1
subcc hi, hi, den
adcs lo, lo, lo
mov pc, lr @ return
@ Exit r0: quotient (lo)
@ r1: remainder (hi)This inner loop takes 192 cycles and only requires four registers. From this one would expect this inner loop to be twice as slow as the original fixed point 32bit divide, but the full blown version with input checks and early out is actually quite a bit faster.
Putting It All Together
The listing below shows the full implementation of my fast fixed point divide written for the inline assembler in gcc 3.x. The function returns the quotient in r0.
int32_t fpdiv(register int32_t numerator, register int32_t denominator) {
register int32_t quotient;
asm("num .req %[numerator] @ Map Register Equates\n\t"
"den .req %[denominator]\n\t"
"mod .req r2\n\t"
"cnt .req r3\n\t"
"quo .req r4\n\t"
"sign .req r12\n\t"
/* set sign and ensure numerator and denominator are positive */
"cmp den, #0 @ exceptioin if den == zero\n\t"
"beq .div0\n\t"
"eor sign, num, den @ sign = num ^ den\n\t"
"rsbmi den, den, #0 @ den = -den if den < 0\n\t"
"subs mod, den, #1 @ mod = den - 1\n\t"
"beq .div1 @ return if den == 1\n\t"
"movs cnt, num @ num = -num if num < 0\n\t"
"rsbmi num, num, #0\n\t"
/* skip if deniminator >= numerator */
"movs cnt, num, lsr #16 @ return if den >= num << 16\n\t"
"bne .cont\n\t"
"cmp den, num, lsl #16\n\t"
"bhs .numLeDen\n\t"
"\n.cont:\n\t"
/* test if denominator is a power of two */
"tst den, mod @ if(den & (den - 1) == 0)\n\t"
"beq .powerOf2 @ den is power of 2\n\t
/* count leading zeros */
"stmfd sp!, {r4} @ push r4 (quo) onto the stack\n\t"
"mov cnt, #28 @ count difference in leading zeros\n\t"
"mov mod, num, lsr #4 @ between num and den\n\t"
"cmp den, mod, lsr #12; subls cnt, cnt, #16; movls mod, mod, lsr #16\n\t"
"cmp den, mod, lsr #4 ; subls cnt, cnt, #8 ; movls mod, mod, lsr #8\n\t"
"cmp den, mod ; subls cnt, cnt, #4 ; movls mod, mod, lsr #4\n\t"
/* shift numerator left by cnt bits */
"mov num, num, lsl cnt @ mod:num = num << cnt\n\t"
"mov quo, #0\n\t"
"rsb den, den, #0 @ negate den for divide loop\n\t"
/* skip cnt iterations in the divide loop */
"adds num, num, num @ start: num = mod:num / den\n\t"
"add pc, pc, cnt, lsl #4 @ skip cnt x 4 x 4 iterations\n\t"
"nop @ nop instruction takes care of pipeline\n\t"
/* inner loop unrolled x 48 */
".rept 47 @ inner loop x 48\n\t"
" adcs mod, den, mod, lsl #1\n\t"
" subcc mod, mod, den\n\t"
" adc quo, quo, quo\n\t"
" adds num, num, num\n\t"
".endr\n\t"
"adcs mod, den, mod, lsl #1\n\t"
"subcc mod, mod, den\n\t"
"adc quo, quo, quo\n\t"
/* negate quotient if signed */
"cmp sign, #0 @ negate quotient if sign < 0\n\t"
"mov num, quo\n\t"
"rsbmi num, num, #0\n\t"
"ldmfd sp!, {r4} @ pop r4 (quo) off the stack\n\t"
"mov pc, lr @return\n\t"
/* divide by zero handler */
"\n.div0:\n\t"
"mov num, #0\n\t"
"mov pc, lr @return\n\t"
/* divide by one handler */
"\n.div1:\n\t"
"cmp sign, #0\n\t"
"mov num, num, asl #16\n\t"
"rsbmi num, num, #0\n\t"
"mov pc, lr @return\n\t"
/* numerator less than or equal to denominator handler */
"\n.numLeDen:\n\t"
"mov num, #0 @ quotient = 0 if num < den\n\t"
"moveq num, sign, asr #31 @ negate quotient if sign < 0\n\t"
"orreq num, num, #1 @ quotient = 1 if num == den\n\t"
"mov pc, lr @return\n\t"
/* power of two handler */
"\n.powerOf2:\n\t"
"mov cnt, #0\n\t"
"cmp den, #(1 << 16); movhs cnt, #16 ; movhs den, den, lsr #16\n\t"
"cmp den, #(1 << 8) ; addhs cnt, cnt, #8; movhs den, den, lsr #8\n\t"
"cmp den, #(1 << 4) ; addhs cnt, cnt, #4; movhs den, den, lsr #4\n\t"
"cmp den, #(1 << 2) ; addhi cnt, cnt, #3; addls cnt, cnt, den, lsr #1\n\t"
"rsb mod, cnt, #32\n\t"
"mov den, num, lsr #16 @ den:num = num << 16\n\t"
"mov num, num, lsl #16\n\t"
"mov num, num, lsr cnt @ num = num >> cnt | den << mod\n\t"
"orr num, num, den, lsl mod\n\t"
"cmp sign, #0\n\t"
"rsbmi num, num, #0 @ negate quotient if sign < 0"
/* output registers */
: [quotient] "=r" (quotient)
/* input registers */
: [numerator] "0" (numerator), [denominator] "r" (denominator)
/* clobbered registers */
: "r2" /* mod */, "r3" /* cnt */, "r12" /* sign */);
return quotient;
}This assembly is written for gcc and gas, and if you can’t make sense of the punctuation see the legend above. Here is an explanation of some of the other things.
\n\tthe gcc asm statement requires the assembly code to be a single string. To permit multi line statements you have to manually add new line and tab characters to the end of each line.%[symbol]gcc allows you to name inputs and outputs in your asm statement. I simply map these to register equates with the same name to make it easier to follow."r"symbol gcc allows you to map symbols in c to registers in gas. Above we map the first available register (r0) to the read only output quotient, then we reuse r0 as the numerator input, finally we map the second available register (r1) to the denominator. I won’t go into all the details, because they can be found in info gcc if you are interested.
Performance
At the end of the day it all comes down to performance. I was pleasantly surprised by the results. Using a random distribution of signed and unsigned numbers for the numerator and a linear sequence of prime numbers for the denominator. I ran 300 million iterations of each function through a PalmOne Zire 72 which has 312Mhz Intel PXA270 processor and more importantly a timer with 10 millisecond of resolution. First, I ran the random number generator through the same loop and subtracted its total time from the final results of each function in order to better isolate the actual timing of each divide routine. Here are the results:
| Fixed Point Divide x 300 million iterations | Time in seconds |
|---|---|
quotient = (int32_t)((int64_t)numerator << 16) / denominator) | 317.09 |
quotient = ((numerator << 6) / (denominator >> 6)) << 4 | 42.77 |
quotient = fpdiv(numerator, denominator) | 66.59 |
For a modest cost in extra cycles we get a huge payoff in precision, well worth the effort, and when compared to the cost of the 64 bit fixed point divide its a no-brainer.
What About A Fixed Point Multiply?
Out of curiosity I wondered if some speedup could be gained from an inline 64 bit fixed point multiply.
We can do a 32 bit multiply
c = ((a >> 6) * (b >> 6)) >> 4or we can do a 64 bit multiply
c = (int32_t)(((int64_t)a * b) >> 16)or we can do a 64 bit multiply with inline assembly.
#define fpmul(A, B) ({ \
register int32_t __a = (A), __b = (B), __c; asm( \
"smull r2, r3, %[a], %[b] @; i_fpmul()\n\t" \
"mov %[c], r2, lsr #16\n\t" \
"orr %[c], r2, r3, asl #16" \
: [c] "=r" (__c) : [a] "0" (__a), [b] "r" (__b) : "r2", "r3"); \
__c; })This may seem cryptic, but it works! And the results:
| Fixed Point Multiply x 300 million iterations | Time in seconds |
|---|---|
c = (int32_t)(((int64_t)a * b) >> 16) | 13.40 |
c = ((a >> 6) * (b >> 6)) >> 4 | 4.38 |
c = fpmul(a, b) | 9.33 |
Once again rather surprising, and well worth the exercise.
Conclusion
I’ve heard a computer science professor admonishing his students by saying, “If a program contains too much division, the programmer’s an idiot”. Division is very costly – from the results above – 7 times more expensive that a fixed point multiply, and even more so because modern ARM processors permit parallel execution of instructions in their MAC pipe and their ALU pipe. Software emulated division permits no such optimizations so having the fastest fixed point division available should be of common interest to developers targeting this platform.
Source
Here is the source for the gcc built tool chain:
- arm_version.h
- gcc_arm_fpdiv.c
- gcc_arm_div.c - if you are interested in the straight divide
- evc_arm_fpdiv.asm - embedded visual C, but you have to compile it with the assembler as there is no inline assembly support in this product.
Thats All,
Enjoy.