65816: Coding – To Optimize or Not

As I was working on my 32-bit floating-point package (GitHub) I noticed some opportunities to optimize the code. For example, scale10, used while converting strings into floating-point values, uses the package’s 32-bit routines to multiply and divide by 10. For example, to scale up by 10 we load 10 to the floating-point stack and call fmult32.

Multiply floating-point value by 10 using package routines


        ldy #10
        jsr pushu16             ; push 10 to stack
        jsr fmult32             ; r1 * 10
        rts

Simple and quite readable. But can we do better than using our general routines with the associated overhead? After all, we’re working with a known constant and multiplying by 10 (%1010) is simply three shifts and an add (technically it’s a bit more considering we need to save an intermediate value after the first shift for the final add later):

x * 10 = x * (8 + 2) = (x * 8) + (x * 2) = (x << 3) + (x << 1)

It’s a bit of code to accomplish this on the 65816 though.

Multiply floating-point value by 10 using a hard coded routine

        ldy #0
        clc

        ; multiply TOSm by 10 (%1010)
        ; r1 * 2 => Y=lsw, A=msw, tmp2=overflow and TOSm,TOSm+2,TOSext for use later
        lda TOSm,x
        beq @eqz                ; TOS lsw = 0? yes, go check msw
        asl
        tay                     ; lsw
        sta TOSm,x
        lda TOSm+2,x
@nez:
        rol
        sta TOSm+2,x
        phx                     ; save FPSP
        tax                     ; msw
        lda #0                  ; capture any overflow
        rol
        sta tmp2
        sta TOSext

        ; r1 * 4 = tm * 2
        tya                     ; lsw
        asl
        tay
        txa                     ; msw
        rol
        tax
        rol tmp2

        ; r1 * 8 = tm * 2
        tya                     ; lsw
        asl
        tay
        txa                     ; msw
        rol
        tax
        rol tmp2

        tya                     ; lsw
        txy                     ; save msw
        plx                     ; retrieve FPSP

        ; r1 * 10 = tm + TOSm
        ; carry is clear
        adc TOSm,x
        sta tm
        tya                     ; msw 
        adc TOSm+2,x
        sta tm+2
        lda tmp2
        adc TOSext

        ; save product to to TOSm, shifting one byte to right
        acc8
        sta TOSm+3,x
        lda tm+3
        sta TOSm+2,x
        lda tm
        sta TOSext+1
        acc16
        lda tm+1
        sta TOSm,x

        ; we've shifted the product one byte to the right
        ; increase the exponent to compensate
        clc
        lda TOSe,x
        adc #8
        sta TOSe,x
        jsr fn                  ; normalize result
        clc
        rts

@eqz:
        lda TOSm+2,x
        bne @nez
        jmp setTOS0

Here we’ve eliminated the overhead of pushing/popping 10 to/from the floating-point stack. We’ve also eliminated some of the overhead associated with working with 32-bit values. You might think we’ve eliminated three bytes of shifts, since 10 is a single byte value. However, the fmult32 routine is optimized to efficiently handle values whose 32-bit binary representation are largely zeros, as is 10 (the mantissa internally is $A0000000). There is still some overhead in dealing with the 32-bit value though.

It’s not all in the hard coded routines favor though. See the jsr fn towards the end? The hard coded routine is a bit more involved than “three shifts and an add” since the result must be normalized which adds a number of shifts after the actual computation to transform the result into a normalized floating-point value. Still, the hard coded routine was about two seconds faster than using the 32-bit floating-point package routines when inputting and outputting about 40 floating-point values of varying length on my emulator.

But what code is easier to read? Clearly the first version. Since processing value input/output often isn’t time critical, many would opt for the clearer code of the first version. I think this is the way I’ll go. I may keep the hard coded version as an option for processing a large quantity of input data. But if the code is there then why not use it?

How About Division by 10?

Hard coding division by 10 isn’t as straightforward as multiplication. At first, I tried flipping it around and multiplying by 1/10. But the mantissa of 1/10 internally is $CCCCCCCD. Hard coding that multiplication for speed is crazy, though the pattern provides some looping opportunities. But when compared to the natural:

Dividing floating-point value by 10 using package routines

        ldy #10
        jsr pushu16
        jsr fdiv32              ; r1 / 10
        rts

it’s hard to even get past the initial sketch.

It’s not hard to find some basic divide by 10 algorithms on online, here and here for example. Not too bad in a higher-level language but much more involved in assembly. And definitely not more readable. I decided to see what I could do with one of my own 32-bit division routines.

I have a division routine that divides a 32-bit unsigned integer by a 16-bit unsigned integer giving a 32-bit quotient and 16-bit remainder. Once again, having a constant divisor I could avoid the overhead of loading/dropping the 10 to/from the stack and could hard code the 10 into the calculation avoiding some memory access overhead. But what to do about the remainder? Without factoring that in I’d lose some precision. The remainder by itself wasn’t useful. I needed the remainder divide by 10. But how to incorporate it into the quotient?

I then realized that a 32-bit value divided by 10, a 4-bit value, fits into 28-bits. If I just continued the division routine for several more loops, 35 vs 32, I could incorporate the fractional remainder divided by 10 into the quotient. With a proper adjustment to the exponent, all was good. Or mostly so. It still needs some work on handling the residual for rounding.

Dividing floating-point value by 10 using a hard coded routine

        lda TOSm,x
        ora TOSm+2,x
        beq @eqz                ; TOS lsw = 0? yes, go check msw

        ldy #35                 ; init loop counter
        lda #0                  ; init partial remainder
        sta tm                  ;   in tm:A (h:l)
@loop:        
        asl TOSm,x              ; dividend in TOSm+2:TOSm (h:l) is gradually replaced
        rol TOSm+2,x            ;   with the quotient
        rol                     ; tm:A is gradually replaced
        rol tm                  ;   with the remainder
        sta TOSext 
        cmp #%1010              ; divisor = 10
        lda tm                  ; partial remainder >= TOS?
        sbc #0
        bcc @cont
        sta tm                  ;   yes: update the partial
        lda TOSext              ;     remainder and set the
        sbc #%1010              ;     low bit in the partial
        inc TOSm,x              ;     quotient
        bra @cont1
@cont:        
        lda TOSext    
@cont1:                
        dey                     ; loop 32 times
        bne @loop
        xba
        asl
        asl
        asl
        asl
        asl
        sta TOSext

        sec
        lda TOSe,x
        sbc #3
        sta TOSe,x

        clc
        jsr fn                  ; normalize result
        rts
@eqz:
        jmp setTOS0

The problem is that this routine isn’t any faster than the shorter version. It’s pretty easy to select between them. In fact, after the above analysis, I decided to take the shorter route for multiplication as well and didn’t bother resolving the residual rounding issue in the above code.

In working through these routines, I came across some suggestions that may help me get more done.

Rules of Optimization

In researching multiplication and division efficiencies I came across an interesting Stack overflow post that laid out one form of the Rules of Optimization. It basically was a much longer, somewhat humorous form of the common advice: 1. Don’t, 2. Don’t … yet, 3. Profile first (ROO).

More practical is the commonly quoted advice: “We should forget about small efficiencies, say about 97% of the time; premature optimization is the root of all evil.” Donald Knuth.

Wikipedia has a nice section on optimization, when to optimize:

“Premature optimization” is a phrase used to describe a situation where a programmer lets performance considerations affect the design of a piece of code. This can result in a design that is not as clean as it could have been or code that is incorrect, because the code is complicated by the optimization and the programmer is distracted by optimizing.

Looking at my Coding Efficiencies post it’s clear that I could benefit from taking some of this too heart. However, I have to say that working through some of these problems has allowed me to better understand the algorithms and workings of the 65816 more completely. That’s made it worth it to me, especially since learning is my main goal in all of this. I’m definitely not doing this for a living, where pumping out a lot of code in a certain timeframe may be more important.

But some of the above code is going to extremes I think, especially since we’re talking about an I/O situation, where in most situations the user isn’t going to notice the difference.

Take for example the shift of a 40-bit value in the multiply by 10 routine above. Most straightforward is:

Easy-to-read 40-bit shift left

        asl tm          ; 7 cycles      2 byte(s)
        rol tm+2        ; 7             2
        rol tmp2        ; 7             2
                        ; 21            6

Slightly faster is:

Slightly faster 40-bit shift left

        tya             ; 2 cycles      1 byte(s)
        asl             ; 2             1
        tay             ; 2             1
        txa             ; 2             1
        rol             ; 2             1
        tax             ; 2             1
        rol tmp2        ; 7             2
                        ; 19            8

where we use the fact that shifting the accumulator on the 65816 is much faster than shifting a memory location, even one located on the direct page. But, you may argue, I haven’t considered the cost of loading the registers in the first place. One thing that makes the above code tempting is that I have to load the values earlier anyway to check for a zero value, which is kind of a unique value in floating-point that doesn’t play nice in many routines. There is the initial 2 cycle cost of transferring them to the index registers, but don’t forget that the temporary tm and tm+2 memory location must also be initialized in the more readable code at 4 cycles each. For example, consider that:

        lda tm          ; 4 cycles      2 byte(s)
        asl             ; 2             1
        sta tm          ; 4             2
                        ; 10            5

is not preferable to simply shifting the tm memory location left once. But what about twice, as in:

        lda tm          ; 4 cycles      2 byte(s)
        asl             ; 2             1
        asl             ; 2             1
        sta tm          ; 4             2
                        ; 12            6

where now loading tm into the accumulator, shifting twice and then saving it back is faster than shifting the memory location itself two times. If you’re shifting many times, the initialization costs are amortized over the group, make them less material.

Of course, we can sometimes avoid the initialization costs by working directly with the data in memory. For example:

Shifting left the top value on the floating-point stack

        asl TOSm,x      ; 8 cycles      2 byte(s)
        rol TOSm+2,x    ; 8             2
        rol tmp2        ; 7             2
                        ; 23            6

This is even slower with the direct page indexed address mode but may still be fine if that’s all we need to do. But in the multiply by 10 routine we need intermediate times 2 and time 8 values, so we’d still need to load and save the values somewhere, our temporary tm locations for example. Just can’t get past those initialization costs. So, where speed is of importance, maintaining key parameters in the processor for manipulation can be important.

But it doesn’t produce more readable code. The slightly faster 40-bit shift isn’t more readable than the shorter version. And the 80 or so line hard coded multiply by 10 isn’t easier to read than the 4-line version (to be fair, there is a lot of code in the background there, but they are main package routines whose use should be familiar).

For me, I’ve decided that readability is more important that shaving off cycles in I/O routines. I leave a comment in the multiply by 10 code pointing to this post informing the user that a hardcoded routine can speed up the processing of a lot of string data. Otherwise, when entering single values, readability is more important for me. For now at least!