65816: Coding Efficiencies (last update: 3/1/2023)

I’m using this post to gather some of the coding efficiencies I’ve found in moving to the 65816. Most of these relate to porting my 6502-based Forth operating system. I’ve documented many of these findings in my code and I figured it was better to have them here in one place rather than there buried in multiple files. At least I’ll be able to find them here. It’s a long post. I’ve tried to group them logically together and create a natural flow between topics. Here are links to particular items of interest, though they might appear scattered from building this as a narrative.

Porting my Forth operPorting my Forth operating system to the 65816
Is there Is there a faster way to get the COP signature byte faster way to get the COP signature byte?
Sometime coding for efficiency doesn’t matter
Managing register sizes
Going from 8-bit to 16-bit
- Adding two 16-bit values
What About 8-bit Data on the 65816?
- Reading 8-bit I/O device registers
- Working with strings
Address mode efficiency on the 65xx
- Copy 16-bit data indirectly on 65C02
- Copy 16-bit data indirectly on 65816
Taking advantage of new 65816 address modes
- Branching to an offset
Transitioning the 6502-based NEXT to the 65816
Branching with STC
Branch over offset
Other Stack Effects with STC
- Adding a long address to the stack
- Pulling a long address from the stack
- Revisiting NEXT
Alternate Forth return stack
16-bit unsigned multiplication
Forth data stack pointer, X or DP register
Alternatives to (dp),Y mode
More Forth primitives
The 65816 COP Instruction
- Is there a faster way to get the COP signature byte?
- But I don’t want a COP service routine in Bank 0
The 65816 MVN/MVP Instructions
- Adding a compare instruction can improve loop efficiency
more to come

Porting my Forth Operating System to the 65816

I’ve spent a bit of time since my last post porting my 6502-based Forth operating system to the 65816. Of course, the system runs on the 65816 in emulation without any changes mode and runs just fine in native 8-bit mode after setting the stack to $01FF. You can even access expanded memory in these modes and use the additional 65816 instructions, but my 8-bit code isn’t as efficient as it could be on the 16-bit processor, especially considering my Forth operating system has a 16-bit cell size.

I thought a post covering some of the coding efficiencies I’ve achieved would be helpful (to me at least for future reference). Switching to the 65816 from the 6502 gains efficiencies in both code size and speed. These come from taking advantage of the 65816’s 16-bit mode, new address modes and new instructions. I’ll cover each separately with respect to the changes I’ve made in porting my 6502-based Forth operating system to the 65816.

Note that a great resource to have available is Programming the 65816 by David Eyes and Ron Lichty. I used it extensively when creating my 65816 emulator and have referred to it often as I’ve ported my code. Various versions are available online. I’ve been using a pdf of the Western Design Center version copyrighted in 2007. I’ve read that there is an updated 2015 version available but haven’t been able to locate a pdf copy of it. Amazon has it in hardcopy and Kindle but there isn’t any indication in the sample provided what’s been updated (or from the copyright page that it’s even been updated at all).

Sometime coding for efficiency doesn’t matter

Sometimes other factors overwhelm the speed of your code making any efficiency gains from coding meaningless. I/O is a common case where shaving a few cycles off an I/O loop is pointless as the I/O takes several orders of magnitude longer to complete. I have this exact case in my startup code, discussed in the postscripts to this post. In a speed test, I streamlined my startup code to only a few instructions needed to fetch about 4k of text from an SD card compared to my original code that had several subroutine calls, multiple register size changes and moved the text between buffers. Even with all of that “bloat” eliminated, the startup time, about 14 seconds at a 1 MHz clock, essentially the same as with the original code.

In this case I opted to keep the original code as it provides more flexibility to expand my operating system at no cost in performance.

Managing Register Sizes

A key feature of the 65816 is that in native mode the accumulator and index registers can be independently switched between 8-bit and 16-bit lengths by changing the m and x flags in the status register. This means that in addition to these registers being all 8-bit or 16-bit you can also have an 8-bit accumulator with 16-bit index registers and a 16-bit accumulator with 8-bit index registers. Below I’ll refer to 8-bit or 16-bit with the understanding that we might be only switching the accumulator or index registers from one size to another.

Currently I’ve designed all code assuming 16-bit registers. Each routine is responsible to switch to 8-bit as needed and then returning to 16-bit afterwards.

I read the other day (can’t find the source right now) a recommendation to “let the caller handle register sizing”. This eliminates the churn of register sizing that could happen if a routine needs to call a bunch of 8-bit routines that each individually handle the switch to 8-bit registers and back. This makes a lot of sense. But does it jive with my 16-bit design?

I did follow this philosophy when I started porting my Forth to the 65816. I had a math module that required more thought in converting it to 16-bit, so to get on with things, I instructed the assembler to assume it was 8-bit and had each routine that called a math function was responsible for changing the appropriate registers to 8-bit and back as appropriate. For example:

Calling a routine that assumes 8-bit registers

        sep #$30
        jsr compare_16   ; 8-bit registers

        jsr UDIV32       ; 8-bit registers
        rep #$30

You can see that we’d have extra register size churn if compare_16 and UDIV32 both changes the registers to 8-bit and back to 16-bit.

I’ve now almost totally eliminated such code from my system. But a few holdouts remain: interrupt service routines and circular buffers.

The switch back to 16-bit registers is easy when you’ve designed the system to always be in 16-bit except when, momentarily, 8-bit is required. But what if you don’t know what you want to return to? An interrupt service routine is a perfect example. An interrupt can occur anytime interrupts are enabled, and you likely can’t predict if they’ll happen when you have 8-bit or 16-bit registers (unless you specifically code for that).

So how does our interrupt service routine handle this? When an interrupt occurs, the 65816 will push the return address and the status register, including the register select bits, to the stack. When you return from the interrupt these are restored to the status register, returning the registers to their original sizes.

Is it that simple? Well, yes, and no. If you don’t need to preserve registers in your interrupt service routine (ISR) then yes, that’s all you need to do. However, if you need to preserve registers, it’s a bit more complicated because you have to think about their size as well, both what size they were when the ISR was called and what size the ISR needs them to be.

For example, if your ISR requires an 8-bit accumulator, you can use the following code:

Preserving registers in an interrupt service routine that uses an 8-bit accumulator

io_irq:
        pha
        phx
        phy

        php                     ; save register status

... change accumulator to 8-bit and do IRS stuff

irq_done:
        plp                     ; recall register status
        ply
        plx
        pla
        rti

With the change to an 8-bit accumulator we avoid changing the B register during the ISR, thus preserving it whether the ISR is entered with 16-bit or 8-bit registers. A reader commented that he always switches to 16-bit registers at the beginning of the ISR. This accomplished the same thing, by allowing the B register to be explicitly saved regardless of whether the ISR is entered with 16-bit or 8-bit registers. This is practical if the ISR uses 16-bit registers but is wastes cycles if the ISR requires an 8-bit accumulator. Have an ISR that uses both 16-bit and 8-bit registers? The approach will depend on your specific needs. Just remember that if you change the B register, you’ll need to restore it prior to returning from the ISR.

Note that the same doesn’t apply to the X and Y registers. You can save, modify and restore them with the code above whether the ISR was entered with 16-bit or 8-bit registers. This is because the 65816 zeros out the upper byte of the index registers when switching to 8-bit. If the ISR is entered with 16-bit index registers, then a 16-bit value is saved and a 16-bit value will be restored. If the ISR is entered with 8-bit index registers, then an 8-bit value is saved and an 8-bit value will be restored. It doesn’t matter if the index register size is changed during the ISR as long as the register size is restored prior to restoring the values, as in the code above.

The same methods can be used in other routines that could be called with unknown register sizes. I use the 8-bit code above with input routines that work with circular character buffers, which need 8-bit registers.

Where a routine will be called in a loop, register sizing is likely best done outside the loop, for best performance by avoiding changing register sizing within the loop. However, where other 16-bit efficiencies can be accomplished within the loop, it can be better to take advantage of those and live with the register sizing overhead. I have this exact case where moving the register sizing outside the loop causes the loop to be 8-bit and less efficient than in 16-bit offsetting any performance gains in eliminating switching to 8-bit within the loop.

Another advantage of using an overall register size design for your project is that disassembly of you code is easier. Disassembly of 65816 code is very difficult with the “let the caller handle the size change” approach. That’s becuase the number of bytes used by many instructions depend on the size of the register. Unless you have a very linear program flow, a disassembler will have no way of determining register size.

At times, it’s also possible to write 65816 code that is indifferent to whether it’s called with 8-bit or 16-bit registers (and I’m not talking about avoiding instructions where it matters). I’m not going to cover it here as it seems useful only in specific instances but this wiki article covers it if you’re interested.

Going from 8-bit to 16-bit

You can get both a code size reduction and speed increase in porting code designed for the 6502 to the 65816. This is especially true if you’re handling 16-bit data on the 6502 as I am in my 6502-based Forth. The 65816 handles 16-bit data naturally in native mode. A 16-bit operation on the 6502 involves working separately with the least and most significant bytes of the 16-bit data, increasing code size and reducing overall speed. The 65816 handles 16-bit data in one step.

Adding Two 16-bit Values

Given that my Forth operating system has a 16-bit cell size, I had numerous opportunities to improve my 6502-based code for the 65816. For example, here is a code snippet to add two 16-bit values located on the Forth stack (represented by the labels TOS and NOS for the values at top and next on stack). I’m using zero-page indexed addressing on the 65C02 which translates to direct-page indexed addressing on the 65816.

Adding two 16-bit values on the zero-page Forth stack with the 65C02

        clc          ; 2 cycles, 1 byte(s)
        lda TOS,x    ; 4         2
        adc NOS,x    ; 4         2
        sta NOS,x    ; 4         2

        lda TOS+1,x  ; 4         2
        adc NOS+1,x  ; 4         2
        sta NOS+1,x  ; 4         2

On the 65C02 it takes a total of 26 clock cycles and 13 bytes of code to add the two 16-bit values on a zero-page stack and return the value to the stack. We can reduce both of these substantially on the 65816.

Adding two 16-bit values on the direct page Forth stack with the 65816 in 16-bit mode

        clc          ; 2 cycles, 1 byte(s)
        lda TOS,x    ; 5         2
        adc NOS,x    ; 5         2
        sta NOS,x    ; 5         2

On the 65816 it takes a total of 17 clock cycles and 7 bytes of code to do the same thing. This is a 35% reduction in clock cycles and a 46% reduction in code size for this snippet. Most primitive Forth words have a similar construct, so the savings add up over the entire operating system.

What About 8-bit Data on the 65816?

It’s not all love though. Operating in 16-bit mode has a disadvantage when you need to work with 8-bit data, say writing to or reading from an 8-bit I/O device register. To do this on the 65816 you need to switch to 8-bit mode prior to the register write or read and back again to 16-bit mode afterwards.

Reading 8-bit I/O device registers

For example, reading from the 65C22 VIA shift register with the 65C02 can be done as follows:

Reading the VIA shift register on the 65C02

        lda VIA_SR    ; 4 cycles, 3 byte(s)

taking 4 clock cycles and 3 bytes. And on the 65816 in 16-bit mode:

Reading the VIA shift register on the 65816 in 16-bit mode

        ; change accumulator/memory to 8-bit
        sep #$20      ; 3 cycles, 2 byte(s)
        lda VIA_SR    ; 4         3
        ; change accumulator/memory to 16-bit
        rep #$20      ; 3         2

taking 10 clock cycles and 7 bytes, 150% longer and 133% larger than on the 65C02.

Technically, it’s a bit more complicated than this as the SEP #$20 instruction switches the accumulator/memory to 8-bit mode but maintains the accumulator’s most significant byte intact, restoring it when the REP #$20 instruction is executed, switching the accumulator/memory bacl to 16-bit mode. Thus, if the accumulator contained $FFFF before the above code sequence and VIA_SR is $65 (an ASCII letter ‘E’) then the accumulator would contain $FF65 after completing the code sequence above. Not exactly what we want. I get around this by dumping the accumulator immediately to a circular buffer before returning to 16-bit mode.

You’d figure that this just kicks the can down the road. We’ll to have to deal with the 8-bit ASCII character problem when accessing the circular buffer. But maybe not. For example, if you can process the buffered data two bytes at a time you can avoid switching to 8-bit mode and back again, the round-trip costing 6 clock cycles and 4 bytes as we saw above.

Other possibilities exist as well. For example, if you can accommodate 16-bit values but just need the least significant byte, you can just AND the 16-bit value from the buffer with $FF at a cost of 3 clock cycles and 3 bytes, less than what’s required by switching to 8-bit mode and back.

This brings up another possibility. Why switch to 8-bit mode at all? Why not read the 65C22 VIA shift register at VIA_SR while in 16-bit mode and then AND this value with $FF similar to the discussion above?

Reading the VIA shift register on the 65816 in 16-bit mode

        lda VIA_SR    ; 5 cycles, 3 byte(s)
        and #$FF      ; 3         3

This takes 8 clock cycles and 6 bytes, again, less than above. But this could have unintended consequences. In the case above you’d also read the Auxiliary Control Register. This isn’t a problem here but could be if you were writing to the shift register. Why would you write to the VIA shift register? It’s one way to start shift operations in some modes. If you did that in 16-bit mode, you’d need to make sure the desired ACR value was in the most significant byte being written to the shift register.

The point is writing or reading an I/O device register in 16-bit mode can affect device operation and should only be done with full knowledge of the impacts on the high-byte register. But with careful consideration this could save you a few clock cycles and bytes. The cost? Your code could be less clear and prone to introducing errors if edited in the future.

Working with strings

Sometimes it is more efficient to work with strings two characters at a time rather than working with individual characters. Doing so allows the 65816 to continue with 16-bit registers rather than incurring the overhead to shift to an 8-bit accumulator and back.

I came across this recently when porting the CC65 strlen function to the 65816. Refining the CC65 code and still working one character at a time we get (assuming 16-bit registers upon entry):

65816 strlen code using 8-bit accumulator

strlen:
        sta     ptr2            ; 4 cycles
        ldy     #0              ; 3

        php                     ; 3
        sep     #$20            ; 3
L1:     lda     (ptr2),y        ; 6
        beq     L9              ; 2,3
        iny                     ; 2
        bne     L1              ; 3

L9:     tya                     ; 2
        plp                     ; 4
        rts                     ; 6
                                ; (34 + 13*n)

A zero-length string takes 34 cycles. An additional 13 cycles per character are needed to determine the string length.

At the expense of a few bytes (5) and some readability, we can gain some speed by staying with the 16-bit accumulator and operating on two bytes during each loop.

65816 strlen code using 16-bit accumulator

strlen:
        sta     ptr2            ; 4 cycles
        ldy     #0              ; 3

L1:     lda     (ptr2),y        ; 6
        xba                     ; 3
        beq     @1              ; 2,3
        xba                     ; 3
        beq     L9              ; 2,3
        iny                     ; 2
        iny                     ; 2
        bne     L1              ; 3

        ; high byte is null, check low byte
@1:     xba                     ; 3
        beq     L9              ; 2,3
        iny                     ; 2

L9:     tya                     ; 2
        rts                     ; 6

Here we load two characters at a time into the accumulator thus looping half the number of times as before. We operate on each character using the XBA instruction which sets the zero flag if the upper byte in the accumulator is zero. Examining the code, you can see that we’re pretty much a wash cycle wise with a zero-length string. However, comparing two loops of the previous code to one loop above, we see some differences. Above, while we avoid an indirect-indexed accumulator load wet use two XBA instructions, so no advantage so far. We also have two INY instructions in each case, but the previous code has four branch instructions in two loops compared to only three in the code above, saving three cycles every two loops. However, additional efficiency is gained depending on if the string length is odd or even.

An odd length string has the biggest advantage when using the code above as we exit the last loop almost immediately and only have to adjust the string length as the last bit of overhead. In the previous code, we still need more than one complete loop for the ending plus all of the original overhead.

An even length string isn’t as efficient in the code above as the accumulator high byte in the last loop has to be processed even though it’s not used (we’d be better off here if the XBA instruction set the zero flag if the lower byte in the accumulator was zero). We still gain from eliminating a branching instruction though.

Calculating the total cycles for each of these cases:

string length odd:
- 34 + (n-1) / 2 * 23
string length even:
- byte after string terminator is null
  - 33 + n / 2 * 23
- byte after string terminator is not null
  - 32 + n / 2 * 23

String Length	8-bit	16-bit (null)	16-bit (not null)
0	34	33	32
1	47	34	34
2	60	56	55
3	73	57	57
4	86	79	78
5	99	80	80
6	112	102	101
7	125	103	103
8	138	125	124
9	151	126	126
10	164	148	147

strlen Total Cycles

The simplicity of the strlen looping criterion and being able to determine if either byte was null with just the XBA instruction without modifying either byte was key to the efficiency gained in the latter case. I assume this isn’t typical and that normally it will be more efficient to switch to an 8-bit accumulator and operate on each character individually.

Address Mode Efficiency on the 65C02

There are often many different ways to do a task programmatically. The different methods will often present a tradeoff between the size and speed of the required code. Which you choose will depend on your project goals. For example, if speed is a goal, then you may inline a shot bit of code that is often repeated rather than factoring it into a subroutine. This way you avoid the overhead of jumping to and returning from the subroutine at the cost of having repeated copies of the subroutine code in your program. Your code is faster but bigger.

The 65xx processor address modes provide similar tradeoffs between size and speed and usually you have to look at specific examples to know which is best. Let’s take an example from my 6502-based Forth operating system. I often need to modify data indirectly. The 65C02 has 5 indirect address modes for accessing data.

Absolute Indirect, (a)
Absolute Indexed Indirect, (a,x)
Zero Page Indirect, (zp)
Zero Page Indexed Indirect, (zp,x)
Zero Page Indirect Indexed, (zp),y

Let’s focus on the Zero Page Indirect options as code size is a one of the goals for my 6502-based Forth and the Absolute options all require an extra byte of code. Given my Forth’s 16-bit cell size, the data I’m modifying indirectly is usually two bytes in length. One way to do this is with the Zero Page Indirect address mode and increment the indirect address between copying the two bytes of data. For example:

Store the 16-bit data at NOS to the 16-bit address at TOS using Zero Page Indirect Addressing on 65C02

        ; copy 16-bit address at TOS to zero page indirect location W
        lda TOS,x       ; 4 cycles, 2 byte(s)
        sta W           ; 3         2
        lda TOS+1,x     ; 4         2
        sta W+1         ; 3         2

        ; copy data at NOS indirectly to TOS
        lda NOS,x       ; 4         2
        sta (W)         ; 6         2      lsb
        lda NOS+1,x     ; 4         2
        inc W           ; 5         2
        bcc @1          ; 3/2       2
        inc W+1         ;   5       2
@1      sta (W)         ; 6         2      msb

This takes 38 clock cycles (42 if W+1 needs incremented) and 22 bytes of code. The increment of W takes many cycles above. Incrementing an index register would be much faster. The Zero Page Indirect Indexed address mode seems perfect for this. Let’s look at an example.

Store the 16-bit data at NOS to the 16-bit address at TOS using Zero Page Indirect Indexed Addressing on 65C02

        ; copy 16-bit address at TOS to zero page indirect location W
        lda TOS,x       ; 4 cycles, 2 byte(s)
        sta W           ; 3         2
        lda TOS+1,x     ; 4         2
        sta W+1         ; 3         2

        ; copy data at NOS indirectly to TOS
        ldy #0          ; 2         2
        lda NOS,x       ; 4         2
        sta (W),y       ; 6         2      lsb
        lda NOS+1,x     ; 4         2
        iny             ; 2         1
        sta (W),y       ; 6         2      msb

This takes 38 clock cycles and 19 bytes of code. Here we’ve reduced the code size by a few bytes and eliminated the variability in clock cycles. But we haven’t gained any speed over the normal case where W+1 doesn’t need incremented. It would be nice if we didn’t have to copy the indirect address at the start of the operation.

Copy 16-bit data indirectly on the 65C02

We can do just that given that the address at TOS is on the zero page by using the using the Zero Page Indexed Indirect address mode.

Store the 16-bit data at NOS to the 16-bit address at TOS using Zero Page Indexed Indirect Addressing on 65C02

        ; copy data at NOS indirectly to TOS
        lda NOS,x       ; 4 cycles, 2 byte(s)
        sta (TOS,x)     ; 6         2      lsb

        ; increment TOS address for storing msb
        inc TOS,x       ; 6
        bne @1          ; 3/2       2
        inc TOS+1,x     ;   6       2

@1      lda NOS+1,x     ; 4         2
        sta (TOS,x)     ; 6         2      msb

This takes 29 clock cycles (34 if TOS+1 needs incremented) and 12 bytes of code. This is always faster and smaller than the other options and should be used in this case.

However, this doesn’t mean Zero Page Indexed Indirect addressing is the best in all cases. Incrementing the TOS address is costly compared to incrementing a register. If we had to operate indirectly on many bytes of data, then the overhead of copying the indirect address and using the Y register with Zero Page Indirect Indexed addressing could easily outweigh the cost of incrementing the indirect address in either Zero Page Indirect or Zero Page Indexed Indirect addressing. At first glance, it’s hard to know which address mode is going to be best. Many times, you have to try different methods to determine which is best.

Copy 16-bit data indirectly on the 65815

These considerations don’t change moving to the 65816, but the example above with 16-bit mode becomes trivial. For example:

Store the 16-bit data at NOS to the 16-bit address at TOS using Direct Page Indexed Indirect Addressing

        lda NOS,x       ; 5 cycles, 2 byte(s)
        sta (TOS,x)     ; 7         2

uses just 12 clock cycles and 4 bytes, a substantial improvement over the 65C02.

But the 65816 provides additional address modes which can provide additional efficiencies.

Taking Advantage of New 65816 Address Modes

To branch execution in my Forth compiled code I embed a 16-bit offset that indicates how far to jump if the branch is taken. Similar to the discussion above, I need to access this offset indirectly. In this case though, I already have a zero page instruction pointer, IP, assigned to access my compiled code, so I can avoid copying an address to this location as I had to do with the Zero Page Indirect Indexed mode above. Avoiding this copy makes the Zero Page Indirect Indexed mode substantially faster as we take advantage of incrementing a register rather than a memory location.

Branching to an Offset

Take for example this code snippet from my 6502-based Forth that branches my compiled Forth code to another instruction.

Branch to offset at instruction pointer IP on the 65C02

        ; adjust IP for branching offset (offset is at IP)
        ; (offset is in two's compliment)
        ldy #1          ; we'll be changing IP
        lda (IP),y      ; so save offset msb before we do
        sta B3

        ; add offset to IP
        lda (IP)        ; offset lsb
        clc
        adc IP          ; add it to IP 
        sta IP
        lda IP+1
        adc B3          ; offset msb
        sta IP+1

The cycle and byte count for the are:

Branch to offset at instruction pointer IP on the 65C02

        ldy #1          ; 2 cycles, 2 byte(s)
        lda (IP),y      ; 5         2
        sta B3          ; 3         2

        ; add offset to IP
        lda (IP)        ; 5         2
        clc             ; 2         2
        adc IP          ; 3         2 
        sta IP          ; 3         2
        lda IP+1        ; 3         2
        adc B3          ; 3         2
        sta IP+1        ; 3         2

or 32 clock cycles and 20 bytes to modify the instruction pointer for the 16-bit branching offset. This is more straight forward on the 65816.

Branch to offset at instruction pointer IP on the 65816 in 16-bit mode

        ; add offset to IP
        lda (IP)        ; 6 cycles, 2 byte(s)
        clc             ; 2         2
        adc IP          ; 4         2 
        sta IP          ; 4         2

This takes only 16 clock cycles and 8 bytes. This, however, assumes that our Forth code resides in Bank 0. Unfortunately, my Forth code resides in Bank 2 so Direct Page Indirect addressing isn’t going to work for me here. Luckily the 65816 has a new address mode to handle this, Direct Page Indirect Long addressing.

Branch to offset at long instruction pointer IP on the 65816 in 16-bit mode

        ; add offset to IP
        lda [IP]        ; 7 cycles, 2 byte(s)
        clc             ; 2         2
        adc IP          ; 4         2
        sta IP          ; 4         2

This takes only 17 clock cycles and 8 bytes. But being able to place the Forth code in any bank has cost more than just one clock cycle. The instruction pointer, IP has grown to 24-bits, and we now have to account for changes to that extra byte, the bank byte, in our code. To illustrate this cost, it’s helpful to take an aside to look at this before we continue with new 65816 address modes.

Transitioning the 6502-based NEXT to the 65816

In Forth, the NEXT subroutine jumps to the code containing the next instruction after incrementing the Forth instruction pointer, IP, to point to the following instruction. Seems pretty straight forward. Here is the code for my 6502-based Forth.

NEXT: jump to the next instruction after incrementing the instruction pointer on the 65C02

        lda (IP)                ; 5 cycles, 2 byte(s)
        sta W                   ; 3         2
        ldy #1                  ; 2         2
        lda (IP),y              ; 5         2
        sta W+1                 ; 3         2
        lda IP                  ; 3         2
        clc                     ; 2         1
        adc #2                  ; 2         2
        sta IP                  ; 3         2
        bcc next1               ; 3/2       2
        inc IP+1                ;   5       2
next1:
        jmp (W)                 ; 5         2

This takes 36 clock cycles (or 42 if IP+1 needs incremented) and 23 bytes. Forthers will recognize this as direct threaded code or DTC. The distinction isn’t particularly important for this discussion but will play a role as we move forward. If you’d like to know more, the definitive primer on Forth threading techniques seems to be Brad Rodriguez’s Moving Forth.

As above, we can port this to the 65816.

NEXT: jump to the next instruction after incrementing the instruction pointer on the 65816

        lda (IP)                ; 5 cycles, 2 byte(s)
        sta W                   ; 4         2
        lda IP                  ; 4         2
        clc                     ; 2         1
        adc #2                  ; 2         2
        sta IP                  ; 4         2
        jmp (W)                 ; 5         2

This shows the familiar savings of switching to 16-bit, taking 26 clock cycles and 13 bytes. But again, this assumes that the Forth code is in Bank 0. Things aren’t quite so nice if our Forth code is in another bank. In that case we need to use some form of long addressing. Let’s use the straight forward Direct Page Indirect Long address mode.

NEXT: jump to the next instruction after incrementing the instruction pointer on the 65816, with long addressing

        lda [IP]                ; 7 cycles, 2 bytes
        sta W                   ; 4         2


        ; grab/store bank address byte
        ldy #2                  ; 3         2
        lda [IP],y              ; 7         3
        sta W+2                 ; 4         2

        lda IP                  ; 4         2
        clc                     ; 2         1
        adc #3                  ; 3         2
        sta IP                  ; 4         2
next1:
        jmp [W]                 ; 6         3

This takes 44 clock cycles and 21 bytes, more clock cycles than the 6502-based NEXT and almost as many bytes. In addition to the added clock cycles for the long addressing, we’ve also had to deal with updating IP’s bank address byte. This cost 14 clock cycles. Given that NEXT is run after nearly every Forth word, this is a substantial increase. And not so clear is that our Forth code now has to encode a 24-bit address rather than the 16-bit one we were using. This means that my Forth code, which basically is a table of 16-bit addresses has increase by 50%. There is other stuff involved so it’s really not quite that dramatic. Luckily I’m not as concerned about code size in my move to the 65816. But I was hoping for a big improvement in the speed of my NEXT routine. Can I do any better?

One of the inefficiencies in the NEXT routines above is having to copy and increment the Forth instruction pointer, IP. This sounds like the behind the scenes work the processor does when it jumps to a subroutine, incrementing the program counter and storing it on the stack. What if we use a jump to subroutine as our NEXT. This is what Forth subroutine thread code, or STC, does and luckily the new 65816 jump to subroutine long instruction may help out.

Subroutine Threaded Code and the New 65816 JSL Instruction

With subroutine threaded code our compiled Forth code becomes a table of subroutine calls instead of simply a table of the addresses to those subroutines. This can be efficient as the instruction pointer maintenance done by NEXT in other Forth threading techniques is done by the processor, hopefully more efficiently. Let’s check if this works out for the 65816.

To get a apples-to-apples comparison with the NEXT above we need to take one step back and see how we get to the NEXT routine originally. I’ve mentioned that almost every Forth word jumps to NEXT, but in the code above, we’ll need a long jump as our Forth code could be in a different bank than NEXT. So a complete sequence of moving from one Forth word to another in DTC using long addressing would be:

Moving from word to word in Direct Threaded Code on the 65816

; current Forth word code
...
        jml NEXT                ; 4 cycles, 4 bytes

...

NEXT:
        lda [IP]                ; 7         2
        sta W                   ; 4         2


        ; grab/store bank address byte
        ldy #2                  ; 3         2
        lda [IP],y              ; 7         3
        sta W+2                 ; 4         2

        lda IP                  ; 4         2
        clc                     ; 2         1
        adc #3                  ; 3         2
        sta IP                  ; 4         2
next1:
        ; jump to next word
        jmp [W]                 ; 6         3

So, the complete process of moving from one word to another in DTC with long addressing takes 48 clock cycles and 4 bytes (not counting the NEXT bytes as those aren’t repeated).

In subroutine threaded code NEXT is replaced by a jump to subroutine long, JSL, instruction. We return from that word with a return from subroutine long, RTL, instruction. This is simply:

Moving from word to word in Subroutine Threaded Code on the 65816

; current Forth word code
...
        rtl                     ; 6 cycles, 1 bytes

...
; compiled Forth code
        jsl next_word           ; 8         4

Moving from one word to another in STC takes 14 clock cycles and 2 additional bytes (not counting the long address bytes that are common with the DTC above). That’s a dramatic reduction from DTC. But it comes at a cost. We’ve added yet another byte to our compiled Forth code, basically doubling the code size from the 6502-based DTC code. Even more dramatic it’s four times the size of my 6502-based token threaded code. Again, given other stuff is compiled in Forth code, the actual increase isn’t quite so dramatic.

Is the tradeoff worth it? This is a personal choice. Given that I’ve unlocked the upper banks of memory for my Forth code, size isn’t an issue anymore. I think the size tradeoff is acceptable to get such a dramatic improvement in moving from word to word in Forth.

Many 65xx Forthers will argue that you don’t need that additional memory for Forth code anyway. Just keep it all in Bank 0 and leave expanded RAM for data. With this you keep your Forth code small (DTC works fine) and fast (and with a few more restrictions a lot faster). There is a lot of truth in this, but for me it comes down to flexibility. The everything in Bank 0 DTC Forth is slower than the expand bank STC Forth unless you make unacceptable, for me at least, compromises such as no interrupts (see 65816 direct threaded NEXT). Sure, the 65816 can handle the I/O with polling, and many (most?) of the folks over at 6502.org seem happy with this. Maybe I’ll come around to their view someday. But I’ve always been one to try things out for myself and I’ll go with the multi-bank STC for now.

There are other drawbacks to my multibank STC approach. The 65816’s new address modes help get over these, but it’s still something I probably wouldn’t have to deal with if I stuck with a Bank 0 DTC Forth. We’ll look at these as I move along.

Revisiting the 65816’s New Address Modes

We left off our discussion of new address modes with an example of incrementing the Forth instruction pointer by a 16-bit offset in Forth code that may be located in another bank.

Branch to offset at instruction pointer IP with DTC on the 65816 in 16-bit mode

        ; add offset to IP
        lda [IP]        ; 7 cycles, 2 byte(s)
        clc             ; 2         2
        adc IP          ; 4         2
        sta IP          ; 4         2

This takes 17 clock cycles and 8 bytes. With STC, we no longer have an instruction pointer, rather, we use the program counter just written to the stack before starting the current word’s code. A straightforward way to do the above in STC could be:

Branch to offset at program counter on stack with STC on the 65816 in 16-bit mode

        pla             ; 5 cycles, 1 byte(s), get return address
        inc             ; 2         1        , increment it for indirect address to offset
        sta B3          ; 4         2        , prep for indirection

        lda (B3)        ; 6         2        , get offset to branch to
        clc             ; 2         1
        adc B3          ; 4         2        , add it to return address
        dec             ; 2         1        , rtl need address - 1
        pha             ; 4         1        , push return address for rtl

for a total of 29 clock cycles and 11 bytes, more than the DTC alternative above. We’re going in the wrong direction with STC. The 65816 has a new address mode that can help us though. It’s called Stack Relative Indirect Indexed, Y addressing.

Branch to offset at program counter on stack with STC on the 65816 in 16-bit mode using Stack Relative Indirect Indexed, Y Addressing

        ldy #1          ; 3 cycles, 2 byte(s)
        lda (1,S),y     ; 8         2        , get offset
        clc             ; 2         1
        adc 1,S         ; 5         2        , add it to return address
        sta 1,S         ; 5         2        , back on stack

This takes 23 clock cycles and 9 bytes, much closer to the DTC alternative. Considering the substantial speed improvement by eliminating the DTC NEXT, I think this is another acceptable tradeoff for STC, especially considering it mainly occurs in flow control words.

The above code snippet adjusts the program flow when the branching offset is taken. What about when the branch isn’t taken. In this case the program flow must move to the instruction past the branching offset. In my 6502-based DTC Forth this is:

Adjust instruction pointer IP to skip over 2-byte branch offset with DTC on 65C02

        lda IP          ; 3 cycles, 2 byte(s)
        clc             ; 2         1
        adc #2          ; 2         3
        sta IP          ; 3         2
        bcc @2          ; 3/2       2
        inc IP+1        ;   5       2
@2:

This takes 13 clock cycles (or 17 if IP+1 needs incremented) and 12 bytes. Can we improve this with the 65816.in 16-bit mode?

Adjust instruction pointer IP to skip over 2-byte branch offset with DTC on 65816

        lda IP          ; 4 cycles, 2 byte(s)
        clc             ; 2         1
        adc #2          ; 3         3
        sta IP          ; 4         2

This takes 13 clock cycle and 8 bytes, so only modest improvements moving to 16-bit DTC as we’ve eliminated the branching portion. We’ve also saved a few bytes. Do we improve with 16-bit STC?

Adjust address on stack to skip over 2-byte branch offset with STC on 65816

        pla             ; 5 cycles, 1 byte(s)
        clc             ; 2         1
        adc #2          ; 3         3
        pha             ; 4         1

This takes 14 clock cycles and 6 bytes. A bit slower, but also shorter. How about using Stack Relative addressing?

Adjust address on stack to skip over 2-byte branch offset with STC on 65816 using Stack Relative addressing

        lda #2          ; 3 cycles, 3 byte(s)
        clc             ; 2         1
        adc 1,S         ; 5         2
        sta 1,S         ; 5         2

This takes 15 clock cycles and 8 bytes, so the pull/push to stack method above is preferable. We have one more alternative with Stack Relative addressing.

Adjust address on stack to skip over 2-byte branch offset with STC on 65816 using Stack Relative addressing, Method #2

        lda 1,S         ; 5 cycles, 2 byte(s)
        inc             ; 2         1
        inc             ; 2         1
        sta 1,S         ; 5         2

This takes 14 clock cycles and 6 bytes, so no better than the pull/push alternative. Also, this method will only work when the data we’re skipping is two bytes. The other alternatives are more flexible and thus preferred.

Other Stack Effects with STC

Subroutine threaded code can cause additional work for words that store or remove data on the hardware return stack. In such cases the RTL return address and bank byte must be pulled from the stack, saved and then returned to the stack after the word has finished its work. I discuss methods for this below.

more to come

Adding a Long Address to the Stack

Closely related to the above is what is the most efficient way to get a long address to the stack in the first place. As shown above, if a long address that you don’t need is already on the stack, an efficient way to replace it is to simply modify it in place. But what if we need to add one to the stack? The trouble with a long address is that it’s 24 bits long, a single byte bank and a two-byte segment address. With our registers in 16-bit mode, that bank byte sticks out like a sore thumb. Of course, we can switch to an 8-bit accumulator and back at the cost of an extra six cycles. but it would be great if we could use the 65816’s single byte registers, K, B or P, in some way to avoid that.

As I’ve been porting my Forth to the 65816 I’ve often contemplated how to work with bytes without incurring the 6-cycle penalty of changing register size. Sometimes you can just go ahead and use the 16-bit register and ignore or mask the high byte, with AND #$ff for example. This is often faster than changing/restoring register sizes.

Another example, in my DTC Forth’s NEXT above, you might have noticed that I get the bank byte of the address pointed to by IP as a 16-bit value and store it in W+2. Here W is 4 bytes long, with the last byte as padding so the 16-bit write at W+2 doesn’t overwrite anything important. We then long jump indirectly on W, using only its first three bytes. This cost me a byte but saved me 5 cycles (six cycles for the register size change less 1 because a 16-bit write is a cycle longer than an 8-bit write). There are other cases as well where you can avoid switching register sizes to handle byte level data, usually at the cost of a byte of memory or a specialized routine to handle two bytes at a time. I haven’t come up with anything concrete yet though for adding a 24-bit value to the stack. We’ll discuss what I’ve found more here.

A trivial case of putting a long address on the stack is where the bank byte, either program or data, isn’t going to change. Then we can simply PHK or PHB to put K or B on the stack. This doesn’t help though when the bank byte will change. We still need a way to write a single byte to the stack with 16-bit registers. The 65816 only has one other instruction that does that, PHP.

If we didn’t care what P after the operation and we knew beforehand what bank byte we wanted, we could simply:

Pushing a known byte to the stack with P (16-bit registers)

        rep #x          ; 3 cycles, 2 byte(s)
        php             ; 3         1

where x is the bank byte we want on the stack. This isn’t particularly useful though and has an unwanted side effect. If x has the m and/or x set (bits 4 and 5), then we’ll need to restore 16-bit registers after this with SEP #$30 (remembering we said we didn’t care what P was; there could be additional consequences otherwise).

It’s more likely though that we won’t know the bank byte ahead of time, so this trick won’t work. I’ve been using the following to write a long address to the stack, for a total of 16 cycles and 8 bytes:

Pushing a long address to the stack (16-bit registers)

        sep #$20        ; 3 cycles, 2 byte(s)
        lda B4          ; 3         2
        pha             ; 3         1
        rep #$20        ; 3         2
        phy             ; 4         1

where B4 is on the direct page and contains the desired bank byte and Y contains the desired segment address.

In reviewing how some 65816 Forths handle far memory access I came across an interesting case of putting a long address on the stack in Michael Guidero’s OF816. In his NEXT routine, which in its essence boils down to:

Slightly simplified version of Michael Guidero’s OF816 NEXT (License)

        ldy   #$0003        ; 3 cycles, 3 byte(s)
        lda   [IP],y        ; 7         2
        xba                 ; 3         1 xxHH  ->  HHxx
        pha                 ; 4         1 stack ... HHxx
        phb                 ; 3         1 stack ... HHxxxx
        dey                 ; 2         1
        dey                 ; 2         1
        lda   [IP],y        ; 7         2 MMLL
        sta   1,s           ; 5         2 stack ... HHMMLL
        lda   IP            ; 4         2
        clc                 ; 2         1
        adc   #$0004        ; 3         2
        sta   IP            ; 4         2
        bcc  :+             ; 2         2
        inc   IP+2          ; 6         2
:       rtl                 ; 6         1
                              63        26

Michael pushes the bank byte to the stack with it in a 16-bit accumulator, but first uses XBA to swap the bank byte, which is in the lower nibble to the upper nibble. The bank byte then is on the stack followed by a dummy byte. He then uses PHB to push another dummy byte to the stack and then stack relative addressing to “fix up” the segment address in these two bytes.

Using a similar method with the initial conditions I used above (B4 is on the direct page and contains the desired bank byte and Y contains the desired segment address) we can add a long address to the stack in a similar way:

Pushing a long address to the stack (16-bit registers)

        lda B4          ; 4 cycles, 2 byte(s)
        xba             ; 3         1          xxHH  ->  HHxx
        pha             ; 4         1          stack ... HHxx
        phb             ; 3         1          stack ... HHxxxx
        tya             ; 2         1
        sta 1,s         ; 5         2          stack ... HHMMLL

for a total of 21 cycles and 8 bytes. This is the same number of bytes as above, but 5 cycles longer. Switching register sizes is the faster method. Is it the fastest general method? I’ll keep looking.

Pulling a Long Address from the Stack

Often paired with adding a long address to the stack is pulling it from the stack in the first place. As discussed, this is needed in an STC Forth when the definition itself will add or remove items from the hardware stack. I use the following to save a long address from the stack:

Pulling a long address from the stack (16-bit registers)

        ply             ; 5 cycles, 1 byte(s)
        sep #$20        ; 3         2
        pla             ; 4         1
        sta B4          ; 3         2
        rep #$20        ; 3         2

which totals 18 cycles and 8 bytes. Thus, it takes my code 34 cycles and 32 bytes to pull a long address from the stack and return it.

As in adding a long address to the stack, the fastest way to “pull” a long address that you don’t care about from the stack is to overwrite it with one you want to add. A fast way to pull a long address from below another long address on the stack is to copy the long address down and then drop the long address. First copy with:

Moving a long address on the stack (16-bit registers)

        lda 2,s         ; 5 cycles, 2 byte(s)
        sta 5,s         ; 5         2
        lda 1,s         ; 5         2
        sta 4,s         ; 5         2

at 20 cycles and 8 bytes and then drop the top long address with the snippet above.

Unfortunately, our other options for pulling a long address from the stack are more limited than pushing one as there is no instruction to pull the K register. Also, unless we know that the bank byte to be pulled won’t have any undesired consequences, we can’t use PLB or PLP. The former could send our data accesses off into unknown memory and the latter could alter our register sizes. As above, if we didn’t care what P is or what the bank byte on the stack is, I suppose we could essentially discard the bank byte on the stack with:

Pulling a byte from the stack (16-bit registers)

        plp             ; 4 cycles, 1 byte(s)
        rep #$30        ; 3         2

For 7 cycles and 3 bytes, saving 3 cycles and 2 bytes from a similar scenario of pulling a byte between changing the accumulator size to 8-bit and back. But the above has the undesirable effect of zeroing out the high bytes in the index registers, probably not something you want to do when working with 16-bit registers. If you don’t care about what’s in the index registers, you can save a few cycles and bytes with this. This also assumes you want 16-bit registers. That’s kind of given for our discussion, but does it work uniformly for all of your code that might be calling this snippet?

I’m still on the lookout for fast ways to pull a byte from the stack, both ones we want and ones we don’t. I’ll update this section if I find any.

Revisiting NEXT

While I posted Michael Guidero’s NEXT code above to illustrate a novel way to push a long address to the stack without switching to an 8-bit register, it’s worth examining the code for other efficiency trade-offs. First, it’s clear from the code that Michael is using a 4-byte instruction pointer,(IP) even though only 3 bytes are needed for the long address that it holds. The bank byte resides in the lower byte of the second word. The high byte of the second word doesn’t come into play when IP is used for long addressing, but having this dead byte allows us to access the bank byte with 16-bit registetrs avoiding a switch to 8-bit. This is similar to some of the register sizing tips I discussed above.

What’s less clear from Michael’s code is that IP apparently is also compiled into his Forth code as a 4-byte address as well. This is likely because Michael wanted to avoid the complication of writing a single byte during compilation. It also leaves a dummy byte with each address that could be used to pass information to the definition. For example, I often include an offset immediately after a program flow control word that provides the information necessary to branch execution. Only a small subset of words would benefit from this though.

Another idea comes to mind though. We could easily transform the Forth threading model to STC if we used the extra byte to form a long subroutine call. That would the achieve significant speed savings I discussed above without greatly affecting the size of his compiled code.

Unlike my DTC NEXT, Michael pushes the indirect long address to the stack and returns with a RTL instruction, rather than jumping indirectly as I do. This saves 4 bytes on the direct page and would be especially useful if one chose to use the direct page register as the stack index.

The last thing to notice about Michael’s code is that he allows IP to roll into the next bank by incrementing the bank byte as he increments IP. I haven’t gotten this far with my code.

Alternate Forth Return Stack

As shown above, subroutine threaded code has a few negative side effects. An annoying one was the 34 cycles needed to pull and restore the long return address from/to the hardware return stack for words such as >R, DO, R>, and UNLOOP. Here is my code for >R, R@, and R>:

>R, R@, and R> from 65816 STC Forth

xt_to_r: ( x -- ) (R: -- x )
        ; save the return address and bank byte
        ply             ; 5 cycles, 1 byte(s)
        sep #$20        ; 3         2
        pla             ; 4         1
        sta B4          ; 3         2
        rep #$20        ; 3         2

        ; put x on return stack
        lda TOS,x       ; 5         2
        pha             ; 4         1

        ; drop x from stack
        inx             ; 2         1
        inx             ; 2         1

        ; return PBR & return address to the return stack
        sep #$20        ; 3         2
        lda B4          ; 3         2
        pha             ; 3         1
        rep #$20        ; 3         2
        phy             ; 4         1
        rtl             ; 47        21

xt_r_fetch: ( -- x ) (R: x -- x )
        ; make room on stack
        dex             ; 2         1
        dex             ; 2         1

        ; copy x to data stack
        ; x is 4 bytes down on the stack
        lda 4,s         ; 5         2
        sta TOS,x       ; 5         2
        rtl             ; 14        6

xt_r_from: ( -- x ) (R: x -- )
        ; make room on stack
        dex             ; 2         1
        dex             ; 2         1

        ; save the return address and bank byte
        ply             ; 4         1
        sep #$20        ; 3         2
        pla             ; 3         1
        sta B4          ; 3         2
        rep #$20        ; 3         2

        pla             ; 4         1
        sta TOS,x       ; 5         2

        ; return PBR & return address to the return stack
        sep #$20        ; 3         2
        lda B4          ; 3         2
        pha             ; 3         1
        rep #$20        ; 3         2
        phy             ; 4         1
        rtl             ; 41        19

You’ll notice the pulling/pushing of the long return address in R> and >R that we discussed above. Can we get around using a separate return stack?

Return Stack with Direct Page Indexed X Address Mode

This is the same address mode we’re using to maintain the Forth data stack, so to change from that to the Forth return stack we simply need to save the data stack index in the X register and load the current return stack index. Switching back is just the reverse, storing the return stack index and restoring the data stack index to the X register. The fastest way to do this is to utilize the Y register:

Switch to the Forth return stack and back using the Y register

        ; switch to forth return stack
        txy             ; 2 cycles, 1 byte(s)
        ldx RSP         ; 4         2
                        ; 6         3

        ; switch to forth data stack
        stx RSP         ; 4 cycles, 2 byte(s)
        tyx             ; 2         1
                        ; 6         3

The downside of this method is the Y register is not available for use, but you’ll notice that I’m not using it in my return stack words. If needed, we could use the slightly slower:

Switch to the Forth return stack and back using the stack

        ; switch to forth return stack
        phy             ; 4 cycles, 1 byte(s)
        ldx RSP         ; 4         2
                        ; 8         3

        ; switch to forth data stack
        stx RSP         ; 4 cycles, 2 byte(s)
        plx             ; 5         1
                        ; 9         3

Let’s rewrite >R, R@, and R> using a Forth return stack and the Y register:

>R, R@, and R> with a separate return stack and the Y register

xt_to_r: ( x -- ) (R: -- x )
        ; get x from data stack
        lda TOS,x       ; 5         2

        ; switch to forth return stack
        txy             ; 2 cycles, 1 byte(s)
        ldx RSP         ; 4         2

        ; make room on return stack
        dex             ; 2         1
        dex             ; 2         1

        ; put x on return stack
        sta TOS,x       ; 5         2

        ; switch to forth data stack
        stx RSP         ; 4         2
        tyx             ; 2         1

        ; drop x from data stack
        inx             ; 2         1
        inx             ; 2         1

        rtl             ; 30        14

xt_r_fetch: ( -- x ) (R: x -- x )
        ; make room on data stack
        dex             ; 2         1
        dex             ; 2         1

        ; switch to forth return stack
        txy             ; 2         1
        ldx RSP         ; 4         2

        ; get x from return stack
        lda TOS,x       ; 5         2

        ; switch to forth data stack
        stx RSP         ; 4         2
        tyx             ; 2         1

        ; copy x to data stack
        sta TOS,x       ; 5         2
        rtl             ; 26        12

xt_r_from: ( -- x ) (R: x -- )
        ; make room on data stack
        dex             ; 2         1
        dex             ; 2         1

        ; switch to forth return stack
        txy             ; 2         1
        ldx RSP         ; 4         2

        lda TOS,x       ; 5         2

        ; drop x from return stack
        inx             ; 2         1
        inx             ; 2         1

        ; switch to forth data stack
        stx RSP         ; 4         2
        tyx             ; 2         1

        sta TOS,x       ; 5         2
        rtl             ; 30        14

Let’s summarize:

>R, R@, and R> Cycle and Byte Counts with Hardware and Separate Return Stacks

        Hardware Return Stack         Forth Return Stack
        cycles          bytes        cycles          bytes
>R        47             21            30             14
R@        14              6            26             12
R>        41             19            30             14
Totals:                  46                           40
                                                delta
>R R>     88                           60         28

>R R@ R>
1 loop   102                           86         16
2 loops  116                          112          4
3 loops  130                          138         -8
x loops  88+14x                       60+26x      28-12x

Using the separate Forth return stack code above saves 4 bytes (6 bytes code savings less 2 bytes to store the return stack pointer). Since >R and R> are used in pairs, each use with the separate Forth return stack code will save 28 cycles. However, there is a tradeoff, R@ is longer by 12 cycles. Since R@ is commonly used in a loop, these extra cycles soon outweigh the savings from >R and R> with the separate Forth return stack at a disadvantage after 3 loops. That’s no many and highlights again the disadvantage of having to do some sort of transfer operation on a 65xx processor.

So, which is better? As usual, it depends. In my fairly limited Forth code, I only use R@ once in a little used word and not in a loop, but I use the >R, R> pair 29 times in many more frequently used words. So, in my code, I’m leaning to using a separate Forth return stack for >R, R@, and R>.

Using a Forth Return Stack in General

Did you catch my qualification there? Don’t a bunch of other words use the hardware return stack? Why not switch those as well?

Other Forth words using the hardware return stack

DO ?DO UNLOOP

I J LOOP +LOOP

LEAVE

Here, on DO and UNLOOP manipulate the size of the return stack and represent savings to be had with a separate return stack. The second group, I, J, LOOP, and +LOOP, only access the return stack and thus, like R@ above, require additional cycles to switch to a separate Forth return stack. Worse, these words are specific to looping and as we saw above it doesn’t take many loops for our savings to go away. First, DO and LOOP come in a pair, so the savings in DO are offset somewhat with the losses in LOOP. But I and J are pure loss and while not used in every do loop, they are frequently used. I use them a combined 24 times. And often when the two are used, the effect is multiplied becuase I is the inner loop count, so the losses are multiplied by the outer loop count.

Complicating matters further, LEAVE, LOOP, and +LOOP utilize a return address placed on the return stack by DO to resume execution past the do loop upon completion. This is more complicated when using a separate Forth return stack. Again, most times it’s better when we can take advantage of an inherent 65xx feature from the start rather than switching to an alternate. For the do loop construct it’s better to use the hardware return stack from the start.

Other Ways to Create a Forth Return Stack

But can’t we take advantage of other 65816 address modes to make a separate Forth return stack more attractive. In short, no, but let’s examine some possibilities further.

Creating a Forth Return Stack Using the Hardware Stack Pointer

more to come

16-bit Unsigned Multiplication

The E&L Programming manual shows the size efficiencies that can be achieved in multiplying two 16-bit values in moving from the 65C02 to the 65816 (compare Listing 14.1 and 14.2). The algorithm is simple, in a loop, shift one of the multiplicands right. If the bit shifted out is set, then add the other multiplicand to the result, if not then do nothing. Finally, shift the other multiplicand left, preparing for the next loop. You can simply loop 16 times, but an early exit is possible by checking if the multiplicand being shifted right is zero prior to the loop.

These routines only produce a 16-bit result, so I modified it to produce the double cell result required for the Standard Forth word UM*. For simplicity I used the Forth data stack with direct page indexed addressing to hold the intermediate values.

Unsigned Multiplication (16×16=32), simple shift/add, Forth stack-based

um_star:

        ; cycle count:
        ;       setup:          53
        ;       cleanup:        13
        ;       last loop:       8      (total overhead 74)
        ;       per set bit:    62
        ;       per reset bit:  37 (below highest set bit)
        ;
        ;        136 cycles for $0001 * x
        ;        829 cycles for $5555 * x
        ;        691 cycles for $8000 * x
        ;        866 cycles for $AAAA * x
        ;       1066 cycles for $FFFF * x

        ; add u3 and u4 to the stack and initialize them to zero
        ; ( u1 u2 -- u1 0 u2 )
        dex             ; 2
        dex             ; 2
        stz TOS,x       ; 5
        jsr swap        ; 32

        ; ( u1 0 u2 --  u1 0 u2 0 )
        dex             ; 2
        dex             ; 2
        stz TOS,x       ; 5

        ; ( u1 0 u2 0 -- ud )
        lda #0          ; 3
@loop:
        ldy NOS,x       ; 5
        beq @done       ; 2/3
        lsr NOS,x       ; 8
        bcc @1          ; 2/3
        clc             ; 2
        adc SI4,x       ; 5
        tay             ; 2
        lda SI3,x       ; 5
        adc TOS,x       ; 5
        sta SI3,x       ; 5
        tya             ; 2
@1:
        asl SI4,x       ; 8
        rol TOS,x       ; 8
        bra @loop       ; 3

@done:
        ; save the least significant part of the result
        ; and drop the top two working cells
        sta SI4,x       ; 5
        inx             ; 2
        inx             ; 2
        inx             ; 2
        inx             ; 2

        rts

You can see that I’ve taken advantage of using the accumulator to retain the least significant part of the result and the Y register as temporary storage for this as we work with the most significant part of the result. However, using the Forth data stack to hold the intermediate results is quite costly, both in setup, cleanup and during the loop as all of the indexed operations add an extra cycle compared to dedicated direct page access. At the cost of 6 bytes of dedicated (or shared with proper care) direct page memory we can eliminate up to 8 cycles per loop and 16 cycles of overhead.

Unsigned Multiplication (16×16=32), simple shift/add, dedicated direct page memory

um_star:

        ; cycle count:
        ;       setup:          32
        ;       cleanup:        14
        ;       last loop:      12      (total overhead 58)
        ;       per set bit:    45
        ;       per reset bit:  31 (below highest set bit)
        ;
        ;        103 cycles for $0001 * x (24.3% faster)
        ;        635 cycles for $5555 * x (23.4%)
        ;        568 cycles for $8000 * x (17.8%)
        ;        666 cycles for $AAAA * x (23.1%)
        ;        778 cycles for $FFFF * x (27.0%)

        lda TOS,x       ; 5
        sta u2          ; 4
        lda NOS,x       ; 5
        sta u1          ; 4
        stz u3          ; 4

        phx             ; 5
        lda #0          ; 3
        tay             ; 2
@loop:
        lsr u2          ; 7
        bcs @1          ; 2/3
        beq @done       ; 2/3
        bcc @2          ; 2/3
@1:
        clc             ; 2
        adc u1          ; 4
        tax             ; 2
        tya             ; 2
        adc u3          ; 4
        tay             ; 2
        txa             ; 2
@2:
        asl u1          ; 7
        rol u3          ; 7
        bra @loop       ; 3
@done:
        plx             ; 4
        sta NOS,x       ; 5
        sty TOS,x       ; 5

        rts

Here I’ve used both the index registers to store the final product. This saves 2 cycles for each load/store operation related to these. Small savings, but they add up, though these is some additional overhead since the X register is the Forth data stack index. I’ve also saved a few cycles in using the right shift operation to determine if we’re done. Overall, for the examples I’ve shown, the savings can be substantial, from about 18% to 28% depending on the number of bits set.

The three shift operations occurring each loop are an ideal place to look for more savings. The accumulator can be shifted in 2 cycles but transferring to and from it will incur at least 4 cycles if we used the index registers as storage. That is just a savings of 2 cycles per loop if we replace the left shift operations. But we add cycles in the add portion of the loop as we have to save the final product in memory, for a net increase. There may be some savings in moving the right shift and early exit check, but the biggest savings will likely come from a different algorithm altogether, such as Booth’s algorithm. I’ll update this section as I do more research.

Right-shifted Multiplication

Notice that there are two add instructions in the addition section of the code above. That’s because we must add in any carry that occurs in adding the multiplicand into the partial product. It would be nice if we could avoid this second addition by incorporating it into the shift operation that follows. Unfortunately, this doesn’t work with shifting the multiplicand to the left. In researching binary multiplication algorithms, I came across references to right-shifted binary multiplication, this 6502.org post or stackoverflow post for example. The process is straightforward, but I didn’t realize the implications until I saw some code by Andrew Jacobs for his ANS Forth for W65c816SXB.

UM* by Andrew Jacobs excerpted from his ANS Forth for W65c816SXB (license)

; UM* ( u1 u2 -- ud )
;
; Multiply u1 by u2, giving the unsigned double-cell product ud. All values and
; arithmetic are unsigned.

                HEADER  3,"UM*",NORMAL
UM_STAR:
                lda     <1                      ; Fetch multiplier
                pha
                stz     <1                      ; Clear the result
                ldx     #16
UM_STAR_1:      lda     <3                      ; Shift multiplier one bit
                lsr     a
                bcc     UM_STAR_2               ; Not set, no add
                lda     1,s                     ; Fetch multiplicand
                clc
                adc     <1
                sta     <1
UM_STAR_2:      ror     <1                      ; Rotate high word down
                ror     <3
                dex
                bne     UM_STAR_1
                pla
                CONTINUE                        ; Done

This isn’t directly comparable to my code above, but it’s remarkable close in structure. Andrew is using the direct page register as the Forth stack index while I’m using the X register with direct page indexed addressing. This avoids preserving the X register as in my code above reducing the overhead for this routine where the stack size doesn’t change. Of course, nothing is free. Andrew must modify the direct page register every time he needs to change the stack index, a multi-instruction task. I only need to inc/dec the X register.

Formatting Andrew’s code similar to above for easier comparison and adding cycle counts, we have

Andrew’s reformatted UM* with cycle counts

UM_STAR:

        ; cycle count:
        ;       setup:          16
        ;       cleanup:         4
        ;       last loop:      -1      (total overhead 19)
        ;       per set bit:    42
        ;       per reset bit:  28 (below highest set bit)
        ;
        ;        481 cycles for $0001 * x (367% slower)
        ;        579 cycles for $5555 * x (8.8% faster)
        ;        481 cycles for $8000 * x (15.3% faster)
        ;        579 cycles for $AAAA * x (13.1% faster)
        ;        691 cycles for $FFFF * x (11.2% faster)

        lda 1           ; 4
        pha             ; 5
        stz 1           ; 4
        ldx #16         ; 3
UM_STAR_1:
        lda 3           ; 4
        lsr             ; 2
        bcc UM_STAR_2   ; 2/3
        lda 1,s         ; 5
        clc             ; 2
        adc 1           ; 4
        sta 1           ; 4
UM_STAR_2:
        ror 1           ; 7
        ror 3           ; 7
        dex             ; 2
        bne UM_STAR_1   ; 2/3
        pla             ; 4

        CONTINUE                        ; Done

The first thing that jumped out at me was that generally, the right-shifted multiplication resulted in somewhat faster execution than the left-shifted algorithm. This was surprising because the former requires that the full 16 shifts take place, while the latter allows for an early exit if the set bits in the multiplier have been fully factored into the product. This trade-off is very noticeable in the 1X example where the right-shifted algorithm was 367% slower than the left-shifted one. Which one is better depends on your application. Choose a modest improvement for larger numbers at the expense of a dramatic slowdown in processing smaller numbers or visa versa.

Another speed increase in the code above comes from shifting the multiplier in the accumulator. The multiplier must be loaded each loop, but this gives the 1 cycle advantage I mentioned above. Less memory is used in the case above as well as the product is calculated on the stack AND the multiplier is rotated out as the product is shifted in. With this, only two extra bytes are used on the stack to hold the multiplicand.

An interesting trick I noticed in the code above was that the multiplicand is stored on the return stack and fetched each loop with stack relative addressing. This is not as efficient as using a dedicated direct page address but is a necessary trade-off when the direct page register is used as the Forth data stack pointer.

So which algorithm is better? It seems that speed is the determining factor with my 65816 build. Looking through my Forth code I see a program that does a lot of multiplication by small numbers, meaning that an early exit algorithm could be superior (assuming the multiplicands were ordered correctly, or should I check for this, maybe adding a few cycles of overhead to save some with an early exit?). With this I think I’ll stick with the left-shifted algorithm and early exit.

Booth’s Algorithm

Booth’s algorithm is another right-shifted binary multiplication method that can result in fewer addition operations. Another advantage is that it can handle signed multiplication. This isn’t as much of an advantage as Forth seems to commonly keep track of the signs separately and apply the appropriate sign to the unsigned multiplication. The Wikipedia article provides an example of a signed multiplication which I think adds confusion when just starting out. An unsigned example is provided here, though it isn’t exactly clear on the process either. Going through the example of 7×3 with a bit of scratch paper we can see that the main right shifting operation is very similar to that used by Andrew Jacobs above where the most significant part of the result and the multiplier are rolled right in succession, thus requiring only an extra cell, as above, to calculate the product.

A disadvantage to Booth’s algorithm is a 16×16 multiplication still requires 16 shifts, as the right-shifted multiply above. The method also requires retaining the bit shifted out of the multiplier as it will be used, in conjunction with the current multiplier bit 0, to determine whether an addition operation is required. This appears to add quite a bit of code. See for example here, here, here, here, and here. That’s a lot of reading, and frankly more than I want to spend on a 16×16 operation. Browsing through it seems that Booth’s algorithm doesn’t have much advantage for such small values anyway. Perhaps I’ll revisit this if I do double-cell multiplication.

Using the X or DP register as the Forth data stack pointer

I’ve always used the X register as the Forth data stack pointer. This made the most sense on the 65C02, but the 65816 has other registers that warrant revisiting this choice. I didn’t see anything in various discussions (here, here and here for example) that made me rethink my choice in porting my 6502-based Forth over to the 65816. In reviewing Andrew Jacobs’ DP register index Forth multiply code above though I thought it worthwhile to look at this choice again.

Above we saw that the direct page operations save a cycle over direct page indexed operations. However, you lose this advantage if the low byte of the direct page register isn’t zero. Since this will almost always be the case, I never thought using the direct page register as a Forth data stack index was sensible given the extra work required to increment and decrement the register (compared to the straightforward inx/dex instructions).

For example, a very crude look, in my Forth related code I have about 150 instances of inx and 135 of dex, mostly involved with deleting and adding items to the Forth data stack. This is 285 total, but since were working with 16-bit cells that amounts to 143 index adjustments. Let’s take the worse case of a single adjustment per direct page register transfer (TDC/TCD) at the cost of 4 cycles (the index inc/dec themselves are the same in both cases). So that’s 572 extra cycles for using the direct page register as an index rather than X. Given that we often adjust the stack by more than a single cell, this estimate is high. Let’s say 60% of the time we make two or more adjustments, reducing this to about 340 cycles.

I have about 370 instances of using the direct page indexed X address mode, so 370 extra cycles compared to a simple direct page usage. That’s about a wash compared to the extra cycles used to adjust the direct page register index. However, direct page instructions take an extra cycle when the low byte of the direct page register isn’t zero. This negates most of the benefit of the direct page instruction cycle savings and we end up with just the cost of adjusting the direct page register index.

A better analysis would look at the usage frequency of words, comparing especially those words that do not change the stack size to those that do. If the former comprise a larger share of the usage than the later, then my crude analysis above would be overturned.

However, in rereading the posts linked above, the a few comments in the first thread stood out for me:

This means stack operations incur a small overhead as DP must be moved to/from C for adjustment but gives you use of the full set of zero page addressing modes when accessing the stack.
Andrew Jacobs

In particular I find it compelling that 24-bit addresses can be passed on stack and dereferenced in situ, bypassing unappealing alternatives such as gymnastics involving the Data Bank Register (DBR).
Jeff Laughton

Of course, they’re referring to direct page address modes here. Jeff points out one address mode in particular, Direct Page Indirect Long, that is not available as a stack-based option when using the X register as a Forth data stack index. Being able to use it on the Forth data stack would be nice.

Tantalized by the above comments I tried to find other direct page address modes that would benefit from using the direct page register, D, as the Forth data stack index. Unfortunately, I couldn’t come up with any compared to using the X register as the stack index.

The dp and dp,x address modes are comparable as are the (dp) and (dp,x) modes. The remaining direct page address modes, direct page indexed Y, indirect indexed Y, and long indirect indexed Y are all useful when the X register is used as the stack index (at least I use them in my own code) but don’t seem to have much value when using D as a stack index. To do something similar you’d have to manipulate the Forth stack which I’ve steered away from when possible in my primitive words. Why churn the stack unnecessarily?

Of course, there are other benefits of using D as the Forth stack index, it frees up the X register and direct page instructions take 1 less cycle than the indexed equivalent. On the flip side though, the cycle savings mostly goes away as the low byte of D usually won’t be zero and incrementing the index is more involved than when using the X register as an index. Based on a very simplistic view of all words having equal frequency, I’d call these a wash, at least for my code. I want to think that the benefit swings toward using D as an index when word frequency is considered, but I haven’t attempted such an analysis.

So, is it worth using D as the Forth data stack index just to gain the indirect long mode for the Forth stack? And is it worth it to give up the direct page indexed Y modes? I posted the question over on 6502.org. As expected, the reply was basically, it depends on your goals. Jeff Laughton put it well:

I myself lean in the other direction, because I want an environment that encourages ambitious goals. I don’t mind if some Forth words run a little slower if the payoff is being able to achieve maximum efficiency (and elegance) when reaching beyond 64K.
Jeff Laughton, 6502.org

My discussion with Jeff lead to clarifying a fine point in my discussion above though. You don’t lose the direct page indexed Y address modes with D as the Forth data stack index. It’s just that you require the operands to be on the Forth stack rather than static memory locations. In fact, the concept of static direct page memory locations doesn’t make much sense with D as the Forth data stack index. Jeff brought up another option that may work in some cases though, using the absolute indexed X or Y modes in place of the direct page indirect indexed Y mode.

Alternatives to (dp),Y mode

As part of my research above, Jeff Laughton, over on 6502.org, suggested using the absolute indexed X or Y modes in place of the direct page indirect indexed Y mode. Instead of saving the indirect address to the direct page you can simply load the address into the X or Y register and use the absolute indexed X or Y address mode with a 0 operand. You save 4 cycles on the direct page save and 1 cycle on any absolute indexed vs direct page indirect indexed instructions. Again, there’s no free lunch though. If you’re doing this in a loop, you’ll need to have the ending address for comparison to end the loop. If you’re using an index as a loop counter, as I often do, then you can structure things to avoid the comparison and just branch on the flags set after incrementing/decrementing the loop index. Since the comparison is probably going to be at least 5 cycles per loop, this is only going to be useful when you can’t avoid the comparison in either mode. Looking through my code I see a few opportunities where I might be able to take advantage of this idea.

Note that special syntax may be needed to indicate to the assembler, the desired address mode when the operand is a zero page address. For example LDA 0,X could be interpreted as either the absolute or direct page indexed X address mode. Different assemblers use different syntax. WDC uses LDA !0,X to force the absolute indexed X mode. CA65 uses LDA a:0,X. BTW – LDA f:0,X forces far addressing in ca65, just as LDA z:0,X can be used to force zero page addressing.

Forth Coded Words to Primitive on the 65816

In reviewing the various discussions related to the topic above I came across this comment:

Something I found so helpful with the ‘816 is that since it can handle 16 bits at a time, a lot of words that would have taken so much extra space if defined as primitives on the ’02 became easy with the ‘816, resulting not just in faster code but even shorter and easier to write in many cases—the best of all worlds. Because of this, my ‘816 Forth has around 350 primitives. Having so much higher percentage of the kernel in primitives dramatically reduces the number of times NEXT, nest, and unnest get executed to carry out a particular job, further speeding things up.
Garth Wilson, 6502.org

I currently have 131 primitive words:

65816 STC Forth Primitive Words as of 6/26/2022

! # #> #s ' ( * + +! +loop , - . ." /mod 0< 0= 1+ 1- 2drop 2dup 2over 2swap : ; < <# = > >body >in >number >r ?dup @ abort abort" abs accept align aligned allot and base c! c, c@ constant count cr create depth do does> drop dup else emit environment? execute exit find here hold i if immediate invert j key leave literal loop lshift move negate or over postpone quit r> r@ recurse rot rshift s" sign source space state swap then type u. um* um/mod unloop until word xor [ ['] ] again compile, marker parse parse-name ?do refill restore-input save-input source-id unused \ ahead ms key? blk buf bye close-file cold disassemble dodoes dsp0 dump interpret open-file r/w see words

62 Forth coded secondaries:

65816 STC Forth Secondary Words as of 6/26/2022

interpret_loop debug stack global thru load list buffer block flush save-buffers empty-buffers update bstatus cb scr ? .s page at-xy include included value to of erase endof endcase case .r */ */mod sm/rem fm/mod m* u+d .( [char] variable while u< spaces s>d repeat mod min max hex fill evaluate decimal chars char+ char cells bl begin 2@ 2/ 2* 2! cell+ /

This doesn’t include the Forth Standard tester related words:

65816 STC Forth Tester Related Words as of 6/26/2022

}t -> t{ error show-results empty-stack counter+ actual-results actual-depth counter verbose tests

not including the numerous words created during the tests themselves.

The split between primitive and secondary above is largely a holdover from my very first Forth, a token threaded version that was limited to 256 words total. The main goal of my TTC Forth was size so I allocated 128 tokens to primitive words and 128 to secondary words. I had most of the words above in my TTC Forth (the block and file related words were added after I switched to DTC) so you can see I had already consumed close to 80% of my token allotment. That was the main reason for switching to DTC as expanding the number tokens took up more memory than switching to DTC.

Without the need to preserve memory (I now have 64k of ROM to play with vs 16k when I was working on TForth) I can take Garth’s comment to heart and code many (all?) of my secondary words as primitives. In fact, doing so eliminates some of the coding tricks I introduced in TForth, for example, I have a barebones interpret loop that allows me to bootstrap the secondary Forth words at startup. This way, if the Forth secondary words failed to load, I could still do rudimentary testing with the primitive words. I’ll update this section as I convert words to show my progress.

The 65816 COP Instruction

I’ve eyed running a Mandelbrot routine since my 6502 builds but the size and complexity of floating-point routines was too much to justify the effort. I got over that with the 65816 so I recently added this third-party floating-point library to my operating system. It’s over 17k but that’s not a problem with my new system.

One open question was how to call the library routines. The package requires its own direct page, using almost the entirety of it, and thus requires some setup and recovery before and after calling a floating-point routine. This seems like a perfect situation for the COP instruction. It looks like this was the intention for the original package, but I couldn’t find any indication that it was ever implemented for it.

I worked up my own COP service routine to handle the floating-point prep and post work but was surprised that it took over double the cycles and bytes as just simply just calling separate prep and post routines from the body of the calling routine. Getting the COP signature byte added a good chunk of the extra cycles/bytes.

For example, here is a skeletonized snippet using COP taking 50 cycles and 19 bytes (not including common instructions with the comparison below):

COP-based Floating-point Package Call

; calling routine
        ...
        cop 1        ; 8 cycles, 2 byte(s)      COP vector: fp_csr
        ...

fp_csr:
        ...          ; fp_prep

        ; get COP signature byte
        lda ADDR,s   ; 5         2
        dec          ; 2         1
        sta F2       ; 4         2
        lda BB ,s    ; 5         2
        sta F2+2     ; 4         2
        lda [F2]     ; 7         2
        and #$ff     ; 3         3
        cmp #1       ; 2         2
        beq @fp1     ; 3         2
        ...

@fp1:
        jsr fp1
        ...
        ...          ; fp_post

        rti          ; 7         1
                     ; 50        19

Alternatively, here is a snippet using a macro that automates the prep and post calls, taking only 24 cycles and 8 bytes:

Macro-based Floating-point Package Call

.macro fp1_call:
        jsr fp_prep  ; 6 cycles, 3 byte(s)
        jsr fp1
        jsr fp_post  ; 6         3
.endmacro

; calling routine
        ...
        fp1_call
        ...

fp_prep:
        ...
        rts          ; 6         1

fp_post:
        ...
        rts          ; 6         1
                     ; 24        8

To be fair, each floating-point call in the latter case takes 9 bytes (including the call to the floating-point routine) versus just 2 bytes for the former (not counting the single call in the COP service routine). Edit 7/21/2022: actually, it’s much more than 9 bytes as the JSR calls here won’t work for the prep and post work without pulling and replacing the return address on the stack which we saw above takes 39 cycles and 16 bytes. With that we might as well use the COP instruction. Thus, to maintain speed we need to make the prep/post work macros, dramatically increasing the size of each call to the floating-point library (it’s currently 30 bytes per call to the FP library in my code, I’ll need to do some benchmarking with this and the COP method to see if I can see any difference in a FP intensive program, say Mandelbrot). But speed seems more important for a floating-point package, so the latter case seems better.

Is there a faster way to get the COP signature byte?

I put the question to 6502.org, “how were others interfacing with the floating-point package”? Several confirmed my conclusion above that speed was important for floating point and likely not appropriate for using the COP instruction as an API call. One member, however, provided some code samples for handling COP instructions that are instructive for examining ways to extract the COP signature byte. Here are a few, each leaving the signature byte in the accumulator and anticipating a future change/restoration to registers so no attempt is made to return them to their original values:

Extract COP Signature Byte, method #1

        tsc          ; 2 cycles  1 byte(s)
        tcd          ; 2         1
        ldy ADDR     ; 5         2
        dey          ; 2         1
        sep #20      ; 3         2
        lda BB       ; 4         2
        pha          ; 3         1
        plb          ; 4         1
        rep #20      ; 3         2
        lda 0,y      ; 4         2
        and #$ff     ; 3         3
                     ; 35        18

Extract COP Signature Byte, method #2

        tsc          ; 2 cycles  1 byte(s)
        tcd          ; 2         1
        dec ADDR     ; 8         2
        lda [ADDR]   ; 8         2
        and #$ff     ; 3         3
        inc ADDR     ; 8         2
                     ; 31        11

Extract COP Signature Byte, method #3

        lda ADDR,s   ; 5 cycles  2 byte(s)
        dec          ; 2         1
        sta F2       ; 4         2
        lda BB,s     ; 5         2
        sta F2+2     ; 4         2
        lda [F2]     ; 7         2
        and #$ff     ; 3         3
                     ; 30        14

Method #3, which is what I used in my COP service routine above, is the fastest of these by a bit, but also a bit longer than the next fastest, Method #2.

Now if we just had INC and DEC for the Stack Relative address mode (I’ve often wished for this):

Extract COP Signature Byte, non-working, wished for method

        dec ADDR,s   ; 8 cycles  2 byte(s)
        lda [ADDR]   ; 8         2
        and #$ff     ; 3         3
        inc ADDR,s   ; 8         2
                     ; 27        9

which is basically Method #1 without the switch of the direct page. Unfortunately, along these lines we have to settle for:

Extract COP Signature Byte, method #4

        lda ADDR,s   ; 5 cycles  2 byte(s)
        dec          ; 2         1
        sta ADDR,s   ; 5         2
        lda [ADDR]   ; 8         2
        and #$ff     ; 3         3
        tax          ; 2         1
        lda ADDR,s   ; 5         2
        inc          ; 2         1
        sta ADDR,s   ; 5         2
        txa          ; 2         1
                     ; 39        17

which is worst of all (though you could save the TAX/TXA if you could wait until post to INC the return address to its original value). Still, it’s not the best.

But I don’t want a COP service routine in Bank 0

Assuming you go ahead with using the COP instruction to interface with some routines, you might decide that Bank 0 isn’t the best place for its service routine. This was the case for the floating point package I looked at above. With the COP service routine in Bank 0 and the floating-point package in Bank 1, all calls to it had to use long addressing. No problem, just switch to JSL instead of JSR. At first, I couldn’t figure out why my system no longer worked. Then it dawned on me. Duh, you need an RTL instead of an RTS as well. Here are two snippets comparing these alternatives.

Bank 0 based COP service routine needs JSL/RTL

Bank 0:
fp_csr:
        ...
        jsl fp1      ; 8 cycles, 4 byte(s)
        ...

Bank 1:
; calling routine
        ...
        cop 1        ; 8         2
        ...

fp1:
        ...
        rtl          ; 6         1
                     ; 22        7

This requires modifying the third-party code, which isn’t necessarily straightforward as it’s unclear whether any of the routines are use internally and thus would be broken with a RTS switched to RTL.

COP service routine based in same bank as third-party library

Bank 0:
fp_csr:
        jml fpcop    ; 4 cycles, 4 byte(s)

Bank 1:
; calling routine
        ...
        cop 1        ; 8         2
        ...

fpcop:
        ...
        jsr fp1      ; 6         3
        ...

fp1:
        ...
        rts          ; 6         1
                     ; 24        8

The Bank 0 jump back to Bank 1 adds 4 cycles but using a JSR saves two cycles over a JSL. Keeping the body of the COP service routine in the same bank as the third-party code allows simple calls with JSR/RTS and doesn’t require modifying the third-party code. I think it’s better for this approach.

The 65816 MVN/MVP Instructions

The 65816 block move instructions, MVN and MVP aren’t as useful as they otherwise could be given that banks that they work on are hardcoded as operands to the instructions. I show below how you can get around this with self-modifying code in RAM, but absent that you’re stuck with a set of banks determined at compile time. I recently had a good opportunity to put these instructions to use.

I recently added a third-party floating-point package to my system and consistent with the Forth Standard, I added a floating-point return stack as well. Ideally, the floating-point package would work directly with data in the floating-point stack, but modifying it is easier said than done. Absent that, I needed an efficient way to move data from the Forth floating-point stack to the floating-point package accumulator.

I originally designed the floating-point stack to contain IEEE-754 floating-point values and used the third-party FPACK and FUNPACK routines to move data back and forth between the floating-point stack and package’s accumulator, which used an internal, unpacked floating-point representation with separate sign significand and exponent parts. I soon realized that there was little to be gained with this packing/unpacking overhead, just 4 bytes, but at the cost of significant processor time. The floating-point stack values would invariably have to be sent back to the floating-point package to be unpacked before accomplishing anything with them and then repacked to be sent back to the Forth floating-point stack. It would be better to eat the 4 bytes and have the Forth floating-point stack be consistent with the package’s internal floating-point format.

The package had routines to transfer floating-point values internally that could be modified to move data from my Forth stack, but they weren’t as efficient as a custom function, so I wrote my own.

Move Floating-point Accumulator to Top of Floating-point Stack

fac2fps:
        phx

        lda #.loword(FPACC)    ; 3
        sta F1                 ; 4
        lda ^FPACC             ; 4
        sta F1+2               ; 4
                
        ldx FPSP               ; 4
        ldy #19                ; 3
@loop:                         ; loop 10 times
        dey                    ; 2
        lda [F1],y             ; 7 
        dex                    ; 2
        dex                    ; 2
        sta TOS,x              ; 5
        dey                    ; 2
        bpl @loop              ; 3/2 loop: 23*10-1=229
        stx FPSP               ; 4
                               ; 229+26=255

        plx
        rts

Move Top of Floating-point Stack to Floating-point Accumulator

fps2fac:
        phx

        lda #.loword(FPACC)     ; 3
        sta F1                  ; 4
        lda ^FPACC              ; 4
        sta F1+2                ; 4
                                 
        ldx FPSP                ; 4
        ldy #0                  ; 3
@loop:                          ; loop 10 times
        lda TOS,x               ; 5
        inx                     ; 2
        inx                     ; 2
        sta [F1],y              ; 7
        iny                     ; 2
        iny                     ; 2
        cpy #19                 ; 3
        bcc @loop               ; 3/2 loop: 26*10-1=259
        stx FPSP                ; 4
                                ; 259+26=285
        plx
        rts

These are better than the floating-point package move routines which gain some efficiency with direct page addressing but are slowed by error checking. But we can do better than these with the MVN and MVP instructions

Move Floating-point Accumulator to Top of Floating-point Stack

fac2fps:
        phx

        phb                     ; 3
        lda #19                 ; 3
        ldy FPSP                ; 4
        dey                     ; 2
        ldx #.loword(FPACC)+19  ; 3
        mvp #0,#0               ; 7*20
        iny                     ; 2
        sty FPSP                ; 4
        plb                     ; 4
                                ; 165
        plx
        rts

Move Top of Floating-point Stack to Floating-point Accumulator

fps2fac:
        phx

        phb                     ; 3
        lda #19                 ; 3
        ldx FPSP                ; 4
        ldy #.loword(FPACC)     ; 3
        mvn #0,#0               ; 7*20
        stx FPSP                ; 4
        plb                     ; 4
                                ; 161
        plx
        rts

These are 35% and 43% faster than the brute force versions above. Note that we have to worry about preserving the data bank register above as the block move instructions leaves this at the destination bank. You could save a few cycles if this wasn’t an issue.

Adding a compare instruction to improve loop efficiency

We could improve the brute force move routines above using the absolute indexed Y address mode instead of the direct page indirect long indexed Y mode. This has the limitation that the data bank register has to be Bank 0.

Move Top of Floating-point Stack to Floating-point Accumulator

fps2fac:
        phx

        ldx FPSP               ; 4
        ldy #.loword(FPACC)    ; 3
@loop:                         ; loop 10 times
        lda TOS,x              ; 5
        inx                    ; 2
        inx                    ; 2
        sta 0,y                ; 5
        iny                    ; 2
        iny                    ; 2
        cpy #.loword(FPACC)+19 ; 3
        bcc @loop              ; 3/2 loop: 24*10-1=239
        stx FPSP               ; 11
                               ; 239+11=250

        plx
        rts

Using the absolute indexed Y address mode saves 20 cycles in the loop and 15 cycles setting up F1, but the compare Y instruction in each loop costs 30 cycles so we only save 5 cycles overall. If the 65816 supported more instructions with its direct page indexed Y address mode we could shave another 10 cycles off, but alas it doesn’t. (I often forget what instructions it supports in that mode and have to either debug or scrutinize a listing to figure out the problem).

But all isn’t well in the code above. It assumes that the data bank register is 0. It isn’t in my code, so it needs to be changed and restored, adding more than the meager savings we got above. For example, one way to accommodate this is:

Move Top of Floating-point Stack to Floating-point Accumulator

fps2fac:
        phx

        phb
        lda #0
        pha
        plb
        plb

        ldx FPSP               ; 4
        ldy #.loword(FPACC)    ; 3
@loop:                         ; loop 10 times
        lda TOS,x              ; 5
        inx                    ; 2
        inx                    ; 2
        sta 0,y                ; 5
        iny                    ; 2
        iny                    ; 2
        cpy #.loword(FPACC)+19 ; 3
        bcc @loop              ; 3/2 loop: 24*10-1=239
        stx FPSP               ; 11
                               ; 239+11=250
        plb
        plx
        rts

adding 22 cycles so we’re better just sticking with the original in this case (assuming that the block move instructions didn’t work for us, otherwise we’d just stick with them instead).

Using MVN/MVP with self-modifying code

to come

Other New 65816 Instructions

Lots more to come

TCS vs TXS
TXY TYX register saves for alt address modes
XBA – output 2 char
temporarily save a byte in 16-bit mode phb ?
MVN MVP
PEA PEI PER