65816 Emulator: Threading in Python Very Slow

I haven’t talked much about my 65816 emulator but it’s been essential in porting my 6502 Forth over to my new 65816 build. The slowdown I mentioned in my post 6502: Coding – Just because it works doesn’t mean it’s correct though was becoming annoying. Starting up my Forth operating system in the emulator took about two and a half minutes, significantly longer than before I moved handling ACIA interrupts to a separate thread. I’ve lived with it for a while because I could often bypass the startup routine during testing, but now that I’ve largely completed the port, I needed to find a solution.

Now Python doesn’t run threads concurrently so each thread must yield so other threads can run. Threads will run sequentially otherwise, likely increasing execution time from what you’d expect. I used the Python sleep function in my threads to yield at strategic points, and did greatly increase startup time, but come on, I didn’t see why loading a couple kilo-bytes into a buffer (and process it of course) should take so long.

Threads are most suited to I/O type tasks where there are natural places to yield to other threads. I used it successfully for keyboard input (handled by simulating a VIA shift register interrupt) and it seemed like a good solution for handling serial input (handled by simulating an ACIA receiver data register full interrupt).

Turns out that wasn’t the case. When I moved the ACIA handling to a separate thread, I chalked the slow down to the inefficiency of running many threads on my rather inefficient laptop, after all, things did run better on my more powerful desktop. However, the issue was really that the ACIA interrupt handling was more akin to a computational task than I/O.

The ACIA receiver data register full interrupt indicates to the processor that a character is ready. In hardware, these characters are coming in over a serial connection, significantly slower than what the processor can handle. In the emulator, all of the data is available before the interrupts are simulated and thus available to the “processor” as fast as it can handle it. I took advantage of this previously by simulating successive interrupts to transfer a 1 kB block before “pausing” to allow for processing. However, this approach had the problem that I discussed in the post above.

Computational tasks are better handled in a separate process than a separate thread. But a separate process was overkill here. I just needed a way to kick off an interrupt at a frequency that allowed the system to process it. Doing away with the thread and putting in delay between interrupts was a simpler solution. Tuning the delay for the fastest startup time, I reduced the start up time of my Forth system in the emulator to 12 seconds. An amazing improvement from before.

What about the VIA

With this success, I decided to try it for the VIA thread as well. This thread works fine for keyboard input, which is slow after all. However, like the ACIA thread, it was slow for pasted input, which is essentially available all at once, or as fast as the system can handle it. I use pasted input often for testing purposes and I’m doing a lot more of that now that my port is fairly complete.

Replacing the VIA thread with a delay between interrupts proved problematic though. Unlike the fixed sized ACIA-based input, the VIA input uses a circular buffer. The delay must be long enough to prevent the buffer from overflowing, but not so long as to slow down the pasted input excessively. In hardware, the 65xx is fast enough to prevent buffer overflow.

The problem with pasted input though is that the system has to interpret each line before proceeding to the next and the time (clock cycles) required to do so depends on the specific input. Since the input is interrupt driven, if the delay isn’t long enough, additional input could be accepted into the circular buffer when processing a complicated line. It doesn’t take much of this until the circular buffer overflows (note my VIA routine doesn’t support hardware flow control).

However, increasing the delay decreased keyboard and paste responsiveness. Interestingly it also increases start up time as the delay affects each interrupt. That is, the VIA delay would affect the ACIA delay, even though keyboard input isn’t expected during start up. I needed a better way to signal that the system could handle another interrupt, or in essence, it was ready for more input.

The WAI Instruction to the Rescue

Put that way, the solution was clear. In my port to the 65816 I started using the wait for interrupt (WAI) instruction to pause the processor when the system was waiting for keyboard input. The emulator sets a flag on executing this instruction and basically “waits” for an interrupt. I could use this flag to indicate when the system was ready for more data. This worked perfectly. It gives a reasonable pasted input experience and only slightly delays startup.

I decided to do the same for the ACIA though due to the block nature of the input in my use case this wasn’t really necessary. It does make for cleaner code though.

Using the waiting state as a delay isn’t the fasted method for responding to input as it doesn’t take (any/full?) advantage of the internal buffers. But it is reasonably efficient. Start up with a tuned cycle count delay was about 12 seconds. It’s about 17 seconds using the waiting flag. The difference is processing the WAI instructions, which are forced with this method, but never needed using a cycle count delay because at startup the input is already buffered (i.e., available without waiting). I can live with the slight increase in startup time. Shoot, it’s nothing after having to wait 2 and a half minutes before.

PS

With a bit a tweaking my Python code, I got my startup time to 13 seconds. I now seem to be constrained by the speed of the Python I/O (and/or the granularity of the Python sleep function which I use to reduce system churn while waiting for input) because even with significant refinement of my startup code I can only shave an additional second off my startup time in the emulator. I’ll have to try this on my hardware build, but as that is also I/O constrained, I might not be able to get much more improvement in startup speed.

PPS

Testing my old and new startup routines on my 65816 build 4 yielded the results I guessed at above. I saw no improvement in startup speed (or at best a fraction of a second) in a startup loop of a couple of instructions versus one with several subroutine calls, multiple register size changes and moving the bytes between buffers. Interestingly the startup time was about 14 seconds with a 1 MHz clock, almost the same as on the emulator. Clearly startup I/O (fetching about 4k of text from an SD card and processing it) is taking the majority of the time and the time spent changing register sizes (even multiple times per byte fetch) or moving the bytes between buffers was immaterial. I’ll probably stick with the longer code as it provides more expansion possibilities.