How fast can a 6502 transfer memory
By Michael Doornbos
- 8 minutes read - 1618 wordsThe amazing Gregorio Naçu posted the article title graphic this week to bring attention to the venerable 6502 processor and poke fun at Apple’s M2 chip marketing slides. He’s doing probably the most ambitious single-person Commodore 64 project I know of and has a fantastic blog.
Apple claims the new M2 chip has the following specs. M2 features Image by Apple via Youtube We all know that these numbers are probably a little fluffy. Maybe a lot fluffy, and in practical applications, they are probably pretty far off. Benchmarking in a lab is fine, but the numbers rarely reflect real-world performance.
Tom’s hardware does an excellent breakdown on this new chip. It does look pretty neato!
How fast is a 6502?
After Gregorio posted this image earlier this week, it sparked a fair amount of discussion on the interwebs about the memory transfer speed of a 6502 processor.
The 6502 on Commodore machines shares the clock with the video chip. Since dual ported ram wasn’t financially feasible at the time, they chose a memory access trick that allowed both the video chip and processor to access memory during a single clock cycle. I think it’s the same on most Commodores, but on the VIC-20, the processor accesses the memory on the low part of the signal and the VIC chip on the high part. Maybe that’s backward… anyhoo, you get the point. VIC-20 PAL Clock signal from the 6561
Memory at 1MB per second
Going back to the slide, this 1Mhz memory bandwidth is what folks are questioning.
On every clock cycle, the 6502 reads memory from somewhere… the stack, registers, program counter, memory locations, etc. So at 1 Mhz, typical for Commodore machines, this 1MB per second bandwidth is probably accurate in a vacuum, where marketing people hang out.
Image by Gregorio Naçu at c64os.com It’s important to note that Gregorio Naçu’s slide was a parody and not intended to be a hard numbers accurate kind of thing. Please remember that because if you don’t, the rest of this discussion will ruffle your feathers.
Testing real-world block transfers
We’ll try some memory transfers to get an idea of what actual transfer speeds might look like using standard Commodore hardware. Other 6502-based platforms might be faster or slower, so I encourage you to try some tests of your own, and please let me know what you find.
We’re going for average user experience, NOT “how fast can this processor perform in a lab.”
Think Suzie, the tech writer opening a document on her computer. That’s more what we’re going for with these tests.
Again, remember that transferring memory takes more clock cycles than just reading or writing…
The Commodore 64 Version
Let’s give this a go on the most popular 6502-based system of all time, the Commodore 64. Everyone has a heads-up display for their Commodore 64 these days.
The transfer
We’ll take a cue right from the venerable Rodney Zaks. Incidentally, Robin did a long video fixing this book’s implementation bug. I’ll be using the revised version as I think it’s a well-established example of doing a real-world block transfer. Sure there may be faster ways, but this is a realistic way, which is what we’re going for. You can read this excellent chapter on how this works, and Robin’s video goes into it in great detail. Here’s what we’re going to do:
source = $0800
dest = $4800
len = $4000
from = $fb
to = $fd
tmpx = $a6
copyr
.block
lda #<source
sta from
lda #>source
sta from+1
lda #<dest
sta to
lda #>dest
sta to+1
ldy #0
ldx #>len
beq remain
next lda (from),y
sta (to),y
iny
bne next
inc from+1
inc to+1
dex
bne next
remain ldx #<len
beq done
nextr lda (from),y
sta (to),y
iny
dex
bne nextr
done rts
.bend
We can count jiffies on a Commodore to give us an idea of how fast this copy takes. Sure there’s a slight overhead in the setup, but I think it’s marginal enough that we can ignore it for our purposes. $12(18) jiffies Okay, that’s pretty fast. Since that’s 16k transferred, it works out to about 54.6 k per second.
Let’s do a bunch of them and see what it comes out as.
We can call this pretty quickly 255 times and do the same math.
lda #$00
sta $a2
sta $a1
sta $a0
ldx #255
stx tmpx
lp
jsr copyr
dec tmpx
ldx tmpx
bne lp
lda $a0
jsr printbyte
lda $a1
jsr printbyte
lda $a2
jsr printbyte
$1128(4392) So at $1128 jiffies(4392) and 255 transfers of 16,384, we’re seeing around 57K per second.
Grain of salt, yes, but real-world enough.
Yeah, there’s some overhead in the setup and running of the transfer. We could probably make this loop a few percentage points faster. Maybe if we make it tight, we could get 15% better out of it. But the point was real-world uses, and this is a pretty good example of a tight but flexible loop to transfer. Let’s not get TOO pedantic here.
What’s important to note is that transferring memory takes several clock cycles per byte. If we count them, it’s about a dozen cycles, which tracks roughly with our results.
KIM-1 version
The KIM-1 is arguably the most simple and pure 6502 platform, so it will be interesting to try and do memory transfers on it.
It IS clocked a little slower than a Commodore 64, so I expect it to transfer slightly slower. But it doesn’t have to compete for access time as VIC-II “badlines,” so maybe it’ll be pretty close.
Let’s find out.
I don’t own a “real” KIM-1, but I do own what is considered the best two clones. Today, let’s use the Corsham KIM-1 Clone. I’m going to call it a KIM-1 from here forward, mostly because I enjoy getting angry letters about this. You’ve been warned.
Measuring time
The KIM-1 doesn’t have a jiffy clock like the other Commodore machines.
The “Application ports” are easily accessible, so if we set a pin high when we start and set it low again when we finish, we can easily use an oscilloscope to measure the time.
With the expansion bus hooked up on my Corsham KIM board, the Application port A direction is set to output with.
lda #$ff
sta $1603
Set all ports out output And then, we can toggle pin PA0 by setting it high or low. We’ll use $FF and $0 for that for simplicity.
*Side note: this is a non-standard location for this port, your KIM-1 or clone probably has it in the $1700 range. Check your documentation. *
16k in 262 Milliseconds is around 62.5k per second. Slightly faster than a Commodore 64 even though an NTSC Commodore 64 runs at a slightly higher clock speed (1.023MHz) than our KIM here.
Let’s do this 255 times in a tight loop, ignoring the overhead of things like JSR, which takes a few clock cycles each loop. We’re going for a ballpark here.
So our loop code then looks something like
lda #$ff
sta $1603
sta $1601 ;technically setting all pins high here
;could just use #$01
ldx #255
stx tmpx
lp
jsr copyr
dec tmpx
ldx tmpx
bne lp
lda #$00
sta $1601
brk
Then if we probe it with an oscilloscope, we can measure the 1+ minute square wave.
So 255 transfers of 16,384 bytes take 67 seconds. Or about 62k per second.
One more for fun, how about a 2021 6502 processor clocked at 8Mhz?
I happen to have a Cerberus 2080 board. As far as I know, mine is the only green one in the world.
This has dual-ported RAM and can clock the brand new (yes, they still make them) WDC 65c02S processor at a blazing 8Mhz. Let’s see what kind of results we get from it.
Again, we have a no jiffy clock problem, so I’m going to skip right to the 4MB transfer, time it over the video capture, and have it show “done” on the screen when it finishes. Unlike the KIM-1, I don’t have a straightforward way to time it with an I/O pin. It’ll give us a good enough idea of where we are.
About 6.29 seconds 16,384 bytes 255 times took 6.29 seconds, so maxed out, a modern 6502 at 8MHz can do about 664.2k per second. Not too bad!
Thoughts
Sure, this was not a comprehensive set of tests. But in the real world, a 6502 can copy the entire contents of a Commodore 64’s memory from one place to another in about a second. Pretty respectable, and it was pretty fast for the time.
Unrolling
You could certainly use self modifying code and unroll this copy routine to get better performance at the price of flexibility and arguably understanding for the average casual 6502 assembly coder.
Again, this was not a “how fast can we absolutely make it” but an everyday use examination.
This copy can handle from one to 216 bytes and every number in between. And as my favorite Youtuber is fond of saying “I know I know, but I didn’t do that. Let the angry emails begin.”
REU
If you have an REU on your Commodore, that can theoretically swap out the memory at a byte per clock cycle. A true 1MB per second. I heard that games like Sam’s Journey make use of this feature quite a bit.
Sam’s Journey First Level
I’d love to hear your thoughts on how you’d approach this, pedantic, nit-picky, and otherwise. Bonus points if you demonstrate methods that show dramatically better results.
Whatever you do, be sure to have fun and don’t take marketing slides too seriously.