21: The data store

(0 comments)

It's time to talk about the data store again.

What we know so far:

  • 8kB
  • treated as 2D array, with 0x10, 0x20, 0x40, or 0x80 byte stride
  • depending on the stride, the low 4 bits of address are mangled
  • there are three kinds of load/store instructions: 16 bytes horizontal (to $v), 16 bytes vertical (to $v), and 4 bytes horizontal (to $r)
  • each kind of load/store instruction is also available in three addressing modes:
    • register + register, postincrement
    • register + immediate, postincrement
    • register + immediate
  • can do DMA to/from main memory
  • there are 3 more unknown address instructions modifying $v, 2 of them also affecting $a. That would match another access mode for load: two postincrement ones and one non-postincrement. My personal bet would be on a 4x4 square load - perfectly suited for MC.

First, let's think about the vertical access feature for a bit. The horizontal part is rather easy to do in hardware: just make a 128-bit wide RAM and address it by high 9 bits of the  byte address. However, this naïve approach is no good for vertical access: all 16 bytes accessed would cover the same 8 bits of the bus, requiring 16 cycles to complete.

An alternative is using 16 8-bit wide RAMs (let's call them banks), and carefully assigning the data store addresses to banks such that all 16 bytes of a horizontal or vertical access always cover all banks. That explains why we're seeing low address bit mangling between the strides: presumably they have different bank assignment formulas. So, bits 4-12 of byte address select address inside the bank, and the bank is determined in some way from the whole address and the selected stride.

There is one problem with that explanation: it's quite easy to come up with a mangling scheme that works for horizontal and vertical accesses at all strides (bank = address bits 0-3 XOR bits 4-7 XOR bits 8-10). So that's strong evidence for an unknown access mode.

Let's check it out. First, we'll fill each byte in the data store with low 8 bits of its own address and try reading every single address with the unknown ops. Then, repeat it with the high bits.

#!/usr/bin/env python3

import sys
import nvapy

c = nvapy.cards[2]

b0 = c.bar0

stride = 0

if 1:
    for x in range(0, 0x2000, 0x10):
        b0.wr32(0xf000, 0x03020100 | ((x & 0xf0) * 0x01010101))
        b0.wr32(0xf080, 0x07060504 | ((x & 0xf0) * 0x01010101))
        b0.wr32(0xf100, 0x0b0a0908 | ((x & 0xf0) * 0x01010101))
        b0.wr32(0xf180, 0x0f0e0d0c | ((x & 0xf0) * 0x01010101))
        b0.wr32(0xf600, x | stride << 30)
        b0.wr32(0xf448, 0xdc000000)
        b0.wr32(0xf458, 1)
else:
    for x in range(0, 0x2000, 0x10):
        b0.wr32(0xf000, (x >> 8) * 0x01010101)
        b0.wr32(0xf080, (x >> 8) * 0x01010101)
        b0.wr32(0xf100, (x >> 8) * 0x01010101)
        b0.wr32(0xf180, (x >> 8) * 0x01010101)
        b0.wr32(0xf600, x | stride << 30)
        b0.wr32(0xf448, 0xdc000000)
        b0.wr32(0xf458, 1)

for op in [0xc8, 0xc9, 0xd7]:
     for x in range(0, 0x2000, 0x1):
         b0.wr32(0xf600, x | stride << 30)
         b0.wr32(0xf604, 0)
         b0.wr32(0xf608, 0)
         b0.wr32(0xf60c, 0)
         b0.wr32(0xf448, op << 24)
         b0.wr32(0xf458, 1)
         print("{:04x} {:08x} {:08x} {:08x} {:08x}".format(x,b0.rd32(0xf180), b0.rd32(0xf100), b0.rd32(0xf080), b0.rd32(0xf000)))

The results are kind of boring for 0xc8 and 0xc9. It seems we haven't managed to hit their "load" behavior at all. On the other hand, the results for 0xd7 are way too interesting:

0000 fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
0001 fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
0002 fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
0003 fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
0004 fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
0005 fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
0006 fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
0007 fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
[...]
1ffc fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
1ffd fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
1ffe fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
1fff fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
0000 fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
0001 fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
0002 fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
0003 fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
0004 fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
0005 fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
0006 fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
0007 fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
[...]
1ffc fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
1ffd fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
1ffe fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
1fff fffefdfc fbfaf9f8 f7f6f5f4 f3f2f1f0
0000 f8e7d7c6 b6a59584 74635342 32211100
0001 8b7b7a69 68585746 45353423 22121100
0002 bab9a898 87867665 55544333 22211100
0003 aa9a8988 78676656 55443433 22121100
0004 aaa99988 87776665 55444333 22211100
0005 aa9a9988 78776656 55443433 22121100
0006 aaa99988 87776665 55444333 22211100
0007 aa9a9988 78776656 55443433 22121100
0008 aaa99988 87776665 55444333 22211100
0009 aa9a9988 78776656 55443433 22121100
000a aaa99988 87776665 55444333 22211100
000b aa9a9988 78776656 55443433 22121100
[...]
000e aaa99988 87776665 55444333 22211100
000f aa9a9988 78776656 55443433 22121100
0010 bab99998 97777675 55545333 32311110
0011 ba9a9998 78777656 55543433 32121110
[...]
001e bab99998 97777675 55545333 32311110
001f ba9a9998 78777656 55543433 32121110
0020 aaa9b8a7 a6776665 74636233 2221302f
0021 aab9a879 68776675 64352433 223120f9
[...]

Quite chaotic, and drops  into stable orbits which are sometimes disturbed... this very likely means that $v0 is an input to this instruction, and not just the output. Clearing it to 0 we get:

0000 0f0e0d0c 0b0a0908 07060504 03020100
0001 0f0e0d0c 0b0a0908 07060504 03020100
0002 0f0e0d0c 0b0a0908 07060504 03020100
0003 0f0e0d0c 0b0a0908 07060504 03020100
0004 0f0e0d0c 0b0a0908 07060504 03020100
0005 0f0e0d0c 0b0a0908 07060504 03020100
0006 0f0e0d0c 0b0a0908 07060504 03020100
0007 0f0e0d0c 0b0a0908 07060504 03020100
[...]
000f 0f0e0d0c 0b0a0908 07060504 03020100
0010 1f1e1d1c 1b1a1918 17161514 13121110
0011 1f1e1d1c 1b1a1918 17161514 13121110
[...]
0020 2e2d2c2b 2a292827 26252423 2221202f
[...]
0030 3e3d3c3b 3a393837 36353433 3231303f
[...]
0040 4d4c4b4a 49484746 45444342 41404f4e
[...]
0050 5d5c5b5a 59585756 55545352 51505f5e
[...]
0060 6c6b6a69 68676665 64636261 606f6e6d
[...]
0070 7c7b7a79 78777675 74737271 707f7e7d
[...]
0080 8b8a8988 87868584 83828180 8f8e8d8c
[...]
0090 9b9a9998 97969594 93929190 9f9e9d9c
[...]
00a0 aaa9a8a7 a6a5a4a3 a2a1a0af aeadacab
[...]
00b0 bab9b8b7 b6b5b4b3 b2b1b0bf bebdbcbb
[...]
00c0 c9c8c7c6 c5c4c3c2 c1c0cfce cdcccbca
[...]
00d0 d9d8d7d6 d5d4d3d2 d1d0dfde dddcdbda
[...]
00e0 e8e7e6e5 e4e3e2e1 e0efeeed ecebeae9
[...]
00f0 f8f7f6f5 f4f3f2f1 f0fffefd fcfbfaf9
[...]
0100 0f0e0d0c 0b0a0908 07060504 03020100
[...]
0110 1f1e1d1c 1b1a1918 17161514 13121110
[...]
0120 2e2d2c2b 2a292827 26252423 2221202f
[...]
0130 3e3d3c3b 3a393837 36353433 3231303f
[...]
0140 4d4c4b4a 49484746 45444342 41404f4e
[...]

Hm. we're seeing some mangling of the low 4 bits. And experimentation with the $v input reveals that each component is shifted left by 4 bits and ORed with the address used to retrieve that particular component!

It seems we found some kind of raw load instruction: the independent addressing behavior can only be done if each component of the vector is tied to the corresponding bank. This also explains the low 4 bits mangling: we're loading unmangled addresses and filling memory through mangled ones.

And as expected, this strange load instruction is unaffected by the stride bits of the address register. Time to model that in hwtest. From what we see above, it's easy to figure out the mangling performed for stride 0x10: bits 5-7 of the address are added to bits 0-3 to get the bank. Curiously, this does *not* result in a good bank spread for vertical accesses: they'll always cover exactly 8 banks. Huh. Anyway, since we know stride 0x10 raw memory mapping and the load/store instructions, we can add the data store to our hwtest model and start testing memory access instructions.

Testing 0xd7 gets us a surprise: while the instruction does perform as expected when bit 0 is unset, setting it changes it to a store instruction! Oh well. It's quite easy to figure out: The addressing mode is register+register post-increment (with no $c output but with source 2 mangling), and the stored data is simply $v selected by src1 bitfield. And the store also uses raw addressing.

And since we now know both raw load and store instructions, we can model it in hwtest in a cleaner way than by using stride 0x10 and undoing the mangling.

There is one more thing about 0xd7: it appears to mess up the vector registers sometimes when used together with the mov to/from other register file instruction. Apparently we're exceeding the number of register read/write ports available on VP1. We'll just disable this combination.

Time to test the non-raw loads and stores. First, we'll need to figure out mangling for the other strides. It's easy enough to do by making another script like above, reading all addresses with those strides. The results:

  • 0x10: bank = (bits 0-3) + (bits 5-7)
  • 0x20: bank = (bits 0-3) + (bits 5-8)
  • 0x40: bank = (bits 0-3) + (bits 6-9)
  • 0x80: bank = (bits 0-3) + (bits 7-10)

The strides other than 0x10 do result in perfect spreading. I see only one explanation here: perhaps reading consecutive two bytes from a bank is free for some reason and they didn't need to spread based on bit 4 of the address. Perhaps the banks are really 16 bits wide (because that's the kind of RAMs they had readily available). I guess we'll never know.

With that knowledge, we can implement and test the load/store instructions. For register-register postincrement and register-immediate postincrement addressing, there are no surprises in the load/store behavior itself. However, there's one weird interaction: when $r load is used together with 0x04 or 0x05 scalar ops, the data is mangled. Seems to be another case of running out of $r read ports (0x04 and 0x05 are exceptional in that they read 3 different $r registers). We'll just disallow that in testing.

Register-immediate non-postincrement addressing has a surprise waiting for us, however: the addresses generated are not what we expected, and are in fact quite chaotic. It doesn't seem to make sense...

Since we've seen an OR operation used in 0xd7 case in similiar circumstances (non-postincrement), let's try that for those ops too... yes, it works. The immediate is treated as an unsigned 11-bit number and ORed into the address.  The instructions now work... except for $c output.

This one was annoying to get right. Setting $c according to the post-OR address didn't work. Pre-OR didn't work either. Const 0, const 1, no modification: also not. I tried setting it based on addition result (that is otherwise nowhere to be seen), but the test didn't pass either. Argh.

So, let's experiment with it on small inputs (there are three: low $a bits, high $a bits, and the immediate). And the experiment reveals... the flags *are* set as for addition. But that didn't work in testing...

After some further stumbling in the dark, I figured out that the immediate is to be treated as unsigned for addition too... So the actual memory load/store uses ($a | immediate), but flags are set according to ($a + immediate). Wat.

Oh well, the test passes perfectly now for all known load and store instructions. That leaves 0xc8 and 0xc9 for the next episode.

Elapsed time: 6h.

Currently unrated

Comments

There are currently no comments

New Comment

required

required (not published)

optional

required