13: hwtest and vector instructions, part 1

(0 comments)

It's time to finally take a look at the vector instructions. Some quick testing reveals that 0x81, 0x82, 0x87-0x8e, 0x90-0x92, 0x94, 0x95, 0x97-0x9f, 0xa1, 0xa2, 0xa4, 0xa5, 0xa7-0xaf, 0xb1-0xb3, 0xb6-0xbe are valid opcodes.

First, let's look at 0xad. It's used to clear the $v registers in the "init" microcode piece, so there's a high chance it's a mov from immediate.

hwtest confirms our guess: it writes a broadcast 8-bit immediate from bits 3-10 to the destination register.

For further instructions, we'll use another simple python program:

#!/usr/bin/env python3

import sys
import nvapy

c = nvapy.cards[2]

b0 = c.bar0

for x in range(31):
    b0.wr32(0xf780 + x * 4, x)

for x in range(-4, 12):
    x = x & 0xff
    b0.wr32(0xf000, 0xdeadbeef)
    b0.wr32(0xf080, 0xdeadbeef)
    b0.wr32(0xf100, 0xdeadbeef)
    b0.wr32(0xf180, 0xdeadbeef)
    b0.wr32(0xf004, 0xfffefdfc)
    b0.wr32(0xf084, 0x03020100)
    b0.wr32(0xf104, 0x07060504)
    b0.wr32(0xf184, 0x0b0a0908)
    b0.wr32(0xf008, x * 0x01010101)
    b0.wr32(0xf088, x * 0x01010101)
    b0.wr32(0xf108, x * 0x01010101)
    b0.wr32(0xf188, x * 0x01010101)
    b0.wr32(0xf450, 0x01000000 | 1 << 14 | 2 << 9)
    b0.wr32(0xf458, 1)
    res = [b0.rd32(0xf000 + (i >> 2) * 0x80) >> (i & 3) * 8 & 0xff for i in range(16)]
    print(' '.join(format(x, "02x") for x in res))

The program strongly assumes that operations on the vectors are piecewise (ie. we're not looking at a swizzle instruction), but that should be true often enough.

Running it on opcode 0x81:

7c 7c 7d 7d 00 00 00 01 01 02 02 03 03 04 04 05
7c 7d 7d 7e 00 00 00 01 01 02 02 03 03 04 04 05
7d 7d 7e 7e 00 00 00 01 01 02 02 03 03 04 04 05
7d 7e 7e 7f 00 00 00 01 01 02 02 03 03 04 04 05
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
01 01 01 01 00 00 00 00 00 00 00 00 00 00 00 00
01 01 01 01 00 00 00 00 00 00 00 00 00 00 00 00
02 02 02 02 00 00 00 00 00 00 00 00 00 00 00 00
02 02 02 02 00 00 00 00 00 00 00 00 00 00 00 00
03 03 03 03 00 00 00 00 00 00 00 00 00 00 00 00
03 03 03 03 00 00 00 00 00 00 00 00 00 00 00 00
04 04 04 04 00 00 00 00 00 00 00 00 00 00 00 00
04 04 04 04 00 00 00 00 00 00 00 00 00 00 00 00
05 05 05 05 00 00 00 00 00 00 00 00 00 00 00 00

Looks kind of like fractional multiplication, with unsigned inputs and signed output. Huh. Let's try 0x82.

7f 7f 7f 7f 00 00 01 01 02 02 03 03 04 04 05 05
7f 7f 7f 7f 00 01 02 03 04 05 06 07 08 09 0a 0b
7f 7f 7f 7f 00 01 03 04 06 07 09 0a 0c 0d 0f 10
7f 7f 7f 7f 00 02 04 06 08 0a 0c 0e 10 12 14 16
7f 7f 7f 7f 00 02 04 06 08 0a 0c 0e 10 12 14 16
7f 7f 7f 7f 00 02 04 06 08 0a 0c 0e 10 12 14 16
7f 7f 7f 7f 00 02 04 06 08 0a 0c 0e 10 12 14 16
7f 7f 7f 7f 00 02 04 06 08 0a 0c 0e 10 12 14 16
7f 7f 7f 7f 00 02 04 06 08 0a 0c 0e 10 12 14 16
7f 7f 7f 7f 00 02 04 06 08 0a 0c 0e 10 12 14 16
7f 7f 7f 7f 00 02 04 06 08 0a 0c 0e 10 12 14 16
7f 7f 7f 7f 00 02 04 06 08 0a 0c 0e 10 12 14 16
7f 7f 7f 7f 00 02 04 06 08 0a 0c 0e 10 12 14 16
7f 7f 7f 7f 00 02 04 06 08 0a 0c 0e 10 12 14 16
7f 7f 7f 7f 00 02 04 06 08 0a 0c 0e 10 12 15 17
7f 7f 7f 7f 00 02 04 06 08 0a 0c 0e 11 13 15 17

Strange. Revisit later. 0x91?

f8 f9 fa fb 00 00 01 02 03 04 05 06 07 08 09 0a
f9 fa fb fc 00 00 01 02 03 04 05 06 07 08 09 0a
fa fb fc fd 00 00 01 02 03 04 05 06 07 08 09 0a
fb fc fd fe 00 00 01 02 03 04 05 06 07 08 09 0a
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
01 01 01 01 00 00 00 00 00 00 00 00 00 00 00 00
02 02 02 02 00 00 00 00 00 00 00 00 00 00 00 00
03 03 03 03 00 00 00 00 00 00 00 00 00 00 00 00
04 04 04 04 00 00 00 00 00 00 00 00 00 00 00 00
05 05 05 05 00 00 00 00 00 00 00 00 00 00 00 00
06 06 06 06 00 00 00 00 00 00 00 00 00 00 00 00
07 07 07 07 00 00 00 00 00 00 00 00 00 00 00 00
08 08 08 08 00 00 00 00 00 00 00 00 00 00 00 00
09 09 09 09 00 00 00 00 00 00 00 00 00 00 00 00
0a 0a 0a 0a 00 00 00 00 00 00 00 00 00 00 00 00

That sounds like a multiply with unsigned inputs and unsigned outputs! Good.

Time to hwtest 0x81 and 0x91. Copying the implementation from scalar ops 0x01 and 0x11 results in a passing test, as long as we mask off bits 3-7. Seems this instruction has some extra parameters there. Oh, and bits 1, 2, and 8 are confirmed to work like on the scalar unit.

So, let's try out the other bits in python. Bit 3 seems to change the semantics to always unsigned, on both input and output. Weird. Bit 4 selects 8 lower bits of the result instead of highest 8 (for extended multiply, apparently). Bits 5-7 seem to be (signed) shift amount for the result. Implementing bits 4-7 is easy enough in hwtest, and it seems to work.

Bit 3, upon closer look, has funnier semantics: it changes the mode from fractional to integer multiplication.

Time for some other opcodes. Let's skip the crazy 0x82 one for now. 0x[89][89cd] should be min, max, add, sub if it's anything like scalar opcodes. And hwtest verifies that, no surprises here.

Likewise, 0x[89]a is quickly determined to be abs (with perfectly useless unsigned version), and 0x8b is neg. 0x9b doesn't pass though, and python spits out weirdness. Skip.

0x8e and 0x9e are shifts right, with no surprises.

Now, time for immediate versions of those. 0x*d seems to be special, since its encoding is used for immediate mov, and 0x[ab][ab] don't seem to work, but 0x[ab][189ce] are quickly determined to be immediate versions of 0x[89]*.

That leaves 0x82, 0x85, 0x87, 0x90, 0x92, 0x94, 0x95, 0x97, 0x9b, 0x9f, 0xa2, 0xa4, 0xa5, 0xa7, 0xaa, 0xab, 0xaf, 0xb2, 0xb3, 0xb6, 0xb7, 0xba, 0xbb, 0xbd for the next episode.

There are still two interesting parts of the vector unit we haven't seen at all yet: three-input instructions (implied to exist by the block diagram) and $c setting.

Elapsed time: 2.5h

Currently unrated

Comments

There are currently no comments

New Comment

required

required (not published)

optional

required