12: hwtest and scalar instructions, part 2

(2 comments)

First of all, I did one minor error in the previous episode: I reset the VP1 unit on every test, causing the $c registers to be cleared every time (rememeber we have no means of setting it arbitrarily). Allowing them to keep the old value between tests reveals that all 0x60-0x7f operands that we though to be NOPs in fact clear the destination $c register (if any) scalar flags to 0.

Let's continue the hwtest approach for further scalar opcodes, the 0x40-0x5f range.

The only opcodes that seem to do anything interesting at all are 0x41, 0x42, 0x45, 0x48-0x4e, 0x51, 0x58-0x5e. The other opcodes, with the exception of 0x4f, have only the "set $c flags to 0" behavior. 0x4f is a full nop.

Let's start with one of the known ones: 0x41 (mul).

Well, things got interesting. The instruction works as expected when the SRC2 selection field is set to 0x38 (ie. $c0 bit 14, which is always 0). In fact, it works as expected when it's set to anything other than $cX bit 4. But selecting bit 4 causes weirdness.

We'll try determining carefully how it works by doing the following:

  1. Set $r0 to 0, $r1 to 1, $r2 to 2, ...
  2. For SRC1 in 4..7:
    1. For SRC2 in 16..19:
      1. For C in 0..3:
        1. Run 0x41 (mul) instruction with SRC1 and SRC2 set as above, bits 3-4 ($c selection) set to C, and bits 5-8 (flag selection) set to 4
        2. Print result.
result src1 src2 c
00000040 4 16 0
00000044 4 16 1
0000004c 4 16 2
00000048 4 16 3
00000044 4 17 0
00000048 4 17 1
00000040 4 17 2
0000004c 4 17 3
00000048 4 18 0
0000004c 4 18 1
00000044 4 18 2
00000040 4 18 3
0000004c 4 19 0
00000040 4 19 1
00000048 4 19 2
00000044 4 19 3
00000050 5 16 0
00000055 5 16 1
0000005f 5 16 2
0000005a 5 16 3
00000055 5 17 0
0000005a 5 17 1
00000050 5 17 2
0000005f 5 17 3
0000005a 5 18 0
0000005f 5 18 1
00000055 5 18 2
00000050 5 18 3
0000005f 5 19 0
00000050 5 19 1
0000005a 5 19 2
00000055 5 19 3
00000060 6 16 0
00000066 6 16 1
00000072 6 16 2
0000006c 6 16 3
00000066 6 17 0
0000006c 6 17 1
00000060 6 17 2
00000072 6 17 3
0000006c 6 18 0
00000072 6 18 1
00000066 6 18 2
00000060 6 18 3
00000072 6 19 0
00000060 6 19 1
0000006c 6 19 2
00000066 6 19 3
00000070 7 16 0
00000077 7 16 1
00000085 7 16 2
0000007e 7 16 3
00000077 7 17 0
0000007e 7 17 1
00000070 7 17 2
00000085 7 17 3
0000007e 7 18 0
00000085 7 18 1
00000077 7 18 2
00000070 7 18 3
00000085 7 19 0
00000070 7 19 1
0000007e 7 19 2
00000077 7 19 3

The result is apparently always divisible by SRC1, so there's no weirdness in SRC1 selection at least. Let's disregard SRC1 now and just print what actual src2 it uses:

actual_reg_used src2 c
16 16 0
17 16 1
19 16 2
18 16 3
17 17 0
18 17 1
16 17 2
19 17 3
18 18 0
19 18 1
17 18 2
16 18 3
19 19 0
16 19 1
18 19 2
17 19 3

And the state of $c during this test (left over from previous hwtest run) is:

0000f680: 00008000 00008090 000080b0 000080a0

It looks like $c bits 4-5 are added to src2 register index bits 0-1, with carry from bit 1 discarded. So you can select one of 4 registers depending on bits 20-21 of some previous results. Straaaange. But hwtest verifies that as true.

Opcodes 0x42 (bitwise ops), 0x48 (min), 0x49 (max), 0x4c (add), 0x4d (sub), 0x4e (sar), 0x5e (shr) hold no surprises. Further, 0x51, 0x58, 0x59, 0x5c, 0x5d seem to be aliases of corresponding ops without bit 4 set.

This leaves 0x45, 0x4a, 0x4b, 0x5a, 0x5b. Given the patterns in encodings, it's quite likely that 0x4a/0x5a work exactly like 0x7a (abs), and 0x4b/0x5b like 0x7b (neg). A simple hwtest indeed confirms that.

So, 0x45. 0x45 is weird. Let's take a look at what it changes:

what        initial    expected   real
R[0x00]   0xbb62c511 0xbb62c511 0xfbb62c51 *
R[0x02]   0xa820609f 0xa820609f 0xfa820609 *
R[0x16]   0x75bad0b0 0x75bad0b0 0x075bad0b *
R[0x1e]   0xd6785ad5 0xd6785ad5 0xfd6785ad *
R[0x1d]   0x6ac8526c 0x6ac8526c 0x06ac8526 *
R[0x0a]   0xff9a4686 0xff9a4686 0xfff9a468 *
R[0x06]   0x4363d465 0x4363d465 0x04363d46 *
R[0x03]   0xade1a830 0xade1a830 0xfade1a83 *
R[0x09]   0xc5d0e851 0xc5d0e851 0xfc5d0e85 *

The thing is, the register that's changed is selected by bits 14-18 of the opcode (ie. normally used for src1). Still, implementing it in hwtest as src1 = (int32_t)src1 >> 4 seems to work... Oh well, maybe it'll make some sense later.

Let's expand the hwtested range to full 0x00-0x7f now. The results are as follows:

  • 0x1f, 0x2f, 0x3f clear the flags to 0
  • 0x00, 0x03-0x07, 0x0f, 0x10, 0x13-0x17, 0x20, 0x23, 0x24, 0x30, 0x33-0x37 do absolutely nothing
  • the remaining instructions modify the $r registers

Starting with 0x01. First, let's use a little python program to try it out:

#!/usr/bin/env python3
import sys
import nvapy
c = nvapy.cards[2]
b0 = c.bar0
for x in range(31):
    b0.wr32(0xf780 + x * 4, x)

for d in [1 << x for x in range(32)]:
    for s1 in [1 << x for x in range(32)]:
        for s2 in [1 << x for x in range(32)]:
            b0.wr32(0xf780, d)
            b0.wr32(0xf784, s1)
            b0.wr32(0xf788, s2)
            b0.wr32(0xf44c, 0x01000000 | 1 << 14 | 2 << 9)
            b0.wr32(0xf458, 1)
            res = b0.rd32(0xf780)
            sys.stdout.write("{:08x} ".format(res))
        print('')

The result quite clearly implies it's an SIMD byte multiplication instruction, operating on (unsigned) fractions. That is, res = s1 * s2 >> 9. The >> 9 is rather strange, but oh well.

After coding it in hwtest, it blows up when the low bits are non-0. It passes after ANDing the opcode with ~0x1ff. With some trial and error, we determine it in fact passes after ANDing with ~0x106. So bits 1, 2 and 8 are some flags.

Trying them out in the python program again, we deduce bit 1 to cause source 2 to become signed (that explains >>9: the output is signed too!) and bit 2 to cause source 1 to become signed. Bit 8 seems to enable round-to-nearest instead of round down.

There is one important thing to be mentioned here: in fixed-point arithmetic with only fractional part, there's exactly one case where overflow is possible: multiplying -1 by -1 (1 is not representable after all). Trying it out, we see the correct result of 0x80 clipped to 0x7f. However, hwtest passes even without caring for that edge case!

This is the major flaw of randomized tests (or black box tests): they may miss some rare special cases. In hwtest, we try to find cases like that and help luck a bit by bumping the probability of hitting them. After adding a simple piece of code that sometimes randomly replaces bytes of a word with 0 or 0x80, we get a proper hwtest failure, and can fix it up by adding the special case.

Onwards. Now, trying opcode 0x02 with the python script, we get the exact same result as with 0x01. hwtest verifies that it's really an alias.

Let's try out opcode 0x11 now. It looks rather similiar to 0x01 in python script, except it's shifted one bit to the left. This heavily implies it should be an unsigned-result version. After a bit of trial and error, this is verified. Like 0x01, the inputs to 0x11 can be negative, but negative results are just clipped to 0. And opcode 0x12 turns out to be an alias for 0x11.

Now, according to encoding patterns, 0x21 should be immediate version of 0x01. Indeed, after modifying our python script for immediate operand support, it looks exactly as expected... except the immediate seems to be 5 bits only, is at bit positions 9-13, and corresponds to bits 2-6 of byte input (broadcast to all 4 bytes). Huh. hwtest verifies that, but blows up when bit 0 of opcode is 1.

Doing another test with the script reveals that bit 0 is, in fact, the missing bit 7 of the immediate. hwtest verifies that and passes completely now.

This time, 0x22 turns out not to be an alias for 0x21. Let's look at it in a moment, then. 0x31 should be an immediate version of 0x11. And hwtest again verifies that quickly.

As for 0x22 (and 0x32): python says they're like 0x21 and 0x31, but have the immediate in bits 3-7 (and shorter). But hwtest again gives error when bits 0-2 are non-0. Huh.

Looking at python output again, it seems the low bits are both part of the immediate *and* control signedness of the inputs. Ugh. So hwtest passes now, but the opcodes seem brain damaged and useless.

Our python script makes quick work of opcodes 0x25-0x27. They're quite obviously AND, OR, and XOR respectively of broadcasted 8-bit immediate at 3-10 and the input. hwtest verifies that, but has errors on $c output. Seems it's always set to 0. After adding that fix, everything passes.

0x08 seems to be, unsurprisingly enough, SIMD signed min, 0x09 signed max, 0x18 unsigned min, and 0x19 unsigned max. hwtest confirms that. $c is likewise zeroed.

0x28, 0x29, 0x38, 0x39 turn out to be immediate versions of those, with immediate in bits 3-10.

0x0a turns out to be bytewise absolute value, with 0x80 clipped to 0x7f. 0x1a is perfectly useless unsigned absolute (ie. passthrough). 0x0b is bytewise negation (with 0x80 again clipped), 0x1b is again perfectly useless unsigned negation with clipping (ie. always 0). 0x2a, 0x2b, 0x3a, 0x3b are aliases.

0x[0-3]c and 0x[0-3]d likewise turn up to be add/sub with signed/unsigned clipping.

0x[0-3]e is, as expected, a bytewise shift right. The shift amount is always signed and 4-bit, taken from low 4 bits of corresponding bytes. This time there's no output clipping.

This covers all scalar instructions except 0x6a and 0x6b.

Elapsed time: 4h

Currently unrated

Comments

Comment awaiting approval 3 months, 2 weeks ago

Comment awaiting approval 1 month ago

New Comment

required

required (not published)

optional

required