Last time I've said there are three possibilities for the weird vector instructions. Let's figure them out.
First, we'll look at the microcode samples looking for example usage.
As it turns out, most of the unknown vector instructions are always used in pairs with unknown scalar instructions. A few common combinations are:
0f 85 0f 95 24 95 24 84 24 86 24 97 45 85 04 b4 04 b5 04 b6 04 b7 The unpaired instructions include 0x94 and 0x9b.
That'd explain the weirdly useless 0x45 behavior: we've only seen half of that instruction's semantics.
Let's start by picking a pair and figuring out its input/output registers. I'm picking 0x0fdeadbe, 0x95cafeba as the opcodes.
#!/usr/bin/env python3 import sys import nvapy import random c = nvapy.cards b0 = c.bar0 r = [random.getrandbits(32) for x in range(31)] v = [random.getrandbits(128) for x in range(32)] def puts(r, v): for x in range(31): b0.wr32(0xf780 + x * 4, r[x]) for x in range(32): for y in range(4): b0.wr32(0xf000 + x * 4 + y * 0x80, v[x] >> y * 32 & 0xffffffff) puts(r, v) op = [0xdf000000, 0x0fdeadbe, 0x95cafeba, 0xef000000] def fire(op): for x in range(4): b0.wr32(0xf448 + x * 4, op[x]) b0.wr32(0xf458, 1) fire(op) def rres(x): r = 0 for y in range(4): r |= b0.rd32(0xf000 + x * 4 + y * 0x80) << y * 32 return r a = rres(0x19) for x in range(31): nr = r[:] nr[x] ^= 0xffffffff puts(nr, v) fire(op) b = rres(0x19) if a != b: print('r', x) for x in range(32): nv = v[:] nv[x] ^= 0xffffffffffffffffffffffffffffffff puts(r, nv) fire(op) b = rres(0x19) if a != b: print('v', x)
This python script says the inputs are $r26, $v11, $v31. This corresponds to scalar source 1 and vector sources 1 & 2 bitfields.
Now let's try to figure out the behavior. I'll run that pair (but with unknown bits set to 0) with all three inputs ranging between -16..32.
The output seems rather chaotic at first. But, it seems that for $r == 0 the instruction reduces to DST = SRC2. For other values of $r, it looks like DST = SRC2 + (something small) * SRC1. Seems it's a mul-add, of the form VDST = VSRC1 * RSRC1 + VSRC2, with RSRC1 signed. Testing low opcode bits reveals bit 8 to be rounding, bits 5-7 to be shift, 4 to be low/high selection, and 1-2 to be signed/unsigned selection for SRC1/SRC2 as usual. However, bit 0 does something really strange: the low bits of scalar source is instead used as mask of byte slots, and the bytes matched by that mask operate differently from the ones not matches. The instruction also has another weirdness: even when bit 0 is not set, input of 0xffffffff seems treated as -2, 0xfffffffe as -3, 0xfffffffd as -4, and so on.
It seems the scalar input is actually treated as two numbers, which are added together. Or maybe sometimes added, based on a mask that's communicated elsewhere. Scary. Let's deal with it later and first look at the other vector instructions we're still missing.
Running the script on 0x94 reveals its inputs are just the usual vector sources 1 and 2. After a bit of experimentation, it's revealed to be a register-register bitwise operation (with operation selected by bits 3-6 of the opcode as previously). Apparently I've missed it earlier. 0xa5 is also revealed to be bytewise min(abs(src1), abs(src2)). Weird, but oh well.
0x9b, 0x9f, 0xa4 are more interesting. They appear to use a third source operand, in bits 4-8. Let's run them with all three inputs varying. 0xa4 apparently takes the three inputs and computes their median. But 0x9b and 0x9f do some much weirder operation. This likely means that, as opposed to previous vector instructions, these operations don't operate strictly vertically (ie. byte X of the output does not depend on just byte X of each source).
After re-running the test with only byte 0 used, and then gradually introducing more bytes, the instructions are easy to understand: 0x9b is "swizzle" - for each byte of the destination, the corresponding byte of source 3 selects which byte of source 1 or 2 should be taken. Only bits 0-4 of source 3 are used: values 0-0xf select source 1 bytes, and 0x10-0x1f select source 2 bytes. And verifying it in hwtest reveals a flag: bit 3 of the opcode apparently swaps bits 0-3 and 4-7 of source 3.
0x9b, on the other hand, does 8-bit + 9-bit addition: sources 2 and 3 are treated as made of 16-bit words, not bytes, and together make one 256-bit operand made of 16 16-bit operands. The instruction takes low 9 bits of each such word (as signed value), adds them to bytes of source 1 (as unsigned value), clips the results to uint8_t range, and writes them to destination. This exactly matches the operations needed to perform motion compensation.
0x90 is the weirdest one left that uses only vector registers as the source: It uses $v(src1), $v(src2), and $v(src1|1) as inputs (where | is bitwise or). After finding that out, it's easy enough to test. Apparently it does linear interpolation (on 8-bit unsigned values): DST = SRC1 * SRC3 + SRC2 * (1 - SRC3). hwtest reveals it also has selectable shift (like multiplication) and rounding mode.
0x85, 0x95, and 0xb3 depend on scalar input. 0x[89ab]2 and 0xb6 are really scary: when a few are executed in a sequence, even with no other instructions in between and identical register state, they give different results. And the resulting vectors appear to be scaled DCT coefficients. This all means these ops are likely used for video transforms, and definitely use some hidden state as input and output. 0x[89ab]7 appear similiar to them, but don't mutate the hidden state (just read it). And 0xbb seems to read some hidden state, but it does not look like a DCT row.
To be continued.
Elapsed time: 8h.Share on Twitter Share on Facebook