Now it's time for either the hidden state instructions or the scalar -> vector instructions. Since I already have a failed attempt at the latter, let's try with the hidden state.

The first thing to note is an unknown instruction used in the clear all registers microcode sequence: 0x80000000. There's a high chance it clears our unknown state. We'll use it to fill the state with some deterministic values for testing.

For our first attempt, we'll use opcode 0x82 to read the state. This instruction is known to both read and modify the state. So if the read hapens pre-mutate, we can test the behavior of 0x80 opcode, and if it happens post-mutate, the behavior of both opcodes at once.

We'll use a similiar script to the s2v test one to figure out the inputs affecting the output of such instruction pair. The result says that sources 1 and 2 of both instructio s are used. Further testing determines that the instruction pair computes roughly src1_80 * src2_80 + src1_82 * src2_82. And by executing more 0x82 instructions, we can add more terms to the sum.

So, our hidden state is an accumulator. 0x80 appears to work like 0x81, except it writes the result to the accumulator instead of stuffing it into $v, and 0x82 adds to the accumulator instead of overwriting it.

Before we can model that in hwtest, we need to figure out how big it is and how to read it. As for size, the accumulator should have 16 bits of fractional part (since that's what we can get by multiplying two unsigned numbers with 8-bit fractional part), and some unknown amount of bits in the integer part. To figure out how many integer bits the accumulator has, we'll set the accumulator to 0 and then add 1 until it overflows. After writing such program, we see the readout suddenly wrap to negative after 2048 adds. So we have 12 integer bits.

Now, how to read it. Attempting a simple read can only ever get the highest 16 significant bits of the accumulator - we can manipulate the shift and lo/hi setting, but if we attempt to read lower bits, the result will be clamped to -0x80 or 0x7f. So it's clear that we'll have to do a destructive read.

The code I ended up using in hwtest follows:

static void read_va(struct hwtest_ctx *ctx, uint32_t *va) { int i; /* * Read vector accumulator. * * This is a complex process, performed in seven steps for each component: * * 1. Prepare consts in vector registers: 0, -1, 0.5 * 2. As long as the value is non-negative (as determined by signed readout), * decrease it by 1. * 3. As long as the value is negative, increase it by 1. * 4. The value is now in [0, 1) range. Determine original integer part * from the number of increases and decreases. * 5. Read out high 8 bits of the fractional part, using unsigned readout. * 6. Read out low 8 bits of the fractional part, using unsigned readout. * 7. Increase/decrease by 1 the correct number of times to restore original * integer part. */ for (i = 0; i < 16; i++) { /* prepare the consts */ uint8_t consts[3][16] = { 0 }; int j, k, l; consts[1][i] = 0x80; consts[2][i] = 0x40; for (j = 0; j < 3; j++) { for (k = 0; k < 4; k++) { uint32_t val = 0; for (l = 0; l < 4; l++) val |= consts[j][k * 4 + l] << l * 8; nva_wr32(ctx->cnum, 0xf000 + j * 4 + k * 0x80, val); } } /* init int part counter */ int ctr = 0; /* while non-negative, decrement... */ while (1) { /* read */ nva_wr32(ctx->cnum, 0xf450, 0x82180000); /* $v3 = $va += 0 * 0, signed */ nva_wr32(ctx->cnum, 0xf458, 1); uint8_t val = nva_rd32(ctx->cnum, 0xf00c + (i >> 2) * 0x80) >> (i & 3) * 8; /* if negative, break */ if (val & 0x80) break; nva_wr32(ctx->cnum, 0xf450, 0x82184406); /* $va += -1 * 0.5 */ nva_wr32(ctx->cnum, 0xf458, 1); nva_wr32(ctx->cnum, 0xf450, 0x82184406); /* $va += -1 * 0.5 */ nva_wr32(ctx->cnum, 0xf458, 1); ctr++; } /* while negative, increment... */ while (1) { /* read */ nva_wr32(ctx->cnum, 0xf450, 0x82180000); /* $v3 = $va += 0 * 0, signed */ nva_wr32(ctx->cnum, 0xf458, 1); uint8_t val = nva_rd32(ctx->cnum, 0xf00c + (i >> 2) * 0x80) >> (i & 3) * 8; /* if non-negative, break */ if (!(val & 0x80)) break; nva_wr32(ctx->cnum, 0xf450, 0x82184206); /* $va += -1 * -1 */ nva_wr32(ctx->cnum, 0xf458, 1); ctr--; } /* read high */ nva_wr32(ctx->cnum, 0xf450, 0x92180000); /* $v3 = #va += 0 * 0, unsigned */ nva_wr32(ctx->cnum, 0xf458, 1); uint8_t fh = nva_rd32(ctx->cnum, 0xf00c + (i >> 2) * 0x80) >> (i & 3) * 8; /* read low */ nva_wr32(ctx->cnum, 0xf450, 0x92180010); /* $v3 = #va += 0 * 0, unsigned, low */ nva_wr32(ctx->cnum, 0xf458, 1); uint8_t fl = nva_rd32(ctx->cnum, 0xf00c + (i >> 2) * 0x80) >> (i & 3) * 8; /* write the result */ va[i] = ctr << 16 | fh << 8 | fl; /* restore */ while (ctr > 0) { nva_wr32(ctx->cnum, 0xf450, 0x82184206); /* $va += -1 * -1 */ nva_wr32(ctx->cnum, 0xf458, 1); ctr--; } while (ctr < 0) { nva_wr32(ctx->cnum, 0xf450, 0x82184406); /* $va += -1 * 0.5 */ nva_wr32(ctx->cnum, 0xf458, 1); nva_wr32(ctx->cnum, 0xf450, 0x82184406); /* $va += -1 * 0.5 */ nva_wr32(ctx->cnum, 0xf458, 1); ctr++; } } }

And after adding that to hwtest, a lot of instructions thought to be nops turn out to actually modify $va state. In fact, all opcodes other than 0xbf are used in the vector opcode space. Oh, and 0x[89ab]1 opcodes write $va in addition to writing the $vX register - it seems all multiplication must go through the accumulator.

So let's determine the exact behavior of that thing, starting with opcode 0x81. We'll have to figure out how much processing happens before the "save to accumulator" stage, and how much happens on the "read accumulator to $vX" path.

So, here are the results:

- Shift and high/low selection is performed on $va -> $vX path
- The rounding correction is, strangely, added to the accumulator, even though it depends on the shift, high/low selection, and signedness of readout
- Using the "integer multiplication" mode (as opposed to fractional multiplication) causes the result to be be stuffed to bits 8+ of the accumulator (ie. it uses high 8 bits of the fractional part and low 8 bits of the integer part). And the readout uses the same.

After implementing that, 0x[89ab]1 passes again. So does 0x80. 0x[89ab]2 also pass and are just versions of 0x[89ab]1 that add to accumulator instead of replacing. And 0x83 is likewise add version of 0x80.

There are a few more instructions with similiar behavior:

- Opcode 0xa0 is immediate version of 0x80
- Opcode 0x93 is unsigned version of 0x83
- Opcode 0xa3 is immediate version of 0x83
- Opcode 0xb0 is immediate and unsigned version of 0x80... but it has the immediate in the low 8 bits instead of bits 0 and 9-14. Just like scalar ops 0x22/0x32. Looks like a bug.

All other remaining instructions read stuff from scalar unit, and all but one (0xb3) mess with $va. We'll deal with them later.

And, one last thing: the current hwtest code (91aa21f6e0d93a47561ec5a64caad9fb6d9bd4a0) has a minor bug in the simulation of mul-add instructions, but the test passes just fine. Do you know what it is and what problem with the testing setup makes it invisible? Do you know how to fix the setup to expose the bug?

Elapsed time: 8h.

Share on Twitter Share on Facebook- 27: A few more details.
- 26: Branch delay slots
- 25: Bundles, pt. 2
- 24: Bundles, pt. 1
- 23: Manually controlling the FIFO interface

- July (5)

- VP1 (27)

- mwk (27)

## Comments

There are currently no comments

## New Comment