18: hwtest improvements

(0 comments)

hwtest is a great tool, but its current implementation for VP1 has a problem I mentioned last time: instead of randomizing $c, $vc, and $va state on each run, the state from previous run is kept, since we have no easy way of writing it. While it's not a problem for $c and $vc, since these are relatively evenly distributed when generated by random instructions, it hides a real bug for $va: while we've determined the accumulator size to be 28 bits, the code treats it as if it had 32 bits, and it passes since the overflow behavior almost never comes up. In fact, to cause overflow, hwtest would have to come up with mul-add instructions a dozen times or so, have them all pull in one direction (up or down), and have no intervening mul instructions (which would reset the accumulator). Time to change that and make write functions for these registers.

Writing $c is tricky and not entirely possible: flags 11, 12, and 14 can never be set at all, and flag 15 must always be set. Flags 8 and 9 cannot be both set at once - they're always set according to the result of address addition or logic operation, and such result cannot be both 0 and negative. Likewise, Flag 1 (scalar 0 flag) being set implies flags 0, 2, and 4-7 are not set. And flags 2 and 6 are tied together. However, for $c values satisfying these conditions, it's easy enough to figure out the right opcodes to run and the right operands for them to set $c accordingly. So, in hwtest, we'll just roll a 16-bit number, apply some fixups to make it an acceptable $c value, then run these operations to set $c.

$vc, on the other hand, can be set to an arbitrary value by using the median instruction: the zero flag is set if the median is 0, and the sign flag is set if the operands are not in the "canonical" order. Both are quite easy to control.

As for writing $va, it's quite clear this has to be done in several steps, accumulating the final value from partial values. It'll probably be easiest to split the value into several bitfields.

  • To write bits 0-7, just use unsigned fractional multiplication of 0x01 and the value.
  • To write bits 8-15, likewise use unsigned integer multiplication of 0x01 and the value.
  • Writing the highest bits in a single instruction is impossible. Instead, we'll aim at the highest bits possible (5 bits short) and repeat the instruction 32 times, so that the final added value is 32 times the multiplication result and this shifted to the correct position. In other words, to write bits 20-27, use unsigned integer multiplication of 0x80 and the value, repeated 32 times.
  • To write the remaining bits 16-19, use unsigned integer multiplication of 0x10 and the value shifted left by 4 bits.

The final code looks like that:

static void write_va(struct hwtest_ctx *ctx, uint32_t *va) {
    uint8_t a[16], b[16];
    int i;
    /* bits 0-7 */
    for (i = 0; i < 16; i++) {
        a[i] = va[i] & 0xff;
        b[i] = 1;
    }
    write_v(ctx, 0, a);
    write_v(ctx, 1, b);
    nva_wr32(ctx->cnum, 0xf450, 0x80000200); /* $va = a * b, fract, unsigned */
    nva_wr32(ctx->cnum, 0xf458, 1);
    /* bits 8-15 */
    for (i = 0; i < 16; i++) {
        a[i] = va[i] >> 8 & 0xff;
    }
    write_v(ctx, 0, a);
    nva_wr32(ctx->cnum, 0xf450, 0x83000208); /* $va += a * b, int, unsigned */
    nva_wr32(ctx->cnum, 0xf458, 1);
    /* bits 16-19 */
    for (i = 0; i < 16; i++) {
        a[i] = (va[i] >> 16 & 0xf) << 4;
        b[i] = 0x10;
    }
    write_v(ctx, 0, a);
    write_v(ctx, 1, b);
    nva_wr32(ctx->cnum, 0xf450, 0x83000208); /* $va += a * b, int, unsigned */
    nva_wr32(ctx->cnum, 0xf458, 1);
    /* bits 20-27 */
    for (i = 0; i < 16; i++) {
        a[i] = va[i] >> 20 & 0xff;
        b[i] = 0x80;
    }
    write_v(ctx, 0, a);
    write_v(ctx, 1, b);
    for (i = 0; i < 32; i++) {
        nva_wr32(ctx->cnum, 0xf450, 0x83000208); /* $va += a * b, int */
        nva_wr32(ctx->cnum, 0xf458, 1);
    }
}

It's easy enough to verify it works - our existing hwtest $va readout function would raise a fuss otherwise.

But our current $va readout is problematic: it basically increments/decrements each accumulator component until it reaches 0 and is thus horribly slow. Before $va write support, this took 95% of hwtest execution time and slowed it down to 131 tests per second. With $va write support, the $va values no longer fall mostly near 0, and $va readout correspondingly takes much longer, slowing hwtest to ridiculously low 25 tests per second. It's clear we need a better way to read $va.

Luckily, constant-time readback is easily possible. It has to be done destructively, however: while we can manipulate the shifts and flags to control which bits of the accumulator we read, we can't aim the readout window lower than 16 bits below the highest set bit (or rather, bit different from sign) of the accumulator without triggering the clipping behavior. So it has to be done like that:

  1. Read bits 20-27 of the accumulator, via signed integer readout with the biggest right shift possible.
  2. Read bits 12-19 likewise, but with the "low part" flag set.
  3. Use the knowledge thus gained to neutralize bits 16-27 of the accumulator by substracting the known value.
  4. Read remaining bits via unsigned fractional readout with no shift, high/low parts.

The final code:

static void read_va(struct hwtest_ctx *ctx, uint32_t *va) {
    uint8_t a[16], b[16];
    int i;
    for (i = 0; i < 16; i++) {
        a[i] = 0;
    }
    write_v(ctx, 0, a);
    /* read bits 12-27 */
    nva_wr32(ctx->cnum, 0xf450, 0x82080088);
    nva_wr32(ctx->cnum, 0xf458, 1);
    nva_wr32(ctx->cnum, 0xf450, 0x82100098);
    nva_wr32(ctx->cnum, 0xf458, 1);
    read_v(ctx, 1, a);
    read_v(ctx, 2, b);
    for (i = 0; i < 16; i++) {
        va[i] = a[i] << 20 | b[i] << 12;
    }
    /* neutralize bits 20-27 */
    for (i = 0; i < 16; i++) {
        a[i] = -(va[i] >> 20) & 0xff;
        b[i] = 0x80;
    }
    write_v(ctx, 1, a);
    write_v(ctx, 2, b);
    for (i = 0; i < 32; i++) {
        nva_wr32(ctx->cnum, 0xf450, 0x83004408); /* $va += a * b, int */ 
        nva_wr32(ctx->cnum, 0xf458, 1); 
    }
    /* and bits 16-19 */
    for (i = 0; i < 16; i++) {
        a[i] = (va[i] >> 16 & 0xf) << 4;
        b[i] = 0xf0;
    }
    write_v(ctx, 1, a);
    write_v(ctx, 2, b);
    nva_wr32(ctx->cnum, 0xf450, 0x8300440a); /* $va += a * b, int */
    nva_wr32(ctx->cnum, 0xf458, 1);
    /* now read low 16 bits */
    nva_wr32(ctx->cnum, 0xf450, 0x92080000);
    nva_wr32(ctx->cnum, 0xf458, 1);
    nva_wr32(ctx->cnum, 0xf450, 0x92100010);
    nva_wr32(ctx->cnum, 0xf458, 1);
    read_v(ctx, 1, a);
    read_v(ctx, 2, b);
    for (i = 0; i < 16; i++) {
        va[i] |= a[i] << 8 | b[i];
    }
}

It works just fine (verified via known-good $va write code above), and bumps hwtest speed to respectable 3700 tests per second. The bottleneck is now PCIE read latency for context readback, and it cannot be easily improved since we actually need all of this state.

Elapsed time: 3h.

Currently unrated

Comments

There are currently no comments

New Comment

required

required (not published)

optional

required