19: the scalar -> vector path

(1 comment)

The only unknowns still left in scalar and vector units involve the scalar to vector bus. Time to deal with it. Starting with 0x85 and 0x95, together with 0x0f scalar opcode.

What we learned before from manual testing is that the instruction looks like dst = src1 * (sc1 + sc2) + src2, with the usual multiplication flags, where sc1 is bits 0-7 of scalar source, and sc2 is bits 16-23. Opcode bit 0 is known to change the instruction in a scary way, so we always unset it. However, we get test failures.

A bit of experimentation reveals that the failures mostly disappear whenever sc2 is 0, or vector src1 is an odd register. This probably means that src1 actually selects an input pair (with |1 behavior again), and the second factor affects the extra input.

After fixing that, there are fewer failures, and  the ones left only affect some of the vector components. This must mean one of the inputs is a component mask. There are two obvious guesses to make here:

  1. The mask input comes from $vc by the same process as in 0x8f opcode - after all, the s2v ops are known to select it there.
  2. When the bit mask is set, the factor used comes from bits 8-15 (or 24-31) of the scalar instead.

Both turn out to be true, and there are no failures anymore. Let's deal with opcode bit 0 now. Instead of multiplying the sources by a factor, it seems to either add the full source to the result or not, depending on some condition varying between vector components. And that condition comes from the scalar source treated as a mask: bits 0-15 select components for source 1 addition, bits 16-31 select components for source 2.

This is again quickly verified by hwtest. So, the scalar output may be treated as 4 factors or as 2 masks. Interesting. Let's look at the other scalar ops now. To make life easier, we'll make a script running a scalar instruction and dumping factors & masks it emits.

def get_factors(op):
    res = [0] * 4
    # set $vc0 to 0xXXXXXXXa
    b0.wr32(0xf000, 0x80008000)
    b0.wr32(0xf450, 0x0c000000)
    b0.wr32(0xf458, 1)
    # prepare source 1
    b0.wr32(0xf000, 0x00000101)
    b0.wr32(0xf004, 0x01010000)
    # prepare source 2
    b0.wr32(0xf008, 0x00000000)
    # fire
    b0.wr32(0xf44c, op)
    b0.wr32(0xf450, 0x85180416)
    b0.wr32(0xf458, 1)
    for x in range(4):
        res[x] = b0.rd32(0xf00c) >> 8 * x & 0xff
    b0.wr32(0xf450, 0x82188400)
    b0.wr32(0xf458, 1)
    for x in range(4):
        res[x] |= (b0.rd32(0xf00c) >> 8 * x & 0xff) << 8
    return res

The first opcode to look at is 0x24. It goes quickly: it has two 9-bit signed immediate operands sent as factors 0&1 and 2&3, respectively. The masks consist of high 8 bits of the corresponding immediate, repeated twice.

Now let's try 0x45, since we already know it a bit: it right-shifts its source by 4, suggesting the s2v output could be determined by the 4 bits shifted out. Sure enough, the first mask it emits consists of each of the 4 bits repeated 4 times (eg. 0xa -> 0xf0f0). The first two factors are bits 0-7 and 8-15 of the mask, shifted left 1 bit. Curious - this means factors are at least 10 bits wide for some reason. The second mask and factors 3 and 4 are just 0.

So far, all tested instructions have had tied factors and masks: mask 0 is made of bits 1-8 of factors 0 and 1, and mask 1 likewise for factors 2 and 3. Since this happens even for instructions where it doesn't make sense (0x45), it's likely the hw lines are just shared. We have no upper estimate on factor size yet, but since -0x100 and 0x1fe are both possible, the lower limit is 10 bits with sign, 8 fractional bits.

The only unknown s2v instruction used by microcode still left to RE is 0x04. Running our new script on it produces junk: the outputs are too complex to be computed from the input. Somehow, we have to clear $r3 to 0 to make output simple, even though the opcode encoding could only select $r0!

Clearly, something strange is going on again with register index mangling. To be safe, we'll clear up to $r7 and point the source bitfields to $r0 and $r4. Assuming no more than 2 bits are mangled, this should be safe.

My findings from messing with the inputs, in order:

  1. When $r0 (source 1) is 0, the factors & mask are determined from $r5 (source 2 + 1) like in 0x0f.
  2. Setting $r0 bits 0-10 and 19-31 makes no difference.
  3. Bits 11-18 of $r0 are treated as an 8-bit unsigned fraction, multiplied by $r7 treated as a 4-byte vector of signed fractions, and added to $r5, then the result is used for determining factors and, in turn, masks
  4. The conversion of the full multiplication result to the factor with 8 fractional bits involves round to nearest, with ties rounded up.

We haven't seen actual source 2 ($r4) used, nor $r6 for that matter. This suggests they may be selected by $c flags, like for some other scalar instructions. And sure enough, unsetting $c0 bit 0 selects $r4/$r6 instead. Looking closer, it seems bits 3-8 of the opcode determine $c register and bit used to select the register. Like in normal scalar isntructions, flag 4 is special-cased and used together with flag 5 to mangle two bits of register index. However, this scheme differs from the usual a bit: the flag or flags are ORed into the register index, not XORed/added modulo 4. After implementing all that, hwtest fully passes for opcode 0x04.

That covers all s2v instructions used by the microcode, but hwtesting opcode 0x8f revealed another instruction that sends the $vc selection signals: 0x05. So it very likely outputs something interesting as factors and masks too.

The instruction looks similiar to 0x04, but with one obvious difference: only components 0 and 2 of the 4-byte vectors are used: factor 0 is duplicated as factor 1, and factor 2 as factor 3. However, that's not enough to make hwtest pass.

After a lot of tedious binary searching by masking various bits of the state, we learn of two more differences:

  • the multiplication factor from source 1 is now one bit shorter: it takes bits 11-17 instead of 11-18
  • if the $c bit selected for input is bit 2, and bit *7* of $c is set,  the instruction uses components 1 and 3 of the inputs instead of 0 and 2.

Strange. But it makes hwtest pass perfectly.

Elapsed time: 8h.

Currently unrated

Comments

Comment awaiting approval 5 months, 1 week ago

New Comment

required

required (not published)

optional

required