3 years, 3 months ago
There are still two obscure cases disabled in hwtest that should be dealt with:
- using scalar op 0x04 or 0x05 (bvecmad and bvecmadsel) together with a store opcode from $r register - there's a mismatch on the stored data
- s2v data emitted by instructions other than the ones dedicated for that purpose (0x04, 0x05, 0x24, 0x45, 0x0f)
In the first case, we easily determine that it's the third scalar input that ends up stored instead of the $r register selected by the store instruction. This means it's a read port conflict, like the other ones on stores. Unfortunately, we cannot easily add that to hwtest, as the scalar instruction (which computes the third source selection) is simulated after the address instruction. It's not easy to exchange the order either - there's also a dependency the other way around. This needs a refactor.
The refactored simulator will work in phases, like a real processor:
- Decode - the instruction words are cut into bitfields, the pipeline configuration is prepared, the register read/write ports are aimed as much as possible.
- Preread - because the configuration of some register read/write ports depends on $c values, they are read separately in this phase.
- Adjust - the final aiming of read/write ports, as calculated from $c values. Port conflicts are taken into account here.
- Read - the register files are read.
- Execute - all major computations are performed here, results are prepared.
- EA - effective address for each data store bank is computed.
- Memory - load/store is performed.
- Write - the results are written to the register files.
The refactor was mostly uneventful, but some things are clearer now:
- there are apparently 3 $r read ports (2 for usual scalar instructions + 1 for store/mov to sr/extra for 0x04/0x05)
- likewise, 4 $v read ports (3 for vector instructions + 1 for store/mov from sr)
- likewise, 3 $a read ports (2 for address instructions + 1 for mov from sr)
- there seem to be 2 $a write ports (1 for address unit, 1 for mov to sr)
- likewise, 2 $r write ports (1 for scalar unit, 1 for load)
- likewise, 3 $v write ports (1 for scalar unit, 1 for load, 1 for vector unit), and $vx can be written independently of them by the load instruction
- there are 6 ways of routing vector opcode bitfields and $c-based adjustment to $v read ports:
- SRC1, SRC2, SRC3
- SRC1, SRC2, SRC1|1
- SRC1, mangled SRC2, SRC1|1
- SRC1Q, SRC1Q, SRC1Q (
- SRC1Q, SRC2, SRC1Q
- SRC1Q, $vx, SRC1Q
- The order in each of the above triples minimises the number of distinct MAD pipe configurations.
- MAD can do either A + B * C or A + B * F1 + D * F2, where F1 and F2 are s2v factors or 0/0x100 from s2v mask
- A can be: 0, $va, s2, s2/2, s3 (maybe xored with 0x80)
- B is either s1 or s1-s3
- C is always s2
- D is either s3 or s2-s3
With the refactor, hwtest now passes with the first point enabled. Time for the second point, s2v outputs of all scalar instructions. Without much ado:
- All opcodes >= 0x40 work like 0x45: factors output are generated based on masks generated based on the low 4 bits of $r source 1. This includes even opcodes that you wouldn't suspect of using $r source 1, like the scalar nop instruction, or the mov from special register instruction, or mov from immediate (where immediate collides with source 1 bits).
- Most opcodes < 0x40 output all-zeros, with the exception of 0x[0-3][0-3] (bmul instruction and some unknown opcodes), 0x[1-3]f, 0x14, 0x15, 0x6, 0x7, 0x3[4-7].
- The bmul instructions (0x[0-3]1) output their results into the corresponding factors, treating them as having 8 fractional bits. This gives precision identical to ordinary unsigned output, and 1 bit better than ordinary signed output.
- Opcodes 0x[0-3]0 behave like bmul instruction for factor output, except they don't have usual $r output and ignore the rounding flag.
- Opcodes 0x[0-3]2 (evil twins of bmul) work as above, except the factors are treated as having 16 fractional bits. Given that they're only 10 bits long, it's very likely that the high bits will be discarded (and mad input sign-extended from 10 bits). Perhaps it should be considered as integer multiplication instead...
- Opcodes 0x[0-3]3 work like above, except again there's no $r output.
- Opcodes 0x06, 0x07, 0x1[4-7], 0x3[4-7] work like above, but there are no flags in the lower opcode bits - they're all assumed to be 0.
- 0x1f, 0x2f, 0x3f work like above, but 0x1f uses mangled source 2, and they set selected $c register to 0.
- For all opcodes except 0x04, 0x05, 0x0f, 0x24 and 0x45, the vmul/vmad/vmac/... instructions making use of $vc input use one selected by vector opcode low bits, like vcmpad and the various vlrp* instructions. This collides with flag bits. Tough luck.
Quite hard to tell which of these behaviors are intentional, and which are accidental.
At least we got hwtest to pass, with the only disallowed instructions being DMA opcodes for address unit and mov to/from special register for scalar unit.
Elapsed time: 19h.
Share on Facebook