Today my goal is to look into an issue I've briefly mentioned at the beginning, but ignored afterwards: the code fetch and execution process.
The initial guess was that the ISA is VLIW with a bundle size of 16 bytes, containing 4 32-bit opcodes. The known branches certainly can only aim at targets aligned to 16 bytes. However, there's no evidence of the instructions being executed explicitely in parallel. In fact, there are lots of code samples like this:
00000004: 65100001 mov $a2 0x1 00000004: 75100000 sethi $a2 0
where two instructions belong to the same bundle, yet the second has a dependency on the first.
On the other hand, in opcode 0x6a/0x6b testing, a look at 0xf478 MMIO reg access shows that 4 cycles elapse between a write to it from instruction 0 and a read from it in instruction 15. So some amount of parallelism is certainly happening. Also, the exit instruction is known to have a fairly large delay before actually halting execution.
And I think this is pretty good evidence for branch instructions also having delay slots:
[...] 0000034b: e1ffffdb B bra1 0x34a [unknown: 000001db] 0000034c: ef0001ff nope 0000034c: e10011bb bra1 0x354 [unknown: 000001bb] 0000034c: 6bb400af mov $a22 $x48 0000034c: ef0001ff nope 0000034d: 6bb480af mov $a22 $x50 0000034d: e1000fbb bra1 0x354 [unknown: 000001bb] 0000034d: ef0001ff nope 0000034d: e1000fbb bra1 0x354 [unknown: 000001bb] 0000034e: 6bb500af mov $a22 $x52 0000034e: ef0001ff nope 0000034e: 6bb580af mov $a22 $x54 [...]
Today's goal will be to figure out just what's going on in there, and to figure out how the conditional branch instructions work.
Our first attempt will be to read from the cycle count register 31 times in a row, to determine the passage of time between executing instructions:
mov $a0 $sr30 mov $a1 $sr30 mov $a2 $sr30 [...] mov $a30 $sr30
0000f780: 01613987 01613989 0161398b 0161398d 0000f790: 0161398f 01613991 01613993 01613995 0000f7a0: 01613997 01613999 0161399b 0161399d 0000f7b0: 0161399f 016139a1 016139a3 016139a5 0000f7c0: 016139a7 016139a9 016139ab 016139ad 0000f7d0: 016139af 016139b1 016139b3 016139b5 0000f7e0: 016139b7 016139b9 016139bb 016139bd 0000f7f0: 016139bf 016139c1 016139c3 00000000
Oh, it's two cycles per instruction now? Hmm, what if I replace half of the instructions with the standard mixed nop sequence?
0000f780: 015f51bd 015f51bf 015f51c1 015f51c3 0000f790: 015f51c5 015f51c7 015f51c9 015f51cb 0000f7a0: deadbe08 deadbe09 deadbe0a deadbe0b 0000f7b0: deadbe0c deadbe0d deadbe0e deadbe0f 0000f7c0: deadbe10 deadbe11 deadbe12 deadbe13 0000f7d0: deadbe14 deadbe15 deadbe16 deadbe17 0000f7e0: 015f51d0 015f51d2 015f51d4 015f51d6 0000f7f0: 015f51d8 015f51da 015f51dc 00000000
So nop execution is faster - the 16 nops were executed in about 3 or 4 cycles. Let's vary the type of the nops. What if I use only 0x4f nop?
0000f780: 01603803 01603805 01603807 01603809 0000f790: 0160380b 0160380d 0160380f 01603811 0000f7a0: deadbe08 deadbe09 deadbe0a deadbe0b 0000f7b0: deadbe0c deadbe0d deadbe0e deadbe0f 0000f7c0: deadbe10 deadbe11 deadbe12 deadbe13 0000f7d0: deadbe14 deadbe15 deadbe16 deadbe17 0000f7e0: 01603822 01603824 01603826 01603828 0000f7f0: 0160382a 0160382c 0160382e 00000000
Now it took about 16 cycles. All 0xbf nops?
0000f780: 015d1b0c 015d1b0e 015d1b10 015d1b12 0000f790: 015d1b14 015d1b16 015d1b18 015d1b1a 0000f7a0: deadbe08 deadbe09 deadbe0a deadbe0b 0000f7b0: deadbe0c deadbe0d deadbe0e deadbe0f 0000f7c0: deadbe10 deadbe11 deadbe12 deadbe13 0000f7d0: deadbe14 deadbe15 deadbe16 deadbe17 0000f7e0: 015d1b2b 015d1b2d 015d1b2f 015d1b31 0000f7f0: 015d1b33 015d1b35 015d1b37 00000000
Same. All 0xdf?
0000f780: 015f59d6 015f59d8 015f59da 015f59dc 0000f790: 015f59de 015f59e0 015f59e2 015f59e4 0000f7a0: deadbe08 deadbe09 deadbe0a deadbe0b 0000f7b0: deadbe0c deadbe0d deadbe0e deadbe0f 0000f7c0: deadbe10 deadbe11 deadbe12 deadbe13 0000f7d0: deadbe14 deadbe15 deadbe16 deadbe17 0000f7e0: 015f59f5 015f59f7 015f59f9 015f59fb 0000f7f0: 015f59fd 015f59ff 015f5a01 00000000
And all 0xef?
0000f780: 016146c9 016146cb 016146cd 016146cf 0000f790: 016146d1 016146d3 016146d5 016146d7 0000f7a0: deadbe08 deadbe09 deadbe0a deadbe0b 0000f7b0: deadbe0c deadbe0d deadbe0e deadbe0f 0000f7c0: deadbe10 deadbe11 deadbe12 deadbe13 0000f7d0: deadbe14 deadbe15 deadbe16 deadbe17 0000f7e0: 016146e8 016146ea 016146ec 016146ee 0000f7f0: 016146f0 016146f2 016146f4 00000000
Okay. So the obvious conclusions are:
Let's try something different - we'll have a look at dependency detection:
mov $a0 $sr30 add $a1 $a1 1 add $a1 $a1 1 add $a1 $a1 1 [... repeated 30 times in total ...] mov $a2 $sr30
as opposed to
mov $a0 $sr30 add $a1 $a1 1 add $a2 $a2 1 [... repeated for 30 regs ...] mov $a1 $sr30
Both code pieces show a difference of 0x20 between the two clock reads - there's no execution time change based on the operands. The first testcase also results in $a1 really incrementing by 30 - the architecture seems to enforce sequential semantics for instructions (unllke vµc), at least for the add instruction. Good. How about something more complicated, like mul?
Well, turns out mul also executes in one cycle. Nothing to see here.
So, how important are the 16-byte bundle bounduaries, anyway? Let's check some testcases.
[...] 6b078047 df000007 df000007 df000007 df000007 bf000007 bf000007 bf000007 bf000007 6b0f8047 [...]: 9 cycles [...] 6b078047 df000007 df000007 df000007 bf000007 bf000007 bf000007 bf000007 bf000007 6b0f8047 [...]: 8 cycles [...] 6b078047 df000007 df000007 bf000007 bf000007 bf000007 bf000007 bf000007 bf000007 6b0f8047 [...]: 8 cycles [...] 6b078047 df000007 df000007 df000007 df000007 df000007 bf000007 bf000007 bf000007 6b0f8047 [...]: 8 cycles [...] 6b078047 df000007 bf000007 df000007 bf000007 bf000007 bf000007 bf000007 bf000007 6b0f8047 [...]: 7 cycles
So I suppose it works like that: instructions are always issued in sequence, but more than one instruction may be issued per clock if they go to different execution units. If the next instruction to be issued goes to an idle execution unit, it's executed immediately, otherwise the whole issue process stalls until the execution unit is idle. (hence the 0xdf 0xdf 0xbf 0xbf sequence takes 3 cycles, not 2). Also, no two instructions from different bundles may be issued on the same cycle.
There's also the issue of the mov from $sr opcode we've used - it seems to take 2 cycles sometimes. This may be because only the issue phase of the bundles cannot be interleaved, but the actual execution can.
This also gives us a nice trick to figure out the execution unit corresponding to each opcode - we can just execute it between two nops of various kinds and see how many cycles it took. Let's do this.
Seems we're not quite done with $a instructions... sigh.
Let's get back to the original point now: branches. Looking again at our first code sample (pXX.2):
00000000: 6b0fc0af mov $a1 $x63 00000000: 7e087f80 shr $a1 $c0 $a1 -0x10 00000000: 7e084080 shr $a1 $c0 $a1 0x10 00000000: 4fffffff nop4 00000001: bf000007 nopb 00000001: efffffff unkend 00000001: 65100000 mov $a2 0 00000001: 75100000 sethi $a2 0 00000002: 4df845c3 sub 0 $c3 $a1 $a2 00000002: 4fffffff nop4 00000002: e200043f bra2 0x4 [unknown: 0000003f] 00000002: 4fffffff nop4 00000003: 4fffffff nop4 00000003: eaf80040 abra 0x40 00000003: 4fffffff nop4 00000003: 4fffffff nop4 00000004: bf000007 B nopb 00000004: efffffff B unkend 00000004: 65100001 B mov $a2 0x1 00000004: 75100000 B sethi $a2 0 00000005: 4df845c3 sub 0 $c3 $a1 $a2 00000005: 4fffffff nop4 00000005: e200043f bra2 0x7 [unknown: 0000003f] 00000005: 4fffffff nop4 00000006: 4fffffff nop4 00000006: eaf80080 abra 0x80 00000006: 4fffffff nop4 00000006: 4fffffff nop4 00000007: bf000007 B nopb
Given what we know about $c already, this e2 opcode should branch if $c3 bit 1 (the zero flag) is not set. If the branch conditions are anything like the $a second source selection condition, that'd be a match if e2 is the "branch if condition not true" opcode.
Let's check it. We'll start the testcase with a flag-setting instruction, a branch instruction right after it, pointed at address 0x40, then 31 instructions setting corresponding $a registers to -1 (with $a file initialised to 0xdeadbeXX beforehand). First, let's attempt 0xe2000807 with $c set to 80e5/8000/8000/8000:
0000f780: 00000003 deadbe01 deadbe02 deadbe03 0000f790: deadbe04 deadbe05 deadbe06 deadbe07 0000f7a0: deadbe08 deadbe09 deadbe0a deadbe0b 0000f7b0: deadbe0c deadbe0d 00000003 00000003 0000f7c0: 00000003 00000003 00000003 00000003 0000f7d0: 00000003 00000003 00000003 00000003 0000f7e0: 00000003 00000003 00000003 00000003 0000f7f0: 00000003 00000003 00000003 00000000
Yay, it pretty much confirms we got the branch target right. There is also a delay slot of one instruction. Also, the branch has been done, which is not what we expected.
Maybe if I insert some delay between the flag-setting instruction and the branch...
Yeah, that did it. The branch no longer happens. I suppose since branches and the $a add (which I used to set the flags) execute on different execution units, they're issued in parallel and the branch uses a stale $c value. Some further testing shows that separating the add and branch with 4f (address) and ef (branch) nops results in proper operation, while bf and df don't help. Having the add and branch in separate bundles also does the trick. I suppose that confirms my theory.
What about the delay slot? Only one of my add instructions were executed... that could be explained by the instructions being already issued when branch executes. Some observations of its behavior:
Sigh. Okay, let's at least RE the condition flags.
Well, some simple tests confirm that the branch works as expected - branches if the selected flag of a $c reg is not set, with same bitfields as the $a instructions.
A more interesting bitfield is bits 0-2. They were the output $c register on $a instructions. Could a branch instruction have $c output too? Let's see, I'll ask it to write something to $c2...
0000f680: 000080f5 00008000 0000a000 00008000
Apparently it can. Another interesting thing happens if I ask it to write to $c0:
0000f680: 0000a0f5 00008000 00008000 00008000
It seems that the $c regs are split into bitfields belonging to specific execution units, and the instructions that write $a don't actually overwrite all of it - they just write to the part owned by the execution unit doing the write.
This doesn't explain what the written bit is, though. It doesn't change between taken and not taken branches, at least... well, another issue to leave for later.
Ah well, that'll do for e2. The most common branch unit instruction is e0, however. Let's look at some examples of usage... the 46c (mthd) chunk has some, apparently:
[...] 00000000: 6b528047 mov $a10 $sr10 00000000: 6b5ac047 mov $a11 $sr11 00000000: 65080180 mov $a1 0x180 00000000: 4dfa83c0 sub 0 $c0 $a10 $a1 00000001: 65100194 mov $a2 0x194 00000001: e0003407 bra0 0x1b [unknown: 00000007] 00000001: 4df895c1 sub 0 $c1 $a2 $a10 00000001: ef0001ff nope 00000002: e000320f bra0 0x1b [unknown: 0000000f] 00000002: 7e5affe7 shr $a11 $a11 -0x4 [...] 0000000a: 65080180 mov $a1 0x180 0000000a: 4dfa83c0 sub 0 $c0 $a10 $a1 0000000a: 65080184 mov $a1 0x184 0000000a: e2000427 bra2 0xc [unknown: 00000027] 0000000b: ef0001ff nope 0000000b: 6a8440af mov $x48 $a17 0000000b: e00015e7 bra0 0x15 [unknown: 000001e7] 0000000b: ef0001ff nope 0000000c: 4dfa83c0 B sub 0 $c0 $a10 $a1 0000000c: 65080188 B mov $a1 0x188 0000000c: e2000427 B bra2 0xe [unknown: 00000027] 0000000c: ef0001ff B nope 0000000d: 6a9440af mov $x50 $a17 0000000d: e00011e7 bra0 0x15 [unknown: 000001e7] 0000000d: bf000007 nopb 0000000d: ef0001ff nope 0000000e: 4dfa83c0 B sub 0 $c0 $a10 $a1 0000000e: 6508018c B mov $a1 0x18c 0000000e: e2000427 B bra2 0x10 [unknown: 00000027] 0000000e: ef0001ff B nope 0000000f: 6aa440af mov $x52 $a17 0000000f: e0000de7 bra0 0x15 [unknown: 000001e7] 0000000f: bf000007 nopb 0000000f: ef0001ff nope [...]
Seems this code looks at $sr10 and compares it with some stuff. Given that it's the code piece that's called on method execution, and that 0x180+ is the DMA method range, $sr10 is almost certainly the method register, and this code is for handling DMA methods. For the control flow to make sense, e0 with 1e7 unknown bits must be an unconditional branch, while the first two branches would have to branch if bit 0 (sign flag) is set. Since e2 doesn't allow an easy branch if a flag is set, e0 is likely to be the same as e2, only without negation.
Sure enough, a few tests confirm it is. No change in effect on $c flags, too.
As a side note, that code confirms there really are no sane comparison flags for $a registers - it uses the sign flag instead, which is wrong if the operand values are really far apart. Not that it can happen for methods, though.
The next most common branch unit opcode is e4. Analysing the places where it's used in the code reveals that it jumps to small pieces of code always terminated by e8ffffff opcode, often quite far away from the branching code.
Call and ret instructions, of course.
Sure enough, executing e4 opcode results in the return address being written to MMIO register f500. Some easy testing reveals bits 0-8 to serve the same purpose as in e0 opcode (ie. e4 is a predicated call, executed if the selected flag is true). However, the return address is not counted in bundles - it's counted in (32-bit) instruction words and points to the first instruction that hasn't been issued! Seems the branch unit keeps track of that nicely...
Doing further calls stuffs the return address in f504, f508, f50c, and then wraps back to f500. Still no other effect on $c flags. No other MMIO-visible reg changes as a result - it appears that the stack pointer is hidden. It may or may not be visible as a $sr... again, something to look at later.
Now, an obvious thing to check would be if e6 opcode is a negated version of e4.
Yeah, it is.
As for the ret opcode, it likewise sets the $c flag, however it does not appear to accept any predicates. Bits 3-23 appear to be ignored.
As for the other branch instructions, let's take care of them some other time.
Elapsed time: 5hShare on Twitter Share on Facebook