5: Branches, pt. 1


Today my goal is to look into an issue I've briefly mentioned at the beginning, but ignored afterwards: the code fetch and execution process.

The initial guess was that the ISA is VLIW with a bundle size of 16 bytes, containing 4 32-bit opcodes. The known branches certainly can only aim at targets aligned to 16 bytes. However, there's no evidence of the instructions being executed explicitely in parallel. In fact, there are lots of code samples like this:

00000004: 65100001     mov $a2 0x1
00000004: 75100000     sethi $a2 0

where two instructions belong to the same bundle, yet the second has a dependency on the first.

On the other hand, in opcode 0x6a/0x6b testing, a look at 0xf478 MMIO reg access shows that 4 cycles elapse between a write to it from instruction 0 and a read from it in instruction 15. So some amount of parallelism is certainly happening. Also, the exit instruction is known to have a fairly large delay before actually halting execution.

And I think this is pretty good evidence for branch instructions also having delay slots:

0000034b: e1ffffdb   B bra1 0x34a [unknown: 000001db]
0000034c: ef0001ff     nope
0000034c: e10011bb     bra1 0x354 [unknown: 000001bb]
0000034c: 6bb400af     mov $a22 $x48
0000034c: ef0001ff     nope
0000034d: 6bb480af     mov $a22 $x50
0000034d: e1000fbb     bra1 0x354 [unknown: 000001bb]
0000034d: ef0001ff     nope
0000034d: e1000fbb     bra1 0x354 [unknown: 000001bb]
0000034e: 6bb500af     mov $a22 $x52
0000034e: ef0001ff     nope
0000034e: 6bb580af     mov $a22 $x54

Today's goal will be to figure out just what's going on in there, and to figure out how the conditional branch instructions work.

Our first attempt will be to read from the cycle count register 31 times in a row, to determine the passage of time between executing instructions:

mov $a0 $sr30
mov $a1 $sr30
mov $a2 $sr30
mov $a30 $sr30


0000f780: 01613987 01613989 0161398b 0161398d
0000f790: 0161398f 01613991 01613993 01613995
0000f7a0: 01613997 01613999 0161399b 0161399d
0000f7b0: 0161399f 016139a1 016139a3 016139a5
0000f7c0: 016139a7 016139a9 016139ab 016139ad
0000f7d0: 016139af 016139b1 016139b3 016139b5
0000f7e0: 016139b7 016139b9 016139bb 016139bd
0000f7f0: 016139bf 016139c1 016139c3 00000000

Oh, it's two cycles per instruction now? Hmm, what if I replace half of the instructions with the standard mixed nop sequence?

0000f780: 015f51bd 015f51bf 015f51c1 015f51c3
0000f790: 015f51c5 015f51c7 015f51c9 015f51cb
0000f7a0: deadbe08 deadbe09 deadbe0a deadbe0b
0000f7b0: deadbe0c deadbe0d deadbe0e deadbe0f
0000f7c0: deadbe10 deadbe11 deadbe12 deadbe13
0000f7d0: deadbe14 deadbe15 deadbe16 deadbe17
0000f7e0: 015f51d0 015f51d2 015f51d4 015f51d6
0000f7f0: 015f51d8 015f51da 015f51dc 00000000

So nop execution is faster - the 16 nops were executed in about 3 or 4 cycles. Let's vary the type of the nops. What if I use only 0x4f nop?

0000f780: 01603803 01603805 01603807 01603809
0000f790: 0160380b 0160380d 0160380f 01603811
0000f7a0: deadbe08 deadbe09 deadbe0a deadbe0b
0000f7b0: deadbe0c deadbe0d deadbe0e deadbe0f
0000f7c0: deadbe10 deadbe11 deadbe12 deadbe13
0000f7d0: deadbe14 deadbe15 deadbe16 deadbe17
0000f7e0: 01603822 01603824 01603826 01603828
0000f7f0: 0160382a 0160382c 0160382e 00000000

Now it took about 16 cycles. All 0xbf nops?

0000f780: 015d1b0c 015d1b0e 015d1b10 015d1b12
0000f790: 015d1b14 015d1b16 015d1b18 015d1b1a
0000f7a0: deadbe08 deadbe09 deadbe0a deadbe0b
0000f7b0: deadbe0c deadbe0d deadbe0e deadbe0f
0000f7c0: deadbe10 deadbe11 deadbe12 deadbe13
0000f7d0: deadbe14 deadbe15 deadbe16 deadbe17
0000f7e0: 015d1b2b 015d1b2d 015d1b2f 015d1b31
0000f7f0: 015d1b33 015d1b35 015d1b37 00000000

Same. All 0xdf?

0000f780: 015f59d6 015f59d8 015f59da 015f59dc
0000f790: 015f59de 015f59e0 015f59e2 015f59e4
0000f7a0: deadbe08 deadbe09 deadbe0a deadbe0b
0000f7b0: deadbe0c deadbe0d deadbe0e deadbe0f
0000f7c0: deadbe10 deadbe11 deadbe12 deadbe13
0000f7d0: deadbe14 deadbe15 deadbe16 deadbe17
0000f7e0: 015f59f5 015f59f7 015f59f9 015f59fb
0000f7f0: 015f59fd 015f59ff 015f5a01 00000000

And all 0xef?

0000f780: 016146c9 016146cb 016146cd 016146cf
0000f790: 016146d1 016146d3 016146d5 016146d7
0000f7a0: deadbe08 deadbe09 deadbe0a deadbe0b
0000f7b0: deadbe0c deadbe0d deadbe0e deadbe0f
0000f7c0: deadbe10 deadbe11 deadbe12 deadbe13
0000f7d0: deadbe14 deadbe15 deadbe16 deadbe17
0000f7e0: 016146e8 016146ea 016146ec 016146ee
0000f7f0: 016146f0 016146f2 016146f4 00000000

Okay. So the obvious conclusions are:

  • Execution time is variable between instructions (string of nops vs string of movs from $sr)
  • The instructions are executed in parallel to some degree, and this is controlled in part by the execution unit the instruction refers to - a string of mixed nops will spread over several execution units and finish faster than a string of identical nops. Which explains why the mixed nop sequence is used so much.

Let's try something different - we'll have a look at dependency detection:

mov $a0 $sr30
add $a1 $a1 1
add $a1 $a1 1
add $a1 $a1 1
[... repeated 30 times in total ...]
mov $a2 $sr30

as opposed to

mov $a0 $sr30
add $a1 $a1 1
add $a2 $a2 1
[... repeated for 30 regs ...]
mov $a1 $sr30

Both code pieces show a difference of 0x20 between the two clock reads - there's no execution time change based on the operands. The first testcase also results in $a1 really incrementing by 30 - the architecture seems to enforce sequential semantics for instructions (unllke vµc), at least for the add instruction. Good. How about something more complicated, like mul?

Well, turns out mul also executes in one cycle. Nothing to see here.

So, how important are the 16-byte bundle bounduaries, anyway? Let's check some testcases.

[...]                      6b078047
df000007 df000007 df000007 df000007
bf000007 bf000007 bf000007 bf000007
6b0f8047 [...]: 9 cycles
[...] 6b078047
df000007 df000007 df000007 bf000007
bf000007 bf000007 bf000007 bf000007
6b0f8047 [...]: 8 cycles
[...] 6b078047
df000007 df000007 bf000007 bf000007
bf000007 bf000007 bf000007 bf000007
6b0f8047 [...]: 8 cycles
[...] 6b078047
df000007 df000007 df000007 df000007
df000007 bf000007 bf000007 bf000007
6b0f8047 [...]: 8 cycles
[...] 6b078047
df000007 bf000007 df000007 bf000007
bf000007 bf000007 bf000007 bf000007
6b0f8047 [...]: 7 cycles

So I suppose it works like that: instructions are always issued in sequence, but more than one instruction may be issued per clock if they go to different execution units. If the next instruction to be issued goes to an idle execution unit, it's executed immediately, otherwise the whole issue process stalls until the execution unit is idle. (hence the 0xdf 0xdf 0xbf 0xbf sequence takes 3 cycles, not 2). Also, no two instructions from different bundles may be issued on the same cycle.

There's also the issue of the mov from $sr opcode we've used - it seems to take 2 cycles sometimes. This may be because only the issue phase of the bundles cannot be interleaved, but the actual execution can.

This also gives us a nice trick to figure out the execution unit corresponding to each opcode - we can just execute it between two nops of various kinds and see how many cycles it took. Let's do this.


  • 00-7f: Same execution unit as 4f ($a / address unit)
  • 80-bf: Same execution unit as bf ($v / vector unit)
  • c0-df: Same execution unit as df ($r / scalar unit)
  • e0-ff: Same execution unit as ef (branch unit)

Seems we're not quite done with $a instructions... sigh.

Let's get back to the original point now: branches. Looking again at our first code sample (pXX.2):

00000000: 6b0fc0af     mov $a1 $x63
00000000: 7e087f80     shr $a1 $c0 $a1 -0x10
00000000: 7e084080     shr $a1 $c0 $a1 0x10
00000000: 4fffffff     nop4
00000001: bf000007     nopb
00000001: efffffff     unkend
00000001: 65100000     mov $a2 0
00000001: 75100000     sethi $a2 0
00000002: 4df845c3     sub 0 $c3 $a1 $a2
00000002: 4fffffff     nop4
00000002: e200043f     bra2 0x4 [unknown: 0000003f]
00000002: 4fffffff     nop4
00000003: 4fffffff     nop4
00000003: eaf80040     abra 0x40
00000003: 4fffffff     nop4
00000003: 4fffffff     nop4
00000004: bf000007   B nopb
00000004: efffffff   B unkend
00000004: 65100001   B mov $a2 0x1
00000004: 75100000   B sethi $a2 0
00000005: 4df845c3     sub 0 $c3 $a1 $a2
00000005: 4fffffff     nop4
00000005: e200043f     bra2 0x7 [unknown: 0000003f]
00000005: 4fffffff     nop4
00000006: 4fffffff     nop4
00000006: eaf80080     abra 0x80
00000006: 4fffffff     nop4
00000006: 4fffffff     nop4
00000007: bf000007   B nopb

Given what we know about $c already, this e2 opcode should branch if $c3 bit 1 (the zero flag) is not set. If the branch conditions are anything like the $a second source selection condition, that'd be a match if e2 is the "branch if condition not true" opcode.

Let's check it. We'll start the testcase with a flag-setting instruction, a branch instruction right after it, pointed at address 0x40, then 31 instructions setting corresponding $a registers to -1 (with $a file initialised to 0xdeadbeXX beforehand). First, let's attempt 0xe2000807 with $c set to 80e5/8000/8000/8000:

0000f780: 00000003 deadbe01 deadbe02 deadbe03
0000f790: deadbe04 deadbe05 deadbe06 deadbe07
0000f7a0: deadbe08 deadbe09 deadbe0a deadbe0b
0000f7b0: deadbe0c deadbe0d 00000003 00000003
0000f7c0: 00000003 00000003 00000003 00000003
0000f7d0: 00000003 00000003 00000003 00000003
0000f7e0: 00000003 00000003 00000003 00000003
0000f7f0: 00000003 00000003 00000003 00000000

Yay, it pretty much confirms we got the branch target right. There is also a delay slot of one instruction. Also, the branch has been done, which is not what we expected.

Maybe if I insert some delay between the flag-setting instruction and the branch...

Yeah, that did it. The branch no longer happens. I suppose since branches and the $a add (which I used to set the flags) execute on different execution units, they're issued in parallel and the branch uses a stale $c value. Some further testing shows that separating the add and branch with 4f (address) and ef (branch) nops results in proper operation, while bf and df don't help. Having the add and branch in separate bundles also does the trick. I suppose that confirms my theory.

What about the delay slot? Only one of my add instructions were executed... that could be explained by the instructions being already issued when branch executes. Some observations of its behavior:

  • If branch is not the last instruction of the bundle, all following instructions in this bundle are executed until one that would issue to a busy execution unit is hit
  • if branch is the last instruction of the bundle, all instrucitons of the following bundle are executed until one that would issue to a busy execution unit is hit

Sigh. Okay, let's at least RE the condition flags.

Well, some simple tests confirm that the branch works as expected - branches if the selected flag of a $c reg is not set, with same bitfields as the $a instructions.

A more interesting bitfield is bits 0-2. They were the output $c register on $a instructions. Could a branch instruction have $c output too? Let's see, I'll ask it to write something to $c2...

0000f680: 000080f5 00008000 0000a000 00008000

Apparently it can. Another interesting thing happens if I ask it to write to $c0:

0000f680: 0000a0f5 00008000 00008000 00008000

It seems that the $c regs are split into bitfields belonging to specific execution units, and the instructions that write $a don't actually overwrite all of it - they just write to the part owned by the execution unit doing the write.

This doesn't explain what the written bit is, though. It doesn't change between taken and not taken branches, at least... well, another issue to leave for later.

Ah well, that'll do for e2. The most common branch unit instruction is e0, however. Let's look at some examples of usage... the 46c (mthd) chunk has some, apparently:

00000000: 6b528047     mov $a10 $sr10
00000000: 6b5ac047     mov $a11 $sr11
00000000: 65080180     mov $a1 0x180
00000000: 4dfa83c0     sub 0 $c0 $a10 $a1
00000001: 65100194     mov $a2 0x194
00000001: e0003407     bra0 0x1b [unknown: 00000007]
00000001: 4df895c1     sub 0 $c1 $a2 $a10
00000001: ef0001ff     nope
00000002: e000320f     bra0 0x1b [unknown: 0000000f]
00000002: 7e5affe7     shr $a11 $a11 -0x4
0000000a: 65080180     mov $a1 0x180
0000000a: 4dfa83c0     sub 0 $c0 $a10 $a1
0000000a: 65080184     mov $a1 0x184
0000000a: e2000427     bra2 0xc [unknown: 00000027]
0000000b: ef0001ff     nope
0000000b: 6a8440af     mov $x48 $a17
0000000b: e00015e7     bra0 0x15 [unknown: 000001e7]
0000000b: ef0001ff     nope
0000000c: 4dfa83c0   B sub 0 $c0 $a10 $a1
0000000c: 65080188   B mov $a1 0x188
0000000c: e2000427   B bra2 0xe [unknown: 00000027]
0000000c: ef0001ff   B nope
0000000d: 6a9440af     mov $x50 $a17
0000000d: e00011e7     bra0 0x15 [unknown: 000001e7]
0000000d: bf000007     nopb
0000000d: ef0001ff     nope
0000000e: 4dfa83c0   B sub 0 $c0 $a10 $a1
0000000e: 6508018c   B mov $a1 0x18c
0000000e: e2000427   B bra2 0x10 [unknown: 00000027]
0000000e: ef0001ff   B nope
0000000f: 6aa440af     mov $x52 $a17
0000000f: e0000de7     bra0 0x15 [unknown: 000001e7]
0000000f: bf000007     nopb
0000000f: ef0001ff     nope

Seems this code looks at $sr10 and compares it with some stuff. Given that it's the code piece that's called on method execution, and that 0x180+ is the DMA method range, $sr10 is almost certainly the method register, and this code is for handling DMA methods. For the control flow to make sense, e0 with 1e7 unknown bits must be an unconditional branch, while the first two branches would have to branch if bit 0 (sign flag) is set. Since e2 doesn't allow an easy branch if a flag is set, e0 is likely to be the same as e2, only without negation.

Sure enough, a few tests confirm it is. No change in effect on $c flags, too.

As a side note, that code confirms there really are no sane comparison flags for $a registers - it uses the sign flag instead, which is wrong if the operand values are really far apart. Not that it can happen for methods, though.

The next most common branch unit opcode is e4. Analysing the places where it's used in the code reveals that it jumps to small pieces of code always terminated by e8ffffff opcode, often quite far away from the branching code.

Call and ret instructions, of course.

Sure enough, executing e4 opcode results in the return address being written to MMIO register f500. Some easy testing reveals bits 0-8 to serve the same purpose as in e0 opcode (ie. e4 is a predicated call, executed if the selected flag is true). However, the return address is not counted in bundles - it's counted in (32-bit) instruction words and points to the first instruction that hasn't been issued! Seems the branch unit keeps track of that nicely...

Doing further calls stuffs the return address in f504, f508, f50c, and then wraps back to f500. Still no other effect on $c flags. No other MMIO-visible reg changes as a result - it appears that the stack pointer is hidden. It may or may not be visible as a $sr... again, something to look at later.

Now, an obvious thing to check would be if e6 opcode is a negated version of e4.

Yeah, it is.

As for the ret opcode, it likewise sets the $c flag, however it does not appear to accept any predicates. Bits 3-23 appear to be ignored.

As for the other branch instructions, let's take care of them some other time.

Elapsed time: 5h

Currently unrated


There are currently no comments

New Comment


required (not published)