2: First code run

(0 comments)

Today's first idea is to look at the VP1's MMIO registers to see if we can run some test code.

Before I managed to figure this out, I noticed some nice things in the MMIO scan:

00f448: df000000 ffffffff 00000000 *
00f44c: 4f000000 ffffffff 00000000 *
00f450: bf000000 ffffffff 00000000 *
00f454: ef000000 ffffffff 00000000 *

Looks suspiciously like the nop opcodes. Last executed instruction, perhaps?

00f000: ff377fff ffffffff 00000000 *
00f004: ffeeffef ffffffff 00000000 *
00f008: fcdeffff ffffffff 00000000 *
00f00c: f3bffe1d ffffffff 00000000 *
00f010: ee78e6df ffffffff 00000000 *
00f014: fbebfff3 ffffffff 00000000 *
[...]
00f3ec: 71d82b0b ffffffff 00000000 *
00f3f0: a13ca2b4 ffffffff 00000000 *
00f3f4: 071c4982 ffffffff 00000000 *
00f3f8: b8c2955f ffffffff 00000000 *
00f3fc: 4ab5d687 ffffffff 00000000 *

Thats 1kB of memory. Too little to be the data store, but perhaps there's an index register somewhere.

00f600: db2d3fee ffffffff 00000000 *
00f604: db2d3fee ffffffff 00000000 *
00f608: db2d3fee ffffffff 00000000 *
00f60c: db2d3fee ffffffff 00000000 *
00f610: db2d3fee ffffffff 00000000 *
00f614: db2d3fee ffffffff 00000000 *
00f618: db2d3fee ffffffff 00000000 *
00f61c: db2d3fee ffffffff 00000000 *
00f620: db2d3fee ffffffff 00000000 *
00f624: db2d3fee ffffffff 00000000 *
00f628: db2d3fee ffffffff 00000000 *
00f62c: db2d3fee ffffffff 00000000 *
00f630: db2d3fee ffffffff 00000000 *
00f634: db2d3fee ffffffff 00000000 *
00f638: db2d3fee ffffffff 00000000 *
00f63c: db2d3fee ffffffff 00000000 *
00f640: db2d3fee ffffffff 00000000 *
00f644: db2d3fee ffffffff 00000000 *
00f648: db2d3fee ffffffff 00000000 *
00f64c: db2d3fee ffffffff 00000000 *
00f650: db2d3fee ffffffff 00000000 *
00f654: db2d3fee ffffffff 00000000 *
00f658: db2d3fee ffffffff 00000000 *
00f65c: db2d3fee ffffffff 00000000 *
00f660: db2d3fee ffffffff 00000000 *
00f664: db2d3fee ffffffff 00000000 *
00f668: db2d3fee ffffffff 00000000 *
00f66c: db2d3fee ffffffff 00000000 *
00f670: db2d3fee ffffffff 00000000 *
00f674: aa776f27 ffffffff 00000000 *
00f678: db2d3fee ffffffff 00000000 *
00f67c: db2d3fee ffffffff 00000000 *

32 words of memory, oddly almost all seem to hold the same value, but a quick test says they're independent... hm. Maybe a scan of earlier MMIO addresses caused something to happen to them.

00f780: c015cf5e ffffffff 00000000 *
00f784: c015cf5e ffffffff 00000000 *
00f788: c015cf5e ffffffff 00000000 *
00f78c: c015cf5e ffffffff 00000000 *
00f790: c015cf5e ffffffff 00000000 *
00f794: c015cf5e ffffffff 00000000 *
00f798: c015cf5e ffffffff 00000000 *
00f79c: c015cf5e ffffffff 00000000 *
00f7a0: c015cf5e ffffffff 00000000 *
00f7a4: c015cf5e ffffffff 00000000 *
00f7a8: c015cf5e ffffffff 00000000 *
00f7ac: c015cf5e ffffffff 00000000 *
00f7b0: c015cf5e ffffffff 00000000 *
00f7b4: c015cf5e ffffffff 00000000 *
00f7b8: c015cf5e ffffffff 00000000 *
00f7bc: c015cf5e ffffffff 00000000 *
00f7c0: c015cf5e ffffffff 00000000 *
00f7c4: c015cf5e ffffffff 00000000 *
00f7c8: c015cf5e ffffffff 00000000 *
00f7cc: c015cf5e ffffffff 00000000 *
00f7d0: c015cf5e ffffffff 00000000 *
00f7d4: c015cf5e ffffffff 00000000 *
00f7d8: c015cf5e ffffffff 00000000 *
00f7dc: c015cf5e ffffffff 00000000 *
00f7e0: c015cf5e ffffffff 00000000 *
00f7e4: c015cf5e ffffffff 00000000 *
00f7e8: c015cf5e ffffffff 00000000 *
00f7ec: c015cf5e ffffffff 00000000 *
00f7f0: c015cf5e ffffffff 00000000 *
00f7f4: c015cf5e ffffffff 00000000 *
00f7f8: c015cf5e ffffffff 00000000 *
...

Same here, but only 31 words... ha, that's obvious now, f780+ is probably the $a register file - we've guessed $a31 to be fixed to 0, after all.

So the ISA registers are exposed via MMIO. I like it already. This means we can guess that f600+ is the $r register file. And some 0x200-byte area should be the $v register file.

00fa00: 256645d1 ffffffff 00000000 * ALIASES 00f200
00fa04: 07fd1216 ffffffff 00000000 * ALIASES 00f204
00fa08: 39697853 ffffffff 00000000 * ALIASES 00f208
00fa0c: c60651a2 ffffffff 00000000 * ALIASES 00f20c
00fa10: 53af85dd ffffffff 00000000 * ALIASES 00f210
00fa14: 94e2229d ffffffff 00000000 * ALIASES 00f214
00fa18: 9b4cd5f5 ffffffff 00000000 * ALIASES 00f218
[...]
00fae8: b38b7ae5 ffffffff 00000000 * ALIASES 00f2e8
00faec: a888d04a ffffffff 00000000 * ALIASES 00f2ec
00faf0: 4e151290 ffffffff 00000000 * ALIASES 00f2f0
00faf4: faa3bd87 ffffffff 00000000 * ALIASES 00f2f4
00faf8: cea58525 ffffffff 00000000 * ALIASES 00f2f8
00fafc: 1257dc99 ffffffff 00000000 * ALIASES 00f2fc

0x100 bytes, not quite what we've been looking for. And it aliases f200+ for some reason. Strangely, NV50 removes f200:f400 area entirely, leaving fa00+ alone. That leaves one good candidate for $v file, and it's 0xf000+ area. This leaves f200/fa00+ and f300+ as two unknown 0x100-byte memory areas.

Another nice find that's useless now, but may come useful later:

00f478: 4b4a64db 0000000b 0000000b * [NV44]
00f478: 3722c767 0000003d 0000003e * [NV50]

A good bet here is a writable clock cycle counter register. This is easy enough to verify:

NV50:
mwk@hydra ~ $ nvapoke -c1 f478 deadbeef
mwk@hydra ~ $ nvapeek -c1 f478
0000f478: 7d919eb0
mwk@hydra ~ $ nvapeek -c1 f478
0000f478: 28b0a5f5
mwk@hydra ~ $ nvapeek -c1 f478
0000f478: 7af10f3c

NV44: mwk@hydra ~ $ nvapoke -c2 f478 deadbeef mwk@hydra ~ $ nvapeek -c2 f478 0000f478: deadbf05 mwk@hydra ~ $ nvapeek -c2 f478 0000f478: deadbf11 mwk@hydra ~ $ nvapeek -c2 f478 0000f478: deadbf23 A quick look at 1588 register on both cards reveals that clock gating for PVP is enabled [by default] on NV44, disabled on NV50. That also gives us an independent confirmation of 1588 controlling the clock gating.

At this point we could easily extend nvatiming to figure out the PVP source clock. That's not high priority, however.

Back to the idea of running some code. The nice part is that we can nicely verify a successful run: we know the suspected "write 0 to $aX" instruction, and $a is exposed via MMIO. The not so nice part is that we don't know exactly what the microcode is for, and thus in what circumstances the hardware will run it. Further, there are [at least] three different microcode base address registers. Some ideas for ways to run the microcode:

- a channel switch - submitting any method - submitting some specific "execute" method - manual poke of some MMIO register - however this, if supported, is not used in the mmiotrace I've seen

Easy enough to try them all.

Let's start by uploading some microcode. First, the microcode base address registers on NV44:

00f498: 00000001 ffffffe3 00000000 *
00f46c: 00000001 ffffffe3 00000000 *
00f464: 00000001 ffffffe3 00000000 *

Okay, it's probably a good bet that all registers of such form are microcode address. Looking at the full scan reveals the following regs:

00f464: 00000001 ffffffe3 00000000 * [both]
00f468: 00000001 ffffffe3 00000000 * [both]
00f46c: 00000001 ffffffe3 00000000 * [both]
00f498: 00000001 ffffffe3 00000000 * [NV44]
00f4c8: 00000000 ffffffe3 00000000 * [NV50]

The test samples are going to be:

mov $aX 0x1dX
[lots of nops]
exit 0xdeaX
[lots of nops]

with X being 0-3 for different microcode base address regs. This will let us verify that execution is taking place, and check which piece was used. The first thing we'll try is doing a channel switch. We'll bastardise the existing nvaxtstart program for that purpose. This forces us to use NV50.

Here comes the test code: test1.c

mwk@hydra ~/envytools/nva $ test1 -c 1
mwk@hydra ~/envytools/nva $ nvapeek -c 1 f000 1000
0000f000: deadbe80 deadbea0 deadbec0 deadbee0
0000f010: deadbe81 deadbea1 deadbec1 deadbee1
0000f020: deadbe82 deadbea2 deadbec2 deadbee2
0000f030: deadbe83 deadbea3 deadbec3 deadbee3
0000f040: deadbe84 deadbea4 deadbec4 deadbee4
0000f050: deadbe85 deadbea5 deadbec5 deadbee5
0000f060: deadbe86 deadbea6 deadbec6 deadbee6
0000f070: deadbe87 deadbea7 deadbec7 deadbee7
0000f080: deadbe88 deadbea8 deadbec8 deadbee8
0000f090: deadbe89 deadbea9 deadbec9 deadbee9
0000f0a0: deadbe8a deadbeaa deadbeca deadbeea
0000f0b0: deadbe8b deadbeab deadbecb deadbeeb
0000f0c0: deadbe8c deadbeac deadbecc deadbeec
0000f0d0: deadbe8d deadbead deadbecd deadbeed
0000f0e0: deadbe8e deadbeae deadbece deadbeee
0000f0f0: deadbe8f deadbeaf deadbecf deadbeef
0000f100: deadbe90 deadbeb0 deadbed0 deadbef0
0000f110: deadbe91 deadbeb1 deadbed1 deadbef1
0000f120: deadbe92 deadbeb2 deadbed2 deadbef2
0000f130: deadbe93 deadbeb3 deadbed3 deadbef3
0000f140: deadbe94 deadbeb4 deadbed4 deadbef4
0000f150: deadbe95 deadbeb5 deadbed5 deadbef5
0000f160: deadbe96 deadbeb6 deadbed6 deadbef6
0000f170: deadbe97 deadbeb7 deadbed7 deadbef7
0000f180: deadbe98 deadbeb8 deadbed8 deadbef8
0000f190: deadbe99 deadbeb9 deadbed9 deadbef9
0000f1a0: deadbe9a deadbeba deadbeda deadbefa
0000f1b0: deadbe9b deadbebb deadbedb deadbefb
0000f1c0: deadbe9c deadbebc deadbedc deadbefc
0000f1d0: deadbe9d deadbebd deadbedd deadbefd
0000f1e0: deadbe9e deadbebe deadbede deadbefe
0000f1f0: deadbe9f deadbebf deadbedf deadbeff
...
0000f400: 00000000 00000000 80001000 00000000
0000f410: 00000000 00000000 00000000 ffffffff
0000f420: 00000001 0000dea2 000006a0 02585000
0000f430: 00000044 00001000 00000111 00000000
0000f440: deadbeff c0000000 df000000 4f000000
0000f450: bf000000 ef000000 00000000 00000000
0000f460: 00000000 01200000 01203000 01202000
0000f470: 01202000 00000011 2e022a45 00000000
...
0000f4c0: 00000000 00000000 01202000 00000000
...
0000f510: df000000 4f000000 bf000000 ef000000
0000f520: 00000030 00000008 00000000 00000000
...
0000f600: deadbe20 deadbe21 deadbe22 deadbe23
0000f610: deadbe24 deadbe25 deadbe26 deadbe27
0000f620: deadbe28 deadbe29 deadbe2a deadbe2b
0000f630: deadbe2c deadbe2d deadbe2e deadbe2f
0000f640: deadbe30 deadbe31 deadbe32 deadbe33
0000f650: deadbe34 deadbe35 deadbe36 deadbe37
0000f660: deadbe38 deadbe39 deadbe3a deadbe3b
0000f670: deadbe3c deadbe3d deadbe3e deadbe3f
0000f680: 00008000 00008000 00008000 00008000
...
0000f780: 000001d0 deadbe01 000001d2 deadbe03
0000f790: deadbe04 deadbe05 deadbe06 deadbe07
0000f7a0: deadbe08 deadbe09 deadbe0a deadbe0b
0000f7b0: deadbe0c deadbe0d deadbe0e deadbe0f
0000f7c0: deadbe10 deadbe11 deadbe12 deadbe13
0000f7d0: deadbe14 deadbe15 deadbe16 deadbe17
0000f7e0: deadbe18 deadbe19 deadbe1a deadbe1b
0000f7f0: deadbe1c deadbe1d deadbe1e 00000000
...
0000fa00: deadbe40 deadbe41 deadbe42 deadbe43
0000fa10: deadbe44 deadbe45 deadbe46 deadbe47
0000fa20: deadbe48 deadbe49 deadbe4a deadbe4b
0000fa30: deadbe4c deadbe4d deadbe4e deadbe4f
0000fa40: deadbe50 deadbe51 deadbe52 deadbe53
0000fa50: deadbe54 deadbe55 deadbe56 deadbe57
0000fa60: deadbe58 deadbe59 deadbe5a deadbe5b
0000fa70: deadbe5c deadbe5d deadbe5e deadbe5f
0000fa80: deadbe60 deadbe61 deadbe62 deadbe63
0000fa90: deadbe64 deadbe65 deadbe66 deadbe67
0000faa0: deadbe68 deadbe69 deadbe6a deadbe6b
0000fab0: deadbe6c deadbe6d deadbe6e deadbe6f
0000fac0: deadbe70 deadbe71 deadbe72 deadbe73
0000fad0: deadbe74 deadbe75 deadbe76 deadbe77
0000fae0: deadbe78 deadbe79 deadbe7a deadbe7b
0000faf0: deadbe7c deadbe7d deadbe7e deadbe7f
0000fb00: 00012204 00012204 00012204 00017391
0000fb10: 0000a0bd 0000d72e 0001c59f 0001473f
...

Success! Two samples were executed, the one in 464 and the one 46c, with the one in 464 executed first. We also confirmed our guesses about $a registers. Further, 424 seems to be the exit code register.

There are two good guesses now: - 464 is executed for a context switch, 46c is executed for the method 0 - 464 and 46c are both executed for context switch: 464 to unload old, 46c to load new. That seems very unlikely though - there's no old context to unload.

I'm betting on the first one. We can verify it by clearing the registers again after the first submission, then submitting another method 0. Let's add this code to the testcase:

	/* idle everything */
	usleep(1000);
	/* clear registers */
	for (i = 0; i < 32; i++) {
	nva_wr32(cnum, 0xf780 + i * 4, 0xdeadbe00 | i);
	nva_wr32(cnum, 0xf600 + i * 4, 0xdeadbe20 | i);
	nva_wr32(cnum, 0xfa00 + i * 4, 0xdeadbe40 | i);
	nva_wr32(cnum, 0xfa80 + i * 4, 0xdeadbe60 | i);
	nva_wr32(cnum, 0xf000 + i * 16, 0xdeadbe80 | i);
	nva_wr32(cnum, 0xf004 + i * 16, 0xdeadbea0 | i);
	nva_wr32(cnum, 0xf008 + i * 16, 0xdeadbec0 | i);
	nva_wr32(cnum, 0xf00c + i * 16, 0xdeadbee0 | i);
	}
	/* pushbuffer */
	nva_wr32(cnum, 0x700000+0x50008, 0x00040000);
	nva_wr32(cnum, 0x700000+0x5000c, 0x00000001);
	nva_wr32(cnum, 0x700000+0x40008, 0x01050008);
	nva_wr32(cnum, 0x700000+0x4000c, 0x00008000);
	/* flush */
	nva_wr32(cnum, 0x70000, 1);
	while (nva_rd32(cnum, 0x70000));
	/* start */
	nva_wr32(cnum, 0xc0208c, 2);

The relevant part of new result is:

0000f780: deadbe00 deadbe01 000001d2 deadbe03
0000f790: deadbe04 deadbe05 deadbe06 deadbe07
0000f7a0: deadbe08 deadbe09 deadbe0a deadbe0b
0000f7b0: deadbe0c deadbe0d deadbe0e deadbe0f
0000f7c0: deadbe10 deadbe11 deadbe12 deadbe13
0000f7d0: deadbe14 deadbe15 deadbe16 deadbe17
0000f7e0: deadbe18 deadbe19 deadbe1a deadbe1b
0000f7f0: deadbe1c deadbe1d deadbe1e 00000000

So, 464 is context switch, 46c is method 0 [and probably lots of other methods].

Another weird part: we've set 468 to code sample 1, but now it points to code sample 3! 4c8 likewise reassigned itself to code sample 2.

A quick test shows that writing 4c8 "writes through" to 468, but writing 468 doesn't modify 4c8. Strange, I'm betting this has something to do with the modified base address thing in p08.0. Another quick test on NV44 shows that the same behavior happens there with 498 instead of 4c8.

There are a few things that can be trivially nailed down at this point:

- immediate size in opcode 0x65 [mov immediate to $a] and 0xff [exit] - the MMIO address at which current PC is available, if any - how much of the cargo cult copied from the mmiotrace is required to run

First let's scan the exit code register:

00f424: 00000000 0000ffff 00000000 *

16-bit, it seems. All these bits have been already verified to work. However, a grep for 0xff instruction returns a lot of results starting with 0xfff9. Let's try it.

We'll take care of the first two points in one run. Let's apply the following changes:

- nva_wr32(cnum, 0x700000 + i * 0x1000, 0x650001d0 | i << 19 | i);
- nva_wr32(cnum, 0x70008c + i * 0x1000, 0xfff8dea0 | i);
+ nva_wr32(cnum, 0x700000 + i * 0x1000, 0x650551d0 | i << 19 | i);
+ nva_wr32(cnum, 0x7000a4 + i * 0x1000, 0xfff9dea0 | i);

Results:

0000f400: 00001000 00000000 80001000 00000000
0000f410: 00000000 00000000 00000000 ffffffff
0000f420: 00000001 0000dea2 00000000 00000004
0000f430: 00000044 00001000 00000111 00000000
0000f440: deadbeff c0000000 df000000 4f000000
0000f450: bf000000 ef000000 00000000 00000000
0000f460: 00000000 01200000 01203000 01202000
0000f470: 01202000 00000011 24671962 00000000
...
0000f4c0: 00000000 00000000 01202000 00000000
...
0000f510: df000000 4f000000 bf000000 ef000000
0000f520: 00000034 00000008 00000000 00000000
...

Ah, we got an interrupt. So that's what bit 16 of exit opcode is for.

This is the exact same interrupt as I got in my test trace by submitting an invalid method. In the same trace, 428 got set to the method, while 42c got set to the data. We can observe the same thing happening here - note how 428 and 42c changed.

Another difference is at f520 - a change from 0x30 to 0x34. This is likely the PC... however, it doesn't quite match with the address of the exit instruction in our code. One possibility is that exit takes some time to finish exitting and effectively has delay slots. We'll test that further in a moment.

0000f780: fffd51d0 deadbe01 fffd51d2 deadbe03
0000f790: deadbe04 deadbe05 deadbe06 deadbe07
0000f7a0: deadbe08 deadbe09 deadbe0a deadbe0b
0000f7b0: deadbe0c deadbe0d deadbe0e deadbe0f
0000f7c0: deadbe10 deadbe11 deadbe12 deadbe13
0000f7d0: deadbe14 deadbe15 deadbe16 deadbe17
0000f7e0: deadbe18 deadbe19 deadbe1a deadbe1b
0000f7f0: deadbe1c deadbe1d deadbe1e 00000000

This solves the other issue: the values match perfectly with 19-bit signed immediate field. So we now know the complete encoding for at least one instruction.

Back to the PC problem. Let's try a bigger increase now.

-	for (j = 0; j != 0x100/4; j++) {
+	for (j = 0; j != 0x1000/4; j++) {
and
-		nva_wr32(cnum, 0x7000a4 + i * 0x1000, 0xfff9dea0 | i);
+		nva_wr32(cnum, 0x7008a4 + i * 0x1000, 0xfff9dea0 | i);

Results:

0000f520: 00000234 0000000e 00000800 00000000

Odd. I can't explain the differences at 524 and 528 yet. However, it seems relatively certain that 520 bits 2-19 are bundle number [high bit determined from bitscan]. This means that execution really halts about 4-5 bundles after the exit instruction.

Delay slots. This can't be a good thing. Stinks like vµc.

Let's gather some more data by trying exit instruction at various addresses.

08c: 00000030 00000008 00000000
0a4: 00000034 00000008 00000000
1a4: 00000074 00000008 00000000
2a4: 000000b4 00000008 00000000
3c0: 000000fc 00000008 00000000
3c8: 000000fc 00000008 00000000
3cc: 00000100 00000008 00000400
3d0: 00000100 00000008 00000400
3e0: 00000100 00000006 00000400
3e8: 00000100 00000006 00000400
3f0: 00000100 00000006 00000400
3f8: 00000100 00000006 00000400
3fc: 00000100 00000006 00000400
400: 0000010c 00000006 00000400
4a4: 00000134 00000006 00000400
4a8: 00000134 00000006 00000400
4ac: 00000138 00000006 00000400
8a4: 00000234 0000000e 00000800

Now let's try to measure how many instructions after the exit get executed - we can just use movs to $a registers. A quick test shows that exactly 10 opcodes after the exit are still executed, regardless of the alignment in the bundle.

Hm. Maybe not so VLIW after all? Maybe the only importance of 16 byte alignment is as instruction fetch unit and branch target alignment?

Another weirdness is the discontinuity near 0x400. The contents of reg 528 suggest that instructions are fetched in 0x400-byte units. Redoing the exit delay test shows that execution of instructions after exit always halts at the 0x400 bounduary, even before 10 instructions have been executed.

The conclusion is simple: exit has a delay slot of 10 *cycles*, not instructions. The instruction fetch required at 0x400 triggers a delay long enough for exit to finish. How ridiculously ugly.

Comparing the PC values with exit address + 0x28 (address of last executed instruction) results in a fairly close match. However, it seems that PC points to the bundle containing the instruction 0xc bytes after the last executed mov instruction. So either the PC refers to the fetch address and not the execution address, or the architecture is interlockless like vµc.

I sure hope it isn't.

Still, that was much more that I hoped for in this session. We can run code and peek register contents. On the second day.

Elapsed time: 5h.

Currently unrated

Comments

There are currently no comments

New Comment

required

required (not published)

optional

required