5. Myth: VU code is hard.
VU code isn’t hard. Fast VU code is hard, but there are now some tools to help you get 80 percent of the way there for a lot less effort.
VCL (Vector Command Line, as opposed to the interactive graphic version) is a tool that preprocesses a single stream of VU code (no paired instructions necessary), analyses it for loop blocks and control flow, pairs and rearranges instructions, opens loops and interleaves the result to give pretty efficient code. For example, take this simplest of VU programs that takes a block of vectors and in-place matrix multiplies them by a fixed matrix, divides by W and integerizes the value:; test.vcl
; simplest vcl program ever
.init_vf_all
.init_vi_all
--enter
--endenter
.name start_here
start_here:
ilw.x srce_ptr 0(vi00)
ilw.x counter, 1(vi00)
iadd counter, counter, srce_ptr
lq v_transf0 2(vi00)
lq v_transf1 3(vi00)
lq v_transf2 4(vi00)
lq v_transf3 5(vi00)
loop:
--LoopCS 6, 1
lq vec, 0(srce_ptr)
mulax.xyzw ACC, v_transf0, vec
madday.xyzw ACC, v_transf1, vec
maddaz.xyzw ACC, v_transf2, vec
maddw.xyzw vec, v_transf3, vf00
div Q, vf0w, vecw
mulq.xyzw vec, vec, Q
ftoi4.xyzw vec, vec
sq vec, 0(srce_ptr)
iaddiu srce_ptr,srce_ptr,1
ibne srce_ptr, counter, loop
--exit
--endexit
. . .
VCL takes the source code, pairs the instructions and unwrap the loop to this seven instruction inner loop (with entry and exit blocks not shown):
loop__MAIN_LOOP:
; [0,7) size=7 nU=6 nL=7 ic=13 [lin=7 lp=7]
maddw VF09,VF04,VF00w lq.xyz VF08,0(VI01)
nop sq VF07,(0)-(5*(1))(VI01)
ftoi4 VF07,VF06 iaddiu VI01,VI01,1
mulq VF06,VF05,Q move VF05,VF10
mulax ACC,VF01,VF08x div Q,VF00w,VF09w
madday ACC,VF02,VF08y ibne VI01,VI02,loop__MAIN_LOOP
maddaz ACC,VF03,VF08z move VF10,VF09
6. Myth: Synchronization is complicated.
The problem with synchronization is that much of it is built into the hardware and the documentation isn’t clear about what’s happening and when. Synchronization points are described variously as “stall states” or hidden behind descriptions of queues and scattered all over the documentation. Nowhere is there a single list of “How to force a wait for X” techniques.
The first point to make is that complicated as general purpose synchronization is, when we are rendering to screen we are dealing with a more limited problem: you only need to keep things in sync once a frame. All your automatic processes can kick off and be fighting for resources during a frame, but as soon as you reach the end of rendering the frame then everything must be finished. You are only dealing with short bursts of synchronization.
The PS2 has three main systems for synchronization: synchronization within the EE Core synchronization between the EE Core and external devices synchronization between external devices.
This whole area is worthy of a paper in itself as much of this information is spread around the documentation. Breaking the problem down into these three areas sheds allows you to grok the whole system. Briefly summarizing:
Within the EE Core we have sync.l and sync.e instructions that guarantee that results are finished before continuing with execution.
Between the EE Core and external devices (VIF, GIF, DMAC, etc) we have a variety of tools. Many events can generate interrupts upon completion, the VIF has a mark instruction that sets the value of a register that can be read by the EE Core allowing the EE Core to know that a certain point has been reached in a DMA stream and we have the memory mapped registers that contain status bits that can be polled.
Between external devices there is a well defined set of priorities that cause execution orders to be well defined. The VIF can also be forced to wait using flush, flushe and flusha instructions. These are the main ones we’ll be using in this tutorial.
7. Myth: Scratchpad is for speed.
The Scratchpad is the 16KB area of memory that is actually on-chip in the EE Core. Using some MMU shenanigans at boot up time, the EE Core makes Scratchpad RAM (SPR) appear to be part of the normal memory map. The thing to note about SPR is that reads and writes to SPR are uncached and memory accesses don’t go through the memory bus – it’s on-chip and physically sitting next to (actually inside) the CPU.
You could think of scratchpad as a fast area of memory, like the original PSX, but real world timings show that it’s not that much faster than Uncached Accelerated memory for sequential work or in-cache data for random work. The best way to think of SPR is as a place to work while the data bus is busy - something like a playground surrounded by roads with heavy traffic.
Picture this: Your program has just kicked off a huge DMA chain of events that will automatically upload and execute VU programs and move information through the system. The DMAC is moving information from unit to unit over the Memory Bus in 8-qword chunks, checking for interruptions every tick and CPU has precedence. The last thing the DMAC needs is to be interrupted every 8 clock cycles with the CPU needing to use the bus for more data. This is why the designers gave you an area of memory to play with while this happens. Sure, the Instruction and Data caches play their part but they are primarily there to aid throughput of instructions.
Scratchpad is there to keep you off the data bus – use it to batch up memory writes and move the data to main memory using burst-mode DMA transfers using the fromSPR DMA channel.
8. There is no such thing as “The Pipeline”.
The best way to think about the rendering hardware in PS2 is a series of optimized programs that run over your data and pipe the resulting polygon lists to the GS. Within a frame there may be many different renderers – one for unclipped models, one for procedural models, one for specular models, one for subdivision surfaces, etc.
As each renderer is less than 16KB of VU code they are very cheap to upload compared to the amount of polygon data they will be generating. Program uploads can be embedded inside DMA chains to complete the automation process, e.g.
9. Speed is all about the Bus.
This has been said many times before, but it bears repeating. The theoretical speed limits of the GS are pretty much attainable, but only by paying attention to the bus speed. The GS can kick one triangle every clock tick (using tri-strips) at 150MHz. This gives us a theoretical upper limit of:
150 million verts per second = 2.5 million verts / frame at 60Hz
Given that each of these polygons will be flat shaded the result isn’t very interesting. We will need to factor in a perspective transform, clipping and lighting which are done on the VUs, which run at 300MHz. The PS2 FAQ says these operations can take 15 – 20 cycles per vertex typically, giving us a throughput of:
5 million verts / 20 cycles per vertex
= 250,000 verts per frame
= 15 million verts per second
5 million verts / 15 cycles per vertex
= 333,000 verts per frame
= 20 million verts per second
Notice the difference here. Just by removing five cycles per vertex we get a huge increase in output. This is the reason we need different renderers for every situation – each renderer can shave off precious cycles-per-vertex by doing only the work necessary.
This is also the reason we have two VUs – often VU1 is often described as the “rendering” VU and VU0 as the “everything else” renderer, but this is not necessarily so. Both can be transforming vertices but only one can be feeding the GIF, and this explains the Memory FIFO you can set up: one VU is feeding the GS while the other is filling the FIFO. It also explains why we have two rendering contexts in the GS, one for each of the two input streams.
10. There are new tools to help you.
Unlike the early days of the PS2 where everything had to be painstakingly pieced together from the manuals and example code, lately there are some new tools to help you program PS2. Most of these are freely available for registered developers from the PS2 support websites and nearly all come with source.
DMA Disassembler. This tool, from SCEE’s James Russell, takes a completes DMA packet, parses it and generates a printout of how the machine will interpret the data block when it is sent. It can report errors in the chain and provides an excellent visual report of your DMA chain.
Packet Libraries. Built by Tyler Daniel, this set of classes allows easy construction of DMA packets, either at fixed locations in memory or in dynamically allocated buffers. The packet classes are styled after insertion-only STL containers and know how to add VIF tags, create all types of DMA packet and will calculate qword counts for you.