can IBM actually deliver a teraflop part? My answer is, yes. Based 
on 0.10 micron technology, EE3 will be able to clock at 3 Ghz. This means EE3 has to 
deliver 333 flop per cycle to reach teraflop. Presuming that the fundamental data size of 
128 bit vector does not change(there is no reason), each VU with 4 FMACs and 2 
dividers will deliver 10 flop per cycle and you will need 32 VUs + CPU FPU to reach 
teraflop. This is technically feasible since you have 100~200 million transistors to play 
with at 0.10 micron level. 
But how would programmers be able to manage 32 VUs, when they were unable to 
cope with only 2 VUs of EE1? Recall that EE1 programming headache arose because of 
direct VU visibility and the problem goes away if the multiple units are properly 
shadowed, much like how programmers only see one pixel shader even though four are 
actually present in GF3. Likewise, IBM could arrange 32 VUs in bank and keep only 
one input to make 32 VUs appear as one. Under this programming mode, a programmer 
would simply dump his data packet into VU bank input port; the VU manager read the 
packet from input port and send it to an idle VU along with tagged script code. The 
results are then sent to destination indicated by the script, be it CPU, memory, or 
rasterizer. 
We will have to wait until EE3 unveiling to find out about exact details, but I have my confidence in IBM and Sony. 
Speaking of EE3, it is quite evident than GS3 will not have T&L since the so much 
computational power is focused on EE3. From Sony\'s ISSCC2001 presentation, the GS3 
will have 32~64 MB of eDRAM and clock at 714 Mhz. Sony\'s continued focus on 
eDRAM makes GS3 T&L unlikely due to low clockspeed and die-size limitation. ***3 will have a very good CPU but the GPU could stay underwhelming.----DM
i think this is to complicated for most of you so this will be my last post in this thread.