PS2 Linux Programming
Using Inline Assembly And The FPU
The intention of this tutorial is to provide enough knowledge to get started coding MIPS Assembly using the GCC inline assembler. The MIPS EE core has hundreds of different instructions which can be found in the “EE Core Instruction Set Manual”.
MIPS Assembly (EE)
A simple inline ASM routine in GCC is given below:
int iAnswer, iX = 10, iY = 30;
asm __volatile__("
addu %0, %1, %2 # iAnswer = iX + iY
": "=r"(iAnswer) : "r"(iX), "r"(iY));
This function adds the two integer variables iX and iY together and stores the result in iAnswer. The first thing to notice is that the int variables are declared normally, as if they were going to be used with C/C++. The “__volatile__” keyword is added to ensure that the compiler doesn’t perform any optimisation that will produce erroneous code. The asm keyword can be thought of as a pseudo-function with the following prototype:
void asm(char * strASMCode : <output list> : <input list>);
strASMCode is a pointer to the assembly language code string containing the ASM instructions. The input and output list are the variables that are to be use in the ASM code block. If a variable is to be changed then the variable is put it in the output list. If the variable is only to be read from, then the variable is put it in the input list.
The format of an item in the input/output list is as follows: “type”(variable). For the moment the only type of concern is “r”. This means that the variable should be treated as an integer. Notice that in the output list iAnswer is type “=r”, the equals means that a value is being written into the iAnswer variable.
The variables passed in to the ASM function are referred to as %n, where n is the order in which the variable appear the output/input lists. iAnswer is first and is therefore %0, iX is second and is therefore %1, and iY will be %2. This means that the line “addu %0, %1, %2” is the same as saying “addu iAnswer, iX, iY”.
The function of the addu instruction can be found in the EE Core Instruction Set Manual on page 27. The important information is:
Format: “ADDU rd, rs, rt”
Description: “GPR[rd] = GPR[rs] + GPR[rt]”
The format indicates that this instruction takes three parameters The description indicates what the instruction does. GPR[xx] means that the general purpose register that holds the value of xx is used. It can be seen that ADDU adds together the values of rs and rt and puts the result in rd.
Comments can be inserted into the ASM block with the # prefix. Anything prefixed with “#” is a comment, just like “//” in C++.
Try experimenting with simple functions like this by performing subtract, multiply and divide with two integers (with multiply and divide it may be necessary to look into the MFLO and MFHI instructions). There are EE Core Specific instructions which may be better than the standard MIPS instructions and easier to use in certain circumstances, e.g. the multiply and add instruction (MADD) is EE core specific and may not require the use of the Hi and LO registers when performing some types of multiplication.
Notice that in the above example, none of the registers were accessed directly. It is however possible to access registers directly as well as with the %n syntax.
The EE core has 32 GPRs (General Purpose Registers). These can be accessed via the pseudo variables $0 through $31. $0 is hard-coded to 0, and cannot be change even if this register is written to. While it is possible to write to any registers it is advisable to use $8 to $15 (called t0 to t7, t for temporary register). When a register is written to, it should be added to what is called the “clobber” list which is an additional list located after the input list in the ASM block as shown below:
void asm(char * strASMCode : <output list> : <input list> : <clobber list>);
The clobber list informs the compiler that the register contents in this list may have been changed by the function – and the compiler can take appropriate action.
If the integer add example given at the start of this tutorial had been written in a manner which used the t0, t1, and t2 registers as temporaries, the last line would be:
": "=r"(iAnswer) : "r"(iX), "r"(iY) : "$8", "$9", "$10");
Note that if a register is only being read from, (i.e. $0) there is no need to add it to the clobber list.
The Floating Point Unit (FPU)
The FPU is connected to the EE Core via a 32-bit coprocessor connection (cop1). The FPU performs high speed floating point calculations.
The FPU has 32 registers ($f0 to $f31). $f0 is used as the function return value so it is best not to write to that, you should also avoid $f12 and $f14 as they are used for argument passing and $f20 upwards since they should be saved between function calls. All other $fX registers can be clobbered without worry.
An example of using the FPU to divide two floating point numbers is given below:
float fAnswer;
float fX = 1.0f;
float fY = 3.0f;
asm __volatile__("
div.s %0, %1, %2 # fAnswer = fX / fY
": "=f"(fAnswer) : "f"(fX), "f"(fY));
As can be seen, this example is very similar to the previous one. In the input/output lists “r” has been replaced by “f” as floating point numbers are now being used.
All the FPU instructions can be found in the COP1 (FPU) Instruction Set section of the EE Core Instruction Set Manual. Try using some of these instructions. It will be noticed that some of the instructions mention the accumulator (ACC). These instructions will be discussed in the next section.
In an ASM program, if the result of an instruction is used by another instruction before the first instruction has finished it’s execution, the program will stall (instructions don’t necessarily finish before the next one has started). The CPU can’t do anything while it is stalling which is obviously not good for performance. Most FPU arithmetic instructions have a latency of 4 clock cycles. Instruction latency is basically how many clock cycles must pass until the result of the calculation is available to be used. For a full list of latencies see page 58 of EE Core Users Manual.
This is where the accumulator comes in to play. The accumulator always has a latency of one so there is no need to worry about stalling the EE if two or more accumulator instructions are executed one after the other. Most accumulator instructions end in A.S (e.g. MADDA.S, MULA.S, etc).
In order to write an ASM function to calculate the dot product of two 3d vectors, it might be thought that three multiplies and two adds were required. While this would work, there are at least 5 instructions to be performed and there is also the worry of the performance hit associated with stalls. The accumulator instructions allow the calculation to be done in 3 instructions without the worry of stalls. This is illustrated below.
float fV1X = 1.0f, fV1Y = 2.0f, fV1Z = 3.0f;
float fV2X = 4.0f, fV2Y = 3.0f, fV2Z = 3.5f;
asm __volatile__("
mula.s %1, %4 # ACC = fV1X * fV2X
madda.s %2, %5 # ACC = ACC + (fV1Y * fV2Y)
madd.s %0, %3, %6 # fAnswer = ACC + (fV1Z * fV2Z)
": "=f"(fAnswer) : "f"(fV1X), "f"(fV1Y), "f"(fV1Z), "f"(fV2X), "f"(fV2Y), "f"(fV2Z));
This function should be fairly self explanatory as the comments show what the instructions are doing.
Unfortunately, a small document like this can never hope to give more than a brief introduction to assembly language. For those interested in learning more here are a few resources:
A good site to learn a bit more about EE ASM:
http://cosmos.raunvis.hi.is/~johannos/mips/mips-howto.html
For more information about stalling and the accumulator see:
http://cosmos.raunvis.hi.is/~johannos/mips/sparky.html
A good place to find some more complex examples of assembly language is in the PS2Maths.cpp file that accompany the sample code used in this tutorial series. The good thing about these is that at the bottom of each function is the C/C++ equivalent so it is relatively clear to determine exactly what the overall assembly code routine is doing.
Aside from these, the manuals that accompany the PS2 Linux kit are an invaluable resource for learning assembly language on the PS2.
Dr Henry S Fortuna
University of Abertay Dundee