Any profiler will tell you that tracker spends most of its time emulating
the Amiga hardware in `resample.c' (except on the Amiga, of course).
Basically, the machine has to compute the actual stream of bytes to output.
There are several things to do:
DO_NOTHING/PLAY/REPLAY does. Each pitch value gives
rise to a step value (see `notes.c' for the step value table).
tracker runs down the sample `by hand' using fixed-point arithmetic.
st_read.c') that is used to scale the sample value. Then 
sample values are added according to left side/right side.
Arch/.../audio.c'
files: for a convincing stereo effect, some part of the right side has to go out on
the left side and vice-versa. This is usually accomplished thanks to a generic routine
of `Arch/common.c'.
There is also some provision for oversampling: instead of using one sample value for each sample output, we add several sample values distributed in the right area.
oversample. That way, we get oversample
values for each value output.
oversample is a power of two, since it is enough
in this case to augment the output size accordingly.
All this can take a lot of time, too much for your cpu maybe. What can be done to alleviate the problem ?
oversample, 
the special `empty' sample and its properties (
samp->fix_length == 0 && samp->rp_start == NULL).
As a consequence, new sample styles (like 16 bit samples) should be implemented
as new types of commands like DO_NOTHING, PLAY, REPLAY 
(PLAY16 and REPLAY16 come to mind), even though this will
 duplicate loads of code.
Recently, the code has changed to allow for more than four channels. This incurs a slight overcost (two more additions), which is actually negligible for oversampled replay, and has been deemed acceptable for simple replay.
The lookup table for converting 0-64 volumes to a linear scale is a cheap way to allow for all sorts of manipulations on the sample volumes at a low cost, and also to use n-bit samples in an almost transparent way, even with n not being an integral multiple of 8.
If your machine is really slow, and uses ulaw, computing a complete lookup
table (all 16384 values of it) might speed things up somewhat. Removing the
oversampling test altogether might do it also. Then you can unroll the
for(i = 0; i < number; i++) loop, not initialising value[LEFT_SIDE] and
value[RIGHT_SIDE] to 0, but giving them their initial real value.
If all that fails, you can still find a better compiler, check whether your audio bandwidth is not too limited, downgrade the audio output to a lower acceptable frequency (stuttering and outputting several times the same sample is possible). Lastly, you can still go to assembly language code.
An important optimization may exist if your machine uses dynamic libraries
and dynamic linking: on some Unixes (Sparcs for instance), some table lookup
and dynamic linking occurs at runtime, which means that function calls may be
slightly slower. In that case, coercing your linker to use static linking may be
a good idea. If you can, link only `Arch/machine/audio.c' and 
`resample.c' statically, since this is the place where the speed bottleneck
occurs. This way, you will get both the advantages of static linking (speed) and
dynamic linking (size).
Maybe the specific audio code for your architecture can be improved.
Recently, I've added some optimization to `Arch/common.c'.
Checking where tracker was spending time, I discovered that almost all the time
was spent computing divisions/multiplications, like for the stereo mixing.
Instead of computing:
realLeft = left*primary + right*secondary realRight = right*primary + left*secondary
computing
sum = (left+right) * (primary+secondary)/2 diff = (left-right) * (primary-secondary)/2 realLeft = sum+diff realRight = sum-diff
gains two multiplications! Just realize that (primary+secondary)/2 and
(primary-secondary)/2 don't change and can be precomputed.
Apart from primitive architectures where multiplication and addition costs
are the same, this gains a lot. On a Sparc 5, this makes the difference between
being able to use `-over 2 -freq 44' and not!
There is a switch in `Arch/common.c' (NEW_OUTPUT_SAMPLES_AWARE)
used for compatibility. In older implementations of tracker, the resample
code called output_samples(left_value, right_value), where left
and right values were 23 bits signed. The newer version call is
output_samples(left-value, right_value, width), with width the number
of bits used. Newer ports should define NEW_OUTPUT_SAMPLES_AWARE and
use the code of `Arch/common.c' whenever possible.
This gains a lot when, for instance, oversample is used, since this keeps
the shifting of data left or right to a minimum.
Check whether your implementation uses the new form of output_samples.
If it does not, it is a good idea to convert it.
Also, the audio routine should give its output resolution when needed. Right now, tracker doesn't use it, but when I get around to adding 16 bit samples, tracker will routinely convert them down to 8 bit if the audio output is only 8 bits.
As a rule, the Sparc version is the most complete. Try to refer to it in case of doubt.