This topic aims at discussing the extensions and optimizations to the TIGCCLIB sprite routines (currently, Sprite8, Sprite16 & Sprite32).
First, a little bit of history.
There have been at least three rounds of optimizations and extensions to the TIGCCLIB sprite routines:
[ul][li]a minor optimization on the address computation. I sent it to Kevin in May 2002, i.e. roughly when the same optimization was applied to ExtGraph.[/li]
[li]further optimization on the address computation + a change in the Sprite32 algorithm: like in ExtGraph, operations can be made on 1 long + 1 short instead of two longs, one of which has a shift count > 16. I sent the modified routines to Kevin in October 2003, i.e. roughly when the same optimizations were applied to ExtGraph.[/li]
[li]Joey Adams' (MrJoey / joeyadams) work on sprite routines: rewrite in assembly; new SPRT_RPLC (AND+OR the same sprite in a single call) drawing mode; new routines: generic (multi-mode) clipped routines, single-mode non-clipped and clipped routines.[/li][/ul]
A tiny subset (Sprite8, Sprite16, Sprite32 with SPRT_RPLC support) of Joey's work of extension+optimization was merged in GCC4TI Beta 10, after further optimization (conversion of branches to explicit short form; address computation; reordering of Basic Blocks; Sprite32 algorithm change).
And I've just noticed that we can squeeze two more bytes on all three routines, by using an optimization I described in the S1P9 tutorial: subq.w #1; beq.s; addq.w #1 instead of cmpi.w #1; beq.s; tst.w.
Joey's work, containing an extra-strong test procedure (exhaustive comparison against ExtGraph routines, with buffer overflow detection), can be downloaded at http://www.funsitelots.com/pub/Sprite_8_16_32_Stable.zip . Two ExtGraph routines were fixed in October 2005 thanks to Joey's work.
A very slightly accelerated and modularized (#defines to individually enable each of the 12 subsets among the 48 tested routines) version of the test program is available within http://www.funsitelots.com/pub/Sprite_8_16_32-20090626.tar.bz2 .
Where do we want to put the slider on the size optimization - speed optimization tradeoff ?
Joey's routines tend to use an external routine for the address computation and/or clipping. This decreases code size if more than one routine is used (especially for clipped routines), but increases it if only one of them is used... And, when more than one drawing mode is used, I'm rather unconvinced that it makes real sense to use two or more routines of the same family (say, ClipSprite16{,AND,OR,RPLC,XOR}):
[ul][li]obviously, it does hardly make sense to use a generic routine and a specialized routine of the same family;[/li]
[li]if size is what matters (that would be the use case for TIGCCLIB, I'm told, though TIGCCLIB is not completely size-optimized), then the generic routine with inlined address computation and/or clipping should be used: the generic routine is smaller than two specialized routines, even if the address computation and/or clipping used by the specialized routines is externalized;[/li]
[li]if speed is what matters (that would be the use case for specialized libraries: ExtGraph and Genlib), specialized routines should obviously be used, and a function call is not a step in the direction of fulfilling the goal of making fast routines.[/li][/ul]
Pushing the address computation and/or clipping to an external routine is a tradeoff between size optimization and speed optimization, but I'm not sure to see the point of that particular tradeoff: in at least one common use case (only a generic routine used), it makes the code both larger and slower.
What do YOU think ?