1

I am currently optimizing Sebastian Mihai''s homebrew Neo Geo game : Neo Thunder (http://sebastianmihai.com/neogeo-neo-thunder.html) My plan is to remove all the slowdown. I know it's not a great game but I am mainly doing this for the coding practice. It's actually been a really good learning experience so far on how to do things quickly.

I have a couple of questions :

1. I'm getting near the end of the optimization process now and I really want to use 16-bit (short) variables in the Neo Thunder C program. Since I figure these will be a little faster - all MOVE.L instructions etc can now become MOVE.W etc when compiled. Currently I use "int" for all variables. This denotes a 32 bit value in the dev kit .

However there is a problem because all the functions in the Dev Kit only accept 32 bit variables.

e.g. write_sprite_data(int x, int y, int xz, int yz, int clipping, int nb, const PTILEMAP tilemap);

Should I cast the short variables to int's when I call these functions, OR should I rewrite the "write_sprite" function to use 16 bit values (And if so, is this even a good idea?). Which would be faster (in terms of execution speed)?

I should say the new Neo Thunder code will run perfectly fine (fast enough) without doing this but the idea is to learn techniques I can use again in future projects

My Second Question :

2. In the "pitfalls.txt" file that came with the Dev Kit. : It says "When interfacing assembly routines, d2-d7/a2-a6 must be preserved." My question is - what are d2-d7 and a2-a6 actually used for? And does this mean that my compiled C code is slower than it should be - because it doesn't use *all* these preserved registers when it is compiled. Or I am I misunderstanding this?


Thank you for any help. Any answers will be very helpful!

2

Are you sure that it's the case? Have you tried to printf("%d", sizeof(int)) ?

On PSP for example, a much more modern platform, int is 16 bit for example. TIGCC was the same. It wouldn't surprise me if it's already all 16 bits.
avatar
Highway Runners, mon jeu de racing à la Outrun qu'il est sorti le 14 décembre 2016 ! N'hésitez pas à me soutenir :)

https://itunes.apple.com/us/app/highway-runners/id964932741

3

CosmicR (./1) :
Should I cast the short variables to int's when I call these functions, OR should I rewrite the "write_sprite" function to use 16 bit values (And if so, is this even a good idea?). Which would be faster (in terms of execution speed)?
Casting from short to int manually won't make any difference (it's done automatically anyways). Rewriting the function to use 16-bit values may or may not bring some performance improvement. But before doing that, I'd check how much time is spent in this function in the first place. I don't know what kind of tools are available on the NeoGeo ; the ideal case would be an emulator capable of profiling. If nothing like that exists, you could measure that indirectly ; e.g. measure the CPU time needed per frame (using a timer or the border color trick) for your existing code, and for the same code but with all calls to the function removed. Otherwise, you could end up spending a lot of time "optimizing" things that don't really matter.
avatar
Zeroblog

« Tout homme porte sur l'épaule gauche un singe et, sur l'épaule droite, un perroquet. » — Jean Cocteau
« Moi je cherche plus de logique non plus. C'est surement pour cela que j'apprécie les Ataris, ils sont aussi logiques que moi ! » — GT Turbo

4

Thank you for the help guys 👍

Brunni (./2) :
Are you sure that it's the case? Have you tried to printf("%d", sizeof(int)) ?

On PSP for example, a much more modern platform, int is 16 bit for example. TIGCC was the same. It wouldn't surprise me if it's already all 16 bits.

I tried your function there and it says 4. So they are definitely 32 bit values

Yes I wonder why he (Jeff Kurtz) decided to use 32 bit values for ints, and for the functions? There must be some reason.

Zerosquare (./3) :
Casting from short to int manually won't make any difference (it's done automatically anyways). Rewriting the function to use 16-bit values may or may not bring some performance improvement. But before doing that, I'd check how much time is spent in this function in the first place. I don't know what kind of tools are available on the NeoGeo ; the ideal case would be an emulator capable of profiling. If nothing like that exists, you could measure that indirectly ; e.g. measure the CPU time needed per frame (using a timer or the border color trick) for your existing code, and for the same code but with all calls to the function removed. Otherwise, you could end up spending a lot of time "optimizing" things that don't really matter.

For me - its more the cost of using ints in my main program. With a shooting game - there is some heavy collision detection going on. So a lot of loops and having .L instructions instead of .W will def add some time to the execution time. But then I also have a loop that draws the sprites (up to 177 sprites on screen) so doing casts from shorts to ints for the sprite function would add to the execution time too. Unless you are saying these casts take no time?

I am currently using background color checks. The slowest part is the the player bullet vs enemy collision checks (maximum = 24 x 50). The game currently does run at 60fps (after the optimizations I have already done) but I am just learning some extra speed improvements that I could use in future games.

I have now looked at HPMan's DatLib extension and discovered that he uses unsigned shorts (16 bit values) for his sprite function. So that does seem like the way to go.

--------------------------------------------------------------------------------------------------

Btw I have found out the answer for my 2nd question about the preservation of registers. That's apparently a GCC (the C compiler) convention when compiling 68000 code. d0,d1 etc are used as scratch registers while the others are preserved. I did a disassembly (in MAME) to check and all the registers *are* used when compiling the C code (where possible). So there are no worries with that now.

I also found out that my GCC is late version 2 (very old). It is on version 13 now, So I am wondering if switching to the new version might speed things up? I suppose it depends if much work has been done on 68000 compilation since. I will give it a go!

5

Well then yes, you would benefit from switching to 16 bit shorts. Note that switching to unsigned is also very important, because a lot of simple operations can be made faster. For example dividing by 2 can safely be replaced with a bit shift when the variable is unsigned, not otherwise.

Depending on what game you are working on, you might also want to implement your own versions of the library functions for your usage. If they really made the choice choice to take 32 bit integers, then it's probably not a highly optimised library in the first place.
avatar
Highway Runners, mon jeu de racing à la Outrun qu'il est sorti le 14 décembre 2016 ! N'hésitez pas à me soutenir :)

https://itunes.apple.com/us/app/highway-runners/id964932741

6

@brunni thank you - that's a good idea to use unsigned shorts where possible. I think the x and y of a sprite have to be standard shorts to allow negative (offscreen) values but everything else can probably be ushorts

Yes I've made a few of my own library functions already for things I've needed. I am a beginner at 68000 asm but I can at least modify and copy existing functions. So this should be within my skill level.

7

Yes. Taking unsigneds also makes screen clipping faster: instead of comparing if (x >= 0 && x < 320 && y >= 0 && y < 224) you can do if (x < 320 && y < 224), because a negative number will be massive (0xffff).
avatar
Highway Runners, mon jeu de racing à la Outrun qu'il est sorti le 14 décembre 2016 ! N'hésitez pas à me soutenir :)

https://itunes.apple.com/us/app/highway-runners/id964932741

8

Thats a nice idea! i was thinking about how that part of my code could be sped up but was drawing blanks

BTW I changed all my ints to shorts. Which seemed to speed the whole program up by 4 display lines. (I have a little routine that counts the display lines).

Then i rewrote one of the sprite functions to see if it would accept shorts directly but it didn't work properly. I looked at the disassembly and it seems that the compiler converts all shorts (words) to ints (long words) automatically before it puts them on the stack ready for the function to use. The overhead for doing this conversion is about 16 CPU cycles per variable sent to the fucnction. So I guess I lose a little time there - but I gained overall. I don't know if there is a way to tell the compiler to do it differently.

Now that I have seen some disassembly of some parts of my program. I think even I could make it more efficient! The compiler seems to make some odd choices on how to do things. It could be I am misunderstanding though given I don't know much about assembly.

I don't think I can update the version of my compiler. I tried but there didn't seem any obvious way to do it.

9

Brunni (./7) :
Yes. Taking unsigneds also makes screen clipping faster: instead of comparing if (x >= 0 && x < 320 && y >= 0 && y < 224) you can do if (x < 320 && y < 224), because a negative number will be massive (0xffff).
Good example ; but to me, it makes more sense to use signed types by default, and explicitly cast to unsigned where you want to use that trick. It makes the code easier to understand.

CosmicR (./8) :
I looked at the disassembly and it seems that the compiler converts all shorts (words) to ints (long words) automatically before it puts them on the stack ready for the function to use. The overhead for doing this conversion is about 16 CPU cycles per variable sent to the fucnction. So I guess I lose a little time there - but I gained overall. I don't know if there is a way to tell the compiler to do it differently.
Some info here (they're discussing Atari ST code, but it uses GCC and targets the 68k) :
https://www.atari-forum.com/viewtopic.php?p=251819#p251819

CosmicR (./8) :
I don't think I can update the version of my compiler. I tried but there didn't seem any obvious way to do it.
I don't know how your dev kit works, but for 68K, you don't want a GCC that's too new ; past a certain version, the code generation quality gets worse :
https://gendev.spritesmind.net/forum/viewtopic.php?t=2634
avatar
Zeroblog

« Tout homme porte sur l'épaule gauche un singe et, sur l'épaule droite, un perroquet. » — Jean Cocteau
« Moi je cherche plus de logique non plus. C'est surement pour cela que j'apprécie les Ataris, ils sont aussi logiques que moi ! » — GT Turbo

10

Zerosquare (./9) :
Brunni (./7) :
Yes. Taking unsigneds also makes screen clipping faster: instead of comparing if (x >= 0 && x < 320 && y >= 0 && y < 224) you can do if (x < 320 && y < 224), because a negative number will be massive (0xffff).
Good example ; but to me, it makes more sense to use signed types by default, and explicitly cast to unsigned where you want to use that trick. It makes the code easier to understand.
Maybe, but I haven't done it a single time.

Seems like it makes it more likely to get extra unneeded instructions accidentally added by the compiler. Better to keep the variable signed only if it makes sense to be signed, which, honestly is not so often in reality, at least in graphics code. If the sprite supports negative numbers (like on the Neo Geo), you are not going to want to test for x >= 0 anyway, but if it's a drawPixel(x, y, …) then yes, and the x, y args have no reason to be signed.
avatar
Highway Runners, mon jeu de racing à la Outrun qu'il est sorti le 14 décembre 2016 ! N'hésitez pas à me soutenir :)

https://itunes.apple.com/us/app/highway-runners/id964932741

11

Thank you both for the help again. Much appreciated 👍

The Neo Geo has values 0-512 for the x-axis and 512-0 (going *down* the display from the top) for the y-axis. Only 320 x 224 pixels of this display area are visible and the top left screen coordinate is (0,496)

So there is some scope to speed up everything by using natural display coordinates in all cases. If it's done at the end of development (so as to not be confusing beforehand), it would be a nice optimisation!

Currently all the sprite functions, convert the more standard screen coordinates you use in the dev kit - where (0,0) is top left of screen and where y goes from 0 to 224 down the screen - to these natural Neo Geo coordinates. So this takes a little time


Zerosquare (./9) :
I don't know how your dev kit works, but for 68K, you don't want a GCC that's too new ; past a certain version, the code generation quality gets worse :
https://gendev.spritesmind.net/forum/viewtopic.php?t=2634

There is a new Neo Geo DevKit built around a later version of GCC but i don't understand it yet, It seems a lot different to the old one I am using and used to! Neo Thunder is built around the old devkit which uses GCC 2.95.2 I will probably have to stick to using this version, since all the tool chains are built around that version of GCC. I don't really understand it that well beyond being able to program it

Zerosquare (./9) :
Some info here (they're discussing Atari ST code, but it uses GCC and targets the 68k) :
https://www.atari-forum.com/viewtopic.php?p=251819#p251819

That was very interesting about using -mshort. I did try it briefly and the game compiled but then crashed - probably because I need to change some of the other type definitions which use int as their base.

I have actually had a better idea since. And that would be to "inline" all the sprite functions. In C, this is when the compiler pastes the code for the appropriate function into the program every time you call a function. So the function is no longer jumped to with the "jsr" instruction, but is run directly each time instead . Unfortunately this didn't work, even with the "always inline" attribute (I did some research!). I got a message each time saying the compiler was ignoring the directive.

I think I have figured it out now : If i define the function in the main code file using inline assembly - it automatically inlines it anyway as part of the optimisation process (i am using -O3). I have only tried it on a small function so far and that worked (I checked disassembly). The best thing about this, is that function arguments are no longer copied onto the stack and can instead be passed directly into registers. Plus there is no jumping to and returning from the function. So that time is saved. So this could be the best solution if it works for larger functions too. I should say that all the sprite functions are already in assembly so it is easy to do this.

12

Inline can only be for static functions, i.e. defined in the same .o file (or same .c file).
avatar
Highway Runners, mon jeu de racing à la Outrun qu'il est sorti le 14 décembre 2016 ! N'hésitez pas à me soutenir :)

https://itunes.apple.com/us/app/highway-runners/id964932741

13

ok yes that would be why. There are a couple of inline functions that the Dev Kit already uses in the video.h file (for changing the current sprite). So I could try putting the inline functions in that file. Maybe that will work? But they are fine in the main program really.

I converted one of the sprite functions myself and I missed out some lines by mistake and it took me a while to fix it. So I used Grok in (XTwitter) to convert the other functions. Very good + fast for this kind of work! ↙️

mHnLMJY.jpeg

14

Just changing things to inline is not going to help a lot though, it's worth it if you want to save around 30-50 cycles or so, which is negligible compared to the typical process times of the 68k, unless you call the function in a tight loop. And even then, the compiler should inline it for you (albeit on GCC 2.x you can't fully count on it).

Overall, this is last resort optimisation and I would advise that you improve the way you are profiling the program first. There has to be something much more important that takes away a lot of cycles.
avatar
Highway Runners, mon jeu de racing à la Outrun qu'il est sorti le 14 décembre 2016 ! N'hésitez pas à me soutenir :)

https://itunes.apple.com/us/app/highway-runners/id964932741

15

My compiler wasn't inlining it before I did this (I checked first). That's why I was messing around trying to get it to put shorts on the stack instead of ints. I didn't really think about inlining at that time. Also I am only average at C

It was not a bad little saving overall. Because there's often over 130 sprites on screen (177 would be a theoretical maximum for this game). And every time they are moved (every frame) or animated these functions are called to update the VRAM.

I think I have really exhausted everything obvious now. I am not happy with some of the 68000 code the compiler generates for loops. It's almost like it purposefully chooses an inefficent way to process arrays of structs (which I use for aliens, bullets etc. I already padded the structs so they can be processed fast). I believe my actual C code is pretty tight for those loops now. But the compiled code is not (Or at least that's what I think!) I might try to re-write one of these loops manually (just for practice) in assembler and then leave it at that.

There is a weird jerk in the program I might need some help with though. But I'll ask about that later if I need to. Think it might be my virus checker in the background.

16

Have you tried playing with GCC's optimization options? For example, if you're currently using -O2, try -Os instead. (There's also -O3, but it can introduce subtle bugs, so it's usually not recommended.)
avatar
Zeroblog

« Tout homme porte sur l'épaule gauche un singe et, sur l'épaule droite, un perroquet. » — Jean Cocteau
« Moi je cherche plus de logique non plus. C'est surement pour cela que j'apprécie les Ataris, ils sont aussi logiques que moi ! » — GT Turbo

17

I'm really impressed that Grok is able to understand and provide advice about Neo Geo programming 😳 It even understands what it's doing, describing what registers are used and why…
avatar
Highway Runners, mon jeu de racing à la Outrun qu'il est sorti le 14 décembre 2016 ! N'hésitez pas à me soutenir :)

https://itunes.apple.com/us/app/highway-runners/id964932741

18

Zerosquare (./16) :
Have you tried playing with GCC's optimization options? For example, if you're currently using -O2, try -Os instead. (There's also -O3, but it can introduce subtle bugs, so it's usually not recommended.)

I was using -O3 because that's what the Dev Kit was already using for this game. But yes that's a good idea to try a lower setting. I will give it a go 👍

Currently with a loop e.g when it is processing an array of structs, in the compiled 68000 code, it will keep calculating the address of the next struct in the list from the loop index each time. In my mind it should be storing a base address in an address register and doing something like LEA 16(A0),A0 each loop, to add 16 to to it (Assuming a struct size of 16 bytes). Also it seems to choose slow addressing modes to access the fields of the struct. But I will look into this more if -O2 -Os doesn't do better. If those don't work it will be probably be down to the older version of GCC I guess

Brunni (./17) :
I'm really impressed that Grok is able to understand and provide advice about Neo Geo programming 😳 It even understands what it's doing, describing what registers are used and why…

Yes Grok is impressive with C and 68000! but it does get many things wrong , especially with the Neo Geo (e.g it can ignore VRAM access timings or it thinks the Neo Geo can scale sprites, rather than just shrink them). I found it most useful to talk through general optimisation techniques with. If I wasn't sure which technique to use - it would give me a good idea of which technique would be faster before I programmed it. Because it would roughly know how it would compile to 68000 and it could then count the CPU cycles easily. Even then, I could see it making the odd mistake. But its pretty good overall and it should improve in future versions. We will be obsolete soon smile

19

it turned out that -O3 was the best to use. They all used pretty much the same technique but -O3 inlined the sprite function where as the others didn't (although I could possibly force this)

I made a simple example just to show : (this just updates the x coordinate of each player bullet - since player bullets only move horizontally)

C-Code :

typedef struct { // where the sprite is placed on the screen short x, y; // hardware sprite to use short spr; // padding to make struct 8 bytes in size for speed of access short pad1; } pbullet_t;
for(i=0; i<=blp_nd; i++) // blp_nd = number of player bullets + 1 { change_spritex_pos(bulletpsprites[i].spr, bulletpsprites[i].x); }
Compiles to this (-O3) :


001104: 97CB suba.l A3, A3 001106: 3839 0010 000C move.w $10000c.l, D4 00110C: B84B cmp.w A3, D4 00110E: 6D34 blt $1144 001110: 41F9 0010 0446 lea $100446.l, A0 // load base address of spr field in player bullet struct 001116: 45E8 FFFC lea (-$4,A0), A2 // load base address of x field in player bullet struct *LOOP START* 00111A: 300B move.w A3, D0 // (A3 = i, the loop index) These 3 lines calculate offset of current element from base address. Struct is 8 bytes in size 00111C: 48C0 ext.l D0 00111E: E788 lsl.l #3, D0 001120: 3630 0800 move.w (A0,D0.l), D3 // Get Sprite number from struct 001124: 3432 0800 move.w (A2,D0.l), D2 // Get x coord from struct *CHNAGE SPRITE X COORD ROUTINE* 001128: 43F9 003C 0002 lea $3c0002.l, A1 00112E: 3003 move.w D3, D0 001130: 0640 8400 addi.w #-$7c00, D0 001134: 3340 FFFE move.w D0, (-$2,A1) // write correct address for that sprite number to VRAM port 001138: 3202 move.w D2, D1 00113A: EF49 lsl.w #7, D1 00113C: 3281 move.w D1, (A1) // write new x coordinate of sprite to VRAM *END OF SPRITE ROUTINE* 00113E: 524B addq.w #1, A3 // increment loop index 001140: B84B cmp.w A3, D4 001142: 6CD6 bge $111a *END LOOP*
Why not just load base address of struct into A0 before the loops starts? and then just do :

LEA 8(A0),A0 each loop to increment A0 by 8 bytes

Then to access fields of the struct do MOVE.W (A0),D2 (for x coord) and MOVE 4(A0),D3 ( for sprite number)

To me , this just seems the simple way. But it's currently calculating the address from the index every time. It even did this when I used to have structs of size 24 bytes and it had to do more work to get there (with a shift and some adds). Back then the indexing modes used were even slower with extra unnecessary offsets