Chunky-mode: explanation and example (source) - Page 2

30

Le 05/04/2016 à 23:03

Ohh, this is great news. Glad my suggestion was helpful.

www.universebios.com

31

Le 06/04/2016 à 00:05Edité par Razoola le 06/04/2016 à 00:09

furrtek (./28) :
Noïce

I still have to update the fix page, I think I was confused with the cropping of some TVs.Have to check but I think it's 40x28 in NTSC and 40x32 (not 30) in PAL. So 16 pixels more top and bottom.

It might be good to actually confirm what is happening first on the fix layer in PAL mode just in case before changing it. It might be that what is on the FIX page is correct and there is an 8 pixel top/bottom border still in PAL mode. That would mean the timings page just needs to be amended to 8 pixels instead of 16. Who knows.

www.universebios.com

32

Le 06/04/2016 à 00:09

From the programming guide:

In the PAL mode, the display area becomes 16 lines larger on the top and bottom; thevertical blanking lengthens only by 16 lines.

33

Le 07/04/2016 à 16:47

I updated 'DIFF' to v1.1
- no more tearing

- uses 2x20sprites (instead of 4x20)
- screenupdate triggered (saves some frames)

I also did some test on real hardware (NGCD with latest MVS-DevBios): screenshot

- BACKDROPCOLOR is green
- on VBLANK changes to red
- REG_LSPCMODE (Load counter at the beginning of the hblank of the first vblank line) waits 40 scanlines and changes BACKDROPCOLOR back to green

NTSC: no visible red lines -> 40 scanlines not visible
PAL: 32 red lines on top - > 16px top & 16px bottom / 8 scanlines not visible

34

Le 07/04/2016 à 18:31

Thanks for taking a few minutes to make that PAL test. I totally zoned out and forgot I could have used my CD system to check, I though I needed an AES for some reason. I'll let furtek add the information into the FIX page on the wiki. HPMAN, for once the official SNK docs and the hardware workings match

I have been trying to download v1.1 from your blog but the link appears not to work, I just get a blank screen. I really want to see the improvement on real hardware.

Raz

www.universebios.com

35

Le 07/04/2016 à 19:08

fixed!

36

Le 07/04/2016 à 19:47

Got it, that's a real nice improvement.

www.universebios.com

37

Le 07/04/2016 à 20:04Edité par Razoola le 08/04/2016 à 12:37

As a matter of interest if you are not already, have you tried movem.l to see if you can copy the palettes quicker ( It should be faster than unrolled move.l (a0)+,(a1)+ )?

For example; have the palette stored in workRAM and use A0 and A1 as pointers. Then something like this (I think I got the A1 additions right).

	lea	$100000,a0		; palette source
	lea	$400000,a1		; palette destination

loop    add.l	#$34,a1
	movem.l	(a0)+,d0-d7/a2-a6		
	movem.l	d0-d7/a2-a6,-(a1)	; a1=0		
	add.l	#$68,A1
	movem.l	(a0)+,d0-d7/a2-a6		
	movem.l	d0-d7/a2-a6,-(a1)	; a1=34		
	add.l	#$68,A1
	movem.l	(a0)+,d0-d7/a2-a6		
	movem.l	d0-d7/a2-a6,-(a1)	; a1=68			
	add.l	#$68,A1
	movem.l	(a0)+,d0-d7/a2-a6		
	movem.l	d0-d7/a2-a6,-(a1)	; a1=9c			
	add.l	#$64,A1
	movem.l	(a0)+,d0-d7/a2-a5		
	movem.l	d0-d7/a2-a5,-(a1)	; a1=d0			

	add.l	#$64,a1
	movem.l	(a0)+,d0-d7/a2-a6		
	movem.l	d0-d7/a2-a6,-(a1)	; a1=0		
	add.l	#$68,A1
	movem.l	(a0)+,d0-d7/a2-a6		
	movem.l	d0-d7/a2-a6,-(a1)	; a1=34		
	add.l	#$68,A1
	movem.l	(a0)+,d0-d7/a2-a6		
	movem.l	d0-d7/a2-a6,-(a1)	; a1=68			
	add.l	#$68,A1
	movem.l	(a0)+,d0-d7/a2-a6		
	movem.l	d0-d7/a2-a6,-(a1)	; a1=9c			
	add.l	#$64,A1
	movem.l	(a0)+,d0-d7/a2-a5		
	movem.l	d0-d7/a2-a5,-(a1)	; a1=d0			

	add.l	#$64,a1
	movem.l	(a0)+,d0-d7/a2-a6		
	movem.l	d0-d7/a2-a6,-(a1)	; a1=0		
	add.l	#$68,A1
	movem.l	(a0)+,d0-d7/a2-a6		
	movem.l	d0-d7/a2-a6,-(a1)	; a1=34		
	add.l	#$68,A1
	movem.l	(a0)+,d0-d7/a2-a6		
	movem.l	d0-d7/a2-a6,-(a1)	; a1=68			
	add.l	#$68,A1
	movem.l	(a0)+,d0-d7/a2-a6		
	movem.l	d0-d7/a2-a6,-(a1)	; a1=9c			
	add.l	#$64,A1
	movem.l	(a0)+,d0-d7/a2-a5		
	movem.l	d0-d7/a2-a5,-(a1)	; a1=d0			

	add.l	#$64,a1
	movem.l	(a0)+,d0-d7/a2-a6		
	movem.l	d0-d7/a2-a6,-(a1)	; a1=0		
	add.l	#$68,A1
	movem.l	(a0)+,d0-d7/a2-a6		
	movem.l	d0-d7/a2-a6,-(a1)	; a1=34		
	add.l	#$68,A1
	movem.l	(a0)+,d0-d7/a2-a6		
	movem.l	d0-d7/a2-a6,-(a1)	; a1=68			
	add.l	#$68,A1
	movem.l	(a0)+,d0-d7/a2-a6		
	movem.l	d0-d7/a2-a6,-(a1)	; a1=9c			
	add.l	#$64,A1
	movem.l	(a0)+,d0-d7/a2-a5		
	movem.l	d0-d7/a2-a5,-(a1)	; a1=d0			
	add.l	#$30,A1

	cmp.l	#$102000,a0             ; end of copy?
	bne	loop
	rts

[edited] I had a small mistake in the ASM which is now fixed.

If your boundary's are ok replacing add.l with add.w will save a few more cycles. When I wrote this I was comparing 68k copy speed against the CD DMA system and was moving large blocks. This code is faster than DMA on the TOP loader at least. You can of course also totally unroll it and use A7 too. for extra.

www.universebios.com

38

Le 07/04/2016 à 22:29

Hi guys,
Are you really hunting all bytes and cycles that you can on the Neo-Geo ?

39

Le 07/04/2016 à 23:13Edité par blastar le 08/04/2016 à 00:37

Razoola,
it's not that easy because you have to skip color#0.
I tried it this way (completely unrolled loop) but it's slower:

          lea       CHUNKY_BUFFER,a0
          lea       PALETTES+16*2*5,a1
          sub.l     #32,a1

          rept      250

          add.l     #64,a1
          sub.l     #2,a0
          movem.l   (a0)+,d0-d7
          movem.l   d0-d7,-(a1)

          endr

this is faster:

          lea       PALETTES+16*2*5+2,a0
          lea       CHUNKY_BUFFER,a1

          rept      250 

          move.l    (a1)+,(a0)+
          move.l    (a1)+,(a0)+
          move.l    (a1)+,(a0)+
          move.l    (a1)+,(a0)+
          move.l    (a1)+,(a0)+
          move.l    (a1)+,(a0)+
          move.l    (a1)+,(a0)+
          move.w    (a1)+,(a0)+
          add.l     #2,a0

          endr

40

Le 08/04/2016 à 00:13

blastar (./33) :
I updated 'DIFF' to v1.1
- no more tearing
- uses 2x20sprites (instead of 4x20)
- screenupdate triggered (saves some frames)

Great optimization Blastar! It looks that you were also able to increase the resolution and the window size.

Its nice to see that great ideas can be driven to even higher levels if the knowlegde of the right people gets concentrated on one spot.
To bad there is nothing meaningful to add from my side but this thread is always a good read anytime when I have quick look into the forum

41

Le 08/04/2016 à 02:19

./38 > Le Folco, par l'odeur de l'assembleur 68k alléché...

cles endr

blastar > 	lea       PALETTES+16*2*5+2,a0
	lea       CHUNKY_BUFFER,a1

	rept      250 

	move.l    (a1)+,(a0)+	; 20 cycles
	move.l    (a1)+,(a0)+	; 20 cycles
	move.l    (a1)+,(a0)+	; 20 cycles
	move.l    (a1)+,(a0)+	; 20 cycles
	move.l    (a1)+,(a0)+	; 20 cycles
	move.l    (a1)+,(a0)+	; 20 cycles
	move.l    (a1)+,(a0)+	; 20 cycles
	move.w    (a1)+,(a0)+	; 12 cycles
	add.l     #2,a0		; 16 cy

168 cycles per palette
(hint : you can save 8 cycles by using addq.l #2,a0 instead of add.l #2,a0

)

vs

 	lea       CHUNKY_BUFFER,a0
        lea       PALETTES+16*2*5,a1
        sub.l     #32,a1

        rept      250

        add.l     #64,a1	; 16 cycles
        sub.l     #2,a0		; 16 cycles
        movem.l   (a0)+,d0-d7	; 12 + (8 * 8) = 76 cycles
        movem.l   d0-d7,-(a1)	;  8 + (8 * 8) = 72 cycles
				
        endr

180 cycles per palette

-> You're right, movem is actually slower here... surprising!

If you've got some time left before VBLANK, here's another strategy:
Preconvert your chunky buffer to a palette buffer (basically, you insert color #0 before each 15-entry palette) before VBLANK. Then, during VBLANK, copy the whole buffer to the palettes registers using movem. To copy 28 colors that way, the movem version (using 14 registers) needs 244 cycles, instead of 280 cycles for the move.l (a1)+,(a0)+. So you can set more palettes during VBLANK, even if it uses more cycles per frame.

I've thought about using self-modifying code too, but actually it's not faster than move.l (a1)+,(a0)+.

— Zeroblog —

« Tout homme porte sur l'épaule gauche un singe et, sur l'épaule droite, un perroquet. » — Jean Cocteau
« Moi je cherche plus de logique non plus. C'est surement pour cela que j'apprécie les Ataris, ils sont aussi logiques que moi ! » — GT Turbo

42

Le 08/04/2016 à 09:12

Ahh the requirement of having to skip colors is a problem for the movem option for sure but Zerosquare's suggestion may get you around it (take note of his mention of addq.l also). You really must include the address registers with the movem opcode for it to be of real benefit. If you just use the data registers it may well be slower but its not just a case of counting opcode cycles however, there is a little more to it.

When counting cycles used in cases like this you need to take into account wait states which cannot be easily counted as it varies from hardware to hardware and even the region within the hardware. For example a program run inside NeoGeo palette RAM will experience wait states equal to executing from ROM. The same code executed in any other NeoGeo RAM region will hit wait states and be slightly slower, its not huge but it can be measured over one frame. I painfully discovered this myself when there were some issues testing PC-2-NEO transfers, testing it in RAM compared to ROM. It took me a while to get to the bottom of why a tightly timed program worked 100% in RAM but as soon as I burned it to ROM and tried again the odd transfer failure would creep in. It was only after I realised RAM wait states were the problem and that using Palette RAM as the test bed would get around these wait states that I could accurately test a tightly timed program without burning a ROM each time.

What I'm saying is because using movem can move a set amount of memory with less opcodes (which are situated in the CD systems RAM) than move.l (a0)+,(a1)+, and because one movem opcode is only filling palette RAM (less wait state issues than RAM) with data/address registers, you are going to experience less wait states and loose cycles. This is one of those times simply counting opcode cycle times may not give a true reflection of which is faster or the true cycle count amount. One really needs a visual speed indication by counting scanlines taken to execute rather than reference 68k opcode cycle speeds, especially because NeoGeo RAM and Palette RAM have different characteristics.

Maybe this is a little deep for the search of saving cycles, but in my book every possible cycle counts when it comes to the search of speed. This is also one area where emulation cannot match testing on real hardware.

www.universebios.com

43

Le 08/04/2016 à 11:45

Razoola, that's a good point. Thus any additional instruction wasting chances of reading or writing data between the wait state cycles (address related adds or subs) should be avoided.

Zerosquare (./41) :
./38 > Le Folco, par l'odeur de l'assembleur 68k alléché...
[cut blastar's code]

(hint : you can save 8 cycles by using addq.l #2,a0 instead of add.l #2,a0 )

vs

[cut your movem code]

180 cycles per palette

-> You're right, movem is actually slower here... surprising!

If you've got some time left before VBLANK, here's another strategy:
Preconvert your chunky buffer to a palette buffer (basically, you insert color #0 before each 15-entry palette) before VBLANK. Then, during VBLANK, copy the whole buffer to the palettes registers using movem. To copy 28 colors that way, the movem version (using 14 registers) needs 244 cycles, instead of 280 cycles for the move.l (a1)+,(a0)+. So you can set more palettes during VBLANK, even if it uses more cycles per frame.

I've thought about using self-modifying code too, but actually it's not faster than move.l (a1)+,(a0)+.

These are very interesting suggestions!

With your own optimization and the 64 byte offset in an address register (e.g. a2), we can also do

        add.l     a2,a1	        ; 8 cycles
        subq.l    #2,a0		; 8 cycles
        movem.l   (a0)+,d0-d7	; 12 + (8 * 8) = 76 cycles
        movem.l   d0-d7,-(a1)	;  8 + (8 * 8) = 72 cycles

164 cycles. Hey, 4 cycles faster!

I also played with some thoughts regarding the movem variant. And since we could use 15 registers as words (no unnecessary writes) and absolute 32b addresses in this special routine, we might just hardcode the operation:

    MOVEM.w CHUNKY_BUFFER,d0-d7/a0-a6      ; 80 cycles - read 15 words
    MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+2  ; 76 cycles - write 15 words at +2 offset
    MOVEM.w CHUNKY_BUFFER+30,d0-d7/a0-a6   ; next 15 words
    MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+34 ; offset 1*32+1*2
    MOVEM.w CHUNKY_BUFFER+60,d0-d7/a0-a6   ; next 15 words
    MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+66 ; offset 2*32+2*2
    ...

156 cycles and 16B code per palette!

7.7% faster than the original version. Any further optimization (disabling interrupts and using a7, loading with one movem, write with 2, etc.) seems to bring more hassle than use.

I think, the chunky buffer preprocessing would be obsolete now. It would also use a lot of potential render time. If those transparent colors are part of the chunky buffer, the renderer could be adapted to jump over them. The whole pixel -> sprite mapping might also be rotated by 90 degrees to turn (no pun intended) memory locations holding the transparent color into lines, which could be more easily avoided while rendering something. For a raycaster or voxel engine (doing vertical stuff) this might be just the other way round.

44

Le 08/04/2016 à 13:29

I personally would always use longwords with movem if possible because you have the biggest potential gain open to you. Around 13% can be had during the copy process if one uses longwords I believe (not including the wait state gains on top ). The downside I see now blaster pasted the code of course not being able to do it this way easily due to the current color skipping in place during the copy process. To get around that might loose more than you gain elsewhere.

If color#0 can be always placed into the Palette buffer and thus moved into Palette RAM during the copy as Zerosquare suggests then this will be the quickest method to actually copy the entire palette I believe. Its swings and roundabouts though and it may end up at the end of the day around the same gain either way.

www.universebios.com

45

Le 08/04/2016 à 14:23Edité par Razoola le 08/04/2016 à 16:49

Expanding on your 2nd method Dresdenboy.

You can use A7 this way and get it down to 148 cycles using words that's almost 12% !!. I did not think about it this way and this may actually be the best way given the ease it can be added into the current code framework. I learned something.

Would need to disable interrupts but given this copy is run every frame from a set scanline you just call your would be interrupt routine right after the copy routine has completed, its no loss and is in fact a gain. You save the cycles lost during the 68k interrupt process (about 72 cycles if I remember right, see https://community.freescale.com/thread/29078). Plus 260 cycles cycles storing and restoring the registers to and from the stack inside the interrupt routine (if your storing all).

    MOVE.l a7,STACKSTORE
    LEA CHUNKY_BUFFER,a7

    MOVEM.w (a7)+,d0-d7/a0-a6              ; 72 cycles - read 15 words
    MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+2  ; 76 cycles - write 15 words at +2 offset
    MOVEM.w (a7)+,d0-d7/a0-a6              ; next 15 words
    MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+34 ; offset 1*32+1*2
    MOVEM.w (a7)+,d0-d7/a0-a6              ; next 15 words
    MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+66 ; offset 2*32+2*2
    ...

    MOVE.l STACKSTORE,a7
    RTS

www.universebios.com

46

Le 08/04/2016 à 15:53Edité par Dresdenboy le 09/04/2016 à 11:21

Yep, using a7 helps even further. I tried using longs in movem, which would look like this (also reusing a7):

    LEA     image+64,a7            ; 12 cycles
    MOVEM.l testdata,d0-d7/a0-a6   ; d0-d7: 0123456789abcde0 a0-a6: 123456789abcde read = 20 + (15 * 8) = 140
    MOVEM.l a0-a6,-(a7)            ; a0-a6: 12 34 56 78 9a bc de (0) written = 64 cycles
    MOVE.w  d7,-(a7)               ; d7.w: 0 written at correct position = 8 cycles
    MOVEM.l d0-d7,-(a7)            ; d0-d7: _0123456789abcde (0) written = 72 cycles
    ;296 cycles for 2 palettes -> 148 cycles/palette
    LEA     image+128,a7            ; 12 cycles
    MOVEM.l testdata+60,d0-d7/a0-a6 ; d0-d7: 0123456789abcde0 a0-a6: 123456789abcde read = 20 + (15 * 8) = 140
    MOVEM.l a0-a6,-(a7)            ; a0-a6: 12 34 56 78 9a bc de (0) written = 64 cycles
    MOVE.w  d7,-(a7)               ; d7.w: 0 written at correct position = 8 cycles
    MOVEM.l d0-d7,-(a7)            ; d0-d7: _0123456789abcde (0) written = 72 cycles   
    ;296 cycles for 2 palettes -> 148 cycles/palette

So it's not faster than your solution, and it doesn't look as straightforward (backward actually

).

I think, many effects can be rendered on a per column or per line basis. In both variants, with some dedicated pixel mapping it should be possible to render into a buffer containing spare words for color 0.

47

Le 08/04/2016 à 16:12

Yes I think together we might have found the best solution for this situation

, Fixed those errors thanks.

I must admit I don't fully understand how the palettes relate to things happening on screen in the demo as I have not studied this effect before. Would I be right in saying if there is a symmetrical full screen image (the top half being a mirror of the bottom half in say a static Dr who tunnel effect), could one optimise this further by simply reading the palette buffer once and then writing it to both relevant positions in physical Palette RAM (for the top and bottom of the screen)? If that's possible you could get like another 30% gain for those kinds of effects and a custom copy routine to handle it.

By the way there is an idea blaster, a Dr Who demo... The tunnel effect with a blue 3d box in the middle spinning about the place with some cool music playing

dadadada dadadada.

www.universebios.com

48

Le 09/04/2016 à 07:07

the first effect (stardeform) in DIFF is a special tunnel (without depth-shading)... static transformation is very fast, I have to use two WAIT_VBL to slow it down, unthrottled this effect runs with ~93fps, so no need for optimizations here.
I changed the buffer and added an empty word for color#0 so i can use a MOVEM-loop, it has become even faster - but without A7! I do not like the idea to disable VBLANK, it is not just about speed.
my next taregt will be to use the free rastertime to increase the screen resolution using more sprites and switching palbanks.

49

Le 09/04/2016 à 07:32

Ohh that's great blaster

, pity about the A7 not being used but at least you know there is gain in the pocket if you really need it..

I take it your moving long words, any chance we have a look at the new code and do you a % increase you can give us?

www.universebios.com

50

Le 09/04/2016 à 15:22

tested on real hardware using REG_LSPCMODE:

old move.l-loop (skip color#0): 55 lines (used in DIFF1.1)
old movem.l-loop (skip color#0): 59 lines
new movem.l-loop (write color#0 without A7): 49 lines
completely unrolled movem.l-loop (write color#0), using all(!) registers: 48 lines

	MOVE.l a7,STACKSTORE

	MOVEM.l CHUNKY_BUFFER+64*0,d0-d7/a0-a7
	MOVEM.l d0-d7/a0-a7,PALETTES+16*2*5+64*0
	MOVEM.l CHUNKY_BUFFER+64*1,d0-d7/a0-a7
	MOVEM.l d0-d7/a0-a7,PALETTES+16*2*5+64*1
	...
	MOVEM.l CHUNKY_BUFFER+64*124,d0-d7/a0-a7
	MOVEM.l d0-d7/a0-a7,PALETTES+16*2*5+64*124

	MOVE.l STACKSTORE,a7

51

Le 09/04/2016 à 15:45

If your going to use A7 it should be faster to do it like this.
It will mean a few more move opcodes to complete the copy but it should be quicker overall because your saving 24 cycles per move (292 vs 268). Mights take you down to 46 or less lines.

	MOVE.l a7,STACKSTORE
        LEA CHUNKY_BUFFER,a7

	MOVEM.l (a7)+,d0-d7/a0-a6
	MOVEM.l d0-d7/a0-a6,PALETTES+16*2*5+60*0
	MOVEM.l (a7)+,,d0-d7/a0-a6
	MOVEM.l d0-d7/a0-a6,PALETTES+16*2*5+60*1
	...

	MOVE.l STACKSTORE,a7

www.universebios.com

52

Le 09/04/2016 à 15:59

you are right, this way it's a bit faster - 47 lines.

53

Le 09/04/2016 à 17:04Edité par Dresdenboy le 09/04/2016 à 17:17

blastar (./52) :
you are right, this way it's a bit faster - 47 lines.

So for a copy loop including #0 this seems to be the final option.

Did you also test the movem.W loop for skipping color #0? (Posts #43 and #45) This should take ~51 lines without using a7 and ~49 lines with using it.

This means, that depending on the rendering requirements (color #0 rows or columns possible), there are fast copy loops with and without color #0 skippings.

54

Le 09/04/2016 à 17:06Edité par Razoola le 09/04/2016 à 19:15

I think that is the maximum you can get easily blaster.

There is one more possible optimisation I can think of but now your getting into the realm of altering the buffer layout to maximise the copy speed.

Notice in the below the first longword of the buffer is actually copied to Palette RAM + (0x1fe0-0x38). It may not be worth the hassle as you'll have to change a few things to get that workable but if you did it will get you another scanline or two.

	MOVE.l a7,STACKSTORE
        LEA CHUNKY_BUFFER,a7
        LEA PALETTES+0x1FE0,a6

	MOVEM.l (a7)+,d0-d7/a0-a5
	MOVEM.l d0-d7/a0-a5,-(a6),
	MOVEM.l (a7)+,,d0-d7/a0-a5
	MOVEM.l d0-d7/a0-a5,-(a6),
	...

	MOVE.l STACKSTORE,a7

www.universebios.com

55

Le 09/04/2016 à 17:13

Dresdenboy (./53) :
blastar (./52) :
you are right, this way it's a bit faster - 47 lines.
So for a copy loop including #0 this seems to be the final option.

Did you also test the movem.W loop for skipping color #0? (Posts #43 and #45)
This means, that depending on the rendering requirements (color #0 rows or columns possible), there are fast copy loops with and without color #0 skippings.

Those are not going to be faster than the longword method because there has to be twice as many opcodes. They were defo the best way though if blaster never added color #0 into buffer.

www.universebios.com

56

Le 09/04/2016 à 17:28

movem.w-loop without color#0 and using A7 is not that slow: 49 lines

MOVE.l a7,STACKSTORE

LEA CHUNKY_BUFFER,a7

MOVEM.w (a7)+,d0-d7/a0-a6
MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+2+32*0
MOVEM.w (a7)+,d0-d7/a0-a6
MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+2+32*1
... 
MOVEM.w (a7)+,d0-d7/a0-a6
MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+2+32*249

MOVE.l STACKSTORE,a7

57

Le 09/04/2016 à 17:52

blastar (./56) :
movem.w-loop without color#0 and using A7 is not that slow: 49 lines

MOVE.l a7,STACKSTORE

LEA CHUNKY_BUFFER,a7

MOVEM.w (a7)+,d0-d7/a0-a6
MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+2+32*0
MOVEM.w (a7)+,d0-d7/a0-a6
MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+2+32*1
... 
MOVEM.w (a7)+,d0-d7/a0-a6
MOVEM.w d0-d7/a0-a6,PALETTES+16*2*5+2+32*249

MOVE.l STACKSTORE,a7

58

Le 09/04/2016 à 18:08

Razoola (./55) :
Dresdenboy (./53) :
blastar (./52) :
you are right, this way it's a bit faster - 47 lines.
So for a copy loop including #0 this seems to be the final option.

Did you also test the movem.W loop for skipping color #0? (Posts #43 and #45)
This means, that depending on the rendering requirements (color #0 rows or columns possible), there are fast copy loops with and without color #0 skippings.
Those are not going to be faster than the longword method because there has to be twice as many opcodes. They were defo the best way though if blaster never added color #0 into buffer.

The movem.w variant completely avoids reading or writing values twice and or writing color #0. It reads 15 words and writes 15 words and has linear (a7)+ reading, no address adjustments. That's where this method wins. It has 2x movem-initialization cycles per palette, but also only needs 4c/word.

59

Le 09/04/2016 à 18:36Edité par Razoola le 09/04/2016 à 19:12

Yup, its a fast way and only marginally slower (2 scanlines) than using longwords, defo faster than the original method, I did not think it would be so close to longword speed in this instance. The method I mentioned altering the buffer format and using longwords is going to give a little more (45 or just into 46 scanlines) but there is the buffer format change to think about which might cancel that out plus more.

In my years of experience I know using longwords is the maximum for speed when copying memory around but I did not realise that sometimes using movem with words can get you almost as close in some situations so I learned something from your initial idea Dresbenboy. Maybe you learned that with movem its sometimes good to use an address register (indirect with increment) and not as storage . And blaster has been taken out of the world of move.l (a0)+,(a1)+ and introduced to the world of movem. We all gained something and that's good for everyone.

www.universebios.com

60

Le 09/04/2016 à 18:56