I think that is the maximum you can get easily blaster.
There is one more possible optimisation I can think of but now your getting into the realm of altering the buffer layout to maximise the copy speed.
Notice in the below the first longword of the buffer is actually copied to Palette RAM + (0x1fe0-0x38). It may not be worth the hassle as you'll have to change a few things to get that workable but if you did it will get you another scanline or two.
MOVE.l a7,STACKSTORE
LEA CHUNKY_BUFFER,a7
LEA PALETTES+0x1FE0,a6
MOVEM.l (a7)+,d0-d7/a0-a5
MOVEM.l d0-d7/a0-a5,-(a6),
MOVEM.l (a7)+,,d0-d7/a0-a5
MOVEM.l d0-d7/a0-a5,-(a6),
...
MOVE.l STACKSTORE,a7