Sizecoding &amp; custom packing

Update 2021-07-14: Ben of The Overlanders commented on my Facebook post that there was yet another optimization possible on the depack-routine. I have edited the source listing below.

This weekend yet another instance of the very popular Atari ST retro computer happening Sommarhack took place. Due to Covid-19, this year as well as last have been online-only events though. I decided to participate in one of the competitions, the 256 byte intro. This is what’s called sizecoding today, the art of writing something worthy of being shown off, yet you have very little actual room available to do so. On the Atari ST, the limit of 256 bytes excludes the operating system header which is normally 32 bytes in size.

The Atari ST, being 68000 based, is of course a 16 bit computer. There are no instructions smaller than a multiple of 16 bit, 2 bytes. A lot of instructions will be 4, 6 or even 8 bytes long in a regular program. We thus know, already from the start, that our program will not consist of more than 128 low level CPU instructions.

I decided I wanted to display text, and additionally I wanted to use a custom font. In a competition like this you would normally use pre-existing primitives available from the operating system, since you would get those “for free” without using any of the space you have available. A custom font would simply look better, and since the point of the intro I wanted to make was to display an http link, looks would be everything.

The graphics artist in my old group, BlueSTar, had already made an 8x8 pixel font back in 2015 that we had used in two previous releases. It was the obvious choice to use here as well, and so I knew what I now had to work with. The link would need 30 characters, of which 21 would be unique. The normal use of a font like this is to keep it in a lookup table, and simply reference the character you want to print to the screen. However, even having shrunk the full character set down to only the 21 I needed, when I added a minimal printing routine I got way too close to the 256 byte limit. The other option would be to just store the 30 characters in a screen friendly format, which would make for a smaller print routine. Also, none of the characters I needed from the font used the 8th line, so in the end I had 30 times 7 bytes of pure graphics data.

210 bytes.

I wrote up, and optimized, the print routine, the setting of a proper palette and some interrupt to get some movement on the otherwise static content. All that came in at 58 bytes. 210+58=268. 12 bytes over the limit. There were no more optimizations available at this stage, so I needed to look into packing the data. This might sound obvious in a world of ubiquitous ZIP and RAR, but it’s not that simple. The depacker will also need to fit, so I needed to find a way to pack data while at the same time gaining more free space from that packing than the depacking code needed.

Luckily, the same reason for why line 8 was not used by any of these characters would come help me again. In the 8 by 8 pixel square available for each character, none of them used the first column - or the eighth bit of each byte. The reason for this is of course that you should be able to write these 8x8 squares to the screen and automatically get space between characters and lines for readability.

My first test consisted of using bit 8 to mean “0x00” (a byte consisting of all zeroes, no pixels set) follows, in addition to the character otherwise using the lower bits. This got me very close, but not enough. After a few iterations of this concept, I decided that it was time to do something a bit more advanced. I switched to writing a custom packer in Java, which borrowed some concepts from the very capable lz77 algorithm. There would of course not be any room for a custom dictionary, and with only 210 bytes of source data it’s not likely to find some large lengths of duplicate bytes.

I developed a packer that would search for duplicates of 2 and 3 byte blocks, in the first 64 bytes of the data. If the highest bit was set, it together with the next highest bit decided if the position pointed to by the low 6 bits should be copied as 2 or 3 bytes. It would of course prioritize three byte blocks before two byte blocks, and all in all it was able to pack the 210 bytes of source data down to 168. The depacker got to be a bit complicated though, since I needed to store the length-bits and loop on those. At this point, I was able to get the intro under the limit, but I had an awful color cycling as the only available “movement”.

Somewhat disappointed, I decided to focus on the size of the depacker. Removing the possibility of using two different lengths brought it down a lot - and also allowed me to use 7 bits for pointing back. The new packer would thus only look for duplications of 2 byte sequences, and it could use the whole first 128 bytes as dictionary. I also gave it a bit more intelligence in how it sorted the priority on what to select for packing. This gave me not only a saving on the depacker - the packer was now able to save 46 bytes in total on the data! Even more than the supposedly more capable version.

Here’s what the depacker looks like, in both 68000 assembler as well as machine code. As you can see, it’s 26 bytes in size (excluding the setup of video memory in A1 and source data in A0).

41FA 004A	lea _text(pc),a0
47D0    	lea (a0),a3
303C 00A3	move.w #_textend-_text-1,d0     ; our packed byte array length
          .l1:
1418    	move.b (a0)+,d2
6A08            bpl.s .notrep                   ; if positive then not a lookup value
12F3 2080       move.b $80(a3,d2.w),(a1)+       ; Offset to ignore high/negative bit
1433 2081       move.b $81(a3,d2.w),d2          ; -"- and saves a bra.s
          .notrep:
12C2    	move.b d2,(a1)+
51C8 FFEC   	dbf	d0,.l1

The total saving is thus 210-(164+26) = 20 bytes, or 10%. Adding back the 58 bytes of other code, plus handling of a temporary depack space, ended up at 260 bytes.

Oh, wait. 260? That’s … 4 bytes too much. Now, I had already taken the time to create some better “movement” in the screen, doing split rasters from a random seed, and I really didn’t want to scale back on that again. So, the final trick used is borrowed from another code magician of the past - Gunstick of ULM. Last year he found a way to make the 32 byte operating system header 4 bytes smaller, and since the whole point of the demo scene is to cheat as much as possible without it being obvious, that’s how come the A Link Between Worlds entry by SYNC at Sommarhack 2021 came in at the expected 288 bytes executable size.

added 2021-07-14: Ben’s optimization removed 4 bytes and thus the entry would not have needed the header-hack. This is a great example of how, when sizecoding, every single trick in the book is used to shrink the code. Ben makes use of the fact that the high bit (bit 8) is also the negative bit in a byte. Instead of clearing the high bit and jumping on whether it was set or not, as my code did previously, it’s possible to jump directly on whether the previous move.b contained a negative or positive byte. Since we don’t clear the high bit, we then need to offset the following moves with 128 and 129 (same as -128 and -127) on our index. The depack code thus now stands at 22 bytes in size.

If there’s interest, I might clean up the packer and release as well. However, unless you have an extremely similar custom use case you’ll probably be better off using lz77 directly. I’ve heard rumours of such a depacker on the 68000 at just double the size of mine.

(Did I win? Of course not - in true SYNC spirit I only cared about the tech challenge. The other entries had COOL MOVING STUFF - and the winner even had sound! You really should go check them out.)