This post was initially written to keep FerrisCraft users apprised of the work toward the launch of FC5, but expanded for the Writing Gaggle, which you should check out.

Also, my personal blog can be found here: https://www.catmonad.xyz/blog.

Consider also seeing yesterday’s devlog.

FerrisCraft 5 Backup Tool Devlog (August 28th)

Apparently I’m doing these daily now. (Not really. Don’t expect another one tomorrow.)

While working on the initial design and outline of this tool, I used the fastanvil crate to parse Minecraft region files (the “Anvil format”). It was good enough to let me examine the internal structure of region files, but it does not provide a way to read the already compressed chunk data— instead only providing a method which creates a buffer of decompressed data.

As I discussed yesterday, I am not going to be implementing delta encoding of chunks in the near future. So, decompressing every chunk is a waste of time, especially when I’m just going to compress it again right after, when I save it in the database.

Instead of looking into fastanvil’s source and perhaps forking the crate in order to add a method which provides direct access to the compressed chunk data, I took the opportunity today to nail down my understanding of the Anvil file format by writing my own parser. I’ll be needing to regenerate region files from the database soon anyway, and so this knowledge was necessary from the start.

This parser is very lazy, and largely assumes that the region file given to it is well-formed. If that assumption is violated, we’ll end up with a panic, which is better than other Bad Things I could name, but probably not relevant anyway.

It works by first running length assertions on a given byte slice, and then casting the reference. And that’s it. Boom:

    pub fn parse_ref(region: &[u8]) -> Result<&Self, ()> {
        let Ok(bytes): Result<&[u8; (PAGE_SIZE * 2) as usize], _> =
            region[..(PAGE_SIZE * 2) as usize].try_into()
        else {
            return Err(());
        };
        // Safety: LocationTable and TimestampTable both have the same layout as a [u8; PAGE_SIZE].
        // We set Meta to repr(C), which results in it having the same layout as a
        // [[u8; PAGE_SIZE]; 2], which has the same layout as a [u8; PAGE_SIZE * 2].
        // These two types are therefore reference compatible.
        Ok(unsafe { core::mem::transmute(bytes) })
    }

…okay, that’s not the whole story.

The Anvil Format

I’ve alluded to it already, but we should take a few minutes to examine the format we’re talking about here. Officially titled the “Anvil format”, it dictates how chunk data is saved to permanent storage.

It is not a complicated format, with two main segments:

  1. A pair of tables with the same layout, representing these things, in this order:
    1. Offset information for each chunk as stored within the current file.
    2. The modification time for each chunk stored within the current file.
  2. A contiguous array of chunks.

All this to save a total of 32×32, or 1024, Minecraft chunks.

The Tables

Each table is an array of 4 byte entries, with 1024 entries total, matching the number of chunks saved in the file.

#[repr(transparent)]
struct TableEntry {
    bytes: [u8; 4],
}

#[repr(transparent)]
struct Table {
    entries: [TableEntry; 1024],
}

The two tables are laid out directly next to each other, making the front of the file look like this: [Table; 2]. But the important thing is how each TableEntry is interpreted.

In the first table, the first three bytes of an entry are used to encode the offset a chunk is found at in the file, in 4096 byte multiples:

pub fn offset(&self) -> u32 {
    self.page_offset() * PAGE_SIZE
}
pub fn page_offset(&self) -> u32 {
    let [a, b, c, _] = self.entry.bytes;
    u32::from_be_bytes([0, a, b, c])
}

And the fourth byte of each entry is used to encode the amount of space allocated in the file for that chunk, in 4096 byte multiples:

pub fn size(&self) -> u32 {
    self.page_size() * PAGE_SIZE
}
pub fn page_size(&self) -> u32 {
    let [_, _, _, x] = self.entry.bytes;
    u32::from(x)
}

But which chunk is a TableEntry talking about? That is determined by this mapping, which converts in-world chunk coordinate offsets from the origin of a region to indexes into our Tables here:

fn compute_chunk_table_index(x: u8, z: u8) -> usize {
    assert!(x < 32 && z < 32);
    let (x, z) = (x as usize, z as usize);
    x + (z * 32)
}

A TableEntry can indicate that there is no data saved for its corresponding chunk by being set to all zero bytes.

Both Tables use this same mapping, and so the last needed information is that entries in the second table are 4 byte big endian integers which represent modification times:

pub fn time(&self) -> u32 {
    let bytes = self.entry.bytes;
    u32::from_be_bytes(bytes)
}

It occurs to me that all of these accessor methods could in fact be on the same TableEntry struct, but they are in fact defined on Location and Timestamp structs which wrap the TableEntry struct, to ensure I don’t mistakenly interpret a member of the location table as a timestamp or vice-versa.

The Chunks

You might have noticed that each table is 4096 bytes, and offsets into an Anvil file are measured in multiples of 4096 bytes. I have dubbed this unit a “page”, because that word is often used to describe large units of data when people speak of binary formats like this. “Sector” is another common one, but “page” is closer to the front of my mind right now because it’s what SQLite’s documentation uses.

So, the data for chunks starts at a page index of 2, or 8192 bytes into the file.

Each chunk starts with a header with this information:

  1. The number of bytes remaining chunk data actually uses in the file. (Yes, including the next byte of the header.) Saved as a 4 byte big endian integer.
  2. The type of compression used for the body of the chunk. Saved as 1 byte, where the value 1 represents gzip, 2 represents zlib, and 3 represents that it is not compressed. Reportedly, only zlib compression is seen in the wild.
/// Get the length of the chunk body.
pub fn len(&self) -> u32 {
    let bytes = self.bytes[..4].try_into().unwrap();
    u32::from_be_bytes(bytes) - 1
}
pub fn compression_scheme(&self) -> CompressionScheme {
    let scheme = self.bytes[4];
    match scheme {
        1 => CompressionScheme::Gzip,
        2 => CompressionScheme::Zlib,
        3 => CompressionScheme::Uncompressed,
        _ => panic!("unknown compression scheme"),
    }
}
pub fn data(&self) -> &[u8] {
    &self.bytes[5..]
}

The Parsing

Armed with that knowledge of the Anvil format, and those accessor methods I wrote above, parsing is trivial: cast a byte slice, and then use the table accessor methods to determine chunk indexes and create a ChunkRef which gives us access to whichever chunk it is we’re examining.

I’ve also defined an iterator for chunks in an Anvil file, which works by walking the first table in the file and handing out chunk references for those chunks which have data associated with them. That will be used in the process of dumping chunks into the content addressed store I mentioned last time.

One step at a time. This is just one part, but we’re that much closer to having a fully usable backup tool. Maybe next devlog I’ll tell you about the object storage format. :)