This post was originally published on this site

Visualising Filesystems

Have you ever wondered where does a filesystem store it’s data?

I mean, I know that they are physically written on a hard drive … but the question of howthey are actually arranged on the disk surface was always a mystery to me.

Since I am a science guy, I always love a good experiment. Therefore I challenged myself to find a way to visualise the filesystem operations on the disk. I have recently found out how to write a network block devices (nbd) on linux, and I consider myself quite adequate on data visualisation, so I didn’t hesitate to get my hands dirty right away!

Visualising Operations

We could easily break down the disk surface into fixed-size blocks. Since a typical virtual memory page size is 4kiB (and by consequence the typical filesystem block size is also 4kiB) I thought that the magic number of 4096 bytes is a good starting point.

After dividing the disk into blocks, we could consider that each block is a histogram bin, that counts the frequency of the operations. In simpler words, when a read or a writeoperation touches that particular block, we are increasing it’s number by one.

This means that at the end of our tests we could assign a colour to each block, according to the number of operations that affected it. And since we are talking about stress on the disk, we are going to use the flame graph colour range (black = cold, white = hot)

Writing a network block device

Thankfully, about 99% of the code that I required was already available in the buse-go project from Sam Alba. The driver example that comes with the library allocates a 512MiB memory block that then exposes as a network block device to the linux kernel.

The read/write operations are trivial. My only addition was to create an array of integers where I am going to count the number of operations on the blocks.

func (d *DeviceExample) ReadAt(p []byte, off uint) error {
...

// Convert offset and range to blocks
iOff := uint(math.Floor(float64(off) / float64(d.blockSize)))
iLen := uint(math.Ceil(float64(len(p)) / float64(d.blockSize)))

// Increment the counters on the affected blocks
for i := iOff; i < (iOff+iLen); i++ {
d.histogram[i] += 1
}

...
}

And, of course, when the user stops the device driver with CTRL+C I added a small snippet that creates the image file that visualises the values of the d.histogram array of blocks. Effectively, we have just created a 512MiB block device that visualises the frequency of the I/O operations in block-level accuracy. Awesome!

Putting it together

Now that we have our “visualisation” block device in place it’s time to test some filesystems. But before we do this, I wanted to create some a realistic, yet reproducible test that I am going to run for each filesystem.

I decided to write a small shell script that is going to:

  1. Create 40 files for every size variations of 1, 25, 50, 75, 100, 250, 500, 750, and 1000 blocks (that’s 4kiB, 100kiB, 204kiB, 307kiB, 409kiB, 1Mib, 2MiB, 3MiB and 4MiB)
  2. Delete 100 random files
  3. Repeat (1) and (2) 10 times

This way I am testing various file sizes, I am creating random fragmentation and I am filling up the entire space of the block device. I think it’s time for the test

Results

In the following sections we are going to present the I/O load across the disk surface for various filesystems. The legend for the images you are going to see below is the following:

Black means “Few Operations”, and White means “Many Operations”. The colour range is normalised to the maximum number of operations occurred. This also means that the more even the distribution of the load, the whiter the image will be.

FAT32

We are starting our presentation with FAT32. One of the oldest filesystems, still used in some rare cases. By design, FAT has the File Allocation Table in the beginning of the disk (two copies to be precise), and the Data section right after it. As expected, the FAT table is hammered with I/O operations, while the data section is used evenly.

In my surprise, there are also 3 places in the middle of the data space that are also heavily loaded. They are 8KiB in size and honestly, I have no idea what they could be :)

NTFS

After FAT, and with the introduction of Windows NT, Microsoft introduced a more reliable filesystem for windows, the NTFS. It is a Journaled filesystem, but similar to FAT, it’s organised in distinct regions: the boot sector, the Master File Table, the System Files area and the Files area.

ext2

We then continue the historical research of filesystems with ext2. An equally old filesystem used by the Linux Kernel. Similar to FAT, ext2 has also discreet regions on the disk : it starts with the superblock, followed by the block group description table, the block bitmap, the inode bitmap and the inode table, followed by the data blocks.

Multiple block groups can appear on the disk, and each one contains it’s own copy of the superblock and all other structures that I mentioned before. We can clearly see it in the image below:

ext3

The next iteration of Linux filesystems was ext3, where Journalling was introduced. Since ext3 was designed to be compatible with ext2, we see a similar pattern with the layout of the block groups. However, this time, the disk was evenly loaded almost throughout it’s surface.

That’s to be expected, since a journalling filesystem records the user intentions to perform a change, instead of the change itself. This creates a continuous stream of information.

ext4

Naturally, we continue with ext4, the most widely used filesystem in the Linux world. Here we don’t see many changes from ext3, apart from the fact that block groups are now much bigger.

XFS

Another famous filesystem on Linux is the XFS. It is famous for it’s high performance due to I/O parallelisation. It achieves this using multiple Allocation Groups, which are equally sized linear regions within the file system. Apart from this, it is also using Journalling.

I think both the Allocation Groups and the Journalling features are visible in the following image.

MINIX

While I was searching my mkfs installation for other filesystems that I could create, I found minix-fs. It’s structure is very similar to ext2 (they both have an inode bitmap and a zone bitmap), and we can see it in the image below.

The “yellow noise” that you see is an artefact caused by the difference in the MINIX block size (which is 1KiB) and my decision of dividing the disk in 4KiB blocks. This means that two adjacent data blocks could be accounted twice, resulting in a yellow colour.

Btrfs

Finally, the last filesystem that I tried to visualise was the btrfs. This filesystem is based on the Copy-on-Write (CoW) principle and is implemented as a B-Tree. I didn’t dig too much into it’s technical details, but it seems a very nice filesystem, designed for scalability, reliability and maintenance.

I was astonished by the image that I produced from my tests. I had to re-run it with various different configuration, but the result was the same … you might think that the following image is blank, but it’s not! There are 2 blocks that were heavily used (2 white pixels) while the rest of the drive is used quite evenly.

Increasing the contrast (a lot) we get the following image:

Bonus: JFS

Journaling Filesystem (JFS) is another journaling filesystem created by IBM. From a quick look it does not achieve the same disk utilisation as the other filesystems, but it’s interesting to see that the heavily loaded blocks are scattered around the disk surface.

Conclusion

I have to say that I learned something new about filesystems today. And most importantly, I learned that if I am using an SSD I should most probably use ext4, since it looks like the the filesystem that wears the disk quite evenly.

Btrfs on the other hand looks really promising, but I would need to do some deeper investigation in order to understand the behaviour that I observed.

I hope you enjoyed the read; till next time!