ncabral.ca

about | portfolio of hobby projects | devlog

back

Today I learned that utf-8 is quite awesome. I already knew that it had some backwards compat with ASCII, yet I did not know the specifics.

utf-8 encodes code points with up to four bytes -- you have some codepoint U+uvwxyz and that gets encoded by straddling the uvwxyz stuff across the bytes while using specific so-called header bits in the MSB part of each byte.

For example, if you have a codepoint that can be represented with just two bytes, then the headers are 110 and 10 for the first and second byte, respectively.

The great part is that you can have code written to process ASCII and it will probably still work when processing UTF-8 encoded text. For example, matching specific ASCII characters in a utf-8 encoded file still works because each byte within all multi-byte characters starts with a 1 bit (due to the headers), meaning that you will never get a false positive because you matched something like a comma delimiter in a multi-byte character that your ASCII processor doesn't understand.

For fun, does this website support utf-8? Let's try ... しいたけ.

Copyright (c) 2022-2025 Noah Cabral.