Zero-width characters are non-printing characters that are not displayed by most applications, which leads to the name “zero-width.” They are Unicode characters, typically used to mark possible line break or join/separate characters in writing systems that use ligatures.
As they are “invisible,” anyone can use them to conceal messages or information within plain text. Don’t believe me? I left a secret message in the first sentence. Read this post to know how it’s possible.
Available zero-width characters
So far I’ve found 9 zero-width characters in the Unicode characters table.
|Zero-width no-break space||U+FEFF|
There may be more, but nine is more than enough. In theory, only two different zero-width characters are enough to insert any type of data. Though binary representation is usually large, we can make use of every zero-width characters to effectively reduce the length of encoded data.
Zero-width characters can be used to fingerprint text. For example, someone within your team is leaking confidential information but you don’t know who. Just send each member a classified text with their name encoded in it. Wait for it to be leaked, then extract the name, and do whatever you like with them.
Unlike other steganography techniques (such as utilizing noises in images, videos, sound as the container), zero-width characters are not removed if the text is formatted, copied, pasted. It’s really hard to detect them without special tools, as most text editors don’t render them. In addition, we’re not limited in the amount of data that can be encoded. However, editors do count zero-width characters, so encoding too much data within a short text makes it more suspicious.
To demonstrate the ability to hide secret messages with zero-width characters, I created a tool here.
How does it work?
TextEncoderto the secret message from
Uint8Array, which is an array of 8-bit unsigned integers.
- Convert each integer to 8 bits, then convert each bit to zero-width characters:
- Bit value 0 is encoded as
Zero-width space (U+200B)
- Bit value 1 is encoded as
Zero-width non-joiner (U+200C)
- Bit value 0 is encoded as
- Hide the encoded string in the middle of the carrier message.
In addition, two other zero-width characters are used to mark the beginning and ending of the encoded string:
Left-To-Right Mark (U+200E)marks the beginning
Right-To-Left Mark (U+200F)marks the end
This makes it easier to detect the position of the encoded string when decoding it.
Please refer to source code for more details.
Detect zero-width characters
Use any text editor that supports rendering of zero-width characters.
For quick test, you can use Chrome Developer Tools console:
This Chrome extension will convert any zero-width characters to emojis.