The Unexpected Complexity of Unicode in VBA
If you’ve ever tried to handle Unicode properly in VBA, you know it’s not as straightforward as it sounds. VBA strings are UTF-16 internally — in theory, this should make things easy. In practice, it’s a minefield.
The problem
I needed a simple function: given a VBA string, return the correct number of Unicode code points. Sounds trivial, right? Len() gives you the number of UTF-16 code units, not code points. For characters in the Basic Multilingual Plane, those are the same thing. But the moment you encounter an emoji, a mathematical symbol, or a CJK extension character, you’re dealing with surrogate pairs, and Len() lies to you.
Surrogate pairs in VBA
A surrogate pair is how UTF-16 encodes code points above U+FFFF. The character 𝐀 (Mathematical Bold Capital A, U+1D400) is stored as two 16-bit values: 0xD835 followed by 0xDC00. VBA’s Len() says this string is 2 characters long. It’s not — it’s one character.
Dim s As String
s = ChrW(&HD835) & ChrW(&HDC00)
Debug.Print Len(s) ' Prints 2, but it's 1 code point
This gets worse when you try to do substring operations. Mid(), Left(), Right() — they all operate on code units. Split a surrogate pair, and you get invalid Unicode.
Cross-platform fun
On Windows, you can call MultiByteToWideChar and WideCharToMultiByte through the Windows API. On Mac, none of that exists. You’re on your own. My library VBA-StringTools handles all of this — proper code point counting, safe substring operations, UTF-8 conversion — on both Windows and Mac.
The Mac implementation was particularly challenging. Without Windows API calls, everything has to be done in pure VBA, which means manually walking the string and checking for surrogates. It’s not fast, but it’s correct.
Lessons learned
- Never assume
Len()gives you what you want. It gives you code units, not characters. - Test with emoji. They’re the easiest way to find surrogate pair bugs.
- Cross-platform VBA is a different language. What works on Windows often doesn’t work on Mac, and vice versa.
- Sometimes the “boring” language has the most interesting problems. VBA might not be trendy, but getting Unicode right in a COM-era language is genuinely hard engineering.
If you’re working with international text in VBA, check out VBA-StringTools. It handles the edge cases so you don’t have to.