Diacritics in file names on macOS behave strangely

by Warp   Last Updated October 09, 2019 14:01 PM

Consider the following, run on the terminal on macOS:

> touch ä
> ls
> ls *
> ls * | hexdump
00000000  c3 a4 0a
> ls | hexdump
00000000  61 cc 88 0a

Expanding * on the command line results in the file name being expanded normally, with its normal UTF-8 character values. However, when ls retrieves the file names in the current directory without them being given to it as parameter, for an unfathomable reason the ä character has now been transmogrified into an a with an UTF-8 combining diacritic. Does anybody have any idea why that's happening?

This is a bit problematic because programs that resolve file names in directories are seeing that exact same difference.

Answers 1

macOS applies Unicode normalization to filenames; it's done so that programs would always find the exact same file regardless of whether they're using the composed or decomposed form.

Unusually, macOS with the HFS+ filesystem uses NFD normalization, which always decomposes the characters into base + combining diacritics.

(In the new APFS, the opposite NFC format is used for better compatibility, as non-macOS systems more commonly used the precomposed characters.)

October 09, 2019 13:53 PM

Related Questions

Updated February 13, 2019 18:01 PM

Updated November 16, 2017 07:01 AM

Updated February 20, 2017 13:01 PM

Updated April 05, 2018 13:01 PM

Updated May 17, 2018 00:01 AM