The question rephrased (as I interpret it) is:
Why doesn't readdir
return back the newly created filename? (Here, represented by the variable filename
which is set to B?rlauch
).
(Note: filename
is a Perl constant variable, so that's why it's missing the $
sigil in front.)
Background:
First note: due to the use utf8
statement in the beginning of your program, filename
will be upgraded to a Unicode string at compile time, since it contain non-ASCII characters. From the documentation of the utf8 pragma:
Enabling the utf8 pragma has the following effect: Bytes in the source
text that are not in the ASCII character set will be treated as being
part of a literal UTF-8 sequence. This includes most literals such as
identifier names, string constants, and constant regular expression
patterns.
and also, according to perluniintro section "Perl's Unicode Model" :
The general principle is that Perl tries to keep its data as eight-bit
bytes for as long as possible, but as soon as Unicodeness cannot be
avoided, the data is transparently upgraded to Unicode.
...
Internally, Perl currently uses whatever the native eight-bit
character set of the platform (for example Latin-1) is, defaulting to
UTF-8, to encode Unicode strings.
The non-ASCII character in filename
is the letter ?
. If you use ISO 8859-1 extended ASCII encoding (Latin-1), it is encoded as the byte value 0xE4
, see this table at ascii-code.com
.
However, if you removed the ?
character from filename
, it would contain only ASCII characters, and therefore it would not be internally upgraded to Unicode, even if you used the utf8
pragma.
So filename
is now a Unicode string with the internal UTF-8
flag set ( see utf8 pragma for more information on the UTF-8
flag). Note that the letter ?
is encoded in UTF-8 as the two bytes 0xC3 0xA4
.
Writing the file:
When writing the file, what happens with the filename? If filename
is a Unicode string, it will be encoded as UTF-8. However, note that it is not necessary to encode filename
first (encode_utf8( filename )
). See Creating filenames with unicode characters for more information. So the filename is written to disk as UTF-8 encoded bytes.
Reading the filename back:
When trying to read the filename back from disk, readdir
does not return Unicode strings (strings with the UTF-8 flag set) even if the filename contains bytes encoded in UTF-8. It returns binary or byte strings, see perlunitut for a discussion of byte strings vs character (Unicode) strings.
Why doesn't readdir
return Unicode strings? First, according to
perlunicode section "When Unicode Does Not Happen" :
There are still many places where Unicode (in some encoding or
another) could be given as arguments or received as results, or both
in Perl, but it is not. (...)
The following are such interfaces. For all of these interfaces Perl
currently (as of v5.16.0) simply assumes byte strings both as
arguments and results. (...)
One reason that Perl does not attempt to resolve the role of Unicode
in these situations is that the answers are highly dependent on the
operating system and the file system(s). For example, whether
filenames can be in Unicode and in exactly what kind of encoding, is
not exactly a portable concept. (...)
- chdir, chmod, chown, chroot, exec, link, lstat, mkdir, rename, rmdir, - stat, symlink, truncate, unlink, utime, -X
- %ENV
- glob (aka the <*>)
- open, opendir, sysopen
- qx (aka the backtick operator), system
- readdir, readlink
So readdir
returns byte strings, since it is in general impossible to know the encoding of a file name a priori. For background information about why this is impossible, see for example:
String comparison:
Now, finally you try to compare the read filename $filename_read
with the variable filename
:
print "found
" if $filename_read eq filename;
In this case the only difference between $filename_read
and filename
is that $filename_read
does not have the UTF-8 flag set (it is not what Perl internally recognize as a "Unicode string").
The interesting thing now is that the result of the eq
operator will depend upon whether the bytes in $filename_read
is pure ASCII or not. According to the documentation of the Encode module:
Before the introduction of Unicode support in Perl, The eq
operator
just compared the strings represented by two scalars. Beginning with
Perl 5.8, eq
compares two strings with simultaneous consideration of
the UTF8 flag.
...
When you decode, the resulting UTF8 flag is on--unless you can unambiguously represent data.
So in your case, eq
will consider the UTF-8
flag since $file_name_read
does not contain pure ASCII, and as a result it will
consider the two string not equal. If $filename_read
and filename
where identical and did only contain pure ASCII bytes (and filename
still had the UTF-8 flag set, $filename_read
did not have the UTF-8 flag set), then eq
would consider the two strings as equal. Se the discussion in the documentation for Encode more information regarding the background for this behavior.
Conclusion:
So if you are relative confident that all your filenames are UTF-8 encoded, you could solve the issue in your question by decoding the byte string returned from readdir
into a Unicode string (forcing the UTF-8 flag to be set):
$filename_read = Encode::decode_utf8( $filename_read );
More details
Note: since Unicode allows multiple representations of the same characters, there exists two forms of the ?
(LATIN SMALL LETTER A WITH COMBINING DIAERESIS) in B?rlauch
. For example,
- U+00E4 is the NFC (Normalization Form canonical Composition) form,
- U+0061.0308 is the NFD (Normalization Form canonical Decomposition) form.
On my platform (Linux), UTF-8 encoded filenames are stored using NFC form, but on Mac OS they use NFD form. See Encode::UTF8Mac
for more information. This means that if you work on a Linux machine, and for example clone a Git repository that was created by a Mac user, you can easily get NFD encoded filenames on your Linux machine. So the Linux filesystem does not care what encoding a filename is in; it just thinks of it as a sequence of bytes. Hence, I could easily write a script that created an ISO-Latin-1 encoded filename, even though my Locale is "en_US.UTF-8"
. The current locale settings are just guidelines for applications, but if the application ignores the locale settings it is nothing that stops them from doing that.
So if you are unsure if filenames returned from readdir
are using NFC or NFD, you should always decompose after you have decoded them:
use Unicode::Normalize;
print "found
" if NFD( $filename_read ) eq NFD( filename );
See also Perl Unicode Cookbook section "Always Decompose and Recompose".
Finally, to understand more about how the Locale works together with Unicode in Perl, you could have a look at: