c# - How do I convert from a possibly Windows 1252 'ANSI' encoded uploaded file to UTF8 in .NET?

Question

Welcome To Ask or Share your Answers For Others

c# - How do I convert from a possibly Windows 1252 'ANSI' encoded uploaded file to UTF8 in .NET?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c# - How do I convert from a possibly Windows 1252 'ANSI' encoded uploaded file to UTF8 in .NET?

I've got a FileUpload control in an ASP.NET web page which is used to upload a file, the contents of which (in a stream) are processed in the C# code behind and output on the page later, using HtmlEncode.

But, some of this output is becoming mangled, specifically the symbol '￡' is output as the Unicode FFFD REPLACEMENT CHARACTER. I've tracked this down to the input file, which is Windows 1252 ('ANSI') encoded.

The question is,

How do I determine whether the file is encoded as 1252 or UTF8? It could be either, and
How do I convert it to UTF8 if it is in Windows 1252, preserving the symbol ￡ etc?

I've looked online but cannot find a satisfactory answer.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:56:07+0000

If you know that the file is encoded with Windows 1252, you can open the file with a StreamReader and pass the proper encoding. That is:

StreamReader reader = new StreamReader("filename", Encoding.GetEncoding("Windows-1252"), true);

The "true" tells it to set the encoding based on the byte order marks at the front of the file, if they're there. Otherwise it opens it as Windows-1252.

You can then read the file and, if you want to convert to UTF-8, write to a file that you've opened with that endcoding.

The short answer to your first question is that there isn't a 100% satisfactory way to determine the encoding of a file. If there are byte order marks, you can determine what flavor of Unicode it is, but without the BOM, you're stuck with using heuristics to determine the encoding.

I don't have a good reference for the heuristics. You might search for "how does Notepad determine the character set". I recall seeing something about that some time ago.

In practice, I've found the following to work for most of what I do:

StreamReader reader = new StreamReader("filename", Encoding.Default, true);

Most of the files I read are those that I create with .NET's StreamWriter, and they're in UTF-8 with the BOM. Other files that I get are typically written with some tool that doesn't understand Unicode or code pages, and I just treat it as a stream of bytes, which Encoding.Default does well.

Categories

c# - How do I convert from a possibly Windows 1252 'ANSI' encoded uploaded file to UTF8 in .NET?

c# - How do I convert from a possibly Windows 1252 'ANSI' encoded uploaded file to UTF8 in .NET?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags