Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
334 views
in Technique[技术] by (71.8m points)

Strip Byte Order Mark from string in C#

I've read similar posts on this and they don't answer my question.

In C#, I have a string that I'm obtaining from WebClient.DownloadString. I've tried setting client.Encoding to new UTF8Encoding(false), but that's made no difference - I still end up with a byte order mark for UTF-8 at the beginning of the result string. I need to remove this (to parse the resulting XML with LINQ), and want to do so in memory.

So I have a string that starts with x00EFx00BBx00BF, and I want to remove that if it exists. Right now I'm using

if (xml.StartsWith(ByteOrderMarkUtf8))
{
    xml = xml.Remove(0, ByteOrderMarkUtf8.Length);
}

but that just feels wrong. I've tried all sorts of code with streams, GetBytes, and encodings, and nothing works. Can anyone provide the "right" algorithm to strip a BOM from a string?

Thank you!

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I recently had issues with the .net 4 upgrade but until then the simple answer is

String.Trim()

removes the BOM up until .net 3.5 However in .net 4 you need to change it slightly

String.Trim(new char[]{'uFEFF'});

That will also get rid of the Byte order mark, though you may also want to remove the ZERO WIDTH SPACE U+200B

String.Trim(new char[]{'uFEFF','u200B'});

This you could also use to remove other unwanted characters

Some further information from http://msdn.microsoft.com/en-us/library/t97s7bs3.aspx

The .NET Framework 3.5 SP1 and earlier versions maintain an internal list of white-space characters that this method trims. Starting with the .NET Framework 4, the method trims all Unicode white-space characters (that is, characters that produce a true return value when they are passed to the Char.IsWhiteSpace method). Because of this change, the Trim method in the .NET Framework 3.5 SP1 and earlier versions removes two characters, ZERO WIDTH SPACE (U+200B) and ZERO WIDTH NO-BREAK SPACE (U+FEFF), that the Trim method in the .NET Framework 4 and later versions does not remove. In addition, the Trim method in the .NET Framework 3.5 SP1 and earlier versions does not trim three Unicode white-space characters: MONGOLIAN VOWEL SEPARATOR (U+180E), NARROW NO-BREAK SPACE (U+202F), and MEDIUM MATHEMATICAL SPACE (U+205F).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...