java - String encoding conversion UTF-8 to SHIFT-JIS

Question

Welcome To Ask or Share your Answers For Others

java - String encoding conversion UTF-8 to SHIFT-JIS

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

java - String encoding conversion UTF-8 to SHIFT-JIS

Variables used:

JavaSE-6
No frameworks

Given this string input of ピーター?ジョーズ which is encoded in UTF-8, I am having problems converting the said string to Shift-JIS without the need of writing the said data to a file.

Input (UTF-8 encoding): ピーター?ジョーンズ
Output (SHIFT-JIS encoding): ピーター?ジョーンズ (SHIFT-JIS to be encoded)

I've tried this code snippets on how to convert UTF-8 strings to SHIFT-JIS:

stringToEncode.getBytes(Charset.forName("SHIFT-JIS"))
new String(unecodedString.getBytes("SHIFT-JIS"), "UTF-8")

Both code snippets return this string output: ?s?[?^?[?E?W???[???Y (SHIFT-JIS encoded)

Any ideas on how this can be resolved?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T21:34:00+0000

Internally in Java, Strings are implemented as an array of UTF-16 code units. But this is an implementation detail, it would be possible to implement a JVM that uses a different encoding internally.

(Note "encoding", "charset" and Charset are more or less synonyms.)

A String should be treated as a sequence of Unicode codepoints (even though in Java it's a sequence of UTF-16 code units).

If you have a String in your Java program, it is incorrect to say that it is a "UTF-8 String" or "String which is encoded in UTF-8". That does not make any sense, unless you're talking about the internal representation, which is irrelevant.

What you can have is a sequence of bytes that decode to a String if you decode it using an encoding, such as UTF-8 or Shift-JIS.

Or you can have a String that encodes to a sequence of bytes if you encode it using an encoding, such as UTF-8 or Shift-JIS.

In short, an encoding or Charset is a pair of two functions, "encode" and "decode" such that:

// String -> encode -> bytes
byte[] bytes = string.getBytes(encoding);
// or using Charset
ByteBuffer byteBuffer = charset.encode(string);

// bytes -> decode -> String
String string = new String(bytes, encoding);
// or using Charset
String string = charset.decode(byteBuffer).toString();

So if you have a byte[] that's encoded using UTF-8:

byte[] utf8Bytes = "ピーター?ジョーズ".getBytes("UTF-8");
// utf8Bytes now contains, in hexadecimal
// e3 83 94  e3 83 bc  e3 82 bf   (ピ ー タ)
// e3 83 bc  e3 83 bb  e3 82 b8   (ー ? ジ)
// e3 83 a7  e3 83 bc  e3 82 ba   (ョ ー ズ)

You can create a String from those bytes using:

String string = new String(utf8Bytes, "UTF-8");
// String now contains "ピーター?ジョーズ"

Then you can encode that String as Shift-JIS using:

byte[] shiftJisBytes = string.getBytes("Shift-JIS");
// shiftJisBytes now contains, in hexadecimal
// 83 73  81 5b  83 5e   (ピ ー タ)
// 81 5b  81 45  83 57   (ー ? ジ)
// 83 87  81 5b  83 59   (ョ ー ズ)

Since those bytes represent a string encoded using Shift-JIS, trying to decode using UTF-8 will produce garbage:

String garbage = new String(shiftJisBytes, "UTF-8")
// String now contains "?s?[?^?[?E?W???[?Y"
// ? is the character decoded when given an invalid UTF-8 sequence
// 83 73 81 5b 83 5e   (? s ? [ ? ^)
// 81 5b 81 45 83 57   (? [ ? E ? W)
// 83 87 81 5b 83 59   (? ? ? [ ? Y)

Further, remember that if you print a string to an output, for example System.out, that will use the system default encoding that is system dependent to convert the String to bytes. It looks like your system default is UTF-8.

System.out.print(string);
// equivalent to:
System.out.write(string.getBytes(Charset.defaultCharset()));

Then if your output is for example the Windows console, it will then convert those bytes to a String using very probably a completely different encoding (probably CP437 or CP850) before presenting it to you.

This last part might be tripping you up.

Categories

java - String encoding conversion UTF-8 to SHIFT-JIS

java - String encoding conversion UTF-8 to SHIFT-JIS

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags