Internally in Java, Strings are implemented as an array of UTF-16 code units. But this is an implementation detail, it would be possible to implement a JVM that uses a different encoding internally.
(Note "encoding", "charset" and Charset are more or less synonyms.)
A String should be treated as a sequence of Unicode codepoints (even though in Java it's a sequence of UTF-16 code units).
If you have a String in your Java program, it is incorrect to say that it is a "UTF-8 String" or "String which is encoded in UTF-8". That does not make any sense, unless you're talking about the internal representation, which is irrelevant.
What you can have is a sequence of bytes that decode to a String if you decode it using an encoding, such as UTF-8 or Shift-JIS.
Or you can have a String that encodes to a sequence of bytes if you encode it using an encoding, such as UTF-8 or Shift-JIS.
In short, an encoding or Charset is a pair of two functions, "encode" and "decode" such that:
// String -> encode -> bytes
byte[] bytes = string.getBytes(encoding);
// or using Charset
ByteBuffer byteBuffer = charset.encode(string);
// bytes -> decode -> String
String string = new String(bytes, encoding);
// or using Charset
String string = charset.decode(byteBuffer).toString();
So if you have a byte[] that's encoded using UTF-8:
byte[] utf8Bytes = "ピーター?ジョーズ".getBytes("UTF-8");
// utf8Bytes now contains, in hexadecimal
// e3 83 94 e3 83 bc e3 82 bf (ピ ー タ)
// e3 83 bc e3 83 bb e3 82 b8 (ー ? ジ)
// e3 83 a7 e3 83 bc e3 82 ba (ョ ー ズ)
You can create a String from those bytes using:
String string = new String(utf8Bytes, "UTF-8");
// String now contains "ピーター?ジョーズ"
Then you can encode that String as Shift-JIS using:
byte[] shiftJisBytes = string.getBytes("Shift-JIS");
// shiftJisBytes now contains, in hexadecimal
// 83 73 81 5b 83 5e (ピ ー タ)
// 81 5b 81 45 83 57 (ー ? ジ)
// 83 87 81 5b 83 59 (ョ ー ズ)
Since those bytes represent a string encoded using Shift-JIS
, trying to decode using UTF-8
will produce garbage:
String garbage = new String(shiftJisBytes, "UTF-8")
// String now contains "?s?[?^?[?E?W???[?Y"
// ? is the character decoded when given an invalid UTF-8 sequence
// 83 73 81 5b 83 5e (? s ? [ ? ^)
// 81 5b 81 45 83 57 (? [ ? E ? W)
// 83 87 81 5b 83 59 (? ? ? [ ? Y)
Further, remember that if you print a string to an output, for example System.out
, that will use the system default encoding that is system dependent to convert the String to bytes. It looks like your system default is UTF-8
.
System.out.print(string);
// equivalent to:
System.out.write(string.getBytes(Charset.defaultCharset()));
Then if your output is for example the Windows console, it will then convert those bytes to a String using very probably a completely different encoding (probably CP437
or CP850
) before presenting it to you.
This last part might be tripping you up.