javascript - How to convert UTF8 string to byte array?

Question

Welcome To Ask or Share your Answers For Others

javascript - How to convert UTF8 string to byte array?

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

javascript - How to convert UTF8 string to byte array?

The .charCodeAt function returns with the unicode code of the caracter. But I would like to get the byte array instead. I know, if the charcode is over 127, then the character is stored in two or more bytes.

var arr=[];
for(var i=0; i<str.length; i++) {
    arr.push(str.charCodeAt(i))
}

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T00:05:54+0000

The logic of encoding Unicode in UTF-8 is basically:

Up to 4 bytes per character can be used. The fewest number of bytes possible is used.
Characters up to U+007F are encoded with a single byte.
For multibyte sequences, the number of leading 1 bits in the first byte gives the number of bytes for the character. The rest of the bits of the first byte can be used to encode bits of the character.
The continuation bytes begin with 10, and the other 6 bits encode bits of the character.

Here's a function I wrote a while back for encoding a JavaScript UTF-16 string in UTF-8:

function toUTF8Array(str) {
    var utf8 = [];
    for (var i=0; i < str.length; i++) {
        var charcode = str.charCodeAt(i);
        if (charcode < 0x80) utf8.push(charcode);
        else if (charcode < 0x800) {
            utf8.push(0xc0 | (charcode >> 6), 
                      0x80 | (charcode & 0x3f));
        }
        else if (charcode < 0xd800 || charcode >= 0xe000) {
            utf8.push(0xe0 | (charcode >> 12), 
                      0x80 | ((charcode>>6) & 0x3f), 
                      0x80 | (charcode & 0x3f));
        }
        // surrogate pair
        else {
            i++;
            // UTF-16 encodes 0x10000-0x10FFFF by
            // subtracting 0x10000 and splitting the
            // 20 bits of 0x0-0xFFFFF into two halves
            charcode = 0x10000 + (((charcode & 0x3ff)<<10)
                      | (str.charCodeAt(i) & 0x3ff));
            utf8.push(0xf0 | (charcode >>18), 
                      0x80 | ((charcode>>12) & 0x3f), 
                      0x80 | ((charcode>>6) & 0x3f), 
                      0x80 | (charcode & 0x3f));
        }
    }
    return utf8;
}

Categories

javascript - How to convert UTF8 string to byte array?

javascript - How to convert UTF8 string to byte array?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags