java - Implement a function to check if a string/byte array follows utf-8 format

Question

Welcome To Ask or Share your Answers For Others

java - Implement a function to check if a string/byte array follows utf-8 format

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

java - Implement a function to check if a string/byte array follows utf-8 format

I am trying to solve this interview question.

After given clearly definition of UTF-8 format. ex: 1-byte : 0b0xxxxxxx 2- bytes:.... Asked to write a function to validate whether the input is valid UTF-8. Input will be string/byte array, output should be yes/no.

I have two possible approaches.

First, if the input is a string, since UTF-8 is at most 4 byte, after we remove the first two characters "0b", we can use Integer.parseInt(s) to check if the rest of the string is at the range 0 to 10FFFF. Moreover, it is better to check if the length of the string is a multiple of 8 and if the input string contains all 0s and 1s first. So I will have to go through the string twice and the complexity will be O(n).

Second, if the input is a byte array (we can also use this method if the input is a string), we check if each 1-byte element is in the correct range. If the input is a string, first check the length of the string is a multiple of 8 then check each 8-character substring is in the range.

I know there are couple solutions on how to check a string using Java libraries, but my question is how I should implement the function based on the question.

Thanks a lot.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:16:33+0000

Let's first have a look at a visual representation of the UTF-8 design.

enter image description here

Now let's resume what we have to do.

Loop over all character of the string (each character being a byte).
We will need to apply a mask to each byte depending on the codepoint as the x characters represent the actual codepoint. We will use the binary AND operator (&) which copy a bit to the result if it exists in both operands.
The goal of applying a mask is to remove the trailing bits so we compare the actual byte as the first code point. We will do the bitwise operation using 0b1xxxxxxx where 1 will appear "Bytes in sequence" time, and other bits will be 0.
We can then compare with the first byte to verify if it is valid, and also determinate what is the actual byte.
If the character entered in none of the case, it means the byte is invalid and we return "No".
If we can get out of the loop, that means each of the character are valid, hence the string is valid.
Make sure the comparison that returned true correspond to the expected length.

The method would look like this :

public static final boolean isUTF8(final byte[] pText) {

    int expectedLength = 0;

    for (int i = 0; i < pText.length; i++) {
        if ((pText[i] & 0b10000000) == 0b00000000) {
            expectedLength = 1;
        } else if ((pText[i] & 0b11100000) == 0b11000000) {
            expectedLength = 2;
        } else if ((pText[i] & 0b11110000) == 0b11100000) {
            expectedLength = 3;
        } else if ((pText[i] & 0b11111000) == 0b11110000) {
            expectedLength = 4;
        } else if ((pText[i] & 0b11111100) == 0b11111000) {
            expectedLength = 5;
        } else if ((pText[i] & 0b11111110) == 0b11111100) {
            expectedLength = 6;
        } else {
            return false;
        }

        while (--expectedLength > 0) {
            if (++i >= pText.length) {
                return false;
            }
            if ((pText[i] & 0b11000000) != 0b10000000) {
                return false;
            }
        }
    }

    return true;
}

Edit : The actual method is not the original one (almost, but not) and is stolen from here. The original one was not properly working as per @EJP comment.

Categories

java - Implement a function to check if a string/byte array follows utf-8 format

java - Implement a function to check if a string/byte array follows utf-8 format

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags