java - How to remove bad characters that are not suitable for utf8 encoding in MySQL?

Question

Welcome To Ask or Share your Answers For Others

java - How to remove bad characters that are not suitable for utf8 encoding in MySQL?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

java - How to remove bad characters that are not suitable for utf8 encoding in MySQL?

I have dirty data. Sometimes it contains characters like this. I use this data to make queries like

WHERE a.address IN ('mydatahere')

For this character I get

org.hibernate.exception.GenericJDBCException: Illegal mix of collations (utf8_bin,IMPLICIT), (utf8mb4_general_ci,COERCIBLE), (utf8mb4_general_ci,COERCIBLE) for operation ' IN '

How can I filter out characters like this? I use Java.

Thanks.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:19:16+0000

When I had problem like this, I used Perl script to ensure that data is converted to valid UTF-8 by using code like this:

use Encode;
binmode(STDOUT, ":utf8");
while (<>) {
    print Encode::decode('UTF-8', $_);
}

This script takes (possibly corrupted) UTF-8 on stdin and re-prints valid UTF-8 to stdout. Invalid characters are replaced with ? (U+FFFD, Unicode replacement character).

If you run this script on good UTF-8 input, output should be identical to input.

If you have data in database, it makes sense to use DBI to scan your table(s) and scrub all data using this approach to make sure that everything is valid UTF-8.

This is Perl one-liner version of this same script:

perl -MEncode -e "binmode STDOUT,':utf8';while(<>){print Encode::decode 'UTF-8',$_}" < bad.txt > good.txt

EDIT: Added Java-only solution.

This is an example how to do this in Java:

import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CodingErrorAction;

public class UtfFix {
    public static void main(String[] args) throws InterruptedException, CharacterCodingException {
        CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
        decoder.onMalformedInput(CodingErrorAction.REPLACE);
        decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
        ByteBuffer bb = ByteBuffer.wrap(new byte[] {
            (byte) 0xD0, (byte) 0x9F, // 'П'
            (byte) 0xD1, (byte) 0x80, // 'р'
            (byte) 0xD0,              // corrupted UTF-8, was 'и'
            (byte) 0xD0, (byte) 0xB2, // 'в'
            (byte) 0xD0, (byte) 0xB5, // 'е'
            (byte) 0xD1, (byte) 0x82  // 'т'
        });
        CharBuffer parsed = decoder.decode(bb);
        System.out.println(parsed);
        // this prints: Пр?вет
    }
}

Categories

java - How to remove bad characters that are not suitable for utf8 encoding in MySQL?

java - How to remove bad characters that are not suitable for utf8 encoding in MySQL?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags