Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
241 views
in Technique[技术] by (71.8m points)

java - JSoup character encoding issue

I am using JSoup to parse content from http://www.latijnengrieks.com/vertaling.php?id=5368 . this is a third party website and does not specify proper encoding. i am using the following code to load the data:

public class Loader {

    public static void main(String[] args){
        String url = "http://www.latijnengrieks.com/vertaling.php?id=5368";

        Document doc;
        try {

            doc = Jsoup.connect(url).timeout(5000).get();
            Element content = doc.select("div.kader").first();
            Element contenttableElement = content.getElementsByClass("kopje").first().parent().parent();

            String contenttext = content.html();
            String tabletext = contenttableElement.html();

            contenttext = Jsoup.parse(contenttext).text();
            contenttext = contenttext.replace("br2n", "
");
            tabletext = Jsoup.parse(tabletext.replaceAll("(?i)<br[^>]*>", "br2n")).text();
            tabletext = tabletext.replace("br2n", "
");

            String text = contenttext.substring(tabletext.length(), contenttext.length());
            System.out.println(text);


        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }


    }    

}

this gives the following output:

Aeneas dwaalt rond in Troje en zoekt Cre?sa. Cre?sa is echter op de vlucht gestorven Plotseling verschijnt er een schim. Het is de schim van Cre?sa. De schim zegt:'De oorlog woedt!' Troje is ingenomen! Cre?sa is gestorven:'Vlucht!' Aeneas vlucht echter niet. Dan spreekt de schim:'Vlucht! Er staat jou een nieuw vaderland en een nieuw koninkrijk te wachten.' Dan pas gehoorzaamt Aeneas en vlucht.

is there any way the ? marks can be the original (ü) again in the output?

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The charset attribute is missing in HTTP response Content-Type header. Jsoup will resort to platform default charset when parsing the HTML. The Document.OutputSettings#charset() won't work as it's used for presentation only (on html() and text()), not for parsing the data (in other words, it's too late already).

You need to read the URL as InputStream and manually specify the charset in Jsoup#parse() method.

String url = "http://www.latijnengrieks.com/vertaling.php?id=5368";
Document document = Jsoup.parse(new URL(url).openStream(), "ISO-8859-1", url);
Element paragraph = document.select("div.kader p").first();

for (Node node : paragraph.childNodes()) {
    if (node instanceof TextNode) {
        System.out.println(((TextNode) node).text().trim());
    }
}

this results here in

Aeneas dwaalt rond in Troje en zoekt Creüsa.
Creüsa is echter op de vlucht gestorven
Plotseling verschijnt er een schim.
Het is de schim van Creüsa.
De schim zegt:'De oorlog woedt!'
Troje is ingenomen!
Creüsa is gestorven:'Vlucht!'
Aeneas vlucht echter niet.
Dan spreekt de schim:'Vlucht! Er staat jou een nieuw vaderland en een nieuw koninkrijk te wachten.'
Dan pas gehoorzaamt Aeneas en vlucht.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...