Use Apache's StringEscapeUtils.escapeHtml(String)
or StringEscapeUtils.unescapeHtml(String)
. This is found in the commons libraries.
If you need to preserve any HTML Markup, but just remove any ascii encoding, then you will have to construct a Map of the values you want to escape. It's an exercise in String
manipulation, so it may be considered an 'ugly hack', but it will run quickly.
For example with some pseudo code,
Create a Map<String, String>()
, and populate it with the the value you want to replace as the Key, and the value to replace it with in the Value.
Find the HTML ascii code in the document using a regular expression,
look the ascii code up in your Map
of replacements
Replace the occurrence of the HTML ascii code with the text equivalent.
I will post some code over the weekend if I get a chance.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…