java - Replace string in PDF file using Itext but letter X not replace

Question

Welcome To Ask or Share your Answers For Others

java - Replace string in PDF file using Itext but letter X not replace

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

java - Replace string in PDF file using Itext but letter X not replace

I'm trying to replace the content of PDF in one text but the letter 'X' are not being replaced.

public static void main(String[] args) {

    String DEST = "/home/diego/Documentos/teste.pdf";

    try {
        PdfReader reader = new PdfReader("termoAdesaoCartao.pdf");
        PdfDictionary dictionary = reader.getPageN(1);
        PdfObject object = dictionary.getDirectObject(PdfName.CONTENTS);
        if (object instanceof PRStream) {
            PRStream stream = (PRStream)object;
            byte[] data = PdfReader.getStreamBytes(stream);
            stream.setData(new String(data).replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z").getBytes());
        }
        PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(DEST));
        stamper.close();
        reader.close();
    } catch (IOException | DocumentException e) {
        e.printStackTrace();
    }

}

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T17:56:00+0000

In general

Basically the OP's approach in general cannot work. There are two major misunderstandings his code is built upon:

He assumes that one can translate a complete content stream from byte[] to String (with all string parameters of text showing operators being legible) using a single character encoding.

This assumption is wrong: Each font may have its own encoding, so if multiple fonts are used on the same page, the same byte value in string operands of different text showing operators may represent completely different characters. Actually the fonts do not even need to contain a mapping to characters, they merely need to map numeric values to glyph painting instructions.

Cf. section 9.4.3 Text-Showing Operators in ISO 32000-1:

A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.

With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".

With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap,

Simple PDF generators often merely use standard encodings (which are ASCII'ish and may give rise to assumptions like the OP's one) but there are more and more non-simple PDF generators out there...
He assumes he can simply edit the string operands of text-showing operators and the matching glyphs will be shown in the PDF viewer.

This assumption is wrong: Fonts usually only support a fairly limited character set, and a text showing operator uses only a single font, the currently selected one. If one replaces a code in a string argument of such an operator with a different one without a matching glyph in the font, one will at most see a gap!

While complete fonts usually at least contain glyphs for all characters of a kind (e.g. latin letters with all Western European variations thereof), PDF allows embedding fonts partially, cf.section 9.6.4 Font Subsets in ISO 32000-1:

PDF documents may include subsets of Type 1 and TrueType fonts.

This option meanwhile often is used to only embed painting instructions for glyphs actually used in the existing text. Thus, one cannot count on embedded fonts containing all characters of the same kind if they contain some. There may be a glyph for A and C but not for B.

In the case at hand

Unfortunately the OP has not supplied his sample PDF. The symptoms , though:

his call replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z") makes a difference as can be seen in his screenshot

and his comment to Viacheslav Vedenin's answer

Before the text was (Nome Completo)Tj and after (A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z)Tj
but some codes do not show as the expected glyphs as can also be seen in the screenshot above

point in the direction that the latter one of his two major false assumption described above makes the OP's code fail him: Most likely the font in question uses a standard encoding (probably WinAnsiEncoding) but is only partially embedded, in particular without the capital letters K, W, X, and Y.

How to do it correctly

Instead of blindly editing the content stream, the OP (who already is using iText) can use the following iText concepts:

text extraction classes can be used to also extract coordinates of text, cf multiple answers on stackoverflow, in particular the bounding rectangle of the text he wants to replace;
the iText xtra library class PdfCleanUpProcessor can be used to remove all content existing in that bounding rectangle;
the PdfStamper.getOverContent() can then be used to properly add new content at those coordinates.

This may sound complicated but this takes care of a number of additional minor misconceptions visible in the OP's approach.

Categories

java - Replace string in PDF file using Itext but letter X not replace

java - Replace string in PDF file using Itext but letter X not replace

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

In general

In the case at hand

How to do it correctly

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags