In general
Basically the OP's approach in general cannot work. There are two major misunderstandings his code is built upon:
He assumes that one can translate a complete content stream from byte[]
to String
(with all string parameters of text showing operators being legible) using a single character encoding.
This assumption is wrong: Each font may have its own encoding, so if multiple fonts are used on the same page, the same byte value in string operands of different text showing operators may represent completely different characters. Actually the fonts do not even need to contain a mapping to characters, they merely need to map numeric values to glyph painting instructions.
Cf. section 9.4.3 Text-Showing Operators in ISO 32000-1:
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap,
Simple PDF generators often merely use standard encodings (which are ASCII'ish and may give rise to assumptions like the OP's one) but there are more and more non-simple PDF generators out there...
He assumes he can simply edit the string operands of text-showing operators and the matching glyphs will be shown in the PDF viewer.
This assumption is wrong: Fonts usually only support a fairly limited character set, and a text showing operator uses only a single font, the currently selected one. If one replaces a code in a string argument of such an operator with a different one without a matching glyph in the font, one will at most see a gap!
While complete fonts usually at least contain glyphs for all characters of a kind (e.g. latin letters with all Western European variations thereof), PDF allows embedding fonts partially, cf.section 9.6.4 Font Subsets in ISO 32000-1:
PDF documents may include subsets of Type 1 and TrueType fonts.
This option meanwhile often is used to only embed painting instructions for glyphs actually used in the existing text. Thus, one cannot count on embedded fonts containing all characters of the same kind if they contain some. There may be a glyph for A
and C
but not for B
.
In the case at hand
Unfortunately the OP has not supplied his sample PDF. The symptoms , though:
his call replace("Nome Completo", "A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z")
makes a difference as can be seen in his screenshot
and his comment to Viacheslav Vedenin's answer
Before the text was (Nome Completo)Tj
and after (A-B-C-D-E-F-G-H-I-J-K-L-M-N-O-P-Q-R-S-T-U-V-W-X-Y-Z)Tj
but some codes do not show as the expected glyphs as can also be seen in the screenshot above
point in the direction that the latter one of his two major false assumption described above makes the OP's code fail him: Most likely the font in question uses a standard encoding (probably WinAnsiEncoding) but is only partially embedded, in particular without the capital letters K
, W
, X
, and Y
.
How to do it correctly
Instead of blindly editing the content stream, the OP (who already is using iText) can use the following iText concepts:
- text extraction classes can be used to also extract coordinates of text, cf multiple answers on stackoverflow, in particular the bounding rectangle of the text he wants to replace;
- the iText xtra library class
PdfCleanUpProcessor
can be used to remove all content existing in that bounding rectangle;
- the
PdfStamper.getOverContent()
can then be used to properly add new content at those coordinates.
This may sound complicated but this takes care of a number of additional minor misconceptions visible in the OP's approach.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…