I am the author of the iText text extraction sub-system. What you need to do is develop your own text extraction strategy (if you look at how PdfTextExtractor.getTextFromPage
is implemented, you will see that you can provide a pluggable strategy).
How you are going to determine where columns start and stop is entirely up to you - this is a difficult problem - PDF doesn't have any concept of columns (heck, it doesn't even have a concept of words - just putting together the text extraction that the default strategy provides is quite tricky). If you know in advanced where the columns are, then you can use a region filter on the text render listener callback (there is code in the iText library for doing this, and the latest version of the iText In Action book gives a detailed example).
If you need to obtain columns from arbitrary data, you've got some algorithm work ahead of you (if you get something working, I'd love to take a look). Some ideas on how to approach this:
- Use an algorithm similar to that used in the default text extraction strategy (LocationAware...) to obtain a list of words and X/Y locations (be sure to account for rotation angle as well)
- For each word, draw an imaginary line running the full height of the page. Scan for all other words that start at the same X position.
- While scanning, also look for words that intersect the X position (but do not start on the X position). This will give you potential location for column start/stop Y positions on the page.
- Once you have column X and Y, you can resort to a region filtered approach
Another approach that may be equally feasible would be to analyze draw operations and look for long horizontal and vertical lines (assuming the columns are demarcated in a table-like format). Right now, the iText content parser doesn't have callbacks for these operations, but it would be possible to add them without major difficulty.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…