I'm not too familiar with PyPDF, but I know Ghostscript will be able to do this for you. Here are links to some other answers on similar questions:
- Convert PDF 2 sides per page to 1 side per page (SuperUser.com)
- Freeware to split a pdf's pages down the middle? (SuperUser.com)
- Cropping a PDF using Ghostscript 9.01 (StackOverflow.com)
The third answer is probably what made you say 'I understand it will not work with every PDF file'. It uses the pdfmark command to try and set the /CropBox
into the PDF page objects.
The method of the first two answers will most likely succeed where the third one fails. This method uses a PostScript command snippet of <</PageOffset [NNN MMM]>> setpagedevice
to shift and place the PDF pages on a (smaller) media size defined by the -gNNNNxMMMM
parameter (which defines device width and height in pixels).
If you understand the concept behind the first two answers, you'll easily be able to adapt the method used there to crop margins on all 4 edges of a PDF page:
An example command to crop a letter sized PDF (8.5x11in == 612x792pt) by half an inch (==36pt) on each of the 4 edges (command is for Windows):
gswin32c.exe ^
-o cropped.pdf ^
-sDEVICE=pdfwrite ^
-g5400x7200 ^
-c "<</PageOffset [-36 -36]>> setpagedevice" ^
-f input.pdf
The resulting page size will be 7.5x10in (== 540x720pt). To do the same on Linux or Mac, use:
gs
-o cropped.pdf
-sDEVICE=pdfwrite
-g5400x7200
-c "<</PageOffset [-36 -36]>> setpagedevice"
-f input.pdf
Update: How to determine 'margins' with Ghostscript
A comment asked for 'automatic' determination of the white margins. You can use Ghostscript's too for this. Its bbox
device can determine the area covered by the (virtual) ink on each page (and hence, indirectly the whitespace for each edge of the canvas).
Here is the command:
gs
-q -dBATCH -dNOPAUSE
-sDEVICE=bbox
input.pdf
Output (example):
%%BoundingBox: 57 29 562 764
%%HiResBoundingBox: 57.265030 29.347046 560.245045 763.649977
%%BoundingBox: 57 28 562 667
%%HiResBoundingBox: 57.265030 28.347046 560.245045 666.295011
The bbox
device renders each PDF page in memory (without writing any output to disk) and then prints the BoundingBox and HiResBoundingBox info to stderr
. You may modify this command like that to make the results more easy to parse:
gs
-q -dBATCH -dNOPAUSE
-sDEVICE=bbox
input.pdf
2>&1
| grep -v HiResBoundingBox
Output (example):
%%BoundingBox: 57 29 562 764
%%BoundingBox: 57 28 561 667
This would tell you...
- ...that the lower left corner of the content rectangle of Page 1 is at coordinates
[57 29]
with the upper right corner is at [562 741]
- ...that the lower left corner of the content rectangle of Page 2 is at coordinates
[57 28]
with the upper right corner is at [561 667]
This means:
- Page 1 uses a whitespace of 57pt on the left edge (
72pt == 1in == 25,4mm
).
- Page 1 uses a whitespace of 29pt on the bottom edge.
- Page 2 uses a whitespace of 57pt on the left edge.
- Page 2 uses a whitespace of 28pt on the bottom edge.
As you can see from this simple example already, the whitespace is not exactly the same for each page. Depending on your needs (you likely want the same size for each page of a multi-page PDF, no?), you have to work out what are the minimum margins for each edge across all pages of the document.
Now what about the right and top edge whitespace? To calculate that, you need to know the original page size for each page. The most simple way to determine this: the pdfinfo
utility. Example command for a 5 page PDF:
pdfinfo
-f 1
-l 5
input.pdf
| grep "Page "
Output (example):
Page 1 size: 612 x 792 pts (letter)
Page 2 size: 612 x 792 pts (letter)
Page 3 size: 595 x 842 pts (A4)
Page 4 size: 842 x 1191 pts (A3)
Page 5 size: 612 x 792 pts (letter)
This will help you determine the required canvas size and the required (maximum) white margins of the top and right edges of each of your new PDF pages.
These calculations can all be scripted too, of course.
But if your PDFs are all of a uniq page size, or if they are 1-page documents, it all is much easier to get done...