Description
(This is an idea for the Google Summer of Code 2015)
Our command line utility PDFDebugger (part of the command line pdfbox-app get it here, read description here, see the source code here) needs some improvements:
- hex view
- view of non printable characters
- ✓ saving streams
- binary copy & paste
- ✓ Create a status line that shows where we are in the tree. (Like in the Windows REGEDIT)
- ✓ Copy the current tree string into the clipboard (useful in discussions about details of a PDF)
- ✓ (Optional, not sure if easy) Jump to specific place in the tree by entering tree string
- ✓ ability to search in streams (very useful for content streams and meta data)
- ✓ show images that are streams
- ✓ show PDIndexed color lookup table, show the index value, the base and RGB color value sets when the mouse moves
- ✓ show PDSeparation color
- ✓ show PDDeviceN colors
- optional, idea should be developed a bit: show meaningful explanation on some attributes, e.g. "appearance stream" when hovering over /AP
- show font encodings and characters
- ✓ display flag bits (e.g. Annotation flags) in a way that is easy to understand. There are probably others, I assume that the main work needs to be done only once
- edit attributes (should be possible to enter values as decimal, hex or binary)
- edit streams, while keeping or changing the compression filter
- save altered PDF
- ✓ color mark of certain PDF operators, especially Q...q and text operators (BT...ET). Ideally, it should help the user understand the "bracketing" of these operators, i.e. understand where a sequence starts and where it ends. (See "operator summary" in the PDF Spec) Other "important" operators I can think of are the matrix, font and color operators. A cool advanced thing would be to show the current color or the font in a popup when hovering above such an operator.
To see a product with a similar purpose that is better than PDFDebugger, watch this video.
I'm not asking to implement a clone of that product (I don't use it, all I know is that video), but we at PDFBox really need something that makes PDF debugging easier. As an example of how the current PDFDebugger prevented me from finding a bug quickly, see PDFBOX-2401 and search for "PDFDebugger".
Prerequisites:
- java programming, especially the GUI components
- the ability to understand existing source code
Using external software components is possible (must have Apache License or a compatible one), but should be decided on a case-by-case basis, we don't want to get too big.
Development strategy: go from the easy to the difficult. The wished features are already sorted this way (mostly).
Get introduced: download the source code with svn and build it with maven. Run PDFDebugger and view some PDFs to see the components of a PDF. Start with the file of PDFBOX-2401. Read up something about the structure of PDF on the web or from the PDF Specification.
Mentor: Tilman Hausherr (European timezone, languages: german, english, french). To see the GSoC2014 project I mentored, go to PDFBOX-1915.