Details

    • Type: New Feature New Feature
    • Status: In Progress
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Parsing
    • Labels:
      None

      Description

      A conforming parser will start at the end of the file and read backward until it has read the EOF marker, the xref location, and trailer[1]. Once this is read, it will read in the xref table so it can locate other objects and revisions. This also allows skipping objects which have been rendered obsolete (per the xref table)[2]. It also allows the minimum amount of information to be read when the file is loaded, and then subsequent information will be loaded if and when it is requested. This is all laid out in the official PDF specification, ISO 32000-1:2008.

      Existing code will be re-used where possible, but this will require new classes in order to accommodate the lazy reading which is a very different paradigm from the existing parser. Using separate classes will also eliminate the possibility of regression bugs from making their way into the PDDocument or BaseParser classes. Changes to existing classes will be kept to a minimum in order to prevent regression bugs.

      [1] Section 7.5.5 "Conforming readers should read a PDF file from its end"
      [2] Section 7.5.4 "the entire file need not be read to locate any particular object"

      1. conforming-parser.patch
        6 kB
        Adam Nichols
      2. ConformingPDDocument.java
        3 kB
        Adam Nichols
      3. ConformingPDFParser.java
        27 kB
        Adam Nichols
      4. ConformingPDFParserTest.java
        2 kB
        Adam Nichols
      5. COSUnread.java
        2 kB
        Adam Nichols
      6. gdb-refcard.pdf
        75 kB
        Adam Nichols
      7. PDFLexer.java
        46 kB
        Maruan Sahyoun
      8. PDFLexer.java
        45 kB
        Maruan Sahyoun
      9. PDFStreamConstants.java
        4 kB
        Maruan Sahyoun
      10. PDFStreamConstants.java
        5 kB
        Maruan Sahyoun
      11. XrefEntry.java
        1 kB
        Adam Nichols

        Issue Links

          Activity

          Hide
          Maruan Sahyoun added a comment -

          I added a new version of the PDFLexer.
          Changes
          a) the PDFLexer is now using InputStream as the PDF source. This makes it possible to use the new IO classes in o.a.pdfbox.io.
          b) refactored the PDFLexer so the only io operation used is read()
          c) drawback is that one needs to call reset() if the position in the stream is changed by a seek operation in order to clear the internal state
          d) StringBuilder is now reused instead of recreated for every new token

          Show
          Maruan Sahyoun added a comment - I added a new version of the PDFLexer. Changes a) the PDFLexer is now using InputStream as the PDF source. This makes it possible to use the new IO classes in o.a.pdfbox.io. b) refactored the PDFLexer so the only io operation used is read() c) drawback is that one needs to call reset() if the position in the stream is changed by a seek operation in order to clear the internal state d) StringBuilder is now reused instead of recreated for every new token
          Hide
          Maruan Sahyoun added a comment -

          New version of the PDFLexer.

          Show
          Maruan Sahyoun added a comment - New version of the PDFLexer.
          Hide
          Maruan Sahyoun added a comment -

          Definition of Constants needed by the PDFLexer

          Show
          Maruan Sahyoun added a comment - Definition of Constants needed by the PDFLexer
          Hide
          Maruan Sahyoun added a comment -

          New version of the PDFLexer replacing the old version.
          Changes:

          1. corrected license header
          2. bug fixes
          3. performance improvements
          Show
          Maruan Sahyoun added a comment - New version of the PDFLexer replacing the old version. Changes: corrected license header bug fixes performance improvements
          Hide
          Adam Nichols added a comment - - edited

          I would prefer you put the changes in the ConformingPDFParser class. I'm really glad to see that work on the conforming parser is continuing even though I don't have time to contribute at the moment. The more we can combine efforts (e.g. using code from the NonSequentialPDFParser) the better. I've found that the more code is re-used, the quicker bugs are brought to light (at which point we can fix them), so I'd much rather see code re-use than copying and pasting from one class to another.

          Show
          Adam Nichols added a comment - - edited I would prefer you put the changes in the ConformingPDFParser class. I'm really glad to see that work on the conforming parser is continuing even though I don't have time to contribute at the moment. The more we can combine efforts (e.g. using code from the NonSequentialPDFParser) the better. I've found that the more code is re-used, the quicker bugs are brought to light (at which point we can fix them), so I'd much rather see code re-use than copying and pasting from one class to another.
          Hide
          Maruan Sahyoun added a comment -

          I can start doing some more work on the conforming parser. Because of the approach I'm taking ConformingParser -> SimpleParser -> PDF Lexer -> PDF file there will be quite a view changes on the current code of ConformingPDFParser e.g. all the low level reading is handled by the Lexer and building (most of) the base PDF objects is handled by SimpleParser (which I'm developing simultaniously to the conforming parser). Would you prefer to put in all changes to ConformingPDFParser or start with a new class?

          Show
          Maruan Sahyoun added a comment - I can start doing some more work on the conforming parser. Because of the approach I'm taking ConformingParser -> SimpleParser -> PDF Lexer -> PDF file there will be quite a view changes on the current code of ConformingPDFParser e.g. all the low level reading is handled by the Lexer and building (most of) the base PDF objects is handled by SimpleParser (which I'm developing simultaniously to the conforming parser). Would you prefer to put in all changes to ConformingPDFParser or start with a new class?
          Hide
          Maruan Sahyoun added a comment -
          1. I put in some code for buffering into the current dev version of the PDFLexer (which reduced the lexing on my machine for the ISO spec from 17s to 5s) but are more looking forward to reusing a general class. If possible this should also enable the lexer to use a byte[] or so as an input e.g. to pass a decoded stream as input. I think the current o.a.p.io.RandomAccessBuffer already has some code but e.g. is missing getFilePointer() from java.io.RandomAccessFile.
          2. stream processing - you are right outlining the issues with the current implementation. I only put it in for completeness but the parser - as it has more information - can handle streams more efficiently.
          3. isNumeric - I put the suggested changes in - thx for the hint.
          Show
          Maruan Sahyoun added a comment - I put in some code for buffering into the current dev version of the PDFLexer (which reduced the lexing on my machine for the ISO spec from 17s to 5s) but are more looking forward to reusing a general class. If possible this should also enable the lexer to use a byte[] or so as an input e.g. to pass a decoded stream as input. I think the current o.a.p.io.RandomAccessBuffer already has some code but e.g. is missing getFilePointer() from java.io.RandomAccessFile. stream processing - you are right outlining the issues with the current implementation. I only put it in for completeness but the parser - as it has more information - can handle streams more efficiently. isNumeric - I put the suggested changes in - thx for the hint.
          Hide
          Timo Boehme added a comment -

          PDFLexer:

          • I also do like the rich comments
          • buffering of to be read chars: for my NonSequentialPDFParser (PDFBOX-1199) I already implemented a random file access with LRU buffering of pages (RandomAccessBufferedFileInputStream.java) ; maybe with some modifications we should use this for file access
          • stream processing: reading stream with looking for 'endstream' should only be done if length attribute is broken (does not exist or no 'endstream' at specified position); there are 2 reasons: 1) 'endstream' can simply appear as normal content, 2) performance: you can simply read 'length' bytes; or even better: use an object referencing the original file stream with offsets, thus no byte copying is needed
          • isNumeric() optimization:
            return ( c >= '0' && c <= '9' )
            c == '.' c == '+' c == '-';
            }
          Show
          Timo Boehme added a comment - PDFLexer: I also do like the rich comments buffering of to be read chars: for my NonSequentialPDFParser ( PDFBOX-1199 ) I already implemented a random file access with LRU buffering of pages (RandomAccessBufferedFileInputStream.java) ; maybe with some modifications we should use this for file access stream processing: reading stream with looking for 'endstream' should only be done if length attribute is broken (does not exist or no 'endstream' at specified position); there are 2 reasons: 1) 'endstream' can simply appear as normal content, 2) performance: you can simply read 'length' bytes; or even better: use an object referencing the original file stream with offsets, thus no byte copying is needed isNumeric() optimization: return ( c >= '0' && c <= '9' ) c == '.' c == '+' c == '-'; }
          Hide
          Maruan Sahyoun added a comment -

          thanks for the review and the effort taken.

          1. the while loops I will fix - thanks for the hint.
          2. the structure validation I'm more in favor of putting that into the parser. The reason behind that is that for being able to check for compliance I need the 'raw' data being read by the lexer instead of the 'parsed' data. E.g. checking that the offset entry in an xref entry is 10 digits. If I do the parsing from a 'raw' number in the lexer and let's say return a COSInteger that information will be gone. In addition e.g. reading/skipping the stream data can be done more efficiently after parsing the dictionarys length entry. The lexer doesn't know about that. So my current favorite is that the lexer is only creating tokens but doesn't ensure validity, creates COSObjects etc. - WDYT?
          3. I fully agree that JUnit test cases will be needed and I'm about creating some basic cases.
          4. I'm very interested in ensuring that parsing is done as quickly as possible without compromising the goal of ensuring/validating conformance to the spec. I don't think that the current implementation will offer the best performance simply because there will be a lot of unbuffered read() calls. This should be enhanced I think by using a small buffer to read more data and then work on that buffer. Because of the random nature of PDFs it might be that we read to many bytes into the buffer but the overall performance would still benefit as I think it's very rare that only single bytes are needed before doing another seek to a completly different location. WDYT?
          5. there will be code which handles PDF's which are not inline with the ISO spec. and I do trust that the new parser will offer better results than the current one but putting all current workarounds in will take some time as one needs to scan through the sources to identify these. What I'm planning to do is having some exits within the code for parsing individual sections to put the workarounds in. This way they stand out and are seperated from the 'clean' parsing. In addition one might also overwrite these.
          Show
          Maruan Sahyoun added a comment - thanks for the review and the effort taken. the while loops I will fix - thanks for the hint. the structure validation I'm more in favor of putting that into the parser. The reason behind that is that for being able to check for compliance I need the 'raw' data being read by the lexer instead of the 'parsed' data. E.g. checking that the offset entry in an xref entry is 10 digits. If I do the parsing from a 'raw' number in the lexer and let's say return a COSInteger that information will be gone. In addition e.g. reading/skipping the stream data can be done more efficiently after parsing the dictionarys length entry. The lexer doesn't know about that. So my current favorite is that the lexer is only creating tokens but doesn't ensure validity, creates COSObjects etc. - WDYT? I fully agree that JUnit test cases will be needed and I'm about creating some basic cases. I'm very interested in ensuring that parsing is done as quickly as possible without compromising the goal of ensuring/validating conformance to the spec. I don't think that the current implementation will offer the best performance simply because there will be a lot of unbuffered read() calls. This should be enhanced I think by using a small buffer to read more data and then work on that buffer. Because of the random nature of PDFs it might be that we read to many bytes into the buffer but the overall performance would still benefit as I think it's very rare that only single bytes are needed before doing another seek to a completly different location. WDYT? there will be code which handles PDF's which are not inline with the ISO spec. and I do trust that the new parser will offer better results than the current one but putting all current workarounds in will take some time as one needs to scan through the sources to identify these. What I'm planning to do is having some exits within the code for parsing individual sections to put the workarounds in. This way they stand out and are seperated from the 'clean' parsing. In addition one might also overwrite these.
          Hide
          Adam Nichols added a comment -

          First and foremost, I like the good documentation, comments and references to the PDF spec. Bugs are easy to fix, a lack of documentation is not, so it's good to have this up front.

          Secondly, I like the design. Being able to read just one or two bytes and know what the next object will be is great. I don't really care for "while(true)" loops such as is seen in processKeyword() (it could simply be "while(!isDelimiter(ch) && !isWhitespace(ch))" with the unread(ch); after the loop). It's of no functional difference, but putting the stop condition in the loop just makes sense.

          As for structure validation, it seems like it'd make sense to do that here, since this is what's dealing with the structure of objects. The parser may also validate structure, but it will be looking for different things (for example, an indirect object which refers to an object that doesn't exist; or an series of indirect objects which form a loop, such as where a -> b -> c -> a; or other logical errors). As discussed, this would only be enforced if it was in "strict mode". I can see you've already taken care to deal with non-conforming PDFs (e.g. processStream(boolean keepData) where it checks to see if the end of line marker is there and makes sure that rawData is set properly in either case.

          The PDFLexer looks very good. About the only suggestion I can think of would be to add some JUnit test cases. I remember that the current parser has some strange code for detecting the endstream, but it made a huge performance difference, so I'd suggest testing the Lexer with a file which contains a lot of streams in it to make sure that everything is okay. Also, I know there are some PDFs in the JUnit tests that are non-conforming (sometimes in very major ways, not just missing newlines, but things like "[" with no matching "]"). As absurd as these may seem, these are things which I've personally seen in the wild and things which Adobe Reader is able to recover from, so it'd be preferable to deal with them at least as well as the current implementation. If I remember correctly, there's also some code in the current parser about dealing with missing/malformed end of file marker (i.e. "%%EOF"). I can't recall if there's an example PDF & JUnit test for that one, but it not it's easy to mangle/remove the "%%EOF" at the end (or in the middle of a file in the case of a PDF which has been incrementally updated).

          Show
          Adam Nichols added a comment - First and foremost, I like the good documentation, comments and references to the PDF spec. Bugs are easy to fix, a lack of documentation is not, so it's good to have this up front. Secondly, I like the design. Being able to read just one or two bytes and know what the next object will be is great. I don't really care for "while(true)" loops such as is seen in processKeyword() (it could simply be "while(!isDelimiter(ch) && !isWhitespace(ch))" with the unread(ch); after the loop). It's of no functional difference, but putting the stop condition in the loop just makes sense. As for structure validation, it seems like it'd make sense to do that here, since this is what's dealing with the structure of objects. The parser may also validate structure, but it will be looking for different things (for example, an indirect object which refers to an object that doesn't exist; or an series of indirect objects which form a loop, such as where a -> b -> c -> a; or other logical errors). As discussed, this would only be enforced if it was in "strict mode". I can see you've already taken care to deal with non-conforming PDFs (e.g. processStream(boolean keepData) where it checks to see if the end of line marker is there and makes sure that rawData is set properly in either case. The PDFLexer looks very good. About the only suggestion I can think of would be to add some JUnit test cases. I remember that the current parser has some strange code for detecting the endstream, but it made a huge performance difference, so I'd suggest testing the Lexer with a file which contains a lot of streams in it to make sure that everything is okay. Also, I know there are some PDFs in the JUnit tests that are non-conforming (sometimes in very major ways, not just missing newlines, but things like " [" with no matching "] "). As absurd as these may seem, these are things which I've personally seen in the wild and things which Adobe Reader is able to recover from, so it'd be preferable to deal with them at least as well as the current implementation. If I remember correctly, there's also some code in the current parser about dealing with missing/malformed end of file marker (i.e. "%%EOF"). I can't recall if there's an example PDF & JUnit test for that one, but it not it's easy to mangle/remove the "%%EOF" at the end (or in the middle of a file in the case of a PDF which has been incrementally updated).
          Hide
          Maruan Sahyoun added a comment -

          I attached the PDFLexer component for initial review. This is still work in progress and there are various areas where it might be enhanced. The main idea behind the design is (somewhat similar to the StAX XML Reader) that the parser is able to look at individual events/tokens to start parsing the PDF instead of working on the byte level. By design whitespace is delivered as is EOL and comments as individual events as it's neccesary to have that information to check for full conformance of a PDF to the specification (e.g. make sure that a (textbased) xref entry is 20 bytes long, the keyword stream is delimited by a CR/LF or LF ...

          WDYT?

          Show
          Maruan Sahyoun added a comment - I attached the PDFLexer component for initial review. This is still work in progress and there are various areas where it might be enhanced. The main idea behind the design is (somewhat similar to the StAX XML Reader) that the parser is able to look at individual events/tokens to start parsing the PDF instead of working on the byte level. By design whitespace is delivered as is EOL and comments as individual events as it's neccesary to have that information to check for full conformance of a PDF to the specification (e.g. make sure that a (textbased) xref entry is 20 bytes long, the keyword stream is delimited by a CR/LF or LF ... WDYT?
          Hide
          Maruan Sahyoun added a comment -

          PDFLexer as a base component to the ConformingPDFParser.

          Show
          Maruan Sahyoun added a comment - PDFLexer as a base component to the ConformingPDFParser.
          Hide
          Maruan Sahyoun added a comment - - edited

          Continuing the work on the parser maybe someone more experienced in PDFBOX can help me with mapping the basic PDF objects as documented in ISO 32000 to the COS model classes in PDFBOX

          Comment [IS0 32000-1:2008: 7.2.3] -> none?
          Boolean [IS0 32000-1:2008: 7.3.2] -> COSBoolean?
          Number [IS0 32000-1:2008: 7.3.3] -> COSReal, COSInteger?
          Literal String [IS0 32000-1:2008: 7.3.4.2] -> COSString?
          Hex String [IS0 32000-1:2008: 7.3.4.3] -> COSString?
          Name Object [IS0 32000-1:2008: 7.3.5] -> COSName?
          Keyword [IS0 32000-1:2008: 7.3] (the spec doesn't have that as a type but as part of some other types) -> none?
          Array Objects [IS0 32000-1:2008: 7.3.6] -> COSArray?
          Dictionary Objects [IS0 32000-1:2008: 7.3.7] -> COSDictionary?
          Stream Objects [IS0 32000-1:2008: 7.3.8] -> COSStream?
          Null Object [IS0 32000-1:2008: 7.3.9] -> COSNull?
          Indirect Objects [IS0 32000-1:2008: 7.3.10] ?

          What are the other classes in o.a.pdfbox.cos for

          If wanted I can also move forward and include some comments from the ISO spec into the a.o.pdfbox.cos classes documentation.

          Show
          Maruan Sahyoun added a comment - - edited Continuing the work on the parser maybe someone more experienced in PDFBOX can help me with mapping the basic PDF objects as documented in ISO 32000 to the COS model classes in PDFBOX Comment [IS0 32000-1:2008: 7.2.3] -> none? Boolean [IS0 32000-1:2008: 7.3.2] -> COSBoolean? Number [IS0 32000-1:2008: 7.3.3] -> COSReal, COSInteger? Literal String [IS0 32000-1:2008: 7.3.4.2] -> COSString? Hex String [IS0 32000-1:2008: 7.3.4.3] -> COSString? Name Object [IS0 32000-1:2008: 7.3.5] -> COSName? Keyword [IS0 32000-1:2008: 7.3] (the spec doesn't have that as a type but as part of some other types) -> none? Array Objects [IS0 32000-1:2008: 7.3.6] -> COSArray? Dictionary Objects [IS0 32000-1:2008: 7.3.7] -> COSDictionary? Stream Objects [IS0 32000-1:2008: 7.3.8] -> COSStream? Null Object [IS0 32000-1:2008: 7.3.9] -> COSNull? Indirect Objects [IS0 32000-1:2008: 7.3.10] ? What are the other classes in o.a.pdfbox.cos for If wanted I can also move forward and include some comments from the ISO spec into the a.o.pdfbox.cos classes documentation.
          Hide
          Maruan Sahyoun added a comment -

          a) for the two parsing modes: relaxed lessens the requirements (e.g. an xref entry doesn't have to be 20 bytes long, but still there need to be three distinct parts of information number, number, usage flag) for parsing. For workarounds these will be part of the relaxed mode but the user will be informed where the default behaviors of relaxed mode will not be reported back. So I think we have the same understanding.
          b) fine. I think that's something we can revisit later. After doing the parser I think I will have a much better understanding how PDFBOX works

          Show
          Maruan Sahyoun added a comment - a) for the two parsing modes: relaxed lessens the requirements (e.g. an xref entry doesn't have to be 20 bytes long, but still there need to be three distinct parts of information number, number, usage flag) for parsing. For workarounds these will be part of the relaxed mode but the user will be informed where the default behaviors of relaxed mode will not be reported back. So I think we have the same understanding. b) fine. I think that's something we can revisit later. After doing the parser I think I will have a much better understanding how PDFBOX works
          Hide
          Timo Boehme added a comment -

          a) while I think that 2 parsing modes are ok, it is important to distinguish between 1) not strict conforming, but parseable without loss/change of information (e.g. not allowed whitespaces) and 2) recover from/workaround an error with possible information change. Thus we would have two states for relaxed parsing. Case 1 may be hidden but case 2 needs to be signaled to the user of an application.

          b) putting the logic into the objects sound like a clean OO approach. Nevertheless I would keep it in the parser, because to do parsing access to environment settings (encryption) and other objects (e.g. object streams) is needed which is more complex if the objects would have to known about this. Furthermore classes of COS objects are easier to maintain if they are not cluttered by parsing code (in my opinion).

          c) absolutely fine with me. Maybe looking at the methods in COSDocument one can find which information is needed, e.g. MediaBox.

          d) A clear separation of workaround code paths with possibility of extension/overwriting is a good idea.

          Show
          Timo Boehme added a comment - a) while I think that 2 parsing modes are ok, it is important to distinguish between 1) not strict conforming, but parseable without loss/change of information (e.g. not allowed whitespaces) and 2) recover from/workaround an error with possible information change. Thus we would have two states for relaxed parsing. Case 1 may be hidden but case 2 needs to be signaled to the user of an application. b) putting the logic into the objects sound like a clean OO approach. Nevertheless I would keep it in the parser, because to do parsing access to environment settings (encryption) and other objects (e.g. object streams) is needed which is more complex if the objects would have to known about this. Furthermore classes of COS objects are easier to maintain if they are not cluttered by parsing code (in my opinion). c) absolutely fine with me. Maybe looking at the methods in COSDocument one can find which information is needed, e.g. MediaBox. d) A clear separation of workaround code paths with possibility of extension/overwriting is a good idea.
          Hide
          Maruan Sahyoun added a comment -

          I'm starting the work on the ConformingPDFParser now and there are some questions/ideas I would like to discuss:

          a) as discussed earlier there will be two parsing modes, where strict will be conforming to the ISO spec. For strict I'm planning to check full compliance with the spec for areas I'm touching e.g. make sure that the (text based) xref table entries are really 20 bytes... - is that fine?
          b) when constructing COS objects such as COSString the parser can make sure or complain that the data is according to the spec. The other alternative would be to put that into the COS object e.g. COSxxx.newInstance(). Both have it's benefits. Putting it into the parser means that all parsing is done in a central place. Putting it into the COS Object would mean that we have the reading and writing logic in the object itself so it's fully aware about it's lifecycle. I tend to put it into the parser initially but think that it should put into the COS object at a later stage. WDYT?
          c) I would like to defer the parsing of an object to the state when this is requested. This will be for most objects but the very basic PDF objects needed to allow for some very basic information e.g. number of pages, metadata, encryption... - is that fine? Which information would need to be available from the start on?
          d) I think about putting code which is a work around for buggy PDFs into some special methods - recoverXXXError. E.g. the current PDFParser has code where the xref table entries have three numbers instead of two (PDFBOX-474). Benefit will be that workarounds are clearly visible and not hidden within the main parsing code and we are offering a solution which can be extended. WDYT? Initially some exits will be made available - the code will come at a later date.

          Show
          Maruan Sahyoun added a comment - I'm starting the work on the ConformingPDFParser now and there are some questions/ideas I would like to discuss: a) as discussed earlier there will be two parsing modes, where strict will be conforming to the ISO spec. For strict I'm planning to check full compliance with the spec for areas I'm touching e.g. make sure that the (text based) xref table entries are really 20 bytes... - is that fine? b) when constructing COS objects such as COSString the parser can make sure or complain that the data is according to the spec. The other alternative would be to put that into the COS object e.g. COSxxx.newInstance(). Both have it's benefits. Putting it into the parser means that all parsing is done in a central place. Putting it into the COS Object would mean that we have the reading and writing logic in the object itself so it's fully aware about it's lifecycle. I tend to put it into the parser initially but think that it should put into the COS object at a later stage. WDYT? c) I would like to defer the parsing of an object to the state when this is requested. This will be for most objects but the very basic PDF objects needed to allow for some very basic information e.g. number of pages, metadata, encryption... - is that fine? Which information would need to be available from the start on? d) I think about putting code which is a work around for buggy PDFs into some special methods - recoverXXXError. E.g. the current PDFParser has code where the xref table entries have three numbers instead of two ( PDFBOX-474 ). Benefit will be that workarounds are clearly visible and not hidden within the main parsing code and we are offering a solution which can be extended. WDYT? Initially some exits will be made available - the code will come at a later date.
          Hide
          Maruan Sahyoun added a comment -

          I think I didn't do a good job describing what I'm heading for. It's clear that PDFs do need random access to get to the portions one is interested in. And that will be up to the parser to make sure that this is done. The lexer is only a helper to the parser when a certain section should be parsed. I think there something like hasNext and next is helpful.

          For example when parsing the xref table the parser will seek to the start and the lexer will start creating events/tokens from there which the parser can inspect - in this case until the parser get's to a token signaling the end of the trailer. Parsing the PDF header will be done in a similar manner. The parser seeks to the start of the file and then inspects the events/tokens delivered by the lexer. For an object the parsers seeks to the start of the object using the information in the xref table and again inspects the events/tokens delivered by the lexer.

          Removing the dependency on RandomAccessFile was only meant for the lexer. The parser still needs the ability for random access. What I discussed with Timo Boehme was the possibility in using an InputStream as an input to the parser in addition to a file. If I understood him correctly he already implemented something which can be extended. But that's a different topic. For now the parser relies on RandomAccess and it will need a RandomAccess capability in the future.

          I have to admit that writing such a parser is an ambitious project for me and I'm certain that there will be lot's of ways in improving the code. But I do hope the general approach is better understood now and seems to be the right approach. That's why I wrote about the status. On the other hand I do know the PDF spec very well so at least I know what PDF is about

          Show
          Maruan Sahyoun added a comment - I think I didn't do a good job describing what I'm heading for. It's clear that PDFs do need random access to get to the portions one is interested in. And that will be up to the parser to make sure that this is done. The lexer is only a helper to the parser when a certain section should be parsed. I think there something like hasNext and next is helpful. For example when parsing the xref table the parser will seek to the start and the lexer will start creating events/tokens from there which the parser can inspect - in this case until the parser get's to a token signaling the end of the trailer. Parsing the PDF header will be done in a similar manner. The parser seeks to the start of the file and then inspects the events/tokens delivered by the lexer. For an object the parsers seeks to the start of the object using the information in the xref table and again inspects the events/tokens delivered by the lexer. Removing the dependency on RandomAccessFile was only meant for the lexer. The parser still needs the ability for random access. What I discussed with Timo Boehme was the possibility in using an InputStream as an input to the parser in addition to a file. If I understood him correctly he already implemented something which can be extended. But that's a different topic. For now the parser relies on RandomAccess and it will need a RandomAccess capability in the future. I have to admit that writing such a parser is an ambitious project for me and I'm certain that there will be lot's of ways in improving the code. But I do hope the general approach is better understood now and seems to be the right approach. That's why I wrote about the status. On the other hand I do know the PDF spec very well so at least I know what PDF is about
          Hide
          Adam Nichols added a comment -

          I'm not sure I understand your approach. I like the points about lazy loading and working on something which will be useful to the conforming parser in the future, however I don't understand why hasNext() and next() would be useful given the random access nature of PDFs. I also do not understand the need to remove the RandomFileAccess dependency. PDFs are files, so it makes sense to be a file object, and dynamic access is good, so RandomAccessFile seems like a logical answer.

          In order to create a base class, I'd say it'd be best create a class to read an object (of any type) given a position in the RandomAccessFile. This isn't as easy as it sounds as there are many different types of objects. Then the conforming parser could parse the xref table, and then use the base class to read files as necessary.

          Show
          Adam Nichols added a comment - I'm not sure I understand your approach. I like the points about lazy loading and working on something which will be useful to the conforming parser in the future, however I don't understand why hasNext() and next() would be useful given the random access nature of PDFs. I also do not understand the need to remove the RandomFileAccess dependency. PDFs are files, so it makes sense to be a file object, and dynamic access is good, so RandomAccessFile seems like a logical answer. In order to create a base class, I'd say it'd be best create a class to read an object (of any type) given a position in the RandomAccessFile. This isn't as easy as it sounds as there are many different types of objects. Then the conforming parser could parse the xref table, and then use the base class to read files as necessary.
          Hide
          Maruan Sahyoun added a comment -

          Just before the weekend another info about my progress.

          Just to let you know about my approach.

          There will be a new (PDF) lexer which works similar to StAX XML Stream Reader going through the PDF and producing events. One can walk through them using hastNext() and next(). Events are produced only for very basic PDF objects such as comments, string literals, keywords and numbers. Using getData() the content of the token belonging to the event can be retrieved in it's raw format. The lexer is using lazy loading so the data building up the token is only constructed when getData() is called, otherwise next() will skip to the next event without keeping the data. Cursor movement is always forward.

          I'm now working on the next component SimpleParser (maybe should be called BaseParser later) which will extend the lexer. Taking the same approach as for the lexer this component is able to handle complex PDF Objects such as Dictionaries and Arrays.

          ConformingParser will then extend SimpleParser to deal with Streams and all other PDF structures such as Xrefs ...

          The lexer is feature complete. There will be some refinements as I'm working on the SimpleParser esp. remove the dependency on java.io.RandomAccessFile. Timo Boehme offered some help here.
          I'm currently working on the SimpleParser. When this is ready I will submit the code for review.

          Show
          Maruan Sahyoun added a comment - Just before the weekend another info about my progress. Just to let you know about my approach. There will be a new (PDF) lexer which works similar to StAX XML Stream Reader going through the PDF and producing events. One can walk through them using hastNext() and next(). Events are produced only for very basic PDF objects such as comments, string literals, keywords and numbers. Using getData() the content of the token belonging to the event can be retrieved in it's raw format. The lexer is using lazy loading so the data building up the token is only constructed when getData() is called, otherwise next() will skip to the next event without keeping the data. Cursor movement is always forward. I'm now working on the next component SimpleParser (maybe should be called BaseParser later) which will extend the lexer. Taking the same approach as for the lexer this component is able to handle complex PDF Objects such as Dictionaries and Arrays. ConformingParser will then extend SimpleParser to deal with Streams and all other PDF structures such as Xrefs ... The lexer is feature complete. There will be some refinements as I'm working on the SimpleParser esp. remove the dependency on java.io.RandomAccessFile. Timo Boehme offered some help here. I'm currently working on the SimpleParser. When this is ready I will submit the code for review.
          Hide
          Maruan Sahyoun added a comment -

          Thanks for your valuable feedback. I'll try to provide a status from time to time to inform about the progress.

          With the startxref - my mistake it's EOF being required [PDF 1.7 App. H 18]. That was the idea behind Acrobat parsing mode to implement the notes in App. H. But I think you are right, 2 Strict and Relaxed should be enough.

          For the documentation I'm putting links to the reference into the code wherever I feel that structures are defined which are related to the spec, to describe what is going on or where assumptions are made. Small sample:

          case DelimiterChars.OpeningAngleBracket: // Dictionary or Hex String
          // This could be either the start of a
          // Dictionary [PDF 1.7: 3.2.6] or a
          // Hexadecimal String [PDF 1.7: 3.2.3]
          // so we need to read the next ch to make
          // a decision

          At the moment I'm trying to get to a state where I can submit the code and it's really doing something useful. There will be TODOs I'm documenting within the code. I think at that point in time I'm looking for feedback and help. One of the lacking areas is doing formal unit tests although I'm testing individual functions against some PDFs I have as development moves forward. So I'm glad that you can commit your PDFs for unit testing.

          Show
          Maruan Sahyoun added a comment - Thanks for your valuable feedback. I'll try to provide a status from time to time to inform about the progress. With the startxref - my mistake it's EOF being required [PDF 1.7 App. H 18] . That was the idea behind Acrobat parsing mode to implement the notes in App. H. But I think you are right, 2 Strict and Relaxed should be enough. For the documentation I'm putting links to the reference into the code wherever I feel that structures are defined which are related to the spec, to describe what is going on or where assumptions are made. Small sample: case DelimiterChars.OpeningAngleBracket: // Dictionary or Hex String // This could be either the start of a // Dictionary [PDF 1.7: 3.2.6] or a // Hexadecimal String [PDF 1.7: 3.2.3] // so we need to read the next ch to make // a decision At the moment I'm trying to get to a state where I can submit the code and it's really doing something useful. There will be TODOs I'm documenting within the code. I think at that point in time I'm looking for feedback and help. One of the lacking areas is doing formal unit tests although I'm testing individual functions against some PDFs I have as development moves forward. So I'm glad that you can commit your PDFs for unit testing.
          Hide
          Adam Nichols added a comment -

          Lexer: I like the idea of keeping the code as small and independent as possible.

          XRef Streams: cool, glad to see that's been added!

          Modes of parsing: The strict is good because it can help developers of other products to ensure their products conform. It may also help prevent unknown attacks from working since it will just bail with an error message when it gets a malformed PDF (doesn't help with flaws which may be in the protocol itself, but then again not much will help there). The relaxed parsing is also a nice option since people expect the software to "just work" even if there are small errors with the file. I'm going to say that I don't like the idea of trying to clone what Adobe Acrobat does. It varies with each version of the PDF spec (at a minimum), is much more complex than is necessary, has been plagued by security problems, and serves no advantage over the strict/relaxed modes. I'd rather do what's right (throw an exception if a PDF is non-conforming) or what's popular (parse anything in the best way we know how) which is decided by the person who uses the library.

          Please make sure to include references to the spec when relevant. For example, I'm not aware of anything which says "startxref is expected to be within the last 1024 bytes." I'd imagine that'd normally be the case, but if the xref table is very large, I could imagine that would sometimes not be the case.

          My circumstances have drastically changed since I last worked on this (in June), so I can't dedicate nearly as much time as I could before. However, I'm still interested in following the progress and helping out when and where I can. On the brighter side, I should now be able to make sure all the PDFs I use will be able to be committed for JUnit test cases. If there are any small things which need done related to the conforming parser, feel free to mention them either here or on the developer mailing list and I'll know where I can jump in and help if I get some free time.

          Show
          Adam Nichols added a comment - Lexer: I like the idea of keeping the code as small and independent as possible. XRef Streams: cool, glad to see that's been added! Modes of parsing: The strict is good because it can help developers of other products to ensure their products conform. It may also help prevent unknown attacks from working since it will just bail with an error message when it gets a malformed PDF (doesn't help with flaws which may be in the protocol itself, but then again not much will help there). The relaxed parsing is also a nice option since people expect the software to "just work" even if there are small errors with the file. I'm going to say that I don't like the idea of trying to clone what Adobe Acrobat does. It varies with each version of the PDF spec (at a minimum), is much more complex than is necessary, has been plagued by security problems, and serves no advantage over the strict/relaxed modes. I'd rather do what's right (throw an exception if a PDF is non-conforming) or what's popular (parse anything in the best way we know how) which is decided by the person who uses the library. Please make sure to include references to the spec when relevant. For example, I'm not aware of anything which says "startxref is expected to be within the last 1024 bytes." I'd imagine that'd normally be the case, but if the xref table is very large, I could imagine that would sometimes not be the case. My circumstances have drastically changed since I last worked on this (in June), so I can't dedicate nearly as much time as I could before. However, I'm still interested in following the progress and helping out when and where I can. On the brighter side, I should now be able to make sure all the PDFs I use will be able to be committed for JUnit test cases. If there are any small things which need done related to the conforming parser, feel free to mention them either here or on the developer mailing list and I'll know where I can jump in and help if I get some free time.
          Hide
          Maruan Sahyoun added a comment -

          I'm looking for a suggestion when dealing with different kinds of PDFs. The idea behind the ConformingParser (if I understood it correctly) was to parse files which are inline with the PDF spec. But even Acrobat is relaxed when it comes to certain deviations from the spec e.g. startxref is expected to be within the last 1024 bytes. In the real world there might also be PDFs which can not be read by Acrobat.

          They way I think I could deal with it is to introduce (gradually) three parsing modes within the ConformingParser: Strict, Acrobat, Relaxed.

          Strict will fail the parsing as soon as a deviation from the spec is encountered.
          Acrobat will take the Acrobat implementation notes as outlined in the spec into account.
          Relaxed will try to continue processing if there are issues.

          WDYT

          Show
          Maruan Sahyoun added a comment - I'm looking for a suggestion when dealing with different kinds of PDFs. The idea behind the ConformingParser (if I understood it correctly) was to parse files which are inline with the PDF spec. But even Acrobat is relaxed when it comes to certain deviations from the spec e.g. startxref is expected to be within the last 1024 bytes. In the real world there might also be PDFs which can not be read by Acrobat. They way I think I could deal with it is to introduce (gradually) three parsing modes within the ConformingParser: Strict, Acrobat, Relaxed. Strict will fail the parsing as soon as a deviation from the spec is encountered. Acrobat will take the Acrobat implementation notes as outlined in the spec into account. Relaxed will try to continue processing if there are issues. WDYT
          Hide
          Maruan Sahyoun added a comment -

          Just to let you know about the (slow) progress I'm doing.

          I've made a decision to split the parsing in two parts. A (new) Lexer which reads the file and returns individual tokens (Number, Comment, NameObject, DictionaryStart, ArrayStart ...) and their type. This is controlled by the ConformingParser which when parsing certain parts of the PDF is looking for specific tokens. Reason behind that was to reduce the code within the individual classes and to allow for the ConformingParser to deal with higher level objects. The tokens return the raw data e.g. a Hex String is delivered as is. The ConformingParser needs to do the interpretation as I wanted to keep the semantics within the parser.

          The Lexer part is ready with it's base functionality and will be extended as work continues completing the ConformingParser. Currently it also can only use RandomAccessFile which needs to be changed later on as I wanted to move forward with the ConformingParser.

          The ConformingParser from the high level is kept as Adam started to develop it but as individual functions are visited is starting to use the Lexer. I've also already changed some of the parameters from int to long e.g. for the byte offset in the xref table as this defined to hold up to 10 digits inline with PDFBOX-1196.

          The XrefEntry class has been extended to deal with regular Xref entries as well as Xref Stream entries i.e. the different properties are reflected in the class. This can be extended later to be usable when writing a PDF if the need arises.

          Show
          Maruan Sahyoun added a comment - Just to let you know about the (slow) progress I'm doing. I've made a decision to split the parsing in two parts. A (new) Lexer which reads the file and returns individual tokens (Number, Comment, NameObject, DictionaryStart, ArrayStart ...) and their type. This is controlled by the ConformingParser which when parsing certain parts of the PDF is looking for specific tokens. Reason behind that was to reduce the code within the individual classes and to allow for the ConformingParser to deal with higher level objects. The tokens return the raw data e.g. a Hex String is delivered as is. The ConformingParser needs to do the interpretation as I wanted to keep the semantics within the parser. The Lexer part is ready with it's base functionality and will be extended as work continues completing the ConformingParser. Currently it also can only use RandomAccessFile which needs to be changed later on as I wanted to move forward with the ConformingParser. The ConformingParser from the high level is kept as Adam started to develop it but as individual functions are visited is starting to use the Lexer. I've also already changed some of the parameters from int to long e.g. for the byte offset in the xref table as this defined to hold up to 10 digits inline with PDFBOX-1196 . The XrefEntry class has been extended to deal with regular Xref entries as well as Xref Stream entries i.e. the different properties are reflected in the class. This can be extended later to be usable when writing a PDF if the need arises.
          Hide
          Maruan Sahyoun added a comment -

          Thanks for your feedback.

          One final question. The current implementation e.g. parseTrailerInformation() is strict when information is preceded or followed by whitespace although the PDF spec might allow that.

          As an example the keyword startxref is expected to be the only content in a line as is the byte offset to the xref not allowing any whitespace before or after the keyword or byte offset where the spec uses the term 'contain'. Just to make sure that we have the same understanding for throwNonConformingException would you treat whitespace as conforming in this case or not? My interpretation would be that whitespace is acceptable in this case.

          Show
          Maruan Sahyoun added a comment - Thanks for your feedback. One final question. The current implementation e.g. parseTrailerInformation() is strict when information is preceded or followed by whitespace although the PDF spec might allow that. As an example the keyword startxref is expected to be the only content in a line as is the byte offset to the xref not allowing any whitespace before or after the keyword or byte offset where the spec uses the term 'contain'. Just to make sure that we have the same understanding for throwNonConformingException would you treat whitespace as conforming in this case or not? My interpretation would be that whitespace is acceptable in this case.
          Hide
          Adam Nichols added a comment -

          There were a few reasons why I wanted to re-write the parser:
          1.) I was tired of tweaking hacks in our parser to deal with non-conforming PDFs. Some of the issues have been resolved, but not all of them (e.g. parsing invalid objects which are never referenced)
          2.) We should comply with the ISO-32000 standard. This makes sure we're handing things in the proper manner; being part of the solution, not part of the problem.
          3.) The ISO way of parsing is more efficient. It's worst case performance is as good as our best case. It generally uses less memory (which is especially important for mobile devices); it shouldn't need to parse all the objects in every case, so it'll use less CPU; it doesn't always need to read all the bytes of the file, reducing disk I/O.

          While this doesn't completely solve all of our problems (especially when it comes to non-conforming documents), it is a step in the right direction. Also, I don't have any uncommitted code for the non-conforming parser. Been very busy lately and haven't had a chance to go back and dig into it.

          Show
          Adam Nichols added a comment - There were a few reasons why I wanted to re-write the parser: 1.) I was tired of tweaking hacks in our parser to deal with non-conforming PDFs. Some of the issues have been resolved, but not all of them (e.g. parsing invalid objects which are never referenced) 2.) We should comply with the ISO-32000 standard. This makes sure we're handing things in the proper manner; being part of the solution, not part of the problem. 3.) The ISO way of parsing is more efficient. It's worst case performance is as good as our best case. It generally uses less memory (which is especially important for mobile devices); it shouldn't need to parse all the objects in every case, so it'll use less CPU; it doesn't always need to read all the bytes of the file, reducing disk I/O. While this doesn't completely solve all of our problems (especially when it comes to non-conforming documents), it is a step in the right direction. Also, I don't have any uncommitted code for the non-conforming parser. Been very busy lately and haven't had a chance to go back and dig into it.
          Hide
          Maruan Sahyoun added a comment -

          Hi Adam,

          I'm looking into putting some work into the conforming parser. But first let me ask some questions:

          1. what are the main areas you were trying to address? To me the most pressing need was the correct Xref resolution but that has been solved
          2. is there some further work you put into this you would like to post before there are any changes?

          Kind regards

          Maruan

          Show
          Maruan Sahyoun added a comment - Hi Adam, I'm looking into putting some work into the conforming parser. But first let me ask some questions: what are the main areas you were trying to address? To me the most pressing need was the correct Xref resolution but that has been solved is there some further work you put into this you would like to post before there are any changes? Kind regards Maruan
          Hide
          Andreas Lehmkühler added a comment -

          +1, sounds like a good plan IMO

          Show
          Andreas Lehmkühler added a comment - +1, sounds like a good plan IMO
          Hide
          Adam Nichols added a comment -

          If there are no objections, I'll commit the changes after the 1.6 release. The changes here should not affect any existing code (not much changed, mostly just adding new classes), but I still don't want to add it on to a tag at the last minute. This will give us more time for regression testing between releases. Waiting longer just means the chances of the patches applying cleanly are reduced.

          Show
          Adam Nichols added a comment - If there are no objections, I'll commit the changes after the 1.6 release. The changes here should not affect any existing code (not much changed, mostly just adding new classes), but I still don't want to add it on to a tag at the last minute. This will give us more time for regression testing between releases. Waiting longer just means the chances of the patches applying cleanly are reduced.
          Hide
          Adam Nichols added a comment -

          I haven't had time to go back to this lately, but I'm still following the mailing list. The comments in PDFBOX-1016 do a great job at explaining how the xref tables should be read/parsed. So when I (or anyone else) comes back to the conforming parser, it'd be good to confirm that we're doing it properly and old references are overwritten by new ones (PDFBOX-1042). I think the current code (attached above) is only reading the last xref table and ignoring all the previous ones, which is very wrong. However, it should be easy to put a loop in there to handle this. Linearized documents will be another thing to add support for in the future, but we'll cross that bridge when we come to it.

          Show
          Adam Nichols added a comment - I haven't had time to go back to this lately, but I'm still following the mailing list. The comments in PDFBOX-1016 do a great job at explaining how the xref tables should be read/parsed. So when I (or anyone else) comes back to the conforming parser, it'd be good to confirm that we're doing it properly and old references are overwritten by new ones ( PDFBOX-1042 ). I think the current code (attached above) is only reading the last xref table and ignoring all the previous ones, which is very wrong. However, it should be easy to put a loop in there to handle this. Linearized documents will be another thing to add support for in the future, but we'll cross that bridge when we come to it.
          Hide
          Andreas Lehmkühler added a comment -

          Use the small triangle next to the "+" to get to the "Manage Attachments" menu

          Show
          Andreas Lehmkühler added a comment - Use the small triangle next to the "+" to get to the "Manage Attachments" menu
          Hide
          Adam Nichols added a comment -

          I'm not sure how to delete Items from JIRA, but the date should identify the new ConformingPDFParser.java

          Show
          Adam Nichols added a comment - I'm not sure how to delete Items from JIRA, but the date should identify the new ConformingPDFParser.java
          Hide
          Adam Nichols added a comment -

          Updated BaseParser so I could inherit from it (also updated StringBuffers to StringBuilders to make them more efficient). COSDictionary updated to avoid a NullPointerException.

          Show
          Adam Nichols added a comment - Updated BaseParser so I could inherit from it (also updated StringBuffers to StringBuilders to make them more efficient). COSDictionary updated to avoid a NullPointerException.
          Hide
          Adam Nichols added a comment -

          I got enough done late last night to a point where it is presentable. It might not be very useful since it just reads the trailer, and the reads the Root and Info objects (but it does not follow the weak references), however it is a reasonable starting point. I'll attach the updated files here for review before committing them.

          Show
          Adam Nichols added a comment - I got enough done late last night to a point where it is presentable. It might not be very useful since it just reads the trailer, and the reads the Root and Info objects (but it does not follow the weak references), however it is a reasonable starting point. I'll attach the updated files here for review before committing them.
          Hide
          Adam Nichols added a comment -

          I updated readWord as described above (ending a "word" on characters like '/', ']', etc.) and was able to remove all the ugly hacks. I confirmed that it worked on my test PDF.

          I'm started work on the lazy evaluation by creating a COSUnread object which is just a placeholder to let us know that the object hasn't been read yet. That'll allow reading an indirect reference as a COSObject consisting of: an objectNumber, generaion, and COSUnread. Later, when we need the data, the COSUnread will be replaced with the actual object. Or at least that's how I imagine it working...

          I'll post the code again once I'm at least able to read the trailer in a lazy way, and am able to retrieve the info by automagically reading the data when a COSUnread is found.

          Show
          Adam Nichols added a comment - I updated readWord as described above (ending a "word" on characters like '/', ']', etc.) and was able to remove all the ugly hacks. I confirmed that it worked on my test PDF. I'm started work on the lazy evaluation by creating a COSUnread object which is just a placeholder to let us know that the object hasn't been read yet. That'll allow reading an indirect reference as a COSObject consisting of: an objectNumber, generaion, and COSUnread. Later, when we need the data, the COSUnread will be replaced with the actual object. Or at least that's how I imagine it working... I'll post the code again once I'm at least able to read the trailer in a lazy way, and am able to retrieve the info by automagically reading the data when a COSUnread is found.
          Hide
          Thomas Chojecki added a comment -

          Hi,

          1) your right, in some case it would be easier to seperate values with space.

          the object reader do the follow. (some abstract code)

          readValue
          {
          char c1 = readChar;
          if (c1 == SPACE)

          { readValue; // maybe something like skipSpaces at the beginning will be better. because someone // can kill the heap with whitespaces duo to the recursion. }

          else if ( c1 == '/')

          { // unreadByte; // Not needed, because the COSName can read till whitespace or / and remove the beginning / if it exist. readCOSName; }

          else if ( c1 == INTEGER)

          { unreadByte; readCOSIntOrRef; }

          else if ( c1 == '(' )

          { // unreadByte; // see readCOSName readCOSString; }

          ... // and so on. '<' is tricky, can be a dictionary or hex string. then we need to read one byte more and see if its a new dict or string.
          }

          readCOSIntOrRef
          {
          buffer b1;
          char c1;
          while((c1 = readChar) != '/' or '>') {
          if(c1 == SPACE)

          { new COSRef(buffer, readChar); readChar;readChar; // to skip the R }

          writeBuffer(c1);
          }
          new COSInteger(writeBuffer);
          }

          readObj

          { readCOSName readValue }

          to write a parser for a pdf is one of the hardest things. the spec is inaccurate and give the developer room for interpret it the way he means is spec conform.

          2) one way is to read the last xxx bytes (maybe 100) and search for the "startxref". after getting this, we can jump to the xref table / stream and parse till end. or we read the last xxx bytes and try to find the "trailer". i whould prefer the first step.

          after parsing the first xref table and the trailer, we should look if another one is in the document and parse it also and skip parsed references.

          3) a good parser needs time and we should keep the old implementation if the user can't parse all the docs he have. the next thing is the seek time. i can't imagine that parsing a document lazy is quicker as parsing it complete from the beginning. if the parser need to jump between the objects, this costs much time on harddisk. this can take much time. the last question is. how much of the document do the user need to parse to get as much informations as he need to work with it. if he need to read 50% or 70% so we can parse the whole document.

          a) that idea is good. so we can grab minimal informations without parsing it complete and the first request for e.g. a page, parse the needed informations.

          My plan is to take a look after work and debug some documents and show what documents maybe fail and fix it.

          Show
          Thomas Chojecki added a comment - Hi, 1) your right, in some case it would be easier to seperate values with space. the object reader do the follow. (some abstract code) readValue { char c1 = readChar; if (c1 == SPACE) { readValue; // maybe something like skipSpaces at the beginning will be better. because someone // can kill the heap with whitespaces duo to the recursion. } else if ( c1 == '/') { // unreadByte; // Not needed, because the COSName can read till whitespace or / and remove the beginning / if it exist. readCOSName; } else if ( c1 == INTEGER) { unreadByte; readCOSIntOrRef; } else if ( c1 == '(' ) { // unreadByte; // see readCOSName readCOSString; } ... // and so on. '<' is tricky, can be a dictionary or hex string. then we need to read one byte more and see if its a new dict or string. } readCOSIntOrRef { buffer b1; char c1; while((c1 = readChar) != '/' or '>') { if(c1 == SPACE) { new COSRef(buffer, readChar); readChar;readChar; // to skip the R } writeBuffer(c1); } new COSInteger(writeBuffer); } readObj { readCOSName readValue } to write a parser for a pdf is one of the hardest things. the spec is inaccurate and give the developer room for interpret it the way he means is spec conform. 2) one way is to read the last xxx bytes (maybe 100) and search for the "startxref". after getting this, we can jump to the xref table / stream and parse till end. or we read the last xxx bytes and try to find the "trailer". i whould prefer the first step. after parsing the first xref table and the trailer, we should look if another one is in the document and parse it also and skip parsed references. 3) a good parser needs time and we should keep the old implementation if the user can't parse all the docs he have. the next thing is the seek time. i can't imagine that parsing a document lazy is quicker as parsing it complete from the beginning. if the parser need to jump between the objects, this costs much time on harddisk. this can take much time. the last question is. how much of the document do the user need to parse to get as much informations as he need to work with it. if he need to read 50% or 70% so we can parse the whole document. a) that idea is good. so we can grab minimal informations without parsing it complete and the first request for e.g. a page, parse the needed informations. My plan is to take a look after work and debug some documents and show what documents maybe fail and fix it.
          Hide
          Adam Nichols added a comment -

          I'll upload XrefEntry tonight. I also noticed that I made some slight changes to some other classes, but when I did a diff, it looked like they were unrelated to this task. If it doesn't work as expected, let me know and I'll double check.

          1.) My point was that [1] above is more difficult to parse than this (note the spaces between objects):
          31 0 obj
          << /Length 45 0 R /Length1 568 /Length2 1017 /Length3 0 >>

          It would be much easier if the objects were separated in some way, like with spaces. However, not all software does this and since white space separation is not required per the spec, we can't depend on this.

          Another, related, issue I ran into was that when I read in "45" is that a COSInteger, or an indirect reference? We don't know until we read the next "word". The next word is "0", still don't know if it is a int or an indirect reference, but if the next "word" is an "R" then we know it's an indirect reference and we can process it. If the first example the last word was "R/Length1" which requires cleaning up before we can identify it as an "R". It's not something which is unsolvable, but it just makes things more difficult.

          Currently reading a "word" is defined (by me) as reading until whitespace is encountered. I suppose we could change this to reading until isWhitespace(c) || '/' == c || ']' == c || '>' == c (or something similar). I didn't test that because I was thinking it would cause problems with things like entries like "/Name Some string with name/identifier here" but on second thought those that won't be a problem as it'll just take more calls to readWord() to read in all the data for that object.

          2.) Yes, the parser read/parses all in one step. I suppose we could just read it into a string and then parse it after reading/parsing the xref table. Or just read & ignore until we find the beginning, mark it down the offset and then read/parse it after dealing with the xref table. I think we'll also need a flag to tell us if we want to use recursion to dereference objects or not. Normally we would, but not for the trailer nor root.

          3.) We should be able to get something which is respectable fairly quickly at which point I'll commit it to the official SVN after going over any and all modifications to existing classes to make sure they won't have any unintended side-effects. In the meantime a unified diff/patch should work okay.

          Here's my plan:
          a.) Add a way to enable/disable recursive parsing. Recursion will be on by default, off for parsing the trailer/root, and then turned back on.
          b.) Change readWord() to stop at '/' ']' and '>' (excluding the first character, which can be any non-whitespace).
          c.) Clean up the ugly hacks which is properly resolved by updating readWord()
          d.) See if the above changes put the code into a reasonable starting point. If so, and if won't cause any issues with the normal parser, commit to svn.

          Show
          Adam Nichols added a comment - I'll upload XrefEntry tonight. I also noticed that I made some slight changes to some other classes, but when I did a diff, it looked like they were unrelated to this task. If it doesn't work as expected, let me know and I'll double check. 1.) My point was that [1] above is more difficult to parse than this (note the spaces between objects): 31 0 obj << /Length 45 0 R /Length1 568 /Length2 1017 /Length3 0 >> It would be much easier if the objects were separated in some way, like with spaces. However, not all software does this and since white space separation is not required per the spec, we can't depend on this. Another, related, issue I ran into was that when I read in "45" is that a COSInteger, or an indirect reference? We don't know until we read the next "word". The next word is "0", still don't know if it is a int or an indirect reference, but if the next "word" is an "R" then we know it's an indirect reference and we can process it. If the first example the last word was "R/Length1" which requires cleaning up before we can identify it as an "R". It's not something which is unsolvable, but it just makes things more difficult. Currently reading a "word" is defined (by me) as reading until whitespace is encountered. I suppose we could change this to reading until isWhitespace(c) || '/' == c || ']' == c || '>' == c (or something similar). I didn't test that because I was thinking it would cause problems with things like entries like "/Name Some string with name/identifier here" but on second thought those that won't be a problem as it'll just take more calls to readWord() to read in all the data for that object. 2.) Yes, the parser read/parses all in one step. I suppose we could just read it into a string and then parse it after reading/parsing the xref table. Or just read & ignore until we find the beginning, mark it down the offset and then read/parse it after dealing with the xref table. I think we'll also need a flag to tell us if we want to use recursion to dereference objects or not. Normally we would, but not for the trailer nor root. 3.) We should be able to get something which is respectable fairly quickly at which point I'll commit it to the official SVN after going over any and all modifications to existing classes to make sure they won't have any unintended side-effects. In the meantime a unified diff/patch should work okay. Here's my plan: a.) Add a way to enable/disable recursive parsing. Recursion will be on by default, off for parsing the trailer/root, and then turned back on. b.) Change readWord() to stop at '/' ']' and '>' (excluding the first character, which can be any non-whitespace). c.) Clean up the ugly hacks which is properly resolved by updating readWord() d.) See if the above changes put the code into a reasonable starting point. If so, and if won't cause any issues with the normal parser, commit to svn.
          Hide
          Thomas Chojecki added a comment -

          Hi Adam,

          first of all, thx for publishing the code. I think you forgot one class "org.apache.pdfbox.pdmodel.common.XrefEntry"

          @ 1)
          i take a look at [1] and can't find a error.
          The indirect object 31 is a dictionary object with 4 key-value pairs as followed:

          The first entry has the name object "Length" and redirect to the indirect object 45. So you need to take a look inside the xref table for the object 45 to see the value (e.g. 45 0 obj 500 endobj).
          The other three entries named "Length1", "Length2" and "Length3" have the integer object 568, 1017 and 0

          For parsing the key-value pairs. Each key is a name object beginning with / (0x2F) immediately followed by the name without whitespaces. After the key you will find a blank (0x20) and the related value. In case that the value is also a name object, the blank will be omited.

          So if you try to read the whole object 31, you need also refer to object 45.

          For more informations about the objects, look at the section 7.3 and 7.3.7 of the spec.

          Have you take a look at the current parser? the parser categorize the engine into small parts like parsing objects, parsing trailer. each object has rules for parsing it. by example. if you find a indirect object you will parse the prefix first (number generation R) then you parse the object (parseObject()) the next byte will be a delimiter like whitespace, linefeed or maybe a "less-than sign" ... more you will find in section 7.2.2 table 1 and 2. then you know you will find the key beginning with a / and followed by the name. after the name you need to parse again an object.

          hard to explain how it work proper. the actual parser do a good work and should not be replaced completely. maybe some parts can be copied.

          The string objects start and end with parenthesis. if the text also has paranthesis, they shall be balanced. if not you need to escape it. see section 7.3.4.2.

          @ 2)
          the dictionary is parsed before xref table? if you want to do it spec conform, the first thing is to find the whole trailer with the startxref.
          then you can know where to find the root dictionary and the xref table. so you can parse the xref table first.

          the most informations about the document can be extract from the trailer and the root dictionary. inside the root dict you can find the page dictionary (i hope this can be parsed lazy), also you can find the acroform field with forms and annotations. i think there are more informations, but i don't study all of them.

          parsing the page dictionary will offer you the page structure as a tree and will refer though most of the objects of the pdf. but i don't know how this exactly work. for creating a lazy parser someone need to study this part of the spec.

          @ 3)
          i will take a look at the classes next days and try also to work on it. is there a easier way to confirm changes to it? like an extra repository? i can provide a cvs repository if this can help.

          otherway i will try to do the RandomAccessFile-like structure for the pdfbox.

          Show
          Thomas Chojecki added a comment - Hi Adam, first of all, thx for publishing the code. I think you forgot one class "org.apache.pdfbox.pdmodel.common.XrefEntry" @ 1) i take a look at [1] and can't find a error. The indirect object 31 is a dictionary object with 4 key-value pairs as followed: The first entry has the name object "Length" and redirect to the indirect object 45. So you need to take a look inside the xref table for the object 45 to see the value (e.g. 45 0 obj 500 endobj). The other three entries named "Length1", "Length2" and "Length3" have the integer object 568, 1017 and 0 For parsing the key-value pairs. Each key is a name object beginning with / (0x2F) immediately followed by the name without whitespaces. After the key you will find a blank (0x20) and the related value. In case that the value is also a name object, the blank will be omited. So if you try to read the whole object 31, you need also refer to object 45. For more informations about the objects, look at the section 7.3 and 7.3.7 of the spec. Have you take a look at the current parser? the parser categorize the engine into small parts like parsing objects, parsing trailer. each object has rules for parsing it. by example. if you find a indirect object you will parse the prefix first (number generation R) then you parse the object (parseObject()) the next byte will be a delimiter like whitespace, linefeed or maybe a "less-than sign" ... more you will find in section 7.2.2 table 1 and 2. then you know you will find the key beginning with a / and followed by the name. after the name you need to parse again an object. hard to explain how it work proper. the actual parser do a good work and should not be replaced completely. maybe some parts can be copied. The string objects start and end with parenthesis. if the text also has paranthesis, they shall be balanced. if not you need to escape it. see section 7.3.4.2. @ 2) the dictionary is parsed before xref table? if you want to do it spec conform, the first thing is to find the whole trailer with the startxref. then you can know where to find the root dictionary and the xref table. so you can parse the xref table first. the most informations about the document can be extract from the trailer and the root dictionary. inside the root dict you can find the page dictionary (i hope this can be parsed lazy), also you can find the acroform field with forms and annotations. i think there are more informations, but i don't study all of them. parsing the page dictionary will offer you the page structure as a tree and will refer though most of the objects of the pdf. but i don't know how this exactly work. for creating a lazy parser someone need to study this part of the spec. @ 3) i will take a look at the classes next days and try also to work on it. is there a easier way to confirm changes to it? like an extra repository? i can provide a cvs repository if this can help. otherway i will try to do the RandomAccessFile-like structure for the pdfbox.
          Hide
          Adam Nichols added a comment -

          I am aware that my JUnit test currently fails. When it works it'll be a very small milestone for the parser.

          Show
          Adam Nichols added a comment - I am aware that my JUnit test currently fails. When it works it'll be a very small milestone for the parser.
          Hide
          Adam Nichols added a comment -

          I ran into some interesting problems while working on this tonight.
          1.) I realized that some PDFs will not have spaces delimiting each item in a dictionary. For an example see [1]. I looked it up in the spec and found that there was nothing which required white space, which means that this appears to be a conforming PDF, but it's a nightmare to parse. I hacked together code which is sufficient to parse my test PDF, but I need to find a better way to deal with this. The current "solution" is just coding around this one PDF and isn't actually solving anything. Reading until we hit whitespace will sometimes get us the entire object, but sometimes it gets us multiple objects (one object and parts of the next). Reading until "]" ">>" or "/" would lead to false positives as any of these characters can legitimately be in a string object. I'll have to think more about this one...
          2.) I solve the infinite recursion problem by keeping loaded objects in memory and referencing the Map. However, the dictionary is parsed before the xref table it read, so there's no way to read these objects as we go. Just iterating through the root element and reading all of the items in the trailer dictionary (e.g. Root, Info, Size) which are weak references, as this will be required info for doing anything at all with the PDF. Relatedly, I need to find a way to do lazy evaluation on this. Currently when the parser will reads in the root object, it ends up traversing the entire tree. While this isn't a "problem" it shouldn't be necessary until the user requests information from these objects. This will reduce load times, memory usage, CPU usage (for loading) and just generally go a good thing.
          3.) This isn't really a problem, but for the record, the new parser doesn't currently have support for streams. The parser will just ignore them for now, which seems like a reasonable solution.

          Since people are interested, I'll upload my classes, but they're far from being ready to commit right now. Patches and suggestions are certainly welcome, especially for the readObject() method of ConformingPDFParser.

          [1] 31 0 obj
          <</Length 45 0 R/Length1 568/Length2 1017/Length3 0>>

          Show
          Adam Nichols added a comment - I ran into some interesting problems while working on this tonight. 1.) I realized that some PDFs will not have spaces delimiting each item in a dictionary. For an example see [1] . I looked it up in the spec and found that there was nothing which required white space, which means that this appears to be a conforming PDF, but it's a nightmare to parse. I hacked together code which is sufficient to parse my test PDF, but I need to find a better way to deal with this. The current "solution" is just coding around this one PDF and isn't actually solving anything. Reading until we hit whitespace will sometimes get us the entire object, but sometimes it gets us multiple objects (one object and parts of the next). Reading until "]" ">>" or "/" would lead to false positives as any of these characters can legitimately be in a string object. I'll have to think more about this one... 2.) I solve the infinite recursion problem by keeping loaded objects in memory and referencing the Map. However, the dictionary is parsed before the xref table it read, so there's no way to read these objects as we go. Just iterating through the root element and reading all of the items in the trailer dictionary (e.g. Root, Info, Size) which are weak references, as this will be required info for doing anything at all with the PDF. Relatedly, I need to find a way to do lazy evaluation on this. Currently when the parser will reads in the root object, it ends up traversing the entire tree. While this isn't a "problem" it shouldn't be necessary until the user requests information from these objects. This will reduce load times, memory usage, CPU usage (for loading) and just generally go a good thing. 3.) This isn't really a problem, but for the record, the new parser doesn't currently have support for streams. The parser will just ignore them for now, which seems like a reasonable solution. Since people are interested, I'll upload my classes, but they're far from being ready to commit right now. Patches and suggestions are certainly welcome, especially for the readObject() method of ConformingPDFParser. [1] 31 0 obj <</Length 45 0 R/Length1 568/Length2 1017/Length3 0>>

            People

            • Assignee:
              Adam Nichols
              Reporter:
              Adam Nichols
            • Votes:
              8 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:

                Development