Tika
  1. Tika
  2. TIKA-612

Specify PDFBox options via ParseContext

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.9
    • Fix Version/s: 1.1
    • Component/s: parser
    • Labels:
      None

      Description

      See https://issues.apache.org/jira/browse/TIKA-611. The options used by PDFBox are currently hardwritten in the PDFParser code, we will allow them to be specified via the ParseContext objects

      1. TIKA-612-testcase.patch
        2 kB
        Michael McCandless
      2. TIKA-612.patch
        7 kB
        Michael McCandless
      3. Tika-612.patch
        4 kB
        Julien Nioche
      4. testPDFTwoColumns.pdf
        56 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Lau Brino added a comment - - edited

          Hi. Due to this serious bug in PDFBox https://issues.apache.org/jira/browse/PDFBOX-956 I would appreciate if you can implement this. It would be then possible to turn the suppressDuplicateOverlappingText off.

          Show
          Lau Brino added a comment - - edited Hi. Due to this serious bug in PDFBox https://issues.apache.org/jira/browse/PDFBOX-956 I would appreciate if you can implement this. It would be then possible to turn the suppressDuplicateOverlappingText off.
          Hide
          Julien Nioche added a comment -

          Patch which allows to specify the options via the Context object. WDYT?

          Show
          Julien Nioche added a comment - Patch which allows to specify the options via the Context object. WDYT?
          Hide
          Michael McCandless added a comment -

          I'm attaching a test case (it passes), showing a PDF w/ 2 columns and verifying the text within a single column is kept contiguous.

          Show
          Michael McCandless added a comment - I'm attaching a test case (it passes), showing a PDF w/ 2 columns and verifying the text within a single column is kept contiguous.
          Hide
          Jukka Zitting added a comment -

          +1 looks good to me.

          A possible design improvement could be to make PDFParseOptions an interface like the following:

          public interface PDFParseOptions {
              void apply(PDFTextStripper stripper);
          }
          

          The proposed bean class would implement that interface like this:

              public void apply(PDFTextStripper stripper) {
                  stripper.setForceParsing(getForceParsing());
                  stripper.setSortByPosition(getSortByPosition());
              }
          

          This would make it easy for client applications to apply also other PDF parsing settings not currently known by Tika.

          Show
          Jukka Zitting added a comment - +1 looks good to me. A possible design improvement could be to make PDFParseOptions an interface like the following: public interface PDFParseOptions { void apply(PDFTextStripper stripper); } The proposed bean class would implement that interface like this: public void apply(PDFTextStripper stripper) { stripper.setForceParsing(getForceParsing()); stripper.setSortByPosition(getSortByPosition()); } This would make it easy for client applications to apply also other PDF parsing settings not currently known by Tika.
          Hide
          Michael McCandless added a comment -

          This would make it easy for client applications to apply also other PDF parsing settings not currently known by Tika.

          +1, this seems like it'd be more general. EG, we could fold in get/setSuppressDuplicateOverlappingText (and move it off of PDFParser), and maybe also get/setEnableAutoSpace.

          In general, since there are so many options on PDFTextStripper, and the "right" settings seems to vary PDF by PDF, it means it's important that we expose full control...

          Show
          Michael McCandless added a comment - This would make it easy for client applications to apply also other PDF parsing settings not currently known by Tika. +1, this seems like it'd be more general. EG, we could fold in get/setSuppressDuplicateOverlappingText (and move it off of PDFParser), and maybe also get/setEnableAutoSpace. In general, since there are so many options on PDFTextStripper, and the "right" settings seems to vary PDF by PDF, it means it's important that we expose full control...
          Hide
          Michael McCandless added a comment -

          I agree, we probably shouldn't just directly expose PDFTextStripper
          directly; it'd be better (less API surface area) if we pick certain
          options and expose them ourselves. Then if PDFTextStripper changes
          things, or if we somehow switch to a different PDF lib, we won't break
          our users.

          Alternatively, can just expose options on PDFParser directly? This is
          more intuitive and direct (you just use setters on the parser), and we
          can name/genericize the options, and choose which to expose? (This is
          what I've been doing on the last few PDF issues....).

          Show
          Michael McCandless added a comment - I agree, we probably shouldn't just directly expose PDFTextStripper directly; it'd be better (less API surface area) if we pick certain options and expose them ourselves. Then if PDFTextStripper changes things, or if we somehow switch to a different PDF lib, we won't break our users. Alternatively, can just expose options on PDFParser directly? This is more intuitive and direct (you just use setters on the parser), and we can name/genericize the options, and choose which to expose? (This is what I've been doing on the last few PDF issues....).
          Hide
          Michael McCandless added a comment -

          Patch, just adding setSortByPosition to PDFParser. I think this is more straightforward and lets us control what/how we expose...

          Show
          Michael McCandless added a comment - Patch, just adding setSortByPosition to PDFParser. I think this is more straightforward and lets us control what/how we expose...
          Hide
          Michael McCandless added a comment -

          I committed the last patch; let's open separate issues for other options that need exposing...

          Show
          Michael McCandless added a comment - I committed the last patch; let's open separate issues for other options that need exposing...
          Hide
          Jan Høydahl added a comment -

          So how do we set a PDFBox option via ParseContext in practice? Say we want to setEnableAutoSpace(false).
          The test case attached to this issue calls parser.setEnableAutoSpace(false) directly on the parser, not via parseContext.

          Show
          Jan Høydahl added a comment - So how do we set a PDFBox option via ParseContext in practice? Say we want to setEnableAutoSpace(false) . The test case attached to this issue calls parser.setEnableAutoSpace(false) directly on the parser, not via parseContext.
          Hide
          Nick Burch added a comment -

          The conclusion was to expose the options on the PDFParser directly instead. setEnableAutoSpace is already supported by PDFParser

          If you know you have a PDF, create a PDFParser, set the options, then parse

          If you want to use something like AutoDetectParser but with special PDF options, you have two options. One is to fetch the parsers from the AutoDetectParser, possibly recursing, until you find the PDFParser, and set. The other is to create a new AutoDetectParser on an explicitly created PDFParser, with the DefaultParser as a fallback

          Show
          Nick Burch added a comment - The conclusion was to expose the options on the PDFParser directly instead. setEnableAutoSpace is already supported by PDFParser If you know you have a PDF, create a PDFParser, set the options, then parse If you want to use something like AutoDetectParser but with special PDF options, you have two options. One is to fetch the parsers from the AutoDetectParser, possibly recursing, until you find the PDFParser, and set. The other is to create a new AutoDetectParser on an explicitly created PDFParser, with the DefaultParser as a fallback
          Hide
          Jan Høydahl added a comment -

          Hmm, that's kind of awkward to use from e.g. SolrCell. Any chance of considering a PDFParseOptions on the Context as an alternative?

          Show
          Jan Høydahl added a comment - Hmm, that's kind of awkward to use from e.g. SolrCell. Any chance of considering a PDFParseOptions on the Context as an alternative?

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Julien Nioche
            • Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development