Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2303

PDFParser with optional bookmarks text extraction

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.14
    • Fix Version/s: None
    • Component/s: parser
    • Labels:

      Description

      I would like to parse an PDF without extract its bookmarks and outlines.

      I was thinking about create a new PDFParser parameter in PDFParserConfig with a option such as 'ExtractBookmarks'. And check it out on 'AbstractPDF2XHTML'

      I can do it, and I would like to present you a patch with this change.

      Thanks in advance.

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1221 (See https://builds.apache.org/job/Tika-trunk/1221/)
          Fix for TIKA-2303 contributed by ppalazon (ppalazon: https://github.com/apache/tika/commit/e9ff4c023ba3f0ad913ac32af04cddde25bb914c)

          • (edit) tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
            TIKA-2303 – allow users to configure whether or not to extract (tallison: https://github.com/apache/tika/commit/22f6ccfe09c89c59510a710b2b3560c92a53f334)
          • (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
          • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1221 (See https://builds.apache.org/job/Tika-trunk/1221/ ) Fix for TIKA-2303 contributed by ppalazon (ppalazon: https://github.com/apache/tika/commit/e9ff4c023ba3f0ad913ac32af04cddde25bb914c ) (edit) tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java TIKA-2303 – allow users to configure whether or not to extract (tallison: https://github.com/apache/tika/commit/22f6ccfe09c89c59510a710b2b3560c92a53f334 ) (edit) tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user tballison closed the pull request at:

          https://github.com/apache/tika/pull/157

          Show
          githubbot ASF GitHub Bot added a comment - Github user tballison closed the pull request at: https://github.com/apache/tika/pull/157
          Hide
          githubbot ASF GitHub Bot added a comment -

          ppalazon closed pull request #157: Fix for TIKA-2303 contributed by ppalazon.
          URL: https://github.com/apache/tika/pull/157

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - ppalazon closed pull request #157: Fix for TIKA-2303 contributed by ppalazon. URL: https://github.com/apache/tika/pull/157 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user ppalazon opened a pull request:

          https://github.com/apache/tika/pull/157

          Fix for TIKA-2303 contributed by ppalazon.

          Added a new option parameter on PDFParserConfig for
          extract bookmarks from a PDF. Its name is extractBookmarksText.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/ppalazon/tika TIKA-2303

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/157.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #157


          commit e9ff4c023ba3f0ad913ac32af04cddde25bb914c
          Author: Pablo Palazón <ppalazon@antara.ws>
          Date: 2017-03-16T17:23:34Z

          Fix for TIKA-2303 contributed by ppalazon


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user ppalazon opened a pull request: https://github.com/apache/tika/pull/157 Fix for TIKA-2303 contributed by ppalazon. Added a new option parameter on PDFParserConfig for extract bookmarks from a PDF. Its name is extractBookmarksText. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ppalazon/tika TIKA-2303 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/157.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #157 commit e9ff4c023ba3f0ad913ac32af04cddde25bb914c Author: Pablo Palazón <ppalazon@antara.ws> Date: 2017-03-16T17:23:34Z Fix for TIKA-2303 contributed by ppalazon
          Hide
          githubbot ASF GitHub Bot added a comment -

          ppalazon opened a new pull request #157: Fix for TIKA-2303 contributed by ppalazon.
          URL: https://github.com/apache/tika/pull/157

          Added a new option parameter on PDFParserConfig for
          extract bookmarks from a PDF. Its name is extractBookmarksText.

          ----------------------------------------------------------------
          This is an automated message from the Apache Git Service.
          To respond to the message, please log on GitHub and use the
          URL above to go to the specific comment.

          For queries about this service, please contact Infrastructure at:
          users@infra.apache.org

          Show
          githubbot ASF GitHub Bot added a comment - ppalazon opened a new pull request #157: Fix for TIKA-2303 contributed by ppalazon. URL: https://github.com/apache/tika/pull/157 Added a new option parameter on PDFParserConfig for extract bookmarks from a PDF. Its name is extractBookmarksText. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
          Hide
          tallison@mitre.org Tim Allison added a comment -

          +1, committers are standing by.

          Show
          tallison@mitre.org Tim Allison added a comment - +1, committers are standing by.

            People

            • Assignee:
              Unassigned
              Reporter:
              ppalazon Pablo Palazon
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 10m
                10m
                Remaining:
                Remaining Estimate - 10m
                10m
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development