Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3058

Support TIKA Migration to PDFBox 2.0

    XMLWordPrintableJSON

Details

    • Task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 2.0.0
    • 2.0.0
    • Text extraction
    • None

    Description

      This issue is to track fixing issues which came up as part of TIKA-1285 (Upgrade to PDFBox 2.0.0 when available) mainly

      • new exceptions compared to PDFBox 1.8.x
      • regressions in text extraction
      • lower quality text extraction

      There should be individual issues to track tasks/bugs arising from that.

      Attachments

        1. content_diffs-1.8-to-2.0.xlsx
          1.52 MB
          Tilman Hausherr
        2. NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU_1_8_10.json
          2 kB
          Tim Allison
        3. NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU_2_0.json
          2 kB
          Tim Allison
        4. textLostFromACausedByNewExceptionsInB.zip
          42 kB
          Tim Allison
        5. content_diffs-4.xlsx
          3.14 MB
          Tilman Hausherr

        Issue Links

          There are no Sub-Tasks for this issue.

          Activity

            People

              lehmi Andreas Lehmkühler
              msahyoun Maruan Sahyoun
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: