Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-5124

Solr glues word´s when parsing PDFs under certan circumstances

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Duplicate
    • 4.4
    • None
    • update
    • Windows 7 (don´t think, this is relevant)

    Description

      For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word)
      (Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr.
      (This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document.
      In our Solr-indices we have a good amount of this weird documents. This results in worse suggestions by the Suggester.

      Attachments

        1. 01_alz_2009_folge11_2009_05_28.pdf
          518 kB
          Christoph Straßer
        2. 02_PDF.png
          150 kB
          Christoph Straßer
        3. 03_TikaOutput.png
          74 kB
          Christoph Straßer
        4. 03_TikaOutput_GUI_MainContent.png
          81 kB
          Christoph Straßer
        5. 03_TikaOutput_GUI_PlainText.png
          50 kB
          Christoph Straßer
        6. 03_TikaOutput_GUI_StructuredText.png
          57 kB
          Christoph Straßer
        7. 04_Solr.png
          95 kB
          Christoph Straßer

        Issue Links

          Activity

            People

              Unassigned Unassigned
              christophs78 Christoph Straßer
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: