Solr
  1. Solr
  2. SOLR-5124

Solr glues word´s when parsing PDFs under certan circumstances

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Affects Version/s: 4.4
    • Fix Version/s: None
    • Component/s: update
    • Environment:

      Windows 7 (don´t think, this is relevant)

      Description

      For some kind of PDF-documents Solr glues words at linebreaks under some circumstances. (eg the last word of line 1 and the first word of line 2 are merged into one word)
      (Stand-alone-)Tika extracts the text correct. Attached you find one sample-PDF and screenshots of tika-output and the corrupted content indexed by solr.
      (This issue does not occur with all PDF-documents. Tried to recreate the issue with new word-documents, I converted into PDF on multiple ways without success.) The attached PDF-document has a real weird internal structure. But Tika seems to do it´s work right. Even with this weird document.
      In our Solr-indices we have a good amount of this weird documents. This results in worse suggestions by the Suggester.

      1. 01_alz_2009_folge11_2009_05_28.pdf
        518 kB
        Christoph Straßer
      2. 02_PDF.png
        150 kB
        Christoph Straßer
      3. 03_TikaOutput_GUI_MainContent.png
        81 kB
        Christoph Straßer
      4. 03_TikaOutput_GUI_PlainText.png
        50 kB
        Christoph Straßer
      5. 03_TikaOutput_GUI_StructuredText.png
        57 kB
        Christoph Straßer
      6. 03_TikaOutput.png
        74 kB
        Christoph Straßer
      7. 04_Solr.png
        95 kB
        Christoph Straßer

        Issue Links

          Activity

          Hide
          Christoph Straßer added a comment -

          Added sample-PDF, screenshots of TIKA-Output, screenshot of SOLR-Index.

          Show
          Christoph Straßer added a comment - Added sample-PDF, screenshots of TIKA-Output, screenshot of SOLR-Index.
          Hide
          Uwe Schindler added a comment -

          I have not looked into DIH's code, but I know that TIKA adds the extra whitespace as "ignoreable whitespace" XML data. It might be "ignored" by the extraction content handler when it consumes the SAX events.

          Show
          Uwe Schindler added a comment - I have not looked into DIH's code, but I know that TIKA adds the extra whitespace as "ignoreable whitespace" XML data. It might be "ignored" by the extraction content handler when it consumes the SAX events.
          Hide
          Christoph Straßer added a comment -

          Maybe it´s in some way related to SOLR-4679. (But not sure; We use the ExtractingRequestHandler)

          Show
          Christoph Straßer added a comment - Maybe it´s in some way related to SOLR-4679 . (But not sure; We use the ExtractingRequestHandler)
          Hide
          Uwe Schindler added a comment -

          Hi, this is a duplicate of 2 other issues. SOLR-4679 is the main issue. I will close this as duplicate.

          Show
          Uwe Schindler added a comment - Hi, this is a duplicate of 2 other issues. SOLR-4679 is the main issue. I will close this as duplicate.
          Hide
          Jack Krupansky added a comment -

          Try doing the update with the extractOnly=true parameter and look at the actual byte codes where the two adjacent terms meet - it may be some odd Unicode value that Solr filters ignore rather than treat as white space.

          Show
          Jack Krupansky added a comment - Try doing the update with the extractOnly=true parameter and look at the actual byte codes where the two adjacent terms meet - it may be some odd Unicode value that Solr filters ignore rather than treat as white space.
          Hide
          Christoph Straßer added a comment -

          @Jack: No issue with odd unicode character. (Fiddler Raw View - Screenshot of extractOnly=true attached.)
          @Uwe: Big thanks for taking care of this issue!

          Show
          Christoph Straßer added a comment - @Jack: No issue with odd unicode character. (Fiddler Raw View - Screenshot of extractOnly=true attached.) @Uwe: Big thanks for taking care of this issue!
          Hide
          ASF subversion and git services added a comment -

          Commit 1512296 from Uwe Schindler in branch 'dev/trunk'
          [ https://svn.apache.org/r1512296 ]

          SOLR-4679, SOLR-4908, SOLR-5124: Text extracted from HTML or PDF files using Solr Cell was missing ignorable whitespace, which is inserted by TIKA for convenience to support plain text extraction without using the HTML elements. This bug resulted in glued words.

          Show
          ASF subversion and git services added a comment - Commit 1512296 from Uwe Schindler in branch 'dev/trunk' [ https://svn.apache.org/r1512296 ] SOLR-4679 , SOLR-4908 , SOLR-5124 : Text extracted from HTML or PDF files using Solr Cell was missing ignorable whitespace, which is inserted by TIKA for convenience to support plain text extraction without using the HTML elements. This bug resulted in glued words.
          Hide
          ASF subversion and git services added a comment -

          Commit 1512297 from Uwe Schindler in branch 'dev/branches/branch_4x'
          [ https://svn.apache.org/r1512297 ]

          Merged revision(s) 1512296 from lucene/dev/trunk:
          SOLR-4679, SOLR-4908, SOLR-5124: Text extracted from HTML or PDF files using Solr Cell was missing ignorable whitespace, which is inserted by TIKA for convenience to support plain text extraction without using the HTML elements. This bug resulted in glued words.

          Show
          ASF subversion and git services added a comment - Commit 1512297 from Uwe Schindler in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1512297 ] Merged revision(s) 1512296 from lucene/dev/trunk: SOLR-4679 , SOLR-4908 , SOLR-5124 : Text extracted from HTML or PDF files using Solr Cell was missing ignorable whitespace, which is inserted by TIKA for convenience to support plain text extraction without using the HTML elements. This bug resulted in glued words.

            People

            • Assignee:
              Unassigned
              Reporter:
              Christoph Straßer
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development