Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2459

Missing text in .doc file (but can be extracted by POI)

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.16
    • Fix Version/s: 1.17
    • Component/s: None
    • Labels:
      None
    • Environment:

      Windows and Linux

      Description

      I've got a document whose text can be extracted via org.apache.poi.hwpf.converter.WordToTextConverter, but does not fully get extracted by Tika. The 'paragraph one' paragraph is present in the POI extraction output, and is not present in Tika's output.

      Tika's output:

      Something
      One:
      Else
      Two:
      Here
      Three:
      Four
      
      Paragraph two
      Paragraph three
      Paragraph four
      cc: Somebody
           Somebody else
      Something here too
      

      POI's output:

      Something
      One:    Else
      Two:    Here
      Three:  Four
      
      Paragraph one
      
      Paragraph two
      
      Paragraph three
      
      Paragraph four
      
      
      cc: Somebody
           Somebody else
      
      
      Something here too
      
      1. foo2.doc
        25 kB
        Dustin Spicuzza

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        Thank you for opening this and sharing a test file. We hadn't seen \u0014 and \u0015 together in the same character run before. This is now fixed.

        Show
        tallison@mitre.org Tim Allison added a comment - Thank you for opening this and sharing a test file. We hadn't seen \u0014 and \u0015 together in the same character run before. This is now fixed.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1361 (See https://builds.apache.org/job/Tika-trunk/1361/)
        TIKA-2459 – fix special character handling (tallison: https://github.com/apache/tika/commit/d1a8bff9faacb828a1039f7cc2c7f9e1f1d5e3fd)

        • (add) tika-parsers/src/test/resources/test-documents/testWORD_specialControlCharacter1415.doc
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1361 (See https://builds.apache.org/job/Tika-trunk/1361/ ) TIKA-2459 – fix special character handling (tallison: https://github.com/apache/tika/commit/d1a8bff9faacb828a1039f7cc2c7f9e1f1d5e3fd ) (add) tika-parsers/src/test/resources/test-documents/testWORD_specialControlCharacter1415.doc (edit) tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java (edit) tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        It looks like Tika's handleSpecialCharacterRuns(...) goes back basically 7 years. Nick Burch or others, any idea why we don't use Range.stripFields() from POI for this?

        Show
        tallison@mitre.org Tim Allison added a comment - It looks like Tika's handleSpecialCharacterRuns(...) goes back basically 7 years. Nick Burch or others, any idea why we don't use Range.stripFields() from POI for this?

          People

          • Assignee:
            Unassigned
            Reporter:
            virtuald Dustin Spicuzza
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development