Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2459

Missing text in .doc file (but can be extracted by POI)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.16
    • 1.17
    • None
    • None
    • Windows and Linux

    Description

      I've got a document whose text can be extracted via org.apache.poi.hwpf.converter.WordToTextConverter, but does not fully get extracted by Tika. The 'paragraph one' paragraph is present in the POI extraction output, and is not present in Tika's output.

      Tika's output:

      Something
      One:
      Else
      Two:
      Here
      Three:
      Four
      
      Paragraph two
      Paragraph three
      Paragraph four
      cc: Somebody
           Somebody else
      Something here too
      

      POI's output:

      Something
      One:    Else
      Two:    Here
      Three:  Four
      
      Paragraph one
      
      Paragraph two
      
      Paragraph three
      
      Paragraph four
      
      
      cc: Somebody
           Somebody else
      
      
      Something here too
      

      Attachments

        1. foo2.doc
          25 kB
          Dustin Spicuzza

        Activity

          People

            Unassigned Unassigned
            virtuald Dustin Spicuzza
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: