Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2459

Missing text in .doc file (but can be extracted by POI)

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.16
    • Fix Version/s: 1.17
    • Component/s: None
    • Labels:
      None
    • Environment:

      Windows and Linux

      Description

      I've got a document whose text can be extracted via org.apache.poi.hwpf.converter.WordToTextConverter, but does not fully get extracted by Tika. The 'paragraph one' paragraph is present in the POI extraction output, and is not present in Tika's output.

      Tika's output:

      Something
      One:
      Else
      Two:
      Here
      Three:
      Four
      
      Paragraph two
      Paragraph three
      Paragraph four
      cc: Somebody
           Somebody else
      Something here too
      

      POI's output:

      Something
      One:    Else
      Two:    Here
      Three:  Four
      
      Paragraph one
      
      Paragraph two
      
      Paragraph three
      
      Paragraph four
      
      
      cc: Somebody
           Somebody else
      
      
      Something here too
      

        Attachments

        1. foo2.doc
          25 kB
          Dustin Spicuzza

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              virtuald Dustin Spicuzza
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: