Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-692

TikaCLI -x or -h on a Word doc sometimes adds newline after </b> tag

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 0.10
    • parser
    • None

    Description

      [Note: spinoff from the tika-dev thread "Issue in text extraction in
      Solr / Tika" on Aug 19 2011, by nirnaydewan]

      When parsing a Word doc where some contiguous text is bolded, due to
      differences in how the user had bolded different parts of the text
      with Word, TikaCLI -x or -h will sometimes generate output like this:

      <p>F<b>oob</b>a<b>r</b>
      </p>
      

      and other times like this (extra newline & 2 adjacent bold sections):

      <p>F<b>oo</b>
      <b>b</b>a<b>r</b>
      </p>
      

      The extra newline in the second example causes browsers (I tried
      Firefox, Safari, Chrome), JTidy and Tika itself to (incorrectly)
      insert a space when rending/extracting text, breaking up the word.

      While this might be technically correct/OK (ie, XML white space rules
      might allow for non-significant space after the </b> within a <p>
      should be ignored), I think we should still fix Tika to not insert
      newlines, if we can.

      Attachments

        1. 0001-TIKA-692-TikaCLI-x-or-h-on-a-Word-doc-sometimes-adds.patch
          1 kB
          Jukka Zitting
        2. 0002-TIKA-692-TikaCLI-x-or-h-on-a-Word-doc-sometimes-adds.patch
          2 kB
          Jukka Zitting
        3. testWORD_bold_character_runs.doc
          22 kB
          Michael McCandless
        4. testWORD_bold_character_runs.doc
          22 kB
          Michael McCandless
        5. testWORD_bold_character_runs2.doc
          22 kB
          Michael McCandless
        6. testWORD_bold_character_runs2.docx
          10 kB
          Michael McCandless
        7. TIKA-692.patch
          16 kB
          Michael McCandless
        8. TIKA-692.patch
          10 kB
          Michael McCandless
        9. TIKA-692.patch
          7 kB
          Michael McCandless
        10. TIKA-692-pretty-print.patch
          5 kB
          Michael McCandless

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            jukkaz Jukka Zitting
            mikemccand Michael McCandless
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment