Tika
  1. Tika
  2. TIKA-711

Word parser doesn't extract optional hyphen correctly

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0
    • Component/s: parser
    • Labels:
      None

      Description

      We seem not to extract the optional hyphen character correctly in
      the Word parser.

      You can create this char in Word by typing ctrl and -. It's hidden,
      normally; you have to turn on display of formatting marks to see it.

      Ideally we'd get U+00AD (unicode soft hyphen), I think.

      DOC produces a unicode replacement char, which is wrong.

      DOCX and PDF drop the char (which seems acceptable). RTF produces
      U+2027 (hyphenation point) which also seems OK (in TIKA-683 it will
      produce U+00AD).

      PPT and PPTX work correctly (U+00AD).

      So DOC is the only bug I think – I haven't dug into what's wrong
      yet...

      1. testOptionalHyphen.doc
        22 kB
        Michael McCandless
      2. testOptionalHyphen.docx
        10 kB
        Michael McCandless
      3. testOptionalHyphen.pdf
        44 kB
        Michael McCandless
      4. testOptionalHyphen.ppt
        99 kB
        Michael McCandless
      5. testOptionalHyphen.pptx
        32 kB
        Michael McCandless
      6. testOptionalHyphen.rtf
        30 kB
        Michael McCandless
      7. TIKA-711.patch
        3 kB
        Michael McCandless
      8. TIKA-711.patch
        34 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch.

        Show
        Michael McCandless added a comment - Patch.
        Hide
        Michael McCandless added a comment -

        The WordExtractor seems to receive ASCII 31 ("unit separator") from POI, for the optional hyphen, which SafeContentHandler then replaces w/ unicode replacement char.

        I don't think we can assume ASCII 31 will always mean soft hyphen though...

        Not sure how to fix this.

        Show
        Michael McCandless added a comment - The WordExtractor seems to receive ASCII 31 ("unit separator") from POI, for the optional hyphen, which SafeContentHandler then replaces w/ unicode replacement char. I don't think we can assume ASCII 31 will always mean soft hyphen though... Not sure how to fix this.
        Hide
        Michael McCandless added a comment -

        Curiously, if I use POI's WordToTextConverter command-line tool, it produces U+200b (ZERO WIDTH SPACE) for the optional hyphen, which I think is at least better than ASCII 31. Still not sure if there's a POI option we can set to get this character out as U+00AD.

        Show
        Michael McCandless added a comment - Curiously, if I use POI's WordToTextConverter command-line tool, it produces U+200b (ZERO WIDTH SPACE) for the optional hyphen, which I think is at least better than ASCII 31. Still not sure if there's a POI option we can set to get this character out as U+00AD.
        Hide
        Michael McCandless added a comment -

        OK, after digging I found out that in fact POI's AbstractWordConverter
        converts ASCII 30 to Unicode non-breaking hyphen (U+2011) and ASCII 31
        to Unicode zero-width space (U+200b), but Tika doesn't. This is why I
        see the "right" behavior when running POI's command-line conversion
        but not with Tika.

        So I think the fix is simple here: just do that same mapping in
        WordExtractor.handleCharacterRun; attached patch does that, and
        enables the test case (now passing).

        Show
        Michael McCandless added a comment - OK, after digging I found out that in fact POI's AbstractWordConverter converts ASCII 30 to Unicode non-breaking hyphen (U+2011) and ASCII 31 to Unicode zero-width space (U+200b), but Tika doesn't. This is why I see the "right" behavior when running POI's command-line conversion but not with Tika. So I think the fix is simple here: just do that same mapping in WordExtractor.handleCharacterRun; attached patch does that, and enables the test case (now passing).

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development