Tika
  1. Tika
  2. TIKA-683

RTF Parser issues with non european characters

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9
    • Fix Version/s: 0.10
    • Component/s: parser
    • Labels:
      None

      Description

      As reported on user@ in "non-West European languages support":
      http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3COF0C0A3275.DA7810E9-ONC22578CC.0051EEDE-C22578CC.0052548B@il.ibm.com%3E

      The RTF Parser seems to be doubling up some non-european characters

      1. testRTFJapanese.rtf
        24 kB
        Nick Burch
      2. testUnicodeUCNControlWordCharacterDoubling.rtf
        0.6 kB
        Cristian Vat
      3. TIKA-683.patch
        2 kB
        Cristian Vat
      4. TIKA-683-unicode-testcase.patch
        2 kB
        Michael McCandless
      5. TIKA-683.patch
        16 kB
        Michael McCandless
      6. testWORD_bold_character_runs.docx
        10 kB
        Michael McCandless
      7. testWORD_bold_character_runs2.docx
        10 kB
        Michael McCandless
      8. TIKA-683.patch
        126 kB
        Michael McCandless
      9. TIKA-683.patch
        215 kB
        Michael McCandless

        Activity

        Nick Burch created issue -
        Hide
        Nick Burch added a comment -

        Add test file. Based on Jp_euc-jp_rtf1.rtf from http://mail-archives.apache.org/mod_mbox/tika-user/201106.mbox/%3COF03CF5CF6.40C9789F-ONC22578BC.0035A24F-C22578BC.0036C220@il.ibm.com%3E but with images removed to keep the size sane

        Show
        Nick Burch added a comment - Add test file. Based on Jp_euc-jp_rtf1.rtf from http://mail-archives.apache.org/mod_mbox/tika-user/201106.mbox/%3COF03CF5CF6.40C9789F-ONC22578BC.0035A24F-C22578BC.0036C220@il.ibm.com%3E but with images removed to keep the size sane
        Nick Burch made changes -
        Field Original Value New Value
        Attachment testRTFJapanese.rtf [ 12486630 ]
        Hide
        Nick Burch added a comment -

        I couldn't use the test as-is, as it contains raw japanese characters in an unknown encoding (rather than \uxxxx escape sequences), and the sample file was too large

        I've re-saved the sample file without the images, and tested with that. That does extract exactly as expected - no doubling up occurs. I've added a unit test for this in r1147200.

        Are you able to get a small RTF file that does shows the problem, along with a suitable unit test similar to the testJapaneseText() method in RTFParser?

        Show
        Nick Burch added a comment - I couldn't use the test as-is, as it contains raw japanese characters in an unknown encoding (rather than \uxxxx escape sequences), and the sample file was too large I've re-saved the sample file without the images, and tested with that. That does extract exactly as expected - no doubling up occurs. I've added a unit test for this in r1147200. Are you able to get a small RTF file that does shows the problem, along with a suitable unit test similar to the testJapaneseText() method in RTFParser?
        Hide
        Cristian Vat added a comment -

        Test file for \ucN control word character doubling

        Show
        Cristian Vat added a comment - Test file for \ucN control word character doubling
        Cristian Vat made changes -
        Hide
        Cristian Vat added a comment -

        I managed to take the original file and slim it down to (possibly) the smallest test case. See "testUnicodeUCNControlWordCharacterDoubling.rtf, 566 bytes.

        Test file contains only one character ( \u5E74 ). Checked with latest Tika SVN and it is doubled.

        The character is defined both as a RTF Unicode escape ( \uXXXX ) and as two RTF charset/font-specific byte escapes ( \'xx ).
        The file is correct since it does specify a unicode skip but that is not taken into account.

        Checked only with RTFEditorKit and that parses fine.
        This is most likely caused by the changes in TIKA-422 which don't take into account \ucN control word and thus show both versions of the character.
        I'll try to look over the code and see what can be done.

        Note on issue name: Current name isn't very accurate. The doubling could also occur with european characters, it all depends on how the rtf generator chooses to encode some characters. A better one would be: "RTFParser doubling characters in some RTF files".

        Show
        Cristian Vat added a comment - I managed to take the original file and slim it down to (possibly) the smallest test case. See "testUnicodeUCNControlWordCharacterDoubling.rtf, 566 bytes. Test file contains only one character ( \u5E74 ). Checked with latest Tika SVN and it is doubled. The character is defined both as a RTF Unicode escape ( \uXXXX ) and as two RTF charset/font-specific byte escapes ( \'xx ). The file is correct since it does specify a unicode skip but that is not taken into account. Checked only with RTFEditorKit and that parses fine. This is most likely caused by the changes in TIKA-422 which don't take into account \ucN control word and thus show both versions of the character. I'll try to look over the code and see what can be done. Note on issue name: Current name isn't very accurate. The doubling could also occur with european characters, it all depends on how the rtf generator chooses to encode some characters. A better one would be: "RTFParser doubling characters in some RTF files".
        Hide
        Cristian Vat added a comment -

        Patch with reduced test file and new test for character doubling in RTFParserTest

        Show
        Cristian Vat added a comment - Patch with reduced test file and new test for character doubling in RTFParserTest
        Cristian Vat made changes -
        Attachment TIKA-683.patch [ 12489624 ]
        Hide
        Michael McCandless added a comment -

        NOTE: I know very little about RTF! So please forgive/correct any
        confusions below:

        It looks like we need a stack to record the \ucN control chars we've
        encountered, at each depth, and we must then skip N ansi chars after
        each \uXXXX we see? (Similarly to how we track the charset with
        charsetQueue now).

        Ie, on seeing \uXXXX (possibly followed by trailing space, which does
        not count in the skip count), we parse and keep that XXXX unicode
        character, re-emitting the \uXXXX in our output data, but then we
        remove the following N ansi chars.

        Some other things I noticed in RTFParser.java; I'm not sure if they
        are really a problem in pratice:

        • I'm worried about how we replace \cell with \u0020\cell –
          depending on the last \ucN control word, this could mean we
          incorrectly skip some number of ansi chars? Changing to
          {\u20}

          \cell would be safer since on group end the pending skip
          chars are reset to 0.

        • But then I also wonder if all the additional groups we are
          creating (because we surround each \uXXXX with { }) are somehow
          costly, eg if it causes RTFEditorKit to use more RAM / be slower /
          something.
        • When we look for the \ansicpgNNNN control word, I noticed we then
          look up the NNNN in the FONTSET_MAP – is that wrong? EG when I
          look at the possible values for NNNN (at
          http://latex2rtf.sourceforge.net/rtfspec_6.html) I see a bunch of
          numbers that aren't in the FONTSET_MAP. We also use FONTSET_MAP
          for \fcharsetNNN but the values for that control word look
          correct.
        • We don't seem to handle the opening charset in the RTF header (ie,
          \ansi, \mac, \pc, \pca)?
        Show
        Michael McCandless added a comment - NOTE: I know very little about RTF! So please forgive/correct any confusions below: It looks like we need a stack to record the \ucN control chars we've encountered, at each depth, and we must then skip N ansi chars after each \uXXXX we see? (Similarly to how we track the charset with charsetQueue now). Ie, on seeing \uXXXX (possibly followed by trailing space, which does not count in the skip count), we parse and keep that XXXX unicode character, re-emitting the \uXXXX in our output data, but then we remove the following N ansi chars. Some other things I noticed in RTFParser.java; I'm not sure if they are really a problem in pratice: I'm worried about how we replace \cell with \u0020\cell – depending on the last \ucN control word, this could mean we incorrectly skip some number of ansi chars? Changing to {\u20} \cell would be safer since on group end the pending skip chars are reset to 0. But then I also wonder if all the additional groups we are creating (because we surround each \uXXXX with { }) are somehow costly, eg if it causes RTFEditorKit to use more RAM / be slower / something. When we look for the \ansicpgNNNN control word, I noticed we then look up the NNNN in the FONTSET_MAP – is that wrong? EG when I look at the possible values for NNNN (at http://latex2rtf.sourceforge.net/rtfspec_6.html ) I see a bunch of numbers that aren't in the FONTSET_MAP. We also use FONTSET_MAP for \fcharsetNNN but the values for that control word look correct. We don't seem to handle the opening charset in the RTF header (ie, \ansi, \mac, \pc, \pca)?
        Hide
        Michael McCandless added a comment -

        I was curious/nervous whether the RTFParser (and RTF format itself) properly handled non-BMP unicode characters, so with Robert Muir's help I created a basic test case (attached) and indeed at least for these Gothic characters in particular non-BMP is handled fine: the test passes.

        It turns out (apparently) each \uXXX is a UTF16 code unit, not a unicode code point.

        Show
        Michael McCandless added a comment - I was curious/nervous whether the RTFParser (and RTF format itself) properly handled non-BMP unicode characters, so with Robert Muir's help I created a basic test case (attached) and indeed at least for these Gothic characters in particular non-BMP is handled fine: the test passes. It turns out (apparently) each \uXXX is a UTF16 code unit, not a unicode code point.
        Michael McCandless made changes -
        Attachment TIKA-683-unicode-testcase.patch [ 12490640 ]
        Chris A. Mattmann made changes -
        Assignee Chris A. Mattmann [ chrismattmann ]
        Chris A. Mattmann made changes -
        Status Open [ 1 ] In Progress [ 3 ]
        Hide
        Chris A. Mattmann added a comment -

        Guys, I see there is a patch from Cristian (looks like the code update) and one from Mike (the test case). Are we seeing that this resolves the issue? If so, I can commit it, with the test case update from Mike (+Robert), and the sample files, but wanted to check first. I have some free cycles, but by no means am a UTF expert, nor a non-european character expert. I'm just willing to help get these committed, and then let you experts tell me whether it works or not

        Show
        Chris A. Mattmann added a comment - Guys, I see there is a patch from Cristian (looks like the code update) and one from Mike (the test case). Are we seeing that this resolves the issue? If so, I can commit it, with the test case update from Mike (+Robert), and the sample files, but wanted to check first. I have some free cycles, but by no means am a UTF expert, nor a non-european character expert. I'm just willing to help get these committed, and then let you experts tell me whether it works or not
        Hide
        Michael McCandless added a comment -

        Thanks Chris!

        Actually both Christian's patch and mine are test cases.

        Christian's test case fails (showing this issue); we don't yet have a patch to fix this issue (but we know what's wrong – we have to handle the \ucN control codes).

        My test case (TIKA-683-unicode-testcase.patch) passes and can be committed right away – it's testing another aspect of RTF+Unicode which (happily) seems to be working correctly.

        I also attached a new test case, passing, on TIKA-422, so if you could commit that one also that'd be great!

        Show
        Michael McCandless added a comment - Thanks Chris! Actually both Christian's patch and mine are test cases. Christian's test case fails (showing this issue); we don't yet have a patch to fix this issue (but we know what's wrong – we have to handle the \ucN control codes). My test case ( TIKA-683 -unicode-testcase.patch) passes and can be committed right away – it's testing another aspect of RTF+Unicode which (happily) seems to be working correctly. I also attached a new test case, passing, on TIKA-422 , so if you could commit that one also that'd be great!
        Hide
        Chris A. Mattmann added a comment -

        Thanks Mike, I went ahead and committed your patch in TIKA-422 (r1158779) and your unit test patch in TIKA-683 in r1158785.

        Show
        Chris A. Mattmann added a comment - Thanks Mike, I went ahead and committed your patch in TIKA-422 (r1158779) and your unit test patch in TIKA-683 in r1158785.
        Hide
        Michael McCandless added a comment -

        Super, thanks Chris!

        Show
        Michael McCandless added a comment - Super, thanks Chris!
        Hide
        Cristian Vat added a comment -

        Thanks Mike for looking into the issues. I also know very little about RTF

        Yes, the skipping is basically skip N ansi chars.
        Actually the JDK RTFEditorKit/Reader already does this and does it well as far as I could see.

        There are also other flaws with the current filtering we do. For example binary data sequences skipping is not handled correctly...

        I went through all the classes in/used-by RTFEditorKit and it appears that it handles most things correctly except the "\'xx" escape where it uses a default translation table not taking into account the current font charset.
        Right now I'm trying to figure out if I can add that behavior by subclassing RTFEditorKit/RTFReader. That I think would be the best solution to this issue and other related ones. It would also avoid temporary files and improve performance maybe.

        Just in case it can't be done with subclassing, anybody know what the licensing restrictions on the JDK classes is? (mainly RTFEditorKit, RTFReader ). It may be do-able with modifying them a little...

        Show
        Cristian Vat added a comment - Thanks Mike for looking into the issues. I also know very little about RTF Yes, the skipping is basically skip N ansi chars. Actually the JDK RTFEditorKit/Reader already does this and does it well as far as I could see. There are also other flaws with the current filtering we do. For example binary data sequences skipping is not handled correctly... I went through all the classes in/used-by RTFEditorKit and it appears that it handles most things correctly except the "\'xx" escape where it uses a default translation table not taking into account the current font charset. Right now I'm trying to figure out if I can add that behavior by subclassing RTFEditorKit/RTFReader. That I think would be the best solution to this issue and other related ones. It would also avoid temporary files and improve performance maybe. Just in case it can't be done with subclassing, anybody know what the licensing restrictions on the JDK classes is? (mainly RTFEditorKit, RTFReader ). It may be do-able with modifying them a little...
        Hide
        Jukka Zitting added a comment -

        > Just in case it can't be done with subclassing, anybody know what the licensing
        > restrictions on the JDK classes is? (mainly RTFEditorKit, RTFReader ).

        They should be available under GPLv2 from the OpenJDK project.

        And it actually looks like Apache Harmony added an initial ALv2-licensed RTF parser
        in HARMONY-5903. I haven't tried that code yet.

        Show
        Jukka Zitting added a comment - > Just in case it can't be done with subclassing, anybody know what the licensing > restrictions on the JDK classes is? (mainly RTFEditorKit, RTFReader ). They should be available under GPLv2 from the OpenJDK project. And it actually looks like Apache Harmony added an initial ALv2-licensed RTF parser in HARMONY-5903 . I haven't tried that code yet.
        Hide
        Michael McCandless added a comment -

        Right now I'm trying to figure out if I can add that behavior by subclassing RTFEditorKit/RTFReader.

        Ooh that sounds interesting! Does it have enough hooks so a subclass
        can "tag along" to know what font is in-use and then intercept the
        \'XX hex escapes?

        Poaching either Harmony's parser or maybe OpenOffice's (C, but we
        could port the parts we poach to Java) seems like a good way to go?

        Either that or we make our own simple tokenizer? The RTF spec looks
        [relatively] simple enough, and Tika only needs to get the text out
        (at least for today?), so we need not do heavy parsing of all
        formatting / document structure. A simple tokenizer that just decoded
        the control words we care about (charset, font default, charset,
        table) should work well and be robust to parser bugs / small errors in
        the doc.

        I'm also worried about the test coverage of the our RTF
        parsing... would be nice to find (or somehow randomly generate) some
        biggish collection of RTF + "expected text" test cases. Maybe we can
        poach tests from OpenOffice....

        I noticed some tests allow for / expect extra whitespace to be
        inserted in the returned text, but that makes me nervous... I think
        (ideally) Tika shouldn't insert extra whitespace if we can help it.
        Though, some cases likely need it, eg text from adjacent table cells.

        Show
        Michael McCandless added a comment - Right now I'm trying to figure out if I can add that behavior by subclassing RTFEditorKit/RTFReader. Ooh that sounds interesting! Does it have enough hooks so a subclass can "tag along" to know what font is in-use and then intercept the \'XX hex escapes? Poaching either Harmony's parser or maybe OpenOffice's (C, but we could port the parts we poach to Java) seems like a good way to go? Either that or we make our own simple tokenizer? The RTF spec looks [relatively] simple enough, and Tika only needs to get the text out (at least for today?), so we need not do heavy parsing of all formatting / document structure. A simple tokenizer that just decoded the control words we care about (charset, font default, charset, table) should work well and be robust to parser bugs / small errors in the doc. I'm also worried about the test coverage of the our RTF parsing... would be nice to find (or somehow randomly generate) some biggish collection of RTF + "expected text" test cases. Maybe we can poach tests from OpenOffice.... I noticed some tests allow for / expect extra whitespace to be inserted in the returned text, but that makes me nervous... I think (ideally) Tika shouldn't insert extra whitespace if we can help it. Though, some cases likely need it, eg text from adjacent table cells.
        Hide
        Michael McCandless added a comment -

        New patch attached, including the last (pretty-print) patch, plus I noticed that the OOXML Word parser also split up adjacent bold character runs so I fixed that and added 2 docx files for testing.

        Show
        Michael McCandless added a comment - New patch attached, including the last (pretty-print) patch, plus I noticed that the OOXML Word parser also split up adjacent bold character runs so I fixed that and added 2 docx files for testing.
        Michael McCandless made changes -
        Attachment TIKA-683.patch [ 12491275 ]
        Attachment testWORD_bold_character_runs.docx [ 12491276 ]
        Attachment testWORD_bold_character_runs2.docx [ 12491277 ]
        Hide
        Michael McCandless added a comment -

        Sorry, wrong issue – that last patch was meant for TIKA-692.

        Show
        Michael McCandless added a comment - Sorry, wrong issue – that last patch was meant for TIKA-692 .
        Hide
        Michael McCandless added a comment -

        I'm now testing the approach of just making our own simple RTF tokenizer, that handles those control words relevant to the text that we need... I'll post a patch once I have something sort of working.

        Show
        Michael McCandless added a comment - I'm now testing the approach of just making our own simple RTF tokenizer, that handles those control words relevant to the text that we need... I'll post a patch once I have something sort of working.
        Michael McCandless made changes -
        Assignee Chris A. Mattmann [ chrismattmann ] Michael McCandless [ mikemccand ]
        Hide
        Michael McCandless added a comment -

        Attached patch, with a first cut at using a simple (shallow) tokenizer
        to interpret the specific RTF control words that determine what text
        is rendered. I built this using the 1.9.1 RTF specification:

        http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=10725

        It's still rough (many nocommits) but I think it's close. All tests
        pass, including a few new RTF test cases I've added.

        I just created a custom tokenizer (the allowed RTF tokens are very
        simple) and shallow parser. I think later we can/should cutover to a
        "real" tokenizer/parser (eg JFlex)...

        The new parser does a better job at extracting some doc structure; the
        current parser just makes a single paragraph, but the new one makes a
        paragraph whenever the doc said there was one. But it doesn't give
        structure for tables, lists (it does extract their text).

        It finds text that the old parser missed, eg footnotes, hyperlink,
        header/footer, text inside a picture, and [generally] does not add
        extra whitespace (the old one sometimes breaks a word by inserting a
        space). Finally the new parser fixes the unicode character doubling
        (this issue)...

        One thing I still have to fix is that it can output mis-matched tags
        for i/b styles (spookily nothing failed; maybe we should add simple
        validation (under asserts) eg to XHTMLContentHandler?).

        Show
        Michael McCandless added a comment - Attached patch, with a first cut at using a simple (shallow) tokenizer to interpret the specific RTF control words that determine what text is rendered. I built this using the 1.9.1 RTF specification: http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=10725 It's still rough (many nocommits) but I think it's close. All tests pass, including a few new RTF test cases I've added. I just created a custom tokenizer (the allowed RTF tokens are very simple) and shallow parser. I think later we can/should cutover to a "real" tokenizer/parser (eg JFlex)... The new parser does a better job at extracting some doc structure; the current parser just makes a single paragraph, but the new one makes a paragraph whenever the doc said there was one. But it doesn't give structure for tables, lists (it does extract their text). It finds text that the old parser missed, eg footnotes, hyperlink, header/footer, text inside a picture, and [generally] does not add extra whitespace (the old one sometimes breaks a word by inserting a space). Finally the new parser fixes the unicode character doubling (this issue)... One thing I still have to fix is that it can output mis-matched tags for i/b styles (spookily nothing failed; maybe we should add simple validation (under asserts) eg to XHTMLContentHandler?).
        Michael McCandless made changes -
        Attachment TIKA-683.patch [ 12492653 ]
        Hide
        Uwe Schindler added a comment -

        XML SAX Handling does not validate the element names, like opening and closing elements are the same. And the serializer in most cases only outputs the elements it get reported, some of those serializers will go crazy

        The reason for this is, because SAX is in general seldom used to generate xml documents, its more XML parsers that report elements they found in an XML document. And those parsers do the validating before, so theoretically, your parser must do this. For speed reasons there are no checks in serializers. You can enforce checks by piping the whole stuff through javax.xml.validator API, but this would also check a schema, which does not really exists for XHTML.

        Show
        Uwe Schindler added a comment - XML SAX Handling does not validate the element names, like opening and closing elements are the same. And the serializer in most cases only outputs the elements it get reported, some of those serializers will go crazy The reason for this is, because SAX is in general seldom used to generate xml documents, its more XML parsers that report elements they found in an XML document. And those parsers do the validating before, so theoretically, your parser must do this. For speed reasons there are no checks in serializers. You can enforce checks by piping the whole stuff through javax.xml.validator API, but this would also check a schema, which does not really exists for XHTML.
        Hide
        Jukka Zitting added a comment -

        +1, I'm eager to see us drop the javax.swing dependency with something we can directly fix and improve.

        The org.apache.tika.sax.SaveContentHandler class already does some sanitization of SAX events, so that might be a good place to also check that tags are correctly nested. Though as Uwe said, ideally the generator of the SAX events would already take care of producing valid output.

        PS. I'd rather use a separate .java file for the ExtractRTFText class than have it as a static inner class inside RTFParser. We can keep it package-private if we don't want to expose it directly to downstream clients.

        Show
        Jukka Zitting added a comment - +1, I'm eager to see us drop the javax.swing dependency with something we can directly fix and improve. The org.apache.tika.sax.SaveContentHandler class already does some sanitization of SAX events, so that might be a good place to also check that tags are correctly nested. Though as Uwe said, ideally the generator of the SAX events would already take care of producing valid output. PS. I'd rather use a separate .java file for the ExtractRTFText class than have it as a static inner class inside RTFParser. We can keep it package-private if we don't want to expose it directly to downstream clients.
        Hide
        Michael McCandless added a comment -

        Thanks Jukka! That's a good idea to move the ExtractRTFText class out; I'll do that.

        I'll mull how to assert the sax start/end elements are valid...

        Show
        Michael McCandless added a comment - Thanks Jukka! That's a good idea to move the ExtractRTFText class out; I'll do that. I'll mull how to assert the sax start/end elements are valid...
        Hide
        Michael McCandless added a comment -

        New patch; I think it's ready! Changes from last patch:

        • Factored out separate source files for the TextExtract, GroupState
          classes
        • Added a few more RTF test cases
        • Added optional loading of ICU4J's Charset impl, if available; I
          did this in CharsetUtils.forName
        • Removed dup test cases from TestParsers (they were already
          previously copied to RTFParserTest)
        • Cleaned up confusing interleaved bytes/chars buffering in the
          parser
        • Added balanced tag asserts to SafeContentHandler; this helped me
          fix the RTFParser, however, other parsers seem to trip the assert
          (do not produce balanced start/end elements). I didn't dig into
          this, and commented out the asserts; I'll open a separate issue to
          pursue that.
        Show
        Michael McCandless added a comment - New patch; I think it's ready! Changes from last patch: Factored out separate source files for the TextExtract, GroupState classes Added a few more RTF test cases Added optional loading of ICU4J's Charset impl, if available; I did this in CharsetUtils.forName Removed dup test cases from TestParsers (they were already previously copied to RTFParserTest) Cleaned up confusing interleaved bytes/chars buffering in the parser Added balanced tag asserts to SafeContentHandler; this helped me fix the RTFParser, however, other parsers seem to trip the assert (do not produce balanced start/end elements). I didn't dig into this, and commented out the asserts; I'll open a separate issue to pursue that.
        Michael McCandless made changes -
        Attachment TIKA-683.patch [ 12494077 ]
        Hide
        Chris A. Mattmann added a comment -

        Hey Mike, +1 to commit, go for it!

        Show
        Chris A. Mattmann added a comment - Hey Mike, +1 to commit, go for it!
        Hide
        Michael McCandless added a comment -

        Thanks Chris, I'll commit today!

        Show
        Michael McCandless added a comment - Thanks Chris, I'll commit today!
        Hide
        Michael McCandless added a comment -

        I'll open a follow-on issue for the mis-matched XHTML events from some parsers....

        Show
        Michael McCandless added a comment - I'll open a follow-on issue for the mis-matched XHTML events from some parsers....
        Michael McCandless made changes -
        Status In Progress [ 3 ] Resolved [ 5 ]
        Fix Version/s 1.0 [ 12313535 ]
        Resolution Fixed [ 1 ]
        Hide
        Michael McCandless added a comment -

        I opened TIKA-715 for the mis-matched XHTML events.

        Show
        Michael McCandless added a comment - I opened TIKA-715 for the mis-matched XHTML events.
        Jukka Zitting made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Nick Burch
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development