Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9
    • Fix Version/s: 1.0
    • Component/s: parser
    • Labels:
      None

      Description

      I spotted this while working on TIKA-631 - an RTF file containing links has the link skipped over - neither the link text nor the link href are output.

      In the attached sample file (which is the RTF contents of /test-documents/test-outlook2003.msg), we should see things like:

      [a href="http://r.office.microsoft.com/r/rlidOutlookWelcomeMail1?clid=1033">Streamlined Mail Experience[/a> - Outlook

      Instead, all we get is " - Outlook"

      1. test.rtf
        29 kB
        Nick Burch
      2. TIKA-632.patch
        9 kB
        Michael McCandless

        Activity

        Hide
        Cristian Vat added a comment -

        Tika uses RTFEditorKit from javax.swing.text.rtf for the actual RTF Parsing and that doesn't seem to support links.

        In the example you provided links are actually marked using two methods:

        • \htmlrtf tags which are "Control Words Introduced by Specific/Other Microsoft Products"
        • \field instances of type hyperlink, which are seem to be the normal RTF way of adding links

        However the RTF Parser in Swing ignores a lot of "unknown" control words, including \field completely.
        For reference, there is a bug opened in 1999 and closed as "Will Not Fix" to enhance RTF Parsing ( http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4261277 )

        To quote Jukka from another issue: "there's little we can do about this as long as we're stuck with the Swing RTF parser".

        Show
        Cristian Vat added a comment - Tika uses RTFEditorKit from javax.swing.text.rtf for the actual RTF Parsing and that doesn't seem to support links. In the example you provided links are actually marked using two methods: \htmlrtf tags which are "Control Words Introduced by Specific/Other Microsoft Products" \field instances of type hyperlink, which are seem to be the normal RTF way of adding links However the RTF Parser in Swing ignores a lot of "unknown" control words, including \field completely. For reference, there is a bug opened in 1999 and closed as "Will Not Fix" to enhance RTF Parsing ( http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4261277 ) To quote Jukka from another issue: "there's little we can do about this as long as we're stuck with the Swing RTF parser".
        Hide
        Nick Burch added a comment -

        Now we have our own RTF parser, it may be possible to add this. For an example, the RTF from /test-documents/test-outlook2003.msg for a part containing a hyperlink is the delightful:

        -----------

        {\*\htmltag84 <I>}

        \htmlrtf {\i \htmlrtf0 If you want to let us know what you think about Outlook 2003, reply to this message. We're always looking for feedback from the people who use Outlook every day! If you would like to keep up with the latest information about Outlook, sign up for a free subscription to the

        {\*\htmltag84 <A HREF="http://r.office.microsoft.com/r/rlidNewsletterSignUp?clid=1033">}

        \htmlrtf {\field{*\fldinst{HYPERLINK "http://r.office.microsoft.com/r/rlidNewsletterSignUp?clid=1033"}}

        {\fldrslt\cf1\ul \htmlrtf0 Inside Office Newsletter\htmlrtf }

        \htmlrtf0 \htmlrtf }\htmlrtf0

        {\*\htmltag92 </A>}

        . The newsletter will be sent to you by e-mail on a regular basis.
        -----------

        Show
        Nick Burch added a comment - Now we have our own RTF parser, it may be possible to add this. For an example, the RTF from /test-documents/test-outlook2003.msg for a part containing a hyperlink is the delightful: ----------- {\*\htmltag84 <I>} \htmlrtf {\i \htmlrtf0 If you want to let us know what you think about Outlook 2003, reply to this message. We're always looking for feedback from the people who use Outlook every day! If you would like to keep up with the latest information about Outlook, sign up for a free subscription to the {\*\htmltag84 <A HREF="http://r.office.microsoft.com/r/rlidNewsletterSignUp?clid=1033">} \htmlrtf {\field{*\fldinst{HYPERLINK "http://r.office.microsoft.com/r/rlidNewsletterSignUp?clid=1033"}} {\fldrslt\cf1\ul \htmlrtf0 Inside Office Newsletter\htmlrtf } \htmlrtf0 \htmlrtf }\htmlrtf0 {\*\htmltag92 </A>} . The newsletter will be sent to you by e-mail on a regular basis. -----------
        Hide
        Michael McCandless added a comment -

        Patch, adding hyperlink extraction to the RTF parser, and enabling the OutlookParserTest case (it passes).

        I think it's ready to commit... I'll wait until after 0.10 is out.

        Show
        Michael McCandless added a comment - Patch, adding hyperlink extraction to the RTF parser, and enabling the OutlookParserTest case (it passes). I think it's ready to commit... I'll wait until after 0.10 is out.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Nick Burch
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development