Tika
  1. Tika
  2. TIKA-648

Parsing HTML anchors with embedded div faulty

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.9
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      Using Nutch with Tika 0.9 i cannot extract all two outlinks from a given page
      [1]. This is because Tika doensn't return the document with the anchor text
      embedded and Nutch skips empty anchors when collecting outlinks.

      The raw HTML is:
      <a href="#"><div>bla 1</div></a>
      <a href="#">bla 2</a>

      But the parsed HTML with tika-app-1.0-SNAPSHOT.jar -h test.html is:
      <a shape="rect" href="#"/>bla 1
      <a shape="rect" href="#">bla 2</a>

      [1]: http://people.apache.org/~markus/test.html

      Also described on the Tika user list:
      http://search.lucidimagination.com/search/document/e74d7e72fd61543a/parsing_html_anchors_with_embedded_div_faulty

        Issue Links

          Activity

          Hide
          Jukka Zitting added a comment -

          Yep, this probably needs to be addressed in one way or another within TagSoup. But since this issue does affect also Tika users, let's leave this issue open to track progress.

          Show
          Jukka Zitting added a comment - Yep, this probably needs to be addressed in one way or another within TagSoup. But since this issue does affect also Tika users, let's leave this issue open to track progress.
          Hide
          Ken Krugler added a comment -

          I think this should be closed, and an improvement request made against TagSoup.

          The issue is that TagSoup currently will close the open <a> tag when it hits the <div>. But it could hold onto that markup until it gets something else that indicates it's time to assume a missing closing </a>. Then, when it does see the </a>, it could emit the text while dumping the <div> tags. I know, pretty ugly, but I think that's how browsers handle it.

          Show
          Ken Krugler added a comment - I think this should be closed, and an improvement request made against TagSoup. The issue is that TagSoup currently will close the open <a> tag when it hits the <div>. But it could hold onto that markup until it gets something else that indicates it's time to assume a missing closing </a>. Then, when it does see the </a>, it could emit the text while dumping the <div> tags. I know, pretty ugly, but I think that's how browsers handle it.
          Hide
          Markus Jelsma added a comment -

          Thanks. I assume this is not something that needs to be fixed in Tika then. Should it be closed and marked as won't fix?

          Show
          Markus Jelsma added a comment - Thanks. I assume this is not something that needs to be fixed in Tika then. Should it be closed and marked as won't fix?
          Hide
          Jukka Zitting added a comment -

          This seems to be a result of TagSoup normalizing the HTML markup. An inline element like <a> is not supposed to contain block elements like <div>. I'm not sure if this should be treated as a bug or a feature in TagSoup.

          Show
          Jukka Zitting added a comment - This seems to be a result of TagSoup normalizing the HTML markup. An inline element like <a> is not supposed to contain block elements like <div>. I'm not sure if this should be treated as a bug or a feature in TagSoup.

            People

            • Assignee:
              Unassigned
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:

                Development