Tika
  1. Tika
  2. TIKA-648

Parsing HTML anchors with embedded div faulty

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.9
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      Using Nutch with Tika 0.9 i cannot extract all two outlinks from a given page
      [1]. This is because Tika doensn't return the document with the anchor text
      embedded and Nutch skips empty anchors when collecting outlinks.

      The raw HTML is:
      <a href="#"><div>bla 1</div></a>
      <a href="#">bla 2</a>

      But the parsed HTML with tika-app-1.0-SNAPSHOT.jar -h test.html is:
      <a shape="rect" href="#"/>bla 1
      <a shape="rect" href="#">bla 2</a>

      [1]: http://people.apache.org/~markus/test.html

      Also described on the Tika user list:
      http://search.lucidimagination.com/search/document/e74d7e72fd61543a/parsing_html_anchors_with_embedded_div_faulty

        Issue Links

          Activity

          Markus Jelsma created issue -
          Markus Jelsma made changes -
          Field Original Value New Value
          Link This issue blocks NUTCH-984 [ NUTCH-984 ]
          Markus Jelsma made changes -
          Fix Version/s 1.0 [ 12313535 ]
          Hide
          Jukka Zitting added a comment -

          This seems to be a result of TagSoup normalizing the HTML markup. An inline element like <a> is not supposed to contain block elements like <div>. I'm not sure if this should be treated as a bug or a feature in TagSoup.

          Show
          Jukka Zitting added a comment - This seems to be a result of TagSoup normalizing the HTML markup. An inline element like <a> is not supposed to contain block elements like <div>. I'm not sure if this should be treated as a bug or a feature in TagSoup.
          Hide
          Markus Jelsma added a comment -

          Thanks. I assume this is not something that needs to be fixed in Tika then. Should it be closed and marked as won't fix?

          Show
          Markus Jelsma added a comment - Thanks. I assume this is not something that needs to be fixed in Tika then. Should it be closed and marked as won't fix?
          Hide
          Ken Krugler added a comment -

          I think this should be closed, and an improvement request made against TagSoup.

          The issue is that TagSoup currently will close the open <a> tag when it hits the <div>. But it could hold onto that markup until it gets something else that indicates it's time to assume a missing closing </a>. Then, when it does see the </a>, it could emit the text while dumping the <div> tags. I know, pretty ugly, but I think that's how browsers handle it.

          Show
          Ken Krugler added a comment - I think this should be closed, and an improvement request made against TagSoup. The issue is that TagSoup currently will close the open <a> tag when it hits the <div>. But it could hold onto that markup until it gets something else that indicates it's time to assume a missing closing </a>. Then, when it does see the </a>, it could emit the text while dumping the <div> tags. I know, pretty ugly, but I think that's how browsers handle it.
          Hide
          Jukka Zitting added a comment -

          Yep, this probably needs to be addressed in one way or another within TagSoup. But since this issue does affect also Tika users, let's leave this issue open to track progress.

          Show
          Jukka Zitting added a comment - Yep, this probably needs to be addressed in one way or another within TagSoup. But since this issue does affect also Tika users, let's leave this issue open to track progress.
          Jukka Zitting made changes -
          Fix Version/s 0.10 [ 12313535 ]
          Gavin made changes -
          Link This issue blocks NUTCH-984 [ NUTCH-984 ]
          Gavin made changes -
          Link This issue is depended upon by NUTCH-984 [ NUTCH-984 ]
          Tyler Palsulich made changes -
          Status Open [ 1 ] Closed [ 6 ]
          Resolution Won't Fix [ 2 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Closed Closed
          1405d 6h 16m 1 Tyler Palsulich 01/Mar/15 22:16

            People

            • Assignee:
              Unassigned
              Reporter:
              Markus Jelsma
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development