Nutch
  1. Nutch
  2. NUTCH-794

Language Identification must use check the parse metadata for language values

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1
    • Component/s: parser
    • Labels:
      None

      Description

      The following HTML document :

      <html lang="fi"><head>document 1 title</head><body>jotain suomeksi</body></html>

      is rendered as the following xhtml by Tika :

      <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 titlejotain suomeksi</body></html>

      with the lang attribute getting lost. The lang is not stored in the metadata either.

      I will open an issue on Tika and modify TestHTMLLanguageParser so that the tests don't break anymore

      1. NUTCH-794.patch
        2 kB
        Julien Nioche

        Issue Links

          Activity

          Hide
          Chris A. Mattmann added a comment -

          Hey Julien, yepper, I posted an RC of Tika 0.7, see: http://bit.ly/c7FZRc. If the VOTE passes on that in say the next 72 hours, I will push out a Tika 0.7 release to the mirrors. If everyone is OK with that, we can release Nutch 1.1 after...thoughts?

          Show
          Chris A. Mattmann added a comment - Hey Julien, yepper, I posted an RC of Tika 0.7, see: http://bit.ly/c7FZRc . If the VOTE passes on that in say the next 72 hours, I will push out a Tika 0.7 release to the mirrors. If everyone is OK with that, we can release Nutch 1.1 after...thoughts?
          Hide
          Julien Nioche added a comment -

          The issue has not been fixed in Tika. Will refile post 1.1 as you suggested. Can we update to Tika 0.7 before finalising 1.1?

          Show
          Julien Nioche added a comment - The issue has not been fixed in Tika. Will refile post 1.1 as you suggested. Can we update to Tika 0.7 before finalising 1.1?
          Hide
          Chris A. Mattmann added a comment -

          @julien – I think this issue has been fixed in Tika right? If not, feel free to reopen, or better yet, re-file the issue against a post 1.1 Nutch release. Thanks!

          Show
          Chris A. Mattmann added a comment - @julien – I think this issue has been fixed in Tika right? If not, feel free to reopen, or better yet, re-file the issue against a post 1.1 Nutch release. Thanks!
          Hide
          Hudson added a comment -

          Integrated in Nutch-trunk #1071 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1071/)
          : Language Identification must use check the parse metadata for language values

          Show
          Hudson added a comment - Integrated in Nutch-trunk #1071 (See http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1071/ ) : Language Identification must use check the parse metadata for language values
          Hide
          Julien Nioche added a comment -

          Committed patch in revision 910454

          Waiting for issue to be fixed in Tika before closing this issue

          Show
          Julien Nioche added a comment - Committed patch in revision 910454 Waiting for issue to be fixed in Tika before closing this issue
          Hide
          Julien Nioche added a comment -

          Apart from the html attribute being lost (see above) there is also an issue with the fact that Tika does not put the lang attributes in its XHTML representation but stores that in the metadata instead.
          I will shortly release a patch to address that in the class HTMLLanguageParser

          Show
          Julien Nioche added a comment - Apart from the html attribute being lost (see above) there is also an issue with the fact that Tika does not put the lang attributes in its XHTML representation but stores that in the metadata instead. I will shortly release a patch to address that in the class HTMLLanguageParser

            People

            • Assignee:
              Julien Nioche
              Reporter:
              Julien Nioche
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development