Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-257

Summary#toString always Entity encodes -- problem for OpenSearchServlet#description field

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.8
    • 0.8
    • None
    • None

    Description

      All search result data we display in search results has to be explicitly Entity.encoded outputing in search.jsp ( title, url, etc.) except Summaries. Its already Entity.encoded. This is fine when outputing HTML but it gets in the way when outputing otherwise – as xml for example. I'd suggest we not make any presumption about how search results are used.

      The problem becomes especially acute when the text language is other than english.

      Here is an example of what a Czech description field in an OpenSearchServlet hit record looks like:

      <description><span class="ellipsis"> ... </span>V&#283;deck&aacute; knihovna v Olomouci Bezru&#269;ova 2, Olomouc 9, 779 11, &#268;esk&aacute; republika &nbsp; tel. +420-585223441 &nbsp; fax +420-585225774 http://www.&lt;span class="highlight">vkol</span>.cz/ &nbsp;&nbsp; info@<span class="highlight">vkol</span>.cz Otev&#345;eno : &nbsp; po-p&aacute; &nbsp; 8 30 -19 00 &nbsp;&nbsp;&nbsp; so &nbsp; 9 00 -13 00 &nbsp;&nbsp;&nbsp; ne &nbsp; zav&#345;eno V katalogu s &uacute;pln&yacute;m &#269;asov&yacute;m<span class="ellipsis"> ... </span>03 Organizace 20/12 Odkazy 19/04 Hledej 23/03 &nbsp; 23/03 &nbsp; Po&#269;et p&#345;&iacute;stup&#367; od 1.9.1998. Statistiky . [ ] &nbsp; [ Nahoru ] <span class="highlight">VKOL</span></description>

      Here is same description field with Entity.encoding disabled:

      <description><span class="ellipsis"> ... </span>tisky statistiky knihovny WWW serveru st?edov?ké rukopisy studovny CD-ROM historických fond? hlavní Internet N?mecké knihovny vázaných novin SVKOL viz <span class="highlight">VKOL</span> ?atna T telefonní ?ísla knihovny zam?stnanc? U V vazba v?cný popis vedení knihovny vedoucí odd?lení video <span class="highlight">VKOL</span> volný výb?r výp?j?ka výro?ní zpráva výstavy W webmaster WWW odkazy X Y Z - ? zamluvení knihy zahrani?ní periodika zpracování fondu<span class="highlight">VKOL</span> - hledej Hledej [ <span class="highlight">VKOL</span> ] [ Novinky ] [ Katalog ] [ Slu?by ] [ Aktivity ] [ Pr?vodce ] [ Dokumenty ] [ Regionální fce ] [ Organizace ] [ Odkazy ] [ Hledej ] [ ] [ ] Obsah full-textové vyhledávání, 19/04/2003 rejst?ík vybraných<span class="ellipsis"> ... </span></description>

      Notice how the Czech characters in the first are all numerically encoded: i.e. #NNN;.

      I'd suggest that Summary#toString() become Summary#toEntityEncodedString() and that toString return raw aggregation of Fragments. Would likely require adding methods to the HitSummarizer interface so can ask for either raw text or entity encoded with addition to NutchBean so can ask for either. Or, better I'd suggest is that Summarizer never return Entity.encoded text. Let that happen in search.jsp (I can make patch to do the latter if its amenable).

      Attachments

        Issue Links

          Activity

            People

              jerome.charron Jerome Charron
              stack Michael Stack
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: