Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
0.8
-
None
-
None
Description
All search result data we display in search results has to be explicitly Entity.encoded outputing in search.jsp ( title, url, etc.) except Summaries. Its already Entity.encoded. This is fine when outputing HTML but it gets in the way when outputing otherwise – as xml for example. I'd suggest we not make any presumption about how search results are used.
The problem becomes especially acute when the text language is other than english.
Here is an example of what a Czech description field in an OpenSearchServlet hit record looks like:
<description><span class="ellipsis"> ... </span>Vědecká knihovna v Olomouci Bezručova 2, Olomouc 9, 779 11, Česká republika tel. +420-585223441 fax +420-585225774 http://www.<span class="highlight">vkol</span>.cz/ info@<span class="highlight">vkol</span>.cz Otevřeno : po-pá 8 30 -19 00 so 9 00 -13 00 ne zavřeno V katalogu s úplným časovým<span class="ellipsis"> ... </span>03 Organizace 20/12 Odkazy 19/04 Hledej 23/03 23/03 Počet přístupů od 1.9.1998. Statistiky . [ ] [ Nahoru ] <span class="highlight">VKOL</span></description>
Here is same description field with Entity.encoding disabled:
<description><span class="ellipsis"> ... </span>tisky statistiky knihovny WWW serveru st?edov?ké rukopisy studovny CD-ROM historických fond? hlavní Internet N?mecké knihovny vázaných novin SVKOL viz <span class="highlight">VKOL</span> ?atna T telefonní ?ísla knihovny zam?stnanc? U V vazba v?cný popis vedení knihovny vedoucí odd?lení video <span class="highlight">VKOL</span> volný výb?r výp?j?ka výro?ní zpráva výstavy W webmaster WWW odkazy X Y Z - ? zamluvení knihy zahrani?ní periodika zpracování fondu<span class="highlight">VKOL</span> - hledej Hledej [ <span class="highlight">VKOL</span> ] [ Novinky ] [ Katalog ] [ Slu?by ] [ Aktivity ] [ Pr?vodce ] [ Dokumenty ] [ Regionální fce ] [ Organizace ] [ Odkazy ] [ Hledej ] [ ] [ ] Obsah full-textové vyhledávání, 19/04/2003 rejst?ík vybraných<span class="ellipsis"> ... </span></description>
Notice how the Czech characters in the first are all numerically encoded: i.e. #NNN;.
I'd suggest that Summary#toString() become Summary#toEntityEncodedString() and that toString return raw aggregation of Fragments. Would likely require adding methods to the HitSummarizer interface so can ask for either raw text or entity encoded with addition to NutchBean so can ask for either. Or, better I'd suggest is that Summarizer never return Entity.encoded text. Let that happen in search.jsp (I can make patch to do the latter if its amenable).
Attachments
Issue Links
- is part of
-
NUTCH-134 Summarizer doesn't select the best snippets
- Closed