Issue Details (XML | Word | Printable)

Key: NUTCH-110
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Sami Siren
Reporter: stack
Votes: 2
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Nutch

OpenSearchServlet outputs illegal xml characters

Created: 13/Oct/05 09:13 AM   Updated: 24/Oct/06 04:14 PM
Return to search
Component/s: searcher
Affects Version/s: 0.8
Fix Version/s: 0.8

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works fixIllegalXmlChars.patch 2005-10-13 09:19 AM stack 3 kB
Text File Licensed for inclusion in ASF works fixIllegalXmlChars08-v2.patch 2006-06-16 09:13 PM John VanDyk 2 kB
Text File Licensed for inclusion in ASF works fixIllegalXmlChars08-v3.patch 2006-06-16 11:18 PM stack 7 kB
Text File Licensed for inclusion in ASF works fixIllegalXmlChars08-v4.patch 2006-06-20 09:53 AM stack 3 kB
Text File Licensed for inclusion in ASF works fixIllegalXmlChars08-v5.patch 2006-06-20 11:44 PM stack 3 kB
Text File Licensed for inclusion in ASF works fixIllegalXmlChars08.patch 2006-05-25 11:06 PM Stefan Neufeind 3 kB
Text File Licensed for inclusion in ASF works NUTCH-110-version2.patch 2005-10-15 09:43 AM stack 7 kB
Environment: linux, jdk 1.5

Resolution Date: 21/Jun/06 02:12 AM


 Description  « Hide
OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it – -- i.e. the ascii character at position (decimal) 12 – the produced XML will show the FF character as ' ' The character/entity ' ' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
stack added a comment - 13/Oct/05 09:19 AM
Attached patch runs all xml text through a check for bad xml characters. This patch is brutal dropping silently illegal characters. Patch was made after hunting xalan, jdk, and nutch itself for a method that would do the above filtering but was unable to find any such method – perhaps an oversight on my part?

stack added a comment - 15/Oct/05 09:43 AM
Patch version 2. This patch benefits from discussion held up on nutch dev list. This patch differs from the first in that it handles ALL illegal XML characters, entity encoding the 5 'special characters' AND (silently) dropping characters outside the xml legal range of characters. The previous patch just did the latter task letting the configured transformer/DOM Serializer handle entity escaping.

This patch also differs from patch version 1 in that it moves the method that processes the xml out into util.StringUtil: The assumption being that not only OpenSearchServlet needs to make text safe to include in xml.

The core method, StringUtil#toValidXmlText, was authored by Dawid Weiss and was taken from carrot2 XMLSerializerHelper. Below is except from mail up on nutch dev where he grants permission to copy toValidXmlText.

Message-ID: <434F5368.6040202@cs.put.poznan.pl>
Date: Fri, 14 Oct 2005 08:42:48 +0200
From: Dawid Weiss <dawid.weiss@cs.put.poznan.pl>
To: nutch-dev@lucene.apache.org
Subject: Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal
xml characters

...

> So, will I amend the patch in NUTCH-110 so it uses
> XMLSerializerHelper#toValidXmlText in place of #getLegalXml method?

Copy the method's contents. It doesn't really make sense to copy the
entire class just for this method. Good luck.

D.


stack added a comment - 11/Nov/05 07:32 AM
Scrub NUTCH-110-version2.patch. This patch double-encode certain entities (First by the new toValidXmlText method, second by the javax.xml.transform.Transformer transformer used by OpenSearchServlet).

Use the original patch, fixIllegalXmlChars.patch, to address the problem described in this issue.


Stefan Neufeind added a comment - 25/May/06 11:06 PM
Since original patch didn't cleanly apply for me on 0.8-dev (nightly-2006-05-20) I re-did it for 0.8 ...

With this patch the XML is fine. Without I had big trouble parsing the RSS-feed in another application.


John VanDyk added a comment - 16/Jun/06 09:13 PM
Stefan's patch didn't apply cleanly for me on svn revision 413155 so I re-did it.

This patch fixes the illegal XML characters and prevents opensearch clients from choking on that bad XML previously emitted.


Jerome Charron added a comment - 16/Jun/06 09:30 PM
This patch process the String twice if it contains some illegal characters!

stack added a comment - 16/Jun/06 11:18 PM
Version of patch that doesn't "...process the String twice if it contains some illegal characters!". Its name is fixIllegalXmlChars08-v3.patch (Be careful, its not the last patch in the list). It was made against 414852.

At least 3 different people have run into this awkward issue going by the comments in this issue. I petition that is sufficent to earn a commit.

Thanks.


stack added a comment - 16/Jun/06 11:19 PM
Was version 0.7. Changed 'Affects Version' to 0.8-dev.

stack added a comment - 20/Jun/06 09:53 AM
v3 mistakenly included debugging code.

Attached cleaned up v4.


Sami Siren added a comment - 20/Jun/06 11:31 PM
in method addAttribute(...)

line:
attribute.setValue(getLegalXml(getLegalXml(value)));

intentional?


stack added a comment - 20/Jun/06 11:44 PM
No, the double call to getLegalXml is not intentional. Its a mistake. Thanks for finding it.

I've attached yet another version (Any prizes for most revisions to a patch?).


Sami Siren added a comment - 21/Jun/06 02:12 AM
I just committed this with small changes (moved test to a test case) thanks.

Sami Siren added a comment - 24/Oct/06 04:14 PM
closing issues for released versions