Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
5.0
-
None
Description
On line 283 in SolrContentHandler, the catch-all field gets all source HTML attribute values dumped into it:
270: @Override 271: public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException { 272: StringBuilder theBldr = fieldBuilders.get(localName); 273: if (theBldr != null) { 274: //we need to switch the currentBuilder 275: bldrStack.add(theBldr); 276: } 277: if (captureAttribs == true) { 278: for (int i = 0; i < attributes.getLength(); i++) { 279: addField(localName, attributes.getValue(i), null); 280: } 281: } else { 282: for (int i = 0; i < attributes.getLength(); i++) { 283: bldrStack.getLast().append(' ').append(attributes.getValue(i)); 284: } 285: } 286: bldrStack.getLast().append(' '); 287: }
But this will contains lots of unwanted cruft: class and style tags, etc.
It would be much better if only attribute values containing addresses or tooltip text, etc. were dumped into the catch-all field. Here are a couple of places where this kind of attribute are described:
From Tika's HtmlHandler class:
// List of attributes that need to be resolved. private static final Set<String> URI_ATTRIBUTES = new HashSet<String>(Arrays.asList("src", "href", "longdesc", "cite"));
Attachments
Issue Links
- is related to
-
SOLR-6856 regression in /update/extract ? ref guide examples of fmap & xpath don't seem to be working
- Closed