[SOLR-7027] ExtractingRequestHandler indiscriminantly dumps all source HTML attributes into the catch-all field when captureAttr=false, but it should be more selective, something like only href, title, alt, etc. attributes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 5.0
Fix Version/s: 5.2, 6.0
Component/s: contrib - Solr Cell (Tika extraction)
Labels:
None

Description

On line 283 in SolrContentHandler, the catch-all field gets all source HTML attribute values dumped into it:

270:  @Override
271:  public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
272:    StringBuilder theBldr = fieldBuilders.get(localName);
273:    if (theBldr != null) {
274:      //we need to switch the currentBuilder
275:      bldrStack.add(theBldr);
276:    }
277:    if (captureAttribs == true) {
278:      for (int i = 0; i < attributes.getLength(); i++) {
279:        addField(localName, attributes.getValue(i), null);
280:      }
281:    } else {
282:      for (int i = 0; i < attributes.getLength(); i++) {
283:        bldrStack.getLast().append(' ').append(attributes.getValue(i));
284:      }
285:    }
286:    bldrStack.getLast().append(' ');
287:  }

But this will contains lots of unwanted cruft: class and style tags, etc.

It would be much better if only attribute values containing addresses or tooltip text, etc. were dumped into the catch-all field. Here are a couple of places where this kind of attribute are described:

http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html#includeAttribute(net.htmlparser.jericho.StartTag,%20net.htmlparser.jericho.Attribute)

From Tika's HtmlHandler class:

    // List of attributes that need to be resolved.
    private static final Set<String> URI_ATTRIBUTES =
        new HashSet<String>(Arrays.asList("src", "href", "longdesc", "cite"));

Attachments

Issue Links

is related to

SOLR-6856 regression in /update/extract ? ref guide examples of fmap & xpath don't seem to be working

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Steven Rowe

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 24/Jan/15 01:22

Updated:: 21/Dec/16 01:08