Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-7027

ExtractingRequestHandler indiscriminantly dumps all source HTML attributes into the catch-all field when captureAttr=false, but it should be more selective, something like only href, title, alt, etc. attributes

    XMLWordPrintableJSON

    Details

      Description

      On line 283 in SolrContentHandler, the catch-all field gets all source HTML attribute values dumped into it:

      270:  @Override
      271:  public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
      272:    StringBuilder theBldr = fieldBuilders.get(localName);
      273:    if (theBldr != null) {
      274:      //we need to switch the currentBuilder
      275:      bldrStack.add(theBldr);
      276:    }
      277:    if (captureAttribs == true) {
      278:      for (int i = 0; i < attributes.getLength(); i++) {
      279:        addField(localName, attributes.getValue(i), null);
      280:      }
      281:    } else {
      282:      for (int i = 0; i < attributes.getLength(); i++) {
      283:        bldrStack.getLast().append(' ').append(attributes.getValue(i));
      284:      }
      285:    }
      286:    bldrStack.getLast().append(' ');
      287:  }
      

      But this will contains lots of unwanted cruft: class and style tags, etc.

      It would be much better if only attribute values containing addresses or tooltip text, etc. were dumped into the catch-all field. Here are a couple of places where this kind of attribute are described:

      http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html#includeAttribute(net.htmlparser.jericho.StartTag,%20net.htmlparser.jericho.Attribute)

      From Tika's HtmlHandler class:

          // List of attributes that need to be resolved.
          private static final Set<String> URI_ATTRIBUTES =
              new HashSet<String>(Arrays.asList("src", "href", "longdesc", "cite"));
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                sarowe Steven Rowe
              • Votes:
                1 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: