Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-7027

ExtractingRequestHandler indiscriminantly dumps all source HTML attributes into the catch-all field when captureAttr=false, but it should be more selective, something like only href, title, alt, etc. attributes

    XMLWordPrintableJSON

Details

    Description

      On line 283 in SolrContentHandler, the catch-all field gets all source HTML attribute values dumped into it:

      270:  @Override
      271:  public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
      272:    StringBuilder theBldr = fieldBuilders.get(localName);
      273:    if (theBldr != null) {
      274:      //we need to switch the currentBuilder
      275:      bldrStack.add(theBldr);
      276:    }
      277:    if (captureAttribs == true) {
      278:      for (int i = 0; i < attributes.getLength(); i++) {
      279:        addField(localName, attributes.getValue(i), null);
      280:      }
      281:    } else {
      282:      for (int i = 0; i < attributes.getLength(); i++) {
      283:        bldrStack.getLast().append(' ').append(attributes.getValue(i));
      284:      }
      285:    }
      286:    bldrStack.getLast().append(' ');
      287:  }
      

      But this will contains lots of unwanted cruft: class and style tags, etc.

      It would be much better if only attribute values containing addresses or tooltip text, etc. were dumped into the catch-all field. Here are a couple of places where this kind of attribute are described:

      http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html#includeAttribute(net.htmlparser.jericho.StartTag,%20net.htmlparser.jericho.Attribute)

      From Tika's HtmlHandler class:

          // List of attributes that need to be resolved.
          private static final Set<String> URI_ATTRIBUTES =
              new HashSet<String>(Arrays.asList("src", "href", "longdesc", "cite"));
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              sarowe Steven Rowe
              Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: