Uploaded image for project: 'Apache Any23 (Retired)'
  1. Apache Any23 (Retired)
  2. ANY23-339

Microdata extractor can sometime merge two different itemscopes into one

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2
    • 2.3
    • extractors
    • None

    Description

      The microdata extractor calculates the subject of a triple as the hashCode() of the itemscope.

      Java's hashCode() method (returning a 32-bit integer) is not guaranteed to be collision-free. (Especially so in this case, since the ItemScope.hashCode() method is not written very well).

      This means that two microdata items can accidentally be merged into one.

      Here's the line that needs to be changed: 

      https://github.com/apache/any23/blob/316b4ec0d6285a204789792084caf012c000b196/core/src/main/java/org/apache/any23/extractor/microdata/MicrodataExtractor.java#L439

      I recommend changing 

      subject = RDFUtils.getBNode(Integer.toString(itemScope.hashCode()));
      

      to

      subject = RDFUtils.bnode();
      

      We could also use itemScope.getItemId() if it's not null, even if it's not a URL. An example of one such id possible is:

      urn:isbn:0-330-34032-8
      

      Edit: according to the microdata spec, urn:isbn:0-330-34032-8 is an absolute URL. Since their definition of URL seems to correspond more closely to our definition of URI, we should be checking for absolute urls with URI.isAbsolute() rather than with URL.getProtocol() != null

      Attachments

        Issue Links

          Activity

            People

              hansbrende Hans Brende
              hansbrende Hans Brende
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: