Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.7.0
    • 1.0
    • core

    Description

      The RDFa 1.1 Core specification requests namespace prefixes in HTML5 be put in a "prefix" attribute like this: "ns1: http://example.org/ ns2: http://example.com/"

      My sample HTML page has this, but Sindice, which uses Any23, didn't read my namespace correctly. I narrowed it down to, and changed accordingly, the XSLT template "tokenize2" in the rdfa.xslt stylesheet. The template expected "ns1:http://example.org/ ns2:http://example.com/" (no spaces between prefix and namespace URI) and did not normalize whitespace, like linebreaks (although I'm not sure that broke the functionality).

      I use Any23 0.6.1 locally, but http://svn.apache.org/viewvc/incubator/any23/trunk/core/src/main/resources/org/apache/any23/extractor/rdfa/rdfa.xslt?revision=1231556&view=markup shows that the template is the same in the trunk.

      A possible problem may be that the new template will not accept the non-spaced namespace definitions, like you can find in the RDFa produced by Best Buy. A further improvement to my template may be accepting both namespace definitions with spaces and the ones without.

      Attachments

        1. rdfa.xslt
          37 kB
          Ben Companjen
        2. rdfa-11-curies-a.html
          0.8 kB
          Ben Companjen
        3. stylesheet.patch
          3 kB
          Ben Companjen
        4. stylesheet3.patch
          5 kB
          Ben Companjen
        5. test.patch
          2 kB
          Ben Companjen

        Issue Links

          Activity

            bencomp Ben Companjen added a comment -

            My version of rdfa.xslt, with the tokenize2 template changed to accept the "prefix" definitions as specified by http://www.w3.org/TR/rdfa-core

            bencomp Ben Companjen added a comment - My version of rdfa.xslt, with the tokenize2 template changed to accept the "prefix" definitions as specified by http://www.w3.org/TR/rdfa-core

            Hi Ben, thanks for this. My very first thoughts on this are that the file should be ASL v2.0 licensed! This obviously insn't the case and we really need to review this across the board so already this is a great issue.
            I don't suppose it's essential but it would be much easier for us to review your work if you were able to submit a patch for this e.g.

            svn diff > name-of-patch.patch

            I'll root out the differences and comment, hopefully we can make this into the codebase. Thanks again.

            lewismc Lewis John McGibbney added a comment - Hi Ben, thanks for this. My very first thoughts on this are that the file should be ASL v2.0 licensed! This obviously insn't the case and we really need to review this across the board so already this is a great issue. I don't suppose it's essential but it would be much easier for us to review your work if you were able to submit a patch for this e.g. svn diff > name-of-patch.patch I'll root out the differences and comment, hopefully we can make this into the codebase. Thanks again.
            bencomp Ben Companjen added a comment -

            Hi Lewis,

            I think this is the patch file you want. I found out that the file I uploaded last night contained a little debug code I had put in. I also changed some indentation.

            I guess you meant that I don't have the author

            {ship|ity}

            to allow the file for inclusion? I guess you're right about that - I hadn't really thought of that. I notified the original author of my issue + update via Twitter, but haven't seen a reply yet.

            bencomp Ben Companjen added a comment - Hi Lewis, I think this is the patch file you want. I found out that the file I uploaded last night contained a little debug code I had put in. I also changed some indentation. I guess you meant that I don't have the author {ship|ity} to allow the file for inclusion? I guess you're right about that - I hadn't really thought of that. I notified the original author of my issue + update via Twitter, but haven't seen a reply yet.
            bencomp Ben Companjen added a comment -

            One step further: build fails with my updated stylesheet because the tests try to extract triples from rdfa-11-curies.html, which has prefix="db:http://database.org/ dc:http://purl.org/dc/01/" (without spaces). Even though this doesn't follow the spec, I start to believe it is important to support both namespace definitions with space and without space between prefix and namespace. Perhaps an extra template "tokenize2a" that can be called when not(contains(@prefix,': ')) is all it takes. (Just thinking out loud here.)

            bencomp Ben Companjen added a comment - One step further: build fails with my updated stylesheet because the tests try to extract triples from rdfa-11-curies.html, which has prefix="db: http://database.org/ dc: http://purl.org/dc/01/ " (without spaces). Even though this doesn't follow the spec, I start to believe it is important to support both namespace definitions with space and without space between prefix and namespace. Perhaps an extra template "tokenize2a" that can be called when not(contains(@prefix,': ')) is all it takes. (Just thinking out loud here.)
            bencomp Ben Companjen added a comment -

            I think I managed to handle both "prefix" attributes with and without spaces between prefix and URI with this newest version.
            Using Saxon, I tested that it works. The build still fails, but because of another issue [testGetClassesInJAR(org.apache.any23.util.DiscoveryUtilsTest)].

            bencomp Ben Companjen added a comment - I think I managed to handle both "prefix" attributes with and without spaces between prefix and URI with this newest version. Using Saxon, I tested that it works. The build still fails, but because of another issue [testGetClassesInJAR(org.apache.any23.util.DiscoveryUtilsTest)] .
            bencomp Ben Companjen added a comment -

            Well, this is kind of embarrassing... I just had another look at the RDFa Core 1.1 doc and saw that the CURIES are defined without spaces. That would mean the examples are wrong and all the work I did was to support the wrong examples :S

            It still works of course. I emailed two of the editors of the doc with my findings, so hopefully it will be updated.

            bencomp Ben Companjen added a comment - Well, this is kind of embarrassing... I just had another look at the RDFa Core 1.1 doc and saw that the CURIES are defined without spaces. That would mean the examples are wrong and all the work I did was to support the wrong examples :S It still works of course. I emailed two of the editors of the doc with my findings, so hopefully it will be updated.

            Hi Ben,
            don't worry it happens, the RDFa 1.1 documentation is not so rich of examples.

            the rdfa.xslt is no longer maintained, we updated it with a partial support for CURIES and then we decided to rewrite the RDFa extractor programmatically [1].
            Any23 chooses to use the legacy RDFaExtractor (based on rdfa.xslt) or the RDFa11Extractor reading the property "any23.extraction.rdfa.programmatic" into default-configuration.properties [2] . Such configuration can be changed also programmatically, see [3].

            If you find any issue and you want to provide any fix I suggest you to send us a patch on [1] (complete of a regression test), being rdfa.xslt gradually dismissed.

            Thanks a lot for your feedback.

            Mic

            [1] org.apache.any23.extractor.rdfa.RDFa11Extractor[Test]
            [2] any23-trunk/core/src/main/resources/default-configuration.properties
            [3] http://incubator.apache.org/any23/configuration.html

            michele.mostarda Michele Mostarda added a comment - Hi Ben, don't worry it happens, the RDFa 1.1 documentation is not so rich of examples. the rdfa.xslt is no longer maintained, we updated it with a partial support for CURIES and then we decided to rewrite the RDFa extractor programmatically [1] . Any23 chooses to use the legacy RDFaExtractor (based on rdfa.xslt) or the RDFa11Extractor reading the property "any23.extraction.rdfa.programmatic" into default-configuration.properties [2] . Such configuration can be changed also programmatically, see [3] . If you find any issue and you want to provide any fix I suggest you to send us a patch on [1] (complete of a regression test), being rdfa.xslt gradually dismissed. Thanks a lot for your feedback. Mic [1] org.apache.any23.extractor.rdfa.RDFa11Extractor [Test] [2] any23-trunk/core/src/main/resources/default-configuration.properties [3] http://incubator.apache.org/any23/configuration.html
            bencomp Ben Companjen added a comment -

            Hi Michele,

            Thanks for the explanation. In the meantime one of the editors of RDFa Core 1.1 pointed out to me that there must be spaces in the prefix attribute: NCName ':' ' '+ xsd:anyURI (defined in section 5). The test HTML file is therefore incorrect.

            I'll have a look at the files you listed and see if they support The Right Way

            Ben

            bencomp Ben Companjen added a comment - Hi Michele, Thanks for the explanation. In the meantime one of the editors of RDFa Core 1.1 pointed out to me that there must be spaces in the prefix attribute: NCName ':' ' '+ xsd:anyURI (defined in section 5). The test HTML file is therefore incorrect. I'll have a look at the files you listed and see if they support The Right Way Ben

            Thanks, please feedback here any issue you find.

            michele.mostarda Michele Mostarda added a comment - Thanks, please feedback here any issue you find.
            bencomp Ben Companjen added a comment -

            It looks like the org.apache.any23.extractor.rdfa.RDFa11Parser does support both "ns1: http://uri" and "ns1:http://uri". To test it, I copied rdfa-11-curies.html to rdfa-11-curies-a.html, added a space between the prefix and URI in the @prefix and copied the test that checked the HTML file to check my new HTML file. Both tests had correct results.

            So if only Sindice could update their Any23 (installation and) settings, my blog could be indexed correctly

            bencomp Ben Companjen added a comment - It looks like the org.apache.any23.extractor.rdfa.RDFa11Parser does support both "ns1: http://uri " and "ns1: http://uri ". To test it, I copied rdfa-11-curies.html to rdfa-11-curies-a.html, added a space between the prefix and URI in the @prefix and copied the test that checked the HTML file to check my new HTML file. Both tests had correct results. So if only Sindice could update their Any23 (installation and) settings, my blog could be indexed correctly

            I've been away for a few days and only just caught up now. Are we stating that our RDFa11Parser is working flawlessly (if ever a thing could be said )? The justification Michele stated provides apt reasoning, however it raises the question as to how long we ship with deprecated legacy code... this is something which I assume will be addressed further down the line.

            lewismc Lewis John McGibbney added a comment - I've been away for a few days and only just caught up now. Are we stating that our RDFa11Parser is working flawlessly (if ever a thing could be said )? The justification Michele stated provides apt reasoning, however it raises the question as to how long we ship with deprecated legacy code... this is something which I assume will be addressed further down the line.
            bencomp Ben Companjen added a comment -

            Well, last night I found that no, it isn't working flawlessly. I was just looking at the whole thing again to trace what is going wrong.

            I was about to send my XSLT file to the Sindice-dev mailing list, to bridge the period between now and the moment Sindice starts using Any23 0.7.0, when I thought of an edge case to test against. My stylesheet doesn't handle it well, and neither does my build from the SVN code.
            The case: when a prefix attribute contains "rnews:http://www.iptc.org/std/rNews/1.0: foaf:http://xmlns.com/foaf/0.1/ dbpedia:http://dbpedia.org/resource/" (no space between prefix name and prefix URI, but at least one prefix URI ending with a colon), my stylesheet wrongly assumes there are spaces between prefix name and prefix URI, because it tests whether the attribute contains ": ". The RDFa11Parser outputs warnings that it cannot map the foaf prefix when I test it on my 0.7.0 build.
            It is an edge test case because it is invalid content for a prefix attribute, but since I saw that Any23 is accepting / testing against no-space prefix definitions and I think namespace URIs with a colon are valid (e.g. <http://dbpedia.org/resource/Category:>), I figured it made an interesting test case. I sent the file anyway, with this test result too

            Another issue I have is that the RDFa11Parser doesn't infer the right triples from <link> elements with a @rel in the <head> section. I believe the @rel values "icon", "stylesheet", "bookmark" etc are to be treated specially. Sindice (Any23 0.6.1) produces URIs for these like <http://www.w3.org/1999/xhtml/vocab#icon> from my blog post. When I extract from my blog post locally, I get properties like <http://ben.companjen.name/2011/08/het-gezin-timmer-de-bruijn-in-amsterdam/icon>.
            The parser (also) complains my HTML doesn't declare 'xmlns="http://www.w3.org/1999/xhtml"'. I couldn't find the part of the HTML or RDFa specifications that says it should do so - as far as I can tell it's not necessary in HTML5.
            Looking at the code, I see some references to RDFa 1.0 in the comments in the processDocument method. This method seems to be the source for the complaint. Maybe the problem and complaint (well, warning actually) are linked? And maybe the incorrect handling of the @rel values is also linked to "// TODO: introduce support for RDFa profiles. (http://www.w3.org/TR/rdfa-core/#s_profiles)"? BTW, these profiles don't exist (anymore) in RDFa Core.
            I hope someone can shed more light on this, I'm too confused from reading all the RDFa related docs and drafts right now

            bencomp Ben Companjen added a comment - Well, last night I found that no, it isn't working flawlessly. I was just looking at the whole thing again to trace what is going wrong. I was about to send my XSLT file to the Sindice-dev mailing list, to bridge the period between now and the moment Sindice starts using Any23 0.7.0, when I thought of an edge case to test against. My stylesheet doesn't handle it well, and neither does my build from the SVN code. The case: when a prefix attribute contains "rnews: http://www.iptc.org/std/rNews/1.0: foaf: http://xmlns.com/foaf/0.1/ dbpedia: http://dbpedia.org/resource/ " (no space between prefix name and prefix URI, but at least one prefix URI ending with a colon), my stylesheet wrongly assumes there are spaces between prefix name and prefix URI, because it tests whether the attribute contains ": ". The RDFa11Parser outputs warnings that it cannot map the foaf prefix when I test it on my 0.7.0 build. It is an edge test case because it is invalid content for a prefix attribute, but since I saw that Any23 is accepting / testing against no-space prefix definitions and I think namespace URIs with a colon are valid (e.g. < http://dbpedia.org/resource/Category: >), I figured it made an interesting test case. I sent the file anyway, with this test result too Another issue I have is that the RDFa11Parser doesn't infer the right triples from <link> elements with a @rel in the <head> section. I believe the @rel values "icon", "stylesheet", "bookmark" etc are to be treated specially. Sindice (Any23 0.6.1) produces URIs for these like < http://www.w3.org/1999/xhtml/vocab#icon > from my blog post. When I extract from my blog post locally, I get properties like < http://ben.companjen.name/2011/08/het-gezin-timmer-de-bruijn-in-amsterdam/icon >. The parser (also) complains my HTML doesn't declare 'xmlns="http://www.w3.org/1999/xhtml"'. I couldn't find the part of the HTML or RDFa specifications that says it should do so - as far as I can tell it's not necessary in HTML5. Looking at the code, I see some references to RDFa 1.0 in the comments in the processDocument method. This method seems to be the source for the complaint. Maybe the problem and complaint (well, warning actually) are linked? And maybe the incorrect handling of the @rel values is also linked to "// TODO: introduce support for RDFa profiles. ( http://www.w3.org/TR/rdfa-core/#s_profiles )"? BTW, these profiles don't exist (anymore) in RDFa Core. I hope someone can shed more light on this, I'm too confused from reading all the RDFa related docs and drafts right now

            People

              Unassigned Unassigned
              bencomp Ben Companjen
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 3h
                  3h
                  Remaining:
                  Remaining Estimate - 3h
                  3h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified