Uploaded image for project: 'Apache Jena'
  1. Apache Jena
  2. JENA-820

Blank Node output under Hadoop can cause identifiers to diverge in multi-stage pipelines

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: Jena 2.13.0
    • Component/s: RDF Tools for Hadoop
    • Labels:
      None

      Description

      In writing up the documentation on the RDF Tools for Hadoop and enumerating the possible issues that blank nodes imply I discovered an issue that I hadn't previously considered.

      For a single job the input and output formats all ensure that blank nodes are consistently given the same identifiers if they had the same syntactic ID and were in the same file. This is done even when a file is being read in multiple chunks by multiple map tasks. However by its nature each reduce task will create an output file so potentially you can end up with blank nodes spread over multiple files.

      However if we then read these files into a subsequent job the blank nodes may now be spread across multiple files so even though they were the same node originally our allocation policy will cause the identifiers to diverge and become distinct blank nodes which is incorrect behaviour.

      Since there is no clear universal fix for this what I am considering doing is instead introducing a configuration setting that will allow the file path to be ignored for the purpose of blank node identifier allocations within a job. This will mean that identifiers are purely allocated on the basis of the Job ID and thus the same syntactic ID in any file will result in the same blank node identifier. As the user will hopefully will have left this turned off for the first job even if we start with the same syntactic ID but in different files the normal allocation policy for the first job should ensure unique IDs for the later jobs.

      My next step on this is to implement a failing unit test (and then temporarily ignore it) which demonstrates this issue.

        Activity

        Hide
        rvesse Rob Vesse added a comment -

        I think what has been done on the read side is sufficient to close this issue, as discussed and filed as JENA-821 there is a potential issue on the write side though as documented this can also be worked around by using an intermediate format like RDF Thrift which guarantees ID preservation

        Show
        rvesse Rob Vesse added a comment - I think what has been done on the read side is sufficient to close this issue, as discussed and filed as JENA-821 there is a potential issue on the write side though as documented this can also be worked around by using an intermediate format like RDF Thrift which guarantees ID preservation
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 1642695 from Rob Vesse in branch 'site/trunk'
        [ https://svn.apache.org/r1642695 ]

        Notes on JENA-820 workaround

        Show
        jira-bot ASF subversion and git services added a comment - Commit 1642695 from Rob Vesse in branch 'site/trunk' [ https://svn.apache.org/r1642695 ] Notes on JENA-820 workaround
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 752646c3e0189ccdcecb58fa1489e748c988f9ba in jena's branch refs/heads/hadoop-rdf from Rob Vesse
        [ https://git-wip-us.apache.org/repos/asf?p=jena.git;h=752646c ]

        Improve tests for JENA-820 to embody assumptions

        Adds assumptions to the blank node tests because some formats don't
        respect the ParserProfile and some formats preserve blank node identity
        regardless

        Show
        jira-bot ASF subversion and git services added a comment - Commit 752646c3e0189ccdcecb58fa1489e748c988f9ba in jena's branch refs/heads/hadoop-rdf from Rob Vesse [ https://git-wip-us.apache.org/repos/asf?p=jena.git;h=752646c ] Improve tests for JENA-820 to embody assumptions Adds assumptions to the blank node tests because some formats don't respect the ParserProfile and some formats preserve blank node identity regardless
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit ed71be184374e51b59fa921c7af56150399c6413 in jena's branch refs/heads/hadoop-rdf from Rob Vesse
        [ https://git-wip-us.apache.org/repos/asf?p=jena.git;h=ed71be1 ]

        Improved fix for JENA-820

        This commit ensures that the JENA-820 fix applies over all input formats
        not just line based formats. It also expands the test cases for blank
        node divergence and identity to cover a wider range of formats.

        Show
        jira-bot ASF subversion and git services added a comment - Commit ed71be184374e51b59fa921c7af56150399c6413 in jena's branch refs/heads/hadoop-rdf from Rob Vesse [ https://git-wip-us.apache.org/repos/asf?p=jena.git;h=ed71be1 ] Improved fix for JENA-820 This commit ensures that the JENA-820 fix applies over all input formats not just line based formats. It also expands the test cases for blank node divergence and identity to cover a wider range of formats.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit ed71be184374e51b59fa921c7af56150399c6413 in jena's branch refs/heads/hadoop-rdf from Rob Vesse
        [ https://git-wip-us.apache.org/repos/asf?p=jena.git;h=ed71be1 ]

        Improved fix for JENA-820

        This commit ensures that the JENA-820 fix applies over all input formats
        not just line based formats. It also expands the test cases for blank
        node divergence and identity to cover a wider range of formats.

        Show
        jira-bot ASF subversion and git services added a comment - Commit ed71be184374e51b59fa921c7af56150399c6413 in jena's branch refs/heads/hadoop-rdf from Rob Vesse [ https://git-wip-us.apache.org/repos/asf?p=jena.git;h=ed71be1 ] Improved fix for JENA-820 This commit ensures that the JENA-820 fix applies over all input formats not just line based formats. It also expands the test cases for blank node divergence and identity to cover a wider range of formats.
        Hide
        andy.seaborne Andy Seaborne added a comment - - edited

        I'm glad you have found ParserProfile and aren't wasting time on mechanism that it can help with already.

        RDFDataMgr.createReader then .setParserProfile should work. If it doesn't please raise a JIRA. There is a risk this is a path-less-travelled and while it is supposed to work with all readers maybe it doesn't. WriterGraphRIOT/WriterDatasetRIOT should have the complementary way to set the label to node mapper. That is missing – JENA-821.

        Let's define the required system contracts in a set of test cases.

        I'd like to make is to more formally define the blank node label in Jena as a UUID. That is, a global unique identifier that you can rely on as being safe and isn't a URI for this intra-system use case. The fact that identifiers that match the UUID syntax can be stored compactly as 2 longs is a bonus.

        At some level, moving a blank node across machines and wanting it to match up with the same blank node that has travelled a different path "inside the graph" is outside RDF standards, at least the syntaxes. The standard syntaxes, only talk about what happens at the boundary of the graph, not issues within the graph. Your example of where you don't want it is, to my way of thinking about it, the boundary. Kong needs both.

        That is very painful when "the graph" is across different machines or any place across boundaries where references to concrete objects don't work.

        The RDF-WG work on skolemization gives reference across the graph boundary. That's the nearest to portability. The use case is a reference across the web that can come back to the original place (the host name fixes that and defines the scope of where it can be matched).

        <_:label> is a different use case - identity within a distributed system (it also predate RDF-WG by several years; it's becoming more important as we go multi-machine). The point is to have blank node syntax that parseable syntax but an illegal URI. _ is not a valid scheme name which must start [a-z].

        Show
        andy.seaborne Andy Seaborne added a comment - - edited I'm glad you have found ParserProfile and aren't wasting time on mechanism that it can help with already. RDFDataMgr.createReader then .setParserProfile should work. If it doesn't please raise a JIRA. There is a risk this is a path-less-travelled and while it is supposed to work with all readers maybe it doesn't. WriterGraphRIOT / WriterDatasetRIOT should have the complementary way to set the label to node mapper. That is missing – JENA-821 . Let's define the required system contracts in a set of test cases. I'd like to make is to more formally define the blank node label in Jena as a UUID. That is, a global unique identifier that you can rely on as being safe and isn't a URI for this intra-system use case. The fact that identifiers that match the UUID syntax can be stored compactly as 2 longs is a bonus. At some level, moving a blank node across machines and wanting it to match up with the same blank node that has travelled a different path "inside the graph" is outside RDF standards, at least the syntaxes. The standard syntaxes, only talk about what happens at the boundary of the graph, not issues within the graph. Your example of where you don't want it is, to my way of thinking about it, the boundary. Kong needs both. That is very painful when "the graph" is across different machines or any place across boundaries where references to concrete objects don't work. The RDF-WG work on skolemization gives reference across the graph boundary. That's the nearest to portability. The use case is a reference across the web that can come back to the original place (the host name fixes that and defines the scope of where it can be matched). <_:label> is a different use case - identity within a distributed system (it also predate RDF-WG by several years; it's becoming more important as we go multi-machine). The point is to have blank node syntax that parseable syntax but an illegal URI. _ is not a valid scheme name which must start [a-z] .
        Hide
        rvesse Rob Vesse added a comment -

        Yes I already use ParserProfile for line based inputs, the current behaviour is to assign a seed based on the combination of Job ID and File path which yields file scoped IDs that are consistent within a Job (though not necessarily reproducible by subsequent jobs). What would be good to know is how to pass a ParserProfile down when using the RDFDataMgr.parse() type operations since I can't see an obvious way to do this right now?

        I am loath to use <_:label> unless it is the only viable solution since it goes outside standard RDF and makes the output non-portable.

        Using the Thrift output format is a good workaround especially for multi-stage pipelines since it is very efficient to read and write.

        There is also the issue that you don't want this behaviour on by default, if I start with two files (from some external source) that have equivalent blank node labels they should be treated as file scoped identifiers and assigned different identifiers. You only need/want this behaviour on when you know the output is from a previous job and that it may be spread over multiple files so blank nodes with same labels may be spread over multiple files.

        Show
        rvesse Rob Vesse added a comment - Yes I already use ParserProfile for line based inputs, the current behaviour is to assign a seed based on the combination of Job ID and File path which yields file scoped IDs that are consistent within a Job (though not necessarily reproducible by subsequent jobs). What would be good to know is how to pass a ParserProfile down when using the RDFDataMgr.parse() type operations since I can't see an obvious way to do this right now? I am loath to use <_:label> unless it is the only viable solution since it goes outside standard RDF and makes the output non-portable. Using the Thrift output format is a good workaround especially for multi-stage pipelines since it is very efficient to read and write. There is also the issue that you don't want this behaviour on by default, if I start with two files (from some external source) that have equivalent blank node labels they should be treated as file scoped identifiers and assigned different identifiers. You only need/want this behaviour on when you know the output is from a previous job and that it may be spread over multiple files so blank nodes with same labels may be spread over multiple files.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit b65a26f445ef8c840c150ce19855e17f3e7ca5a6 in jena's branch refs/heads/hadoop-rdf from Rob Vesse
        [ https://git-wip-us.apache.org/repos/asf?p=jena.git;h=b65a26f ]

        Initial workaround for JENA-820

        Currently only works correctly when the output is line based, need to
        add more test cases and research further into how to implement this for
        block and whole file based modes.

        Show
        jira-bot ASF subversion and git services added a comment - Commit b65a26f445ef8c840c150ce19855e17f3e7ca5a6 in jena's branch refs/heads/hadoop-rdf from Rob Vesse [ https://git-wip-us.apache.org/repos/asf?p=jena.git;h=b65a26f ] Initial workaround for JENA-820 Currently only works correctly when the output is line based, need to add more test cases and research further into how to implement this for block and whole file based modes.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 22995bb2a7a69f03cccc258249a8b3e67e1d077d in jena's branch refs/heads/hadoop-rdf from Rob Vesse
        [ https://git-wip-us.apache.org/repos/asf?p=jena.git;h=22995bb ]

        Test case that demonstrates JENA-820

        Test case that demonstrates an issue with processing blank nodes in
        multi-stage pipelines being subject to blank node divergence. Test is
        failing but set to ignore until the workaround suggested in JENA-820 is
        implemented.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 22995bb2a7a69f03cccc258249a8b3e67e1d077d in jena's branch refs/heads/hadoop-rdf from Rob Vesse [ https://git-wip-us.apache.org/repos/asf?p=jena.git;h=22995bb ] Test case that demonstrates JENA-820 Test case that demonstrates an issue with processing blank nodes in multi-stage pipelines being subject to blank node divergence. Test is failing but set to ignore until the workaround suggested in JENA-820 is implemented.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit 22995bb2a7a69f03cccc258249a8b3e67e1d077d in jena's branch refs/heads/hadoop-rdf from Rob Vesse
        [ https://git-wip-us.apache.org/repos/asf?p=jena.git;h=22995bb ]

        Test case that demonstrates JENA-820

        Test case that demonstrates an issue with processing blank nodes in
        multi-stage pipelines being subject to blank node divergence. Test is
        failing but set to ignore until the workaround suggested in JENA-820 is
        implemented.

        Show
        jira-bot ASF subversion and git services added a comment - Commit 22995bb2a7a69f03cccc258249a8b3e67e1d077d in jena's branch refs/heads/hadoop-rdf from Rob Vesse [ https://git-wip-us.apache.org/repos/asf?p=jena.git;h=22995bb ] Test case that demonstrates JENA-820 Test case that demonstrates an issue with processing blank nodes in multi-stage pipelines being subject to blank node divergence. Test is failing but set to ignore until the workaround suggested in JENA-820 is implemented.
        Hide
        andy.seaborne Andy Seaborne added a comment -

        There are mechanisms for this in RIOT.

        1 – Use <_:LABEL> for writing a bNode and then the label is preserved.

        2 – Consistent parsing across streams:

        The ParserProfile dicates the conversion from syntax read to Nodes. One operation is createBlankNode(String label,... ParserProfileBase has a "label to node" controls this with a LableToNode policy object.

        For example, LabelToNode createScopeByDocumentHash takes a seed (a UUID so very large). The default is to use new UUID per parser. This then can scale to arbitrary scale data files because parsing retains no growing state to track labels throughout the run.

        You can chnage the policy to start at a fixed seed for all files.

        The Thrift format preserves bNodes labels.

        Show
        andy.seaborne Andy Seaborne added a comment - There are mechanisms for this in RIOT. 1 – Use <_:LABEL> for writing a bNode and then the label is preserved. 2 – Consistent parsing across streams: The ParserProfile dicates the conversion from syntax read to Nodes. One operation is createBlankNode(String label,.. . ParserProfileBase has a "label to node" controls this with a LableToNode policy object. For example, LabelToNode createScopeByDocumentHash takes a seed (a UUID so very large). The default is to use new UUID per parser. This then can scale to arbitrary scale data files because parsing retains no growing state to track labels throughout the run. You can chnage the policy to start at a fixed seed for all files. The Thrift format preserves bNodes labels.

          People

          • Assignee:
            rvesse Rob Vesse
            Reporter:
            rvesse Rob Vesse
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development