Jackrabbit Content Repository
  1. Jackrabbit Content Repository
  2. JCR-3534

Efficient copying of binaries across repositories with the same data store

    Details

      Description

      we have a couple of use cases, where we would like to leverage the global data store to prevent sending around and copying around large binary data unnecessarily: We have two separate Jackrabbit instances configured to use the same DataStore (for the sake of this discussion assume we have the problems of concurrent access and garbage collection under control). When sending content from one instance to the other instance we don't want to send potentially large binary data (e.g. video files) if not needed.

      The idea is for the sender to just send the content identity from JackrabbitValue.getContentIdentity(). The receiver would then check whether the such content already exists and would reuse if so:

      String ci = contentIdentity_from_sender;
      try

      { Value v = session.getValueByContentIdentity(ci); Property p = targetNode.setProperty(propName, v); }

      catch (ItemNotFoundException ie)

      { // unknown or invalid content Identity }

      catch (RepositoryException re)

      { // some other exception }

      Thus the proposed JackrabbitSession.getValueByContentIdentity(String) method would allow for round tripping the JackrabbitValue.getContentIdentity() preventing superfluous binary data copying and moving.

      See also the dev@ thread http://jackrabbit.markmail.org/thread/gedk5jsrp6offkhi

      1. JCR-3534.2.patch
        25 kB
        Tommaso Teofili
      2. JCR-3534.26.patch
        27 kB
        Tommaso Teofili
      3. JCR-3534.3.patch
        32 kB
        Tommaso Teofili
      4. JCR-3534.4.patch
        24 kB
        Tommaso Teofili
      5. JCR-3534.6.patch
        2 kB
        Tommaso Teofili
      6. JCR-3534.7.patch
        10 kB
        Tommaso Teofili
      7. JCR-3534.patch
        24 kB
        Tommaso Teofili
      8. JCR-3534.patch
        12 kB
        Felix Meschberger

        Issue Links

          Activity

          Hide
          Jukka Zitting added a comment -

          Excellent! I'd consider this resolved then. We can track further improvements (e.g. implementing this also for the DbDataStore or using something like Java's KeyStore for the reference key) in followup issues.

          Show
          Jukka Zitting added a comment - Excellent! I'd consider this resolved then. We can track further improvements (e.g. implementing this also for the DbDataStore or using something like Java's KeyStore for the reference key) in followup issues.
          Hide
          Tommaso Teofili added a comment - - edited

          I've removed the unneeded secret param from the test repository.xml at http://svn.apache.org/r1484727 and backported the current implementation at http://svn.apache.org/r1484726

          Show
          Tommaso Teofili added a comment - - edited I've removed the unneeded secret param from the test repository.xml at http://svn.apache.org/r1484727 and backported the current implementation at http://svn.apache.org/r1484726
          Hide
          Tommaso Teofili added a comment -

          here's a patch which backports previous commits on trunk to 2.6 branch (used svn merge to keep history)

          Show
          Tommaso Teofili added a comment - here's a patch which backports previous commits on trunk to 2.6 branch (used svn merge to keep history)
          Hide
          Tommaso Teofili added a comment -

          > Tagging this for the 2.6 branch. I think the solution here is getting stable enough to be backported.

          +1, I think it's stable and still flexible enough as it is now

          Show
          Tommaso Teofili added a comment - > Tagging this for the 2.6 branch. I think the solution here is getting stable enough to be backported. +1, I think it's stable and still flexible enough as it is now
          Hide
          Jukka Zitting added a comment -

          Tagging this for the 2.6 branch. I think the solution here is getting stable enough to be backported.

          > I'm not sure about 3 as with URIs potentially pointing to localhost such a mechanism may become unreliable

          Good point. The connection URI might not even be easily available if a data source is used.

          Show
          Jukka Zitting added a comment - Tagging this for the 2.6 branch. I think the solution here is getting stable enough to be backported. > I'm not sure about 3 as with URIs potentially pointing to localhost such a mechanism may become unreliable Good point. The connection URI might not even be easily available if a data source is used.
          Hide
          Tommaso Teofili added a comment -

          > Yes. I left that bit undone to keep things simple for now and to highlight that not all implementations need to implement this feature.

          ok it makes sense

          > I'm also not sure whether option 2 or 3 from above would be the best key mechanism for the DbDataStore

          I'm not sure about 3 as with URIs potentially pointing to localhost such a mechanism may become unreliable

          Show
          Tommaso Teofili added a comment - > Yes. I left that bit undone to keep things simple for now and to highlight that not all implementations need to implement this feature. ok it makes sense > I'm also not sure whether option 2 or 3 from above would be the best key mechanism for the DbDataStore I'm not sure about 3 as with URIs potentially pointing to localhost such a mechanism may become unreliable
          Hide
          Jukka Zitting added a comment -

          > override the AbstractDataStore#getOrCreateReferenceKey method in DbDataStore

          Yes. I left that bit undone to keep things simple for now and to highlight that not all implementations need to implement this feature. I'm also not sure whether option 2 or 3 from above would be the best key mechanism for the DbDataStore.

          Show
          Jukka Zitting added a comment - > override the AbstractDataStore#getOrCreateReferenceKey method in DbDataStore Yes. I left that bit undone to keep things simple for now and to highlight that not all implementations need to implement this feature. I'm also not sure whether option 2 or 3 from above would be the best key mechanism for the DbDataStore.
          Hide
          Tommaso Teofili added a comment -

          Thanks Jukka, it looks good to me.
          If I understand thing correctly we need to override the AbstractDataStore#getOrCreateReferenceKey method in DbDataStore too to allow the reference key to be shared in this scenario when using it.

          Show
          Tommaso Teofili added a comment - Thanks Jukka, it looks good to me. If I understand thing correctly we need to override the AbstractDataStore#getOrCreateReferenceKey method in DbDataStore too to allow the reference key to be shared in this scenario when using it.
          Hide
          Jukka Zitting added a comment -

          In http://svn.apache.org/r1484444 I changed the getIdentiferFromReference method to getRecordFromReference to make sure that a reference can only be used if the referenced binary actually does exist.

          Show
          Jukka Zitting added a comment - In http://svn.apache.org/r1484444 I changed the getIdentiferFromReference method to getRecordFromReference to make sure that a reference can only be used if the referenced binary actually does exist.
          Hide
          Jukka Zitting added a comment -

          I committed a somewhat revised version of the patch in http://svn.apache.org/r1484440. Instead of overloading the normal record storage mechanism (and having to deal with interference from things like the garbage collector), I just let it open for each DataStore implementation to figure out how or if it wants to store the reference key.

          Angela:
          > we should not have a plain txt referenceKey/secret stored anywhere

          What would you propose as an alternative?

          Show
          Jukka Zitting added a comment - I committed a somewhat revised version of the patch in http://svn.apache.org/r1484440 . Instead of overloading the normal record storage mechanism (and having to deal with interference from things like the garbage collector), I just let it open for each DataStore implementation to figure out how or if it wants to store the reference key. Angela: > we should not have a plain txt referenceKey/secret stored anywhere What would you propose as an alternative?
          Hide
          Tommaso Teofili added a comment -

          here's a new patch which stores the reference key in the data store.
          The AbstractDataStore class doesn't get the configured value anymore but just retrieve it if available in the repository or stores it before if not by the abstract addReferenceKeyRecord(DataIdentifier,InputStream) method.
          By default the reference key is randomly generated using a SecureRandom with 256 bit long byte[] and the key identifier is just a String.
          Both this two last points can be enhanced by any subclass of AbstractDataStore.

          Show
          Tommaso Teofili added a comment - here's a new patch which stores the reference key in the data store. The AbstractDataStore class doesn't get the configured value anymore but just retrieve it if available in the repository or stores it before if not by the abstract addReferenceKeyRecord(DataIdentifier,InputStream) method. By default the reference key is randomly generated using a SecureRandom with 256 bit long byte[] and the key identifier is just a String. Both this two last points can be enhanced by any subclass of AbstractDataStore.
          Hide
          Tommaso Teofili added a comment -

          > what i am saying is that we should not have a plain txt referenceKey/secret stored anywhere

          What should we have in your opinion instead? Would a randomly generated key stored in the data store address your concern?
          As far as I can see it has to be stored somewhere (not in memory) if we want it to be shared (no configuration) or has to be (manually) configured on both repositories if we don't want to persist it.

          > just changing the name doesn't make it better.

          sure, in fact I didn't change that for making it better in terms of security, just because it was a better name for it

          > if storing the key is always data-store specific (such as a special file in the FileDataStore, which is by far the most important data store, not sure if anyone is using the slow database data store), you only need a getSecret() method on the AbstractDataStore class

          ok, but it'd be probably better to expose a default way of getting such a secret / referenceKey with just an abstract method which knows how to store the reference key value given its identifier. Also a subclass of AbstractDataStore may choose then to override the reference key identifier or value.

          Show
          Tommaso Teofili added a comment - > what i am saying is that we should not have a plain txt referenceKey/secret stored anywhere What should we have in your opinion instead? Would a randomly generated key stored in the data store address your concern? As far as I can see it has to be stored somewhere (not in memory) if we want it to be shared (no configuration) or has to be (manually) configured on both repositories if we don't want to persist it. > just changing the name doesn't make it better. sure, in fact I didn't change that for making it better in terms of security, just because it was a better name for it > if storing the key is always data-store specific (such as a special file in the FileDataStore, which is by far the most important data store, not sure if anyone is using the slow database data store), you only need a getSecret() method on the AbstractDataStore class ok, but it'd be probably better to expose a default way of getting such a secret / referenceKey with just an abstract method which knows how to store the reference key value given its identifier. Also a subclass of AbstractDataStore may choose then to override the reference key identifier or value.
          Hide
          Alexander Klimetschek added a comment - - edited

          @Tommaso: if storing the key is always data-store specific (such as a special file in the FileDataStore, which is by far the most important data store, not sure if anyone is using the slow database data store), you only need a getSecret() method on the AbstractDataStore class. See my post from above: https://issues.apache.org/jira/browse/JCR-3534?focusedCommentId=13653621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13653621

          Show
          Alexander Klimetschek added a comment - - edited @Tommaso: if storing the key is always data-store specific (such as a special file in the FileDataStore, which is by far the most important data store, not sure if anyone is using the slow database data store), you only need a getSecret() method on the AbstractDataStore class. See my post from above: https://issues.apache.org/jira/browse/JCR-3534?focusedCommentId=13653621&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13653621
          Hide
          angela added a comment -

          yes, anyone that has access to the filesystem is admin... that's according to our threat model.
          what i am saying is that we should not have a plain txt referenceKey/secret stored anywhere. just changing the name doesn't make
          it better.

          Show
          angela added a comment - yes, anyone that has access to the filesystem is admin... that's according to our threat model. what i am saying is that we should not have a plain txt referenceKey/secret stored anywhere. just changing the name doesn't make it better.
          Hide
          Tommaso Teofili added a comment -

          here's the trivial patch which reflects the approach from my comment

          Show
          Tommaso Teofili added a comment - here's the trivial patch which reflects the approach from my comment
          Hide
          Tommaso Teofili added a comment -

          >that doesn't make it secure... the problem is that that value should not be stored as plain text value in a file that is
          readable for everyone that can write "new File....".
          > Putting it in the datastore as outlined above would not necessarily make access more difficult

          I agree with Jukka on this point, if one has access to repository.xml then he/she is already "root" on the whole repository

          The problem with storing the reference key in the datastore is that such a thing cannot be done via the DataStore API as that doesn't offer a method to add a DataRecord given a DataIdentifier (which would have to be fixed to let the AbstractDataStore know which identifier is assigned to the reference key record), therefore each DS implementation should do it in its own specific way.

          Given that it seems to me the best option is having a getter and setter in the AbstractDataStore for the referenceKey (better name than secret), then subclasses can extend the getter and define a DS implementation specific way of storing/retrieving the referenceKey if that's not provided in the configuration.

          Show
          Tommaso Teofili added a comment - >that doesn't make it secure... the problem is that that value should not be stored as plain text value in a file that is readable for everyone that can write "new File....". > Putting it in the datastore as outlined above would not necessarily make access more difficult I agree with Jukka on this point, if one has access to repository.xml then he/she is already "root" on the whole repository The problem with storing the reference key in the datastore is that such a thing cannot be done via the DataStore API as that doesn't offer a method to add a DataRecord given a DataIdentifier (which would have to be fixed to let the AbstractDataStore know which identifier is assigned to the reference key record), therefore each DS implementation should do it in its own specific way. Given that it seems to me the best option is having a getter and setter in the AbstractDataStore for the referenceKey (better name than secret), then subclasses can extend the getter and define a DS implementation specific way of storing/retrieving the referenceKey if that's not provided in the configuration.
          Hide
          Alexander Klimetschek added a comment -

          Putting it in the datastore as outlined above would not necessarily make access more difficult (assuming repository.xml and datastore directory have similar ACLs on the file system), but it would reduce user errors:

          • blindly copying repository.xml over from another instance with the same shared secret
          • users using weak secrets (instead of a long random auto generated key)
          • users having to make sure the secret is synced in addition to sharing the data store directory
          Show
          Alexander Klimetschek added a comment - Putting it in the datastore as outlined above would not necessarily make access more difficult (assuming repository.xml and datastore directory have similar ACLs on the file system), but it would reduce user errors: blindly copying repository.xml over from another instance with the same shared secret users using weak secrets (instead of a long random auto generated key) users having to make sure the secret is synced in addition to sharing the data store directory
          Hide
          Jukka Zitting added a comment -

          > the problem is that that value should not be stored as plain text value in a file that is readable for everyone that can write "new File...."

          Anyone with access to the local file system can read the entire repository.xml, and thus in any case has full access to all content inside the repository. Putting the reference key in some other location and/or encrypting it in some way doesn't make the system any more secure.

          Show
          Jukka Zitting added a comment - > the problem is that that value should not be stored as plain text value in a file that is readable for everyone that can write "new File...." Anyone with access to the local file system can read the entire repository.xml, and thus in any case has full access to all content inside the repository. Putting the reference key in some other location and/or encrypting it in some way doesn't make the system any more secure.
          Hide
          angela added a comment -

          Tommaso Teofili
          > Just mind that the unsecure key is configured only on the test repository.xml

          that doesn't make it secure... the problem is that that value should not be stored as plain text value in a file that is
          readable for everyone that can write "new File....".
          i am really not happy with that and i want this to be addressed before this feature is being released in a stable
          jackrabbit version. 2.7 is an instable branch so i consider this a temp. solution.

          -1 for merging this to the stable 2.6.x branch
          -1 for any 2.8 release that contains this.

          Show
          angela added a comment - Tommaso Teofili > Just mind that the unsecure key is configured only on the test repository.xml that doesn't make it secure... the problem is that that value should not be stored as plain text value in a file that is readable for everyone that can write "new File....". i am really not happy with that and i want this to be addressed before this feature is being released in a stable jackrabbit version. 2.7 is an instable branch so i consider this a temp. solution. -1 for merging this to the stable 2.6.x branch -1 for any 2.8 release that contains this.
          Hide
          Jukka Zitting added a comment -

          Chatting with Tommaso we realized that getting the reference into DataIdentifiers is a bit tricky since we have some code paths that instantiate DataIdentifiers directly froma string without consulting the respective DataStore. Instead of changing those places, a simpler alternative turned out to be to move the getReference() method from DataIdentifier to DataRecord, which we implemented in http://svn.apache.org/r1481964.

          Show
          Jukka Zitting added a comment - Chatting with Tommaso we realized that getting the reference into DataIdentifiers is a bit tricky since we have some code paths that instantiate DataIdentifiers directly froma string without consulting the respective DataStore. Instead of changing those places, a simpler alternative turned out to be to move the getReference() method from DataIdentifier to DataRecord, which we implemented in http://svn.apache.org/r1481964 .
          Hide
          Tommaso Teofili added a comment -

          I agree with your comments about the shared secret/referenceKey configuration.
          Just mind that the unsecure key is configured only on the test repository.xml therefore shouldn't impact the default one shipped in src/main/resources-

          Regarding the getIdentifierFromReference() method to verify existence of the record I'm not completely sure about it but while chatting with ThomasM we also thought to that and it seemed to us it was something due to the DataStore when trying to resolve the record from the identifier (or at least it seems that's how it's done with "plain" DataIdentifiers).
          However it may make sense to do as Jukka suggests here.

          Show
          Tommaso Teofili added a comment - I agree with your comments about the shared secret/referenceKey configuration. Just mind that the unsecure key is configured only on the test repository.xml therefore shouldn't impact the default one shipped in src/main/resources- Regarding the getIdentifierFromReference() method to verify existence of the record I'm not completely sure about it but while chatting with ThomasM we also thought to that and it seemed to us it was something due to the DataStore when trying to resolve the record from the identifier (or at least it seems that's how it's done with "plain" DataIdentifiers). However it may make sense to do as Jukka suggests here.
          Hide
          Alexander Klimetschek added a comment -

          The longer I think about it, we should not have the configuration option at all. It is not required for now and would avoid all the possible issues such as accidentally sharing the key etc.

          Show
          Alexander Klimetschek added a comment - The longer I think about it, we should not have the configuration option at all. It is not required for now and would avoid all the possible issues such as accidentally sharing the key etc.
          Hide
          Jukka Zitting added a comment -

          Note that an explicitly configured or generated "secret" is not necessary in all cases, as all we need is some non-public and un-guessable key that is different for each distinct backend store. For example the database and S3 data stores could combine the connection parameters, including the secret password or access key, to generate such a reference key without the need for extra configuration or key storage.

          So we have at least three ways in which a data store could get such a reference key:

          1. As an explicit "secret" (perhaps "referenceKey" was a better name) configuration parameter
          2. As a random string that's automatically generated at first startup and stored along with the rest of the data store contents
          3. As a combination of existing configuration parameters

          An implementation could even use a combination of the above. I don't have much preference on any specific implementation, as they all cover the same basic functionality (each with its own minor benefits and downsides compared to the others) and there's no inherent difference security as anyone with access to the <DataStore/> configuration entry will any case have full access to the entire data store.

          In any case I agree with Angela that whatever we do, the default repository.xml file should not contain a pre-defined reference key, as that'll make all default deployments share the same key even if their data stores are distinct.

          Also, another change that occurs to me is that the getIdentifierFromReference() method should also verify the existence of the referenced record before returning the identifier, as otherwise we could end up with dangling references to garbage-collected binaries.

          Show
          Jukka Zitting added a comment - Note that an explicitly configured or generated "secret" is not necessary in all cases, as all we need is some non-public and un-guessable key that is different for each distinct backend store. For example the database and S3 data stores could combine the connection parameters, including the secret password or access key, to generate such a reference key without the need for extra configuration or key storage. So we have at least three ways in which a data store could get such a reference key: 1. As an explicit "secret" (perhaps "referenceKey" was a better name) configuration parameter 2. As a random string that's automatically generated at first startup and stored along with the rest of the data store contents 3. As a combination of existing configuration parameters An implementation could even use a combination of the above. I don't have much preference on any specific implementation, as they all cover the same basic functionality (each with its own minor benefits and downsides compared to the others) and there's no inherent difference security as anyone with access to the <DataStore/> configuration entry will any case have full access to the entire data store. In any case I agree with Angela that whatever we do, the default repository.xml file should not contain a pre-defined reference key, as that'll make all default deployments share the same key even if their data stores are distinct. Also, another change that occurs to me is that the getIdentifierFromReference() method should also verify the existence of the referenced record before returning the identifier, as otherwise we could end up with dangling references to garbage-collected binaries.
          Hide
          Alexander Klimetschek added a comment -

          My proposal above could actually be implemented now without much effort - it has the big advantage of being configuration-less while at the same time ensuring a different secret for different data stores (i.e. avoiding people copying the config files and thus the same secret).

          Show
          Alexander Klimetschek added a comment - My proposal above could actually be implemented now without much effort - it has the big advantage of being configuration-less while at the same time ensuring a different secret for different data stores (i.e. avoiding people copying the config files and thus the same secret).
          Hide
          Tommaso Teofili added a comment -

          > no, you got that right... i don't want to block the cq release if there is no other solution right now. but i would like us to discuss
          already today if there was really no other way and how we are going to address this in OAK before we ship that feature.
          therefore i was somehow waiting for a confirmation that my concern has been noticed.

          sure, I agree and will be happy to discuss a proper way to handle that in Oak

          > if we end up with releasing it with that entry in the repository.xml, i would like to see us having at least a FIXME comment
          associated with the repository.xml as this is not secure at all.... currently it states "a very well kept secret"... that's sort of
          a joke given the fact that the repository.xml is almost accessible to everyone

          right

          Show
          Tommaso Teofili added a comment - > no, you got that right... i don't want to block the cq release if there is no other solution right now. but i would like us to discuss already today if there was really no other way and how we are going to address this in OAK before we ship that feature. therefore i was somehow waiting for a confirmation that my concern has been noticed. sure, I agree and will be happy to discuss a proper way to handle that in Oak > if we end up with releasing it with that entry in the repository.xml, i would like to see us having at least a FIXME comment associated with the repository.xml as this is not secure at all.... currently it states "a very well kept secret"... that's sort of a joke given the fact that the repository.xml is almost accessible to everyone right
          Hide
          angela added a comment -

          > probably I misunderstood but I thought that was ok for the moment as long as we also can find a better solution for Oak,
          > which I think would have to deal with the same topic of shared resources probably on a dedicated issue.

          no, you got that right... i don't want to block the cq release if there is no other solution right now. but i would like us to discuss
          already today if there was really no other way and how we are going to address this in OAK before we ship that feature.
          therefore i was somehow waiting for a confirmation that my concern has been noticed.

          if we end up with releasing it with that entry in the repository.xml, i would like to see us having at least a FIXME comment
          associated with the repository.xml as this is not secure at all.... currently it states "a very well kept secret"... that's sort of
          a joke given the fact that the repository.xml is almost accessible to everyone

          Show
          angela added a comment - > probably I misunderstood but I thought that was ok for the moment as long as we also can find a better solution for Oak, > which I think would have to deal with the same topic of shared resources probably on a dedicated issue. no, you got that right... i don't want to block the cq release if there is no other solution right now. but i would like us to discuss already today if there was really no other way and how we are going to address this in OAK before we ship that feature. therefore i was somehow waiting for a confirmation that my concern has been noticed. if we end up with releasing it with that entry in the repository.xml, i would like to see us having at least a FIXME comment associated with the repository.xml as this is not secure at all.... currently it states "a very well kept secret"... that's sort of a joke given the fact that the repository.xml is almost accessible to everyone
          Hide
          Alexander Klimetschek added a comment -

          The DataStore could generate this shared secret the first time it is started (i.e. no secret present), store it in a file in the root folder in case of the FileDataStore, and then every shared usage will have access to it anyway. Then this secret can be a secure long random string.

          Implementation wise, AbstractDataStore would remove setSecret(), and add a getSecret(), which it implements returning null - which means no reference binaries are supported here (i.e. the DataIdentifier would get a null reference). Then DataStore implementations would override the getSecret() method, which would look for the secret, and id not present, create it. AbstractDataStore should provide a generateSecret() method to use. The FileDataStore would then define the name and location of the file. Something like "secret" in the root folder of the storage should work - afaics it cannot conflict with normal entries, since they are always in "hash" subdirectories.

          For the DbDataStore it could be an special named entry that cannot collide with binaries. For the MultiDataStore I don't know.

          Show
          Alexander Klimetschek added a comment - The DataStore could generate this shared secret the first time it is started (i.e. no secret present), store it in a file in the root folder in case of the FileDataStore, and then every shared usage will have access to it anyway. Then this secret can be a secure long random string. Implementation wise, AbstractDataStore would remove setSecret(), and add a getSecret(), which it implements returning null - which means no reference binaries are supported here (i.e. the DataIdentifier would get a null reference). Then DataStore implementations would override the getSecret() method, which would look for the secret, and id not present, create it. AbstractDataStore should provide a generateSecret() method to use. The FileDataStore would then define the name and location of the file. Something like "secret" in the root folder of the storage should work - afaics it cannot conflict with normal entries, since they are always in "hash" subdirectories. For the DbDataStore it could be an special named entry that cannot collide with binaries. For the MultiDataStore I don't know.
          Hide
          Tommaso Teofili added a comment -

          > not really.... didn't you see my comment above regarding the modification to repository.xml?

          probably I misunderstood but I thought that was ok for the moment as long as we also can find a better solution for Oak, which I think would have to deal with the same topic of shared resources probably on a dedicated issue.

          An alternative idea would be storing the secret inside the DataStore itself but then one would need a way of getting that specific DataRecord with a specific DataIdentifier which would be kind of tricky as you would have to either define an API for this or rely on some knowledge of how the secret is stored in the DS on the upper level (e.g. in ValueFactoryImpl).

          To me the current approach sounds like a reasonably ok one for now while I agree we can work to improve that for Oak.

          Show
          Tommaso Teofili added a comment - > not really.... didn't you see my comment above regarding the modification to repository.xml? probably I misunderstood but I thought that was ok for the moment as long as we also can find a better solution for Oak, which I think would have to deal with the same topic of shared resources probably on a dedicated issue. An alternative idea would be storing the secret inside the DataStore itself but then one would need a way of getting that specific DataRecord with a specific DataIdentifier which would be kind of tricky as you would have to either define an API for this or rely on some knowledge of how the secret is stored in the DS on the upper level (e.g. in ValueFactoryImpl). To me the current approach sounds like a reasonably ok one for now while I agree we can work to improve that for Oak.
          Hide
          angela added a comment -

          > as I assume everyone else is happy with it I've applied the patch at http://svn.apache.org/r1480574

          not really.... didn't you see my comment above regarding the modification to repository.xml?

          Show
          angela added a comment - > as I assume everyone else is happy with it I've applied the patch at http://svn.apache.org/r1480574 not really.... didn't you see my comment above regarding the modification to repository.xml?
          Hide
          Tommaso Teofili added a comment - - edited

          thanks Alex for pointing out to that, I'll do that as soon as possible.

          Also I just noticed we removed constructor new DataIdentifier(byte[] bytes), I'll put it back in as DataStore implementors may create DataIdentifiers thus it'd introduce a backward compatibility issue for them; so, since we have the code for converting to hex String in AbstractDataStore#encodeHexString, it's just trivial to do it.

          Show
          Tommaso Teofili added a comment - - edited thanks Alex for pointing out to that, I'll do that as soon as possible. Also I just noticed we removed constructor new DataIdentifier(byte[] bytes), I'll put it back in as DataStore implementors may create DataIdentifiers thus it'd introduce a backward compatibility issue for them; so, since we have the code for converting to hex String in AbstractDataStore#encodeHexString, it's just trivial to do it.
          Hide
          Alexander Klimetschek added a comment -

          Looks great!

          We should document the new feature and in particular the "secret" parameter for the DataStore when there is time:

          Show
          Alexander Klimetschek added a comment - Looks great! We should document the new feature and in particular the "secret" parameter for the DataStore when there is time: on the http://wiki.apache.org/jackrabbit/DataStore wiki page (with an example on how to use it and a note it only applies to File & DbDataStore) in the javadocs of AbstractDataStore.setSecret in "default" repository xml (maybe): http://svn.apache.org/repos/asf/jackrabbit/trunk/jackrabbit-core/src/main/resources/org/apache/jackrabbit/core/repository.xml on the configuration page (with version info): http://jackrabbit.apache.org/jackrabbit-configuration.html#JackrabbitConfiguration-Datastoreconfiguration
          Hide
          Tommaso Teofili added a comment -

          as I assume everyone else is happy with it I've applied the patch at http://svn.apache.org/r1480574

          Show
          Tommaso Teofili added a comment - as I assume everyone else is happy with it I've applied the patch at http://svn.apache.org/r1480574
          Hide
          Jukka Zitting added a comment -

          Looks great, thanks! I'll commit it later today unless anyone spots further areas of improvements or problems to fix.

          Show
          Jukka Zitting added a comment - Looks great, thanks! I'll commit it later today unless anyone spots further areas of improvements or problems to fix.
          Hide
          Tommaso Teofili added a comment -

          updated version of the patch which incorporates Jukka's stuff plus some minor refactoring the name of DataIdentifier#getIdentifier(String reference) to DataIdentifier#getIdentifierFromReference(String reference) and added a first test which briefly shows how the new API can be used on the higher level.

          Show
          Tommaso Teofili added a comment - updated version of the patch which incorporates Jukka's stuff plus some minor refactoring the name of DataIdentifier#getIdentifier(String reference) to DataIdentifier#getIdentifierFromReference(String reference) and added a first test which briefly shows how the new API can be used on the higher level.
          Hide
          Tommaso Teofili added a comment -

          it looks really nice to me, thanks Jukka

          Show
          Tommaso Teofili added a comment - it looks really nice to me, thanks Jukka
          Hide
          Jukka Zitting added a comment -
          Show
          Jukka Zitting added a comment - Here's a quick draft of what this could look like: https://github.com/jukka/jackrabbit/commit/a2ff6bc70bf6d9385ad384547a63b02213acc8d6
          Hide
          Marcel Reutegger added a comment -

          Summary of an offline discussion with Tommaso, Alex, Jukka, Thomas and Marcel:

          • Introduce new interface to Jackrabbit API: ReferenceBinary extending Binary with getReference method that returns a String
          • Add SimpleReferenceBinary class to jcr-commons implementing the above. The SimpleReferenceBinary methods will throw an exception (IllegalState or Repository) when accessing Binary interface methods
          • SimpleReferenceBinary is created by supplying a String which is identifier+HMAC
          • ReferenceBinary will be created without supplying the HMAC from external. We will provide a mechanism to get a reference from a blob in store
          • Add a new method getReference to DataIdentifier (jackrabbit-core) which returns a String of the reference
          • Create AbstractDataStore from which all DataStore impls extend
          • Provide a mechanism in the AbstractDataStore that creates an identifier from the reference
          • The implementations of Binary need to also implement the ReferenceBinary interface if the binary supports a reference

          Client code will then use it like this:

          • Create the ReferenceBinary on sender by checking Binary instanceOf ReferenceBinary
          • On receiver side we create a SimpleReferenceBinary from the reference and create a value using ValueFactory.createValue(Binary) and one passes the SimpleReferenceBinary

          Comments welcome.

          Show
          Marcel Reutegger added a comment - Summary of an offline discussion with Tommaso, Alex, Jukka, Thomas and Marcel: Introduce new interface to Jackrabbit API: ReferenceBinary extending Binary with getReference method that returns a String Add SimpleReferenceBinary class to jcr-commons implementing the above. The SimpleReferenceBinary methods will throw an exception (IllegalState or Repository) when accessing Binary interface methods SimpleReferenceBinary is created by supplying a String which is identifier+HMAC ReferenceBinary will be created without supplying the HMAC from external. We will provide a mechanism to get a reference from a blob in store Add a new method getReference to DataIdentifier (jackrabbit-core) which returns a String of the reference Create AbstractDataStore from which all DataStore impls extend Provide a mechanism in the AbstractDataStore that creates an identifier from the reference The implementations of Binary need to also implement the ReferenceBinary interface if the binary supports a reference Client code will then use it like this: Create the ReferenceBinary on sender by checking Binary instanceOf ReferenceBinary On receiver side we create a SimpleReferenceBinary from the reference and create a value using ValueFactory.createValue(Binary) and one passes the SimpleReferenceBinary Comments welcome.
          Hide
          angela added a comment - - edited

          +1 for marcels comment.

          one more comment from my side:
          the modification to the repository.xml looks a bit suspicious to me:

          • <DataStore class="org.apache.jackrabbit.core.data.FileDataStore"/>
            + <DataStore class="org.apache.jackrabbit.core.data.FileDataStore">
            + <param name="secret" value="a very well kept secret"/>
            + </DataStore>

          there was something similar present for the trusted-credentials configuration (aka pseudo-sso), which we finally deprecated (no longer supported in oak).
          that's why i am not too happy about this entry...
          maybe we can find a better solution?
          or at least discuss on how to properly solved that in oak.... then i would just consider this a temporary work around in the instable 2.7 cycle.

          Show
          angela added a comment - - edited +1 for marcels comment. one more comment from my side: the modification to the repository.xml looks a bit suspicious to me: <DataStore class="org.apache.jackrabbit.core.data.FileDataStore"/> + <DataStore class="org.apache.jackrabbit.core.data.FileDataStore"> + <param name="secret" value="a very well kept secret"/> + </DataStore> there was something similar present for the trusted-credentials configuration (aka pseudo-sso), which we finally deprecated (no longer supported in oak). that's why i am not too happy about this entry... maybe we can find a better solution? or at least discuss on how to properly solved that in oak.... then i would just consider this a temporary work around in the instable 2.7 cycle.
          Hide
          Marcel Reutegger added a comment -

          I'd rename the BinaryReferenceMessage to ReferenceBinary. The data inside the ReferenceBinary can be a message, but other than that the class is primarily a Binary and not a message.

          More important however, is to make the API as small as possible. Right now, we have BinaryReferenceMessage with many public methods. I think it would be much better to only have a few methods. I like Jukka's suggestion earlier to create an abstract ReferenceBinary class without a direct dependency to the message or reference internals. The message or reference inside the ReferenceBinary could be an opaque String, which can easily be serialized and deserialized through various protocols. The use of a HMAC then becomes an implementation detail.

          The remaining question is how to get a ReferenceBinary from a Repository. I think this has to go into the API (just like the ReferenceBinary). I'd probably add it to the existing JackrabbitValue interface and add a method getReferenceBinary().

          Show
          Marcel Reutegger added a comment - I'd rename the BinaryReferenceMessage to ReferenceBinary. The data inside the ReferenceBinary can be a message, but other than that the class is primarily a Binary and not a message. More important however, is to make the API as small as possible. Right now, we have BinaryReferenceMessage with many public methods. I think it would be much better to only have a few methods. I like Jukka's suggestion earlier to create an abstract ReferenceBinary class without a direct dependency to the message or reference internals. The message or reference inside the ReferenceBinary could be an opaque String, which can easily be serialized and deserialized through various protocols. The use of a HMAC then becomes an implementation detail. The remaining question is how to get a ReferenceBinary from a Repository. I think this has to go into the API (just like the ReferenceBinary). I'd probably add it to the existing JackrabbitValue interface and add a method getReferenceBinary() .
          Hide
          Tommaso Teofili added a comment -

          > - BinaryReferenceMessage just a data object holding a "token/message" string (passed via constructor and read via getter)

          ok, no changes needed on the last patch

          > - other Binary methods empty/no-op

          that doesn't sound too nice to me, but I can live with that

          > - message format creation including signature done inside new API (e.g. JackrabbitValue#getSecureContentIdentity())

          that can be easily done for sure but I wonder if such an API would be accepted / introduce new security concerns.

          > - message format parsing and signature validation all happening inside createValue(Binary) when it sees a BinaryReferenceMessage

          ok, no changes needed on the last patch

          Show
          Tommaso Teofili added a comment - > - BinaryReferenceMessage just a data object holding a "token/message" string (passed via constructor and read via getter) ok, no changes needed on the last patch > - other Binary methods empty/no-op that doesn't sound too nice to me, but I can live with that > - message format creation including signature done inside new API (e.g. JackrabbitValue#getSecureContentIdentity()) that can be easily done for sure but I wonder if such an API would be accepted / introduce new security concerns. > - message format parsing and signature validation all happening inside createValue(Binary) when it sees a BinaryReferenceMessage ok, no changes needed on the last patch
          Hide
          Alexander Klimetschek added a comment -

          The main point is actually that the signature MUST be created inside the repository implementation, I can't see how client code would have access to the data store secret which is required to create the signature.

          This leads to changing JackrabbitValue#getContentIdentity(): I did not find a use of getContentIdentity() in our proprietary code; jackrabbit itself only provides it, but there is no utility etc. making use of it. I wonder what you could actually do with it so far other than checking for equality to a content id fetched from somewhere else? But if the hmac signature is expected to change every time, this would break the api contract afaics, which clearly states "If two values have the same identifier, the content of the value is guaranteed to be the same". This forces to add a new API that gives this signed message, no way around that afaics.

          That would be my proposal:

          • BinaryReferenceMessage just a data object holding a "token/message" string (passed via constructor and read via getter)
          • other Binary methods empty/no-op
          • message format creation including signature done inside new API (e.g. JackrabbitValue#getSecureContentIdentity())
          • message format parsing and signature validation all happening inside createValue(Binary) when it sees a BinaryReferenceMessage
          Show
          Alexander Klimetschek added a comment - The main point is actually that the signature MUST be created inside the repository implementation, I can't see how client code would have access to the data store secret which is required to create the signature. This leads to changing JackrabbitValue#getContentIdentity(): I did not find a use of getContentIdentity() in our proprietary code; jackrabbit itself only provides it, but there is no utility etc. making use of it. I wonder what you could actually do with it so far other than checking for equality to a content id fetched from somewhere else? But if the hmac signature is expected to change every time, this would break the api contract afaics, which clearly states "If two values have the same identifier, the content of the value is guaranteed to be the same". This forces to add a new API that gives this signed message, no way around that afaics. That would be my proposal: BinaryReferenceMessage just a data object holding a "token/message" string (passed via constructor and read via getter) other Binary methods empty/no-op message format creation including signature done inside new API (e.g. JackrabbitValue#getSecureContentIdentity()) message format parsing and signature validation all happening inside createValue(Binary) when it sees a BinaryReferenceMessage
          Hide
          Alexander Klimetschek added a comment -

          Because client code must not use jackrabbit-core inner classes, nor should jackrabbit-core export any APIs.

          Show
          Alexander Klimetschek added a comment - Because client code must not use jackrabbit-core inner classes, nor should jackrabbit-core export any APIs.
          Hide
          Tommaso Teofili added a comment -

          could you please detail why that would be the case?

          Show
          Tommaso Teofili added a comment - could you please detail why that would be the case?
          Hide
          Alexander Klimetschek added a comment -

          Yeah, but then you can't use it for the intended use case...

          Show
          Alexander Klimetschek added a comment - Yeah, but then you can't use it for the intended use case...
          Hide
          Tommaso Teofili added a comment -

          what about having all the binary reference stuff as it is now committed under jackrabbit-core and improve it from there iteratively?
          I'd very much like to move this forward if there're no big blockers.

          Show
          Tommaso Teofili added a comment - what about having all the binary reference stuff as it is now committed under jackrabbit-core and improve it from there iteratively? I'd very much like to move this forward if there're no big blockers.
          Hide
          Alexander Klimetschek added a comment -

          I think on the API level we want a single string (token) - the repository implementation generates and also handles it, all with a minimum of API additions.

          > otherwise it seems to me this is not "that" high level to justify its belonging to jackrabbit-api

          IMO there is no gray area - either we say it's an API feature or it's not a feature at all.

          (BTW, to me jackrabbit-jcr-commons is a big chunk of utilities and it's already suboptimal that both repository implementation and jcr client code depend on it).

          > that's basically for compatibility with existing code where you may now receive such an object and you wouldn't expect that to return null.
          There is no way that you can "accidentally" get the BinaryReferenceMessage. The repository would never return it (via getProperty() etc.), the only way it can exist is between an API client instantiating it itself and passing it to ValueFactory.createValue(Binary).

          But maybe this is an indication that this Binary trick is too fuzzy from an API design standpoint...

          Some others should share their opinions.

          Show
          Alexander Klimetschek added a comment - I think on the API level we want a single string (token) - the repository implementation generates and also handles it, all with a minimum of API additions. > otherwise it seems to me this is not "that" high level to justify its belonging to jackrabbit-api IMO there is no gray area - either we say it's an API feature or it's not a feature at all. (BTW, to me jackrabbit-jcr-commons is a big chunk of utilities and it's already suboptimal that both repository implementation and jcr client code depend on it). > that's basically for compatibility with existing code where you may now receive such an object and you wouldn't expect that to return null. There is no way that you can "accidentally" get the BinaryReferenceMessage. The repository would never return it (via getProperty() etc.), the only way it can exist is between an API client instantiating it itself and passing it to ValueFactory.createValue(Binary). But maybe this is an indication that this Binary trick is too fuzzy from an API design standpoint... Some others should share their opinions.
          Hide
          Tommaso Teofili added a comment -

          > Any key generation should come from the repository itself. Currently it's mixed - generating is done by the client (needs to know the right format) and later reading & validating is done by the repository in createValue(). It should all be done by the repository, the client code should not have to know the format of the signature. I was thinking that we change the JackrabbitValue#getContentIdentity() [0] to return the final secure/signed token format. Should be possible to detect old and new IDs IMHO. This requires that other code (where?) that uses the content identity can work with the new format as well.

          generating a BinaryReferenceMessage has to be done with BinaryReferenceMessage(String contentIdentity, String key, long referencedBinaryLength) which doesn't require any knowledge of signature / message format.
          I'm -0.5 for changing JackrabbitValue#getContentIdentity(), IMHO that may be really confusing as it wouldn't expose the content identity anymore but a message with other semantics.

          > - BinaryReferenceMessage should go into jackrabbit-api instead of jackrabbit-jcr-commons since it's a specific Jackrabbit API and jcr-commons is more about utilities around both JCR API users and JCR implementors, which makes it hard to find actual jackrabbit specific features

          if you'd say jackrabbit-core then it probably makes more sense, otherwise it seems to me this is not "that" high level to justify its belonging to jackrabbit-api, in the end its main purpose is to allow "efficient copying of binaries across repositories with the same datastore", which seems to me best fit a core concept related to some specific deployments.

          > BinaryReferenceMessage also implements getStream(), read() and getSize() by providing the message token - IMHO that is misleading if it is used as special interface only. These should probably return null. The repository itself would never return such a Binary itself.

          that's basically for compatibility with existing code where you may now receive such an object and you wouldn't expect that to return null.

          Show
          Tommaso Teofili added a comment - > Any key generation should come from the repository itself. Currently it's mixed - generating is done by the client (needs to know the right format) and later reading & validating is done by the repository in createValue(). It should all be done by the repository, the client code should not have to know the format of the signature. I was thinking that we change the JackrabbitValue#getContentIdentity() [0] to return the final secure/signed token format. Should be possible to detect old and new IDs IMHO. This requires that other code (where?) that uses the content identity can work with the new format as well. generating a BinaryReferenceMessage has to be done with BinaryReferenceMessage(String contentIdentity, String key, long referencedBinaryLength) which doesn't require any knowledge of signature / message format. I'm -0.5 for changing JackrabbitValue#getContentIdentity(), IMHO that may be really confusing as it wouldn't expose the content identity anymore but a message with other semantics. > - BinaryReferenceMessage should go into jackrabbit-api instead of jackrabbit-jcr-commons since it's a specific Jackrabbit API and jcr-commons is more about utilities around both JCR API users and JCR implementors, which makes it hard to find actual jackrabbit specific features if you'd say jackrabbit-core then it probably makes more sense, otherwise it seems to me this is not "that" high level to justify its belonging to jackrabbit-api, in the end its main purpose is to allow "efficient copying of binaries across repositories with the same datastore", which seems to me best fit a core concept related to some specific deployments. > BinaryReferenceMessage also implements getStream(), read() and getSize() by providing the message token - IMHO that is misleading if it is used as special interface only. These should probably return null. The repository itself would never return such a Binary itself. that's basically for compatibility with existing code where you may now receive such an object and you wouldn't expect that to return null.
          Hide
          Alexander Klimetschek added a comment -

          Looks good. But some more comments

          Since the BinaryReferenceMessage is API, I would keep it as minimal as possible:

          • Any key generation should come from the repository itself. Currently it's mixed - generating is done by the client (needs to know the right format) and later reading & validating is done by the repository in createValue(). It should all be done by the repository, the client code should not have to know the format of the signature. I was thinking that we change the JackrabbitValue#getContentIdentity() [0] to return the final secure/signed token format. Should be possible to detect old and new IDs IMHO. This requires that other code (where?) that uses the content identity can work with the new format as well.
          • BinaryReferenceMessage should go into jackrabbit-api instead of jackrabbit-jcr-commons since it's a specific Jackrabbit API and jcr-commons is more about utilities around both JCR API users and JCR implementors, which makes it hard to find actual jackrabbit specific features
          • BinaryReferenceMessage also implements getStream(), read() and getSize() by providing the message token - IMHO that is misleading if it is used as special interface only. These should probably return null. The repository itself would never return such a Binary itself.

          [0] http://jackrabbit.apache.org/api/2.4/org/apache/jackrabbit/api/JackrabbitValue.html#getContentIdentity()

          Show
          Alexander Klimetschek added a comment - Looks good. But some more comments Since the BinaryReferenceMessage is API, I would keep it as minimal as possible: Any key generation should come from the repository itself. Currently it's mixed - generating is done by the client (needs to know the right format) and later reading & validating is done by the repository in createValue(). It should all be done by the repository, the client code should not have to know the format of the signature. I was thinking that we change the JackrabbitValue#getContentIdentity() [0] to return the final secure/signed token format. Should be possible to detect old and new IDs IMHO. This requires that other code (where?) that uses the content identity can work with the new format as well. BinaryReferenceMessage should go into jackrabbit-api instead of jackrabbit-jcr-commons since it's a specific Jackrabbit API and jcr-commons is more about utilities around both JCR API users and JCR implementors, which makes it hard to find actual jackrabbit specific features BinaryReferenceMessage also implements getStream(), read() and getSize() by providing the message token - IMHO that is misleading if it is used as special interface only. These should probably return null. The repository itself would never return such a Binary itself. [0] http://jackrabbit.apache.org/api/2.4/org/apache/jackrabbit/api/JackrabbitValue.html#getContentIdentity( )
          Hide
          Tommaso Teofili added a comment -

          updated patch which introduces special handling of BinaryReferenceMessage in VF.createValue(Binary).

          If the reference message holds an existing content id and the signature matches then the referenced binary is retrieved and returned as the Value.

          If either the content id doesn't exist or the signature doesn't match a RepositoryException is thrown (returning null doesn't seem to be allowed by JCR spec)

          Show
          Tommaso Teofili added a comment - updated patch which introduces special handling of BinaryReferenceMessage in VF.createValue(Binary). If the reference message holds an existing content id and the signature matches then the referenced binary is retrieved and returned as the Value. If either the content id doesn't exist or the signature doesn't match a RepositoryException is thrown (returning null doesn't seem to be allowed by JCR spec)
          Hide
          Tommaso Teofili added a comment -

          after a chat with Alexander we realized that it'd be possible to avoid any API change / addition by using createValue(Binary) as that in ValueFactoryImpl already handles different Binary implementations accordingly, so that would resolve the final binary by using the store and the reference binary message or throw an exception if not existing.

          Show
          Tommaso Teofili added a comment - after a chat with Alexander we realized that it'd be possible to avoid any API change / addition by using createValue(Binary) as that in ValueFactoryImpl already handles different Binary implementations accordingly, so that would resolve the final binary by using the store and the reference binary message or throw an exception if not existing.
          Hide
          Alexander Klimetschek added a comment -

          @Tommaso: There are 2 things:

          a) api or not for creating the binary: since the requirement here is an application level protocol, there must be an API, otherwise it could not know if it should resend the full binary or not. The application level protocol currently cannot and should not know if the datastores are shared or not. AFAICS, the discussion so far did not result in avoiding an API, just to make sure it's a secure API.

          In my above comments I missed the proposed trick to use "createValue(java.lang.String value, PropertyType.BINARY)" - but I don't see that in the patch either. This would look something like this IIUC, instead of the getBinaryFromSecureID() in my above snippet:

          Node node = ... // current node written
          Binary binary;
          if (messageData.hasBinaryStream())

          { binary = getBinaryFromStream(messageData, session); }

          else {
          String message = messageData.getBinaryMessage(); // get from custom protocol
          try

          { binary = session.getValueFactory().createValue(message, PropertyType.BINARY); }

          catch (ValueFormatException e)

          { // not supported / wrong secret / referenced binary not found return ASK_FOR_BINARY_STREAM; }


          }
          node.setProperty("jcr:data", binary);

          It is important however that createValue() will only return a binary if the message is right; if it creates a binary of the string contents (not sure if that is currently the case if you call createValue() this way, then this trick cannot work and we do need another API.

          b) reading the binary: could you show me the application code reading such a binary just using the JCR API how it gets the actual data? This is completely missing yet (the test in the patch uses DataStore.getRecord(), but that is Jackrabbit internal).

          Show
          Alexander Klimetschek added a comment - @Tommaso: There are 2 things: a) api or not for creating the binary: since the requirement here is an application level protocol, there must be an API, otherwise it could not know if it should resend the full binary or not. The application level protocol currently cannot and should not know if the datastores are shared or not. AFAICS, the discussion so far did not result in avoiding an API, just to make sure it's a secure API. In my above comments I missed the proposed trick to use "createValue(java.lang.String value, PropertyType.BINARY)" - but I don't see that in the patch either. This would look something like this IIUC, instead of the getBinaryFromSecureID() in my above snippet: Node node = ... // current node written Binary binary; if (messageData.hasBinaryStream()) { binary = getBinaryFromStream(messageData, session); } else { String message = messageData.getBinaryMessage(); // get from custom protocol try { binary = session.getValueFactory().createValue(message, PropertyType.BINARY); } catch (ValueFormatException e) { // not supported / wrong secret / referenced binary not found return ASK_FOR_BINARY_STREAM; } } node.setProperty("jcr:data", binary); It is important however that createValue() will only return a binary if the message is right; if it creates a binary of the string contents (not sure if that is currently the case if you call createValue() this way, then this trick cannot work and we do need another API. b) reading the binary: could you show me the application code reading such a binary just using the JCR API how it gets the actual data? This is completely missing yet (the test in the patch uses DataStore.getRecord(), but that is Jackrabbit internal).
          Hide
          Tommaso Teofili added a comment -

          @Alexander I think we cannot store the real binary in the referenced binary, otherwise what's the matter of sending the reference message? It'd still hold the binary data so there wouldn't be any reason for doing any check, just re import the binary and that's it.

          Earlier we agreed we don't want public API for this, now we want a public API again, but this scenario shouldn't apply to generic binary handling, just to the case where the datastore is shared so I don't think this would trigger any code rewrite.

          In the end: do we want to design this scenario by giving an API? If yes then probably Alex's concern makes sense, if not then it's just another Binary implementation which can be created from a "master" repository to avoid sending the actual binary, then the "slave" repository checks if he has such a binary with the given approach (by accessing the data store), if not (or if it's been changed) the "master" will send again the same node with the actual binary instead of the reference binary and the "slave" will import it now with the proper actual binary.

          Any other opinions?

          Show
          Tommaso Teofili added a comment - @Alexander I think we cannot store the real binary in the referenced binary, otherwise what's the matter of sending the reference message? It'd still hold the binary data so there wouldn't be any reason for doing any check, just re import the binary and that's it. Earlier we agreed we don't want public API for this, now we want a public API again, but this scenario shouldn't apply to generic binary handling, just to the case where the datastore is shared so I don't think this would trigger any code rewrite. In the end: do we want to design this scenario by giving an API? If yes then probably Alex's concern makes sense, if not then it's just another Binary implementation which can be created from a "master" repository to avoid sending the actual binary, then the "slave" repository checks if he has such a binary with the given approach (by accessing the data store), if not (or if it's been changed) the "master" will send again the same node with the actual binary instead of the reference binary and the "slave" will import it now with the proper actual binary. Any other opinions?
          Hide
          Alexander Klimetschek added a comment - - edited

          @Tommaso: In the current patch, the binary stores the HMAC message/reference, but not the real binary. This it not transparent for JCR API users. How do they get the actual binary? They don't have access to the DataStore at all, and they should not have to know about this special mechanism at all (otherwise all code would have to be rewritten).

          What I think we need on the API client side:
          a) creating binary

          Node node = ... // current node written
          Binary binary;
          if (messageData.hasBinaryStream())

          { binary = getBinaryFromStream(messageData, session); }

          else {
          String message = messageData.getBinaryMessage(); // get from custom protocol
          binary = jackrabbitSession.getBinaryBySecureID(message);
          if (binary == null)

          { // protocol does an extra step and transfers the full binary itself return ASK_FOR_BINARY_STREAM; }

          }
          node.setProperty("jcr:data", binary);

          b) code reading the data - plain jcr api

          InputStream is = node.getProperty("jcr:data").getBinary().getInputStream();

          Show
          Alexander Klimetschek added a comment - - edited @Tommaso: In the current patch, the binary stores the HMAC message/reference, but not the real binary. This it not transparent for JCR API users. How do they get the actual binary? They don't have access to the DataStore at all, and they should not have to know about this special mechanism at all (otherwise all code would have to be rewritten). What I think we need on the API client side: a) creating binary Node node = ... // current node written Binary binary; if (messageData.hasBinaryStream()) { binary = getBinaryFromStream(messageData, session); } else { String message = messageData.getBinaryMessage(); // get from custom protocol binary = jackrabbitSession.getBinaryBySecureID(message); if (binary == null) { // protocol does an extra step and transfers the full binary itself return ASK_FOR_BINARY_STREAM; } } node.setProperty("jcr:data", binary); b) code reading the data - plain jcr api InputStream is = node.getProperty("jcr:data").getBinary().getInputStream();
          Hide
          Tommaso Teofili added a comment -

          second patch which also incorporates final binary length (and related check in ReferenceBinaryTest)

          Show
          Tommaso Teofili added a comment - second patch which also incorporates final binary length (and related check in ReferenceBinaryTest)
          Hide
          Jukka Zitting added a comment -

          Updated issue title to refer to the problem being solved instead of just the originally proposed solution

          Show
          Jukka Zitting added a comment - Updated issue title to refer to the problem being solved instead of just the originally proposed solution
          Hide
          Tommaso Teofili added a comment -

          > the public API is still missing (BinaryReferenceMessage is in a jackrabbit internal package); I think a public api Binary getBinaryBySecureID() (or similar) like in the first patch is still required

          I thought we weren't agreeing on this and we wanted no public APIs, but I'm open to creating a public one.

          > instead of resolving the message upon writing to the repository, it is instead written as Binary (with the message as contents)

          how would you do it instead? I understood having it as a special Binary sounded ok.

          > but it's never resolved to the final binary that you want (if the message is valid)

          when the binary reference message is received the receiving repository has to generate a signature from the received content id using its own secret, if the generated signature matches the signature received along with the content id in the reference message then it means it was generated using the same secret and therefore comes from the same repository.
          Regarding the final binary if it doesn't exist or it doesn't come from the same repository then it has to be "sent" in a subsequent call, if it exists and the signature matches then probably the things that's missing is if that has changed, thus probably sending something like the actual binary length as part of the binary reference message would be helpful.

          > and then you'd write the returned Binary right to the repository (no delayed resolution)

          but to obtain this, wouldn't that mean that we basically send the final binary itself along with the message?

          > this is important, since the application level code needs to know if that getBinaryBySecureID() worked or not (returns non-null or not) so it can handle the case by asking for the full binary as stream over its own message protocol
          > btw, some lucene-related java files are accidentally included in the patch with import reorgs

          thanks for pointing out, probably some overlooked left out from other changes

          Alex, thanks for your comments, hopefully I've clarified what my solution looks like; if you'd use a different approach it'd be good to have a broader overview of it (or perhaps I just misunderstood).

          Show
          Tommaso Teofili added a comment - > the public API is still missing (BinaryReferenceMessage is in a jackrabbit internal package); I think a public api Binary getBinaryBySecureID() (or similar) like in the first patch is still required I thought we weren't agreeing on this and we wanted no public APIs, but I'm open to creating a public one. > instead of resolving the message upon writing to the repository, it is instead written as Binary (with the message as contents) how would you do it instead? I understood having it as a special Binary sounded ok. > but it's never resolved to the final binary that you want (if the message is valid) when the binary reference message is received the receiving repository has to generate a signature from the received content id using its own secret, if the generated signature matches the signature received along with the content id in the reference message then it means it was generated using the same secret and therefore comes from the same repository. Regarding the final binary if it doesn't exist or it doesn't come from the same repository then it has to be "sent" in a subsequent call, if it exists and the signature matches then probably the things that's missing is if that has changed, thus probably sending something like the actual binary length as part of the binary reference message would be helpful. > and then you'd write the returned Binary right to the repository (no delayed resolution) but to obtain this, wouldn't that mean that we basically send the final binary itself along with the message? > this is important, since the application level code needs to know if that getBinaryBySecureID() worked or not (returns non-null or not) so it can handle the case by asking for the full binary as stream over its own message protocol > btw, some lucene-related java files are accidentally included in the patch with import reorgs thanks for pointing out, probably some overlooked left out from other changes Alex, thanks for your comments, hopefully I've clarified what my solution looks like; if you'd use a different approach it'd be good to have a broader overview of it (or perhaps I just misunderstood).
          Hide
          Alexander Klimetschek added a comment -

          HMAC stuff is good afaics (not an expert), but some things are missing:

          • it's on the Binary level and not on the more generic Value level (but ok for me)
          • instead of resolving the message upon writing to the repository, it is instead written as Binary (with the message as contents)
          • but it's never resolved to the final binary that you want (if the message is valid)
          • the public API is still missing (BinaryReferenceMessage is in a jackrabbit internal package); I think a public api Binary getBinaryBySecureID() (or similar) like in the first patch is still required
          • and then you'd write the returned Binary right to the repository (no delayed resolution)
          • this is important, since the application level code needs to know if that getBinaryBySecureID() worked or not (returns non-null or not) so it can handle the case by asking for the full binary as stream over its own message protocol
          • btw, some lucene-related java files are accidentally included in the patch with import reorgs
          Show
          Alexander Klimetschek added a comment - HMAC stuff is good afaics (not an expert), but some things are missing: it's on the Binary level and not on the more generic Value level (but ok for me) instead of resolving the message upon writing to the repository, it is instead written as Binary (with the message as contents) but it's never resolved to the final binary that you want (if the message is valid) the public API is still missing (BinaryReferenceMessage is in a jackrabbit internal package); I think a public api Binary getBinaryBySecureID() (or similar) like in the first patch is still required and then you'd write the returned Binary right to the repository (no delayed resolution) this is important, since the application level code needs to know if that getBinaryBySecureID() worked or not (returns non-null or not) so it can handle the case by asking for the full binary as stream over its own message protocol btw, some lucene-related java files are accidentally included in the patch with import reorgs
          Hide
          Tommaso Teofili added a comment -

          here's a first sketch patch.

          • it adds the datastore secret as a String returned by the DataStore, each current DS impl is thus aligned by just holding that property
          • it defines a BinaryReferenceMessage which is constructed by a message and a key which in turn are used to generate the HMAC (still missing, if we want it, the encryption of the message/content id)
          • an HMAC utility class
          • the commons module has commons-codec dependency added to allow pretty representation of the HMAC as hex but maybe that can be "workarounded" some way
          • a ReferenceBinaryTest is provided in order to show a relaxed scenario of two sessions over same repository, one creates a binary and its reference message while the other reads it, parses it and check the signature

          looking forward to your comments

          Show
          Tommaso Teofili added a comment - here's a first sketch patch. it adds the datastore secret as a String returned by the DataStore, each current DS impl is thus aligned by just holding that property it defines a BinaryReferenceMessage which is constructed by a message and a key which in turn are used to generate the HMAC (still missing, if we want it, the encryption of the message/content id) an HMAC utility class the commons module has commons-codec dependency added to allow pretty representation of the HMAC as hex but maybe that can be "workarounded" some way a ReferenceBinaryTest is provided in order to show a relaxed scenario of two sessions over same repository, one creates a binary and its reference message while the other reads it, parses it and check the signature looking forward to your comments
          Hide
          Thomas Mueller added a comment -

          > we may defer the salt / expiry stuff as possible further improvement

          Yes. Implementing this would be rather easy and can be done later.

          Show
          Thomas Mueller added a comment - > we may defer the salt / expiry stuff as possible further improvement Yes. Implementing this would be rather easy and can be done later.
          Hide
          Tommaso Teofili added a comment - - edited

          >> To avoid further delay, we could already start implement what we seem to agree on

          which could be:

          • we have a shared secret (repository wide, not a System property nor DS wide)
          • using HMAC (with SHA-1 for now)

          we may defer the salt / expiry stuff as possible further improvement

          Show
          Tommaso Teofili added a comment - - edited >> To avoid further delay, we could already start implement what we seem to agree on which could be: we have a shared secret (repository wide, not a System property nor DS wide) using HMAC (with SHA-1 for now) we may defer the salt / expiry stuff as possible further improvement
          Hide
          Thomas Mueller added a comment -

          > > Well, this message is an access token.
          > The message data must not be a general access token.

          I didn't say "general" access token, but it's still an access token: it grants access to read a certain binary. Sure, we can argue now what an "access token" exactly is.

          Show
          Thomas Mueller added a comment - > > Well, this message is an access token. > The message data must not be a general access token. I didn't say "general" access token, but it's still an access token: it grants access to read a certain binary. Sure, we can argue now what an "access token" exactly is.
          Hide
          Felix Meschberger added a comment -

          > Well, this message is an access token.

          If that would be the case, this would be bad design. The message data must not be a general access token.

          > A key-value list would be nice

          This sounds like YAGNI – I doubt we are going to use it. And if so, we can still adapt. There is just one place in the code involved which knows how to talk to itself....

          Show
          Felix Meschberger added a comment - > Well, this message is an access token. If that would be the case, this would be bad design. The message data must not be a general access token. > A key-value list would be nice This sounds like YAGNI – I doubt we are going to use it. And if so, we can still adapt. There is just one place in the code involved which knows how to talk to itself....
          Hide
          Thomas Mueller added a comment -

          > the price of parsing JSON ... Not worth it IMHO

          I'm not worried about the speed of parsing: verifying the signature is the slow part. Sure, it should be simple to implement, but shouldn't prevent us from changing the algorithm in the future. A key-value list would be nice, or something similar to the format used in PasswordUtility.

          > What is the problem with this message leaking

          Well, this message is an access token. It provides access to more than just the leaked data. It's kind of like a password to get more data.

          Show
          Thomas Mueller added a comment - > the price of parsing JSON ... Not worth it IMHO I'm not worried about the speed of parsing: verifying the signature is the slow part. Sure, it should be simple to implement, but shouldn't prevent us from changing the algorithm in the future. A key-value list would be nice, or something similar to the format used in PasswordUtility. > What is the problem with this message leaking Well, this message is an access token. It provides access to more than just the leaked data. It's kind of like a password to get more data.
          Hide
          Felix Meschberger added a comment -

          > To simplify development/support, the message should readable, for example JSON or an URL. Example (shortened)

          This sounds like the old mantra in the XML-days: Everything had to be XML, bla, bla, bla. Please just keep this simple and don't overexagerate. Having a string of colon separated values (if the value is structured in some way) is more that enough. Otherwise you incurr the price of parsing JSON ... Not worth it IMHO.

          > Having expiry and encrypting the identifier would prevent further damage in case the BinaryReferenceMessage leaks.

          What is the problem with this message leaking as opposed to the actual data leaking which would be transported over the same channel completely unencrypted ? Again, this sounds like over-engineering to me.

          We should leave data protection to the transport layer (e.g. SSL) and just care to make sure that a data reference cannot be made up by an attacker (in the sense of "try to find out whether a document exists").

          Show
          Felix Meschberger added a comment - > To simplify development/support, the message should readable, for example JSON or an URL. Example (shortened) This sounds like the old mantra in the XML-days: Everything had to be XML, bla, bla, bla. Please just keep this simple and don't overexagerate. Having a string of colon separated values (if the value is structured in some way) is more that enough. Otherwise you incurr the price of parsing JSON ... Not worth it IMHO. > Having expiry and encrypting the identifier would prevent further damage in case the BinaryReferenceMessage leaks. What is the problem with this message leaking as opposed to the actual data leaking which would be transported over the same channel completely unencrypted ? Again, this sounds like over-engineering to me. We should leave data protection to the transport layer (e.g. SSL) and just care to make sure that a data reference cannot be made up by an attacker (in the sense of "try to find out whether a document exists").
          Hide
          Thomas Mueller added a comment -

          To avoid further delay, we could already start implement what we seem to agree on (well, lets see . That is, we need a secret, that is used as the key to sign the message and verify the signature. One solution is a shared secret, configured in the repository.xml, in a new tag. Or it could be configured within the data store, but then each data store where we want to support this feature would need to be changed. Are the any other (simpler) solutions? The simplest to implement would probably be a system property, but that feels wrong

          Show
          Thomas Mueller added a comment - To avoid further delay, we could already start implement what we seem to agree on (well, lets see . That is, we need a secret, that is used as the key to sign the message and verify the signature. One solution is a shared secret, configured in the repository.xml, in a new tag. Or it could be configured within the data store, but then each data store where we want to support this feature would need to be changed. Are the any other (simpler) solutions? The simplest to implement would probably be a system property, but that feels wrong
          Hide
          Thomas Mueller added a comment -

          Jukka, I understand now what your attack scenario is. However, this is not the only scenario. See the comment above from Chetan Mehrotra "So may be we can have some service provided by DataStore which can provide such safe ids which can be passed around and still be secure".

          Having expiry and encrypting the identifier would prevent further damage in case the BinaryReferenceMessage leaks. It basically allows to use it within an email, embed it in a web site,... as described in http://stackoverflow.com/questions/5419264/best-practice-amazons3-url-sharing

          Show
          Thomas Mueller added a comment - Jukka, I understand now what your attack scenario is. However, this is not the only scenario. See the comment above from Chetan Mehrotra "So may be we can have some service provided by DataStore which can provide such safe ids which can be passed around and still be secure". Having expiry and encrypting the identifier would prevent further damage in case the BinaryReferenceMessage leaks. It basically allows to use it within an email, embed it in a web site,... as described in http://stackoverflow.com/questions/5419264/best-practice-amazons3-url-sharing
          Hide
          Alexander Klimetschek added a comment -

          I agree with Jukka - we are not talking about a protocol here that needs to be protected, but an API access. A replication protocol that would make use of this on top of Jackrabbit would need its own protection (e.g. SSL) to protect the message contents, but that should be clearly separated.

          BTW, yesterday I came across another use case that the solution should include: not only a shared DataStore between sender and receiver of that application-level replication mechanism, but also the case of multiple receivers (horizontally scaled boxes) that share one datastore. In this scenario the sender sends N replication messages to N receivers. Ideally you want to avoid sending large binaries N times, so one could imagine the receivers sharing a common DataStore (but not with the sender, as their is usually a strict firewall separation between sender and receiver and possibly a higher latency as they might reside in different data centers). In that case I imagine that after the first successful transfer of the binary to receiver 1 and storage in the DataStore, the other replications to the other receivers see that the binary is present already and don't need to send it over again (although one has to avoid that all replications happen at the same time to benefit from that).

          This means that one should be able to configure a shared secret in both the sender and receiver DataStores, so they could "trust" each other, but wouldn't necessarily have shared content.

          IMHO this case is even more applicable to real-world scenarios and performance benefits - because a shared DataStore between sender and receiver is definitely less possible than a shared DataStore among multiple receivers.

          Show
          Alexander Klimetschek added a comment - I agree with Jukka - we are not talking about a protocol here that needs to be protected, but an API access. A replication protocol that would make use of this on top of Jackrabbit would need its own protection (e.g. SSL) to protect the message contents, but that should be clearly separated. BTW, yesterday I came across another use case that the solution should include: not only a shared DataStore between sender and receiver of that application-level replication mechanism, but also the case of multiple receivers (horizontally scaled boxes) that share one datastore. In this scenario the sender sends N replication messages to N receivers. Ideally you want to avoid sending large binaries N times, so one could imagine the receivers sharing a common DataStore (but not with the sender, as their is usually a strict firewall separation between sender and receiver and possibly a higher latency as they might reside in different data centers). In that case I imagine that after the first successful transfer of the binary to receiver 1 and storage in the DataStore, the other replications to the other receivers see that the binary is present already and don't need to send it over again (although one has to avoid that all replications happen at the same time to benefit from that). This means that one should be able to configure a shared secret in both the sender and receiver DataStores, so they could "trust" each other, but wouldn't necessarily have shared content. IMHO this case is even more applicable to real-world scenarios and performance benefits - because a shared DataStore between sender and receiver is definitely less possible than a shared DataStore among multiple receivers.
          Hide
          Jukka Zitting added a comment -

          > Well, we wanted to make it secure, right?

          The main security threat at least as discussed above was to prevent someone form accessing the contents or checking for the existence of a binary based on its identifier (which can be predictable). The HMAC already protects against that, as it guarantees that the client already has another way to access the binary. Otherwise it couldn't have acquired a HMAC signed by the underlying data store.

          Is there some other attack vector that I'm missing?

          > Expiry: this is to avoid reply attacks.

          I'm not sure I follow. Who's the attacker and what does an expiry value prevent them from doing?

          > Without it, the message would no longer have the meaning of "you have access to this binary" but it would sometimes mean "this is the data".

          I don't see a problem with that. Currently that's what the client is in any case doing, copying the data from one repository to another. The proposed feature here is just an optimization to that case, so it shouldn't be a problem if in some cases the feature ends up copying the data instead of just a signed identifier. In fact for small binaries the whole HMAC/identifier mechanism could simply be skipped, as the data could just as well be directly copied without notable extra overhead.

          Show
          Jukka Zitting added a comment - > Well, we wanted to make it secure, right? The main security threat at least as discussed above was to prevent someone form accessing the contents or checking for the existence of a binary based on its identifier (which can be predictable). The HMAC already protects against that, as it guarantees that the client already has another way to access the binary. Otherwise it couldn't have acquired a HMAC signed by the underlying data store. Is there some other attack vector that I'm missing? > Expiry: this is to avoid reply attacks. I'm not sure I follow. Who's the attacker and what does an expiry value prevent them from doing? > Without it, the message would no longer have the meaning of "you have access to this binary" but it would sometimes mean "this is the data". I don't see a problem with that. Currently that's what the client is in any case doing, copying the data from one repository to another. The proposed feature here is just an optimization to that case, so it shouldn't be a problem if in some cases the feature ends up copying the data instead of just a signed identifier. In fact for small binaries the whole HMAC/identifier mechanism could simply be skipped, as the data could just as well be directly copied without notable extra overhead.
          Hide
          Thomas Mueller added a comment -

          Well, we wanted to make it secure, right?

          Expiry: this is to avoid reply attacks. It is the same mechanism as used for S3, see http://stackoverflow.com/questions/5419264/best-practice-amazons3-url-sharing

          The identifier may be the data itself, so if it's not encrypted then the data would be included in the message. Without it, the message would no longer have the meaning of "you have access to this binary" but it would sometimes mean "this is the data".

          Show
          Thomas Mueller added a comment - Well, we wanted to make it secure, right? Expiry: this is to avoid reply attacks. It is the same mechanism as used for S3, see http://stackoverflow.com/questions/5419264/best-practice-amazons3-url-sharing The identifier may be the data itself, so if it's not encrypted then the data would be included in the message. Without it, the message would no longer have the meaning of "you have access to this binary" but it would sometimes mean "this is the data".
          Hide
          Jukka Zitting added a comment -

          Why would the identifier need to be encrypted, salted and/or expirable? A HMAC should be all we need here.

          Show
          Jukka Zitting added a comment - Why would the identifier need to be encrypted, salted and/or expirable? A HMAC should be all we need here.
          Hide
          Thomas Mueller added a comment -

          I was chatting with Tommaso Teofili about the basic data structures, features, and security protocols. There are still a few open questions regarding the API, but here what we have so far:

          DataIdentifier: The (unencryptged and unsigned) identifier of the binary, as already used by the Jackrabbit DataStore. Please note it could be a reference to a file, or, for small binaries, contain the data itself.

          DataStoreSecret: a secret value that needs to be configured to be the same value in all repositories that share the same physical data store. It is used as the basis to encrypt and decrypt the DataIdentifier, and to sign and verify the signature. This could be a configuration parameter of the DataStore element in the repository.xml, but then we would probably need to change each DataStore implementation were we want support for the new feature? To avoid that, should be add a new element to the repository.xml? Not sure what is the easiest.

          BinaryReferenceMessage: The encrypt DataIdentifier, the random salt, the expiry time. Plus the signature of all of that. The encryption key for the DataIdentifier is the (SHA-1) hash of the random salt combined with the DataStoreSecret (this is to avoid re-using the same encryption key for all BinaryReferenceMessages). The random salt is per message. The expiry time is the maximum system time up to when to accept the BinaryReferenceMessage (same as for time limited S3 URLs), for example the system time the message was generated plus 2 hours or so. The signature is the HMAC of the rest of the message, with the DataStoreSecret as the key. To simplify development/support, the message should readable, for example JSON or an URL. Example (shortened): "

          {encryptedDataId:0123456789abcd, salt:1234, expiry:3456, signature:4567}

          ". This will also allow to change the algorithms in the future. For now, we could use the following algorithms / formats: 128 bit DataStoreSecret and salt (generated with a SecureRandom); AES-256 encryption / AES-CTR mode; expiry: milliseconds since 1970 UTC; signature: HMAC-SHA-1.

          Show
          Thomas Mueller added a comment - I was chatting with Tommaso Teofili about the basic data structures, features, and security protocols. There are still a few open questions regarding the API, but here what we have so far: DataIdentifier: The (unencryptged and unsigned) identifier of the binary, as already used by the Jackrabbit DataStore. Please note it could be a reference to a file, or, for small binaries, contain the data itself. DataStoreSecret: a secret value that needs to be configured to be the same value in all repositories that share the same physical data store. It is used as the basis to encrypt and decrypt the DataIdentifier, and to sign and verify the signature. This could be a configuration parameter of the DataStore element in the repository.xml, but then we would probably need to change each DataStore implementation were we want support for the new feature? To avoid that, should be add a new element to the repository.xml? Not sure what is the easiest. BinaryReferenceMessage: The encrypt DataIdentifier, the random salt, the expiry time. Plus the signature of all of that. The encryption key for the DataIdentifier is the (SHA-1) hash of the random salt combined with the DataStoreSecret (this is to avoid re-using the same encryption key for all BinaryReferenceMessages). The random salt is per message. The expiry time is the maximum system time up to when to accept the BinaryReferenceMessage (same as for time limited S3 URLs), for example the system time the message was generated plus 2 hours or so. The signature is the HMAC of the rest of the message, with the DataStoreSecret as the key. To simplify development/support, the message should readable, for example JSON or an URL. Example (shortened): " {encryptedDataId:0123456789abcd, salt:1234, expiry:3456, signature:4567} ". This will also allow to change the algorithms in the future. For now, we could use the following algorithms / formats: 128 bit DataStoreSecret and salt (generated with a SecureRandom); AES-256 encryption / AES-CTR mode; expiry: milliseconds since 1970 UTC; signature: HMAC-SHA-1.
          Hide
          angela added a comment -

          fine with me.

          Show
          angela added a comment - fine with me.
          Hide
          Jukka Zitting added a comment -

          > public API

          +1 for createValue(Binary), with an abstract ReferenceBinary class (one that holds and provides access to a HMAC code, but doesn't know how to access the underlying value) in something like jackrabbit-jcr-commons where it can be shared by the client and the server. That class can either be used directly by client code or extended by RMI and DAVex to allow seamless integration without the client code having to worry about the HMAC codes.

          > I'm not sure about spec compliance, but createValue(Binary) could return null if the binary isn't available on the target system.

          A better alternative might be to throw a RepositoryException from methods like setProperty(String, Value) and setProperty(String, Binary), when given such a reference binary that actually doesn't match the underlying data store and whose getStream() method for some reason can't be used to fetch the raw byte stream.

          Show
          Jukka Zitting added a comment - > public API +1 for createValue(Binary), with an abstract ReferenceBinary class (one that holds and provides access to a HMAC code, but doesn't know how to access the underlying value) in something like jackrabbit-jcr-commons where it can be shared by the client and the server. That class can either be used directly by client code or extended by RMI and DAVex to allow seamless integration without the client code having to worry about the HMAC codes. > I'm not sure about spec compliance, but createValue(Binary) could return null if the binary isn't available on the target system. A better alternative might be to throw a RepositoryException from methods like setProperty(String, Value) and setProperty(String, Binary), when given such a reference binary that actually doesn't match the underlying data store and whose getStream() method for some reason can't be used to fetch the raw byte stream.
          Hide
          Felix Meschberger added a comment -

          > createValue(Binary)

          If we have a custom Binary implementation, we could also fail early when trying to create that instance: If the content Id cannot be resolved to a valid entry, the Binary object cannot be created and thus there is no need to call createValue(Binary) at all.

          Show
          Felix Meschberger added a comment - > createValue(Binary) If we have a custom Binary implementation, we could also fail early when trying to create that instance: If the content Id cannot be resolved to a valid entry, the Binary object cannot be created and thus there is no need to call createValue(Binary) at all.
          Hide
          Tommaso Teofili added a comment - - edited

          > As mentioned earlier, I'd rather introduce an extension to the Binary interface and use ValueFactory.createValue(Binary) instead.

          at a first glance this option seems to better address back compatibility as it wouldn't change the ValueFactory#createValue(String, PropertyType.BINARY) behavior but rely on explicitly use a different type of Binary, at the same time that may require special handling in the VF impl (which is usually not nice)

          > I'm not sure about spec compliance, but createValue(Binary) could return null if the binary isn't available on the target system.

          reading http://www.day.com/maven/jsr170/javadocs/jcr-2.0/javax/jcr/ValueFactory.html#createValue(javax.jcr.Binary) it seems returning null is not considered but I may be wrong.

          Show
          Tommaso Teofili added a comment - - edited > As mentioned earlier, I'd rather introduce an extension to the Binary interface and use ValueFactory.createValue(Binary) instead. at a first glance this option seems to better address back compatibility as it wouldn't change the ValueFactory#createValue(String, PropertyType.BINARY) behavior but rely on explicitly use a different type of Binary, at the same time that may require special handling in the VF impl (which is usually not nice) > I'm not sure about spec compliance, but createValue(Binary) could return null if the binary isn't available on the target system. reading http://www.day.com/maven/jsr170/javadocs/jcr-2.0/javax/jcr/ValueFactory.html#createValue(javax.jcr.Binary ) it seems returning null is not considered but I may be wrong.
          Hide
          Marcel Reutegger added a comment -

          > ValueFactory#createValue(String, PropertyType.BINARY)

          As mentioned earlier, I'd rather introduce an extension to the Binary interface and use ValueFactory.createValue(Binary) instead. I'm not sure about spec compliance, but createValue(Binary) could return null if the binary isn't available on the target system.

          Show
          Marcel Reutegger added a comment - > ValueFactory#createValue(String, PropertyType.BINARY) As mentioned earlier, I'd rather introduce an extension to the Binary interface and use ValueFactory.createValue(Binary) instead. I'm not sure about spec compliance, but createValue(Binary) could return null if the binary isn't available on the target system.
          Hide
          Felix Meschberger added a comment -

          Re. ValueFactory#createValue(String, PropertyType.BINARY)

          Some thought on this: The implementation must make sure to (a) properly identify the hashed/signed/whatever value as being an identifier (no false positives or negatives!) and (b) reject identifiers for which there is no data store entry.

          Show
          Felix Meschberger added a comment - Re. ValueFactory#createValue(String, PropertyType.BINARY) Some thought on this: The implementation must make sure to (a) properly identify the hashed/signed/whatever value as being an identifier (no false positives or negatives!) and (b) reject identifiers for which there is no data store entry.
          Hide
          angela added a comment - - edited

          in general it seems that we agree on creating some sort of signing mechanism for the content identifier, right?
          if that was true, i would suggest that tommaso (or whoever is working on this at the end) comes up with a patch.

          regarding injecting the value:
          without having a closer look i would expect that ValueFactory#createValue(String, PropertyType.BINARY) should
          be sufficient to create the value from the content id. but we can still decide on that once we have the basics working.

          regarding oak:
          that's the main reason for me being really picky about the security impact... hacking something into jackrabbit-core is
          one thing but since we have to implement that in oak as well it should be well-thought-out.

          Show
          angela added a comment - - edited in general it seems that we agree on creating some sort of signing mechanism for the content identifier, right? if that was true, i would suggest that tommaso (or whoever is working on this at the end) comes up with a patch. regarding injecting the value: without having a closer look i would expect that ValueFactory#createValue(String, PropertyType.BINARY) should be sufficient to create the value from the content id. but we can still decide on that once we have the basics working. regarding oak: that's the main reason for me being really picky about the security impact... hacking something into jackrabbit-core is one thing but since we have to implement that in oak as well it should be well-thought-out.
          Hide
          Felix Meschberger added a comment -

          Re. signed "message" or HMAC

          I agree with Jukka: The user id cannot be part of this transaction because there is no reason to assume that to be the same. Even if the "name" is the same the actual entity represented by the name need not be the same. This is essentially the same issue that NFS is faced with the numeric user and group IDs.

          But: we also need a way to inject that value through some public API such as the ValueFactory or some other means. Assuming RMI or Davex is not the going to work because we have two separate systems where the data is extracted through the JackrabbitValue.getContentIdentifier() (or some new API) method, serialized in a custom way and then reinjected through some new API (or the ValueFactory if that can differentiate between a "message" binary and a real binary !).

          Re. Future Proof

          Angela is quite right: It is essential, that whatever mechanism we implement for Jackrabbit 2 should also be available for Oak.

          Show
          Felix Meschberger added a comment - Re. signed "message" or HMAC I agree with Jukka: The user id cannot be part of this transaction because there is no reason to assume that to be the same. Even if the "name" is the same the actual entity represented by the name need not be the same. This is essentially the same issue that NFS is faced with the numeric user and group IDs. But: we also need a way to inject that value through some public API such as the ValueFactory or some other means. Assuming RMI or Davex is not the going to work because we have two separate systems where the data is extracted through the JackrabbitValue.getContentIdentifier() (or some new API) method, serialized in a custom way and then reinjected through some new API (or the ValueFactory if that can differentiate between a "message" binary and a real binary !). Re. Future Proof Angela is quite right: It is essential, that whatever mechanism we implement for Jackrabbit 2 should also be available for Oak.
          Hide
          Jukka Zitting added a comment -

          Note that there is no guarantee that the userIds of the two repositories have no relation with each other, so I don't think it should be included in a mechanism like the one proposed.

          Instead, a HMAC of the content identifier, signed by the underlying data store should be good enough. The target repository can check that the message comes from the same underlying data store, so anyone who has access to such a code already has access to the related binary through one of the repositories attached to that data store. Thus allowing the user to access the binary doesn't reveal anything he or she couldn't already access by other means.

          Ideally I'd see such a HMAC to be passed transparently as a part of Binary instances acquired from a remote RMI or Davex connection to the source repository. The target repository would automatically extract and evaluate the information based on the type of the passed Binary instance, and could always fall back to streaming the data if for example the HMAC doesn't match. The client would use the pattern I outlined above, the only difference being that repositoryA would be a remote RMI or Davex connection instead of a local cluster node.

          Show
          Jukka Zitting added a comment - Note that there is no guarantee that the userIds of the two repositories have no relation with each other, so I don't think it should be included in a mechanism like the one proposed. Instead, a HMAC of the content identifier, signed by the underlying data store should be good enough. The target repository can check that the message comes from the same underlying data store, so anyone who has access to such a code already has access to the related binary through one of the repositories attached to that data store. Thus allowing the user to access the binary doesn't reveal anything he or she couldn't already access by other means. Ideally I'd see such a HMAC to be passed transparently as a part of Binary instances acquired from a remote RMI or Davex connection to the source repository. The target repository would automatically extract and evaluate the information based on the type of the passed Binary instance, and could always fall back to streaming the data if for example the HMAC doesn't match. The client would use the pattern I outlined above, the only difference being that repositoryA would be a remote RMI or Davex connection instead of a local cluster node.
          Hide
          angela added a comment -

          And what I discussed with Felix yesterday as well (again just quick summary of a longer discussion and pretty much brainstorming
          ideas that have not been fully thought through):

          the use-case behind the original proposal is basically a shift in paradigm how we treat and look at a jcr content repository.
          in our current 'world' a query "findPropertyWithContentId" would do the trick and be backed by the current permission model.
          however, what the current proposal shadows is the fact that we want to support a setup where multiple JCR repositories
          share resources (such as e.g. the datastore)

          in order to address this new way of looking at a content repository and the interaction between multiple repositories we
          should be very clear about the kind of abstraction and concept we want to use. in addition to marcels summary this basically
          has two different flavors:

          1) the JCR/Jackrabbit API doesn't not reveal that requirement, the content repository still behaves the same as before
          and we treat that shared datastore (or any other inter-repo-connection) as implementation detail of the repository not
          known nor exposed to the JCR API consumer. in this case we should neither expose the new proposed method nor
          define any public API to access or deal with the datastore.
          one possible option to just reflect our requirements with JCR-only API could be to expose the information stored in the
          datastore in a read-only tree underneath /jcr:system such as do for other repository global information (namespaces,
          node types) and access to the individual items would be governed by regular and possibly fine grained item access.

          2) we treat things like the datastore concept as additional repository that was associated to a given content repository
          without being an internal feature of a given implementation. in that case we may rather want to define a separate API
          for the datastore that comes with it's own security concept... depending on how feature rich this API might get, we
          may however end up in situation where we start re-inventing the JCR API for the data store.... if that was the case
          we may rather think on how we could handle such inter-repository communication or lookup or data-sharing in general
          and treat the datastore case as a special case.

          in oak we currently are having some sort of intermediate state with a BlobFactory interface present on the oak-api
          which has a lot of unsolved questions associated with it. currently it's just a 'create-only' feature but looking at this
          feature request it seems pretty clear to me that there will be pressure to add additional lookup or query functionality.

          whatever we do, we should keep the development in oak in mind in order not to get locked into a new feature
          in jackrabbit core that will cause major troubles in the upcoming jr3 rewrite.

          Show
          angela added a comment - And what I discussed with Felix yesterday as well (again just quick summary of a longer discussion and pretty much brainstorming ideas that have not been fully thought through): the use-case behind the original proposal is basically a shift in paradigm how we treat and look at a jcr content repository. in our current 'world' a query "findPropertyWithContentId" would do the trick and be backed by the current permission model. however, what the current proposal shadows is the fact that we want to support a setup where multiple JCR repositories share resources (such as e.g. the datastore) in order to address this new way of looking at a content repository and the interaction between multiple repositories we should be very clear about the kind of abstraction and concept we want to use. in addition to marcels summary this basically has two different flavors: 1) the JCR/Jackrabbit API doesn't not reveal that requirement, the content repository still behaves the same as before and we treat that shared datastore (or any other inter-repo-connection) as implementation detail of the repository not known nor exposed to the JCR API consumer. in this case we should neither expose the new proposed method nor define any public API to access or deal with the datastore. one possible option to just reflect our requirements with JCR-only API could be to expose the information stored in the datastore in a read-only tree underneath /jcr:system such as do for other repository global information (namespaces, node types) and access to the individual items would be governed by regular and possibly fine grained item access. 2) we treat things like the datastore concept as additional repository that was associated to a given content repository without being an internal feature of a given implementation. in that case we may rather want to define a separate API for the datastore that comes with it's own security concept... depending on how feature rich this API might get, we may however end up in situation where we start re-inventing the JCR API for the data store.... if that was the case we may rather think on how we could handle such inter-repository communication or lookup or data-sharing in general and treat the datastore case as a special case. in oak we currently are having some sort of intermediate state with a BlobFactory interface present on the oak-api which has a lot of unsolved questions associated with it. currently it's just a 'create-only' feature but looking at this feature request it seems pretty clear to me that there will be pressure to add additional lookup or query functionality. whatever we do, we should keep the development in oak in mind in order not to get locked into a new feature in jackrabbit core that will cause major troubles in the upcoming jr3 rewrite.
          Hide
          Thomas Mueller added a comment -

          Sounds good. In addition, I think we should consider having a mechanism to expire the identifier, similar to Amazon S3. I found some information here: http://stackoverflow.com/questions/14414455/amazon-s3-generating-an-expiring-link-using-ruby-1-9-3

          What do you think about the limited lifetime of an identifier? I think it might be overkill, but I'm not sure.

          Show
          Thomas Mueller added a comment - Sounds good. In addition, I think we should consider having a mechanism to expire the identifier, similar to Amazon S3. I found some information here: http://stackoverflow.com/questions/14414455/amazon-s3-generating-an-expiring-link-using-ruby-1-9-3 What do you think about the limited lifetime of an identifier? I think it might be overkill, but I'm not sure.
          Hide
          Marcel Reutegger added a comment -

          Here's what Angela and I briefly discussed yesterday, which has similarities to what Thomas proposed. Please consider this as taking notes of the discussion, rather then a full fledged proposal.

          Repository instances connected to a shared data store trust each other by means of a shared shared secret or some other mechanism to verify 'messages' from the other instance. A client of the JCR API can get a content identifier from a binary property stored in the data store via JackrabbitValue.getContentIdentity(). The returned value is an encrypted message, which contains the current userId and the hash of the data store item. This value can then be sent to the other repository and a JCR Value will be created from this message. We were not sure how exactly that would work. One option we discussed was a custom JCR Binary class recognized by the repository implementation. The created binary can then be used to set a property. The implementation will then decrypt and verify the message and extract the userId and the hash. If the userId does not match the current user, then the repository will throw an exception. If the userId matches and an item with the given hash already exists, the implementation will set the property to the given value. Otherwise the call to setProperty() behaves as if it was passed a null value, which is equivalent to removing the value. This allows a client to check whether the binary is already on the target system.

          The benefit of this mechanism is, that you cannot generate content identifiers on a system and then use it to attack another one. Rather the content identifier depends on something like a salt or shared secret, as proposed by Thomas. The system further guarantees a user is only able to see data store items on the target system he had access to on the source system.

          Comments welcome.

          Show
          Marcel Reutegger added a comment - Here's what Angela and I briefly discussed yesterday, which has similarities to what Thomas proposed. Please consider this as taking notes of the discussion, rather then a full fledged proposal. Repository instances connected to a shared data store trust each other by means of a shared shared secret or some other mechanism to verify 'messages' from the other instance. A client of the JCR API can get a content identifier from a binary property stored in the data store via JackrabbitValue.getContentIdentity(). The returned value is an encrypted message, which contains the current userId and the hash of the data store item. This value can then be sent to the other repository and a JCR Value will be created from this message. We were not sure how exactly that would work. One option we discussed was a custom JCR Binary class recognized by the repository implementation. The created binary can then be used to set a property. The implementation will then decrypt and verify the message and extract the userId and the hash. If the userId does not match the current user, then the repository will throw an exception. If the userId matches and an item with the given hash already exists, the implementation will set the property to the given value. Otherwise the call to setProperty() behaves as if it was passed a null value, which is equivalent to removing the value. This allows a client to check whether the binary is already on the target system. The benefit of this mechanism is, that you cannot generate content identifiers on a system and then use it to attack another one. Rather the content identifier depends on something like a salt or shared secret, as proposed by Thomas. The system further guarantees a user is only able to see data store items on the target system he had access to on the source system. Comments welcome.
          Hide
          Thomas Mueller added a comment -

          (d) Content ids could expire after some time (for example one minute). One way to do that is to add the number of minutes since 1970 to the hash, then encrypt this using a datastore wide secret key, and use this as the content identifier. The receiver repository would decrypt the identifier and check time.

          Show
          Thomas Mueller added a comment - (d) Content ids could expire after some time (for example one minute). One way to do that is to add the number of minutes since 1970 to the hash, then encrypt this using a datastore wide secret key, and use this as the content identifier. The receiver repository would decrypt the identifier and check time.
          Hide
          Thomas Mueller added a comment -

          Possible security measures:

          (a) The data store could have a repository wide secret "salt" that is used for all binaries. For a shared data store, it would be the same salt for all repositories (otherwise it wouldn't work). With the salt, it would not be possible for an attacker to check whether a certain binary exists because he couldn't easily generate the content hash (unless he knows the secret salt).

          (b) Some kind of whitelist / blacklist of content ids. Only content ids in the whitelist / not in the blacklist would be allowed.

          (c) The content id could contain an additional token (randomly generated). That way, you could have multiple content ids that point to the same binary, which could be whitelisted / blacklisted individually.

          All those features would add complexity of course. It's just possible solutions to address security concerns. I'm not saying we should implement them.

          Show
          Thomas Mueller added a comment - Possible security measures: (a) The data store could have a repository wide secret "salt" that is used for all binaries. For a shared data store, it would be the same salt for all repositories (otherwise it wouldn't work). With the salt, it would not be possible for an attacker to check whether a certain binary exists because he couldn't easily generate the content hash (unless he knows the secret salt). (b) Some kind of whitelist / blacklist of content ids. Only content ids in the whitelist / not in the blacklist would be allowed. (c) The content id could contain an additional token (randomly generated). That way, you could have multiple content ids that point to the same binary, which could be whitelisted / blacklisted individually. All those features would add complexity of course. It's just possible solutions to address security concerns. I'm not saying we should implement them.
          Hide
          angela added a comment -

          > What we want to use this feature for is for a "replication" feature in our app, which is really an infrastructure service and
          > already uses an admin (or similar highly privileged user) session on the target jackrabbit instance to copy over content from
          > another one. Having this special "access-by-datastore-id" permission flag would only be used for that case anyway.

          so, you are justifying something that looks problematic security wise with the argument that's just one specific use-case
          and 'admin-only'

          IMHO we should not hack the built-in permissions for something that looks like a nice feature for us without thinking
          about the consequences. claiming that only our replication user would be allowed to do this as naive as claiming that
          something was just an 'admin-only' task. that's not how the repository is being used and if it was we could equally
          just hardcode access to the getValueById to a single, dedicated user (which obviously is a bad idea).

          > Now if we ensure the data store IDs are not guessable, then there is no option to browse the repository's binaries.
          > If we avoid using simple hashes of the binary for the (exposed) ID, it will not be possible to check for the existence
          > of certain documents known to an attacker. In fact, an attacker will only be able to get to the ID if he has the access
          > rights to the content in the first place.

          that's correct and i don't see a problem with this part.
          what is problematic IMO is the fact that once you get access to the content ID you may be able to look at the binary
          irrespective of the accessibility of the property (or properties) that hold(s) this value.

          in other words: what we are adding here is a additional dimension to the way how access control is used and enforced
          by the repository. we have permissions on nodes and properties and we are extending this to values irrespective of
          which property this value was attached to.

          if we add the contentId handling to the API, it will be used (i see the service coming that exposes contentIDs
          with admin session which is searchable and where googles inurl will be a perfect fit to determine all kind of
          contentIds all over the world

          don't get me wrong: i am not opposed to have this in general but i am totally opposed to just hacking that in without
          having a clear picture of what we are doing and careful reevaluation on what that actually means for our threat model
          and for the further development (including oak).

          > If not, we really need to think of bringing a similar performance-optimized replication feature into Jackrabbit / Oak itself.

          again: no objection to this.... but 'thinking' is definitely the key word here

          Show
          angela added a comment - > What we want to use this feature for is for a "replication" feature in our app, which is really an infrastructure service and > already uses an admin (or similar highly privileged user) session on the target jackrabbit instance to copy over content from > another one. Having this special "access-by-datastore-id" permission flag would only be used for that case anyway. so, you are justifying something that looks problematic security wise with the argument that's just one specific use-case and 'admin-only' IMHO we should not hack the built-in permissions for something that looks like a nice feature for us without thinking about the consequences. claiming that only our replication user would be allowed to do this as naive as claiming that something was just an 'admin-only' task. that's not how the repository is being used and if it was we could equally just hardcode access to the getValueById to a single, dedicated user (which obviously is a bad idea). > Now if we ensure the data store IDs are not guessable, then there is no option to browse the repository's binaries. > If we avoid using simple hashes of the binary for the (exposed) ID, it will not be possible to check for the existence > of certain documents known to an attacker. In fact, an attacker will only be able to get to the ID if he has the access > rights to the content in the first place. that's correct and i don't see a problem with this part. what is problematic IMO is the fact that once you get access to the content ID you may be able to look at the binary irrespective of the accessibility of the property (or properties) that hold(s) this value. in other words: what we are adding here is a additional dimension to the way how access control is used and enforced by the repository. we have permissions on nodes and properties and we are extending this to values irrespective of which property this value was attached to. if we add the contentId handling to the API, it will be used (i see the service coming that exposes contentIDs with admin session which is searchable and where googles inurl will be a perfect fit to determine all kind of contentIds all over the world don't get me wrong: i am not opposed to have this in general but i am totally opposed to just hacking that in without having a clear picture of what we are doing and careful reevaluation on what that actually means for our threat model and for the further development (including oak). > If not, we really need to think of bringing a similar performance-optimized replication feature into Jackrabbit / Oak itself. again: no objection to this.... but 'thinking' is definitely the key word here
          Hide
          Alexander Klimetschek added a comment - - edited

          I agree that the security aspect of this seems problematic - but I think they go away if we look at the larger picture.

          What we want to use this feature for is for a "replication" feature in our app, which is really an infrastructure service and already uses an admin (or similar highly privileged user) session on the target jackrabbit instance to copy over content from another one. Having this special "access-by-datastore-id" permission flag would only be used for that case anyway.

          Now if we ensure the data store IDs are not guessable, then there is no option to browse the repository's binaries. If we avoid using simple hashes of the binary for the (exposed) ID, it will not be possible to check for the existence of certain documents known to an attacker. In fact, an attacker will only be able to get to the ID if he has the access rights to the content in the first place.

          IMHO that is all acceptable; except for such a special replication system user, no normal JCR user would ever need to have that permission turned on (and documentation should say so).

          If not, we really need to think of bringing a similar performance-optimized replication feature into Jackrabbit / Oak itself.

          Show
          Alexander Klimetschek added a comment - - edited I agree that the security aspect of this seems problematic - but I think they go away if we look at the larger picture. What we want to use this feature for is for a "replication" feature in our app, which is really an infrastructure service and already uses an admin (or similar highly privileged user) session on the target jackrabbit instance to copy over content from another one. Having this special "access-by-datastore-id" permission flag would only be used for that case anyway. Now if we ensure the data store IDs are not guessable, then there is no option to browse the repository's binaries. If we avoid using simple hashes of the binary for the (exposed) ID, it will not be possible to check for the existence of certain documents known to an attacker. In fact, an attacker will only be able to get to the ID if he has the access rights to the content in the first place. IMHO that is all acceptable; except for such a special replication system user, no normal JCR user would ever need to have that permission turned on (and documentation should say so). If not, we really need to think of bringing a similar performance-optimized replication feature into Jackrabbit / Oak itself.
          Hide
          angela added a comment -

          i share jukka's concerns.

          the way the permissions are extended is not acceptable from a security point of view as it compromises the overall
          security of the system: while a given property value was not accessible due to limited access to a given jcr property
          it would become visible to a given session if the 'access-by-datastore-id' was turned on. this creates the exact same
          problem were are currently having with the version store and removes the ability to enforce access permissions
          on the binary values while at the same time the binary values are most probably those that contain the most
          sensitive information.

          -1 for the patch.

          we have to think about the security implications and address those in a reasonable manner.
          in particular accessing a binary by id should always allow to enforce the permissions that were
          in place at the corresponding property.

          in fact we discussed that in the oak project: http://markmail.org/message/5omo54jpue4si3e4?q=blobfactory

          Show
          angela added a comment - i share jukka's concerns. the way the permissions are extended is not acceptable from a security point of view as it compromises the overall security of the system: while a given property value was not accessible due to limited access to a given jcr property it would become visible to a given session if the 'access-by-datastore-id' was turned on. this creates the exact same problem were are currently having with the version store and removes the ability to enforce access permissions on the binary values while at the same time the binary values are most probably those that contain the most sensitive information. -1 for the patch. we have to think about the security implications and address those in a reasonable manner. in particular accessing a binary by id should always allow to enforce the permissions that were in place at the corresponding property. in fact we discussed that in the oak project: http://markmail.org/message/5omo54jpue4si3e4?q=blobfactory
          Hide
          Felix Meschberger added a comment -

          Thanks Chetan for sharing JCR-3448. Yes I agree, we try to solve the same problem. IMHO this proposal (of using content identifier round tripping) less intrusive and exposes less of the implementation details.

          Show
          Felix Meschberger added a comment - Thanks Chetan for sharing JCR-3448 . Yes I agree, we try to solve the same problem. IMHO this proposal (of using content identifier round tripping) less intrusive and exposes less of the implementation details.
          Hide
          Chetan Mehrotra added a comment -

          +1 for the new feature

          JCR-3448 talks about similar requirement. At that time the thought was to use a custom Binary implementation which can be serialized and deserialzed. The requirement for that issue was more around DataStore backed by S3. To address the security concern we could have used S3 Signed URL which are valid for certain duration [1] as the opaque identifier which is passed around. So may be we can have some service provided by DataStore which can provide such safe ids which can be passed around and still be secure

          [1] http://stackoverflow.com/questions/5419264/best-practice-amazons3-url-sharing

          Show
          Chetan Mehrotra added a comment - +1 for the new feature JCR-3448 talks about similar requirement. At that time the thought was to use a custom Binary implementation which can be serialized and deserialzed. The requirement for that issue was more around DataStore backed by S3. To address the security concern we could have used S3 Signed URL which are valid for certain duration [1] as the opaque identifier which is passed around. So may be we can have some service provided by DataStore which can provide such safe ids which can be passed around and still be secure [1] http://stackoverflow.com/questions/5419264/best-practice-amazons3-url-sharing
          Hide
          Alexander Klimetschek added a comment -

          Yes, a shared data store can be a much more viable option than a full cluster (considering Jackrabbit 2 here). The whole point is to optimize for large binaries, to avoid sending them around and to minimize disk space. Switching to a full cluster in the use case we have (as described by Felix above) is not an option, as the JCR content will be different on the instances that should share the data store.

          I agree that we should consider the long term as well - but I don't know if there are actual better mechanisms, and I don't know of any being worked on for Oak already.

          Show
          Alexander Klimetschek added a comment - Yes, a shared data store can be a much more viable option than a full cluster (considering Jackrabbit 2 here). The whole point is to optimize for large binaries, to avoid sending them around and to minimize disk space. Switching to a full cluster in the use case we have (as described by Felix above) is not an option, as the JCR content will be different on the instances that should share the data store. I agree that we should consider the long term as well - but I don't know if there are actual better mechanisms, and I don't know of any being worked on for Oak already.
          Hide
          Thomas Mueller added a comment -

          I think the general problem here is: how do you avoid sending the binary if the binary is already there? The repositories don't necessarily need to share the data store.

          Of course there are security questions, but then all operations have security questions (uploading a huge binary can fill the disk so we would need quotas in theory; a rogue remote client might already access binaries he is not allowed).

          Show
          Thomas Mueller added a comment - I think the general problem here is: how do you avoid sending the binary if the binary is already there? The repositories don't necessarily need to share the data store. Of course there are security questions, but then all operations have security questions (uploading a huge binary can fill the disk so we would need quotas in theory; a rogue remote client might already access binaries he is not allowed).
          Hide
          Thomas Mueller added a comment -

          > I would prefer the simpler return of "null" for the not-found case instead of the ItemNotFoundException

          I agree.

          > leaky abstraction that may come back to haunt us for example if someone who doesn't realize the security implications

          I think what Felix described is a valid use case. If there are better ways to solve the problem, that would be great, but I also currently don't see other solutions that would work well.

          > adjust the deployment configuration if you want to make those repositories share data more intimately

          Could you provide more details? How could you reference a binary stored in one repository in the other repository, if the repositories are not running in the same process?

          > the implementation may well be something like hash(revision + path) that can't be reversed for use in something like getValueByContentId().

          This is an idea that is new to me, could you tell us more about it? I believe we can and should support getValueByContentId() in Oak in the same way as in Jackrabbit 2.x. I don't see a reason to use hash(revision + path).

          Show
          Thomas Mueller added a comment - > I would prefer the simpler return of "null" for the not-found case instead of the ItemNotFoundException I agree. > leaky abstraction that may come back to haunt us for example if someone who doesn't realize the security implications I think what Felix described is a valid use case. If there are better ways to solve the problem, that would be great, but I also currently don't see other solutions that would work well. > adjust the deployment configuration if you want to make those repositories share data more intimately Could you provide more details? How could you reference a binary stored in one repository in the other repository, if the repositories are not running in the same process? > the implementation may well be something like hash(revision + path) that can't be reversed for use in something like getValueByContentId(). This is an idea that is new to me, could you tell us more about it? I believe we can and should support getValueByContentId() in Oak in the same way as in Jackrabbit 2.x. I don't see a reason to use hash(revision + path).
          Hide
          Jukka Zitting added a comment -

          > well, copying lots of 20MB binaries ? instead of lots of 50 character strings ? Do the math

          I mean the overhead of the extra cluster configuration. Both approaches suggested here achieve the reduced data transfer, one at the cost of extra API and permissions the other at the cost of deployment changes. The question here is which one is the better long term alternative.

          > The abstraction already leaked with the JackrabbitValue.getContentIdentitiy method

          The case for getContentIdentity() is quite different from this use case, and has neither the security concerns nor any reasonable alternatives that wouldn't require API changes. Unlike in here, the kinds of use cases enabled with getContentIdentity() are impossible, not just inconvenient to configure, without the extra API.

          Show
          Jukka Zitting added a comment - > well, copying lots of 20MB binaries ? instead of lots of 50 character strings ? Do the math I mean the overhead of the extra cluster configuration. Both approaches suggested here achieve the reduced data transfer, one at the cost of extra API and permissions the other at the cost of deployment changes. The question here is which one is the better long term alternative. > The abstraction already leaked with the JackrabbitValue.getContentIdentitiy method The case for getContentIdentity() is quite different from this use case, and has neither the security concerns nor any reasonable alternatives that wouldn't require API changes. Unlike in here, the kinds of use cases enabled with getContentIdentity() are impossible, not just inconvenient to configure, without the extra API.
          Hide
          Felix Meschberger added a comment -

          > but is the overhead high enough to justify the leak in the abstraction

          well, copying lots of 20MB binaries ? instead of lots of 50 character strings ? Do the math

          The abstraction already leaked with the JackrabbitValue.getContentIdentitiy method (ok, that would be the broken window ).

          Show
          Felix Meschberger added a comment - > but is the overhead high enough to justify the leak in the abstraction well, copying lots of 20MB binaries ? instead of lots of 50 character strings ? Do the math The abstraction already leaked with the JackrabbitValue.getContentIdentitiy method (ok, that would be the broken window ).
          Hide
          Alexander Klimetschek added a comment - - edited

          I would prefer the simpler return of "null" for the not-found case instead of the ItemNotFoundException, even if the JCR API does that all the time. Throwing exceptions for expected cases such as not found is IMHO a bad design, especially if those exceptions share the same super class RepositoryException which is used for actual unexpected error cases (network down).

          Otherwise +1.

          Show
          Alexander Klimetschek added a comment - - edited I would prefer the simpler return of "null" for the not-found case instead of the ItemNotFoundException, even if the JCR API does that all the time. Throwing exceptions for expected cases such as not found is IMHO a bad design, especially if those exceptions share the same super class RepositoryException which is used for actual unexpected error cases (network down). Otherwise +1.
          Hide
          Jukka Zitting added a comment -

          > So JackrabbitValue.getContentIdentifier will not be supported in Oak, either?

          It will, but the implementation may well be something like hash(revision + path) that can't be reversed for use in something like getValueByContentId().

          > How if clustering is not an option ? [...] For various reasons both share the same Data Store.

          Since they already share the data store (which already requires custom configuration), I don't believe setting up one of the repositories (or even a third one reserved for such data migration purposes) as a cluster shared by both boxes would be an insurmountable issue.

          I agree that an approach like getValueByContentId() is probably easier to implement in practice for such a case, but is the overhead high enough to justify the leak in the abstraction?

          Show
          Jukka Zitting added a comment - > So JackrabbitValue.getContentIdentifier will not be supported in Oak, either? It will, but the implementation may well be something like hash(revision + path) that can't be reversed for use in something like getValueByContentId(). > How if clustering is not an option ? [...] For various reasons both share the same Data Store. Since they already share the data store (which already requires custom configuration), I don't believe setting up one of the repositories (or even a third one reserved for such data migration purposes) as a cluster shared by both boxes would be an insurmountable issue. I agree that an approach like getValueByContentId() is probably easier to implement in practice for such a case, but is the overhead high enough to justify the leak in the abstraction?
          Hide
          Felix Meschberger added a comment -

          > suggested getValueByContentId() method wouldn't work with Oak

          So JackrabbitValue.getContentIdentifier will not be supported in Oak, either ?

          > I'd adjust the deployment configuration if you want to make those repositories share data more intimately.

          How if clustering is not an option ?

          Consider for example, I have an author box and a publish box. For various reasons both share the same Data Store. The straight forward approach to send binaries from the author box to the publish box would be to extract the binary from the Data Store on the author box and push it back in to the Data Store on the publish box, just for the Data Store to realize the binary is actually the same. On the reading side we can retrieve an identifier (JackrabbitValue.getContentIdentifier). I just need a way to reuse the binary data referred to by that content Identifier on the publish side.

          Show
          Felix Meschberger added a comment - > suggested getValueByContentId() method wouldn't work with Oak So JackrabbitValue.getContentIdentifier will not be supported in Oak, either ? > I'd adjust the deployment configuration if you want to make those repositories share data more intimately. How if clustering is not an option ? Consider for example, I have an author box and a publish box. For various reasons both share the same Data Store. The straight forward approach to send binaries from the author box to the publish box would be to extract the binary from the Data Store on the author box and push it back in to the Data Store on the publish box, just for the Data Store to realize the binary is actually the same. On the reading side we can retrieve an identifier (JackrabbitValue.getContentIdentifier). I just need a way to reuse the binary data referred to by that content Identifier on the publish side.
          Hide
          Jukka Zitting added a comment -

          > I have two separate systems (separate JVMs and no clustering but shared data store) involved.

          I'd adjust the deployment configuration if you want to make those repositories share data more intimately. Otherwise this seems like a leaky abstraction that may come back to haunt us for example if someone who doesn't realize the security implications mistakes getValueByContentId() for a neat way to provide permalinks for binaries.

          Also, unlike the pattern I outlined (which BTW already works) the suggested getValueByContentId() method wouldn't work with Oak as currently designed.

          Show
          Jukka Zitting added a comment - > I have two separate systems (separate JVMs and no clustering but shared data store) involved. I'd adjust the deployment configuration if you want to make those repositories share data more intimately. Otherwise this seems like a leaky abstraction that may come back to haunt us for example if someone who doesn't realize the security implications mistakes getValueByContentId() for a neat way to provide permalinks for binaries. Also, unlike the pattern I outlined (which BTW already works) the suggested getValueByContentId() method wouldn't work with Oak as currently designed.
          Hide
          Felix Meschberger added a comment -

          This does not solve my problems at all: I have two separate systems (separate JVMs and no clustering but shared data store) involved. So a Binary object from one system cannot be used on the other system without serializing and deserializing it.

          My proposal uses the data store identity as the serialization mechanism and uses the new JackrabbitSesssion.getValueByContentIdentity as the deserialization mechanism.

          Another option would be to have such a method on a new JackrabbitValueFactory class. But since we have a content identifier and want to get at content, we need some level of access control. So we need that access control in the JackrabbitValueFactory which would be the only method in the JackrabbitValueFactory which employs access control to create a value....

          Show
          Felix Meschberger added a comment - This does not solve my problems at all: I have two separate systems (separate JVMs and no clustering but shared data store) involved. So a Binary object from one system cannot be used on the other system without serializing and deserializing it. My proposal uses the data store identity as the serialization mechanism and uses the new JackrabbitSesssion.getValueByContentIdentity as the deserialization mechanism. Another option would be to have such a method on a new JackrabbitValueFactory class. But since we have a content identifier and want to get at content, we need some level of access control. So we need that access control in the JackrabbitValueFactory which would be the only method in the JackrabbitValueFactory which employs access control to create a value....
          Hide
          Jukka Zitting added a comment -

          I'm not too excited about the idea of using a special READ_VALUE_BY_CONTENT_IDENTITY permission for this. It seems like a hack to get around the inherent security implications of a feature like this.

          Instead of exposing this at the JCR API level, to me it would feel more natural to handle this on a lower level. For example, the following pattern would require no API extensions or extra permissions, but would still allow the lower-level implementation to detect that we're talking to two repositories with the same data store and thus optimize the copying of the data:

          Repository repositoryA = ...;
          Session sessionA = repositoryA.login(...);

          Repository repositoryB = ...;
          Session sessionB = repositoryB.login(...);

          Binary binary = sessionA.getProperty(...).getBinary();
          sessionB.getNode(...).setProperty(..., binary);

          Show
          Jukka Zitting added a comment - I'm not too excited about the idea of using a special READ_VALUE_BY_CONTENT_IDENTITY permission for this. It seems like a hack to get around the inherent security implications of a feature like this. Instead of exposing this at the JCR API level, to me it would feel more natural to handle this on a lower level. For example, the following pattern would require no API extensions or extra permissions, but would still allow the lower-level implementation to detect that we're talking to two repositories with the same data store and thus optimize the copying of the data: Repository repositoryA = ...; Session sessionA = repositoryA.login(...); Repository repositoryB = ...; Session sessionB = repositoryB.login(...); Binary binary = sessionA.getProperty(...).getBinary(); sessionB.getNode(...).setProperty(..., binary);
          Hide
          Thomas Mueller added a comment -

          The patch looks good to me.

          Show
          Thomas Mueller added a comment - The patch looks good to me.
          Hide
          Felix Meschberger added a comment -

          Proposed patch adding the new API.

          To access the DataStore item by its ID I converted one method from InternalValue from package private to public since the InternvalValue.create(DataStore, String) method cannot be used: The getContentIdentifier method does not return the store prefix required by the create(DataStore, String) method. Also the create method does not validate the existence of a data store entry for the provided identifier.

          Show
          Felix Meschberger added a comment - Proposed patch adding the new API. To access the DataStore item by its ID I converted one method from InternalValue from package private to public since the InternvalValue.create(DataStore, String) method cannot be used: The getContentIdentifier method does not return the store prefix required by the create(DataStore, String) method. Also the create method does not validate the existence of a data store entry for the provided identifier.

            People

            • Assignee:
              Tommaso Teofili
              Reporter:
              Felix Meschberger
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development