Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Component/s: jackrabbit-core
    • Labels:
      None

      Description

      We want to have a database backed data store implementation.
      An implementation using files is already available as part of JCR-926.

      1. datastore-concurrent-reads.zip
        5 kB
        Esteban Franqueiro
      2. jr-1.3.1-bea.zip
        85 kB
        Esteban Franqueiro

        Issue Links

          Activity

          Hide
          Jukka Zitting added a comment -

          Esteban, since this issue has already been closed for 1.4, it would be better if you created new Jira issues for any improvements or other changes. That would make it much easier to track which changes are being included in which release.

          Also, please avoid putting multiple independent changes in a single issue or patch. For example the test cases should go to JCR-1187 and the concurrent read patch should go to a new improvement issue.

          Show
          Jukka Zitting added a comment - Esteban, since this issue has already been closed for 1.4, it would be better if you created new Jira issues for any improvements or other changes. That would make it much easier to track which changes are being included in which release. Also, please avoid putting multiple independent changes in a single issue or patch. For example the test cases should go to JCR-1187 and the concurrent read patch should go to a new improvement issue.
          Hide
          Esteban Franqueiro added a comment -

          The attachment contains a test class for the issue I reported in [1] and a helper class it depends on. Also includes a patch that corrects this issue by allowing pooled connections to remain open until the streams they're serving are closed.

          [1] http://mail-archives.apache.org/mod_mbox/jackrabbit-dev/200802.mbox/browser

          Show
          Esteban Franqueiro added a comment - The attachment contains a test class for the issue I reported in [1] and a helper class it depends on. Also includes a patch that corrects this issue by allowing pooled connections to remain open until the streams they're serving are closed. [1] http://mail-archives.apache.org/mod_mbox/jackrabbit-dev/200802.mbox/browser
          Hide
          Thomas Mueller added a comment -

          Javadoc comment fixed in revision 618935

          Show
          Thomas Mueller added a comment - Javadoc comment fixed in revision 618935
          Hide
          Esteban Franqueiro added a comment -

          The following javadoc comment should be removed from the DbDataStore class:

          • A three level directory structure is used to avoid placing too many
          • files in a single directory. The chosen structure is designed to scale
          • up to billions of distinct records.
          Show
          Esteban Franqueiro added a comment - The following javadoc comment should be removed from the DbDataStore class: A three level directory structure is used to avoid placing too many files in a single directory. The chosen structure is designed to scale up to billions of distinct records.
          Hide
          Thomas Mueller added a comment -

          Hi,

          I have tested it again and my results are:

          When storing an object, a temp file is created for most databases to find out how long the binary object is. For databases that support PreparedStatement.setBinaryStream with unknown size (currently only the H2 database), no temp file is created.

          When retrieving an object (prop.getStream()), the call sequence is: 1) DbDataStore.getRecord(..), which calls touch(..) but touch(..) returns quickly. 2) DbDataRecord.getStream(), which calls touch(..) but touch also doesn't do anything. Here a temp file is created, otherwise the connection would be blocked until the application read the blob. By the way, after running the test, I did not have any leftover temp files.

          So if I run the test, everything looks fine to me.

          Show
          Thomas Mueller added a comment - Hi, I have tested it again and my results are: When storing an object, a temp file is created for most databases to find out how long the binary object is. For databases that support PreparedStatement.setBinaryStream with unknown size (currently only the H2 database), no temp file is created. When retrieving an object (prop.getStream()), the call sequence is: 1) DbDataStore.getRecord(..), which calls touch(..) but touch(..) returns quickly. 2) DbDataRecord.getStream(), which calls touch(..) but touch also doesn't do anything. Here a temp file is created, otherwise the connection would be blocked until the application read the blob. By the way, after running the test, I did not have any leftover temp files. So if I run the test, everything looks fine to me.
          Hide
          Esteban Franqueiro added a comment -

          Thomas, with the test I posted in JCR-1250 I see that there's still a copy from the DB to the local FS on the save() call (which I believe is unnecessary). Also, there are many calls to touch() inside the data store, and at least one call to getRecord().

          Show
          Esteban Franqueiro added a comment - Thomas, with the test I posted in JCR-1250 I see that there's still a copy from the DB to the local FS on the save() call (which I believe is unnecessary). Also, there are many calls to touch() inside the data store, and at least one call to getRecord().
          Hide
          Esteban Franqueiro added a comment -

          Really interesting. Have you thought about the use case of the GC running in a background thread? How would you go about it?
          Could you post your code, as a patch maybe, to take a look a it?
          As you know this feature is very important for us.

          Show
          Esteban Franqueiro added a comment - Really interesting. Have you thought about the use case of the GC running in a background thread? How would you go about it? Could you post your code, as a patch maybe, to take a look a it? As you know this feature is very important for us.
          Hide
          Thomas Mueller added a comment -

          Hi,

          The solution I am working on right now is: add a method

          public GarbageCollector SessionImpl.createDataStoreGarbageCollector().

          This method creates a data store garbage collector object and initialized it. If all persistence managers are IterablePersistenceManager (a new interface with one method: getAllNodeIds; currently implemented for all BundlePersistenceManager), then it will use this method to scan the repository.

          If not (if other persistence managers are used), it uses the same method as now to scan the repository (however automatically in all workspaces).

          That means the methods don't need to be made public (SessionsImpl is in the same package as RepositoryImpl).

          Regards,
          Thomas

          Show
          Thomas Mueller added a comment - Hi, The solution I am working on right now is: add a method public GarbageCollector SessionImpl.createDataStoreGarbageCollector(). This method creates a data store garbage collector object and initialized it. If all persistence managers are IterablePersistenceManager (a new interface with one method: getAllNodeIds; currently implemented for all BundlePersistenceManager), then it will use this method to scan the repository. If not (if other persistence managers are used), it uses the same method as now to scan the repository (however automatically in all workspaces). That means the methods don't need to be made public (SessionsImpl is in the same package as RepositoryImpl). Regards, Thomas
          Hide
          Esteban Franqueiro added a comment -

          Stefan,

          > the said method is not public since it returns all workspace names, regardless of user rights. please note that
          > the a session must only see/know about workspaces it is actually allowed to access.

          I'm working on a garbage collector for the data store. The GC doesn't care about user rights, or, putting it another way, it must be able to access all workspaces. Ideally I'd like to use a SystemSession, but you can't get one outside RepositoryImpl or o.a.j.c package (i.e., getSystemSession(String) is package private, WorkspaceInfo is protected).
          I understand that these APIs shouldn't be public, but they're useful from within the core. In my case, for the GC.
          I'm very interested in using another way to walk every workspace, but this other way should guarantee that the repository won't shutdown during a walk operation. I'll check the solution Thomas said he'll be proposing, but in the mean time that is what I'm using.
          Still, I don't see why this APIs aren't public.
          Regards

          Show
          Esteban Franqueiro added a comment - Stefan, > the said method is not public since it returns all workspace names, regardless of user rights. please note that > the a session must only see/know about workspaces it is actually allowed to access. I'm working on a garbage collector for the data store. The GC doesn't care about user rights, or, putting it another way, it must be able to access all workspaces. Ideally I'd like to use a SystemSession, but you can't get one outside RepositoryImpl or o.a.j.c package (i.e., getSystemSession(String) is package private, WorkspaceInfo is protected). I understand that these APIs shouldn't be public, but they're useful from within the core. In my case, for the GC. I'm very interested in using another way to walk every workspace, but this other way should guarantee that the repository won't shutdown during a walk operation. I'll check the solution Thomas said he'll be proposing, but in the mean time that is what I'm using. Still, I don't see why this APIs aren't public. Regards
          Hide
          Esteban Franqueiro added a comment -

          > Do you still need this?

          Yes.

          > FileDataRecord is now protected, but why do you want to extend it? I don't think it's a good idea to re-use the class for something that it's not meant for (it would break encapsulation).

          Like I said in my last message, I don't want to extend it, I want to use it from another package, so I need both the class and constructor to be public.

          Regards

          Show
          Esteban Franqueiro added a comment - > Do you still need this? Yes. > FileDataRecord is now protected, but why do you want to extend it? I don't think it's a good idea to re-use the class for something that it's not meant for (it would break encapsulation). Like I said in my last message, I don't want to extend it, I want to use it from another package, so I need both the class and constructor to be public. Regards
          Hide
          Thomas Mueller added a comment -

          > FileDataRecord class should be public, and the DataRecord interface should have a getLastModified() method.

          Do you still need this?

          > These are for extensibility.

          FileDataRecord is now protected, but why do you want to extend it? I don't think it's a good idea to re-use the class for something that it's not meant for (it would break encapsulation).

          Regards,
          Thomas

          Show
          Thomas Mueller added a comment - > FileDataRecord class should be public, and the DataRecord interface should have a getLastModified() method. Do you still need this? > These are for extensibility. FileDataRecord is now protected, but why do you want to extend it? I don't think it's a good idea to re-use the class for something that it's not meant for (it would break encapsulation). Regards, Thomas
          Hide
          Thomas Mueller added a comment -

          Stefan, I don't think we are talking about the same issue. It was never the idea to add this method to JackrabbitRepository. Anyway, I found a workaround to get the workspace names (if you need it):

          RepositoryImpl r = (RepositoryImpl) rep;
          RepositoryConfig conf = r.getConfig();
          Collection coll = conf.getWorkspaceConfigs();
          for(Iterator it = coll.iterator(); it.hasNext()

          { WorkspaceConfig wsc = (WorkspaceConfig) it.next(); wsc.getName(); }

          There is no session required to call those methods. In any case, the workspace names may not be required for the data store garbage collection. I will propose another solution.

          Show
          Thomas Mueller added a comment - Stefan, I don't think we are talking about the same issue. It was never the idea to add this method to JackrabbitRepository. Anyway, I found a workaround to get the workspace names (if you need it): RepositoryImpl r = (RepositoryImpl) rep; RepositoryConfig conf = r.getConfig(); Collection coll = conf.getWorkspaceConfigs(); for(Iterator it = coll.iterator(); it.hasNext() { WorkspaceConfig wsc = (WorkspaceConfig) it.next(); wsc.getName(); } There is no session required to call those methods. In any case, the workspace names may not be required for the data store garbage collection. I will propose another solution.
          Hide
          Stefan Guggisberg added a comment -

          > Esteban Franqueiro commented on JCR-1154:
          > -----------------------------------------
          >
          > [...]
          >
          > BTW, thanks for commiting the changes I asked for, but I still think that making RepositoryImpl.getWorkspaceNames() public will be useful (I'm using it for the GC).

          the said method is not public since it returns all workspace names, regardless of user rights. please note that
          the a session must only see/know about workspaces it is actually allowed to access.

          see http://www.day.com/maven/jsr170/javadocs/jcr-1.0/javax/jcr/Workspace.html#getAccessibleWorkspaceNames()

          Show
          Stefan Guggisberg added a comment - > Esteban Franqueiro commented on JCR-1154 : > ----------------------------------------- > > [...] > > BTW, thanks for commiting the changes I asked for, but I still think that making RepositoryImpl.getWorkspaceNames() public will be useful (I'm using it for the GC). the said method is not public since it returns all workspace names, regardless of user rights. please note that the a session must only see/know about workspaces it is actually allowed to access. see http://www.day.com/maven/jsr170/javadocs/jcr-1.0/javax/jcr/Workspace.html#getAccessibleWorkspaceNames( )
          Hide
          Esteban Franqueiro added a comment -

          Hi Thomas.
          I noticed that you made the FileDataRecord public, but it's only constructor protected. This way you can't have a class that returns a file data record, because it can't construct one. Maybe I didn't express correctly. I want to use FileDataRecord from another package, so I need both the class and its constructor(s) to be public . For the time being I'm not extending it, but who knows
          I would also like to ask for a getLastModified() method in the DataRecord interface.
          Regards

          Show
          Esteban Franqueiro added a comment - Hi Thomas. I noticed that you made the FileDataRecord public, but it's only constructor protected. This way you can't have a class that returns a file data record, because it can't construct one. Maybe I didn't express correctly. I want to use FileDataRecord from another package, so I need both the class and its constructor(s) to be public . For the time being I'm not extending it, but who knows I would also like to ask for a getLastModified() method in the DataRecord interface. Regards
          Hide
          Esteban Franqueiro added a comment -

          I see you have updated the impl. I'm checkin it out right now. I have a few comments:

          • you have to update the javadoc in DbDataStore (it has a wrong package and class name, an also the className attribute should be "class"
          • you left the @author tag in Pool and TempFileInputStream classes
          • shouldn't be better to use a larger buffer inTempFileInputStream.writeToFileAndClose()? like 64K perhaps?
          • what happens if two clients request the same binary at the same time? I guess the temp file doesn't overwrite the other one, so the file is copied twice, right?

          BTW, thanks for commiting the changes I asked for, but I still think that making RepositoryImpl.getWorkspaceNames() public will be useful (I'm using it for the GC).

          Regards

          Show
          Esteban Franqueiro added a comment - I see you have updated the impl. I'm checkin it out right now. I have a few comments: you have to update the javadoc in DbDataStore (it has a wrong package and class name, an also the className attribute should be "class" you left the @author tag in Pool and TempFileInputStream classes shouldn't be better to use a larger buffer inTempFileInputStream.writeToFileAndClose()? like 64K perhaps? what happens if two clients request the same binary at the same time? I guess the temp file doesn't overwrite the other one, so the file is copied twice, right? BTW, thanks for commiting the changes I asked for, but I still think that making RepositoryImpl.getWorkspaceNames() public will be useful (I'm using it for the GC). Regards
          Hide
          Thomas Mueller added a comment -

          The current solution is still not perfect (blobs need to fit in memory for MySQL and PostgreSQL), but in a usable state I hope.

          Show
          Thomas Mueller added a comment - The current solution is still not perfect (blobs need to fit in memory for MySQL and PostgreSQL), but in a usable state I hope.
          Hide
          Esteban Franqueiro added a comment - - edited

          Also, I think FileDataRecord class should be public, and the DataRecord interface should have a getLastModified() method. These are for extensibility. Since it's not public, it can't be extended from another package. And the method is the easyest way to get the last modified time of a DB record in my case.

          Show
          Esteban Franqueiro added a comment - - edited Also, I think FileDataRecord class should be public, and the DataRecord interface should have a getLastModified() method. These are for extensibility. Since it's not public, it can't be extended from another package. And the method is the easyest way to get the last modified time of a DB record in my case.
          Hide
          Thomas Mueller added a comment -

          The original message may not be very helpful. I think it's better to add as much context information as possible, as in:

          try

          { return new FileInputStream(file); }

          catch (IOException e)

          { throw new DataStoreException("Error opening input stream of " + file.getAbsolutePath(), e); }

          because the exception from FileInputStream may not contain the absolute path. Or:

          } catch (Exception e)

          { throw convert("Can not init data store, driver=" + driver + " url=" + url + " user=" + user, e); }

          The original message will not always have that much context information. But if you need the non-message constructor, it's not a problem, I can add it.

          Show
          Thomas Mueller added a comment - The original message may not be very helpful. I think it's better to add as much context information as possible, as in: try { return new FileInputStream(file); } catch (IOException e) { throw new DataStoreException("Error opening input stream of " + file.getAbsolutePath(), e); } because the exception from FileInputStream may not contain the absolute path. Or: } catch (Exception e) { throw convert("Can not init data store, driver=" + driver + " url=" + url + " user=" + user, e); } The original message will not always have that much context information. But if you need the non-message constructor, it's not a problem, I can add it.
          Hide
          Esteban Franqueiro added a comment -

          Why have to write
          catch (SomeException e)

          { throw new DataStoreException(e.getMessage(), e); }

          when
          catch (SomeException e)

          { throw new DataStoreException(e); }

          does the same?

          Show
          Esteban Franqueiro added a comment - Why have to write catch (SomeException e) { throw new DataStoreException(e.getMessage(), e); } when catch (SomeException e) { throw new DataStoreException(e); } does the same?
          Hide
          Thomas Mueller added a comment -

          There is: "public DataStoreException(String message, Throwable cause)". Is this not enough? If not, why?

          Show
          Thomas Mueller added a comment - There is: "public DataStoreException(String message, Throwable cause)". Is this not enough? If not, why?
          Hide
          Esteban Franqueiro added a comment -

          Shouldn't the DataStoreException (as well as every exception, actually) have the DataStoreException(Exception) constructor?

          Show
          Esteban Franqueiro added a comment - Shouldn't the DataStoreException (as well as every exception, actually) have the DataStoreException(Exception) constructor?
          Hide
          Thomas Mueller added a comment -

          In revision 591286, the DataStore throws DataStoreException instead of IOException. This should simplify adding new data stores, for example the DatabaseDataStore.

          Show
          Thomas Mueller added a comment - In revision 591286, the DataStore throws DataStoreException instead of IOException. This should simplify adding new data stores, for example the DatabaseDataStore.
          Show
          Esteban Franqueiro added a comment - See http://www.nabble.com/Re%3A--jira--Commented%3A-%28JCR-1154%29-Database-Data-Store-p13535861.html for response.
          Hide
          Thomas Mueller added a comment -

          Hi,

          Thanks again for the patch! And sorry for the delay. I have a few remarks and questions:

          • please use spaces, not tabs (Jackrabbit consistently uses spaces)
          • please don't use import ...*
          • byte b[] > byte[] b
          • public is not required in interfaces
          • return does not require (..)

          To fix such problems, I use Checkstyle. If you like I can post the configuration I use.

          Garbage collection: I like make garbage collection implementation independent; is it OK for you if I remove those cases? AbstractGarbageCollector and AbstractDataStore are not required then (anyway they are small).

          It would be great if the database data store would automatically re-connect if the connection was lost (important for MySQL). To achieve this, org.apache.jackrabbit.core.persistence.bundle.util.ConnectionRecoveryManager can be used.

          It's quite tricky that DatabaseRecord extends FilterInputStream... However I'm not sure if it is required, is it not possible to just wrap the database BLOB object?

          I don't think that FileDataStoreConstants is required.

          Those are just my view, and I'm open to discuss them of course.

          If you have time to change it yourself please go ahead. Otherwise I will do it and post the patch here before I commit it - but it will take a few more days.

          Thanks for your help!
          Thomas

          Show
          Thomas Mueller added a comment - Hi, Thanks again for the patch! And sorry for the delay. I have a few remarks and questions: please use spaces, not tabs (Jackrabbit consistently uses spaces) please don't use import ...* byte b[] > byte[] b public is not required in interfaces return does not require (..) To fix such problems, I use Checkstyle. If you like I can post the configuration I use. Garbage collection: I like make garbage collection implementation independent; is it OK for you if I remove those cases? AbstractGarbageCollector and AbstractDataStore are not required then (anyway they are small). It would be great if the database data store would automatically re-connect if the connection was lost (important for MySQL). To achieve this, org.apache.jackrabbit.core.persistence.bundle.util.ConnectionRecoveryManager can be used. It's quite tricky that DatabaseRecord extends FilterInputStream... However I'm not sure if it is required, is it not possible to just wrap the database BLOB object? I don't think that FileDataStoreConstants is required. Those are just my view, and I'm open to discuss them of course. If you have time to change it yourself please go ahead. Otherwise I will do it and post the patch here before I commit it - but it will take a few more days. Thanks for your help! Thomas
          Hide
          Thomas Mueller added a comment -

          Thanks a lot for your help Esteban!

          I'm currently trying to integrate this patch into the current jackrabbit trunk. Hopefully the garbage collection problem can be solved more easily than in your patch. I'm not sure if you saw the WeakHashMap 'inUse' in the file data store, what do you think about it?

          Also I like to make sure that the database data store can re-connect to the database automatically; if possible the auto-reconnect feature of the bundle db persistence manager should be used.

          Anyway, I'm working on it, but it will take some more time.

          Thomas

          Show
          Thomas Mueller added a comment - Thanks a lot for your help Esteban! I'm currently trying to integrate this patch into the current jackrabbit trunk. Hopefully the garbage collection problem can be solved more easily than in your patch. I'm not sure if you saw the WeakHashMap 'inUse' in the file data store, what do you think about it? Also I like to make sure that the database data store can re-connect to the database automatically; if possible the auto-reconnect feature of the bundle db persistence manager should be used. Anyway, I'm working on it, but it will take some more time. Thomas
          Hide
          Esteban Franqueiro added a comment -

          Changed attachment name.

          Show
          Esteban Franqueiro added a comment - Changed attachment name.
          Hide
          Esteban Franqueiro added a comment -

          This is an implementation of a database-based data store. It currently uses SQL Server but this can be changed by subclassing the DatabaseDataStore class and overriding its buildSQLStatements() method as usual. As discussed, this impl uses UUIDs for the record id's, but we'll be working out a solution like the one proposed by Thomas.
          Also included is a garbage collector (based on the initial implementation by Thomas). The changes are mostly cosmetic, to generalize it for the file and database DS.
          In RepositoryImpl's constructor is the code to initialize the garbage collector, according to the configuration in the repository.xml file.
          In the included InternalValue we added methods useDataStore()/setDataStore() so we can avoid relaying on the system property.
          There are a few tests, for different cases (we couldn't cover every corner case, of course).
          Please note that there are a few design decisions in which we had to take into account the application we're building on top of Jackrabbit, but we tried very hard to keep these to a minimum while at the same time trying to simplify each time we have to update the version of Jackrabbit, since our application extends some of it's functionality (specifically, the database data store ). I mention the ones I remember now: the boolean parameter in DatabaseDataStore.getConnection(), and the DatabaseRecord implementation. We should further discuss these issues.
          Finally, note that the attachment is NOT a patch since my working version has many patches applied to it.

          Show
          Esteban Franqueiro added a comment - This is an implementation of a database-based data store. It currently uses SQL Server but this can be changed by subclassing the DatabaseDataStore class and overriding its buildSQLStatements() method as usual. As discussed, this impl uses UUIDs for the record id's, but we'll be working out a solution like the one proposed by Thomas. Also included is a garbage collector (based on the initial implementation by Thomas). The changes are mostly cosmetic, to generalize it for the file and database DS. In RepositoryImpl's constructor is the code to initialize the garbage collector, according to the configuration in the repository.xml file. In the included InternalValue we added methods useDataStore()/setDataStore() so we can avoid relaying on the system property. There are a few tests, for different cases (we couldn't cover every corner case, of course). Please note that there are a few design decisions in which we had to take into account the application we're building on top of Jackrabbit, but we tried very hard to keep these to a minimum while at the same time trying to simplify each time we have to update the version of Jackrabbit, since our application extends some of it's functionality (specifically, the database data store ). I mention the ones I remember now: the boolean parameter in DatabaseDataStore.getConnection(), and the DatabaseRecord implementation. We should further discuss these issues. Finally, note that the attachment is NOT a patch since my working version has many patches applied to it.

            People

            • Assignee:
              Thomas Mueller
              Reporter:
              Thomas Mueller
            • Votes:
              1 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development