Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      Implement class RAMDirectoryFactory to make possible using RAMDirectory by adding the next configuration in solrconfig.xml:

      <directoryFactory class="org.apache.solr.core.RAMDirectoryFactory"/>
      1. SOLR-1379.patch
        3 kB
        Alex Baranau
      2. SOLR-1379.patch
        0.7 kB
        Alex Baranau

        Activity

        Hide
        Alex Baranau added a comment -

        Attached patch with the implemented class.

        However, the next constraint is present: RAMDirectoryFactory can be used only when the index already present on disk. For example, first, we can start Solr using StandardDirectoryFactory (default) so that Solr creates needed files in dataDir.

        If this constraint is quite normal, please, let me know. To avoid it some changes need to be done in SolrCore and in DirectoryFactory API.

        Anyway I'm going to provide an alternative patch later to eliminate the constraint.

        Show
        Alex Baranau added a comment - Attached patch with the implemented class. However, the next constraint is present: RAMDirectoryFactory can be used only when the index already present on disk. For example, first, we can start Solr using StandardDirectoryFactory (default) so that Solr creates needed files in dataDir. If this constraint is quite normal, please, let me know. To avoid it some changes need to be done in SolrCore and in DirectoryFactory API. Anyway I'm going to provide an alternative patch later to eliminate the constraint.
        Hide
        Alex Baranau added a comment -

        Moreover, with the RAMDirectoryFactory provided by this patch updating documents via Solr doesn't work properly.

        Example of the scenario in which this RAMDirectoryFactory is usable is when persistent (on file system) index is created/updated by another application (perhaps using pure luce approach to maintain this) and Solr core is reloaded whenever updates should affect the search. One of the possible reasons to use this scenario is to speed up search with RAMDirectory when index changes rarely on doesn't change at all.

        Show
        Alex Baranau added a comment - Moreover, with the RAMDirectoryFactory provided by this patch updating documents via Solr doesn't work properly. Example of the scenario in which this RAMDirectoryFactory is usable is when persistent (on file system) index is created/updated by another application (perhaps using pure luce approach to maintain this) and Solr core is reloaded whenever updates should affect the search. One of the possible reasons to use this scenario is to speed up search with RAMDirectory when index changes rarely on doesn't change at all.
        Hide
        Alex Baranau added a comment -

        Attached new patch. It eliminates the previously mentioned constraint.

        I managed not to change the existing classes, just new ones were added. But I think that we should look closer at using DirectoryFactory for accessing the Directory instance for the index. I believe re-opening Directory is overused in the current code. It didn't cause any problem because before introducing RAMDirectoryFactory only filesystem-based directories were used, but still it may affect the performance and lead to unwanted defects similar to the mentioned in the previous comment.

        Show
        Alex Baranau added a comment - Attached new patch. It eliminates the previously mentioned constraint. I managed not to change the existing classes, just new ones were added. But I think that we should look closer at using DirectoryFactory for accessing the Directory instance for the index. I believe re-opening Directory is overused in the current code. It didn't cause any problem because before introducing RAMDirectoryFactory only filesystem-based directories were used, but still it may affect the performance and lead to unwanted defects similar to the mentioned in the previous comment.
        Hide
        Hoss Man added a comment -

        i can imagine situations where having an index that exists purely in ram would be useful, but i don't really understand the point of a DirectoryFactory that loads an index from disk into a RAMDirectory ... my understanding (from other lucene contributors that spend a lot of time worrying about performance) is that there are very few (and specialized) use cases where this performs better then just using an FSDirectory and letting the OS cache the disk pages.

        do you have some concrete examples of when your patch makes more sense then just using the default?

        Show
        Hoss Man added a comment - i can imagine situations where having an index that exists purely in ram would be useful, but i don't really understand the point of a DirectoryFactory that loads an index from disk into a RAMDirectory ... my understanding (from other lucene contributors that spend a lot of time worrying about performance) is that there are very few (and specialized) use cases where this performs better then just using an FSDirectory and letting the OS cache the disk pages. do you have some concrete examples of when your patch makes more sense then just using the default?
        Hide
        Alex Baranau added a comment -

        The second patch gives ability to have an index which exists purely in RAM. So, this is one example. Actually this is a primary usage scenario.

        The scenario with loading index into the RAM from the disk can be determined by the the combination of all of the next requirements:

        • Index is updated not through the Solr
        • Index is very large (more than 100mil documents, containing more than 50 fields)
        • Index updates affect about half of the documents each month.
        • Search should be performed extremely fast
          Since updates touch a lot of documents this can affect the user while OS caches are renewed (according to our tests this can result in a lag of 5+ minutes while commit is happening). Also we need to optimize the index after such huge updates which is also causes all caches to be recreated, etc. To avoid the lag we can load the index into a RAM and reload it on a scheduled basis using core reload feature (new index is used only after it is warmed up, etc.).

        In addition to that, the test results for RAMDirectory and FSDirectory are different if the user load is significant (e.g. 30+ concurrent requests): RAMDirectory is faster. Even when we used mounted RAM disk as a storage for index and used FSDirectory this performed 2.5-3 times worse than RAMDirectory.

        Show
        Alex Baranau added a comment - The second patch gives ability to have an index which exists purely in RAM. So, this is one example. Actually this is a primary usage scenario. The scenario with loading index into the RAM from the disk can be determined by the the combination of all of the next requirements: Index is updated not through the Solr Index is very large (more than 100mil documents, containing more than 50 fields) Index updates affect about half of the documents each month. Search should be performed extremely fast Since updates touch a lot of documents this can affect the user while OS caches are renewed (according to our tests this can result in a lag of 5+ minutes while commit is happening). Also we need to optimize the index after such huge updates which is also causes all caches to be recreated, etc. To avoid the lag we can load the index into a RAM and reload it on a scheduled basis using core reload feature (new index is used only after it is warmed up, etc.). In addition to that, the test results for RAMDirectory and FSDirectory are different if the user load is significant (e.g. 30+ concurrent requests): RAMDirectory is faster. Even when we used mounted RAM disk as a storage for index and used FSDirectory this performed 2.5-3 times worse than RAMDirectory.
        Hide
        Grant Ingersoll added a comment -

        The scenario with loading index into the RAM from the disk can be determined by the the combination of all of the next requirements:

        • Index is updated not through the Solr
        • Index is very large (more than 100mil documents, containing more than 50 fields)
        • Index updates affect about half of the documents each month.
        • Search should be performed extremely fast
          Since updates touch a lot of documents this can affect the user while OS caches are renewed (according to our tests this can result in a lag of 5+ minutes while commit is happening). Also we need to optimize the index after such huge updates which is also causes all caches to be recreated, etc. To avoid the lag we can load the index into a RAM and reload it on a scheduled basis using core reload feature (new index is used only after it is warmed up, etc.).

        In addition to that, the test results for RAMDirectory and FSDirectory are different if the user load is significant (e.g. 30+ concurrent requests): RAMDirectory is faster. Even when we used mounted RAM disk as a storage for index and used FSDirectory this performed 2.5-3 times worse than RAMDirectory.

        Can you share your performance tests? Also, not sure why a very large index is a use case for RAMDirectory implementation.

        Finally, storing the Directories in a Map doesn't seem like a good idea.

        I think we should mark this for 1.5, so it can be revisited then.

        Show
        Grant Ingersoll added a comment - The scenario with loading index into the RAM from the disk can be determined by the the combination of all of the next requirements: Index is updated not through the Solr Index is very large (more than 100mil documents, containing more than 50 fields) Index updates affect about half of the documents each month. Search should be performed extremely fast Since updates touch a lot of documents this can affect the user while OS caches are renewed (according to our tests this can result in a lag of 5+ minutes while commit is happening). Also we need to optimize the index after such huge updates which is also causes all caches to be recreated, etc. To avoid the lag we can load the index into a RAM and reload it on a scheduled basis using core reload feature (new index is used only after it is warmed up, etc.). In addition to that, the test results for RAMDirectory and FSDirectory are different if the user load is significant (e.g. 30+ concurrent requests): RAMDirectory is faster. Even when we used mounted RAM disk as a storage for index and used FSDirectory this performed 2.5-3 times worse than RAMDirectory. Can you share your performance tests? Also, not sure why a very large index is a use case for RAMDirectory implementation. Finally, storing the Directories in a Map doesn't seem like a good idea. I think we should mark this for 1.5, so it can be revisited then.
        Hide
        Yonik Seeley added a comment -

        Haha - I just write almost the exact same RAMDirectory implementation for newtrunk - then remembered that there might already be an issue open.

        Show
        Yonik Seeley added a comment - Haha - I just write almost the exact same RAMDirectory implementation for newtrunk - then remembered that there might already be an issue open.
        Hide
        Yonik Seeley added a comment -

        Thanks Alex! I've committed this on branches/newtrunk.

        Show
        Yonik Seeley added a comment - Thanks Alex! I've committed this on branches/newtrunk.
        Hide
        Hoss Man added a comment -

        Correcting Fix Version based on CHANGES.txt, see this thread for more details...

        http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

        Show
        Hoss Man added a comment - Correcting Fix Version based on CHANGES.txt, see this thread for more details... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1.0 release

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1.0 release

          People

          • Assignee:
            Unassigned
            Reporter:
            Alex Baranau
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development