Solr
  1. Solr
  2. SOLR-2593

A new core admin action 'split' for splitting index

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: 4.3
    • Component/s: None
    • Labels:
      None

      Description

      If an index is too large/hot it would be desirable to split it out to another core .
      This core may eventually be replicated out to another host.

      There can be to be multiple strategies

      • random split of x or x%
      • fq="user:johndoe"

      example :
      action=split&split=20percent&newcore=my_new_index
      or
      action=split&fq=user:johndoe&newcore=john_doe_index

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          Closed after release.

          Show
          Uwe Schindler added a comment - Closed after release.
          Hide
          Shalin Shekhar Mangar added a comment -

          Committed as part of SOLR-3755 changes.

          Show
          Shalin Shekhar Mangar added a comment - Committed as part of SOLR-3755 changes.
          Hide
          Deepak Kumar added a comment -

          I have a situation which demands 2 core merging, re-create data partitions, split & install in 2(or more) cores, seems like this place has got somewhat things closer in that area, basically the case is that there are 2 cores on same schema roughly of 55G and 35G(and growing) each and data keeps on getting pushed continuously on 35G core, we can't allow it to get filled infinitely so essentially over a period of time(offline period/maintenance period) we regenrate(by re-indexing to a fresh core) both the cores with the desired set of data keyed on some unique key, discard the old oversized cores and install the fresh ones, re-indexing is a kind of pain and eventually it'll create the same set of documents but the older core will loose too older docs due to size constraint and the smaller core would be further shrinked as it'll probably be holding lesser documents due to docs getting shifted to bigger one, this can be considered as a sliding time window based core, so the basic steps in demand could be:

          1.) Merge N cores to 1 big core(high cost).
          2.) Scan through all the documents of the big core and create N(num of cores that were merged initially) new cores till allowed size by the side.
          3.) Hot swap the main cores with the fresh ones.
          4.) Discard the old cores probably after backing it up.

          Above 1 may be omitted if we can directly scan through documents of N cores and keep on pushing the new docs over to target cores.

          Show
          Deepak Kumar added a comment - I have a situation which demands 2 core merging, re-create data partitions, split & install in 2(or more) cores, seems like this place has got somewhat things closer in that area, basically the case is that there are 2 cores on same schema roughly of 55G and 35G(and growing) each and data keeps on getting pushed continuously on 35G core, we can't allow it to get filled infinitely so essentially over a period of time(offline period/maintenance period) we regenrate(by re-indexing to a fresh core) both the cores with the desired set of data keyed on some unique key, discard the old oversized cores and install the fresh ones, re-indexing is a kind of pain and eventually it'll create the same set of documents but the older core will loose too older docs due to size constraint and the smaller core would be further shrinked as it'll probably be holding lesser documents due to docs getting shifted to bigger one, this can be considered as a sliding time window based core, so the basic steps in demand could be: 1.) Merge N cores to 1 big core(high cost). 2.) Scan through all the documents of the big core and create N(num of cores that were merged initially) new cores till allowed size by the side. 3.) Hot swap the main cores with the fresh ones. 4.) Discard the old cores probably after backing it up. Above 1 may be omitted if we can directly scan through documents of N cores and keep on pushing the new docs over to target cores.
          Hide
          Andrzej Bialecki added a comment -

          Jason, see LUCENE-2632 for a possible way to implement this at the Lucene level. Splitting into arbitrary parts so far required multiple passes over input data, using the approach of tee/filter codecs it's possible to do this in one pass over the input data.

          Show
          Andrzej Bialecki added a comment - Jason, see LUCENE-2632 for a possible way to implement this at the Lucene level. Splitting into arbitrary parts so far required multiple passes over input data, using the approach of tee/filter codecs it's possible to do this in one pass over the input data.
          Hide
          Jason Rutherglen added a comment -

          Is there a patch for this issue available? If not it's fine.

          Show
          Jason Rutherglen added a comment - Is there a patch for this issue available? If not it's fine.
          Hide
          Terrance A. Snyder added a comment -

          @Noble Paul - do you have more information on this, we have a unique requirement that would greatly benefit from being able to take a 'slice' of data a user has modified and persist it in such a fashion.

          Show
          Terrance A. Snyder added a comment - @Noble Paul - do you have more information on this, we have a unique requirement that would greatly benefit from being able to take a 'slice' of data a user has modified and persist it in such a fashion.
          Hide
          Noble Paul added a comment -

          the fq type option is basically going to require making a full copy of hte index and then deleting by query...

          Lucene does it better. We can pass a Filtered Index to a new writer and it creates a new index w/ only those docs. I was surprised at the speed at which it split a dummy 1million doc index in < 1 sec

          Show
          Noble Paul added a comment - the fq type option is basically going to require making a full copy of hte index and then deleting by query... Lucene does it better. We can pass a Filtered Index to a new writer and it creates a new index w/ only those docs. I was surprised at the speed at which it split a dummy 1million doc index in < 1 sec
          Hide
          Hoss Man added a comment -

          one thing to think about when talking about the API is how the implementation will actually work.

          the fq type option is basically going to require making a full copy of hte index and then deleting by query. (unless i'm missing something) but for people who don't care how the index is partitioned a more efficient approach could probably happen by working at the segment level – let the user say "split off a hunk of at least 20% but no more then 50%" and then you can look at individual segments and doc counts and see if it's possible to just move segments around (and maybe only do the "copy+deleteByQuery" logic on a single segment.

          Show
          Hoss Man added a comment - one thing to think about when talking about the API is how the implementation will actually work. the fq type option is basically going to require making a full copy of hte index and then deleting by query. (unless i'm missing something) but for people who don't care how the index is partitioned a more efficient approach could probably happen by working at the segment level – let the user say "split off a hunk of at least 20% but no more then 50%" and then you can look at individual segments and doc counts and see if it's possible to just move segments around (and maybe only do the "copy+deleteByQuery" logic on a single segment.
          Hide
          Hoss Man added a comment -

          If it's possible, it would be cool to have config parameters to:

          ...those seem like they should be discrete actions that can be taken after the split has happened. the simplest thing is to have a "split" action that just creates a new core with the docs selected either using the fq (or randomly selection) and then use other CoreAdmin actions for the other stuff: rename, swap, swap+delete (the old one), merge ... merge is really the only one we don't have at a "core" level yet (i think)

          Show
          Hoss Man added a comment - If it's possible, it would be cool to have config parameters to: ...those seem like they should be discrete actions that can be taken after the split has happened. the simplest thing is to have a "split" action that just creates a new core with the docs selected either using the fq (or randomly selection) and then use other CoreAdmin actions for the other stuff: rename, swap, swap+delete (the old one), merge ... merge is really the only one we don't have at a "core" level yet (i think)
          Hide
          Peter Sturge added a comment -

          This is a really great idea, thanks!
          If it's possible, it would be cool to have config parameters to:
          create a new core
          overwrite an existing core
          rename an existing core, then create (rolling backup)
          merge with an existing core (ever-growing, but kind of an accessible 'archive' index)

          Show
          Peter Sturge added a comment - This is a really great idea, thanks! If it's possible, it would be cool to have config parameters to: create a new core overwrite an existing core rename an existing core, then create (rolling backup) merge with an existing core (ever-growing, but kind of an accessible 'archive' index)
          Hide
          Koji Sekiguchi added a comment -

          CoreAdminHandler uses action, not command.

          Show
          Koji Sekiguchi added a comment - CoreAdminHandler uses action, not command.

            People

            • Assignee:
              Unassigned
              Reporter:
              Noble Paul
            • Votes:
              8 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development