Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-11299

Time partitioned collections (umbrella issue)

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • SolrCloud
    • None

    Description

      Solr ought to have the ability to manage large-scale time-series data (think logs or sensor data / IOT) itself without a lot of manual/external work. The most naive and painless approach today is to create a collection with a high numShards with hash routing but this isn't as good as partitioning the underlying indexes by time for these reasons:

      • Easy to scale up/down horizontally as data/requirements change. (No need to over-provision, use shard splitting, or re-index with different config)
      • Faster queries:
        • can search fewer shards, reducing overall load
        • realtime search is more tractable (since most shards are stable – good caches)
        • "recent" shards (that might be queried more) can be allocated to faster hardware
        • aged out data is simply removed, not marked as deleted. Deleted docs still have search overhead.
      • Outages of a shard result in a degraded but sometimes a useful system nonetheless (compare to random subset missing)

      Ideally you could set this up once and then simply work with a collection (potentially actually an alias) in a normal way (search or update), letting Solr handle the addition of new partitions, removing of old ones, and appropriate routing of requests depending on their nature.

      This issue is an umbrella issue for the particular tasks that will make it all happen – either subtasks or issue linking.

      Attachments

        Issue Links

          Activity

            People

              dsmiley David Smiley
              dsmiley David Smiley
              Votes:
              7 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 13h 10m
                  13h 10m