Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-11299

Time partitioned collections (umbrella issue)

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • SolrCloud
    • None

    Description

      Solr ought to have the ability to manage large-scale time-series data (think logs or sensor data / IOT) itself without a lot of manual/external work. The most naive and painless approach today is to create a collection with a high numShards with hash routing but this isn't as good as partitioning the underlying indexes by time for these reasons:

      • Easy to scale up/down horizontally as data/requirements change. (No need to over-provision, use shard splitting, or re-index with different config)
      • Faster queries:
        • can search fewer shards, reducing overall load
        • realtime search is more tractable (since most shards are stable – good caches)
        • "recent" shards (that might be queried more) can be allocated to faster hardware
        • aged out data is simply removed, not marked as deleted. Deleted docs still have search overhead.
      • Outages of a shard result in a degraded but sometimes a useful system nonetheless (compare to random subset missing)

      Ideally you could set this up once and then simply work with a collection (potentially actually an alias) in a normal way (search or update), letting Solr handle the addition of new partitions, removing of old ones, and appropriate routing of requests depending on their nature.

      This issue is an umbrella issue for the particular tasks that will make it all happen – either subtasks or issue linking.

      Attachments

        Issue Links

        1.
        Collection Alias metadata for time partitioned collections Sub-task Resolved David Smiley   Actions
        2.
        Add URP to route time partitioned collections Sub-task Closed David Smiley   Actions
        3.
        Expose Alias Properties CRUD in REST API Sub-task Closed David Smiley

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 2h 20m
        Actions
        4.
        create next time collection based on a fixed time gap Sub-task Closed David Smiley   Actions
        5.
        TimePartitionedUpdateProcessor.lookupShardLeaderOfCollection should route to the ideal shard Sub-task Closed David Smiley   Actions
        6.
        API to create a Time Routed Alias and first collection Sub-task Closed David Smiley

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 50m
        Actions
        7.
        API command to delete oldest collections in a time routed alias Sub-task Open Unassigned   Actions
        8.
        Auto delete oldest collections in a time routed alias Sub-task Closed David Smiley   Actions
        9.
        Create Time Routed Alias stress-test Sub-task Open Unassigned   Actions
        10.
        Exception Class to identify out of range docs vs other errors Sub-task Open Unassigned   Actions
        11.
        add option for deleting an alias to delete collections first Sub-task Open Unassigned   Actions
        12.
        Document Time Routed Aliases separate from API Sub-task Closed David Smiley   Actions
        13.
        TRA: Pre-emptively create next collection Sub-task Closed David Smiley

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 9.5h
        Actions
        14.
        TRA: evaluate autoDeleteAge independently of when collections are created Sub-task Open Unassigned

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 20m
        Actions
        15.
        TRA: document re-dating (question, test, docs) Sub-task Open Unassigned   Actions
        16.
        Optimize Queries when sorting by router.field Sub-task Patch Available Gus Heck

        100%

        Original Estimate - Not Specified Original Estimate - Not Specified
        Time Spent - 10m
        Actions
        17.
        Optimize Queries when query filtering by TRA router.field Sub-task Patch Available Gus Heck   Actions

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            dsmiley David Smiley
            dsmiley David Smiley

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 13h 10m
                13h 10m

                Slack

                  Issue deployment