Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-13479

Relax sorting requirement in ACID tables

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.2.0
    • Fix Version/s: None
    • Component/s: Transactions
    • Labels:
      None

      Description

      Currently ACID tables require data to be sorted according to internal primary key. This is that base + delta files can be efficiently sort/merged to produce the snapshot for current transaction.

      This prevents the user to make the table sorted based on any other criteria which can be useful. One example is using dynamic partition insert (which also occurs for update/delete SQL). This may create lots of writers (buckets*partitions) and tax cluster resources.
      The usual solution is hive.optimize.sort.dynamic.partition=true which won't be honored for ACID tables.

      We could rely on hash table based algorithm to merge delta files and then not require any particular sort on Acid tables. One way to do that is to treat each update event as an Insert (new internal PK) + delete (old PK). Delete events are very small since they just need to contain PKs. So the hash table would just need to contain Delete events and be reasonably memory efficient.

      This is a significant amount of work but worth doing.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                ekoifman Eugene Koifman
                Reporter:
                ekoifman Eugene Koifman
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - 160h
                  160h
                  Remaining:
                  Remaining Estimate - 160h
                  160h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified