Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-11683

Hive Streaming may overload the metastore

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.0.0
    • None
    • None

    Description

      HiveEndPoint represents a way to write to a specific partition transactionally.
      Each HiveEndPoint creates TransactionBatch(es) and commits transactions.

      Suppose you have 10 instances of Storm Hive bolt using Streaming API.
      Each instance will create HiveEndPoints on demand when it sees an event for particular partition value.

      If events are uniformly distributed wrt partition values and the table has 1000 partitions (for example it's partitioned by CustomerId), each of 10 bolt instances may create 1000 HiveEndPoints and thus > 10,000 (actually 10K * num_txn_per_batch) concurrent transactions.

      This creates huge amount of Metastore traffic.

      HIVE-11672 is investigating how some sort of "shuffle" phase can be added route events for a particular bucket to the same bolt instance.

      The same idea should explored to route events based on partition value.

      cc alangates,sriharsha,rbains

      Attachments

        Issue Links

          Activity

            People

              roshan_naik Roshan Naik
              ekoifman Eugene Koifman
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: