Flume
  1. Flume
  2. FLUME-1734

Create a Hive Sink based on the new Hive Streaming support

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: v1.2.0
    • Fix Version/s: None
    • Component/s: Sinks+Sources
    • Labels:

      Description

      Create a sink that would stream data into HCatalog partitions. The primary goal being that once the data is loaded into Hadoop, it should be automatically queryable (using say Hive or Pig) without requiring additional post processing steps on behalf of the users. Sink should manage the creation of new partitions and committing them periodically.

      1. FLUME-1734.draft.1.patch
        0.5 kB
        Roshan Naik
      2. FLUME-1734.draft.2.patch
        45 kB
        Roshan Naik
      3. FLUME-1734.v1.patch
        103 kB
        Roshan Naik

        Issue Links

          Activity

          Hide
          Mike Percy added a comment -

          Hey Roshan,
          Sounds interesting. Please pardon my limited knowledge about HCatalog, but I have a few questions about the approach you are proposing.

          1. Would all of the partitions be calculated on the client side? Or would all of that loading logic happen via map/reduce jobs? Or would it be a mix?
          2. If client side, what are the HCatalog API calls that can be used to stream the data onto HDFS?
          3. Would this be able to support a secure Metastore? What about Kerberized HDFS clusters?
          4. How much overlap do you see with the HDFS sink?

          The HCatalog docs that I've found only seem to talk about using HCatalog in the context of Hive, Pig, and other types of MapReduce jobs.

          Show
          Mike Percy added a comment - Hey Roshan, Sounds interesting. Please pardon my limited knowledge about HCatalog, but I have a few questions about the approach you are proposing. 1. Would all of the partitions be calculated on the client side? Or would all of that loading logic happen via map/reduce jobs? Or would it be a mix? 2. If client side, what are the HCatalog API calls that can be used to stream the data onto HDFS? 3. Would this be able to support a secure Metastore? What about Kerberized HDFS clusters? 4. How much overlap do you see with the HDFS sink? The HCatalog docs that I've found only seem to talk about using HCatalog in the context of Hive, Pig, and other types of MapReduce jobs.
          Hide
          Roshan Naik added a comment - - edited

          Mike,
          1. There will be no map reduce. This will all be client side (i.e flume agents) streaming data in parallel into HCatalog. Clients will compute the specific partition into which the data will be written. Periodically (configurable) they would 'commit' the currently open partition and roll-over to a new partition. Until the partition is committed the data will not be query-able. There is one restriction... once a partition is committed its data cannot be modified it.

          2. org.apache.hcatalog.data.transfer.*

          3. I have not verified the secure mode HCat operation, but it appears to be supported. Will get back to you.

          4. At the moment, I dont see much code overlap with HDFS sink for the core data movement functionality. There may be always room for sharing other smaller tidbits.

          Show
          Roshan Naik added a comment - - edited Mike, 1. There will be no map reduce. This will all be client side (i.e flume agents) streaming data in parallel into HCatalog. Clients will compute the specific partition into which the data will be written. Periodically (configurable) they would 'commit' the currently open partition and roll-over to a new partition. Until the partition is committed the data will not be query-able. There is one restriction... once a partition is committed its data cannot be modified it. 2. org.apache.hcatalog.data.transfer.* 3. I have not verified the secure mode HCat operation, but it appears to be supported. Will get back to you. 4. At the moment, I dont see much code overlap with HDFS sink for the core data movement functionality. There may be always room for sharing other smaller tidbits.
          Hide
          Mike Percy added a comment -

          Hi Roshan,
          Cool! A couple aspects to consider as you are mulling this over:

          • If a Flume Transaction is committed by the sink then the data must be persisted. We need to avoid getting into states where any committed Channel.take() could be lost somehow. One way to do that today (requires some setup though) is to write to an external Hive table and then periodically do a LOAD via Oozie or something, which could move the files out of the external table and into the desired partitions.
          • If the HCat APIs don't work with secure meta stores or secure HDFS yet, it might be worth considering other APIs at the moment. However, if it can navigate the necessary Hive & Hadoop security features to partition and write the data, it sounds great to me! This is just my opinion, of course you are welcome to take it or leave it.
          Show
          Mike Percy added a comment - Hi Roshan, Cool! A couple aspects to consider as you are mulling this over: If a Flume Transaction is committed by the sink then the data must be persisted. We need to avoid getting into states where any committed Channel.take() could be lost somehow. One way to do that today (requires some setup though) is to write to an external Hive table and then periodically do a LOAD via Oozie or something, which could move the files out of the external table and into the desired partitions. If the HCat APIs don't work with secure meta stores or secure HDFS yet, it might be worth considering other APIs at the moment. However, if it can navigate the necessary Hive & Hadoop security features to partition and write the data, it sounds great to me! This is just my opinion, of course you are welcome to take it or leave it.
          Hide
          Roshan Naik added a comment -

          FYI: Due to limitations in the current Hive and HCatalog support for streaming clients, a new initiative is being pursued in Hive (and HCatalog) to support streaming clients.

          https://issues.apache.org/jira/browse/HIVE-4196

          One of things being baked into it the ability to support transactional commits needed by Flume sinks.

          Show
          Roshan Naik added a comment - FYI: Due to limitations in the current Hive and HCatalog support for streaming clients, a new initiative is being pursued in Hive (and HCatalog) to support streaming clients. https://issues.apache.org/jira/browse/HIVE-4196 One of things being baked into it the ability to support transactional commits needed by Flume sinks.
          Hide
          Hari Shreedharan added a comment -

          Thanks for following up on this Roshan.

          Show
          Hari Shreedharan added a comment - Thanks for following up on this Roshan.
          Hide
          Roshan Naik added a comment -

          Update:
          The HCatalog based streaming support (HIVE-4196) in Hive has been abandoned in favor of a different approach (HIVE-5687) based on Hive native ACID transactions support.

          So planning to rename this JIRA to "create a hive sink."

          Show
          Roshan Naik added a comment - Update: The HCatalog based streaming support ( HIVE-4196 ) in Hive has been abandoned in favor of a different approach ( HIVE-5687 ) based on Hive native ACID transactions support. So planning to rename this JIRA to "create a hive sink."
          Hide
          Roshan Naik added a comment -

          Draft patch for review. No tests currently.

          Show
          Roshan Naik added a comment - Draft patch for review. No tests currently.
          Hide
          Roshan Naik added a comment -

          Link to review board

          Show
          Roshan Naik added a comment - Link to review board
          Hide
          Roshan Naik added a comment -

          Updating patch... adding serializer support to sink

          Show
          Roshan Naik added a comment - Updating patch... adding serializer support to sink
          Hide
          Roshan Naik added a comment -

          Uploading fully functional patch.

          Show
          Roshan Naik added a comment - Uploading fully functional patch.

            People

            • Assignee:
              Roshan Naik
              Reporter:
              Roshan Naik
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:

                Development