Hive
  1. Hive
  2. HIVE-4943

An explode function that includes the item's position in the array

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.11.0
    • Fix Version/s: 0.13.0
    • Component/s: Query Processor
    • Labels:

      Description

      A function that explodes an array and includes an output column with the position of each item in the original array.

      1. HIVE-4943.1.patch
        5 kB
        Niko Stahl
      2. HIVE-4943.2.patch
        8 kB
        Niko Stahl
      3. HIVE-4943.3.patch
        8 kB
        Edward Capriolo

        Activity

        Hide
        Niko Stahl added a comment -

        Please review.

        Show
        Niko Stahl added a comment - Please review.
        Hide
        Niko Stahl added a comment -

        Please review. Thanks!

        Show
        Niko Stahl added a comment - Please review. Thanks!
        Hide
        Niko Stahl added a comment -

        Please review.

        Show
        Niko Stahl added a comment - Please review.
        Hide
        Ashutosh Chauhan added a comment -

        I cannot see any patch attached to the jira. What is it that needs review?

        Show
        Ashutosh Chauhan added a comment - I cannot see any patch attached to the jira. What is it that needs review?
        Hide
        Niko Stahl added a comment -

        Sorry I've now attached the patch for your review. Thanks!

        Show
        Niko Stahl added a comment - Sorry I've now attached the patch for your review. Thanks!
        Hide
        Hive QA added a comment -

        Overall: -1 at least one tests failed

        Here are the results of testing the latest attachment:
        https://issues.apache.org/jira/secure/attachment/12598792/HIVE-4943.1.patch

        ERROR: -1 due to 2 failed/errored test(s), 2886 tests executed
        Failed tests:

        org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udtf_posexplode
        org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_infer_bucket_sort_reducers_power_two
        

        Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/477/testReport
        Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/477/console

        Messages:

        Executing org.apache.hive.ptest.execution.PrepPhase
        Executing org.apache.hive.ptest.execution.ExecutionPhase
        Executing org.apache.hive.ptest.execution.ReportingPhase
        Tests failed with: TestsFailedException: 2 tests failed
        

        This message is automatically generated.

        Show
        Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12598792/HIVE-4943.1.patch ERROR: -1 due to 2 failed/errored test(s), 2886 tests executed Failed tests: org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udtf_posexplode org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_infer_bucket_sort_reducers_power_two Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/477/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/477/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests failed with: TestsFailedException: 2 tests failed This message is automatically generated.
        Hide
        Navis added a comment -

        Seemed too specific to be included in hive built-in.

        Show
        Navis added a comment - Seemed too specific to be included in hive built-in.
        Hide
        Ashutosh Chauhan added a comment -

        Agree with Navis This one looks too specific to be included in hive built-ins.

        Show
        Ashutosh Chauhan added a comment - Agree with Navis This one looks too specific to be included in hive built-ins.
        Hide
        Niko Stahl added a comment -

        Yes, it's quite specific but very useful when the ordering of the original array matters. I've uploaded a new patch that should pass those failing tests. Thanks for your feedback.

        Show
        Niko Stahl added a comment - Yes, it's quite specific but very useful when the ordering of the original array matters. I've uploaded a new patch that should pass those failing tests. Thanks for your feedback.
        Hide
        Hive QA added a comment -

        Overall: +1 all checks pass

        Here are the results of testing the latest attachment:
        https://issues.apache.org/jira/secure/attachment/12598937/HIVE-4943.2.patch

        SUCCESS: +1 2887 tests passed

        Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/489/testReport
        Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/489/console

        Messages:

        Executing org.apache.hive.ptest.execution.PrepPhase
        Executing org.apache.hive.ptest.execution.ExecutionPhase
        Executing org.apache.hive.ptest.execution.ReportingPhase
        

        This message is automatically generated.

        Show
        Hive QA added a comment - Overall : +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12598937/HIVE-4943.2.patch SUCCESS: +1 2887 tests passed Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/489/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/489/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase This message is automatically generated.
        Hide
        Niko Stahl added a comment -

        What's the verdict? Will there be a vote?

        Show
        Niko Stahl added a comment - What's the verdict? Will there be a vote?
        Hide
        Michael Haeusler added a comment -

        An awesome feature of Hive is the rich type system with excellent support for complex data-structures. To me, this ticket seems like a very useful extension to the hive built-ins. It is especially helpful for those users, that use complex data-structures. Right now, queries are often cumbersome when you access denormalized or nested data.

        E.g., let's consider a table that contains products together with their most popular accessories (cross-sellings). The order of the cross-selling products matter:

        {
          "productId": 42,
          "name": "most awesome mp3 player",
          "manufacturer": "acme corp",
        
          "accessories": [
            { "productId" : 23, "name": "batteries", "manufacturer": "acme corp" },
            { "productId" : 25, "name": "extra load earphones", "manufacturer": "noisemakers inc" }
          ]
        }
        

        Let's assume we want to know the average position in cross-sellings of the manufacturer "noisemakers inc". Surprisingly, this is not possible with hive built-ins. You could try to come up with a custom UDFSequence and a query like this:

        SELECT
          AVG(SEQUENCE(p.productId)) AS wrongAverage
        FROM
          products p
        LATERAL VIEW
          EXPLODE(p.accessories) pa AS accessory
        WHERE
          pa.accessory.manufacturer = 'noisemakers inc';
        

        Unfortunately, the above query will give us wrong results, because Hive executes the predicate in the where clause first. Therefore, any UDF in the select clause has no chance to see and count all values.

        Using the UDTF from this ticket seems to be the best solution:

        SELECT
          AVG(pa.pos) AS correctAverage
        FROM
          products p
        LATERAL VIEW
          POSEXPLODE(p.accessories) pa AS pos, accessory
        WHERE
          pa.accessory.manufacturer = 'noisemakers inc';
        
        Show
        Michael Haeusler added a comment - An awesome feature of Hive is the rich type system with excellent support for complex data-structures. To me, this ticket seems like a very useful extension to the hive built-ins. It is especially helpful for those users, that use complex data-structures. Right now, queries are often cumbersome when you access denormalized or nested data. E.g., let's consider a table that contains products together with their most popular accessories (cross-sellings). The order of the cross-selling products matter: { "productId": 42, "name": "most awesome mp3 player", "manufacturer": "acme corp", "accessories": [ { "productId" : 23, "name": "batteries", "manufacturer": "acme corp" }, { "productId" : 25, "name": "extra load earphones", "manufacturer": "noisemakers inc" } ] } Let's assume we want to know the average position in cross-sellings of the manufacturer "noisemakers inc". Surprisingly, this is not possible with hive built-ins. You could try to come up with a custom UDFSequence and a query like this: SELECT AVG(SEQUENCE(p.productId)) AS wrongAverage FROM products p LATERAL VIEW EXPLODE(p.accessories) pa AS accessory WHERE pa.accessory.manufacturer = 'noisemakers inc'; Unfortunately, the above query will give us wrong results, because Hive executes the predicate in the where clause first. Therefore, any UDF in the select clause has no chance to see and count all values. Using the UDTF from this ticket seems to be the best solution: SELECT AVG(pa.pos) AS correctAverage FROM products p LATERAL VIEW POSEXPLODE(p.accessories) pa AS pos, accessory WHERE pa.accessory.manufacturer = 'noisemakers inc';
        Hide
        Niko Stahl added a comment -

        Hi Yin,

        Sorry I'm new to this process. Is any further action required on my part?

        Thanks,
        Niko

        Show
        Niko Stahl added a comment - Hi Yin, Sorry I'm new to this process. Is any further action required on my part? Thanks, Niko
        Hide
        Edward Capriolo added a comment -

        +1. Let me re-upload the patch after it retests I will commit.

        Show
        Edward Capriolo added a comment - +1. Let me re-upload the patch after it retests I will commit.
        Hide
        Niko Stahl added a comment -

        Sounds good, thanks!

        Show
        Niko Stahl added a comment - Sounds good, thanks!
        Hide
        Hive QA added a comment -

        Overall: +1 all checks pass

        Here are the results of testing the latest attachment:
        https://issues.apache.org/jira/secure/attachment/12608151/HIVE-4943.3.patch

        SUCCESS: +1 4397 tests passed

        Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/1113/testReport
        Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/1113/console

        Messages:

        Executing org.apache.hive.ptest.execution.PrepPhase
        Executing org.apache.hive.ptest.execution.ExecutionPhase
        Executing org.apache.hive.ptest.execution.ReportingPhase
        

        This message is automatically generated.

        Show
        Hive QA added a comment - Overall : +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12608151/HIVE-4943.3.patch SUCCESS: +1 4397 tests passed Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/1113/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/1113/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase This message is automatically generated.
        Hide
        Edward Capriolo added a comment -

        Resolved. Thanks Niko. Next time tag me as a watcher or make more noise if the patch takes so long.

        Show
        Edward Capriolo added a comment - Resolved. Thanks Niko. Next time tag me as a watcher or make more noise if the patch takes so long.
        Hide
        Lefty Leverenz added a comment -

        This will need to be documented in the wiki, with a version note and a link to this JIRA. You can wait until Hive 0.13 is released, but please don't forget.

        Use this version tag for consistency with other docs:

        {info:title=Version}
        To be introduced in Hive 0.13. See <link-to-JIRA>.
        {info}

        Show
        Lefty Leverenz added a comment - This will need to be documented in the wiki, with a version note and a link to this JIRA. You can wait until Hive 0.13 is released, but please don't forget. add to table here: Built-in Table-Generating Functions (UDTF) after the table, add a section similar to this (or change to "explode and posexplode"): explode Use this version tag for consistency with other docs: {info:title=Version} To be introduced in Hive 0.13. See <link-to-JIRA>. {info}
        Hide
        Lefty Leverenz added a comment -

        Lars Francke documented this in the wiki. See Built-in Table-Generating Functions and posexplode.

        Thanks, Lars. (Also thanks for fixing typos on the Home page.)

        Show
        Lefty Leverenz added a comment - Lars Francke documented this in the wiki. See Built-in Table-Generating Functions and posexplode . Thanks, Lars. (Also thanks for fixing typos on the Home page.)
        Hide
        Niko Stahl added a comment -

        Thanks, Lars!

        Show
        Niko Stahl added a comment - Thanks, Lars!

          People

          • Assignee:
            Unassigned
            Reporter:
            Niko Stahl
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 8h
              8h
              Remaining:
              Remaining Estimate - 8h
              8h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development