Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-404

Problems in "SELECT * FROM t SORT BY col1 LIMIT 100"

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.3.0
    • Query Processor
    • None
    • Reviewed
    • Hide
      HIVE-404. Fix ordering in "SELECT * FROM t SORT BY col1 LIMIT 100" when query is a outer-most query. (Namit Jain via zshao)
      Show
      HIVE-404 . Fix ordering in "SELECT * FROM t SORT BY col1 LIMIT 100" when query is a outer-most query. (Namit Jain via zshao)

    Description

      Unless the user specify "set mapred.reduce.tasks=1;", he will see unexpected results with the query of "SELECT * FROM t SORT BY col1 LIMIT 100"

      Basically, in the first map-reduce job, each reducer will get sorted data and only keep the first 100. In the second map-reduce job, we will distribute and sort the data randomly, before feeding into a single reducer that outputs the first 100.

      In short, the query will output 100 random records in N * 100 top records from each of the reducer in the first map-reduce job.

      This is contradicting to what people expects.

      We should propagate the SORT BY columns to the second map-reduce job.

      Attachments

        1. hive.404.1.patch
          4 kB
          Namit Jain
        2. hive.404.2.patch
          4 kB
          Namit Jain

        Activity

          People

            namit Namit Jain
            zshao Zheng Shao
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: