Hive
  1. Hive
  2. HIVE-887

Allow SELECT <col> without a mapreduce job

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      All

    • Release Note:
      Set hive-conf hive.fetch.task.conversion to more to make use of this feature. Turned-off by default.

      Description

      I often find myself needing to take a quick look at a particular column of a Hive table.

      I usually do this by doing a
      SELECT * from <table> LIMIT 20;
      from the CLI. Doing this is pretty fast since it doesn't require a mapreduce job. However, it's tough to examine just 1 or 2 columns when the table is very wide.

      So, I might do
      SELECT <col> from <table> LIMIT 20;
      but it's much slower since it requires a map-reduce. It'd be really convenient if a map-reduce wasn't necessary.

      Currently a good work around is to do
      hive -e "select * from table" | cut --key=n
      but it'd be more convenient if it were built in since it alleviates the need for column counting.

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          1149d 3m 1 Ashutosh Chauhan 11/Dec/12 16:38
          Resolved Resolved Closed Closed
          30d 3h 15m 1 Ashutosh Chauhan 10/Jan/13 19:53
          Hide
          Navis added a comment -

          We should defaultize hive.fetch.task.conversion as more, someday.

          Show
          Navis added a comment - We should defaultize hive.fetch.task.conversion as more, someday.
          Hide
          Tim Goodman added a comment -

          Hmm, apparently I have to do:
          SET hive.fetch.task.conversion=more;
          (default was hive.fetch.task.conversion=minimal)

          Show
          Tim Goodman added a comment - Hmm, apparently I have to do: SET hive.fetch.task.conversion=more; (default was hive.fetch.task.conversion=minimal)
          Hide
          Tim Goodman added a comment -

          What is the syntax for this? I'm using hive 0.10.0, and the default behavior still appears to be to trigger a map reduce whenever I specify column names, even when I use LIMIT 1.

          Show
          Tim Goodman added a comment - What is the syntax for this? I'm using hive 0.10.0, and the default behavior still appears to be to trigger a map reduce whenever I specify column names, even when I use LIMIT 1.
          Ashutosh Chauhan made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          Ashutosh Chauhan added a comment -

          This issue is fixed and released as part of 0.10.0 release. If you find an issue which seems to be related to this one, please create a new jira and link this one with new jira.

          Show
          Ashutosh Chauhan added a comment - This issue is fixed and released as part of 0.10.0 release. If you find an issue which seems to be related to this one, please create a new jira and link this one with new jira.
          Ashutosh Chauhan made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Release Note Set hive-conf hive.fetch.task.conversion to more to make use of this feature. Turned-off by default.
          Fix Version/s 0.10.0 [ 12320745 ]
          Resolution Fixed [ 1 ]
          Hide
          Ashutosh Chauhan added a comment -

          Most of this got implemented in HIVE-2925

          Show
          Ashutosh Chauhan added a comment - Most of this got implemented in HIVE-2925
          Navis made changes -
          Link This issue is duplicated by HIVE-2925 [ HIVE-2925 ]
          Hide
          E. Sammer added a comment -

          Ning:

          That sounds great. The syntax isn't really important. I was just trying to think of a way of providing a hint to the query execution layer so the user can indicate when they're willing to take the performance hit in favor of a faster initial response. I would think that in many cases you still probably want to exec as a MR job even for small results; this fetch behavior should be a user requested special case, in my opinion.

          Show
          E. Sammer added a comment - Ning: That sounds great. The syntax isn't really important. I was just trying to think of a way of providing a hint to the query execution layer so the user can indicate when they're willing to take the performance hit in favor of a faster initial response. I would think that in many cases you still probably want to exec as a MR job even for small results; this fetch behavior should be a user requested special case, in my opinion.
          Hide
          Ning Zhang added a comment -

          Eric, we were discussing the same thing before to allow simple where-clause to the non-MR client side execution. This woud be straightforward after we make fetchOperator extends Operator. Then basically any map-only plan (including map-side joins) could be move to non-MR execution.

          The only thing I'm not sure about is the "fetch n" syntax. Since we don't have stats, we don't know the number of rows in the input file. So as a first step we can rely on the file size.

          Show
          Ning Zhang added a comment - Eric, we were discussing the same thing before to allow simple where-clause to the non-MR client side execution. This woud be straightforward after we make fetchOperator extends Operator. Then basically any map-only plan (including map-side joins) could be move to non-MR execution. The only thing I'm not sure about is the "fetch n" syntax. Since we don't have stats, we don't know the number of rows in the input file. So as a first step we can rely on the file size.
          Hide
          E. Sammer added a comment -

          I would also like this kind of functionality. I would add WHERE clause support to the request though as there are cases where you know a table will be small. What would be really ideal is to be able to define a projected threshold where, if the query execution engine think there may be many rows, it resorts to a MR job, but if under, performs client side fetch and filter. The expectation is that GROUP BY, joins, ORDER / SORT / CLUSTER and related would always cause a MR job.

          Ex:

          SELECT a, b FROM t WHERE c = 'foo' FETCH n;

          where n is an upper limit for which a fetch should be done based on the projected number of rows. If projection is still not yet on the table in Hive (I haven't looked at the internals), maybe FETCH n acts like a fetch + limit operation. Maybe n is simply some global configuration parameter, although that seems too inflexible.

          For me, Hive has been excellent for storing raw parsed log data which can be queried into summary tables of around 1 million rows. These summary tables containing aggregations are then queried by a UI for visualization. This "fetch" functionality would allow for the UI load times to go from minutes to seconds and reduce contention for task slots in a production Hadoop cluster.

          Show
          E. Sammer added a comment - I would also like this kind of functionality. I would add WHERE clause support to the request though as there are cases where you know a table will be small. What would be really ideal is to be able to define a projected threshold where, if the query execution engine think there may be many rows, it resorts to a MR job, but if under, performs client side fetch and filter. The expectation is that GROUP BY, joins, ORDER / SORT / CLUSTER and related would always cause a MR job. Ex: SELECT a, b FROM t WHERE c = 'foo' FETCH n; where n is an upper limit for which a fetch should be done based on the projected number of rows. If projection is still not yet on the table in Hive (I haven't looked at the internals), maybe FETCH n acts like a fetch + limit operation. Maybe n is simply some global configuration parameter, although that seems too inflexible. For me, Hive has been excellent for storing raw parsed log data which can be queried into summary tables of around 1 million rows. These summary tables containing aggregations are then queried by a UI for visualization. This "fetch" functionality would allow for the UI load times to go from minutes to seconds and reduce contention for task slots in a production Hadoop cluster.
          Hide
          Namit Jain added a comment -

          Based on an offline discussion, this seems very important for the users.

          Show
          Namit Jain added a comment - Based on an offline discussion, this seems very important for the users.
          Hide
          Ning Zhang added a comment -

          After looking more carefully at the code, the fetch task spawn a fetch operator at its initialization, and the fetch operator is not subclass of Operator. So we probably need to extend it to Operator class and make the select Operator as a child of fetch, or push select operator's logic to fetch operator's implementation. Will have a discussion first and put it as a lower priority in my task list for now.

          Show
          Ning Zhang added a comment - After looking more carefully at the code, the fetch task spawn a fetch operator at its initialization, and the fetch operator is not subclass of Operator. So we probably need to extend it to Operator class and make the select Operator as a child of fetch, or push select operator's logic to fetch operator's implementation. Will have a discussion first and put it as a lower priority in my task list for now.
          Hide
          Ning Zhang added a comment -

          Currently the query "select * from T [limit n]" will generate a fetch task rather than a mapred task. Since fetch task is also an operator, it seems possible to add a select operator on top of it for query 'select a, b, expr(c, d) from T [limit n]".

          Show
          Ning Zhang added a comment - Currently the query "select * from T [limit n] " will generate a fetch task rather than a mapred task. Since fetch task is also an operator, it seems possible to add a select operator on top of it for query 'select a, b, expr(c, d) from T [limit n] ".
          Ning Zhang made changes -
          Assignee Ning Zhang [ nzhang ]
          Eric Sun made changes -
          Field Original Value New Value
          Description I often find myself needing to take a quick look at a particular column of a Hive table.

          I usually do this by doing a
          SELECT * from <table> LIMIT 20;
          from the CLI. Doing this is pretty fast since it doesn't require a mapreduce job. However, it's tough to examine just 1 or 2 columns when the table is very wide.

          So, I might do
          SELECT <col> from <table> LIMIT 20;
          but it's much slower since it requires a map-reduce. It'd be really convenient if a map-reduce wasn't necessary.

          Currently a good work around is to do
          hive -e "select * from table" | cut --key=n
          but it'd be more convenient if it were built in.
          I often find myself needing to take a quick look at a particular column of a Hive table.

          I usually do this by doing a
          SELECT * from <table> LIMIT 20;
          from the CLI. Doing this is pretty fast since it doesn't require a mapreduce job. However, it's tough to examine just 1 or 2 columns when the table is very wide.

          So, I might do
          SELECT <col> from <table> LIMIT 20;
          but it's much slower since it requires a map-reduce. It'd be really convenient if a map-reduce wasn't necessary.

          Currently a good work around is to do
          hive -e "select * from table" | cut --key=n
          but it'd be more convenient if it were built in since it alleviates the need for column counting.
          Eric Sun created issue -

            People

            • Assignee:
              Ning Zhang
              Reporter:
              Eric Sun
            • Votes:
              2 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development