Uploaded image for project: 'ORC'
  1. ORC
  2. ORC-128

Add capability to get column statistics during writing

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3.0
    • Component/s: Java
    • Labels:
      None

      Description

      It would be useful if users could get the column statistics as the file is being written.

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user omalley opened a pull request:

          https://github.com/apache/orc/pull/78

          ORC-128. Add getStatistics to Writer API

          Allow user to getStatistics while writing files.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/omalley/orc orc-128

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/orc/pull/78.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #78


          commit 69c33fd7eb94d4786f41df6bdb17c1ba4c259065
          Author: Owen O'Malley <omalley@apache.org>
          Date: 2017-01-06T18:22:11Z

          ORC-128. Add getStatistics to Writer API to allow user to get statistics as the
          file is written.

          Signed-off-by: Owen O'Malley <omalley@apache.org>


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user omalley opened a pull request: https://github.com/apache/orc/pull/78 ORC-128 . Add getStatistics to Writer API Allow user to getStatistics while writing files. You can merge this pull request into a Git repository by running: $ git pull https://github.com/omalley/orc orc-128 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/orc/pull/78.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #78 commit 69c33fd7eb94d4786f41df6bdb17c1ba4c259065 Author: Owen O'Malley <omalley@apache.org> Date: 2017-01-06T18:22:11Z ORC-128 . Add getStatistics to Writer API to allow user to get statistics as the file is written. Signed-off-by: Owen O'Malley <omalley@apache.org>
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dain commented on the issue:

          https://github.com/apache/orc/pull/78

          Just curious, how do you plan on using this?

          Show
          githubbot ASF GitHub Bot added a comment - Github user dain commented on the issue: https://github.com/apache/orc/pull/78 Just curious, how do you plan on using this?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/orc/pull/78

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/orc/pull/78
          Hide
          owen.omalley Owen O'Malley added a comment -

          I just committed this.

          Show
          owen.omalley Owen O'Malley added a comment - I just committed this.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user prasanthj commented on the issue:

          https://github.com/apache/orc/pull/78

          @dain Hive already uses stats API (reader side and writer side) to get basic statistics like (numRows, rawDataSize, etc.) from the footer to avoid row-by-row stats gathering. This new API is to extend the same for column statistics (although ORC is missing NDV at this point).

          Show
          githubbot ASF GitHub Bot added a comment - Github user prasanthj commented on the issue: https://github.com/apache/orc/pull/78 @dain Hive already uses stats API (reader side and writer side) to get basic statistics like (numRows, rawDataSize, etc.) from the footer to avoid row-by-row stats gathering. This new API is to extend the same for column statistics (although ORC is missing NDV at this point).
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dain commented on the issue:

          https://github.com/apache/orc/pull/78

          I'm curious how the writer statistics are used since it is the "end of the line". Is it just to display to a user (e.g., logging), or is the engine making decisions based on the information?

          Show
          githubbot ASF GitHub Bot added a comment - Github user dain commented on the issue: https://github.com/apache/orc/pull/78 I'm curious how the writer statistics are used since it is the "end of the line". Is it just to display to a user (e.g., logging), or is the engine making decisions based on the information?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user prasanthj commented on the issue:

          https://github.com/apache/orc/pull/78

          Hive auto-gathers statistics during write (INSERT, CTAS..). Just before closing the file, the file sink operator gets the statistics from Writer, publishes it for aggregation by the client. This just pushes the stats collection part from processOp (row-by-row or vector batch processing) to closeOp.

          Similarly Reader side interface is used by ANALYZE queries to compute statistics just by reading footer.

          Show
          githubbot ASF GitHub Bot added a comment - Github user prasanthj commented on the issue: https://github.com/apache/orc/pull/78 Hive auto-gathers statistics during write (INSERT, CTAS..). Just before closing the file, the file sink operator gets the statistics from Writer, publishes it for aggregation by the client. This just pushes the stats collection part from processOp (row-by-row or vector batch processing) to closeOp. Similarly Reader side interface is used by ANALYZE queries to compute statistics just by reading footer.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user prasanthj commented on the issue:

          https://github.com/apache/orc/pull/78

          > is the engine making decisions based on the information?

          Not at this point.

          Show
          githubbot ASF GitHub Bot added a comment - Github user prasanthj commented on the issue: https://github.com/apache/orc/pull/78 > is the engine making decisions based on the information? Not at this point.
          Hide
          owen.omalley Owen O'Malley added a comment -

          ORC 1.3.0 was released.

          Show
          owen.omalley Owen O'Malley added a comment - ORC 1.3.0 was released.

            People

            • Assignee:
              owen.omalley Owen O'Malley
              Reporter:
              owen.omalley Owen O'Malley
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development