Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-5324

Extend record writer and ORC reader/writer interfaces to provide statistics

    XMLWordPrintableJSON

Details

    Description

      The current implementation for computing statistics (number of rows and raw data size) happens for every single row processed. The processOp() method in FileSinkOperator gets raw data size for each row from the serde and accumulates the size in hashmap while counting the number of rows. This accumulated statistics is then published to metastore.
      In case of ORC, ORC already stores enough statistics internally which can be made use of when publishing the stats to metastore. This will avoid the duplication of work that is happening in the processOp(). Also getting the statistics directly from ORC is very cheap (can directly read from the file footer).

      Attachments

        1. HIVE-5324.4.patch.txt
          44 kB
          Prasanth Jayachandran
        2. HIVE-5324.3.patch.txt
          44 kB
          Prasanth Jayachandran
        3. HIVE-5324.2.patch.txt
          9 kB
          Prasanth Jayachandran
        4. HIVE-5324.1.patch.txt
          3 kB
          Prasanth Jayachandran

        Issue Links

          Activity

            People

              prasanth_j Prasanth Jayachandran
              prasanth_j Prasanth Jayachandran
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: