Hive
  1. Hive
  2. HIVE-6449

EXPLAIN has diffs in Statistics in tests generated on Windows vs. test generated on Linux

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Tests
    • Labels:
      None

      Description

      When .q.out files are generated on Windows the statistics in EXPLAIN differ from ones generated on Linux. Eg:

      Running: diff -a /root/hive/itests/qtest/../../itests/qtest/target/qfile-results/clientpositive/vectorized_parquet.q.out /root/hive/itests/qtest/../../ql/src/test/results/clientpositive/vectorized_parquet.q.out
      72c72
      <             Statistics: Num rows: 12288 Data size: 73728 Basic stats: COMPLETE Column stats: NONE
      ---
      >             Statistics: Num rows: 2072 Data size: 257046 Basic stats: COMPLETE Column stats: NONE
      75c75
      <               Statistics: Num rows: 6144 Data size: 36864 Basic stats: COMPLETE Column stats: NONE
      ---
      >               Statistics: Num rows: 1036 Data size: 128523 Basic stats: COMPLETE Column stats: NONE
      

        Activity

        Remus Rusanu created issue -
        Hide
        Prasanth Jayachandran added a comment -

        Hi Resmus

        One reason for this to happen is that parquet SerDe does not implement SerDeStats interface or parquet record writers does not implement StatsProvidingRecordWriter interface. The implementation of these interfaces are required for gathering raw data size. Statistics in explain will try to use the raw data size from the metastore. Raw data size should not be dependent on the operating system since its equivalent deserialized row size * number of rows. So I believe that parquet does not implement these interface and hence do not provide raw data size, in which case, file size is shown as the "Data size:". If the file size return by metastore or returned by filesystem.getContentSummary() api call is different then the statistics reported will be different. My suspicion is that the file sizes for the table are different for Windows vs Linux. Can you verify if the file size in windows is same as the file size in linux?

        Show
        Prasanth Jayachandran added a comment - Hi Resmus One reason for this to happen is that parquet SerDe does not implement SerDeStats interface or parquet record writers does not implement StatsProvidingRecordWriter interface. The implementation of these interfaces are required for gathering raw data size. Statistics in explain will try to use the raw data size from the metastore. Raw data size should not be dependent on the operating system since its equivalent deserialized row size * number of rows. So I believe that parquet does not implement these interface and hence do not provide raw data size, in which case, file size is shown as the "Data size:". If the file size return by metastore or returned by filesystem.getContentSummary() api call is different then the statistics reported will be different. My suspicion is that the file sizes for the table are different for Windows vs Linux. Can you verify if the file size in windows is same as the file size in linux?
        Hide
        Remus Rusanu added a comment -

        Prasanth Jayachandran thanks for the guidance. Since the difference reproes on ORC files, I focused on them now to eliminate any Parquet related problem. For my test ORC file, created as

        CREATE TABLE decimal_mapjoin STORED AS ORC AS 
          SELECT cdouble, CAST (((cdouble*22.1)/37) AS DECIMAL(20,10)) AS cdecimal1, 
          CAST (((cdouble*9.3)/13) AS DECIMAL(23,14)) AS cdecimal2,
          cint
          FROM alltypesorc;
        

        I get the following stats in describe extended:

        describe extended decimal_mapjoin;
        ...
        Windows: {numFiles=1, COLUMN_STATS_ACCURATE=true, transient_lastDdlTime=1392727196, numRows=0, totalSize=126087, rawDataSize=0}
        Linux:       {numFiles=1, transient_lastDdlTime=1392722507, COLUMN_STATS_ACCURATE=true, totalSize=126087, numRows=12288, rawDataSize=2165060} ...
        

        So the problem is that neither ROW_COUNT nor RAW_DATA_SIZE are initialized properly. I'm investigating.

        Show
        Remus Rusanu added a comment - Prasanth Jayachandran thanks for the guidance. Since the difference reproes on ORC files, I focused on them now to eliminate any Parquet related problem. For my test ORC file, created as CREATE TABLE decimal_mapjoin STORED AS ORC AS SELECT cdouble, CAST (((cdouble*22.1)/37) AS DECIMAL(20,10)) AS cdecimal1, CAST (((cdouble*9.3)/13) AS DECIMAL(23,14)) AS cdecimal2, cint FROM alltypesorc; I get the following stats in describe extended: describe extended decimal_mapjoin; ... Windows: {numFiles=1, COLUMN_STATS_ACCURATE= true , transient_lastDdlTime=1392727196, numRows=0, totalSize=126087, rawDataSize=0} Linux: {numFiles=1, transient_lastDdlTime=1392722507, COLUMN_STATS_ACCURATE= true , totalSize=126087, numRows=12288, rawDataSize=2165060} ... So the problem is that neither ROW_COUNT nor RAW_DATA_SIZE are initialized properly. I'm investigating.
        Hide
        Remus Rusanu added a comment -

        In Windows .q tests FileSinkOperator.publishStats fails silently because of JDBC error java.lang.ClassNotFoundException: org.apache.derby.jdbc.EmbeddedDriver. this causes all subsequent problems because stats are 0/0

        Show
        Remus Rusanu added a comment - In Windows .q tests FileSinkOperator.publishStats fails silently because of JDBC error java.lang.ClassNotFoundException: org.apache.derby.jdbc.EmbeddedDriver. this causes all subsequent problems because stats are 0/0
        Hide
        Remus Rusanu added a comment -

        Fixed by adding Derby jar into the CLASSPATH, right there in hadoop.cmd. Just one more hack to get .q tests to run on Windows... I wish the whole pom->surefire->driver->hadoopCLI->task chain would work correctly vis-a-vis execution on a Windows environment, but is beyond my bandwith to fix it now. I'm explaining my hack fix for the unfortunate soul running into this later... and that includes myself 2 months later when I forgot what I did.

        Show
        Remus Rusanu added a comment - Fixed by adding Derby jar into the CLASSPATH, right there in hadoop.cmd. Just one more hack to get .q tests to run on Windows... I wish the whole pom->surefire->driver->hadoopCLI->task chain would work correctly vis-a-vis execution on a Windows environment, but is beyond my bandwith to fix it now. I'm explaining my hack fix for the unfortunate soul running into this later... and that includes myself 2 months later when I forgot what I did.

          People

          • Assignee:
            Remus Rusanu
            Reporter:
            Remus Rusanu
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development