Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-7491

Incorrect count() returned for complex types in parquet

Details

    Description

      To reproduce use the attached file for hive_alltypes.parquet (this is parquet file generated by Hive) and try count on columns c13 - c15.  For example, 

      SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
      

      Expected result: 3
      Actual result: 0

      Attachments

        1. hive_alltypes.parquet
          2 kB
          Arina Ielchiieva

        Issue Links

          Activity

            johna John Andrunas made changes -
            Issue-Comment-Bulk Replace String [ ~IhorHuzenko ] [ ~ihuzenko ]
            arina Arina Ielchiieva made changes -
            Resolution Fixed [ 1 ]
            Status Reviewable [ 10006 ] Resolved [ 5 ]

            Merged into master with commit id f1b976126e592c67e5de2448573c6172e74ee763.

            arina Arina Ielchiieva added a comment - Merged into master with commit id f1b976126e592c67e5de2448573c6172e74ee763.
            githubbot ASF GitHub Bot added a comment -

            asfgit commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - asfgit commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            githubbot ASF GitHub Bot added a comment -

            ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#discussion_r368547053

            ##########
            File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
            ##########
            @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
            */
            @Override
            public long getColumnValueCount(SchemaPath column) {

            • long tableRowCount, colNulls;
            • Long nulls;
              ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column);
            • ColumnStatistics<?> nonInterestingColStats = null;
            • if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - }

              + ColumnStatistics<?> nonInterestingColStats = columnStats == null
              + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null;

            + long tableRowCount;
            if (columnStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); }

            else if (nonInterestingColStats != null) {
            tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());

            Review comment:
            Hello @paul-rogers , could you please check the comment on line 182 in this file? It was added after your first review, and you may have accidentally missed it. Also test for given case was added in ```TestParquetComplex.java```.

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - ihuzenko commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r368547053 ########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ########## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { long tableRowCount, colNulls; Long nulls; ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column); ColumnStatistics<?> nonInterestingColStats = null; if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - } + ColumnStatistics<?> nonInterestingColStats = columnStats == null + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; + long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); Review comment: Hello @paul-rogers , could you please check the comment on line 182 in this file? It was added after your first review, and you may have accidentally missed it. Also test for given case was added in ```TestParquetComplex.java```. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            githubbot ASF GitHub Bot added a comment -

            paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#discussion_r368323139

            ##########
            File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
            ##########
            @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
            */
            @Override
            public long getColumnValueCount(SchemaPath column) {

            • long tableRowCount, colNulls;
            • Long nulls;
              ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column);
            • ColumnStatistics<?> nonInterestingColStats = null;
            • if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - }

              + ColumnStatistics<?> nonInterestingColStats = columnStats == null
              + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null;

            + long tableRowCount;
            if (columnStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); }

            else if (nonInterestingColStats != null) {
            tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());

            Review comment:
            HI @ihuzenko, thanks for the explanation. Very clear. In fact, I'd even suggest copying the details into the code somewhere to give future readers additional context. And, do we have a unit test that demonstrates this case? This seems like the kind of obscure issue that someone (such as me) could easily break at some point because the behavior is not obvious..

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - paul-rogers commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r368323139 ########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ########## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { long tableRowCount, colNulls; Long nulls; ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column); ColumnStatistics<?> nonInterestingColStats = null; if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - } + ColumnStatistics<?> nonInterestingColStats = columnStats == null + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; + long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); Review comment: HI @ihuzenko, thanks for the explanation. Very clear. In fact, I'd even suggest copying the details into the code somewhere to give future readers additional context. And, do we have a unit test that demonstrates this case? This seems like the kind of obscure issue that someone (such as me) could easily break at some point because the behavior is not obvious.. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot added a comment -

            paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#discussion_r368321457

            ##########
            File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
            ##########
            @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
            */
            @Override
            public long getColumnValueCount(SchemaPath column) {

            • long tableRowCount, colNulls;
            • Long nulls;
              ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column);
            • ColumnStatistics<?> nonInterestingColStats = null;
            • if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - }

              + ColumnStatistics<?> nonInterestingColStats = columnStats == null
              + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null;

            + long tableRowCount;
            if (columnStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); }

            else if (nonInterestingColStats != null) {
            tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());

            Review comment:
            HI @ihuzenko, thanks for the explanation. Very clear. In fact, I'd even suggest copying the details into the code somewhere to give future readers additional context. And, do we have a unit test that demonstrates this case? This seems like the kind of obscure issue that someone (such as me) could easily break at some point because the behavior is not obvious..

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - paul-rogers commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r368321457 ########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ########## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { long tableRowCount, colNulls; Long nulls; ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column); ColumnStatistics<?> nonInterestingColStats = null; if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - } + ColumnStatistics<?> nonInterestingColStats = columnStats == null + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; + long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); Review comment: HI @ihuzenko, thanks for the explanation. Very clear. In fact, I'd even suggest copying the details into the code somewhere to give future readers additional context. And, do we have a unit test that demonstrates this case? This seems like the kind of obscure issue that someone (such as me) could easily break at some point because the behavior is not obvious.. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            githubbot ASF GitHub Bot added a comment -

            arina-ielchiieva commented on issue #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#issuecomment-575580335

            @paul-rogers Igor had addressed code review comments and left one comment explaining the background of the issue (https://github.com/apache/drill/pull/1955#discussion_r366812967). Could you take a look one more time? Thanks.

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - arina-ielchiieva commented on issue #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#issuecomment-575580335 @paul-rogers Igor had addressed code review comments and left one comment explaining the background of the issue ( https://github.com/apache/drill/pull/1955#discussion_r366812967 ). Could you take a look one more time? Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            githubbot ASF GitHub Bot added a comment -

            ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#discussion_r366812967

            ##########
            File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
            ##########
            @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
            */
            @Override
            public long getColumnValueCount(SchemaPath column) {

            • long tableRowCount, colNulls;
            • Long nulls;
              ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column);
            • ColumnStatistics<?> nonInterestingColStats = null;
            • if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - }

              + ColumnStatistics<?> nonInterestingColStats = columnStats == null
              + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null;

            + long tableRowCount;
            if (columnStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); }

            else if (nonInterestingColStats != null) {
            tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());

            Review comment:
            Hello @paul-rogers ,

            As says Javadoc for this method, here we're looking for a count of non-null row values of the specific column. There is a special physical rule ```ConvertCountToDirectScanPrule``` purpose which is to reduce IO when we're looking for ```count(col)``` values, since the value may be present in the metadata read while planning. The problem which I'm trying to fix ( DRILL-7491(https://issues.apache.org/jira/browse/DRILL-7491)) appeared in the method because the method for existing complex column returned *0* as if the column doesn't exist. So physical plan was converted to something like:
            ```
            ScreenPrel
            ProjectPrel(cr=[$0])
            DirectScanPrel(groupscan=[selectionRoot = classpath:/parquet/tmphive/hive_alltypes.parquet, numFiles = 1, usedMetadataSummaryFile = false, usedMetastore = false, DynamicPojoRecordReader

            {records = [[0]]}

            ])
            ```
            and as a result incorrect count was returned.

            In debug mode, I noticed that parquet didn't provide metadata for top-level complex type column, but some metadata exists for nested fields, so I decided to check the existence of metadata to determine whether such complex column exists in parquet file and that scan is necessary to calculate correct row count value.

            Regarding your comments about the improvement of a metadata structure for complex fields, I think here there is no some special magic in Drill. I suppose that we're just reading & converting metadata from parquet into our model classes. If you have some good ideas on how we can improve our meta, could you please start a discussion at our mailing list to hear ideas of the community? Since the topic is out of scope for this pull request.

            Kind regards,
            Igor

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - ihuzenko commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366812967 ########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ########## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { long tableRowCount, colNulls; Long nulls; ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column); ColumnStatistics<?> nonInterestingColStats = null; if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - } + ColumnStatistics<?> nonInterestingColStats = columnStats == null + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; + long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); Review comment: Hello @paul-rogers , As says Javadoc for this method, here we're looking for a count of non-null row values of the specific column. There is a special physical rule ```ConvertCountToDirectScanPrule``` purpose which is to reduce IO when we're looking for ```count(col)``` values, since the value may be present in the metadata read while planning. The problem which I'm trying to fix ( DRILL-7491 ( https://issues.apache.org/jira/browse/DRILL-7491 )) appeared in the method because the method for existing complex column returned * 0 * as if the column doesn't exist. So physical plan was converted to something like: ``` ScreenPrel ProjectPrel(cr= [$0] ) DirectScanPrel(groupscan=[selectionRoot = classpath:/parquet/tmphive/hive_alltypes.parquet, numFiles = 1, usedMetadataSummaryFile = false, usedMetastore = false, DynamicPojoRecordReader {records = [[0]]} ]) ``` and as a result incorrect count was returned. In debug mode, I noticed that parquet didn't provide metadata for top-level complex type column, but some metadata exists for nested fields, so I decided to check the existence of metadata to determine whether such complex column exists in parquet file and that scan is necessary to calculate correct row count value. Regarding your comments about the improvement of a metadata structure for complex fields, I think here there is no some special magic in Drill. I suppose that we're just reading & converting metadata from parquet into our model classes. If you have some good ideas on how we can improve our meta, could you please start a discussion at our mailing list to hear ideas of the community? Since the topic is out of scope for this pull request. Kind regards, Igor ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            githubbot ASF GitHub Bot added a comment -

            paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#discussion_r366676515

            ##########
            File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
            ##########
            @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
            */
            @Override
            public long getColumnValueCount(SchemaPath column) {

            • long tableRowCount, colNulls;
            • Long nulls;
              ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column);
            • ColumnStatistics<?> nonInterestingColStats = null;
            • if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - }

              + ColumnStatistics<?> nonInterestingColStats = (columnStats == null)
              + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null;

            + long tableRowCount;
            if (columnStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); }

            else if (nonInterestingColStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; + }

            else if (existsNestedStatsForColumn(column, getTableMetadata())
            + || existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata()))

            { + return Statistic.NO_COLUMN_STATS; }

            else

            { return 0; // returns 0 if the column doesn't exist in the table. }
            • columnStats = columnStats != null ? columnStats : nonInterestingColStats;
            • nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
            • colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
              + Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
              + if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; + }

              else

              { + return tableRowCount - nulls; + }

              + }

            • return Statistic.NO_COLUMN_STATS == tableRowCount
            • Statistic.NO_COLUMN_STATS == colNulls
            • ? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls;
              + /**
              + * For complex columns, stats may be present only for nested fields. For example, a column path is `a`,
              + * but stats present for `a`.`b`. So before making a decision that column is absent, the case needs
              + * to be tested.
              + *
              + * @param column column path
              + * @param metadata metadata with column statistics
              + * @return whether stats exists for nested fields
              + */
              + private boolean existsNestedStatsForColumn(SchemaPath column, Metadata metadata) {

            Review comment:
            I wonder about the premise of this function. The Java doc suggests we have the path `a.b`. We want to know if `a` is an uninteresting column. Are we suggesting that we have metadata that says that `a.b` is uninteresting, but we have no metadata for `a` itself?

            Why would this be? Do we store columns in a non-structured way? That is, do we have a list of columns like 'a', 'b', 'c.d', 'e.f.g', 'c.h', and so on? Rather than `(a, b, c(d, g), e((g)))`?

            Further we seem to be assuming that we will gather stats for, say, `a.c`, but not `a.b`.

            The only place I can see that such complexity would make sense is to estimate cardinality for Flatten or Lateral Join. In such a case, might it make more sense to treat this recursively as a nested table? That is 'a` is a table and `a.b` and `a.c` are just `b` and `c` columns in the nested table.

            Given this, we must have a metadata entry for `a` (the map) in order to have metadata for any of its nested columns.

            So, again, I wonder if we've defined our metadata semantics as clearly as we might.

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - paul-rogers commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366676515 ########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ########## @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { long tableRowCount, colNulls; Long nulls; ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column); ColumnStatistics<?> nonInterestingColStats = null; if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - } + ColumnStatistics<?> nonInterestingColStats = (columnStats == null) + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; + long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; + } else if (existsNestedStatsForColumn(column, getTableMetadata()) + || existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { + return Statistic.NO_COLUMN_STATS; } else { return 0; // returns 0 if the column doesn't exist in the table. } columnStats = columnStats != null ? columnStats : nonInterestingColStats; nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS; + Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); + if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; + } else { + return tableRowCount - nulls; + } + } return Statistic.NO_COLUMN_STATS == tableRowCount Statistic.NO_COLUMN_STATS == colNulls ? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls; + /** + * For complex columns, stats may be present only for nested fields. For example, a column path is `a`, + * but stats present for `a`.`b`. So before making a decision that column is absent, the case needs + * to be tested. + * + * @param column column path + * @param metadata metadata with column statistics + * @return whether stats exists for nested fields + */ + private boolean existsNestedStatsForColumn(SchemaPath column, Metadata metadata) { Review comment: I wonder about the premise of this function. The Java doc suggests we have the path `a.b`. We want to know if `a` is an uninteresting column. Are we suggesting that we have metadata that says that `a.b` is uninteresting, but we have no metadata for `a` itself? Why would this be? Do we store columns in a non-structured way? That is, do we have a list of columns like 'a', 'b', 'c.d', 'e.f.g', 'c.h', and so on? Rather than `(a, b, c(d, g), e((g)))`? Further we seem to be assuming that we will gather stats for, say, `a.c`, but not `a.b`. The only place I can see that such complexity would make sense is to estimate cardinality for Flatten or Lateral Join. In such a case, might it make more sense to treat this recursively as a nested table? That is 'a` is a table and `a.b` and `a.c` are just `b` and `c` columns in the nested table. Given this, we must have a metadata entry for `a` (the map) in order to have metadata for any of its nested columns. So, again, I wonder if we've defined our metadata semantics as clearly as we might. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot added a comment -

            paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#discussion_r366674972

            ##########
            File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
            ##########
            @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
            */
            @Override
            public long getColumnValueCount(SchemaPath column) {

            • long tableRowCount, colNulls;
            • Long nulls;
              ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column);
            • ColumnStatistics<?> nonInterestingColStats = null;
            • if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - }

              + ColumnStatistics<?> nonInterestingColStats = columnStats == null
              + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null;

            + long tableRowCount;
            if (columnStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); }

            else if (nonInterestingColStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; + }

            else if (existsNestedStatsForColumn(column, getTableMetadata())
            + || existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata()))

            { + // When statistics for nested field exists, this is complex column which is present in table. + // But its nested fields statistics can't be used to extract tableRowCount for this column. + // So NO_COLUMN_STATS returned here to avoid problems described in DRILL-7491. + return Statistic.NO_COLUMN_STATS; }

            else

            { return 0; // returns 0 if the column doesn't exist in the table. }
            • columnStats = columnStats != null ? columnStats : nonInterestingColStats;
            • nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
            • colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
              + Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
              + if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; + }

              else {
              + return tableRowCount - nulls;

            Review comment:
            The above is true, but non-interesting. The number of nulls is useful when estimating selectivity of things like `IS NULL` or `IS NOT NULL`. If we are interested in scans, then the null count is not helpful.

            Given the math here, is the meaning of this function, "return the number of rows with non-null values for this column"? If so, that is somewhat a backward request. More typical is to return the null count directly, or the percentage of nulls.

            A reason that this value is not as helpful as it could be is if we are using stats to estimate something like `WHERE x = 10 AND y IS NOT NULL`. First, we estimate table row count, which might have been adjusted via partitioning. Then we estimate the selectivity reduction due to `x = 10`. Then we estimate the further reduction due to `y IS NOT NULL`. We cannot do this math if all we have is a number n that tell us the number of non-null rows in the table; we'd have to work out the selectivity number ourselves from the table row count.

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - paul-rogers commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366674972 ########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ########## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { long tableRowCount, colNulls; Long nulls; ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column); ColumnStatistics<?> nonInterestingColStats = null; if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - } + ColumnStatistics<?> nonInterestingColStats = columnStats == null + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; + long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; + } else if (existsNestedStatsForColumn(column, getTableMetadata()) + || existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { + // When statistics for nested field exists, this is complex column which is present in table. + // But its nested fields statistics can't be used to extract tableRowCount for this column. + // So NO_COLUMN_STATS returned here to avoid problems described in DRILL-7491. + return Statistic.NO_COLUMN_STATS; } else { return 0; // returns 0 if the column doesn't exist in the table. } columnStats = columnStats != null ? columnStats : nonInterestingColStats; nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS; + Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); + if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; + } else { + return tableRowCount - nulls; Review comment: The above is true, but non-interesting. The number of nulls is useful when estimating selectivity of things like `IS NULL` or `IS NOT NULL`. If we are interested in scans, then the null count is not helpful. Given the math here, is the meaning of this function, "return the number of rows with non-null values for this column"? If so, that is somewhat a backward request. More typical is to return the null count directly, or the percentage of nulls. A reason that this value is not as helpful as it could be is if we are using stats to estimate something like `WHERE x = 10 AND y IS NOT NULL`. First, we estimate table row count, which might have been adjusted via partitioning. Then we estimate the selectivity reduction due to `x = 10`. Then we estimate the further reduction due to `y IS NOT NULL`. We cannot do this math if all we have is a number n that tell us the number of non-null rows in the table; we'd have to work out the selectivity number ourselves from the table row count. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot added a comment -

            paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#discussion_r366671949

            ##########
            File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
            ##########
            @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
            */
            @Override
            public long getColumnValueCount(SchemaPath column) {

            • long tableRowCount, colNulls;
            • Long nulls;
              ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column);
            • ColumnStatistics<?> nonInterestingColStats = null;
            • if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - }

              + ColumnStatistics<?> nonInterestingColStats = columnStats == null
              + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null;

            + long tableRowCount;
            if (columnStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); }

            else if (nonInterestingColStats != null) {
            tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());

            Review comment:
            Having a hard time understanding this. If a column is uninteresing, we get the row count from the non-interesting columns metadata. Seems round-about. Should we get the row count from the table itself? That is, the indirection through non-interesting columns to get table metadata seems awkward.

            Also, if we do have column stats, we get the row count from the table metadata. This raises the question: by column value count, do we mean NDV (number of distinct values)? Otherwise, the column value count for top-level columns is defined as the same as the row count whether the column is interesting or not.

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - paul-rogers commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366671949 ########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ########## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { long tableRowCount, colNulls; Long nulls; ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column); ColumnStatistics<?> nonInterestingColStats = null; if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - } + ColumnStatistics<?> nonInterestingColStats = columnStats == null + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; + long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); Review comment: Having a hard time understanding this. If a column is uninteresing, we get the row count from the non-interesting columns metadata. Seems round-about. Should we get the row count from the table itself? That is, the indirection through non-interesting columns to get table metadata seems awkward. Also, if we do have column stats, we get the row count from the table metadata. This raises the question: by column value count, do we mean NDV (number of distinct values)? Otherwise, the column value count for top-level columns is defined as the same as the row count whether the column is interesting or not. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot added a comment -

            paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#discussion_r366673114

            ##########
            File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
            ##########
            @@ -180,7 +180,7 @@ public long getColumnValueCount(SchemaPath column) {
            } else if (nonInterestingColStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); }

            else {

            • return 0; // returns 0 if the column doesn't exist in the table.
              + return Statistic.NO_COLUMN_STATS;

            Review comment:
            I'm more confused. If this is a structured (complex) column, then it can have nested columns. The nested columns don't add information about this column. (Knowing the number of values in an array of maps does not tell us the cardinality of the map.) Again, if the Map is at the top level, then the value count is row count. If this stat is NDV, then we don't know the NDV if we don't have metadata. I'd even argue that NDV makes no sense for a complex column; it only makes sense for the members of the column.

            Now, back to Arina's point. The info here tells us something about scans. If I ask only for column `x`, and the table does not contain column `x`, then I don't even need to scan at all, I can just return n copies of NULL. (Most query engines would fail the query because the column is undefined. Drill will run the query and return nulls.) However, in practice, the only way to know the correct value of n is to do the scan (stats can be out of date.)

            Still, I don't get why we need column value counts. If we do a scan, we want the table row count, we don't care about the column value count.

            So, I wonder if there is some additional problem here where our use of stats needs some adjusting.

            If we want to estimate the row count after filtering (that is, the row count seen by, say, a join or sort), then we need the NDV which we can estimate only if we have stats, otherwise we should fall back on heuristic selectivity values.

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - paul-rogers commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366673114 ########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ########## @@ -180,7 +180,7 @@ public long getColumnValueCount(SchemaPath column) { } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); } else { return 0; // returns 0 if the column doesn't exist in the table. + return Statistic.NO_COLUMN_STATS; Review comment: I'm more confused. If this is a structured (complex) column, then it can have nested columns. The nested columns don't add information about this column. (Knowing the number of values in an array of maps does not tell us the cardinality of the map.) Again, if the Map is at the top level, then the value count is row count. If this stat is NDV, then we don't know the NDV if we don't have metadata. I'd even argue that NDV makes no sense for a complex column; it only makes sense for the members of the column. Now, back to Arina's point. The info here tells us something about scans. If I ask only for column `x`, and the table does not contain column `x`, then I don't even need to scan at all, I can just return n copies of NULL. (Most query engines would fail the query because the column is undefined. Drill will run the query and return nulls.) However, in practice, the only way to know the correct value of n is to do the scan (stats can be out of date.) Still, I don't get why we need column value counts. If we do a scan, we want the table row count, we don't care about the column value count. So, I wonder if there is some additional problem here where our use of stats needs some adjusting. If we want to estimate the row count after filtering (that is, the row count seen by, say, a join or sort), then we need the NDV which we can estimate only if we have stats, otherwise we should fall back on heuristic selectivity values. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot added a comment -

            paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#discussion_r366677192

            ##########
            File path: logical/src/main/java/org/apache/drill/common/expression/SchemaPath.java
            ##########
            @@ -334,36 +335,13 @@ public int hashCode() {

            @Override
            public boolean equals(Object obj) {

            • if (this == obj) { - return true; - }
            • if (obj == null) { - return false; - }
              - if (!(obj instanceof SchemaPath)) {- return false;- }

              -

            • SchemaPath other = (SchemaPath) obj;
            • if (rootSegment == null) { - return (other.rootSegment == null); - }
            • return rootSegment.equals(other.rootSegment);
              + return this == obj || obj instanceof SchemaPath
              + && Objects.equals(rootSegment, ((SchemaPath) obj).rootSegment);
              }

            public boolean contains(Object obj) {

            Review comment:
            `contains()` is not a Java-defined method. Must the signature be `Object`? Or, can it be `SchemaPath` since that is the only kind of thing a `SchemaPath` could ever contain? Would save some casting.

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - paul-rogers commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366677192 ########## File path: logical/src/main/java/org/apache/drill/common/expression/SchemaPath.java ########## @@ -334,36 +335,13 @@ public int hashCode() { @Override public boolean equals(Object obj) { if (this == obj) { - return true; - } if (obj == null) { - return false; - } - if (!(obj instanceof SchemaPath)) {- return false;- } - SchemaPath other = (SchemaPath) obj; if (rootSegment == null) { - return (other.rootSegment == null); - } return rootSegment.equals(other.rootSegment); + return this == obj || obj instanceof SchemaPath + && Objects.equals(rootSegment, ((SchemaPath) obj).rootSegment); } public boolean contains(Object obj) { Review comment: `contains()` is not a Java-defined method. Must the signature be `Object`? Or, can it be `SchemaPath` since that is the only kind of thing a `SchemaPath` could ever contain? Would save some casting. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot added a comment -

            paul-rogers commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#discussion_r366669212

            ##########
            File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
            ##########
            @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
            */
            @Override
            public long getColumnValueCount(SchemaPath column) {

            • long tableRowCount, colNulls;
            • Long nulls;
              ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column);
            • ColumnStatistics<?> nonInterestingColStats = null;
            • if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - }

              + ColumnStatistics<?> nonInterestingColStats = columnStats == null
              + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null;

            + long tableRowCount;
            if (columnStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); }

            else if (nonInterestingColStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; + }

            else if (existsNestedStatsForColumn(column, getTableMetadata())
            + || existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) {

            Review comment:
            Nit: `hasNestedStatsForColumn`, or `isCompoundColumnWithStats`.

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - paul-rogers commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366669212 ########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ########## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { long tableRowCount, colNulls; Long nulls; ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column); ColumnStatistics<?> nonInterestingColStats = null; if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - } + ColumnStatistics<?> nonInterestingColStats = columnStats == null + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; + long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; + } else if (existsNestedStatsForColumn(column, getTableMetadata()) + || existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { Review comment: Nit: `hasNestedStatsForColumn`, or `isCompoundColumnWithStats`. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            arina Arina Ielchiieva made changes -
            Labels ready-to-commit
            githubbot ASF GitHub Bot added a comment -

            arina-ielchiieva commented on issue #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#issuecomment-574200097

            +1, LGTM.

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - arina-ielchiieva commented on issue #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#issuecomment-574200097 +1, LGTM. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            githubbot ASF GitHub Bot added a comment -

            KazydubB commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#discussion_r366339423

            ##########
            File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
            ##########
            @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
            */
            @Override
            public long getColumnValueCount(SchemaPath column) {

            • long tableRowCount, colNulls;
            • Long nulls;
              ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column);
            • ColumnStatistics<?> nonInterestingColStats = null;
            • if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - }

              + ColumnStatistics<?> nonInterestingColStats = (columnStats == null)
              + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null;

            + long tableRowCount;
            if (columnStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); }

            else if (nonInterestingColStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; + }

            else if (existsNestedStatsForColumn(column, getTableMetadata())
            + || existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata()))

            { + return Statistic.NO_COLUMN_STATS; }

            else

            { return 0; // returns 0 if the column doesn't exist in the table. }
            • columnStats = columnStats != null ? columnStats : nonInterestingColStats;
            • nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
            • colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
              + Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
              + if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; + }

              else

              { + return tableRowCount - nulls; + }

              + }

            • return Statistic.NO_COLUMN_STATS == tableRowCount
            • Statistic.NO_COLUMN_STATS == colNulls
            • ? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls;
              + /**
              + * For complex columns, stats may be present only for nested fields. For example, a column path is `a`,
              + * but stats present for `a`.`b`. So before making a decision that column is absent, the case needs
              + * to be tested.
              + *
              + * @param column column path
              + * @param metadata metadata with column statistics
              + * @return whether stats exists for nested fields
              + */
              + private boolean existsNestedStatsForColumn(SchemaPath column, Metadata metadata) {

            Review comment:
            It depends from which perspective to think, but I agree the one proposed by me is confusing. Name still looks clumsy, but with the comments it should be fine. Thanks!

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - KazydubB commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366339423 ########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ########## @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { long tableRowCount, colNulls; Long nulls; ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column); ColumnStatistics<?> nonInterestingColStats = null; if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - } + ColumnStatistics<?> nonInterestingColStats = (columnStats == null) + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; + long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; + } else if (existsNestedStatsForColumn(column, getTableMetadata()) + || existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { + return Statistic.NO_COLUMN_STATS; } else { return 0; // returns 0 if the column doesn't exist in the table. } columnStats = columnStats != null ? columnStats : nonInterestingColStats; nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS; + Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); + if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; + } else { + return tableRowCount - nulls; + } + } return Statistic.NO_COLUMN_STATS == tableRowCount Statistic.NO_COLUMN_STATS == colNulls ? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls; + /** + * For complex columns, stats may be present only for nested fields. For example, a column path is `a`, + * but stats present for `a`.`b`. So before making a decision that column is absent, the case needs + * to be tested. + * + * @param column column path + * @param metadata metadata with column statistics + * @return whether stats exists for nested fields + */ + private boolean existsNestedStatsForColumn(SchemaPath column, Metadata metadata) { Review comment: It depends from which perspective to think, but I agree the one proposed by me is confusing. Name still looks clumsy, but with the comments it should be fine. Thanks! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            githubbot ASF GitHub Bot added a comment -

            arina-ielchiieva commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#discussion_r366328108

            ##########
            File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
            ##########
            @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() {
            */
            @Override
            public long getColumnValueCount(SchemaPath column) {

            • long tableRowCount, colNulls;
            • Long nulls;
              ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column);
            • ColumnStatistics<?> nonInterestingColStats = null;
            • if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - }

              + ColumnStatistics<?> nonInterestingColStats = columnStats == null
              + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null;

            + long tableRowCount;
            if (columnStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); }

            else if (nonInterestingColStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; + }

            else if (existsNestedStatsForColumn(column, getTableMetadata())
            + || existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) {
            + // When exists statistic for nested field, this is complex column which is present in table.

            Review comment:
            ```suggestion
            // When statistics for nested field exists, this is complex column which is present in table.
            ```

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - arina-ielchiieva commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366328108 ########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ########## @@ -167,29 +167,46 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { long tableRowCount, colNulls; Long nulls; ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column); ColumnStatistics<?> nonInterestingColStats = null; if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - } + ColumnStatistics<?> nonInterestingColStats = columnStats == null + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; + long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; + } else if (existsNestedStatsForColumn(column, getTableMetadata()) + || existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { + // When exists statistic for nested field, this is complex column which is present in table. Review comment: ```suggestion // When statistics for nested field exists, this is complex column which is present in table. ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            githubbot ASF GitHub Bot added a comment -

            ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#discussion_r366327715

            ##########
            File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
            ##########
            @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
            */
            @Override
            public long getColumnValueCount(SchemaPath column) {

            • long tableRowCount, colNulls;
            • Long nulls;
              ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column);
            • ColumnStatistics<?> nonInterestingColStats = null;
            • if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - }

              + ColumnStatistics<?> nonInterestingColStats = (columnStats == null)
              + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null;

            + long tableRowCount;
            if (columnStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); }

            else if (nonInterestingColStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; + }

            else if (existsNestedStatsForColumn(column, getTableMetadata())
            + || existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata()))

            { + return Statistic.NO_COLUMN_STATS; }

            else

            { return 0; // returns 0 if the column doesn't exist in the table. }
            • columnStats = columnStats != null ? columnStats : nonInterestingColStats;
            • nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
            • colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
              + Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
              + if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; + }

              else

              { + return tableRowCount - nulls; + }

              + }

            • return Statistic.NO_COLUMN_STATS == tableRowCount
            • Statistic.NO_COLUMN_STATS == colNulls
            • ? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls;
              + /**
              + * For complex columns, stats may be present only for nested fields. For example, a column path is `a`,
              + * but stats present for `a`.`b`. So before making a decision that column is absent, the case needs
              + * to be tested.
              + *
              + * @param column column path
              + * @param metadata metadata with column statistics
              + * @return whether stats exists for nested fields
              + */
              + private boolean existsNestedStatsForColumn(SchemaPath column, Metadata metadata) {

            Review comment:
            sorry @KazydubB , but the proposed new name also makes confusion, since column passed as the first argument isn't a leaf. I've added a comprehensive comment in place of usage, could you confirm that now intentions are clear?

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - ihuzenko commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366327715 ########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ########## @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { long tableRowCount, colNulls; Long nulls; ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column); ColumnStatistics<?> nonInterestingColStats = null; if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - } + ColumnStatistics<?> nonInterestingColStats = (columnStats == null) + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; + long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; + } else if (existsNestedStatsForColumn(column, getTableMetadata()) + || existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { + return Statistic.NO_COLUMN_STATS; } else { return 0; // returns 0 if the column doesn't exist in the table. } columnStats = columnStats != null ? columnStats : nonInterestingColStats; nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS; + Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); + if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; + } else { + return tableRowCount - nulls; + } + } return Statistic.NO_COLUMN_STATS == tableRowCount Statistic.NO_COLUMN_STATS == colNulls ? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls; + /** + * For complex columns, stats may be present only for nested fields. For example, a column path is `a`, + * but stats present for `a`.`b`. So before making a decision that column is absent, the case needs + * to be tested. + * + * @param column column path + * @param metadata metadata with column statistics + * @return whether stats exists for nested fields + */ + private boolean existsNestedStatsForColumn(SchemaPath column, Metadata metadata) { Review comment: sorry @KazydubB , but the proposed new name also makes confusion, since column passed as the first argument isn't a leaf. I've added a comprehensive comment in place of usage, could you confirm that now intentions are clear? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            githubbot ASF GitHub Bot added a comment -

            KazydubB commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#discussion_r366316876

            ##########
            File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
            ##########
            @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
            */
            @Override
            public long getColumnValueCount(SchemaPath column) {

            • long tableRowCount, colNulls;
            • Long nulls;
              ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column);
            • ColumnStatistics<?> nonInterestingColStats = null;
            • if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - }

              + ColumnStatistics<?> nonInterestingColStats = (columnStats == null)
              + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null;

            + long tableRowCount;
            if (columnStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); }

            else if (nonInterestingColStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; + }

            else if (existsNestedStatsForColumn(column, getTableMetadata())
            + || existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata()))

            { + return Statistic.NO_COLUMN_STATS; }

            else

            { return 0; // returns 0 if the column doesn't exist in the table. }
            • columnStats = columnStats != null ? columnStats : nonInterestingColStats;
            • nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
            • colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
              + Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
              + if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; + }

              else

              { + return tableRowCount - nulls; + }

              + }

            • return Statistic.NO_COLUMN_STATS == tableRowCount
            • Statistic.NO_COLUMN_STATS == colNulls
            • ? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls;
              + /**
              + * For complex columns, stats may be present only for nested fields. For example, a column path is `a`,
              + * but stats present for `a`.`b`. So before making a decision that column is absent, the case needs
              + * to be tested.
              + *
              + * @param column column path
              + * @param metadata metadata with column statistics
              + * @return whether stats exists for nested fields
              + */
              + private boolean existsNestedStatsForColumn(SchemaPath column, Metadata metadata) {

            Review comment:
            The name is a little bit awkward, indeed. It can be changed to something like `existStatsForLeafColumn` (in the case, 'nested fields' should be changed to 'leaf columns' in the javadoc).

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - KazydubB commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366316876 ########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ########## @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { long tableRowCount, colNulls; Long nulls; ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column); ColumnStatistics<?> nonInterestingColStats = null; if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - } + ColumnStatistics<?> nonInterestingColStats = (columnStats == null) + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; + long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; + } else if (existsNestedStatsForColumn(column, getTableMetadata()) + || existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { + return Statistic.NO_COLUMN_STATS; } else { return 0; // returns 0 if the column doesn't exist in the table. } columnStats = columnStats != null ? columnStats : nonInterestingColStats; nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS; + Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); + if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; + } else { + return tableRowCount - nulls; + } + } return Statistic.NO_COLUMN_STATS == tableRowCount Statistic.NO_COLUMN_STATS == colNulls ? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls; + /** + * For complex columns, stats may be present only for nested fields. For example, a column path is `a`, + * but stats present for `a`.`b`. So before making a decision that column is absent, the case needs + * to be tested. + * + * @param column column path + * @param metadata metadata with column statistics + * @return whether stats exists for nested fields + */ + private boolean existsNestedStatsForColumn(SchemaPath column, Metadata metadata) { Review comment: The name is a little bit awkward, indeed. It can be changed to something like `existStatsForLeafColumn` (in the case, 'nested fields' should be changed to 'leaf columns' in the javadoc). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            githubbot ASF GitHub Bot added a comment -

            ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#discussion_r366313821

            ##########
            File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
            ##########
            @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
            */
            @Override
            public long getColumnValueCount(SchemaPath column) {

            • long tableRowCount, colNulls;
            • Long nulls;
              ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column);
            • ColumnStatistics<?> nonInterestingColStats = null;
            • if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - }

              + ColumnStatistics<?> nonInterestingColStats = (columnStats == null)
              + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null;

            + long tableRowCount;
            if (columnStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); }

            else if (nonInterestingColStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; + }

            else if (existsNestedStatsForColumn(column, getTableMetadata())
            + || existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata()))

            { + return Statistic.NO_COLUMN_STATS; }

            else

            { return 0; // returns 0 if the column doesn't exist in the table. }
            • columnStats = columnStats != null ? columnStats : nonInterestingColStats;
            • nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
            • colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
              + Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
              + if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; + }

              else

              { + return tableRowCount - nulls; + }

              + }

            • return Statistic.NO_COLUMN_STATS == tableRowCount
            • Statistic.NO_COLUMN_STATS == colNulls
            • ? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls;
              + /**
              + * For complex columns, stats may be present only for nested fields. For example, a column path is `a`,
              + * but stats present for `a`.`b`. So before making a decision that column is absent, the case needs
              + * to be tested.
              + *
              + * @param column column path
              + * @param metadata metadata with column statistics
              + * @return whether stats exists for nested fields
              + */
              + private boolean existsNestedStatsForColumn(SchemaPath column, Metadata metadata) {

            Review comment:
            No, the nested stats exists but not applicable for the parent column. As javadoc says, the method is used only to determine whether the column is actually absent. The current name expresses exactly what the method does.

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - ihuzenko commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366313821 ########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ########## @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { long tableRowCount, colNulls; Long nulls; ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column); ColumnStatistics<?> nonInterestingColStats = null; if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - } + ColumnStatistics<?> nonInterestingColStats = (columnStats == null) + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; + long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; + } else if (existsNestedStatsForColumn(column, getTableMetadata()) + || existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { + return Statistic.NO_COLUMN_STATS; } else { return 0; // returns 0 if the column doesn't exist in the table. } columnStats = columnStats != null ? columnStats : nonInterestingColStats; nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS; + Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); + if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; + } else { + return tableRowCount - nulls; + } + } return Statistic.NO_COLUMN_STATS == tableRowCount Statistic.NO_COLUMN_STATS == colNulls ? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls; + /** + * For complex columns, stats may be present only for nested fields. For example, a column path is `a`, + * but stats present for `a`.`b`. So before making a decision that column is absent, the case needs + * to be tested. + * + * @param column column path + * @param metadata metadata with column statistics + * @return whether stats exists for nested fields + */ + private boolean existsNestedStatsForColumn(SchemaPath column, Metadata metadata) { Review comment: No, the nested stats exists but not applicable for the parent column. As javadoc says, the method is used only to determine whether the column is actually absent. The current name expresses exactly what the method does. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            githubbot ASF GitHub Bot added a comment -

            arina-ielchiieva commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#discussion_r366310138

            ##########
            File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
            ##########
            @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
            */
            @Override
            public long getColumnValueCount(SchemaPath column) {

            • long tableRowCount, colNulls;
            • Long nulls;
              ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column);
            • ColumnStatistics<?> nonInterestingColStats = null;
            • if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - }

              + ColumnStatistics<?> nonInterestingColStats = (columnStats == null)
              + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null;

            + long tableRowCount;
            if (columnStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); }

            else if (nonInterestingColStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; + }

            else if (existsNestedStatsForColumn(column, getTableMetadata())
            + || existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata()))

            { + return Statistic.NO_COLUMN_STATS; }

            else

            { return 0; // returns 0 if the column doesn't exist in the table. }
            • columnStats = columnStats != null ? columnStats : nonInterestingColStats;
            • nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
            • colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS;
              + Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats);
              + if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; + }

              else

              { + return tableRowCount - nulls; + }

              + }

            • return Statistic.NO_COLUMN_STATS == tableRowCount
            • Statistic.NO_COLUMN_STATS == colNulls
            • ? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls;
              + /**
              + * For complex columns, stats may be present only for nested fields. For example, a column path is `a`,
              + * but stats present for `a`.`b`. So before making a decision that column is absent, the case needs
              + * to be tested.
              + *
              + * @param column column path
              + * @param metadata metadata with column statistics
              + * @return whether stats exists for nested fields
              + */
              + private boolean existsNestedStatsForColumn(SchemaPath column, Metadata metadata) {

            Review comment:
            I am not quite sure I follow this method logic because when you call this method you return that there is no statistics is present but method states that it checks that nested column statistics exists. I guess we might need to rename method to depict exactly what it does, for example checking if column exists.
            By the way, if stats for nested column exists, can we use it for calculation?

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - arina-ielchiieva commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366310138 ########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ########## @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { long tableRowCount, colNulls; Long nulls; ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column); ColumnStatistics<?> nonInterestingColStats = null; if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - } + ColumnStatistics<?> nonInterestingColStats = (columnStats == null) + ? getNonInterestingColumnsMetadata().getColumnStatistics(column) : null; + long tableRowCount; if (columnStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getTableMetadata()); } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); + columnStats = nonInterestingColStats; + } else if (existsNestedStatsForColumn(column, getTableMetadata()) + || existsNestedStatsForColumn(column, getNonInterestingColumnsMetadata())) { + return Statistic.NO_COLUMN_STATS; } else { return 0; // returns 0 if the column doesn't exist in the table. } columnStats = columnStats != null ? columnStats : nonInterestingColStats; nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); colNulls = nulls != null ? nulls : Statistic.NO_COLUMN_STATS; + Long nulls = ColumnStatisticsKind.NULLS_COUNT.getFrom(columnStats); + if (nulls == null || Statistic.NO_COLUMN_STATS == nulls || Statistic.NO_COLUMN_STATS == tableRowCount) { + return Statistic.NO_COLUMN_STATS; + } else { + return tableRowCount - nulls; + } + } return Statistic.NO_COLUMN_STATS == tableRowCount Statistic.NO_COLUMN_STATS == colNulls ? Statistic.NO_COLUMN_STATS : tableRowCount - colNulls; + /** + * For complex columns, stats may be present only for nested fields. For example, a column path is `a`, + * but stats present for `a`.`b`. So before making a decision that column is absent, the case needs + * to be tested. + * + * @param column column path + * @param metadata metadata with column statistics + * @return whether stats exists for nested fields + */ + private boolean existsNestedStatsForColumn(SchemaPath column, Metadata metadata) { Review comment: I am not quite sure I follow this method logic because when you call this method you return that there is no statistics is present but method states that it checks that nested column statistics exists. I guess we might need to rename method to depict exactly what it does, for example checking if column exists. By the way, if stats for nested column exists, can we use it for calculation? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot added a comment -

            arina-ielchiieva commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#discussion_r366307446

            ##########
            File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
            ##########
            @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() {
            */
            @Override
            public long getColumnValueCount(SchemaPath column) {

            • long tableRowCount, colNulls;
            • Long nulls;
              ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column);
            • ColumnStatistics<?> nonInterestingColStats = null;
            • if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - }

              + ColumnStatistics<?> nonInterestingColStats = (columnStats == null)

            Review comment:
            ```suggestion
            ColumnStatistics<?> nonInterestingColStats = columnStats == null
            ```

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - arina-ielchiieva commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366307446 ########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ########## @@ -167,29 +167,43 @@ public boolean isMatchAllMetadata() { */ @Override public long getColumnValueCount(SchemaPath column) { long tableRowCount, colNulls; Long nulls; ColumnStatistics<?> columnStats = getTableMetadata().getColumnStatistics(column); ColumnStatistics<?> nonInterestingColStats = null; if (columnStats == null) { - nonInterestingColStats = getNonInterestingColumnsMetadata().getColumnStatistics(column); - } + ColumnStatistics<?> nonInterestingColStats = (columnStats == null) Review comment: ```suggestion ColumnStatistics<?> nonInterestingColStats = columnStats == null ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            githubbot ASF GitHub Bot added a comment -

            ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#discussion_r366303787

            ##########
            File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
            ##########
            @@ -180,7 +180,7 @@ public long getColumnValueCount(SchemaPath column) {
            } else if (nonInterestingColStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); }

            else {

            • return 0; // returns 0 if the column doesn't exist in the table.
              + return Statistic.NO_COLUMN_STATS;

            Review comment:
            Hello @arina-ielchiieva , thank you very much for describing the issue. I've updated this PR, please take another look.

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - ihuzenko commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366303787 ########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ########## @@ -180,7 +180,7 @@ public long getColumnValueCount(SchemaPath column) { } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); } else { return 0; // returns 0 if the column doesn't exist in the table. + return Statistic.NO_COLUMN_STATS; Review comment: Hello @arina-ielchiieva , thank you very much for describing the issue. I've updated this PR, please take another look. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            arina Arina Ielchiieva made changes -
            Reviewer Arina Ielchiieva [ arina ]
            arina Arina Ielchiieva made changes -
            Fix Version/s 1.18.0 [ 12345459 ]
            githubbot ASF GitHub Bot added a comment -

            arina-ielchiieva commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955#discussion_r366250930

            ##########
            File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
            ##########
            @@ -180,7 +180,7 @@ public long getColumnValueCount(SchemaPath column) {
            } else if (nonInterestingColStats != null)

            { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); }

            else {

            • return 0; // returns 0 if the column doesn't exist in the table.
              + return Statistic.NO_COLUMN_STATS;

            Review comment:
            @ihuzenko I am not sure that change is correct. As you can see from java-doc and comment that you have removed, 0 is returned deliberately to avoid full scan if column does not exist.
            Here is an example of unit test that shows that with your change, you are breaking existing functionality, you can add it to `org.apache.drill.exec.planner.logical.TestConvertCountToDirectScan` class:
            ```
            @Test
            public void textConvertAbsentColumn() throws Exception

            { String sql = "select count(abc) as cnt from cp.`tpch/nation.parquet`"; queryBuilder() .sql(sql) .planMatcher() .include("DynamicPojoRecordReader") .match(); testBuilder() .sqlQuery(sql) .unOrdered() .baselineColumns("cnt") .baselineValues(0L) .go(); }

            ```
            After your changes, this test fill fail. I think you need to find a way to determine if column is absent or complex.

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - arina-ielchiieva commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366250930 ########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ########## @@ -180,7 +180,7 @@ public long getColumnValueCount(SchemaPath column) { } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); } else { return 0; // returns 0 if the column doesn't exist in the table. + return Statistic.NO_COLUMN_STATS; Review comment: @ihuzenko I am not sure that change is correct. As you can see from java-doc and comment that you have removed, 0 is returned deliberately to avoid full scan if column does not exist. Here is an example of unit test that shows that with your change, you are breaking existing functionality, you can add it to `org.apache.drill.exec.planner.logical.TestConvertCountToDirectScan` class: ``` @Test public void textConvertAbsentColumn() throws Exception { String sql = "select count(abc) as cnt from cp.`tpch/nation.parquet`"; queryBuilder() .sql(sql) .planMatcher() .include("DynamicPojoRecordReader") .match(); testBuilder() .sqlQuery(sql) .unOrdered() .baselineColumns("cnt") .baselineValues(0L) .go(); } ``` After your changes, this test fill fail. I think you need to find a way to determine if column is absent or complex. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ] This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            ihuzenko Igor Guzenko made changes -
            Status In Progress [ 3 ] Reviewable [ 10006 ]
            ihuzenko Igor Guzenko made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            githubbot ASF GitHub Bot added a comment -

            ihuzenko commented on pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet
            URL: https://github.com/apache/drill/pull/1955

            1. DRILL-7491(https://issues.apache.org/jira/browse/DRILL-7491): Incorrect count() returned for complex types in parquet
              1. Description

            Convert ConvertCountToDirectScanPrule was applied to existing complex type columns with absent rowCount metadata. Returning ```Statistic.NO_COLUMN_STATS``` fixes the issue.

            ----------------------------------------------------------------
            This is an automated message from the Apache Git Service.
            To respond to the message, please log on to GitHub and use the
            URL above to go to the specific comment.

            For queries about this service, please contact Infrastructure at:
            users@infra.apache.org

            githubbot ASF GitHub Bot added a comment - ihuzenko commented on pull request #1955: DRILL-7491 : Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955 DRILL-7491 ( https://issues.apache.org/jira/browse/DRILL-7491): Incorrect count() returned for complex types in parquet Description Convert ConvertCountToDirectScanPrule was applied to existing complex type columns with absent rowCount metadata. Returning ```Statistic.NO_COLUMN_STATS``` fixes the issue. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org
            githubbot ASF GitHub Bot made changes -
            Remote Link This issue links to "GitHub Pull Request #1955 (Web Link)" [ 183894 ]
            ihuzenko Igor Guzenko made changes -
            Summary Incorrect COUNT(column) returned for complex types in parquet Incorrect count() returned for complex types in parquet
            ihuzenko Igor Guzenko made changes -
            Description To reproduce use the attached file for {{hive_alltypes.parquet}} (this is parquet file generated by Hive) and try count on columns *c13 - c15.*  For example, 

            {code:sql}

            SELECT count(c13) FROM dfs.tmp.``

            {code}
            To reproduce use the attached file for {{hive_alltypes.parquet}} (this is parquet file generated by Hive) and try count on columns *c13 - c15.*  For example, 
            {code:sql}
            SELECT count(c13) FROM dfs.tmp.`hive_alltypes.parquet`
            {code}

            *Expected result:* {color:green}3 {color}
            *Actual result:* {color:red}0{color}
            ihuzenko Igor Guzenko made changes -
            Description To reproduce use the attached file for {{hive_alltypes.parquet}} (this is parquet file generated by Hive) and try count on columns *c13 - c15.*  To reproduce use the attached file for {{hive_alltypes.parquet}} (this is parquet file generated by Hive) and try count on columns *c13 - c15.*  For example, 

            {code:sql}

            SELECT count(c13) FROM dfs.tmp.``

            {code}
            arina Arina Ielchiieva made changes -
            Description To reproduce use the attached file for {{hive_alltypes.parquet}} and try count on columns *c13 - c15.*  To reproduce use the attached file for {{hive_alltypes.parquet}} (this is parquet file generated by Hive) and try count on columns *c13 - c15.* 
            arina Arina Ielchiieva made changes -
            Description To reproduce download the attached file for DRILL-7473 and try count on columns *c13 - c15.*  To reproduce use the attached file for {{hive_alltypes.parquet}} and try count on columns *c13 - c15.* 
            arina Arina Ielchiieva made changes -
            Field Original Value New Value
            Attachment hive_alltypes.parquet [ 12989188 ]
            arina Arina Ielchiieva added a comment - - edited

            ihuzenko please provide query example

            arina Arina Ielchiieva added a comment - - edited ihuzenko please provide query example
            volodymyr Vova Vysotskyi added a comment -

            Good catch!

            volodymyr Vova Vysotskyi added a comment - Good catch!
            ihuzenko Igor Guzenko created issue -

            People

              ihuzenko Igor Guzenko
              ihuzenko Igor Guzenko
              Arina Ielchiieva Arina Ielchiieva
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: