Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8902

[rust][datafusion] optimize count(*) queries on parquet sources

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Invalid
    • None
    • None
    • Rust, Rust - DataFusion
    • None

    Description

      Currently, as far as I can tell, when you perform a `select count from dataset` in datafusion against a parquet dataset, the way this is implemented is by doing a scan on column 0, and counting up all of the rows (specifically I think it counts the # of rows in each batch).

       

      However, for the specific case of just counting everythign in a parquet file, you can just read the rowcount from the footer metadata, so it's O(1) instead of O

      Attachments

        Activity

          People

            Unassigned Unassigned
            alex_gaynor Alex Gaynor
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: