Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16495

[Python] Scanner.count_rows() doesn't properly handle null expressions

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 7.0.0
    • 8.0.0
    • Python
    • None

    Description

      Passing an expression filter with `is_null()` doesn't properly remove null values, when computing row counts. I have reproduced this with both strings and integer. Here is a reproducer.

       

       

       

      df = pd.DataFrame({"C": pd.array([None, None, 1], dtype=pd.Int64Dtype())})
      print(df)
      df.to_parquet("test.pq")
       
      # Create a dataset
      dataset = ds.dataset("test.pq")
      fragments = [f for f in dataset.get_fragments()]
      #There should just be 1 fragment.
      fragment = fragments[0]
      # Get the null row count
      expr = ds.field("C").is_null()
      scanner = fragment.scanner(filter=expr)
      print(scanner.count_rows())
      

       

       

      I expect this print 2 as there are 2 NULL values.

      Attachments

        Activity

          People

            Unassigned Unassigned
            njriasan Nick Riasanovsky
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: