Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9644

[C++][Dataset] Do not check for ignore_prefixes in the base path

    XMLWordPrintableJSON

Details

    Description

      Somewhat related to ARROW-8427, and from https://github.com/apache/arrow/issues/7857

      I am not sure we should check the ignore_prefixes for the base path provided by the user. Because if that contains eg an underscore, it simply skips the full dataset resulting in an empty dataset.

      import tempfile
      import pathlib
      
      path = tempfile.mkdtemp()
      tmpdir =  pathlib.Path(path)                                                                                                                                                              
      
      # base path with a directory with an underscore 
      datadir = tmpdir / "_data" / "dataset"                                                                                                                                                                    
      datadir.mkdir(parents=True, exist_ok=True)                                                                                                                                                                
      
      # create a parquet file at that location
      import pyarrow as pa
      import pyarrow.parquet as pq
      
      table = pa.table({'a': [1, 2, 3]})                                                                                                                                                                        
      pq.write_table(table, datadir / "data.parquet")                                                                                                                                                           
      
      # reading dataset skips everything
      import pyarrow.dataset as ds                                                                                                                                                                              
      
      In [26]: ds.dataset(datadir)                                                                                                                                                                                       
      Out[26]: <pyarrow._dataset.FileSystemDataset at 0x7fbfd8779bb0>
      
      In [27]: ds.dataset(datadir).files                                                                                                                                                                                 
      Out[27]: []
      

      cc bkietz npr

      Attachments

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h
                  3h