Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16526

[Python] test_partitioned_dataset fails when building with PARQUET but without DATASET

    XMLWordPrintableJSON

Details

    Description

      Our current minimal_build examples for python build with -DARROW_PARQUET=ON but without DATASET. This produces the following failure:

       _________________________________________________________ test_partitioned_dataset[True] _________________________________________________________tempdir = PosixPath('/tmp/pytest-of-root/pytest-0/test_partitioned_dataset_True_0'), use_legacy_dataset = True    @pytest.mark.pandas
          @parametrize_legacy_dataset
          def test_partitioned_dataset(tempdir, use_legacy_dataset):
              # ARROW-3208: Segmentation fault when reading a Parquet partitioned dataset
              # to a Parquet file
              path = tempdir / "ARROW-3208"
              df = pd.DataFrame({
                  'one': [-1, 10, 2.5, 100, 1000, 1, 29.2],
                  'two': [-1, 10, 2, 100, 1000, 1, 11],
                  'three': [0, 0, 0, 0, 0, 0, 0]
              })
              table = pa.Table.from_pandas(df)
      >       pq.write_to_dataset(table, root_path=str(path),
                                  partition_cols=['one', 'two'])pyarrow/tests/parquet/test_dataset.py:1544: 
      _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
      pyarrow/parquet/__init__.py:3110: in write_to_dataset
          import pyarrow.dataset as ds
      _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _     """Dataset is currently unstable. APIs subject to change without notice."""
          
          import pyarrow as pa
          from pyarrow.util import _is_iterable, _stringify_path, _is_path_like
          
      >   from pyarrow._dataset import (  # noqa
              CsvFileFormat,
              CsvFragmentScanOptions,
              Dataset,
              DatasetFactory,
              DirectoryPartitioning,
              FilenamePartitioning,
              FileFormat,
              FileFragment,
              FileSystemDataset,
              FileSystemDatasetFactory,
              FileSystemFactoryOptions,
              FileWriteOptions,
              Fragment,
              FragmentScanOptions,
              HivePartitioning,
              IpcFileFormat,
              IpcFileWriteOptions,
              InMemoryDataset,
              Partitioning,
              PartitioningFactory,
              Scanner,
              TaggedRecordBatch,
              UnionDataset,
              UnionDatasetFactory,
              _get_partition_keys,
              _filesystemdataset_write,
          )
      E   ModuleNotFoundError: No module named 'pyarrow._dataset'
      

      This can be reproduced via running the minimal_build examples:

      $ cd arrow/python/examples/minimal_build
      $ docker build -t arrow_ubuntu_minimal -f Dockerfile.ubuntu . 

      or via building arrow and pyarrow with PARQUET but without DATASET.

      Attachments

        Issue Links

          Activity

            People

              westonpace Weston Pace
              raulcd Raúl Cumplido
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m