[ARROW-16526] [Python] test_partitioned_dataset fails when building with PARQUET but without DATASET - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 8.0.0
Fix Version/s: 9.0.0
Component/s: Python
Labels:
- good-first-issue
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/31887

Description

Our current minimal_build examples for python build with -DARROW_PARQUET=ON but without DATASET. This produces the following failure:

 _________________________________________________________ test_partitioned_dataset[True] _________________________________________________________tempdir = PosixPath('/tmp/pytest-of-root/pytest-0/test_partitioned_dataset_True_0'), use_legacy_dataset = True    @pytest.mark.pandas
    @parametrize_legacy_dataset
    def test_partitioned_dataset(tempdir, use_legacy_dataset):
        # ARROW-3208: Segmentation fault when reading a Parquet partitioned dataset
        # to a Parquet file
        path = tempdir / "ARROW-3208"
        df = pd.DataFrame({
            'one': [-1, 10, 2.5, 100, 1000, 1, 29.2],
            'two': [-1, 10, 2, 100, 1000, 1, 11],
            'three': [0, 0, 0, 0, 0, 0, 0]
        })
        table = pa.Table.from_pandas(df)
>       pq.write_to_dataset(table, root_path=str(path),
                            partition_cols=['one', 'two'])pyarrow/tests/parquet/test_dataset.py:1544: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/parquet/__init__.py:3110: in write_to_dataset
    import pyarrow.dataset as ds
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _     """Dataset is currently unstable. APIs subject to change without notice."""
    
    import pyarrow as pa
    from pyarrow.util import _is_iterable, _stringify_path, _is_path_like
    
>   from pyarrow._dataset import (  # noqa
        CsvFileFormat,
        CsvFragmentScanOptions,
        Dataset,
        DatasetFactory,
        DirectoryPartitioning,
        FilenamePartitioning,
        FileFormat,
        FileFragment,
        FileSystemDataset,
        FileSystemDatasetFactory,
        FileSystemFactoryOptions,
        FileWriteOptions,
        Fragment,
        FragmentScanOptions,
        HivePartitioning,
        IpcFileFormat,
        IpcFileWriteOptions,
        InMemoryDataset,
        Partitioning,
        PartitioningFactory,
        Scanner,
        TaggedRecordBatch,
        UnionDataset,
        UnionDatasetFactory,
        _get_partition_keys,
        _filesystemdataset_write,
    )
E   ModuleNotFoundError: No module named 'pyarrow._dataset'

This can be reproduced via running the minimal_build examples:

$ cd arrow/python/examples/minimal_build
$ docker build -t arrow_ubuntu_minimal -f Dockerfile.ubuntu .

or via building arrow and pyarrow with PARQUET but without DATASET.

Attachments

Issue Links

relates to

ARROW-16582 [Python] Include DATASET in list of components in PyArrow's dev page

Resolved

links to

GitHub Pull Request #13116

Activity

People

Assignee:: Weston Pace

Reporter:: Raúl Cumplido

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 11/May/22 10:10

Updated:: 11/Jan/23 11:44

Resolved:: 12/May/22 11:25

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

40m