Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
When you have a hive-style partitioned dataset, with our current dataset(..) API, it's relatively easy to mess up the inferred partitioning and get confusing results.
For example, if you specify the partitioning field names with partitioning=[...] (which is not needed for hive style since those are inferred), we actually assume you want directory partitioning. This DirectoryPartitioning will then parse the hive-style file paths and take the full "key=value" as the data values for the field.
And then, doing a filter can result in a confusing empty result (because "value" doesn't match "key=value").
I am wondering if we can't relatively cheaply detect this case, and eg give an informative warning about this to the user.
Basically what happens is this:
>>> part = ds.DirectoryPartitioning(pa.schema([("part", "string")])) >>> part.parse("part=a") <pyarrow.dataset.Expression (part == "part=a")>
If the parsed value is a string that contains a "=" (and in this case also contains the field name), that is I think a clear sign that (in the large majority of cases) the user is doing something wrong.
I am not fully sure where and at what stage the check could be done though. Doing it for every path in the dataset might be too costly.
Illustrative code example:
import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds import pathlib ## constructing a small dataset with 1 hive-style partitioning level basedir = pathlib.Path(".") / "dataset_wrong_partitioning" basedir.mkdir(exist_ok=True) (basedir / "part=a").mkdir(exist_ok=True) (basedir / "part=b").mkdir(exist_ok=True) table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]}) pq.write_table(table1, basedir / "part=a" / "data.parquet") table2 = pa.table({'a': [4, 5, 6], 'b': [1, 2, 3]}) pq.write_table(table2, basedir / "part=b" / "data.parquet")
Reading as is (not specifying a partitioning, so default to no partitioning) will at least give an error about a missing field:
>>> dataset = ds.dataset(basedir) >>> dataset.to_table(filter=ds.field("part") == "a") ... ArrowInvalid: No match for FieldRef.Name(part) in a: int64
But specifying the partitioning field name (which currently gets (silently) interpreted as directory partitioning) gives a confusing empty result:
>>> dataset = ds.dataset(basedir, partitioning=["part"]) >>> dataset.to_table(filter=ds.field("part") == "a") pyarrow.Table a: int64 b: int64 part: string ---- a: [] b: [] part: []
This filter doesn't work because the values in the "part" column are not "a" but "part=a":
>>> dataset.to_table().to_pandas() a b part 0 1 1 part=a 1 2 2 part=a 2 3 3 part=a 3 4 1 part=b 4 5 2 part=b 5 6 3 part=b
Attachments
Issue Links
- is related to
-
ARROW-10485 [R] Accept partitioning in open_dataset when file paths are hive-style
- Resolved