[ARROW-15310] [C++][Python][Dataset] Detect (and warn?) when DirectoryPartitioning is parsing an actually hive-style file path? - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: C++, Python
Labels:
- dataset

External issue URL:
https://github.com/apache/arrow/issues/30799

Description

When you have a hive-style partitioned dataset, with our current dataset(..) API, it's relatively easy to mess up the inferred partitioning and get confusing results.

For example, if you specify the partitioning field names with partitioning=[...] (which is not needed for hive style since those are inferred), we actually assume you want directory partitioning. This DirectoryPartitioning will then parse the hive-style file paths and take the full "key=value" as the data values for the field.
And then, doing a filter can result in a confusing empty result (because "value" doesn't match "key=value").

I am wondering if we can't relatively cheaply detect this case, and eg give an informative warning about this to the user.

Basically what happens is this:

>>> part = ds.DirectoryPartitioning(pa.schema([("part", "string")]))
>>> part.parse("part=a")
<pyarrow.dataset.Expression (part == "part=a")>

If the parsed value is a string that contains a "=" (and in this case also contains the field name), that is I think a clear sign that (in the large majority of cases) the user is doing something wrong.

I am not fully sure where and at what stage the check could be done though. Doing it for every path in the dataset might be too costly.

Illustrative code example:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

import pathlib

## constructing a small dataset with 1 hive-style partitioning level

basedir = pathlib.Path(".") / "dataset_wrong_partitioning"
basedir.mkdir(exist_ok=True)

(basedir / "part=a").mkdir(exist_ok=True)
(basedir / "part=b").mkdir(exist_ok=True)

table1 = pa.table({'a': [1, 2, 3], 'b': [1, 2, 3]})
pq.write_table(table1, basedir / "part=a" / "data.parquet")

table2 = pa.table({'a': [4, 5, 6], 'b': [1, 2, 3]})
pq.write_table(table2, basedir / "part=b" / "data.parquet")

Reading as is (not specifying a partitioning, so default to no partitioning) will at least give an error about a missing field:

Unable to find source-code formatter for language: python. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml

>>> dataset = ds.dataset(basedir)
>>> dataset.to_table(filter=ds.field("part") == "a")
...
ArrowInvalid: No match for FieldRef.Name(part) in a: int64

But specifying the partitioning field name (which currently gets (silently) interpreted as directory partitioning) gives a confusing empty result:

>>> dataset = ds.dataset(basedir, partitioning=["part"])
>>> dataset.to_table(filter=ds.field("part") == "a")
pyarrow.Table
a: int64
b: int64
part: string
----
a: []
b: []
part: []

This filter doesn't work because the values in the "part" column are not "a" but "part=a":

>>> dataset.to_table().to_pandas()
   a  b    part
0  1  1  part=a
1  2  2  part=a
2  3  3  part=a
3  4  1  part=b
4  5  2  part=b
5  6  3  part=b

Attachments

Issue Links

is related to

ARROW-10485 [R] Accept partitioning in open_dataset when file paths are hive-style

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Joris Van den Bossche

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 12/Jan/22 11:21

Updated:: 11/Jan/23 11:36