Details
-
Bug
-
Status: Resolved
-
P2
-
Resolution: Fixed
-
2.0.0
-
None
Description
Validation of file patterns containing wildcard (`*`) in GCS directories does not always work.
Some kinds of patterns generates an error from here during validation:
https://github.com/apache/beam/blob/v2.0.0/sdks/python/apache_beam/io/filebasedsource.py#L168
I've tried a few different FileSystems match commands which confuses be a bit.
Full path works:
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF'], limits=[1])[0].metadata_list [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF, 74721736)]
Glob star on directory does not
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342016259LGN00_B1.TIF'], limits=[1])[0].metadata_list []
If adding a star on the file level only searching for TIF files it works (all tough we match a different file but that is fine)
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/*.TIF'], limits=[1])[0].metadata_list [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF, 65862791)]
Ok, Here comes the even more strange case.
Looking for the same file we found with the patterns that but with a star on the dir we find it!!
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342013106LGN01_B1.TIF'], limits=[1])[0].metadata_list [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF, 65862791)]
Also looking at the first case again we will match if the star is placed late enough in the pattern to make the directory unique.
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN*/LC80440342016259LGN00_B1.TIF'], limits=[1])[0].metadata_list [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF, 74721736)]
but not if further up in the name
>>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC8044034201*/LC80440342016259LGN00_B1.TIF'], limits=[1])[0].metadata_list []
My guess is that some folders are dropped from the list of matched directories or something which is a bit concerning.
Attachments
Issue Links
- links to