Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-2338

GCS filepattern wildcard broken in Python SDK

Details

    • Bug
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • 2.0.0
    • 2.1.0
    • beam-model
    • None

    Description

      Validation of file patterns containing wildcard (`*`) in GCS directories does not always work.

      Some kinds of patterns generates an error from here during validation:
      https://github.com/apache/beam/blob/v2.0.0/sdks/python/apache_beam/io/filebasedsource.py#L168

      I've tried a few different FileSystems match commands which confuses be a bit.

      Full path works:

      >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF'], limits=[1])[0].metadata_list
      [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF, 74721736)]
      

      Glob star on directory does not

      >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342016259LGN00_B1.TIF'], limits=[1])[0].metadata_list
      []
      

      If adding a star on the file level only searching for TIF files it works (all tough we match a different file but that is fine)

      >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/*.TIF'], limits=[1])[0].metadata_list
      [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF, 65862791)]
      

      Ok, Here comes the even more strange case.
      Looking for the same file we found with the patterns that but with a star on the dir we find it!!

      >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342013106LGN01_B1.TIF'], limits=[1])[0].metadata_list
      [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF, 65862791)]
      

      Also looking at the first case again we will match if the star is placed late enough in the pattern to make the directory unique.

      >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN*/LC80440342016259LGN00_B1.TIF'], limits=[1])[0].metadata_list
      [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF, 74721736)]
      

      but not if further up in the name

      >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC8044034201*/LC80440342016259LGN00_B1.TIF'], limits=[1])[0].metadata_list
      []
      

      My guess is that some folders are dropped from the list of matched directories or something which is a bit concerning.

      Attachments

        Issue Links

          Activity

            People

              sb2nov Sourabh Bajaj
              while Vilhelm von Ehrenheim
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: