Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-11621

s3a doesn't consider blobs with trailing / and content-length >0 as directories

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.7.0
    • Fix Version/s: None
    • Component/s: fs/s3
    • Labels:
      None

      Description

      When creating a directory using the AWS Management Console, the content-length is set to 0 and s3a works fine.

      When creating a directory using other tools, like S3Browse, the content-length is set to 1 and s3a doesn't work:
      S3AFileSystem: Found file (with /): real file? should not happen: dir1

        Issue Links

          Activity

          Hide
          stevel@apache.org Steve Loughran added a comment -

          If you look at the specification of FileSystem.getFileStatus(), we say "the size of directories must be 0".

          That's an implicit concept in Unix-filesystems: the entries point to data or to a directory.

          It also leads to follow-on point: you can't have files under another file

          In object stores, it's not so clear cut, because you can have things in the path that have data, and things in paths underneath. That breaks the illusion of files and directories that s3a:// creates and relies on. It sees a structure that isn't consistent with the illusion, and it isn't happy as it is a warning sign of a structure that can potentially have other problems. As far as s3a is concerned, the object store isn't consistent with the illusion, and so rejects it.

          swift:// will behave the same way.

          Is it just S3Browse which is creating objects with "/" with actual content in and then expecting them to be a directory? What is it putting in the content? And why?

          Show
          stevel@apache.org Steve Loughran added a comment - If you look at the specification of FileSystem.getFileStatus() , we say "the size of directories must be 0". That's an implicit concept in Unix-filesystems: the entries point to data or to a directory. It also leads to follow-on point: you can't have files under another file In object stores, it's not so clear cut, because you can have things in the path that have data, and things in paths underneath. That breaks the illusion of files and directories that s3a:// creates and relies on. It sees a structure that isn't consistent with the illusion, and it isn't happy as it is a warning sign of a structure that can potentially have other problems. As far as s3a is concerned, the object store isn't consistent with the illusion, and so rejects it. swift:// will behave the same way. Is it just S3Browse which is creating objects with "/" with actual content in and then expecting them to be a directory? What is it putting in the content? And why?
          Hide
          brahmareddy Brahma Reddy Battula added a comment -

          Hello Denis Jannot

          Can you please reply for following Steve Loughran Query..? wanted to know what's make difference with AWS Management Console.

          Is it just S3Browse which is creating objects with "/" with actual content in and then expecting them to be a directory? What is it putting in the content? And why?

          Show
          brahmareddy Brahma Reddy Battula added a comment - Hello Denis Jannot Can you please reply for following Steve Loughran Query..? wanted to know what's make difference with AWS Management Console. Is it just S3Browse which is creating objects with "/" with actual content in and then expecting them to be a directory? What is it putting in the content? And why?
          Hide
          djannot Denis Jannot added a comment -

          Sorry for the delay.

          In fact, the S3 API doesn't have any notion of directory, so each third-party software has it's own way of implementing a directory structure.
          Most of them are creating keys like /dir1/ without any value and then when you upload a file in this directory, it creates a /dir1/file1 key with the value being the content of the file.

          But, in any case, the content-length shouldn't be used by s3a to consider if it's a directory or not.
          Any key finishing by / (%2F) should be considered as a directory.

          In fact, you only have to do this if you want to display empty directories, because for any directory containing files, you will always have a key like /dir1/file1 and you can use delimiter and prefix to determine the directory structure to display.

          Does it make sense ?

          Show
          djannot Denis Jannot added a comment - Sorry for the delay. In fact, the S3 API doesn't have any notion of directory, so each third-party software has it's own way of implementing a directory structure. Most of them are creating keys like /dir1/ without any value and then when you upload a file in this directory, it creates a /dir1/file1 key with the value being the content of the file. But, in any case, the content-length shouldn't be used by s3a to consider if it's a directory or not. Any key finishing by / (%2F) should be considered as a directory. In fact, you only have to do this if you want to display empty directories, because for any directory containing files, you will always have a key like /dir1/file1 and you can use delimiter and prefix to determine the directory structure to display. Does it make sense ?
          Hide
          djannot Denis Jannot added a comment -

          Sorry for the delay. Just replied

          Show
          djannot Denis Jannot added a comment - Sorry for the delay. Just replied
          Hide
          stevel@apache.org Steve Loughran added a comment -

          I can see this makes some sense, even though things shouldn't be storing data in what appears to be a directory: the data will be lost on any FS operation that does a rename, or if your proposed "delete parent paths as files are created" notion is applied.

          Show
          stevel@apache.org Steve Loughran added a comment - I can see this makes some sense, even though things shouldn't be storing data in what appears to be a directory: the data will be lost on any FS operation that does a rename, or if your proposed "delete parent paths as files are created" notion is applied.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          The encryption patch of HADOOP-13887 includes a fix for this, as when you turn encryption on, even 0 byte files can gain some entries. We are still going to delete anything with a trailing / without caring whether or not its a file, so may want to consider adding a warning note in the release notes there, maybe even in this one & mark it as an incompatible change.

          Show
          stevel@apache.org Steve Loughran added a comment - The encryption patch of HADOOP-13887 includes a fix for this, as when you turn encryption on, even 0 byte files can gain some entries. We are still going to delete anything with a trailing / without caring whether or not its a file, so may want to consider adding a warning note in the release notes there, maybe even in this one & mark it as an incompatible change.

            People

            • Assignee:
              Unassigned
              Reporter:
              djannot Denis Jannot
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:

                Development