Pig
  1. Pig
  2. PIG-3400

FS commands do not work with S3 paths

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: grunt
    • Labels:
      None

      Description

      To reproduce issues, run the following commands w/ S3 paths:

      pig -e 'ls s3://<path>'
      pig -e 'mkdir s3://<path>'
      pig -e 'cp s3://<path1> s3://<path2>'
      pig -e 'mv s3://<path1> s3://<path2>'
      pig -e 'copyToLocal s3://<path>/<file> .'
      pig -e 'copyFromLocal <file> s3://<path>'
      

      As of now, none of these commands works in Apache Pig, whereas they do in EMR Pig. The problem is that in GruntParser, DataStorage is constructed using the default file system provided by configuration, and s3 paths are not recognized.

      Instead, we should construct DataStorage based on the given URL. For example,

      DataStorage dfs = new HDataStorage(new Path(path).toUri(), mConf);
      
      1. PIG-3400-3.patch
        31 kB
        Cheolsoo Park
      2. PIG-3400-2.patch
        10 kB
        Cheolsoo Park
      3. PIG-3400.patch
        10 kB
        Cheolsoo Park

        Activity

        Hide
        Cheolsoo Park added a comment -

        Attached is a patch that implements the proposed fix.

        Show
        Cheolsoo Park added a comment - Attached is a patch that implements the proposed fix.
        Hide
        Cheolsoo Park added a comment -

        All unit tests pass.

        Show
        Cheolsoo Park added a comment - All unit tests pass.
        Hide
        Cheolsoo Park added a comment -

        Minor improvements in a new patch.

        Show
        Cheolsoo Park added a comment - Minor improvements in a new patch.
        Hide
        Rohini Palaniswamy added a comment -

        processCD else condition should have DataStorage dfs = new HDataStorage(new Path(path).toUri(), mConf);

        Show
        Rohini Palaniswamy added a comment - processCD else condition should have DataStorage dfs = new HDataStorage(new Path(path).toUri(), mConf);
        Hide
        Cheolsoo Park added a comment -

        Hi Rohini, thank you for the comment.

        In fact, that's something that I was debating myself. I intentionally didn't do that for PWD and CD. I wasn't sure whether it makes sense to cd to a s3 path. Given constructing a new instance of DataStorage is not free, I wanted to avoid it when not useful. Does this make sense? I can certainly document it though.

        Show
        Cheolsoo Park added a comment - Hi Rohini, thank you for the comment. In fact, that's something that I was debating myself. I intentionally didn't do that for PWD and CD. I wasn't sure whether it makes sense to cd to a s3 path. Given constructing a new instance of DataStorage is not free, I wanted to avoid it when not useful. Does this make sense? I can certainly document it though.
        Hide
        Rohini Palaniswamy added a comment -

        Ok. Got it. That involves changing the dfs in PigContext and ensuring everywhere the default filesystem is s3. It would be good to document it if you do not want to do it. What is the behaviour on EMR pig?

        Show
        Rohini Palaniswamy added a comment - Ok. Got it. That involves changing the dfs in PigContext and ensuring everywhere the default filesystem is s3. It would be good to document it if you do not want to do it. What is the behaviour on EMR pig?
        Hide
        Cheolsoo Park added a comment -

        EMR supports it. For example, I can do something like this:

        pig -e 'cd s3://<path>; ls .'
        

        This is somewhat nice, but I don't find this very useful because you can always use absolute paths instead. I suggest we should simply document which commands work and which commands don't. But please let me know If anyone thinks otherwise.

        Show
        Cheolsoo Park added a comment - EMR supports it. For example, I can do something like this: pig -e 'cd s3: //<path>; ls .' This is somewhat nice, but I don't find this very useful because you can always use absolute paths instead. I suggest we should simply document which commands work and which commands don't. But please let me know If anyone thinks otherwise.
        Hide
        Rohini Palaniswamy added a comment -

        Agree. We can document for now. If anyone wants total compatibility with EMR, then it can be addressed in a separate jira.

        Show
        Rohini Palaniswamy added a comment - Agree. We can document for now. If anyone wants total compatibility with EMR, then it can be addressed in a separate jira.
        Hide
        Cheolsoo Park added a comment -

        Updated the docs. ReviewBoard:
        https://reviews.apache.org/r/13122/

        Show
        Cheolsoo Park added a comment - Updated the docs. ReviewBoard: https://reviews.apache.org/r/13122/
        Hide
        Cheolsoo Park added a comment -

        Canceling patch until we decide what to do.

        See the discussion on dev mailing list.

        Show
        Cheolsoo Park added a comment - Canceling patch until we decide what to do. See the discussion on dev mailing list .

          People

          • Assignee:
            Cheolsoo Park
            Reporter:
            Cheolsoo Park
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development