Pig
  1. Pig
  2. PIG-3400

FS commands do not work with S3 paths

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: grunt
    • Labels:
      None

      Description

      To reproduce issues, run the following commands w/ S3 paths:

      pig -e 'ls s3://<path>'
      pig -e 'mkdir s3://<path>'
      pig -e 'cp s3://<path1> s3://<path2>'
      pig -e 'mv s3://<path1> s3://<path2>'
      pig -e 'copyToLocal s3://<path>/<file> .'
      pig -e 'copyFromLocal <file> s3://<path>'
      

      As of now, none of these commands works in Apache Pig, whereas they do in EMR Pig. The problem is that in GruntParser, DataStorage is constructed using the default file system provided by configuration, and s3 paths are not recognized.

      Instead, we should construct DataStorage based on the given URL. For example,

      DataStorage dfs = new HDataStorage(new Path(path).toUri(), mConf);
      
      1. PIG-3400-3.patch
        31 kB
        Cheolsoo Park
      2. PIG-3400-2.patch
        10 kB
        Cheolsoo Park
      3. PIG-3400.patch
        10 kB
        Cheolsoo Park

        Activity

        Cheolsoo Park created issue -
        Hide
        Cheolsoo Park added a comment -

        Attached is a patch that implements the proposed fix.

        Show
        Cheolsoo Park added a comment - Attached is a patch that implements the proposed fix.
        Cheolsoo Park made changes -
        Field Original Value New Value
        Attachment PIG-3400.patch [ 12594480 ]
        Hide
        Cheolsoo Park added a comment -

        All unit tests pass.

        Show
        Cheolsoo Park added a comment - All unit tests pass.
        Cheolsoo Park made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Cheolsoo Park added a comment -

        Minor improvements in a new patch.

        Show
        Cheolsoo Park added a comment - Minor improvements in a new patch.
        Cheolsoo Park made changes -
        Attachment PIG-3400-2.patch [ 12594782 ]
        Hide
        Rohini Palaniswamy added a comment -

        processCD else condition should have DataStorage dfs = new HDataStorage(new Path(path).toUri(), mConf);

        Show
        Rohini Palaniswamy added a comment - processCD else condition should have DataStorage dfs = new HDataStorage(new Path(path).toUri(), mConf);
        Hide
        Cheolsoo Park added a comment -

        Hi Rohini, thank you for the comment.

        In fact, that's something that I was debating myself. I intentionally didn't do that for PWD and CD. I wasn't sure whether it makes sense to cd to a s3 path. Given constructing a new instance of DataStorage is not free, I wanted to avoid it when not useful. Does this make sense? I can certainly document it though.

        Show
        Cheolsoo Park added a comment - Hi Rohini, thank you for the comment. In fact, that's something that I was debating myself. I intentionally didn't do that for PWD and CD. I wasn't sure whether it makes sense to cd to a s3 path. Given constructing a new instance of DataStorage is not free, I wanted to avoid it when not useful. Does this make sense? I can certainly document it though.
        Hide
        Rohini Palaniswamy added a comment -

        Ok. Got it. That involves changing the dfs in PigContext and ensuring everywhere the default filesystem is s3. It would be good to document it if you do not want to do it. What is the behaviour on EMR pig?

        Show
        Rohini Palaniswamy added a comment - Ok. Got it. That involves changing the dfs in PigContext and ensuring everywhere the default filesystem is s3. It would be good to document it if you do not want to do it. What is the behaviour on EMR pig?
        Hide
        Cheolsoo Park added a comment -

        EMR supports it. For example, I can do something like this:

        pig -e 'cd s3://<path>; ls .'
        

        This is somewhat nice, but I don't find this very useful because you can always use absolute paths instead. I suggest we should simply document which commands work and which commands don't. But please let me know If anyone thinks otherwise.

        Show
        Cheolsoo Park added a comment - EMR supports it. For example, I can do something like this: pig -e 'cd s3: //<path>; ls .' This is somewhat nice, but I don't find this very useful because you can always use absolute paths instead. I suggest we should simply document which commands work and which commands don't. But please let me know If anyone thinks otherwise.
        Hide
        Rohini Palaniswamy added a comment -

        Agree. We can document for now. If anyone wants total compatibility with EMR, then it can be addressed in a separate jira.

        Show
        Rohini Palaniswamy added a comment - Agree. We can document for now. If anyone wants total compatibility with EMR, then it can be addressed in a separate jira.
        Hide
        Cheolsoo Park added a comment -

        Updated the docs. ReviewBoard:
        https://reviews.apache.org/r/13122/

        Show
        Cheolsoo Park added a comment - Updated the docs. ReviewBoard: https://reviews.apache.org/r/13122/
        Cheolsoo Park made changes -
        Attachment PIG-3400-3.patch [ 12595255 ]
        Hide
        Cheolsoo Park added a comment -

        Canceling patch until we decide what to do.

        See the discussion on dev mailing list.

        Show
        Cheolsoo Park added a comment - Canceling patch until we decide what to do. See the discussion on dev mailing list .
        Cheolsoo Park made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Cheolsoo Park made changes -
        Fix Version/s 0.12 [ 12323380 ]
        Cheolsoo Park made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Won't Fix [ 2 ]
        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Patch Available Patch Available
        1d 4h 19m 1 Cheolsoo Park 28/Jul/13 04:53
        Patch Available Patch Available Open Open
        4d 20h 38m 1 Cheolsoo Park 02/Aug/13 01:31
        Open Open Resolved Resolved
        401d 3h 4m 1 Cheolsoo Park 07/Sep/14 04:35

          People

          • Assignee:
            Cheolsoo Park
            Reporter:
            Cheolsoo Park
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development