Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-1576

Difference in Semantics between Load statement in Pig and HDFS client on Command line

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Invalid
    • 0.6.0, 0.7.0
    • 0.9.0
    • impl
    • None

    Description

      Here is my directory structure on HDFS which I want to access using Pig.
      This is a sample, but in real use case I have more than 100 of these directories.

      $ hadoop fs -ls /user/viraj/recursive/
      Found 3 items
      drwxr-xr-x   - viraj supergroup          0 2010-08-26 11:25 /user/viraj/recursive/20080615
      drwxr-xr-x   - viraj supergroup          0 2010-08-26 11:25 /user/viraj/recursive/20080616
      drwxr-xr-x   - viraj supergroup          0 2010-08-26 11:25 /user/viraj/recursive/20080617
      

      Using the command line I am access them using variety of options:

      $ hadoop fs -ls /user/viraj/recursive/{200806}{15..17}/
      -rw-r--r--   1 viraj supergroup       5791 2010-08-26 11:25 /user/viraj/recursive/20080615/kv2.txt
      -rw-r--r--   1 viraj supergroup       5791 2010-08-26 11:25 /user/viraj/recursive/20080616/kv2.txt
      -rw-r--r--   1 viraj supergroup       5791 2010-08-26 11:25 /user/viraj/recursive/20080617/kv2.txt
      
      $ hadoop fs -ls /user/viraj/recursive/{20080615..20080617}/
      
      -rw-r--r--   1 viraj supergroup       5791 2010-08-26 11:25 /user/viraj/recursive/20080615/kv2.txt
      
      -rw-r--r--   1 viraj supergroup       5791 2010-08-26 11:25 /user/viraj/recursive/20080616/kv2.txt
      
      -rw-r--r--   1 viraj supergroup       5791 2010-08-26 11:25 /user/viraj/recursive/20080617/kv2.txt
      

      I have written a Pig script, all the below combination of load statements do not work?

      --A = load '/user/viraj/recursive/{200806}{15..17}/' using PigStorage('\u0001') as (k:int, v:chararray);
      A = load '/user/viraj/recursive/{20080615..20080617}/' using PigStorage('\u0001') as (k:int, v:chararray);
      AL = limit A 10;
      dump AL;
      

      I get the following error in Pig 0.8

      2010-08-27 16:34:27,704 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
      2010-08-27 16:34:27,711 [main] INFO  org.apache.pig.tools.pigstats.PigStats - Script Statistics: 
      HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
      0.20.2  0.8.0-SNAPSHOT  viraj   2010-08-27 16:34:24     2010-08-27 16:34:27     LIMIT
      Failed!
      Failed Jobs:
      JobId   Alias   Feature Message Outputs
      N/A     A,AL            Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: /user/viraj/recursive/{20080615..20080617}/
              at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279)
              at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
              at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
              at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
              at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
              at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
              at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
              at java.lang.Thread.run(Thread.java:619)
      Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://localhost:9000/user/viraj/recursive/{20080615..20080617} matches 0 files
              at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224)
              at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
              at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
              at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:268)
              ... 7 more
              hdfs://localhost:9000/tmp/temp241388470/tmp987803889,
      

      The following works:

      A = load '/user/viraj/recursive/{200806}{15,16,17}/' using PigStorage('\u0001') as (k:int, v:chararray);
      AL = limit A 10;
      dump AL;
      

      Why is there an inconsistency between HDFS client and Pig?

      Viraj

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            viraj Viraj Bhat
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment