Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-1576

Difference in Semantics between Load statement in Pig and HDFS client on Command line

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Invalid
    • 0.6.0, 0.7.0
    • 0.9.0
    • impl
    • None

    Description

      Here is my directory structure on HDFS which I want to access using Pig.
      This is a sample, but in real use case I have more than 100 of these directories.

      $ hadoop fs -ls /user/viraj/recursive/
      Found 3 items
      drwxr-xr-x   - viraj supergroup          0 2010-08-26 11:25 /user/viraj/recursive/20080615
      drwxr-xr-x   - viraj supergroup          0 2010-08-26 11:25 /user/viraj/recursive/20080616
      drwxr-xr-x   - viraj supergroup          0 2010-08-26 11:25 /user/viraj/recursive/20080617
      

      Using the command line I am access them using variety of options:

      $ hadoop fs -ls /user/viraj/recursive/{200806}{15..17}/
      -rw-r--r--   1 viraj supergroup       5791 2010-08-26 11:25 /user/viraj/recursive/20080615/kv2.txt
      -rw-r--r--   1 viraj supergroup       5791 2010-08-26 11:25 /user/viraj/recursive/20080616/kv2.txt
      -rw-r--r--   1 viraj supergroup       5791 2010-08-26 11:25 /user/viraj/recursive/20080617/kv2.txt
      
      $ hadoop fs -ls /user/viraj/recursive/{20080615..20080617}/
      
      -rw-r--r--   1 viraj supergroup       5791 2010-08-26 11:25 /user/viraj/recursive/20080615/kv2.txt
      
      -rw-r--r--   1 viraj supergroup       5791 2010-08-26 11:25 /user/viraj/recursive/20080616/kv2.txt
      
      -rw-r--r--   1 viraj supergroup       5791 2010-08-26 11:25 /user/viraj/recursive/20080617/kv2.txt
      

      I have written a Pig script, all the below combination of load statements do not work?

      --A = load '/user/viraj/recursive/{200806}{15..17}/' using PigStorage('\u0001') as (k:int, v:chararray);
      A = load '/user/viraj/recursive/{20080615..20080617}/' using PigStorage('\u0001') as (k:int, v:chararray);
      AL = limit A 10;
      dump AL;
      

      I get the following error in Pig 0.8

      2010-08-27 16:34:27,704 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
      2010-08-27 16:34:27,711 [main] INFO  org.apache.pig.tools.pigstats.PigStats - Script Statistics: 
      HadoopVersion   PigVersion      UserId  StartedAt       FinishedAt      Features
      0.20.2  0.8.0-SNAPSHOT  viraj   2010-08-27 16:34:24     2010-08-27 16:34:27     LIMIT
      Failed!
      Failed Jobs:
      JobId   Alias   Feature Message Outputs
      N/A     A,AL            Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: /user/viraj/recursive/{20080615..20080617}/
              at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279)
              at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
              at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
              at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
              at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
              at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
              at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
              at java.lang.Thread.run(Thread.java:619)
      Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://localhost:9000/user/viraj/recursive/{20080615..20080617} matches 0 files
              at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224)
              at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
              at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
              at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:268)
              ... 7 more
              hdfs://localhost:9000/tmp/temp241388470/tmp987803889,
      

      The following works:

      A = load '/user/viraj/recursive/{200806}{15,16,17}/' using PigStorage('\u0001') as (k:int, v:chararray);
      AL = limit A 10;
      dump AL;
      

      Why is there an inconsistency between HDFS client and Pig?

      Viraj

      Attachments

        Activity

          People

            Unassigned Unassigned
            viraj Viraj Bhat
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: