Pig
  1. Pig
  2. PIG-4003

Error is thrown by JobStats.getOutputSize() when storing to a Hive table

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.14.0
    • Component/s: None
    • Labels:
      None

      Description

      Here is an example of stack trace printed to console output. Technically, this is a warning message and does not make the job fail. However, this is certainly not user-friendly.

      4/06/09 16:20:28 WARN pigstats.JobStats: unable to find the output file
      java.io.FileNotFoundException: File hdfs://10.61.10.185:9000/user/cheolsoop/prodhive.benchmark.unittest_vhs_bitrate_asn_sum_stg_test2 does not exist.
      	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
      	at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
      	at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
      	at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
      	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
      	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
      	at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.FileBasedOutputSizeReader.getOutputSize(FileBasedOutputSizeReader.java:65)
      	at org.apache.pig.tools.pigstats.JobStats.getOutputSize(JobStats.java:352)
      

      The issue is that FileBasedOutputSizeReader mis-interprets hive table name as hdfs path.

      @Override
      public boolean supports(POStore sto, Configuration conf) {
          return UriUtil.isHDFSFileOrLocalOrS3N(getLocationUri(sto), conf);
      }
      
      1. PIG-4003-5.patch
        5 kB
        Cheolsoo Park
      2. PIG-4003-4.patch
        4 kB
        Cheolsoo Park
      3. PIG-4003-3.patch
        3 kB
        Cheolsoo Park
      4. PIG-4003-2.patch
        1 kB
        Cheolsoo Park
      5. PIG-4003-1.patch
        3 kB
        Cheolsoo Park

        Activity

        Hide
        Cheolsoo Park added a comment -

        The attached patch addresses two issues-

        1. MRJobStats.addOutputStatistics() has redundant code. It handles the case of # of stores == 1 separately, but it is not only unnecessary but also confusing since it adds an extra code path.
        2. Make FileBasedOutputSizeReader.support() return false for hive table names. If there is no scheme in uri, assumes it is not a hdfs path.
        Show
        Cheolsoo Park added a comment - The attached patch addresses two issues- MRJobStats.addOutputStatistics() has redundant code. It handles the case of # of stores == 1 separately, but it is not only unnecessary but also confusing since it adds an extra code path. Make FileBasedOutputSizeReader.support() return false for hive table names. If there is no scheme in uri, assumes it is not a hdfs path.
        Hide
        Cheolsoo Park added a comment -

        I reverted #1 due to regression- 10 unit test cases fail.

        I also found a better way of doing #2. Now HadoopShims.hasFileSystemImpl() returns false when schema is null. Fixed in both 20 and 23 shims.

        Show
        Cheolsoo Park added a comment - I reverted #1 due to regression- 10 unit test cases fail. I also found a better way of doing #2. Now HadoopShims.hasFileSystemImpl() returns false when schema is null. Fixed in both 20 and 23 shims.
        Hide
        Cheolsoo Park added a comment -

        Discussed with Rohini offline.

        1. We should keep HadoopShims.hasFileSystemImpl() as it. i.e. when no scheme is given, we should assume it's a hdfs file following hadoop convention.
        2. Instead, I introduced a new property pig.stats.output.size.reader.unsupported via which certain store funcs can be excluded.

        Attaching the patch.

        Show
        Cheolsoo Park added a comment - Discussed with Rohini offline. We should keep HadoopShims.hasFileSystemImpl() as it. i.e. when no scheme is given, we should assume it's a hdfs file following hadoop convention. Instead, I introduced a new property pig.stats.output.size.reader.unsupported via which certain store funcs can be excluded. Attaching the patch.
        Hide
        Rohini Palaniswamy added a comment - - edited

        Patch looks good. Can you add the property to pig-default.properties and have org.apache.hcatalog.pig.HCatStorer,org.apache.hive.hcatalog.pig.HCatStorer as default values as they are well know ones and also add a unit test.

        Show
        Rohini Palaniswamy added a comment - - edited Patch looks good. Can you add the property to pig-default.properties and have org.apache.hcatalog.pig.HCatStorer,org.apache.hive.hcatalog.pig.HCatStorer as default values as they are well know ones and also add a unit test.
        Hide
        Cheolsoo Park added a comment -

        Incorporated Rohini's comments.

        Show
        Cheolsoo Park added a comment - Incorporated Rohini's comments.
        Hide
        Rohini Palaniswamy added a comment -

        Can you add a unit test to TestMRJobStats? We can just pass PigStorage in the unsupported list and verify that the output size in jobstats is -1.

        Show
        Rohini Palaniswamy added a comment - Can you add a unit test to TestMRJobStats? We can just pass PigStorage in the unsupported list and verify that the output size in jobstats is -1.
        Hide
        Cheolsoo Park added a comment -

        Of course. Added a test case as suggested.

        Show
        Cheolsoo Park added a comment - Of course. Added a test case as suggested.
        Hide
        Rohini Palaniswamy added a comment -

        Thanks Cheolsoo. +1

        Show
        Rohini Palaniswamy added a comment - Thanks Cheolsoo. +1
        Hide
        Cheolsoo Park added a comment -

        Committed to trunk. Thank you Rohini for reviewing the patch!

        Show
        Cheolsoo Park added a comment - Committed to trunk. Thank you Rohini for reviewing the patch!

          People

          • Assignee:
            Cheolsoo Park
            Reporter:
            Cheolsoo Park
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development