Pig
  1. Pig
  2. PIG-2856

AvroStorage doesn't load files in the directories when a glob pattern matches both files and directories.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.11
    • Fix Version/s: 0.11
    • Component/s: piggybank
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      This is a regression from PIG-2492.

      When a glob pattern such as '*' matches not only files but also directories, AvroStorage does not load files in the directories. This is a bug in getAllSubDirs() that can be fixed as follows:

      static boolean getAllSubDirs(Path path, Job job, Set<Path> paths)
      ...
      FileStatus[] matchedFiles = fs.globStatus(path, PATH_FILTER);
      ...
      for (FileStatus file : matchedFiles) {
          if (file.isDir()) {
      -        for (FileStatus sub : fs.listStatus(path)) {
      +        for (FileStatus sub : fs.listStatus(file.getPath())) {
                  getAllSubDirs(sub.getPath(), job, paths);
              }
          }
      }
      
      1. PIG-2856-2.patch
        3 kB
        Cheolsoo Park
      2. PIG-2856.patch
        2 kB
        Cheolsoo Park

        Activity

        Hide
        Cheolsoo Park added a comment -

        Attach is a patch that fixes the bug in getAllSubDirs() and updates the unit test testGlob1.

        Regarding the test, "expected_test_dir_1.avro" includes files in test_dir1 but doesn't include ones in its sub-directory test_subdir. On the other hand, "expected_testDir.avro" includes files not only test_dir1 but also its sub-directory test_subdir.

        Since all files in test_dir1 and its sub-directory are supposed to be loaded, "expected_testDir.avro" is used.

        Show
        Cheolsoo Park added a comment - Attach is a patch that fixes the bug in getAllSubDirs() and updates the unit test testGlob1. Regarding the test, "expected_test_dir_1.avro" includes files in test_dir1 but doesn't include ones in its sub-directory test_subdir. On the other hand, "expected_testDir.avro" includes files not only test_dir1 but also its sub-directory test_subdir. Since all files in test_dir1 and its sub-directory are supposed to be loaded, "expected_testDir.avro" is used.
        Hide
        Cheolsoo Park added a comment -
        Show
        Cheolsoo Park added a comment - Review board: https://reviews.apache.org/r/6318/
        Hide
        Cheolsoo Park added a comment -

        Regarding why this problem was not caught by testGlob1, there are actually two reasons:

        1. The expected output was incorrect (as mentioned above).
        2. The job status was not checked at all. So even though the job failed, the test still passed if it generated the expected output. In testGlob1, the job failed after loading 3 files, but since that happened to be the expected output, the test still passed.

        I've updated the patch so that not only is the expected output for testGlob1 updated, but the job status also is checked.

        Thanks!

        Show
        Cheolsoo Park added a comment - Regarding why this problem was not caught by testGlob1, there are actually two reasons: The expected output was incorrect (as mentioned above). The job status was not checked at all. So even though the job failed, the test still passed if it generated the expected output. In testGlob1, the job failed after loading 3 files, but since that happened to be the expected output, the test still passed. I've updated the patch so that not only is the expected output for testGlob1 updated, but the job status also is checked. Thanks!
        Hide
        Santhosh Srinivasan added a comment -

        All unit tests pass except TestDBStorage (Hadoop 20) and TestMultiStorage (Hadoop 20 and Hadoop 23). Patch has been committed. Thanks Cheolsoo!

        Show
        Santhosh Srinivasan added a comment - All unit tests pass except TestDBStorage (Hadoop 20) and TestMultiStorage (Hadoop 20 and Hadoop 23). Patch has been committed. Thanks Cheolsoo!
        Hide
        Santhosh Srinivasan added a comment -

        Forgot to add that TestLookupInFiles in Hadoop 23 is erroring out.

        Show
        Santhosh Srinivasan added a comment - Forgot to add that TestLookupInFiles in Hadoop 23 is erroring out.
        Hide
        Santhosh Srinivasan added a comment -

        Addendum to previous comment - its unrelated to this patch and existed previously.

        Show
        Santhosh Srinivasan added a comment - Addendum to previous comment - its unrelated to this patch and existed previously.

          People

          • Assignee:
            Cheolsoo Park
            Reporter:
            Cheolsoo Park
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development