Pig
  1. Pig
  2. PIG-2924

PigStats should not be assuming all Storage classes to be file-based storage

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.2, 0.10.0
    • Fix Version/s: 0.12.0
    • Component/s: tools
    • Labels:
      None

      Description

      Using PigStatsUtil (like Oozie does) to collect JobStats for jobs that use a HBaseStorage blows up when the stats are asked to be accumulated.

      This is because JobStats (which adds stuff up) is assuming all storages are file based and that it can do listStatus/etc. operations on their filespec-provided filename. For HBaseStorage, this is set to the tablename and there's no such file, leading to an exception (FileNotFound or Invalid URI - depending on using 'tablename' or 'hbase://tablename').

      1. PIG-2924-5.patch
        20 kB
        Cheolsoo Park
      2. PIG-2924-4.patch
        19 kB
        Cheolsoo Park
      3. PIG-2924-3.patch
        18 kB
        Cheolsoo Park
      4. PIG-2924-2.patch
        19 kB
        Cheolsoo Park
      5. PIG-2924.patch
        18 kB
        Cheolsoo Park

        Issue Links

          Activity

          Hide
          Harsh J added a comment -

          A straight forward fix may be to intercept the FileNotFound issues. A proper fix may be to have storages report if they are file-based or otherwise?

          Show
          Harsh J added a comment - A straight forward fix may be to intercept the FileNotFound issues. A proper fix may be to have storages report if they are file-based or otherwise?
          Hide
          Bill Graham added a comment -

          We ran into similar issues with HCatalog and reducer estimation (PIG-2573, PIG-2574), since an HDFS path was assumed.

          For this issue we could register different classes that know how to look up (or not support) stats based on the URI prefix of the data location (hdfs, hbase, s3, etc).

          Show
          Bill Graham added a comment - We ran into similar issues with HCatalog and reducer estimation ( PIG-2573 , PIG-2574 ), since an HDFS path was assumed. For this issue we could register different classes that know how to look up (or not support) stats based on the URI prefix of the data location (hdfs, hbase, s3, etc).
          Hide
          Cheolsoo Park added a comment -

          I am attaching a patch that implements Bill's suggestion. To make the computation of the output size plugable, I did the following:

          • Added a new interface called PigStatsOutputSizeComputer.
          • Added a default implementation of this interface called FileBasedOutputSizeComputer.
          • Added a new flag via which a custom output size computer can be registered.
          • Added unit test cases for both file-based and non-file-based systems.

          Basically, I followed the pattern that Bill introduced for the reducer estimator in PIG-2574. Any comments would be appreciated.

          Thanks!

          Show
          Cheolsoo Park added a comment - I am attaching a patch that implements Bill's suggestion. To make the computation of the output size plugable, I did the following: Added a new interface called PigStatsOutputSizeComputer. Added a default implementation of this interface called FileBasedOutputSizeComputer. Added a new flag via which a custom output size computer can be registered. Added unit test cases for both file-based and non-file-based systems. Basically, I followed the pattern that Bill introduced for the reducer estimator in PIG-2574 . Any comments would be appreciated. Thanks!
          Hide
          Cheolsoo Park added a comment -

          Adding the new property (pig.stats.output.size.computer) to the default pig.properties file with comments.

          Show
          Cheolsoo Park added a comment - Adding the new property (pig.stats.output.size.computer) to the default pig.properties file with comments.
          Hide
          Bill Graham added a comment -

          This looks great, thanks for taking this one. I think we need to make a few changes to the pattern used PIG-2574 though, because we could have a case where we have multiple store funcs that each write to a different data source.

          • Instead of registering a single new computer it would be ideal if we could register a list of computers.
          • Each computer could have a boolean supports(POStore poStore) method that returns whether this class supports a given POStore. This can often be done by inspecting the output path. A default URI-based abstract class could help with that part.
          • The computers would then be consulted in order, where the first to support the POStore wins.
          • If a computer can't determine a size for some reason (i.e., it doesn't support it or an exception occurred), it shouldn't return 0. Instead maybe we reserve -1 for this case and document it as such.
          • Having the word Computer in the interface name and configs could cause confusion, due to how it's an overloaded term. I don't have any great suggestions though. PigStatsOutputSizeReader?

          Thoughts?

          Show
          Bill Graham added a comment - This looks great, thanks for taking this one. I think we need to make a few changes to the pattern used PIG-2574 though, because we could have a case where we have multiple store funcs that each write to a different data source. Instead of registering a single new computer it would be ideal if we could register a list of computers. Each computer could have a boolean supports(POStore poStore) method that returns whether this class supports a given POStore. This can often be done by inspecting the output path. A default URI-based abstract class could help with that part. The computers would then be consulted in order, where the first to support the POStore wins. If a computer can't determine a size for some reason (i.e., it doesn't support it or an exception occurred), it shouldn't return 0. Instead maybe we reserve -1 for this case and document it as such. Having the word Computer in the interface name and configs could cause confusion, due to how it's an overloaded term. I don't have any great suggestions though. PigStatsOutputSizeReader ? Thoughts?
          Hide
          Cheolsoo Park added a comment -

          Hi Bill, thank you very much for reviewing my patch!

          I totally agree with most of your comments. In particular, adding a supports() method seems like an elegant way to support multiple computers. I will make that change in a new patch.

          But I am wondering if you would agree to remove POStore from the interface. The reason why I want to remove it is because I don't think that POStore is needed to implement supports() and getOutputSize() for any kinds of computers. All we need is probably the uri string, so it seems to make sense to pass the uri string (or a URI object) instead of the whole POStore. Please let me know if you think otherwise.

          Regarding the name of the interface, I couldn't come up with a better name. Reader sounds good to me. Maybe reporter or calculator?

          Thanks!

          Show
          Cheolsoo Park added a comment - Hi Bill, thank you very much for reviewing my patch! I totally agree with most of your comments. In particular, adding a supports() method seems like an elegant way to support multiple computers. I will make that change in a new patch. But I am wondering if you would agree to remove POStore from the interface. The reason why I want to remove it is because I don't think that POStore is needed to implement supports() and getOutputSize() for any kinds of computers. All we need is probably the uri string, so it seems to make sense to pass the uri string (or a URI object) instead of the whole POStore. Please let me know if you think otherwise. Regarding the name of the interface, I couldn't come up with a better name. Reader sounds good to me. Maybe reporter or calculator? Thanks!
          Hide
          Cheolsoo Park added a comment -

          I updated the patch as follows:

          Having the word Computer in the interface name and configs could cause confusion, due to how it's an overloaded term. I don't have any great suggestions though. PigStatsOutputSizeReader?

          Changed to PigStatsOutputSizeReader.

          Instead of registering a single new computer it would be ideal if we could register a list of computers.

          Fixed.

          Each computer could have a boolean supports(POStore poStore) method that returns whether this class supports a given POStore. This can often be done by inspecting the output path. A default URI-based abstract class could help with that part.

          Each reader implements boolean supports(String uri) method. For FileBasedOutputSizeReader, the output of UriUtil.isHDFSFileOrLocalOrS3N() is returned.

          The computers would then be consulted in order, where the first to support the POStore wins.

          Fixed.

          If a computer can't determine a size for some reason (i.e., it doesn't support it or an exception occurred), it shouldn't return 0. Instead maybe we reserve -1 for this case and document it as such.

          Fixed.

          In addition, I replaced POStore with String. Please let me know what you think.

          Thanks!

          Show
          Cheolsoo Park added a comment - I updated the patch as follows: Having the word Computer in the interface name and configs could cause confusion, due to how it's an overloaded term. I don't have any great suggestions though. PigStatsOutputSizeReader? Changed to PigStatsOutputSizeReader . Instead of registering a single new computer it would be ideal if we could register a list of computers. Fixed. Each computer could have a boolean supports(POStore poStore) method that returns whether this class supports a given POStore. This can often be done by inspecting the output path. A default URI-based abstract class could help with that part. Each reader implements boolean supports(String uri) method. For FileBasedOutputSizeReader , the output of UriUtil.isHDFSFileOrLocalOrS3N() is returned. The computers would then be consulted in order, where the first to support the POStore wins. Fixed. If a computer can't determine a size for some reason (i.e., it doesn't support it or an exception occurred), it shouldn't return 0. Instead maybe we reserve -1 for this case and document it as such. Fixed. In addition, I replaced POStore with String . Please let me know what you think. Thanks!
          Hide
          Cheolsoo Park added a comment -

          Updating the comments in pig.properties to make it clear that the user can register multiple readers, and the 1st reader that supports the given uri will be used.

          Show
          Cheolsoo Park added a comment - Updating the comments in pig.properties to make it clear that the user can register multiple readers, and the 1st reader that supports the given uri will be used.
          Hide
          Bill Graham added a comment -

          Sorry for the delay on the review Chelsoo. Looking good. A few more comments. Let me know what you think.

          • I think we need to pass the POStore instead of the location. PigStorage impls provided by random parties might not all abide by a unique namespacing convention in their location syntax. For example, VerticaStorer uses a syntax like " {[db_schema].[table_name]}

            " (curly brackets included). Another implementor could use the same syntax.

          • JobStats.getOuputSize could be simplified by doing this, which is more commonly done:
            String reporterNames = conf.get(
               PigStatsOutputSizeReader.OUTPUT_SIZE_READER_KEY,
               FileBasedOutputSizeReader.class.getCanonicalName());
            
          • Does PigContext.instantiateFuncFromSpec(className) (without appending "()") not work?
          • It seems like it would be reasonable for PigStatsOutputSizeReader.getOutputSize to throw IOException all the way up to JobStats.
          • Let's make DummyOutputSizeReader an inner class of TestJobStats since that package is already totally bloated.
          • In pig.properties reducers' should not have an apostrophe (no possessive for inanimate objects).
          Show
          Bill Graham added a comment - Sorry for the delay on the review Chelsoo. Looking good. A few more comments. Let me know what you think. I think we need to pass the POStore instead of the location. PigStorage impls provided by random parties might not all abide by a unique namespacing convention in their location syntax. For example, VerticaStorer uses a syntax like " {[db_schema].[table_name]} " (curly brackets included). Another implementor could use the same syntax. JobStats.getOuputSize could be simplified by doing this, which is more commonly done: String reporterNames = conf.get( PigStatsOutputSizeReader.OUTPUT_SIZE_READER_KEY, FileBasedOutputSizeReader.class.getCanonicalName()); Does PigContext.instantiateFuncFromSpec(className) (without appending "()") not work? It seems like it would be reasonable for PigStatsOutputSizeReader.getOutputSize to throw IOException all the way up to JobStats . Let's make DummyOutputSizeReader an inner class of TestJobStats since that package is already totally bloated. In pig.properties reducers' should not have an apostrophe (no possessive for inanimate objects).
          Hide
          Cheolsoo Park added a comment -

          Thank you very much for reviewing my patch, Bill!

          I agree that passing POStore is better. I also agree with the other points and incorporated them in this new patch. Please let me know what you think.

          Show
          Cheolsoo Park added a comment - Thank you very much for reviewing my patch, Bill! I agree that passing POStore is better. I also agree with the other points and incorporated them in this new patch. Please let me know what you think.
          Hide
          Cheolsoo Park added a comment -

          Bill gave +1 in the RB:
          https://reviews.apache.org/r/8122/

          Committed to trunk.

          Show
          Cheolsoo Park added a comment - Bill gave +1 in the RB: https://reviews.apache.org/r/8122/ Committed to trunk.

            People

            • Assignee:
              Cheolsoo Park
              Reporter:
              Harsh J
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development