Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-1722

PiggyBank AllLoader - Load multiple file formats in one load statement


    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.9.0
    • Component/s: None
    • Labels:


      This gives the ability to point one loader at a directory and have multiple formats loaded and used in the same query

      ----- Overview -----

      Lets say we have a directory with files:

      To load these currently requires multiple loaders, load statements in pig and then have the query perform a union on these.

      With this Loader the query becomes:
      a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader();

      The AllLoader will use the mapping property in the $PIG_HOME/conf/pig.properties

      file.extension.loaders that can be setup as:

      file.extension.loaders=gz:org.apache.pig.builtin.PigStorage(),bz2:org.apache.pig.builtin.PigStorage(),lzo:com.twitter.elephantbird.pig.load.LzoTextLoader(), rc:org.apache.pig.piggybank.storage.HiveColumnarLoader()

      The formats of this property is:

      -> [file extension]:[loader func spec]
      -> [file-extension]:[optional path tag]:[loader func spec]
      -> [file-extension]:[optional path tag]:[sequence file key value writer class name]:[loader func spec]

      ----- File path tagging: -----

      Loaders can also be chosen based on folder names in the file path:
      file.extension.loaders:gz:type1:Type1Loader(), gz:type2:Type2Loader()

      So that if you have /logs/type1/mylog and /logs/type2/mylog
      doing : a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader(); will use Type1Loader for mylog in /logs/type1 and Type2Loader for mylog in /logs/type2

      ----- File content guessing: -----

      If the files do not have extensions the AllLoader will try to guess the type of file by looking at the first three bytes mapping the following bytes to each extension:

      [ -119, 76, 90 ] = lzo
      [ 31, -117, 8 ] = gz
      [ 66, 90, 104 ] = bz2
      [ 83, 69, 81 ] = seq

      ----- Loader selection based on sequence file writer class -----

      Loaders can be configured to be selected based on the getKeyClassName of the Sequence File.
      will use the HiveColumnarLoader loader for all sequence files that have been written with org.apache.hadoop.hive.ql.io.RCFile as the KeyClassName.

      All $ extensions are removed from the getKeyClassName's return value.

      ----- Path Partition Handling -----

      Hive style partitioning is supported in the Loader itself so that if you have /logs/type=1 /logs/type=2 /logs/type=3
      The partition columns will be recougnised as "type" and filtering can be done like type<=2 etc.

      For this current implementation filtering expressions should be passed into the AllLoader's constructor e.g.

      a = LOAD '/logs/' USING org.apache.pig.piggybank.storage.AllLoader('type<=2'); will load only files that are in /logs/type=1 and /logs/type=2


        1. PIG-1722.patch
          58 kB
          Gerrit Jansen van Vuuren



            • Assignee:
              gerritjvv Gerrit Jansen van Vuuren
              gerritjvv Gerrit Jansen van Vuuren
            • Votes:
              0 Vote for this issue
              0 Start watching this issue


              • Created: