Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4697

Serialize relevant part of the udfcontext per vertex to reduce payload size

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.16.0
    • None
    • None
    • Reviewed

    Description

      What HCatLoader/HCatStorer puts in UDFContext is huge and if there are multiple of them in the pig script, the size of data sent to Tez AM is huge and also the size of data that Tez AM sends to tasks is huge causing RPC limit exceeded and OOM issues respectively. If Pig serializes only part of the udfcontext that is required for each vertex, it will save a lot. HCat folks are also looking up at cleaning what goes into the conf (it ends up serializing whole job conf, not just hive-site.xml) and moving out the common part to be shared by all hcat loaders and stores.

      Also looking at other options for faster and compact serialization. Will create separate jiras for that. Will use PIG-4653 to cleanup all other pig config other than udfcontext.

      Attachments

        1. PIG-4697-1.patch
          37 kB
          Rohini Palaniswamy
        2. PIG-4697-2.patch
          45 kB
          Rohini Palaniswamy
        3. PIG-4697-fixunittests.patch
          3 kB
          Rohini Palaniswamy

        Issue Links

          Activity

            People

              rohini Rohini Palaniswamy
              rohini Rohini Palaniswamy
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: