Pig
  1. Pig
  2. PIG-3648

Make the sample size for RandomSampleLoader configurable

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.13.0
    • Component/s: impl
    • Labels:
      None

      Description

      Pig uses RandomSampleLoader for range partitioning in order-by. But since the sample size is hardcoded as 100, volatility in the variance of the results increases when sorting a large number of rows (e.g. 10M+ per task).

      It would be nice if the sample size could be configurable via Pig properties.

      1. PIG-3648-1.patch
        3 kB
        Cheolsoo Park

        Activity

        Hide
        Aniket Mokashi added a comment -

        Thanks Cheolsoo Park, it would be great if you can share rough numbers on benefits of the setting. That will give us some guidance on configuring this value.

        Show
        Aniket Mokashi added a comment - Thanks Cheolsoo Park , it would be great if you can share rough numbers on benefits of the setting. That will give us some guidance on configuring this value.
        Hide
        Cheolsoo Park added a comment -

        Committed to trunk. Thank you Daniel!

        Show
        Cheolsoo Park added a comment - Committed to trunk. Thank you Daniel!
        Hide
        Daniel Dai added a comment -

        +1, that should be fine.

        Show
        Daniel Dai added a comment - +1, that should be fine.
        Hide
        Cheolsoo Park added a comment -

        Actually, this turned out to be helpful. Since I am not change the default behavior (i.e. the number of sampled rows per task is still set to 100), I think we can commit this to 0.13. Does anyone disagree?

        Show
        Cheolsoo Park added a comment - Actually, this turned out to be helpful. Since I am not change the default behavior (i.e. the number of sampled rows per task is still set to 100), I think we can commit this to 0.13. Does anyone disagree?
        Hide
        Cheolsoo Park added a comment -

        I'm suspecting that there is not much benefit of keeping this configurable.

        I am deploying this at work to let my users experiment. I will let you know whether this helps or not. In the meantime, we can leave it as is.

        Show
        Cheolsoo Park added a comment - I'm suspecting that there is not much benefit of keeping this configurable. I am deploying this at work to let my users experiment. I will let you know whether this helps or not. In the meantime, we can leave it as is.
        Hide
        Rohini Palaniswamy added a comment -

        I don't think it is possible to configure it right unless we store statistics on the total number of records (using something like hraven) and use that to determine the sample size as a proportion dynamically. Otherwise the best option is to let the user specify a sample size as we don't know the number of records until the map completes.

        On a different note, when I was checking code to confirm that the samples only contain the order by columns saw that MR does RandomSampleLoader -> Foreach (to project sort columns) because it was loader. In Tez, Daniel Dai had fixed it to do POForeach - > POReservoirSample projecting the columns early.

        Show
        Rohini Palaniswamy added a comment - I don't think it is possible to configure it right unless we store statistics on the total number of records (using something like hraven) and use that to determine the sample size as a proportion dynamically. Otherwise the best option is to let the user specify a sample size as we don't know the number of records until the map completes. On a different note, when I was checking code to confirm that the samples only contain the order by columns saw that MR does RandomSampleLoader -> Foreach (to project sort columns) because it was loader. In Tez, Daniel Dai had fixed it to do POForeach - > POReservoirSample projecting the columns early.
        Hide
        Aniket Mokashi added a comment -

        Memory is not a concern. I'm suspecting that there is not much benefit of keeping this configurable. Obviously, no harm, if we configure it right.

        Show
        Aniket Mokashi added a comment - Memory is not a concern. I'm suspecting that there is not much benefit of keeping this configurable. Obviously, no harm, if we configure it right.
        Hide
        Rohini Palaniswamy added a comment -

        Is memory a big concern really w.r.t to number of samples? I haven't checked but I am assuming we will only have the orderby key in the sample and not entire tuple and that would not amount to much in terms of occupying memory. Is that wrong?

        Show
        Rohini Palaniswamy added a comment - Is memory a big concern really w.r.t to number of samples? I haven't checked but I am assuming we will only have the orderby key in the sample and not entire tuple and that would not amount to much in terms of occupying memory. Is that wrong?
        Hide
        Aniket Mokashi added a comment - - edited

        We are using reservoir sampling here, with assumption that the number of samples fit in memory. My only question/concern is how much benefit does increasing sample size provide here.
        In your example- 100 samples on 13M rows had 10x skew. Does 200 samples make it 5x skew or less? If it does, doing this definitely makes sense.

        Show
        Aniket Mokashi added a comment - - edited We are using reservoir sampling here, with assumption that the number of samples fit in memory. My only question/concern is how much benefit does increasing sample size provide here. In your example- 100 samples on 13M rows had 10x skew. Does 200 samples make it 5x skew or less? If it does, doing this definitely makes sense.
        Hide
        Rohini Palaniswamy added a comment -

        +1. Agree that it obviously makes sense to have large sample size for larger data - http://en.wikipedia.org/wiki/Sample_size_determination#Introduction.

        Show
        Rohini Palaniswamy added a comment - +1. Agree that it obviously makes sense to have large sample size for larger data - http://en.wikipedia.org/wiki/Sample_size_determination#Introduction .
        Hide
        Matt Bossenbroek added a comment -

        I came across this issue while trying to order data with a large number of reducers. The last reducer ended up with 10x the data of the other reducers and took 10x longer to execute.

        I ran some statistical simulations on the selection algo and found that with a small sample size, the likelihood of a less than uniform sample distribution was higher. In my example, it was selecting only 100 rows out of 13M, which wasn't representative of the data.

        I can provide the sample code I was using to test this if needed.

        Show
        Matt Bossenbroek added a comment - I came across this issue while trying to order data with a large number of reducers. The last reducer ended up with 10x the data of the other reducers and took 10x longer to execute. I ran some statistical simulations on the selection algo and found that with a small sample size, the likelihood of a less than uniform sample distribution was higher. In my example, it was selecting only 100 rows out of 13M, which wasn't representative of the data. I can provide the sample code I was using to test this if needed.
        Hide
        Aniket Mokashi added a comment - - edited

        +1.

        There is a typo in comments - RandomeSampleLoader, otherwise, patch looks good.

        volatility in the variance of the results increases when sorting a large number of rows

        Can you give an example when this happens? Sampling algo looks good to me. Also, we want to keep number of samples less, so that we can replace this mechanism in future if needed.

        Show
        Aniket Mokashi added a comment - - edited +1. There is a typo in comments - RandomeSampleLoader, otherwise, patch looks good. volatility in the variance of the results increases when sorting a large number of rows Can you give an example when this happens? Sampling algo looks good to me. Also, we want to keep number of samples less, so that we can replace this mechanism in future if needed.
        Hide
        Cheolsoo Park added a comment -

        Attached is a patch that introduces a new property called "pig.random.sampler.sample.size".

        Show
        Cheolsoo Park added a comment - Attached is a patch that introduces a new property called "pig.random.sampler.sample.size".

          People

          • Assignee:
            Cheolsoo Park
            Reporter:
            Cheolsoo Park
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development