Uploaded image for project: 'Apache Open Climate Workbench'
  1. Apache Open Climate Workbench
  2. CLIMATE-248

PERFORMANCE - Rebinning Daily to Monthly datasets takes a really long time

    Details

    • Type: Improvement
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 0.1-incubating, 0.2-incubating
    • Fix Version/s: 1.3.0
    • Labels:
    • Environment:

      *nix

      Description

      When I was testing the dataset_processor module I noticed that most tests would complete in less than 1 second. Then I came across the "test_daily_to_monthly_rebin" test and it would take over 2 minutes to complete.

      The test initially used a 1x1 degree grid covering the globe and daily time step for 2 years (730 days).

      I ran some initial checks and the lag appears to be down in the code where the data is rebinned down in '_rcmes_calc_average_on_new_time_unit_K'.

                      mask = np.zeros_like(data)
                      mask[timeunits!=myunit,:,:] = 1.0
                      # Calculate missing data mask within each time unit...
                      datamask_at_this_timeunit = np.zeros_like(data)
                      datamask_at_this_timeunit[:]= process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75)
                      # Store results for masking later
                      datamask_store.append(datamask_at_this_timeunit[0])
                      # Calculate means for each pixel in this time unit, ignoring missing data (using masked array).
                      datam = ma.masked_array(data,np.logical_or(mask,datamask_at_this_timeunit))
                      meanstore[i,:,:] = ma.average(datam,axis=0)
      

      That block is suspect since the rest of the code is doing simple string parsing and appending to lists. I don't have the time to do a deep dive into this now, and it technically isn't broken, but just really slow.

      1. test_monthly_rebin.py
        2 kB
        Alex Goodman
      2. test.py
        0.6 kB
        Cameron Goodale
      3. inital_profile.txt
        55 kB
        Cameron Goodale

        Activity

        Hide
        agoodman Alex Goodman added a comment - - edited

        Cam,

        Here are a few quick improvements I can think of:

        First, there are many times where the code calls:
        np.unique(timeunits)

        You should only make this call once and store it into a variable, then when it is needed again you just use that variable.

        Also, remove this line:
        datamask_at_this_timeunit = np.zeros_like(data)

        And change the following line to:
        datamask_at_this_timeunit = process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75)

        A key thing you should know about numpy: Constructing new arrays is very expensive, so if it is not necessary then we should not do it. Since process.create_mask_using_threshold() builds a new array anyway, there is no need to allocate a new array with the same shape in advance. Likewise with np.unique(), a new array is allocated each time it gets called, so doing it only once should also save you some time.

        I don't expect these changes to give you much improvement in performance, maybe about 30% at the most. I suspect that the bottleneck is a result of the rebinning being done in a loop rather than a single vectorized numpy array operation. While it is obvious that python loops are inherently slow, there are other consequences at play too. One is that the output array has to be pre-allocated (line 316):

        meanstore = np.zeros([len(np.unique(timeunits)),data.shape[1],data.shape[2]])

        Which I have already explained why that will slow things down. Additionally, there is this line too inside the loop:
        datamask_at_this_timeunit = process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75)

        Specifically the part where the boolean masked array is being created: data[timeunits==myunit,:,:]. This is an example of what numpy refers to as fancy indexing, the distinction being that such a slice of an array is a copy, while regular slice indexing (eg data[3:4]) only shows a view of the original array.

        Unfortunately, fixing this is not trivial due to the inherent nature of daily data, specifically each month of the year does not have the same number of total days. As a result it is not possible to simply reshape the original data and then use one vectorized numpy operation like I did with monthly data. The best way to get around this in numpy that I could think of would be to use a masked array with shape (number_of_months, 31, ...), then have the months with less than 31 days have masked values in the excess days. You can then get the monthly means really easily with something like new_array.mean(axis=1). This is far from an ideal solution though because 1), inserting the masked values into the original array in the right spots can be tricky and 2), this cannot be done on the original array in place so allocating another new array is necessary.

        The absolute fastest way to do this then is to make your own python extension using C, C++, or Fortran where loops are much quicker, and then call a wrapper to the compiled code in python. As we all know though this is opening another large can of worms and will probably be a lot of work in of itself. I'd say too much for just this particular issue, but in the long term it might be necessary if we want to support daily data.

        Alex

        Show
        agoodman Alex Goodman added a comment - - edited Cam, Here are a few quick improvements I can think of: First, there are many times where the code calls: np.unique(timeunits) You should only make this call once and store it into a variable, then when it is needed again you just use that variable. Also, remove this line: datamask_at_this_timeunit = np.zeros_like(data) And change the following line to: datamask_at_this_timeunit = process.create_mask_using_threshold(data [timeunits==myunit,:,:] ,threshold=0.75) A key thing you should know about numpy: Constructing new arrays is very expensive, so if it is not necessary then we should not do it. Since process.create_mask_using_threshold() builds a new array anyway, there is no need to allocate a new array with the same shape in advance. Likewise with np.unique(), a new array is allocated each time it gets called, so doing it only once should also save you some time. I don't expect these changes to give you much improvement in performance, maybe about 30% at the most. I suspect that the bottleneck is a result of the rebinning being done in a loop rather than a single vectorized numpy array operation. While it is obvious that python loops are inherently slow, there are other consequences at play too. One is that the output array has to be pre-allocated (line 316): meanstore = np.zeros([len(np.unique(timeunits)),data.shape [1] ,data.shape [2] ]) Which I have already explained why that will slow things down. Additionally, there is this line too inside the loop: datamask_at_this_timeunit = process.create_mask_using_threshold(data [timeunits==myunit,:,:] ,threshold=0.75) Specifically the part where the boolean masked array is being created: data [timeunits==myunit,:,:] . This is an example of what numpy refers to as fancy indexing, the distinction being that such a slice of an array is a copy, while regular slice indexing (eg data [3:4] ) only shows a view of the original array. Unfortunately, fixing this is not trivial due to the inherent nature of daily data, specifically each month of the year does not have the same number of total days. As a result it is not possible to simply reshape the original data and then use one vectorized numpy operation like I did with monthly data. The best way to get around this in numpy that I could think of would be to use a masked array with shape (number_of_months, 31, ...), then have the months with less than 31 days have masked values in the excess days. You can then get the monthly means really easily with something like new_array.mean(axis=1). This is far from an ideal solution though because 1), inserting the masked values into the original array in the right spots can be tricky and 2), this cannot be done on the original array in place so allocating another new array is necessary. The absolute fastest way to do this then is to make your own python extension using C, C++, or Fortran where loops are much quicker, and then call a wrapper to the compiled code in python. As we all know though this is opening another large can of worms and will probably be a lot of work in of itself. I'd say too much for just this particular issue, but in the long term it might be necessary if we want to support daily data. Alex
        Hide
        cgoodale Cameron Goodale added a comment -

        Alex,

        Thank you for the information, this gives me several avenues to explore and tune up the code. I know that Daily Data processing is a long standing desire of users, so this issue will prove critical. Once things stable in ocw, then perhaps someone (or myself) can tackle this issue.

        Show
        cgoodale Cameron Goodale added a comment - Alex, Thank you for the information, this gives me several avenues to explore and tune up the code. I know that Daily Data processing is a long standing desire of users, so this issue will prove critical. Once things stable in ocw, then perhaps someone (or myself) can tackle this issue.
        Hide
        agoodman Alex Goodman added a comment -

        Some bad news. I actually went and tested a few of these changes, in order that I mentioned them:

        The first change made almost no difference in the the running time of the tests, maybe about 1 second or so. The second change saved about 5 seconds but unfortunately doesn't work correctly. I misunderstood how the create_mask_function works, ie the shape doesn't match up. That's why pre-allocating the array was necessary.

        (Keep in mind test_dataset_processor.py takes about 30 sec to run on my laptop without any changes)

        So now we can say that the worst case scenario I mentioned is most likely the source of the bottleneck.

        Show
        agoodman Alex Goodman added a comment - Some bad news. I actually went and tested a few of these changes, in order that I mentioned them: The first change made almost no difference in the the running time of the tests, maybe about 1 second or so. The second change saved about 5 seconds but unfortunately doesn't work correctly. I misunderstood how the create_mask_function works, ie the shape doesn't match up. That's why pre-allocating the array was necessary. (Keep in mind test_dataset_processor.py takes about 30 sec to run on my laptop without any changes) So now we can say that the worst case scenario I mentioned is most likely the source of the bottleneck.
        Hide
        cgoodale Cameron Goodale added a comment - - edited

        I created a small module to run the slow portion of the code. To repeat my testing cd into ocw/tests and copy test.py (attached to this issue) into that dir. The initial_profile.txt is the output of running the command below without changing any of the underlying code.

        Then run:

        python -m cProfile -s cumulative test.py > inital_profile.txt
        

        The -s flag is used to sort the results. Full docs are here: http://docs.python.org/2/library/profile.html

        Show
        cgoodale Cameron Goodale added a comment - - edited I created a small module to run the slow portion of the code. To repeat my testing cd into ocw/tests and copy test.py (attached to this issue) into that dir. The initial_profile.txt is the output of running the command below without changing any of the underlying code. Then run: python -m cProfile -s cumulative test.py > inital_profile.txt The -s flag is used to sort the results. Full docs are here: http://docs.python.org/2/library/profile.html
        Hide
        cgoodale Cameron Goodale added a comment -

        After some more digging into the code and using the profiler, I believe the bottleneck is here:

        385    meanstore[i,:,:] = ma.average(datam,axis=0)
        

        Since this call seems to carry the largest cumulative time outside of the calling function:

        ncalls tottime percall cumtime percall filename:lineno(function)
        24 0.628 0.026 8.749 0.365 extras.py:439(average)

        In this test case the function is called 24 times (once for each timestep, 24 months). Which means that if this function is called on 20 years of data with a monthly bin size (240 times), we can expect a linear increase in runtime, so a 10X increase.

        Show
        cgoodale Cameron Goodale added a comment - After some more digging into the code and using the profiler, I believe the bottleneck is here: 385 meanstore[i,:,:] = ma.average(datam,axis=0) Since this call seems to carry the largest cumulative time outside of the calling function: ncalls tottime percall cumtime percall filename:lineno(function) 24 0.628 0.026 8.749 0.365 extras.py:439(average) In this test case the function is called 24 times (once for each timestep, 24 months). Which means that if this function is called on 20 years of data with a monthly bin size (240 times), we can expect a linear increase in runtime, so a 10X increase.
        Hide
        agoodman Alex Goodman added a comment -

        Hi Cam,

        Very interesting. Not only do I see the problem stemming from the need to combine fancy indexing with a loop, but also that masked array arithmetic is actually somewhat slower than the standard ndarray equivalent. It might be interesting to determine whether changing that line to:

        385    meanstore[i,:,:] = np.average(datam,axis=0)

        would create a considerable performance increase, even though the result will obviously be wrong. From my own experience, masked arrays can be up to 3x slower.

        Show
        agoodman Alex Goodman added a comment - Hi Cam, Very interesting. Not only do I see the problem stemming from the need to combine fancy indexing with a loop, but also that masked array arithmetic is actually somewhat slower than the standard ndarray equivalent. It might be interesting to determine whether changing that line to: 385 meanstore[i,:,:] = np.average(datam,axis=0) would create a considerable performance increase, even though the result will obviously be wrong. From my own experience, masked arrays can be up to 3x slower.
        Hide
        cgoodale Cameron Goodale added a comment -

        Alex,

        I was also toying with that idea, trying to see if we can somehow mask the final output (after the average is applied), but I think we have to work with masked data to avoid polluting the average calculation. Unless the underlying mask is empty, then we can use the code you suggested. It might be worth testing if any values are masked, and if not we can use numpy.average instead.

        Show
        cgoodale Cameron Goodale added a comment - Alex, I was also toying with that idea, trying to see if we can somehow mask the final output (after the average is applied), but I think we have to work with masked data to avoid polluting the average calculation. Unless the underlying mask is empty, then we can use the code you suggested. It might be worth testing if any values are masked, and if not we can use numpy.average instead.
        Hide
        agoodman Alex Goodman added a comment -

        Compare running times for slicing vs fancy indexing

        Show
        agoodman Alex Goodman added a comment - Compare running times for slicing vs fancy indexing
        Hide
        agoodman Alex Goodman added a comment -

        Hi Cam,

        Yes indeed. I have already tried to come up with an alternative approach to averaging a masked array in python but I am afraid that I can't think of any faster method without resorting to a C or Fortran extension, which is probably overkill for this issue alone as I have already said.

        However, I have discovered a way to fix another source for the bottleneck, which is the use of fancy indexing in this line:

        380    datamask_at_this_timeunit[:]= process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75) 
        

        My solution relies on standard slice indexing since this has much less overhead than fancy boolean indexing. The trick here is to realize that the first day of a new month is always 1 in a python datetime object sense. Then we find the indices in the array where this occurs and use those to split the array into chunks corresponding to each month of data. Then taking the average is simply a matter of averaging each individual chunk.

        I attached the code I used to test this as test_monthly_rebin.py. During my own testing, the average speed up after using this method was about 3x. It could be even faster (5x or 6x) if the arrays are not masked. Try incorporating something like this back into the original function and see if that helps.

        Show
        agoodman Alex Goodman added a comment - Hi Cam, Yes indeed. I have already tried to come up with an alternative approach to averaging a masked array in python but I am afraid that I can't think of any faster method without resorting to a C or Fortran extension, which is probably overkill for this issue alone as I have already said. However, I have discovered a way to fix another source for the bottleneck, which is the use of fancy indexing in this line: 380 datamask_at_this_timeunit[:]= process.create_mask_using_threshold(data[timeunits==myunit,:,:],threshold=0.75) My solution relies on standard slice indexing since this has much less overhead than fancy boolean indexing. The trick here is to realize that the first day of a new month is always 1 in a python datetime object sense. Then we find the indices in the array where this occurs and use those to split the array into chunks corresponding to each month of data. Then taking the average is simply a matter of averaging each individual chunk. I attached the code I used to test this as test_monthly_rebin.py. During my own testing, the average speed up after using this method was about 3x. It could be even faster (5x or 6x) if the arrays are not masked. Try incorporating something like this back into the original function and see if that helps.

          People

          • Assignee:
            cgoodale Cameron Goodale
            Reporter:
            cgoodale Cameron Goodale
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:

              Development