Uploaded image for project: 'Community Development'
  1. Community Development
  2. COMDEV-163

mailglomper.py takes ages to run

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Reporter Tool
    • None

    Description

      mailglomper takes a very long time to run (several hours)

      This is mainly because it has to download the last 7 mailboxes for each mailing list; some of these mailboxes can be quite large.

      Most of this is wasted processing because only the mailbox for the current month is ever updated; once a new month starts, emails are added to the new mailbox only and the earlier mailboxes are not updated further.

      It would be more efficient to cache the counts/times for the previous months and use those instead of re-reading them. If the cache entry is missing, then the file is read.

      How much information needs to be cached for each mailbox?
      For exact compatibility with the current code, it would be necessary to store the counts for each day, but if this results in too much storage, then it would be possible to store just the weekly counts. This would not affect the historic weekly stats.

      However the running quarterly stats currently allocate the email to the quaterly buckets on a daily rather than weekly basis, so some precision would be lost if only the weekly merged counts were available for past months.

      The cache itself would need managing to ensure that the oldest entries were dropped, otherwise it would grow very large.

      Note: since contributions to the weekly buckets may come from more than one month, it's likely not feasible to use the existing data. This is because the current month is processed multiple times, so its data needs to be replaced each time. If its first week overlaps with the last week of the previous month, that would result in lost data. This problem might even affect dailiy accumulations; it depends exactly when the mailboxes are flipped. Having a separate cache entries for each monthly mailbox would also make it easier to manage the cache. The downside is that it would require more storage, but the cost of re-reading the historic mailboxes every day is relatively large.

      Attachments

        Activity

          People

            Unassigned Unassigned
            sebb Sebb
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: