Uploaded image for project: 'Crunch (Retired)'
  1. Crunch (Retired)
  2. CRUNCH-72

Cartesian#cross loads all data in memory per parallel mapper

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.4.0
    • None
    • None

    Description

      Cartesian#cross currently uses the PTable#cogroup method to join two sets of data together; this results in all data from both sides of the join being loaded in memory at one time. This can be a real problem with cartesian joins because of the quantity of data being joined.

      Using PTable#join instead of PTable#cogroup will reduce the memory usage by 50%, which can be the difference between a cartesian join working or failing with an OOME.

      Attachments

        Activity

          People

            gabriel.reid Gabriel Reid
            gabriel.reid Gabriel Reid
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment