Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-6268

Poor performance of Hadoop if any DC is using VNodes

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Normal
    • Resolution: Fixed
    • Fix Version/s: 1.2.13, 2.0.4
    • Component/s: None
    • Labels:
      None

      Description

      Some customers are complaining about huge number of splits in Hadoop caused by VNodes. Disabling vnodes only in Hadoop DC does not fix it. Splits are generated from the results of describe_ring, which returns a huge number of ranges anyways, and doesn't take into account that there will be huge number of consecutive ranges residing on the nodes we'd like the M/R job to be run.

      The proposed fix:
      1. allows for specifying the DC(s) the Hadoop job should be run in (in DSE - defaults to all Hadoop DCs)
      2. merges consecutive ranges before generating Hadoop splits, so we don't have artificial range splitting caused by vnodes in the other DCs

      For non-DSE users this feature is turned off by default and doesn't change the old behaviour.

        Attachments

        1. 6268-src-2.0.txt
          8 kB
          Piotr Kolaczkowski
        2. 6268-src-1.2.txt
          8 kB
          Piotr Kolaczkowski
        3. 6268-thrift-1.2.txt
          311 kB
          Piotr Kolaczkowski
        4. 6268-thrift-2.0.txt
          149 kB
          Piotr Kolaczkowski

          Activity

            People

            • Assignee:
              pkolaczk Piotr Kolaczkowski
              Reporter:
              pkolaczk Piotr Kolaczkowski
              Authors:
              Piotr Kolaczkowski
              Reviewers:
              Jonathan Ellis
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: