Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-3424

CassandraIO uses 1 split if can't estimate size

Details

    • Bug
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • None
    • 2.5.0
    • io-java-cassandra
    • None

    Description

      See https://stackoverflow.com/questions/48090668/how-to-increase-dataflow-read-parallelism-from-cassandra?noredirect=1#comment83227824_48090668 . When CassandraIO can't estimate size, it falls back to a single split:

      https://github.com/apache/beam/blob/master/sdks/java/io/cassandra/src/main/java/org/apache/beam/sdk/io/cassandra/CassandraServiceImpl.java#L196

      A single split is very poor for performance. We should fall back to a different value. Not sure what a good value would be; probably the largest value that still doesn't introduce too much per-split overhead? E.g. would there be any downside to just changing that number to 100?

      Alternatively/additionally, like in DatastoreIO, CassandraIO could accept requested number of splits as a parameter.

      Attachments

        Activity

          People

            adejanovski Alexander Dejanovski
            jkff Eugene Kirpichov
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: