[CASSANDRA-10835] CqlInputFormat creates too small splits for map Hadoop tasks - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 2.2.5, 3.0.3, 3.2
Component/s: None
Labels:
None

Severity:
Normal
Since Version:

2.2.0

Description

CqlInputFormat use number of rows in C* version < 2.2 to define split size
The default split size was 64K rows.

    private static final int DEFAULT_SPLIT_SIZE = 64 * 1024;

The doc:

* You can also configure the number of rows per InputSplit with
 *   ConfigHelper.setInputSplitSize. The default split size is 64k rows.

New split algorithm assumes that SPLIT size is in bytes, so it creates really small map hadoop tasks by default (or with old configs).

There two way to fix it:
1. Update the doc and increase default value to something like 16MB
2. Make the C* to be compatible with older version.

I like the second options, as it will not surprise people who upgrade from old versions. I do not expect a lot of new user that will use Hadoop.

Attachments

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

cassandra-2.2-10835-2.txt
10/Dec/15 09:05
6 kB
Artem Aliev
cassandra-3.0.1-10835.txt
09/Dec/15 17:30
2 kB
Artem Aliev
cassandra-3.0.1-10835-2.txt
10/Dec/15 09:05
6 kB
Artem Aliev

Activity

People

Assignee:: Unassigned

Reporter:: Artem Aliev

Reviewers:: Joshua McKenzie

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 09/Dec/15 17:15

Updated:: 16/Apr/19 09:30

Resolved:: 10/Dec/15 17:44