[CASSANDRA-9074] Hadoop Cassandra CqlInputFormat pagination - not reading all input rows - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Low
Resolution: Duplicate
Fix Version/s: 2.0.15
Component/s: None
Labels:
None
Environment:

Cassandra 2.0.11, Hadoop 1.0.4, Datastax java cassandra-driver-core 2.1.4

Severity:
Low

Description

I have a 7-node Cassandra (v2.0.11) cluster and a table with 10k rows. I run a hadoop job (datanodes reside on cassandra nodes of course) that reads data from that table and I see that only 7k rows is read to map phase.

I checked CqlInputFormat source code and noticed that a CQL query is build to select node-local date and also LIMIT clause is added (1k default). So that 7k read rows can be explained:
7 nodes * 1k limit = 7k rows read total

The limit can be changed using CqlConfigHelper:

CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "1000");

Please help me with questions below:
Is this a desired behavior?
Why CqlInputFormat does not page through the rest of rows?
Is it a bug or should I just increase the InputCQLPageRowSize value?
What if I want to read all data in table and do not know the row count?
What if the amount of rows I need to read per cassandra node is very large - in other words how to avoid OOM when setting InputCQLPageRowSize very large to handle all data?

Attachments

Issue Links

duplicates

CASSANDRA-8166 Not all data is loaded to Pig using CqlNativeStorage

Resolved

Activity

People

Assignee:: Alex Liu

Reporter:: fuggy_yama

Authors:: Alex Liu

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 30/Mar/15 19:32

Updated:: 16/Apr/19 09:31

Resolved:: 02/Apr/15 17:11