Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-9074

Hadoop Cassandra CqlInputFormat pagination - not reading all input rows

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Low
    • Resolution: Duplicate
    • 2.0.15
    • None
    • None
    • Cassandra 2.0.11, Hadoop 1.0.4, Datastax java cassandra-driver-core 2.1.4

    • Low

    Description

      I have a 7-node Cassandra (v2.0.11) cluster and a table with 10k rows. I run a hadoop job (datanodes reside on cassandra nodes of course) that reads data from that table and I see that only 7k rows is read to map phase.

      I checked CqlInputFormat source code and noticed that a CQL query is build to select node-local date and also LIMIT clause is added (1k default). So that 7k read rows can be explained:
      7 nodes * 1k limit = 7k rows read total

      The limit can be changed using CqlConfigHelper:

      CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "1000");

      Please help me with questions below:
      Is this a desired behavior?
      Why CqlInputFormat does not page through the rest of rows?
      Is it a bug or should I just increase the InputCQLPageRowSize value?
      What if I want to read all data in table and do not know the row count?
      What if the amount of rows I need to read per cassandra node is very large - in other words how to avoid OOM when setting InputCQLPageRowSize very large to handle all data?

      Attachments

        Issue Links

          Activity

            People

              alexliu68 Alex Liu
              fuggy_yama fuggy_yama
              Alex Liu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: