Cassandra
  1. Cassandra
  2. CASSANDRA-1101

A Hadoop Output Format That Targets Cassandra

    Details

      Description

      Currently, there exists a Hadoop-specific input format (viz., ColumnFamilyInputFormat) that allows one to iterate over the rows in a given Cassandra column family and treat it as the input to a Hadoop map task. By the same token, one may need to feed the output of a Hadoop reduce task into a Cassandra column family, for which no mechanism exists today. This calls for the definition of a Hadoop-specific output format which accepts a pair of key and columns, and writes it out to a given column family.

      Here, we describe an output format known as ColumnFamilyOutputFormat, which allows reduce tasks to persist keys and their associated columns as Cassandra rows in a given column family. By default, it prevents overwriting existing rows in the column family, by ensuring at initialization time that it contains no rows in the given slice predicate. For the sake of speed, it employs a lazy write-back caching mechanism, where its record writer batches mutations created based on the reduce's inputs (in a task-specific map) but stops short of actually mutating the rows. The latter responsibility falls on its output committer, which makes the changes official by sending a batch mutate request to Cassandra.

      The record writer, which is called ColumnFamilyRecordWriter, maps the input <key, value> pairs to a Cassandra column family. In particular, it creates mutations for each column in the value, which it then associates with the key, and in turn the responsible endpoint. Note that, given that round trips to the server are fairly expensive, it merely batches the mutations in-memory, and leaves it on the output committer to send the batched mutations to the server. Furthermore, the writer groups the mutations by the endpoint responsible for the rows being affected. This allows the output committer to execute the mutations in parallel, on an endpoint-by-endpoint basis.

      The output committer, which is called ColumnFamilyOutputCommitter, traverses the mutations collected by the record writer, and sends them to the endpoints responsible for them. Since the total set of mutations is partitioned by their endpoints, each of which can be performed in parallel, it allows us to commit the mutations using multiple threads, one per endpoint. As a result, it reduces the time it takes to propagate the mutations to the server considering that (a) the client eliminates one network hop that the server would otherwise have had to make and (b) each endpoint node has to deal with but a sub-set of the total set of mutations.

      For convenience, we also define a default reduce task, called ColumnFamilyOutputReducer, which collects the columns in the input value and maps them to a data structure expected by Cassandra. By default, it assumes the input value to be in the form of a ColumnWritable, which denotes a name value pair corresponding to a certain column. This reduce task is in turn used by the attached test case, which maps every <key, value> pair in a sample input sequence file to a <key, column> pair, and then reduces them by aggregating columns corresponding to the same key. Eventually, the batched <key, columns> pairs are written to the column family associated with the output format.

      1. CASSANDRA-1101-V5.patch
        49 kB
        Karthick Sankarachary
      2. CASSANDRA-1101-V4.patch
        52 kB
        Karthick Sankarachary
      3. CASSANDRA-1101-V3.patch
        50 kB
        Karthick Sankarachary
      4. CASSANDRA-1101-V2.patch
        42 kB
        Karthick Sankarachary
      5. CASSANDRA-1101-V1.patch
        43 kB
        Karthick Sankarachary
      6. CASSANDRA-1101.patch
        49 kB
        Karthick Sankarachary
      7. 1101-clock-fix.diff
        3 kB
        Stu Hood

        Issue Links

          Activity

          Karthick Sankarachary created issue -
          Karthick Sankarachary made changes -
          Field Original Value New Value
          Attachment CASSANDRA-1101.patch [ 12444703 ]
          Karthick Sankarachary made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Stu Hood made changes -
          Assignee Stu Hood [ stuhood ]
          Karthick Sankarachary made changes -
          Attachment CASSANDRA-1101.patch [ 12444703 ]
          Karthick Sankarachary made changes -
          Attachment CASSANDRA-1101.patch [ 12445128 ]
          Attachment CASSANDRA-1101-V1.patch [ 12445129 ]
          Karthick Sankarachary made changes -
          Attachment CASSANDRA-1101.patch [ 12445128 ]
          Karthick Sankarachary made changes -
          Attachment CASSANDRA-1101-V2.patch [ 12445240 ]
          Karthick Sankarachary made changes -
          Attachment CASSANDRA-1101.patch [ 12445241 ]
          Karthick Sankarachary made changes -
          Attachment CASSANDRA-1101.patch [ 12445241 ]
          Karthick Sankarachary made changes -
          Attachment CASSANDRA-1101-V3.patch [ 12445496 ]
          Karthick Sankarachary made changes -
          Attachment CASSANDRA-1101.patch [ 12445497 ]
          Stu Hood made changes -
          Attachment 1101-clock-fix.diff [ 12446270 ]
          Stu Hood made changes -
          Fix Version/s 0.7 [ 12314533 ]
          Labels cassandra hadoop output format cassandra hadoop output_format
          Affects Version/s 0.6.1 [ 12314867 ]
          Jonathan Ellis made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Karthick Sankarachary made changes -
          Attachment CASSANDRA-1101-V4.patch [ 12446374 ]
          Karthick Sankarachary made changes -
          Attachment CASSANDRA-1101.patch [ 12445497 ]
          Karthick Sankarachary made changes -
          Attachment CASSANDRA-1101.patch [ 12446375 ]
          Karthick Sankarachary made changes -
          Attachment CASSANDRA-1101.patch [ 12446375 ]
          Karthick Sankarachary made changes -
          Attachment CASSANDRA-1101.patch [ 12446411 ]
          Stu Hood made changes -
          Assignee Stu Hood [ stuhood ]
          Jonathan Ellis made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Jonathan Ellis made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Karthick Sankarachary made changes -
          Attachment CASSANDRA-1101.patch [ 12446411 ]
          Karthick Sankarachary made changes -
          Attachment CASSANDRA-1101-V5.patch [ 12447199 ]
          Attachment CASSANDRA-1101.patch [ 12447200 ]
          Jonathan Ellis made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Assignee Karthick Sankarachary [ karthick ]
          Karthick Sankarachary made changes -
          Link This issue depends on CASSANDRA-1131 [ CASSANDRA-1131 ]
          Jonathan Ellis made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Gavin made changes -
          Workflow no-reopen-closed, patch-avail [ 12511058 ] patch-available, re-open possible [ 12752260 ]
          Gavin made changes -
          Workflow patch-available, re-open possible [ 12752260 ] reopen-resolved, no closed status, patch-avail, testing [ 12758219 ]
          Gavin made changes -
          Link This issue depends on CASSANDRA-1131 [ CASSANDRA-1131 ]
          Gavin made changes -
          Link This issue depends upon CASSANDRA-1131 [ CASSANDRA-1131 ]

            People

            • Assignee:
              Karthick Sankarachary
              Reporter:
              Karthick Sankarachary
            • Votes:
              2 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development