Cassandra
  1. Cassandra
  2. CASSANDRA-4238

Pig secondary index usage could be improved

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 1.1.1
    • Component/s: Hadoop
    • Labels:
      None

      Description

      As Dmitriy suggested on CASSANDRA-2246, CassandraStorage could implement LoadMetadata.getPartitionKeys and LoadMetadata.setPartitionFilter to automatically apply secondary indexes.

      1. 4238-v3.txt
        10 kB
        Brandon Williams
      2. 4238-v2.txt
        10 kB
        Brandon Williams
      3. 4238.txt
        7 kB
        Brandon Williams

        Issue Links

          Activity

          Hide
          Brandon Williams added a comment -

          Committed and renamed PIG_PARTITION_FILTER to PIG_USE_SECONDARY to be clearer.

          Show
          Brandon Williams added a comment - Committed and renamed PIG_PARTITION_FILTER to PIG_USE_SECONDARY to be clearer.
          Hide
          Pavel Yaskevich added a comment -

          +1

          Show
          Pavel Yaskevich added a comment - +1
          Hide
          Brandon Williams added a comment -

          v3 makes one small tweak, and prepends "index_" instead of appending "_index", since pig identifiers need to always begin with an alphanumeric character and this can guarantee that.

          Show
          Brandon Williams added a comment - v3 makes one small tweak, and prepends "index_" instead of appending "_index", since pig identifiers need to always begin with an alphanumeric character and this can guarantee that.
          Hide
          Brandon Williams added a comment -

          v2 implements a workaround. If PIG_PARTITION_FILTER is enabled, then each index (actual index, not plain validation) is appended as a top-level field to the schema after the bag, and the name has '_index' appended. Thus, if there is an index on a column called 'name', you can use it with a statement like "filter rows by name_index eq 'foo'".

          The caveat to this is that we have to relax the putNext function a bit to ignore these fields, so if you have this enabled and are storing a completely bad schema, it will just silently drop your bad fields as well. However this is a small price to pay for the added functionality.

          Show
          Brandon Williams added a comment - v2 implements a workaround. If PIG_PARTITION_FILTER is enabled, then each index (actual index, not plain validation) is appended as a top-level field to the schema after the bag, and the name has '_index' appended. Thus, if there is an index on a column called 'name', you can use it with a statement like "filter rows by name_index eq 'foo'". The caveat to this is that we have to relax the putNext function a bit to ignore these fields, so if you have this enabled and are storing a completely bad schema, it will just silently drop your bad fields as well. However this is a small price to pay for the added functionality.
          Hide
          Brandon Williams added a comment -

          Updated patch to fix minor problems. What I've discovered is that pig wants to match on the tuple named 'name' and as far as I can tell, there's no way for getPartitionKeys to specify anything deeper, such as name.value. This would almost be ok, at the cost of silly syntax like "filter rows by name eq ('name', 'foo')" except when setPartitionFilter is called, we unsurprisingly get get the string literal "(name, foo)" as the value to match against, which of course is not in the index. I'm not sure what, if anything, can be done about this.

          Show
          Brandon Williams added a comment - Updated patch to fix minor problems. What I've discovered is that pig wants to match on the tuple named 'name' and as far as I can tell, there's no way for getPartitionKeys to specify anything deeper, such as name.value. This would almost be ok, at the cost of silly syntax like "filter rows by name eq ('name', 'foo')" except when setPartitionFilter is called, we unsurprisingly get get the string literal "(name, foo)" as the value to match against, which of course is not in the index. I'm not sure what, if anything, can be done about this.
          Hide
          Brandon Williams added a comment -

          Posting what I have here which seems to be complete, but I'm having a hard time testing it. I would think given an index on column 'name' something like 'filter rows by name.value eq "foo"' would work, but while I see getPartitionKeys called setPartitionFilter never does.

          Show
          Brandon Williams added a comment - Posting what I have here which seems to be complete, but I'm having a hard time testing it. I would think given an index on column 'name' something like 'filter rows by name.value eq "foo"' would work, but while I see getPartitionKeys called setPartitionFilter never does.
          Show
          Brandon Williams added a comment - There's an HCatalog implementation at https://svn.apache.org/repos/asf/incubator/hcatalog/trunk/src/java/org/apache/hcatalog/pig/HCatLoader.java

            People

            • Assignee:
              Brandon Williams
              Reporter:
              Brandon Williams
              Reviewer:
              Pavel Yaskevich
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development