Accumulo
  1. Accumulo
  2. ACCUMULO-241

Visibility labels should allow more characters

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.5
    • Fix Version/s: 1.5.0
    • Component/s: None
    • Labels:

      Description

      We currently whitelist our visibility labels to only allow alphanumerics and a few select delimiting characters. Users occasionally ask for characters that are not allowed in the current system. We need a system that lets users use whatever characters they like w/o seeking permission from Accumulo developers.

      1. ACCUMULO-241-quoting.txt
        8 kB
        Keith Turner
      2. ACCUMULO-241-quoting-2.txt
        13 kB
        Keith Turner

        Activity

        Hide
        David Medinets added a comment -

        It seems like someone ran into a specific label string that would not currently work. If possible, I'd like to know what that label was. Thanks.

        Show
        David Medinets added a comment - It seems like someone ran into a specific label string that would not currently work. If possible, I'd like to know what that label was. Thanks.
        Hide
        jv added a comment -

        I just hate arbitrary restrictions. I see no point for this whitelist approach when a looser blacklist gives more power to the end user. Currently we only have 3 non-text characters, so if people want to set up different levels of delimiters they are limited.

        Show
        jv added a comment - I just hate arbitrary restrictions. I see no point for this whitelist approach when a looser blacklist gives more power to the end user. Currently we only have 3 non-text characters, so if people want to set up different levels of delimiters they are limited.
        Hide
        David Medinets added a comment -

        Not to be obtuse but what is the purpose of any restrictions? I am new to Accumulo so I don't grok how a label is used. If this information is already documented, I can go look it up.

        Show
        David Medinets added a comment - Not to be obtuse but what is the purpose of any restrictions? I am new to Accumulo so I don't grok how a label is used. If this information is already documented, I can go look it up.
        Hide
        David Medinets added a comment -

        Let me provide a potential use case. What if I want to use a MAC address as a label? One example value would be "00:34:C8:3B:32:68". I don't know if this makes any sense. Just trying to start a conversation.

        Show
        David Medinets added a comment - Let me provide a potential use case. What if I want to use a MAC address as a label? One example value would be "00:34:C8:3B:32:68". I don't know if this makes any sense. Just trying to start a conversation.
        Hide
        Billie Rinaldi added a comment -

        Currently, visibilities can contain a-z, A-Z, 0-9, _, -, and :, so your example would work. As John says, there is no reason not to include all printable characters other than &, |, (, and ).

        Show
        Billie Rinaldi added a comment - Currently, visibilities can contain a-z, A-Z, 0-9, _, -, and :, so your example would work. As John says, there is no reason not to include all printable characters other than &, |, (, and ).
        Hide
        Adam Fuchs added a comment -

        We have a few goals for the language used in security labels (expressions made up of operators and authorizations): they should be easy to read by human and by computer, and they should be unambiguous, the Boolean logic operators should be easily distinguished from the atomic authorizations, labels should be backwards compatible forever, and the language should be extensible to anything we might want to do with it in the future. To support backwards compatibility while leaving room for extension, we originally reserved all non-alphanumeric characters and only allowed alphanumeric characters within authorizations. When our users asked for '_', '-', and ':' for use in authorizations, we added those to the white list. Moving to a black list approach is a bit more limiting to extensibility, but I think it can be done while preserving the possibility of adding future capabilities.

        Supporting escaping of reserved characters might be another option, but that might reduce the human readability.

        The big question is what do we want to do with cell-level security in the future? I think we probably want to support "not" at some point, so probably '!' and '~' should be reserved. If we do want to support escaping, we should probably reserve '\' or '#' and ';'. It has been hinted that we might want to support something like regular expressions, so '*', '?', '[', ']', '+', .... How about variable substitution, with '%' or '$'?

        Maybe it would be better to keep a white list for now?

        Show
        Adam Fuchs added a comment - We have a few goals for the language used in security labels (expressions made up of operators and authorizations): they should be easy to read by human and by computer, and they should be unambiguous, the Boolean logic operators should be easily distinguished from the atomic authorizations, labels should be backwards compatible forever, and the language should be extensible to anything we might want to do with it in the future. To support backwards compatibility while leaving room for extension, we originally reserved all non-alphanumeric characters and only allowed alphanumeric characters within authorizations. When our users asked for '_', '-', and ':' for use in authorizations, we added those to the white list. Moving to a black list approach is a bit more limiting to extensibility, but I think it can be done while preserving the possibility of adding future capabilities. Supporting escaping of reserved characters might be another option, but that might reduce the human readability. The big question is what do we want to do with cell-level security in the future? I think we probably want to support "not" at some point, so probably '!' and '~' should be reserved. If we do want to support escaping, we should probably reserve '\' or '#' and ';'. It has been hinted that we might want to support something like regular expressions, so '*', '?', ' [', '] ', '+', .... How about variable substitution, with '%' or '$'? Maybe it would be better to keep a white list for now?
        Hide
        David Medinets added a comment -

        Doh! For some reason I kept not seeing the word 'visibility'. Now I understand what labels are being discussed. Would it make sense to use syntax like NOT(FOO) instead of !FOO in labels. Using function-based syntax (if that terminology makes sense) reduces the need for fiddly single-character flags and should be more readable. It also leaves the door open for an extensible mechanism whereby keywords can be attached to user-supplied functionality.

        Show
        David Medinets added a comment - Doh! For some reason I kept not seeing the word 'visibility'. Now I understand what labels are being discussed. Would it make sense to use syntax like NOT(FOO) instead of !FOO in labels. Using function-based syntax (if that terminology makes sense) reduces the need for fiddly single-character flags and should be more readable. It also leaves the door open for an extensible mechanism whereby keywords can be attached to user-supplied functionality.
        Hide
        Keith Turner added a comment -

        I like the idea of quoting. We keep the current white list. If the user wants to use something outside of that list, they have to quote it. For example if a user wanted to use the label foo.bar, then they would need to do 'foo.bar'. I think this has the following benefits:

        • backwards compatible with existing data
        • allows users to use whatever characters they like in their labels
        • give us the flexibility to use additional characters in the language in the future
        • is human readable

        Would just need an escape mechanism for quote, could do the standard two quotes.

        One drawback I can think of is that users can make labels that look like expressions, like 'A&B'. This is the type of thing that a computer has no issues with, but it may mislead a person.

        Show
        Keith Turner added a comment - I like the idea of quoting. We keep the current white list. If the user wants to use something outside of that list, they have to quote it. For example if a user wanted to use the label foo.bar, then they would need to do 'foo.bar'. I think this has the following benefits: backwards compatible with existing data allows users to use whatever characters they like in their labels give us the flexibility to use additional characters in the language in the future is human readable Would just need an escape mechanism for quote, could do the standard two quotes. One drawback I can think of is that users can make labels that look like expressions, like 'A&B'. This is the type of thing that a computer has no issues with, but it may mislead a person.
        Hide
        David Medinets added a comment -

        If y'all decide to quote, please consider implementing ruby's alternate quote syntax. Using alternative syntax can make strings more readable. See more information at http://en.wikibooks.org/wiki/Ruby_Programming/Alternate_quotes.

        You can alleviate the 'A&B' issue if the syntax is AND(A,B) instead of the single-character &. Granted, this is not want must programmers would want to see, though.

        Show
        David Medinets added a comment - If y'all decide to quote, please consider implementing ruby's alternate quote syntax. Using alternative syntax can make strings more readable. See more information at http://en.wikibooks.org/wiki/Ruby_Programming/Alternate_quotes . You can alleviate the 'A&B' issue if the syntax is AND(A,B) instead of the single-character &. Granted, this is not want must programmers would want to see, though.
        Hide
        Jim Klucar added a comment -

        I agree with John's statement "I just hate arbitrary restrictions" I vote for the blacklist approach. The current whitelist limits to ascii characters, which seems to be an arbitrary restriction. Currently, many non-english languages are left out, and whitelisting UTF-8 is probably too big of a space to make it worth it.

        Show
        Jim Klucar added a comment - I agree with John's statement "I just hate arbitrary restrictions" I vote for the blacklist approach. The current whitelist limits to ascii characters, which seems to be an arbitrary restriction. Currently, many non-english languages are left out, and whitelisting UTF-8 is probably too big of a space to make it worth it.
        Hide
        Eric Newton added a comment -

        This whitelist/blacklist discussion started because we have an immediate need to extend the alphabet to include ".". While all of these alternative ideas are great, we need "." added yesterday. A quoting mechanism, or alternative syntax is going to require testing and analysis of the performance which isn't something we can do very quickly.

        We can always extend the expressions with something that is presently illegal like "%" and provide an alternative syntax:

        %AND("a", OR("b", "c"), NOT("$x"))
        

        But for now... I'm adding the silly "."; we can add extensibility of the column visibility when an actual use-case comes up.

        Show
        Eric Newton added a comment - This whitelist/blacklist discussion started because we have an immediate need to extend the alphabet to include ".". While all of these alternative ideas are great, we need "." added yesterday. A quoting mechanism, or alternative syntax is going to require testing and analysis of the performance which isn't something we can do very quickly. We can always extend the expressions with something that is presently illegal like "%" and provide an alternative syntax: %AND("a", OR("b", "c"), NOT("$x")) But for now... I'm adding the silly "."; we can add extensibility of the column visibility when an actual use-case comes up.
        Hide
        Keith Turner added a comment -

        I experimented w/ adding quoting, here is a patch for review. It was pretty easy and I think its complete, but something like this can be tricky. I think we will keep seeing users asking for more characters.

        Show
        Keith Turner added a comment - I experimented w/ adding quoting, here is a patch for review. It was pretty easy and I think its complete, but something like this can be tricky. I think we will keep seeing users asking for more characters.
        Hide
        Keith Turner added a comment -

        Updated patch based on code review comments

        Show
        Keith Turner added a comment - Updated patch based on code review comments
        Hide
        Keith Turner added a comment -

        FYI

        I read up on UTF-8 [1] to see if it would work w/ the quoting changes I made. It seems like UTF-8 within quotes in a visibility expression will work just fine. So theoretically Accumulo visibility labels should support non ASCII charsets now. I was worried that a multi-byte character may contain a quote byte, however this will not happen w/ UTF-8. The MSB [2] is always set to 1 for each byte in a multi-byte UTF-8 encoded char. Therefore a multi-byte characater will not contain a quote byte. When a quote byte occurs in UTF-8 it can only be the ASCII quote char.

        [1]: http://en.wikipedia.org/wiki/UTF-8
        [2]: http://en.wikipedia.org/wiki/Most_significant_bit

        Show
        Keith Turner added a comment - FYI I read up on UTF-8 [1] to see if it would work w/ the quoting changes I made. It seems like UTF-8 within quotes in a visibility expression will work just fine. So theoretically Accumulo visibility labels should support non ASCII charsets now. I was worried that a multi-byte character may contain a quote byte, however this will not happen w/ UTF-8. The MSB [2] is always set to 1 for each byte in a multi-byte UTF-8 encoded char. Therefore a multi-byte characater will not contain a quote byte. When a quote byte occurs in UTF-8 it can only be the ASCII quote char. [1] : http://en.wikipedia.org/wiki/UTF-8 [2] : http://en.wikipedia.org/wiki/Most_significant_bit
        Hide
        Eric Newton added a comment -

        Builds are failing, though it works for me in my environment:

        java.lang.AssertionError: 
        	at org.junit.Assert.fail(Assert.java:74)
        	at org.junit.Assert.assertTrue(Assert.java:37)
        	at org.junit.Assert.assertFalse(Assert.java:56)
        	at org.junit.Assert.assertFalse(Assert.java:65)
        	at org.apache.accumulo.core.security.VisibilityEvaluatorTest.testNonAscii(VisibilityEvaluatorTest.java:117)
        
        Show
        Eric Newton added a comment - Builds are failing, though it works for me in my environment: java.lang.AssertionError: at org.junit.Assert.fail(Assert.java:74) at org.junit.Assert.assertTrue(Assert.java:37) at org.junit.Assert.assertFalse(Assert.java:56) at org.junit.Assert.assertFalse(Assert.java:65) at org.apache.accumulo.core.security.VisibilityEvaluatorTest.testNonAscii(VisibilityEvaluatorTest.java:117)
        Hide
        John Vines added a comment -

        http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#getBytes%28%29

        It seems getBytes (which i think is what we're using) doesn't always have a guarantee that it will be in UTF8. We should explicitly use UTF-8 to make sure that's not the problem by using the other getBytes call where the Charset is explicitly set.

        Show
        John Vines added a comment - http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#getBytes%28%29 It seems getBytes (which i think is what we're using) doesn't always have a guarantee that it will be in UTF8. We should explicitly use UTF-8 to make sure that's not the problem by using the other getBytes call where the Charset is explicitly set.
        Hide
        Keith Turner added a comment -

        I suspected the default encoding on the build server was different, but I was not sure. Does anyone know if there is a way to run something on the build server w/o committing? I will make changes to allow the encoding to be explicitly specified and see if that helps.

        Show
        Keith Turner added a comment - I suspected the default encoding on the build server was different, but I was not sure. Does anyone know if there is a way to run something on the build server w/o committing? I will make changes to allow the encoding to be explicitly specified and see if that helps.
        Hide
        Billie Rinaldi added a comment -

        I get the error on my laptop (not when I run the test in Eclipse, but when I run mvn at the command line). If you want, I can test your changes.

        Show
        Billie Rinaldi added a comment - I get the error on my laptop (not when I run the test in Eclipse, but when I run mvn at the command line). If you want, I can test your changes.

          People

          • Assignee:
            Keith Turner
            Reporter:
            John Vines
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development