Accumulo
  1. Accumulo
  2. ACCUMULO-840

Allow String-based getBytes calls to pick Charset ending from JVM setting.

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      ACCUMULO-836 changed all String-based getBytes() calls to use the UTF-8 standard. However, there is a JVM setting called "jvm.encoding" that should be honored. See http://stackoverflow.com/questions/361975/setting-the-default-java-character-encoding for a discussion of JAVA_TOOL_OPTIONS which might be relevant to this topic. http://javarevisited.blogspot.com/2012/01/get-set-default-character-encoding.html is also a good page to read especially the comment on how character encoding is cached.

        Issue Links

          Activity

          Hide
          Christopher Tubbs added a comment -

          There are two issues here. The first is establishing a standard encoding for all Accumulo internal persistent state/metadata, and the second is how to automatically encode API convenience methods that accept String or char[] or CharSequence (from here on, I'll refer to these three collectively as "Strings"). I'll deal with the latter first:

          API: It is important to note that Accumulo deals only with bytes. That's it. We don't guarantee a sort order for Strings with arbitrary (or configurable) encoding, though some have asked for custom comparators to achieve fine-grained control over this. Instead, we only guarantee a sort order for bytes, sorted numerically byte-by-byte, from most significant to least. It is important to realize that we only deal with bytes internally, because all of the API decisions appear to be centered around that idea. This is why you almost always see a Text object, because it holds an arbitrary byte array. It is true that Text has a constructor that accepts a String, and it has a very specific encoding when it does so (UTF8 only, as per its documentation). We have copied this behavior in some of our APIs to add convenience methods that accept Strings, because it's easier than forcing users to do write

          new Mutation(new Text("myString".getBytes("UTF8")));

          It is so much easier to do

          new Mutation("myString");

          . This does not change the behavior, though. We still expect convenience methods that accept Strings to behave as though we had converted a String to UTF8 and passed in the resulting bytes (in a Text object) to the method.

          API (cont.): Now, it may be the case that the API could benefit from convenience wrappers that accept Strings with a specific encoding, or we could change the behavior of those we have to respect the JVM's "file.encoding" property, and simply pre-encode the Strings before we throw their resulting bytes into a Text object. This may be useful and convenient, but this is a VERY LIMITED SCOPE, and it's important to realize that any consideration of changes to the way we encode things should focus on this scope, and not go crazy, changing all instances of "String-based" uses of ".getBytes()" in the code. Regardless of whether we make such changes, though, we should update our Javadocs to ensure that the encoding we use for these convenience methods is described. It is in the case of Mutation... I'm not sure about elsewhere.

          INTERNAL: The other scope to consider for encoding has to do with our internal storage (metadata we store in Zookeeper, in the !METADATA table, and other places where Accumulo writes persistent state). It is imperative that we maintain consistency in the way we interpret our persistent state. For this scope, we absolutely should stick to an encoding, but it should be hard-coded (use a Constant or a util method, for convenience), and should not respect any user configurable field. This is important, because a user should be able to change his/her JVM's encoding settings (for the API scope described above) and it should NOT affect our ability to read and understand data that we've previously written to Zookeeper or !METADATA (or elsewhere).

          INTERNAL (cont.): For the internal, persistent state's encoding, I'm comfortable assuming that we're already treating all persistent Strings storage as UTF-8 encoded (because we move things around in Text objects a lot, and for those things we aren't, we're probably using ASCII, and can safely treat it as UTF-8). If there are any situations where we are storing persistent state ambiguously, based on anything other than the hard-coded UTF-8 encoding, such that it might cause a problem if a user were to change an OS setting, or non-ASCII data can find its way in, we should treat such as a bug.

          As far as I see it, these are the only two scopes we need to concern ourselves with when considering encoding.

          Show
          Christopher Tubbs added a comment - There are two issues here. The first is establishing a standard encoding for all Accumulo internal persistent state/metadata, and the second is how to automatically encode API convenience methods that accept String or char[] or CharSequence (from here on, I'll refer to these three collectively as "Strings"). I'll deal with the latter first: API: It is important to note that Accumulo deals only with bytes. That's it. We don't guarantee a sort order for Strings with arbitrary (or configurable) encoding, though some have asked for custom comparators to achieve fine-grained control over this. Instead, we only guarantee a sort order for bytes, sorted numerically byte-by-byte, from most significant to least. It is important to realize that we only deal with bytes internally, because all of the API decisions appear to be centered around that idea. This is why you almost always see a Text object, because it holds an arbitrary byte array. It is true that Text has a constructor that accepts a String, and it has a very specific encoding when it does so (UTF8 only, as per its documentation). We have copied this behavior in some of our APIs to add convenience methods that accept Strings, because it's easier than forcing users to do write new Mutation( new Text( "myString" .getBytes( "UTF8" ))); It is so much easier to do new Mutation( "myString" ); . This does not change the behavior, though. We still expect convenience methods that accept Strings to behave as though we had converted a String to UTF8 and passed in the resulting bytes (in a Text object) to the method. API (cont.): Now, it may be the case that the API could benefit from convenience wrappers that accept Strings with a specific encoding, or we could change the behavior of those we have to respect the JVM's "file.encoding" property, and simply pre-encode the Strings before we throw their resulting bytes into a Text object. This may be useful and convenient, but this is a VERY LIMITED SCOPE, and it's important to realize that any consideration of changes to the way we encode things should focus on this scope, and not go crazy, changing all instances of "String-based" uses of ".getBytes()" in the code. Regardless of whether we make such changes, though, we should update our Javadocs to ensure that the encoding we use for these convenience methods is described. It is in the case of Mutation... I'm not sure about elsewhere. INTERNAL: The other scope to consider for encoding has to do with our internal storage (metadata we store in Zookeeper, in the !METADATA table, and other places where Accumulo writes persistent state). It is imperative that we maintain consistency in the way we interpret our persistent state. For this scope, we absolutely should stick to an encoding, but it should be hard-coded (use a Constant or a util method, for convenience), and should not respect any user configurable field. This is important, because a user should be able to change his/her JVM's encoding settings (for the API scope described above) and it should NOT affect our ability to read and understand data that we've previously written to Zookeeper or !METADATA (or elsewhere). INTERNAL (cont.): For the internal, persistent state's encoding, I'm comfortable assuming that we're already treating all persistent Strings storage as UTF-8 encoded (because we move things around in Text objects a lot, and for those things we aren't, we're probably using ASCII, and can safely treat it as UTF-8). If there are any situations where we are storing persistent state ambiguously, based on anything other than the hard-coded UTF-8 encoding, such that it might cause a problem if a user were to change an OS setting, or non-ASCII data can find its way in, we should treat such as a bug. As far as I see it, these are the only two scopes we need to concern ourselves with when considering encoding.
          Hide
          David Medinets added a comment -

          From the dev mailing list:

          John: Why not just have a configuration in the xml file for setting a global > charset? This way we avoid hard coded settings but also avoid the issue of shared vm issues.

          Drew: +1 for a configuration file property – perhaps this could be worked into the Encoding class

          Show
          David Medinets added a comment - From the dev mailing list: John: Why not just have a configuration in the xml file for setting a global > charset? This way we avoid hard coded settings but also avoid the issue of shared vm issues. Drew: +1 for a configuration file property – perhaps this could be worked into the Encoding class

            People

            • Assignee:
              David Medinets
              Reporter:
              David Medinets
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development