Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-5824

Fix String/byte conversions without explicit encoding

    Details

      Description

      In a couple of places we convert Strings to bytes and bytes back to Strings without explicitly specifying an encoding. This can lead to problems when client and server default encodings differ.

      The task of this JIRA is to go over the whole project and look for conversions where we don't specify an encoding and fix it to specify UTF-8 explicitly.

      For starters, we can grep -R 'getBytes()' ., which already reveals many problematic places.

        Issue Links

          Activity

          Hide
          StephanEwen Stephan Ewen added a comment -

          We should add a checkstyle rule that forbids String.toBytes() and new String(byte[]).

          Show
          StephanEwen Stephan Ewen added a comment - We should add a checkstyle rule that forbids String.toBytes() and new String(byte[]) .
          Hide
          dawidwys Dawid Wysakowicz added a comment -

          I would be glad to help with this issue, if it is ok with you.

          Show
          dawidwys Dawid Wysakowicz added a comment - I would be glad to help with this issue, if it is ok with you.
          Hide
          uce Ufuk Celebi added a comment -

          Dawid Wysakowicz I've just assigned it to you.

          Stephan Ewen Stefan Richter Stephan mentioned that this also affects savepoints. Do we want to do something special there or simply fix it (possibly breaking savepoints)?

          Show
          uce Ufuk Celebi added a comment - Dawid Wysakowicz I've just assigned it to you. Stephan Ewen Stefan Richter Stephan mentioned that this also affects savepoints. Do we want to do something special there or simply fix it (possibly breaking savepoints)?
          Hide
          StephanEwen Stephan Ewen added a comment -

          I think some names in checkpoints are serialized in the default encoding. We have to check whether that would affect users.

          Dawid Wysakowicz Would probably be good to pull out the default encoding we want to use (UTF 8) into some class (like ConfigConstants or so).

          Show
          StephanEwen Stephan Ewen added a comment - I think some names in checkpoints are serialized in the default encoding. We have to check whether that would affect users. Dawid Wysakowicz Would probably be good to pull out the default encoding we want to use (UTF 8) into some class (like ConfigConstants or so).
          Hide
          dawidwys Dawid Wysakowicz added a comment -

          Stephan Ewen Sure, will do it. We don't want it to be configurable from the configuration file at that moment, do we?

          Ufuk Celebi Thanks for assigning!

          Show
          dawidwys Dawid Wysakowicz added a comment - Stephan Ewen Sure, will do it. We don't want it to be configurable from the configuration file at that moment, do we? Ufuk Celebi Thanks for assigning!
          Hide
          StephanEwen Stephan Ewen added a comment -

          I don't think we need it to be configurable at the moment. Since these are internal strings, we should rather be safe that we always use the same encoding.

          Show
          StephanEwen Stephan Ewen added a comment - I don't think we need it to be configurable at the moment. Since these are internal strings, we should rather be safe that we always use the same encoding.
          Hide
          dawidwys Dawid Wysakowicz added a comment -

          Just two more questions.

          First I tried to investigate the possibilities to add such a check to checkstyle, but I didn't find a good way to that. Checkstyle does not come with it and a regex check is not enough in that case. It would be possible with FindBugs but don't know if we want to add it right know. Some ideas?

          Do we want to change also writing to files like e.g:
          GlobalConfiguration::loadYAMLResource:144
          OptimizerPlanEnvironment::getOptimizedPlan:74/76

          Show
          dawidwys Dawid Wysakowicz added a comment - Just two more questions. First I tried to investigate the possibilities to add such a check to checkstyle, but I didn't find a good way to that. Checkstyle does not come with it and a regex check is not enough in that case. It would be possible with FindBugs but don't know if we want to add it right know. Some ideas? Do we want to change also writing to files like e.g: GlobalConfiguration::loadYAMLResource:144 OptimizerPlanEnvironment::getOptimizedPlan:74/76
          Hide
          StephanEwen Stephan Ewen added a comment -

          Will a regex pattern for getBytes() currently match methods on other classes than String as well?

          Show
          StephanEwen Stephan Ewen added a comment - Will a regex pattern for getBytes() currently match methods on other classes than String as well?
          Hide
          dawidwys Dawid Wysakowicz added a comment -

          Right now such a regex will match also for org.apache.flink.api.common.JobID and org.apache.avro.util.Utf8.

          Show
          dawidwys Dawid Wysakowicz added a comment - Right now such a regex will match also for org.apache.flink.api.common.JobID and org.apache.avro.util.Utf8 .
          Hide
          dawidwys Dawid Wysakowicz added a comment -

          I've fixed all direct String <> bytes conversions as listed by Findbugs.

          Still Findbugs shows
          34 reliance on default encoding in mainbase
          144 in mainbase + tests

          Those reliances are while using different kinds of Output/InputStreams mainly with FileOutputStream

          Do we also want to use UTF_8 in those cases?

          Current version one can check at https://github.com/dawidwys/flink/tree/encoding

          Show
          dawidwys Dawid Wysakowicz added a comment - I've fixed all direct String <> bytes conversions as listed by Findbugs. Still Findbugs shows 34 reliance on default encoding in mainbase 144 in mainbase + tests Those reliances are while using different kinds of Output/InputStreams mainly with FileOutputStream Do we also want to use UTF_8 in those cases? Current version one can check at https://github.com/dawidwys/flink/tree/encoding
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user dawidwys opened a pull request:

          https://github.com/apache/flink/pull/3468

          FLINK-5824 Fix String/byte conversions without explicit encoding

          Thanks for contributing to Apache Flink. Before you open your pull request, please take the following check list into consideration.
          If your changes take all of the items into account, feel free to open your pull request. For more information and/or questions please refer to the [How To Contribute guide](http://flink.apache.org/how-to-contribute.html).
          In addition to going through the list, please provide a meaningful description of your changes.

          • [ ] General
          • The pull request references the related JIRA issue ("[FLINK-XXX] Jira title text")
          • The pull request addresses only one issue
          • Each commit in the PR has a meaningful commit message (including the JIRA id)
          • [ ] Documentation
          • Documentation has been added for new functionality
          • Old documentation affected by the pull request has been updated
          • JavaDoc for public methods has been added
          • [ ] Tests & Build
          • Functionality added by the pull request is covered by tests
          • `mvn clean verify` has been executed successfully locally or a Travis build has passed

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/dawidwys/flink flink5824

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3468.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3468


          commit 0aaf7a843ffb4ef5667e9b2517fd537f3b0323b5
          Author: Dawid Wysakowicz <dawid@getindata.com>
          Date: 2017-03-03T12:24:49Z

          FLINK-5824 Fix String/byte conversions without explicit encoding


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user dawidwys opened a pull request: https://github.com/apache/flink/pull/3468 FLINK-5824 Fix String/byte conversions without explicit encoding Thanks for contributing to Apache Flink. Before you open your pull request, please take the following check list into consideration. If your changes take all of the items into account, feel free to open your pull request. For more information and/or questions please refer to the [How To Contribute guide] ( http://flink.apache.org/how-to-contribute.html ). In addition to going through the list, please provide a meaningful description of your changes. [ ] General The pull request references the related JIRA issue (" [FLINK-XXX] Jira title text") The pull request addresses only one issue Each commit in the PR has a meaningful commit message (including the JIRA id) [ ] Documentation Documentation has been added for new functionality Old documentation affected by the pull request has been updated JavaDoc for public methods has been added [ ] Tests & Build Functionality added by the pull request is covered by tests `mvn clean verify` has been executed successfully locally or a Travis build has passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/dawidwys/flink flink5824 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3468.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3468 commit 0aaf7a843ffb4ef5667e9b2517fd537f3b0323b5 Author: Dawid Wysakowicz <dawid@getindata.com> Date: 2017-03-03T12:24:49Z FLINK-5824 Fix String/byte conversions without explicit encoding
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user shijinkui commented on the issue:

          https://github.com/apache/flink/pull/3468

          Good changes. `ConfigConstants.DEFAULT_CHARSET` change to `ConfigConstants.UTF_8 ` may be more clear.

          Show
          githubbot ASF GitHub Bot added a comment - Github user shijinkui commented on the issue: https://github.com/apache/flink/pull/3468 Good changes. `ConfigConstants.DEFAULT_CHARSET` change to `ConfigConstants.UTF_8 ` may be more clear.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dawidwys commented on the issue:

          https://github.com/apache/flink/pull/3468

          @shijinkui I think the idea of having this Constant in `ConfigConstant` is it could potentially be set to some other value. If I am mistaken I would instead of using `ConfigConstants.UTF_8` opt for java7's `StandardCharsets.UTF_8` directly.

          Show
          githubbot ASF GitHub Bot added a comment - Github user dawidwys commented on the issue: https://github.com/apache/flink/pull/3468 @shijinkui I think the idea of having this Constant in `ConfigConstant` is it could potentially be set to some other value. If I am mistaken I would instead of using `ConfigConstants.UTF_8` opt for java7's `StandardCharsets.UTF_8` directly.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user shijinkui commented on the issue:

          https://github.com/apache/flink/pull/3468

          As your current replacement with `ConfigConstants.DEFAULT_CHARSET`, it's better for setting other charsets. UTF_8 has clear semantics. If we'll never change the default charset utf_8 to other charset, I prefer `ConfigConstants.UTF_8`

          Show
          githubbot ASF GitHub Bot added a comment - Github user shijinkui commented on the issue: https://github.com/apache/flink/pull/3468 As your current replacement with `ConfigConstants.DEFAULT_CHARSET`, it's better for setting other charsets. UTF_8 has clear semantics. If we'll never change the default charset utf_8 to other charset, I prefer `ConfigConstants.UTF_8`
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user zentol commented on the issue:

          https://github.com/apache/flink/pull/3468

          I'm in favor of using ```ConfigConstants.DEFAULT_CHARSET```. Using ```ConfigConstants.UTF_8``` does not imply in any way that this is the default that anyone should use; it would be equivalent to using ```StandardCharsets.UTF_8```.

          Show
          githubbot ASF GitHub Bot added a comment - Github user zentol commented on the issue: https://github.com/apache/flink/pull/3468 I'm in favor of using ```ConfigConstants.DEFAULT_CHARSET```. Using ```ConfigConstants.UTF_8``` does not imply in any way that this is the default that anyone should use; it would be equivalent to using ```StandardCharsets.UTF_8```.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3468

          I would prefer to keep the value in `ConfigConstants`, just because it clearly defines in one place what we want to use as the default (rather than passing `StandardCharsets.UTF_8` everywhere).

          As for the name, I think both suggestions are fine in the end. It is unlikely that we ever switch the charset, especially since checkpoints and high-availability storage encodes data in that charset.

          What would be good to have is a test that checks that we never use a byte conversion without a charset. Maybe a checkstyle rule. If that does not work, we can try and do a `Reflections` test (see RpcCompletenessTest for an example of how to reflectively analyze code in tests).

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3468 I would prefer to keep the value in `ConfigConstants`, just because it clearly defines in one place what we want to use as the default (rather than passing `StandardCharsets.UTF_8` everywhere). As for the name, I think both suggestions are fine in the end. It is unlikely that we ever switch the charset, especially since checkpoints and high-availability storage encodes data in that charset. What would be good to have is a test that checks that we never use a byte conversion without a charset. Maybe a checkstyle rule. If that does not work, we can try and do a `Reflections` test (see RpcCompletenessTest for an example of how to reflectively analyze code in tests).
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user dawidwys commented on the issue:

          https://github.com/apache/flink/pull/3468

          So as I mentioned in JIRA I don't see a way to do it in checkstyle rule.

          I managed to implement a test that performs the check using `Reflections` but I don't know where shall I place it. The problem is that only files from packages that are dependencies to the module with this test will be checked. First idea I had was to put it in the `filnk-tests` module, but it does not depend on e.g. any of `flink-connectors`

          Show
          githubbot ASF GitHub Bot added a comment - Github user dawidwys commented on the issue: https://github.com/apache/flink/pull/3468 So as I mentioned in JIRA I don't see a way to do it in checkstyle rule. I managed to implement a test that performs the check using `Reflections` but I don't know where shall I place it. The problem is that only files from packages that are dependencies to the module with this test will be checked. First idea I had was to put it in the `filnk-tests` module, but it does not depend on e.g. any of `flink-connectors`
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3468

          I think starting with `flink-tests` is good, we can then improve from there.

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3468 I think starting with `flink-tests` is good, we can then improve from there.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3468

          Good change, thanks!
          Merging this...

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3468 Good change, thanks! Merging this...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/3468

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/3468
          Hide
          StephanEwen Stephan Ewen added a comment -

          Fixed via 53fedbd2894c6c7b839d8fdcc0dbf1e6e21e631a

          Show
          StephanEwen Stephan Ewen added a comment - Fixed via 53fedbd2894c6c7b839d8fdcc0dbf1e6e21e631a

            People

            • Assignee:
              dawidwys Dawid Wysakowicz
              Reporter:
              uce Ufuk Celebi
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development