Commons CSV
  1. Commons CSV
  2. CSV-34

CSVFormat describes itself as immutable, but it is not - in particular it is not thread-safe

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0
    • Component/s: None
    • Labels:
      None

      Description

      CSVFormat describes itself as immutable, but it is not @Immutable - the class fields are all mutable.

      The methods that change the fields do so by creating a clone, and returning the changed clone.
      So in a sense the class is immutable.

      However, the normal expectation is that @Immutable classes are @ThreadSafe.
      CSVFormat is not thread-safe, because the fields are not volatile, and the fields are not written & read using a common lock.

      The comment needs to be clarified or removed.

        Activity

        Sebb created issue -
        Hide
        Emmanuel Bourg added a comment -

        Could you elaborate a bit in this issue Seb? I'm scratching my head on this and I fail to see how the current design defeats the thread safety. I even wrote a test trying the demonstrate the class wasn't thread safe but I didn't succeed.

        Is it sufficient to make the fields volatile and leave the logic as is?

        Show
        Emmanuel Bourg added a comment - Could you elaborate a bit in this issue Seb? I'm scratching my head on this and I fail to see how the current design defeats the thread safety. I even wrote a test trying the demonstrate the class wasn't thread safe but I didn't succeed. Is it sufficient to make the fields volatile and leave the logic as is?
        Hide
        Sebb added a comment -

        The Java Memory Model allows threads to cache copies of variables. This is done to improve performance.

        If thread A writes to a field, that write may be cached.
        Even if the cache is written to main memory, thread B may already have cached the value so may not see the update.

        If the reader and writer threads use the same lock - or the field is volatile - then the JVM guarantees that the data will be published correctly. Otherwise, in general the reader may never see what the writer thread wrote, or may see writes out of order. [There are some further rules about final variables.]

        Unfortunately this is quite difficult to test, as whether and when caching occurs depends on lots of factors.

        So yes, adding volatile would fix this particular problem.
        Or one could use final variables for all fields and use a constructor instead of clone.
        That would be rather better.

        But given that the rest of CSV is not thread-safe, I wonder whether it's necessary to make the format class thread-safe.

        Show
        Sebb added a comment - The Java Memory Model allows threads to cache copies of variables. This is done to improve performance. If thread A writes to a field, that write may be cached. Even if the cache is written to main memory, thread B may already have cached the value so may not see the update. If the reader and writer threads use the same lock - or the field is volatile - then the JVM guarantees that the data will be published correctly. Otherwise, in general the reader may never see what the writer thread wrote, or may see writes out of order. [There are some further rules about final variables.] Unfortunately this is quite difficult to test, as whether and when caching occurs depends on lots of factors. So yes, adding volatile would fix this particular problem. Or one could use final variables for all fields and use a constructor instead of clone. That would be rather better. But given that the rest of CSV is not thread-safe, I wonder whether it's necessary to make the format class thread-safe.
        Hide
        Emmanuel Bourg added a comment -

        CSVFormat doesn't claim to be thread safe, it's just immutable otherwise it would be possible to modify the predefined formats (CSVFormat.DEFAULT, CSVFormat.EXCEL, etc).

        Thread safety is not a goal, but it's still good to have. With the current implementation it's impossible for an instance to be mutated after the reference is shared with another thread. So I think it's impossible for another thread to see a different value.

        The case can be described like this:

        1. Thread A creates a bean and set a non final, non volatile field
        2. Thread A shares this instance with Thread B (through a queue for example)

        Question: Is it possible for Thread B to get the bean with the field not mutated?

        My understanding is that it's not possible, because for sharing the instance with Thread B the instance that may have been cached in Thread A memory must necessarily be copied in it's current state into the main memory, and that's with the field mutated. So Thread B receives the bean with the field properly initialized.

        Is this correct?

        Show
        Emmanuel Bourg added a comment - CSVFormat doesn't claim to be thread safe, it's just immutable otherwise it would be possible to modify the predefined formats (CSVFormat.DEFAULT, CSVFormat.EXCEL, etc). Thread safety is not a goal, but it's still good to have. With the current implementation it's impossible for an instance to be mutated after the reference is shared with another thread. So I think it's impossible for another thread to see a different value. The case can be described like this: Thread A creates a bean and set a non final, non volatile field Thread A shares this instance with Thread B (through a queue for example) Question: Is it possible for Thread B to get the bean with the field not mutated? My understanding is that it's not possible, because for sharing the instance with Thread B the instance that may have been cached in Thread A memory must necessarily be copied in it's current state into the main memory, and that's with the field mutated. So Thread B receives the bean with the field properly initialized. Is this correct?
        Hide
        Sebb added a comment -

        OK, good point about predefined formats.

        The issue is not about making changes to an object after it has been shared (although that is also important), it's about ensuring that the contents of an object are correctly published to the other thread.

        Queues that are designed for passing work items between threads must include sufficient synch. to guarantee safe publication.

        However, if thread A were to store the object directly into a field in thread B, there would be no such guarantee.

        Show
        Sebb added a comment - OK, good point about predefined formats. The issue is not about making changes to an object after it has been shared (although that is also important), it's about ensuring that the contents of an object are correctly published to the other thread. Queues that are designed for passing work items between threads must include sufficient synch. to guarantee safe publication. However, if thread A were to store the object directly into a field in thread B, there would be no such guarantee.
        Hide
        Sebb added a comment -

        CSVFormat.java reworked to use all final variables.

        [Can be provided as diff if required; it's easier to understand a full source file]

        Show
        Sebb added a comment - CSVFormat.java reworked to use all final variables. [Can be provided as diff if required; it's easier to understand a full source file]
        Sebb made changes -
        Field Original Value New Value
        Attachment CSVFormat.java [ 12517482 ]
        Hide
        Emmanuel Bourg added a comment -

        Thank you for showing how this can be implemented Sebb. Considering that thread safety is not a goal I think I still prefer the current implementation.

        Show
        Emmanuel Bourg added a comment - Thank you for showing how this can be implemented Sebb. Considering that thread safety is not a goal I think I still prefer the current implementation.
        Hide
        Sebb added a comment -

        I thought you wanted to be able to pass CSVFormat objects between threads?

        As it stands, the implementation is not completely thread-safe.
        And the Javadoc comment is still misleading.

        Show
        Sebb added a comment - I thought you wanted to be able to pass CSVFormat objects between threads? As it stands, the implementation is not completely thread-safe. And the Javadoc comment is still misleading.
        Sebb made changes -
        Attachment COLLECTIONS-398-2.patch [ 12517491 ]
        Sebb made changes -
        Comment [ Another simplification; no need to check for null element. ]
        Sebb made changes -
        Attachment COLLECTIONS-398-2.patch [ 12517491 ]
        Hide
        Emmanuel Bourg added a comment -

        After checking the JLS and the excellent Java Concurrency In Practice I now understand why my analysis is wrong. CSVFormat is "effectively immutable" but must be safely published to avoid visibility issues.

        I made the fields volatile to ensure the thread safety. I prefer this to the constructor idiom which is less readable with a large number of fields. Also I expect some fields to be added or removed in the near future, the clone construct will be easier to maintain. I don't mind switching to constructors later when the API is stabilized.

        Thank you for pointing to this issue Sebb, that was very instructing.

        Show
        Emmanuel Bourg added a comment - After checking the JLS and the excellent Java Concurrency In Practice I now understand why my analysis is wrong. CSVFormat is "effectively immutable" but must be safely published to avoid visibility issues. I made the fields volatile to ensure the thread safety. I prefer this to the constructor idiom which is less readable with a large number of fields. Also I expect some fields to be added or removed in the near future, the clone construct will be easier to maintain. I don't mind switching to constructors later when the API is stabilized. Thank you for pointing to this issue Sebb, that was very instructing.
        Emmanuel Bourg made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s CSV 1.0 [ 12313514 ]
        Resolution Fixed [ 1 ]
        Emmanuel Bourg made changes -
        Project Commons Sandbox [ 12310491 ] Commons CSV [ 12313222 ]
        Key SANDBOX-408 CSV-34
        Component/s CSV [ 12311182 ]
        Fix Version/s 1.0 [ 12320161 ]
        Fix Version/s CSV 1.0 [ 12313514 ]
        Hide
        Emmanuel Bourg added a comment -

        Eventually I removed the volatile modifier, it was slowing the parser by about 5% in my tests. I replaced the clones by a constructor call as you suggested.

        Show
        Emmanuel Bourg added a comment - Eventually I removed the volatile modifier, it was slowing the parser by about 5% in my tests. I replaced the clones by a constructor call as you suggested.
        Hide
        Sebb added a comment -

        That's quite surprising.
        Obviously volatile will have some overhead, but I'm surprised it is as much as that, but then there are quite a few variables to be checked.

        Show
        Sebb added a comment - That's quite surprising. Obviously volatile will have some overhead, but I'm surprised it is as much as that, but then there are quite a few variables to be checked.
        Hide
        Emmanuel Bourg added a comment -

        volatile prevents the variables from being cached in the registers and creates a memory barrier, so that's not that surprising to see an impact on the performances. But I admit I didn't expect as much as 5%. That's probably because the parser calls the format properties many time on every character read.

        Show
        Emmanuel Bourg added a comment - volatile prevents the variables from being cached in the registers and creates a memory barrier, so that's not that surprising to see an impact on the performances. But I admit I didn't expect as much as 5%. That's probably because the parser calls the format properties many time on every character read.
        Benedikt Ritter made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Unassigned
            Reporter:
            Sebb
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development