Uploaded image for project: 'Samza'
  1. Samza
  2. SAMZA-256

Provide in-memory data store implementation

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.6.0
    • Fix Version/s: 0.8.0
    • Component/s: kv
    • Labels:
      None

      Description

      The sole current kv store, LevelDbKeyValueStore, works well when the amount of data to be stored is prohibitively large to keep it all in memory. However, in cases where the state is small enough to comfortably fit in whatever memory is available, it would be better to provide an in-memory implementation. This can be backed by either a native Java class, or perhaps a Guava class, if that is found to scale better (or, of course, the backing implementation could be configurable).

      1. samza_256.patch
        133 kB
        Chinmay Soman
      2. samza_256_2.patch
        73 kB
        Chinmay Soman
      3. samza_256_1.patch
        73 kB
        Chinmay Soman

        Activity

        Hide
        criccomini Chris Riccomini added a comment -

        Opened SAMZA-545. There is some subtlety to this, but I think it can be done.

        Show
        criccomini Chris Riccomini added a comment - Opened SAMZA-545 . There is some subtlety to this, but I think it can be done.
        Hide
        jkreps Jay Kreps added a comment -

        Yeah you're right it does mess up the layering. But my experience has been serialization is a pretty big hit so I wonder for in-memory stuff if you would consistently beat an in-memory RocksDB if you don't do that.

        Show
        jkreps Jay Kreps added a comment - Yeah you're right it does mess up the layering. But my experience has been serialization is a pretty big hit so I wonder for in-memory stuff if you would consistently beat an in-memory RocksDB if you don't do that.
        Hide
        cpsoman Chinmay Soman added a comment -

        Jay Kreps You mean extending from BaseKeyValueStorageEngineFactory (which in turn wraps the inner store with serialized and cached stores) ? It was done at the time to maintain consistency (otherwise the code becomes a bit messy).

        Although - you're right. I'll open up a ticket to change the store wrapping logic in case of an in-memory store:

        • Ignore SerializedKeyValueStore
        • Use generic types for the in-memory store

        This will break the nice layering we have today. getKVStore() returns a KeyValueStore[Array[Byte], Array[Byte]]. We'll now have to do something different for in-memory store. But at a high level - I agree - the benefits are worth the effort.

        Show
        cpsoman Chinmay Soman added a comment - Jay Kreps You mean extending from BaseKeyValueStorageEngineFactory (which in turn wraps the inner store with serialized and cached stores) ? It was done at the time to maintain consistency (otherwise the code becomes a bit messy). Although - you're right. I'll open up a ticket to change the store wrapping logic in case of an in-memory store: Ignore SerializedKeyValueStore Use generic types for the in-memory store This will break the nice layering we have today. getKVStore() returns a KeyValueStore[Array [Byte] , Array [Byte] ]. We'll now have to do something different for in-memory store. But at a high level - I agree - the benefits are worth the effort.
        Hide
        jkreps Jay Kreps added a comment -

        Hey Chris Riccomini and Chinmay Soman Any reason this in memory map is doing serialization? The primary advantage of an in-memory store versus a fully cached leveldb/rocksdb would seem to be avoiding the serialization, right?

        Show
        jkreps Jay Kreps added a comment - Hey Chris Riccomini and Chinmay Soman Any reason this in memory map is doing serialization? The primary advantage of an in-memory store versus a fully cached leveldb/rocksdb would seem to be avoiding the serialization, right?
        Hide
        cpsoman Chinmay Soman added a comment -

        Chris beat me to the comment.

        I was about to point that out - this is still the old code, I just moved out the LevelDb part (and added a new in-memory part) into separate module (s).

        Show
        cpsoman Chinmay Soman added a comment - Chris beat me to the comment. I was about to point that out - this is still the old code, I just moved out the LevelDb part (and added a new in-memory part) into separate module (s).
        Show
        criccomini Chris Riccomini added a comment - Also, I think this split already existed in 0.7.0: https://git-wip-us.apache.org/repos/asf?p=incubator-samza.git;a=tree;f=samza-kv/src/main;h=d704801c016f822153ea64152a533350e420402c;hb=0.7.0
        Hide
        criccomini Chris Riccomini added a comment -

        I think this is OK because TaskContext.getStore returns an object. I'd be worried about it if we had just put an API-level interface/class in module other than samza-api, but I think we should be OK since this requires a specific cast.

        Show
        criccomini Chris Riccomini added a comment - I think this is OK because TaskContext.getStore returns an object. I'd be worried about it if we had just put an API-level interface/class in module other than samza-api, but I think we should be OK since this requires a specific cast.
        Hide
        jghoman Jakob Homan added a comment -
        jhoman-mn1:samza-kv jhoman$ tree
        .
        ├── samza-kv_2.10.iml
        └── src
            └── main
                ├── java
                │   └── org
                │       └── apache
                │           └── samza
                │               └── storage
                │                   └── kv
                │                       ├── Entry.java
                │                       ├── KeyValueIterator.java
                │                       └── KeyValueStore.java
                └── scala
                    └── org
                        └── apache
                            └── samza
                                └── storage
                                    └── kv
                                        ├── BaseKeyValueStorageEngineFactory.scala
                                        ├── CachedStore.scala
                                        ├── CachedStoreMetrics.scala
                                        ├── KeyValueStorageEngine.scala
                                        ├── KeyValueStorageEngineMetrics.scala
                                        ├── KeyValueStoreMetrics.scala
                                        ├── LoggedStore.scala
                                        ├── LoggedStoreMetrics.scala
                                        ├── NullSafeKeyValueStore.scala
                                        ├── SerializedKeyValueStore.scala
                                        └── SerializedKeyValueStoreMetrics.scala

        Just noticed this is our first non-samza-test mix of Scala and Java. I'm cool with that, but worth pointing out.

        Show
        jghoman Jakob Homan added a comment - jhoman-mn1:samza-kv jhoman$ tree . ├── samza-kv_2.10.iml └── src └── main ├── java │   └── org │   └── apache │   └── samza │   └── storage │   └── kv │   ├── Entry.java │   ├── KeyValueIterator.java │   └── KeyValueStore.java └── scala └── org └── apache └── samza └── storage └── kv ├── BaseKeyValueStorageEngineFactory.scala ├── CachedStore.scala ├── CachedStoreMetrics.scala ├── KeyValueStorageEngine.scala ├── KeyValueStorageEngineMetrics.scala ├── KeyValueStoreMetrics.scala ├── LoggedStore.scala ├── LoggedStoreMetrics.scala ├── NullSafeKeyValueStore.scala ├── SerializedKeyValueStore.scala └── SerializedKeyValueStoreMetrics.scala Just noticed this is our first non-samza-test mix of Scala and Java. I'm cool with that, but worth pointing out.
        Hide
        criccomini Chris Riccomini added a comment -

        +1 merged and committed.

        Show
        criccomini Chris Riccomini added a comment - +1 merged and committed.
        Hide
        cpsoman Chinmay Soman added a comment -

        samza-256_2.patch has the latest changes.

        Show
        cpsoman Chinmay Soman added a comment - samza-256_2.patch has the latest changes.
        Hide
        criccomini Chris Riccomini added a comment -

        I actually like samza-kv to be a separate module - given that there's plenty more refactoring coming up (getStorageEngine API).

        Cool, works for me.

        Could you post updated patch here? Posting on JIRA is what assigns rights to Apache (as opposed to RB). I know it's annoying to do both, sorry.

        Show
        criccomini Chris Riccomini added a comment - I actually like samza-kv to be a separate module - given that there's plenty more refactoring coming up (getStorageEngine API). Cool, works for me. Could you post updated patch here? Posting on JIRA is what assigns rights to Apache (as opposed to RB). I know it's annoying to do both, sorry.
        Hide
        cpsoman Chinmay Soman added a comment -

        I actually like samza-kv to be a separate module - given that there's plenty more refactoring coming up (getStorageEngine API).

        But I'm fine either ways

        Show
        cpsoman Chinmay Soman added a comment - I actually like samza-kv to be a separate module - given that there's plenty more refactoring coming up (getStorageEngine API). But I'm fine either ways
        Hide
        criccomini Chris Riccomini added a comment -

        Posted comments on RB. One other item for discussion is samza-kv vs. just rolling the remains of that module into samza-core. For: would reduce module counts, and samza-kv dependencies are totally clean. Against: samza-core is actually really samza-container. If we want to keep considering it samza-container, we shouldn't roll KV stuff into it.

        Looking for feedback.

        Show
        criccomini Chris Riccomini added a comment - Posted comments on RB. One other item for discussion is samza-kv vs. just rolling the remains of that module into samza-core. For: would reduce module counts, and samza-kv dependencies are totally clean. Against: samza-core is actually really samza-container. If we want to keep considering it samza-container, we shouldn't roll KV stuff into it. Looking for feedback.
        Hide
        cpsoman Chinmay Soman added a comment -

        Submitted the patch after addressing review comments

        Show
        cpsoman Chinmay Soman added a comment - Submitted the patch after addressing review comments
        Hide
        criccomini Chris Riccomini added a comment -

        Posted comments on RB. Would be good to get Jakob Homan's input as well.

        Show
        criccomini Chris Riccomini added a comment - Posted comments on RB. Would be good to get Jakob Homan 's input as well.
        Show
        cpsoman Chinmay Soman added a comment - https://reviews.apache.org/r/23372/
        Hide
        criccomini Chris Riccomini added a comment -

        Could you post an RB?

        Show
        criccomini Chris Riccomini added a comment - Could you post an RB?
        Hide
        cpsoman Chinmay Soman added a comment -

        Apologies for the gigantic patch (lot of refactoring).

        Here are the changes:

        • Moved most of the key value classes to samza-core
        • Added an in-memory key value store to samza-core
        • Created a trait (BaseKeyValueStorageEngineFactory.scala) to abstract away most of the work done by KeyValueStorageEngineFactory types.
        • Added LevelDb and InMemory factories for KeyValueStorageEngine
        • Renamed samza-kv to samza-leveldb
        Show
        cpsoman Chinmay Soman added a comment - Apologies for the gigantic patch (lot of refactoring). Here are the changes: Moved most of the key value classes to samza-core Added an in-memory key value store to samza-core Created a trait (BaseKeyValueStorageEngineFactory.scala) to abstract away most of the work done by KeyValueStorageEngineFactory types. Added LevelDb and InMemory factories for KeyValueStorageEngine Renamed samza-kv to samza-leveldb
        Hide
        jkreps Jay Kreps added a comment -

        Jakob Homan No I just didn't know any better. We can change it we would just need to give people a heads up.

        Show
        jkreps Jay Kreps added a comment - Jakob Homan No I just didn't know any better. We can change it we would just need to give people a heads up.
        Hide
        jghoman Jakob Homan added a comment -

        Jay Kreps, seems like we should bring this iterator in line with the others. Was this differing implementation intentional?

        Show
        jghoman Jakob Homan added a comment - Jay Kreps , seems like we should bring this iterator in line with the others. Was this differing implementation intentional?
        Hide
        cpsoman Chinmay Soman added a comment -

        Currently, LevelDbKeyValueStore range seems to implement an iterator that returns all elements : from(inclusive) => to(inclusive)

        However, most Java/Guava util classes that implement some form of 'SortedMap' have a subMap that returns elements : from(inclusive) => to(exclusive)

        What is the correct semantic for this API ? (all inclusive) ? or (all inclusive except the last element in that specified range) ?

        Show
        cpsoman Chinmay Soman added a comment - Currently, LevelDbKeyValueStore range seems to implement an iterator that returns all elements : from(inclusive) => to(inclusive) However, most Java/Guava util classes that implement some form of 'SortedMap' have a subMap that returns elements : from(inclusive) => to(exclusive) What is the correct semantic for this API ? (all inclusive) ? or (all inclusive except the last element in that specified range) ?
        Hide
        cpsoman Chinmay Soman added a comment -

        I think moving the abstract storage engine factories to samza-core is a good idea (thanks for suggesting !)

        At this point, I'm still playing with the TreeMap implementation vs the guava equivalent - to check which one is better. Based on that, we can choose to get rid of samza-kv (or not).

        Show
        cpsoman Chinmay Soman added a comment - I think moving the abstract storage engine factories to samza-core is a good idea (thanks for suggesting !) At this point, I'm still playing with the TreeMap implementation vs the guava equivalent - to check which one is better. Based on that, we can choose to get rid of samza-kv (or not).
        Hide
        criccomini Chris Riccomini added a comment -

        This would suggest that it would be better to put the in-memory KV store in samza-core, since it doesn't have any library dependencies (apart from perhaps Guava, if we choose to use that).

        1. If we use Guava, I don't want that dependency floating around in core. Google is notoriously evil with their guava/collections garbage.
        2. If we opt to put the in-memory implementation in core, we'd have to move the new abstract kv storage engine that Chinmay Soman is working on, as well. I looked at the imports, and all the dependencies seem to be in API or core, so this should be OK:
        import java.io.File
        import org.apache.samza.config.Config
        import org.apache.samza.container.SamzaContainerContext
        import org.apache.samza.serializers._
        import org.apache.samza.SamzaException
        import org.apache.samza.metrics.MetricsRegistry
        import org.apache.samza.task.MessageCollector
        import org.apache.samza.system.SystemStreamPartition
        import org.apache.samza.storage.StorageEngineFactory
        import org.apache.samza.storage.StorageEngine
        

        If there's nothing that I've missed, then perhaps this is a good idea. As Martin says, we could eliminate samza-kv all together, and just split it up between samza-kv-* and samza-core.

        By the same logic, the LevelDB storage engine should be in a samza-leveldb module, the RocksDB storage engine in a samza-rocksdb module, etc. Perhaps we can remove the samza-kv module in 0.8.0 in favour of modules that make the dependencies more explicit.

        +1 If we can do this without adding any new dependencies to samza-core, I'm all for this.

        Show
        criccomini Chris Riccomini added a comment - This would suggest that it would be better to put the in-memory KV store in samza-core, since it doesn't have any library dependencies (apart from perhaps Guava, if we choose to use that). If we use Guava, I don't want that dependency floating around in core. Google is notoriously evil with their guava/collections garbage. If we opt to put the in-memory implementation in core, we'd have to move the new abstract kv storage engine that Chinmay Soman is working on, as well. I looked at the imports, and all the dependencies seem to be in API or core, so this should be OK: import java.io.File import org.apache.samza.config.Config import org.apache.samza.container.SamzaContainerContext import org.apache.samza.serializers._ import org.apache.samza.SamzaException import org.apache.samza.metrics.MetricsRegistry import org.apache.samza.task.MessageCollector import org.apache.samza.system.SystemStreamPartition import org.apache.samza.storage.StorageEngineFactory import org.apache.samza.storage.StorageEngine If there's nothing that I've missed, then perhaps this is a good idea. As Martin says, we could eliminate samza-kv all together, and just split it up between samza-kv-* and samza-core. By the same logic, the LevelDB storage engine should be in a samza-leveldb module, the RocksDB storage engine in a samza-rocksdb module, etc. Perhaps we can remove the samza-kv module in 0.8.0 in favour of modules that make the dependencies more explicit. +1 If we can do this without adding any new dependencies to samza-core, I'm all for this.
        Hide
        martinkl Martin Kleppmann added a comment - - edited

        Sounds good!

        The In-memory KV store can reside within samza-kv module

        IMHO the main value of splitting Samza into multiple modules is that any particular job only needs to bring in those transitive dependencies that it needs, and no more. That's why the JSON serde is in a separate module (it depends on Jackson) but the String and Integer serdes are in samza-core, for example.

        This would suggest that it would be better to put the in-memory KV store in samza-core, since it doesn't have any library dependencies (apart from perhaps Guava, if we choose to use that). By the same logic, the LevelDB storage engine should be in a samza-leveldb module, the RocksDB storage engine in a samza-rocksdb module, etc. Perhaps we can remove the samza-kv module in 0.8.0 in favour of modules that make the dependencies more explicit.

        What do you think?

        Modify Samza container (I think) to map something like: org.apache.samza.storage.kv.KeyValueStorageEngineFactory => org.apache.samza.storage.kv.LevelDBKeyValueStorageEngineFactory

        Rather than aliasing this in SamzaContainer, I think it would be ok to just keep the KeyValueStorageEngineFactory class, and have it delegate to LevelDBKeyValueStorageEngineFactory.

        Show
        martinkl Martin Kleppmann added a comment - - edited Sounds good! The In-memory KV store can reside within samza-kv module IMHO the main value of splitting Samza into multiple modules is that any particular job only needs to bring in those transitive dependencies that it needs, and no more. That's why the JSON serde is in a separate module (it depends on Jackson) but the String and Integer serdes are in samza-core, for example. This would suggest that it would be better to put the in-memory KV store in samza-core, since it doesn't have any library dependencies (apart from perhaps Guava, if we choose to use that). By the same logic, the LevelDB storage engine should be in a samza-leveldb module, the RocksDB storage engine in a samza-rocksdb module, etc. Perhaps we can remove the samza-kv module in 0.8.0 in favour of modules that make the dependencies more explicit. What do you think? Modify Samza container (I think) to map something like: org.apache.samza.storage.kv.KeyValueStorageEngineFactory => org.apache.samza.storage.kv.LevelDBKeyValueStorageEngineFactory Rather than aliasing this in SamzaContainer, I think it would be ok to just keep the KeyValueStorageEngineFactory class, and have it delegate to LevelDBKeyValueStorageEngineFactory.
        Hide
        cpsoman Chinmay Soman added a comment -

        The high level plan of action is the following:

        • Create a trait which encapsulates the common functionality while creating a key-value storage engine factory.
        • Create additional modules for LevelDB/RocksDB/... that use this trait and pass the appropriate key value store. The In-memory KV store can reside within samza-kv module
        • Modify Samza container (I think) to map something like:
          org.apache.samza.storage.kv.KeyValueStorageEngineFactory => org.apache.samza.storage.kv.LevelDBKeyValueStorageEngineFactory

        This is to maintain backwards compatibility.

        Show
        cpsoman Chinmay Soman added a comment - The high level plan of action is the following: Create a trait which encapsulates the common functionality while creating a key-value storage engine factory. Create additional modules for LevelDB/RocksDB/... that use this trait and pass the appropriate key value store. The In-memory KV store can reside within samza-kv module Modify Samza container (I think) to map something like: org.apache.samza.storage.kv.KeyValueStorageEngineFactory => org.apache.samza.storage.kv.LevelDBKeyValueStorageEngineFactory This is to maintain backwards compatibility.
        Hide
        cpsoman Chinmay Soman added a comment -

        Agreed, it does seem that explicit factory names are better. I'll use that approach.

        Thanks for all the comments !

        C

        Show
        cpsoman Chinmay Soman added a comment - Agreed, it does seem that explicit factory names are better. I'll use that approach. Thanks for all the comments ! C
        Hide
        martinkl Martin Kleppmann added a comment -

        I would prefer approach (1), a separate factory for each type of storage engine. I fear that a generic key-value interface that abstracts across multiple storage engines would be a leaky abstraction; users would still have to think about which storage engine is being used under the hood. Some of the subtle differences that may arise:

        • LevelDB and RocksDB use a sorted log-structured representation which allows efficient range queries, but a HashMap would not allow range queries.
        • Perhaps the in-memory store should use a TreeMap instead, but then it's limited to keys that implement Comparable.
        • For the in-memory storage engine, serdes may be optional. For on-disk storage, serdes are required.
        • LevelDB has no mechanism for expiry; RocksDB supports pluggable compaction filters which allow expiry to be implemented; Guava collections have lots of cache-replacement and expiry options. We should be able to give the user access to whatever options the underlying storage engine provides.

        I also think we should name the factories after the particular storage engine being used (LevelDBStorageEngineFactory, RocksDBStorageEngineFactory, HashMapStorageEngineFactory, etc) not after their persistence characteristics (PersistentKeyValueStorageEngineFactory, InMemoryKeyValueStorageEngineFactory), because:

        1. It's misleading: an in-memory storage engine can still be durable if changelog replication is enabled, and an on-disk storage engine can still lose data if you don't have changelog replication enabled. The difference between on-disk and in-memory storage determines whether you can store state larger than memory, not whether the state is durable.
        2. Leaky abstraction: RocksDB has different features and different performance characteristics from LevelDB, so I don't think it makes sense to abstract over them.
        3. Explicit is better than implicit: users will need to know what storage engine is being used, so the factory name shouldn't hide it from them.

        For compatibility, making KeyValueStorageEngineFactory an alias for LevelDBStorageEngineFactory sounds good to me.

        Show
        martinkl Martin Kleppmann added a comment - I would prefer approach (1), a separate factory for each type of storage engine. I fear that a generic key-value interface that abstracts across multiple storage engines would be a leaky abstraction; users would still have to think about which storage engine is being used under the hood. Some of the subtle differences that may arise: LevelDB and RocksDB use a sorted log-structured representation which allows efficient range queries, but a HashMap would not allow range queries. Perhaps the in-memory store should use a TreeMap instead, but then it's limited to keys that implement Comparable. For the in-memory storage engine, serdes may be optional. For on-disk storage, serdes are required. LevelDB has no mechanism for expiry; RocksDB supports pluggable compaction filters which allow expiry to be implemented; Guava collections have lots of cache-replacement and expiry options. We should be able to give the user access to whatever options the underlying storage engine provides. I also think we should name the factories after the particular storage engine being used (LevelDBStorageEngineFactory, RocksDBStorageEngineFactory, HashMapStorageEngineFactory, etc) not after their persistence characteristics (PersistentKeyValueStorageEngineFactory, InMemoryKeyValueStorageEngineFactory), because: It's misleading: an in-memory storage engine can still be durable if changelog replication is enabled, and an on-disk storage engine can still lose data if you don't have changelog replication enabled. The difference between on-disk and in-memory storage determines whether you can store state larger than memory, not whether the state is durable. Leaky abstraction: RocksDB has different features and different performance characteristics from LevelDB, so I don't think it makes sense to abstract over them. Explicit is better than implicit: users will need to know what storage engine is being used, so the factory name shouldn't hide it from them. For compatibility, making KeyValueStorageEngineFactory an alias for LevelDBStorageEngineFactory sounds good to me.
        Hide
        criccomini Chris Riccomini added a comment -

        Personally, I'm biased towards option (ii) since the existing jobs don't need to change their configs (and we default to LevelDB). What do you guys think ?

        If we went the (1) approach, we could retain backwards compatibility by keeping the KeyValueStorageEngineFactory around, marking @Deprecated, and having it continue to use the LevelDB impl for some time. Again, not sure if this is better or not.

        Show
        criccomini Chris Riccomini added a comment - Personally, I'm biased towards option (ii) since the existing jobs don't need to change their configs (and we default to LevelDB). What do you guys think ? If we went the (1) approach, we could retain backwards compatibility by keeping the KeyValueStorageEngineFactory around, marking @Deprecated, and having it continue to use the LevelDB impl for some time. Again, not sure if this is better or not.
        Hide
        criccomini Chris Riccomini added a comment -

        Keeping that pattern would allow us to break out RocksDB, LevelDB or other implementations that bring substantial baggage, into their own modules, thus making it easier to keep that baggage sorted and segregated

        Yea, that was my main knock on the second approach. It seems to muddle dependencies between the different implementations.

        Would it be better to have separate a LevelDBKeyValueStorageEngineFactory, RocksDBKeyValueStorageEngineFactory, HashMapKeyValueStorageEngineFactory, etc?

        The main thing that bugs me about the mutliple-factory approach is that they're all going to cargo cult almost exactly the same chunk of code. If you look at KeyValueStorageEngineFactory, about 99% of it is setting up the cache, serialization, changelog, etc. This is stuff that every key value store is going to want to do identically. One way around that would be to have the KeyValueStorageEngineFactory be a base class for all the rest, and just have some abstract method that returns the underlying KV store (i.e. LevelDB, RocksDB, etc).

        On the flip side of this, I could imagine some KV stores wanting to slightly tweak the getStorageEngine method. For example, there's no point in having a cache or serialized key value store when using an in-memory TreeMap implementation. Having different factories lets us control this a bit better, maybe.

        stores.*.factory.persistent=true / false

        This config approach only works if we draw a line in the sand that RocksDB will be the only storage engine supported by the KeyValueStorageEngine implementation. In such a case, true=use the TreeMap implementation, and false=use RocksDB. However, if we ever wanted to support another implementation, you then end up needing a second parameter. Something like:

        stores.*.factory.persistent.store=rocksdb|leveldb|lmdb|etc

        I'm not sure if this is better or worse than approach (1), though.

        Show
        criccomini Chris Riccomini added a comment - Keeping that pattern would allow us to break out RocksDB, LevelDB or other implementations that bring substantial baggage, into their own modules, thus making it easier to keep that baggage sorted and segregated Yea, that was my main knock on the second approach. It seems to muddle dependencies between the different implementations. Would it be better to have separate a LevelDBKeyValueStorageEngineFactory, RocksDBKeyValueStorageEngineFactory, HashMapKeyValueStorageEngineFactory, etc? The main thing that bugs me about the mutliple-factory approach is that they're all going to cargo cult almost exactly the same chunk of code. If you look at KeyValueStorageEngineFactory, about 99% of it is setting up the cache, serialization, changelog, etc. This is stuff that every key value store is going to want to do identically. One way around that would be to have the KeyValueStorageEngineFactory be a base class for all the rest, and just have some abstract method that returns the underlying KV store (i.e. LevelDB, RocksDB, etc). On the flip side of this, I could imagine some KV stores wanting to slightly tweak the getStorageEngine method. For example, there's no point in having a cache or serialized key value store when using an in-memory TreeMap implementation. Having different factories lets us control this a bit better, maybe. stores.*.factory.persistent=true / false This config approach only works if we draw a line in the sand that RocksDB will be the only storage engine supported by the KeyValueStorageEngine implementation. In such a case, true=use the TreeMap implementation, and false=use RocksDB. However, if we ever wanted to support another implementation, you then end up needing a second parameter. Something like: stores.*.factory.persistent.store=rocksdb|leveldb|lmdb|etc I'm not sure if this is better or worse than approach (1), though.
        Hide
        jghoman Jakob Homan added a comment -

        Is the key difference here whether or not the kv store persists to disk or not? I could see implementations that provided such persistence as a configurable option, not as an existential feature.

        Also, Option 2 breaks our current pattern of one factory per implementation and loading the factory via reflection. Keeping that pattern would allow us to break out RocksDB, LevelDB or other implementations that bring substantial baggage, into their own modules, thus making it easier to keep that baggage sorted and segregated. Would it be better to have separate a LevelDBKeyValueStorageEngineFactory, RocksDBKeyValueStorageEngineFactory, HashMapKeyValueStorageEngineFactory, etc?

        Show
        jghoman Jakob Homan added a comment - Is the key difference here whether or not the kv store persists to disk or not? I could see implementations that provided such persistence as a configurable option, not as an existential feature. Also, Option 2 breaks our current pattern of one factory per implementation and loading the factory via reflection. Keeping that pattern would allow us to break out RocksDB, LevelDB or other implementations that bring substantial baggage, into their own modules, thus making it easier to keep that baggage sorted and segregated. Would it be better to have separate a LevelDBKeyValueStorageEngineFactory, RocksDBKeyValueStorageEngineFactory, HashMapKeyValueStorageEngineFactory, etc?
        Hide
        cpsoman Chinmay Soman added a comment -

        Copying over from the internal email thread (sorry for that)

        I like option (2). One thing to consider is the submodule structure in this approach. If we have KeyValueStorageEngineFactory in samza-kv, but we then have samza-kv-leveldb and samza-kv-rocksdb components, then samza-kv must depend on both (i.e. Pull in dependencies from both projects--even if you only use RocksDB, you'd still have LevelDB junk in your classpath). This is a little ugly from a hygiene perspective, but I don't think it should cause any problems. Alternatively, we could just have one samza-kv submodule, and put all implementations in there, but that seems a bit nastier even than separate submodules. Alternatively^2, we could have samza-kv do a Class.forName().newInstance to create the actual StorageEngine, but this seems likely to introduce even more runtime errors due to improper dependencies.
        Other than that, I don't see an immediate problem with approach (2). It seems preferable to approach (1) in your list below. That said, let's move over to JIRA. I'm sure other folks will have feedback as well.
        Cheers,
        Chris

        ----------------

        Currently, it seems the Samza job config has something like this:

        org.apache.samza.storage.kv.KeyValueStorageEngineFactory

        Which today -> defaults to a LevelDbKeyValueStore. To make this more pluggable, I think we can use 2 approaches:

        i) Separate Factory:

        Have something like:
        org.apache.samza.storage.kv.PersistentKeyValueStorageEngineFactory
        org.apache.samza.storage.kv.InMemoryKeyValueStorageEngineFactory

        ii) Additional factory config:

        In this case, we can keep the same factory: org.apache.samza.storage.kv.KeyValueStorageEngineFactory,

        but have additional parameters to determine the type. Example:

        stores.*.factory.persistent=true / false

        To add to that, I think option (ii) might be better since we can abstract all key-value stores (RocksDB / LevelDB / In-memory / blah) with one factory and use the additional config parameter to determine what type ?

        This way, the different storage engines can be categorized by their types in a hierarchy ( KeyValue / BitMap / Document structured / blah ...)

        Personally, I'm biased towards option (ii) since the existing jobs don't need to change their configs (and we default to LevelDB). What do you guys think ?

        C

        Show
        cpsoman Chinmay Soman added a comment - Copying over from the internal email thread (sorry for that) I like option (2). One thing to consider is the submodule structure in this approach. If we have KeyValueStorageEngineFactory in samza-kv, but we then have samza-kv-leveldb and samza-kv-rocksdb components, then samza-kv must depend on both (i.e. Pull in dependencies from both projects--even if you only use RocksDB, you'd still have LevelDB junk in your classpath). This is a little ugly from a hygiene perspective, but I don't think it should cause any problems. Alternatively, we could just have one samza-kv submodule, and put all implementations in there, but that seems a bit nastier even than separate submodules. Alternatively^2, we could have samza-kv do a Class.forName().newInstance to create the actual StorageEngine, but this seems likely to introduce even more runtime errors due to improper dependencies. Other than that, I don't see an immediate problem with approach (2). It seems preferable to approach (1) in your list below. That said, let's move over to JIRA. I'm sure other folks will have feedback as well. Cheers, Chris ---------------- Currently, it seems the Samza job config has something like this: org.apache.samza.storage.kv.KeyValueStorageEngineFactory Which today -> defaults to a LevelDbKeyValueStore. To make this more pluggable, I think we can use 2 approaches: i) Separate Factory: Have something like: org.apache.samza.storage.kv.PersistentKeyValueStorageEngineFactory org.apache.samza.storage.kv.InMemoryKeyValueStorageEngineFactory ii) Additional factory config: In this case, we can keep the same factory: org.apache.samza.storage.kv.KeyValueStorageEngineFactory, but have additional parameters to determine the type. Example: stores.*.factory.persistent=true / false To add to that, I think option (ii) might be better since we can abstract all key-value stores (RocksDB / LevelDB / In-memory / blah) with one factory and use the additional config parameter to determine what type ? This way, the different storage engines can be categorized by their types in a hierarchy ( KeyValue / BitMap / Document structured / blah ...) Personally, I'm biased towards option (ii) since the existing jobs don't need to change their configs (and we default to LevelDB). What do you guys think ? C
        Hide
        jghoman Jakob Homan added a comment -

        Yes, I have a patch I'm finishing up.

        Show
        jghoman Jakob Homan added a comment - Yes, I have a patch I'm finishing up.
        Hide
        davidzchen David Chen added a comment -

        Hi Jakob,

        Have you started working on this? If not, would it be fine if I pick this up?

        Thanks,
        David

        Show
        davidzchen David Chen added a comment - Hi Jakob, Have you started working on this? If not, would it be fine if I pick this up? Thanks, David

          People

          • Assignee:
            cpsoman Chinmay Soman
            Reporter:
            jghoman Jakob Homan
          • Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development