Avro
  1. Avro
  2. AVRO-230

Create a shared schema test directory structure

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: c, c++, java, python
    • Labels:
      None

      Description

      This is an example of my proposed directory structure:

      • invalid_schemas/
        • broken.json
        • wrong.json
      • valid_data/
        • foo_test/
          • schema.json
          • json_data/
            • valid_json_test_data.json
            • more_valid_json_test_data.json
          • binary_data/
            • valid_binary_test_data.bin
            • more_test_data.bin
        • bar_test/
          • schema.json
          • json_data/
            • ...
      • invalid_data/
        • baz_test/
          • schema.json
          • json_data/
            • ...
          • binary_data/
            • ...

      This structure supports positive and negative tests for avro schemas, json data and binary data.

      • The "invalid_schema" directory holds a number of invalid schemas that should fail to parse.
      • The "valid_data" directory has a number of self-contained tests in separate directories. Each test directory is required to have a "schema.json" file that valid avro schema. The "json_data" and "binary_data" directories are optional for each test.
      • The "invalid_data" directory has the same rules as the "valid_data" directory. The data files should fail during tests (negative testing).

        Activity

        Hide
        Jeff Hammerbacher added a comment -

        Love it! From working with the schemas you use for the C code: it would be nice to have some information about just what corner case the schema is meant to test. You've stuffed some information in the file name, and some in the JSON text itself, but it would be nice to have a more official way of encoding that information.

        I'd vote for a "valid_schemas" directory too.

        We should probably do the same thing for valid/invalid protocols.

        Show
        Jeff Hammerbacher added a comment - Love it! From working with the schemas you use for the C code: it would be nice to have some information about just what corner case the schema is meant to test. You've stuffed some information in the file name, and some in the JSON text itself, but it would be nice to have a more official way of encoding that information. I'd vote for a "valid_schemas" directory too. We should probably do the same thing for valid/invalid protocols.
        Hide
        Doug Cutting added a comment -

        > I'd vote for a "valid_schemas" directory too.

        The valid_data directory has valid schemas. Is that enough?

        In valid_data/binary_data, should each file contain just a single serialized object with no headers, etc, or should it be a data file? My hunch is it should be a single object, and that we should have a separate valid_data_files directory.

        To support validation, we could store a random seed in data file's metadata. Then a validator can read the file while generating random objects from the seed and schema and check that the two match.

        > We should probably do the same thing for valid/invalid protocols.

        +1 We might also inlclude valid requests and responses.

        Show
        Doug Cutting added a comment - > I'd vote for a "valid_schemas" directory too. The valid_data directory has valid schemas. Is that enough? In valid_data/binary_data, should each file contain just a single serialized object with no headers, etc, or should it be a data file? My hunch is it should be a single object, and that we should have a separate valid_data_files directory. To support validation, we could store a random seed in data file's metadata. Then a validator can read the file while generating random objects from the seed and schema and check that the two match. > We should probably do the same thing for valid/invalid protocols. +1 We might also inlclude valid requests and responses.
        Hide
        Matt Massie added a comment -

        I'd vote for a "valid_schemas" directory too.

        Jeff: By definition, all schema in the "valid_data" directory must be valid and the "json_data" and "binary_data" directories are optional. So it would be easy to test an array of schemas without much hassle.

        e.g.

        valid_data

        • test_foo
          • schema.json
        • test_bar
          • schema.json
        • test_baz
          • schema.json

        would be all you need to test three schema without any data tests.

        We should probably do the same thing for valid/invalid protocols.

        Jeff: +1 on your and Doug's suggestion to also include valid requests and responses

        In valid_data/binary_data, should each file contain just a single serialized object with no headers, etc, or should it be a data file?

        Doug: I think you're correct in have a separate directory for single objects and objects inside a container. This would allow people writing new language bindings to test objects eoncoding/decoding without needing to implement any containers.

        To support validation, we could store a random seed in data file's metadata. Then a validator can read the file while generating random objects from the seed and schema and check that the two match.

        Doug: +1 I really like this idea. Couldn't we just (1) seed with a fixed number, (2) encode schema with random data and then (3) save away the next random number? The implementation validating the schema would then (1) seed with a fixed number, (2) decode the data and (3) generate a random number and compare to the encoders stored value?

        Or did I just repeat your recommendation using different words?

        Show
        Matt Massie added a comment - I'd vote for a "valid_schemas" directory too. Jeff: By definition, all schema in the "valid_data" directory must be valid and the "json_data" and "binary_data" directories are optional. So it would be easy to test an array of schemas without much hassle. e.g. valid_data test_foo schema.json test_bar schema.json test_baz schema.json would be all you need to test three schema without any data tests. We should probably do the same thing for valid/invalid protocols. Jeff: +1 on your and Doug's suggestion to also include valid requests and responses In valid_data/binary_data, should each file contain just a single serialized object with no headers, etc, or should it be a data file? Doug: I think you're correct in have a separate directory for single objects and objects inside a container. This would allow people writing new language bindings to test objects eoncoding/decoding without needing to implement any containers. To support validation, we could store a random seed in data file's metadata. Then a validator can read the file while generating random objects from the seed and schema and check that the two match. Doug: +1 I really like this idea. Couldn't we just (1) seed with a fixed number, (2) encode schema with random data and then (3) save away the next random number? The implementation validating the schema would then (1) seed with a fixed number, (2) decode the data and (3) generate a random number and compare to the encoders stored value? Or did I just repeat your recommendation using different words?
        Hide
        Doug Cutting added a comment -

        > generate a random number and compare to the encoders stored value?

        What I'm thinking of is that we provide somewhere a spec for a random data generator. This can be based on, e.g., a simple 32-bit Linear Congruential random number generator:

        http://en.wikipedia.org/wiki/Linear_congruential_generator

        This can be implemented in a few simple lines of code as the "raw" generator. Since small values should often be more common, we can define a biased generator by using the high bit (high bits are more random) of the next raw random number as a coin. Toss it until false, counting the tosses, then use that many high-order bits of the next raw random number as the biased value.

        Then we specify for each kind of schema, how random data should be generated.

        • int: take the next biased value. subtract it from 0 if the next raw value's high-bit is one.
        • long: multiply the next two biased values, then compute sign as above.
        • float/double: divide the next two biased values, compute sign as above.
        • bytes: use next biased value as length, up to, e.g., 64k, then fill with high-byte of next raw value.
        • string: use next biased value as length, up to, e.g., 64k, then fill with randomly selected [a-z].
        • union: select branch next_raw % branch_count
        • etc.

        The remaining question is how to seed the LCG generator. My instinct is not to always start with a fixed seed, but rather should select a seed based on, e.g., the system time when a test file is created, and store that in each file's metadata.

        Show
        Doug Cutting added a comment - > generate a random number and compare to the encoders stored value? What I'm thinking of is that we provide somewhere a spec for a random data generator. This can be based on, e.g., a simple 32-bit Linear Congruential random number generator: http://en.wikipedia.org/wiki/Linear_congruential_generator This can be implemented in a few simple lines of code as the "raw" generator. Since small values should often be more common, we can define a biased generator by using the high bit (high bits are more random) of the next raw random number as a coin. Toss it until false, counting the tosses, then use that many high-order bits of the next raw random number as the biased value. Then we specify for each kind of schema, how random data should be generated. int: take the next biased value. subtract it from 0 if the next raw value's high-bit is one. long: multiply the next two biased values, then compute sign as above. float/double: divide the next two biased values, compute sign as above. bytes: use next biased value as length, up to, e.g., 64k, then fill with high-byte of next raw value. string: use next biased value as length, up to, e.g., 64k, then fill with randomly selected [a-z] . union: select branch next_raw % branch_count etc. The remaining question is how to seed the LCG generator. My instinct is not to always start with a fixed seed, but rather should select a seed based on, e.g., the system time when a test file is created, and store that in each file's metadata.
        Hide
        Matt Massie added a comment -

        Seeding with a fixed number is an awful idea now that I really think about it. Your approach is better.

        Show
        Matt Massie added a comment - Seeding with a fixed number is an awful idea now that I really think about it. Your approach is better.
        Hide
        Bruce Mitchener added a comment -

        We need at least a simple version of this sooner rather than later (we aren't currently doing correct interop testing of the actual values of floats, doubles and such).

        We can have things without random data involved and a static directory of known values that we can write tests to work against, no?

        Keeping it simple would mean that someone could start knocking out some basic examples in an hour or two and then we'd just need a reader in each langauge to verify.

        Show
        Bruce Mitchener added a comment - We need at least a simple version of this sooner rather than later (we aren't currently doing correct interop testing of the actual values of floats, doubles and such). We can have things without random data involved and a static directory of known values that we can write tests to work against, no? Keeping it simple would mean that someone could start knocking out some basic examples in an hour or two and then we'd just need a reader in each langauge to verify.
        Hide
        Doug Cutting added a comment -

        > We can have things without random data involved and a static directory of known values that we can write tests to work against, no?

        Yes, something would be much better than nothing here.

        Show
        Doug Cutting added a comment - > We can have things without random data involved and a static directory of known values that we can write tests to work against, no? Yes, something would be much better than nothing here.
        Hide
        Bruce Mitchener added a comment -

        Going to try to get something going here for the 1.4 release.

        Show
        Bruce Mitchener added a comment - Going to try to get something going here for the 1.4 release.

          People

          • Assignee:
            Matt Massie
            Reporter:
            Matt Massie
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development