Uploaded image for project: 'Apache Avro'
  1. Apache Avro
  2. AVRO-1922

Fixed dimension for array

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.8.1
    • Fix Version/s: None
    • Component/s: spec
    • Labels:
      None

      Description

      This is a feature request for future versions of the Avro specification.

      We have found one kind of data structure that is hard to express in Avro: tensors. Although we can (and do) build matrices as {"type": "array", "items": {"type": "array", "items": "double"}}, this type does not specify that the grid of numbers is rectangular. We believe that rectangular arrays of numbers (or other nested types) would be a strong addition to Avro, both as a type system and as a serialization format. With the total size of all dimensions fixed in the schema, they would not need to be repeated in each serialized datum.

      For instance, suppose there was an extension of type "array" to specify dimensions:

      {"type": "array", "dimensions": [3, 3, 3, 3], "items": "double"}

      This 3-by-3-by-3-by-3 tensor (representing, for instance, the Riemann curvature tensor in 3-space) specifies that 81 double-precision numbers (3*3*3*3) are expected for each datum. With nested arrays, the size, "3," would have to be separately encoded 40 times (1 + 3*(1 + 3*(1 + 3))) for each datum, even though they would never change in a dataset of Riemann tensors. With a "dimensions" attribute in the schema, only the content needs to be serialized. Moreover, this extension can clearly be used with any other "items" type, to make dense tables of strings, for instance.

      Avro has been extended in a similar way in the past. The "fixed" type is a "bytes" without the need to specify the number of bytes for each datum. Our proposal provides a similar packing for structured objects that can be significant for large numbers of dimensions, as shown above. The advantage to consumers of Avro data is that we can write functions that do not need to check all array sizes at runtime (for operations like tensor contractions and products).

      We have searched the web and the Avro JIRA site for similar proposals and found none, so we're adding this proposal to JIRA in addition to this e-mail. Please let us know if you have any comments, or if we can provide any more information.

      Thank you!

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jpivarski Jim Pivarski
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: