Details
-
Wish
-
Status: Closed
-
Minor
-
Resolution: Later
-
None
-
None
-
C/C++
Description
Narrative (text), by default, is notoriously inefficient to store on the disk or in memory. It is, in the most basic form, a long sequence of bytes with no indexing or other optimized layout structure.
There are data structures such as tries, DAFSAs, or b-tries that support more efficient storage and lookup of phrases.
We would like to enable arrow to serialize from/to these efficient structures as the format/carrier between high performance text processing steps which like to operate on binary data structures (lookups, spellers, or more advance NLP routines).
so, it could be something like:
text.to_arrow(infer=true|dafsa|trie|b-trie) : arrow // writes arrow as format for the specified encoding. This could be implicit if we could store encoding in some kind of manifest
arrow.to_text(infer=true|dafsa|trie|b-trie) : string // restores text from the arrow format, and from a specified encoding, same as above.
On the dev mailing list we are discussion creation of the contrib folder where such features could be optionally included for Arrow.