Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-4753

[C++] Extension types and layouts for text-optimized data structures

    XMLWordPrintableJSON

    Details

    • Type: Wish
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 2.0.0
    • Component/s: C++, Format
    • Labels:
    • Environment:
      C/C++

      Description

      Narrative (text), by default, is notoriously inefficient to store on the disk or in memory. It is, in the most basic form, a long sequence of bytes with no indexing or other optimized layout structure. 
       
      There are data structures such as tries, DAFSAs, or b-tries that support more efficient storage and lookup of phrases. 
       
      We would like to enable arrow to serialize from/to these efficient structures as the format/carrier between high performance text processing steps which like to operate on binary data structures (lookups, spellers, or more advance NLP routines).
       
      so, it could be something like:
       
      text.to_arrow(infer=true|dafsa|trie|b-trie) : arrow // writes arrow as format for the specified encoding. This could be implicit if we could store encoding in some kind of manifest
       
      arrow.to_text(infer=true|dafsa|trie|b-trie) : string // restores text from the arrow format, and from a specified encoding, same as above. 
       
      On the dev mailing list we are discussion creation of the contrib folder where such features could be optionally included for Arrow.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              ebegoli Edmon Begoli
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: