Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1799

Unicode compression

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Reopened
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.4.1
    • Fix Version/s: None
    • Component/s: core/store
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In lucene-1793, there is the off-topic suggestion to provide compression of Unicode data. The motivation was a custom encoding in a Russian analyzer. The original supposition was that it provided a more compact index.

      This led to the comment that a different or compressed encoding would be a generally useful feature.

      BOCU-1 was suggested as a possibility. This is a patented algorithm by IBM with an implementation in ICU. If Lucene provide it's own implementation a freely avIlable, royalty-free license would need to be obtained.

      SCSU is another Unicode compression algorithm that could be used.

      An advantage of these methods is that they work on the whole of Unicode. If that is not needed an encoding such as iso8859-1 (or whatever covers the input) could be used.

        Attachments

        1. LUCENE-1799.patch
          9 kB
          Robert Muir
        2. LUCENE-1799.patch
          9 kB
          Uwe Schindler
        3. LUCENE-1799.patch
          9 kB
          Uwe Schindler
        4. LUCENE-1799.patch
          10 kB
          Uwe Schindler
        5. LUCENE-1799.patch
          11 kB
          Uwe Schindler
        6. LUCENE-1799.patch
          17 kB
          Uwe Schindler
        7. LUCENE-1799_big.patch
          355 kB
          Robert Muir
        8. LUCENE-1799.patch
          9 kB
          Robert Muir
        9. LUCENE-1779.patch
          21 kB
          Michael McCandless
        10. LUCENE-1799.patch
          7 kB
          Michael McCandless
        11. LUCENE-1799.patch
          7 kB
          Michael McCandless
        12. LUCENE-1799.patch
          9 kB
          Michael McCandless
        13. LUCENE-1799.patch
          9 kB
          Robert Muir
        14. LUCENE-1799.patch
          9 kB
          Robert Muir
        15. LUCENE-1799.patch
          9 kB
          Robert Muir
        16. LUCENE-1799.patch
          13 kB
          Robert Muir
        17. Benchmark.java
          1 kB
          Robert Muir
        18. Benchmark.java
          4 kB
          Yonik Seeley
        19. Benchmark.java
          1 kB
          Yonik Seeley

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              dmsmith555 DM Smith
            • Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: