Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-944

LuceneIndexToSequenceFiles (lucene2seq) utility


    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.8
    • Component/s: Integration
    • Labels:


      Here is a lucene2seq tool I used in a project. It creates sequence files based on the stored fields of a lucene index.

      The output from this tool can be then fed into seq2sparse and from there you can do text clustering.

      Comes with Java bean configuration.

      Let me know what you think. Some CLI code can be added later on. I used this for a small-scale project +- 100.000 docs. Is a MR version useful or is that overkill?

      See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and review comments from Simon Willnauer (Thanks Simon!)

      or the attached patch.


        1. MAHOUT-944.patch
          91 kB
          Grant Ingersoll
        2. MAHOUT-944.patch
          86 kB
          Grant Ingersoll
        3. MAHOUT-944.patch
          81 kB
          Grant Ingersoll
        4. MAHOUT-944.patch
          81 kB
          Grant Ingersoll
        5. MAHOUT-944.patch
          82 kB
          Grant Ingersoll
        6. MAHOUT-944.patch
          85 kB
          Grant Ingersoll
        7. MAHOUT-944.patch
          377 kB
          Frank Scholten
        8. MAHOUT-944.patch
          86 kB
          Frank Scholten
        9. MAHOUT-944.patch
          39 kB
          Frank Scholten
        10. MAHOUT-944.patch
          39 kB
          Frank Scholten
        11. MAHOUT-944.patch
          53 kB
          Frank Scholten
        12. MAHOUT-944.patch
          20 kB
          Frank Scholten
        13. MAHOUT-944-minor.patch
          69 kB
          Grant Ingersoll



            • Assignee:
              gsingers Grant Ingersoll
              frankscholten Frank Scholten
            • Votes:
              2 Vote for this issue
              7 Start watching this issue


              • Created: