Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-944

LuceneIndexToSequenceFiles (lucene2seq) utility

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 0.5
    • 0.8
    • classic
    • None

    Description

      Here is a lucene2seq tool I used in a project. It creates sequence files based on the stored fields of a lucene index.

      The output from this tool can be then fed into seq2sparse and from there you can do text clustering.

      Comes with Java bean configuration.

      Let me know what you think. Some CLI code can be added later on. I used this for a small-scale project +- 100.000 docs. Is a MR version useful or is that overkill?

      See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and review comments from Simon Willnauer (Thanks Simon!)

      or the attached patch.

      Attachments

        1. MAHOUT-944.patch
          20 kB
          Frank Scholten
        2. MAHOUT-944.patch
          53 kB
          Frank Scholten
        3. MAHOUT-944.patch
          39 kB
          Frank Scholten
        4. MAHOUT-944.patch
          39 kB
          Frank Scholten
        5. MAHOUT-944.patch
          86 kB
          Frank Scholten
        6. MAHOUT-944.patch
          377 kB
          Frank Scholten
        7. MAHOUT-944.patch
          85 kB
          Grant Ingersoll
        8. MAHOUT-944.patch
          82 kB
          Grant Ingersoll
        9. MAHOUT-944.patch
          81 kB
          Grant Ingersoll
        10. MAHOUT-944.patch
          81 kB
          Grant Ingersoll
        11. MAHOUT-944.patch
          86 kB
          Grant Ingersoll
        12. MAHOUT-944.patch
          91 kB
          Grant Ingersoll
        13. MAHOUT-944-minor.patch
          69 kB
          Grant Ingersoll

        Activity

          People

            gsingers Grant Ingersoll
            frankscholten Frank Scholten
            Votes:
            2 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: