[MAHOUT-944] LuceneIndexToSequenceFiles (lucene2seq) utility - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 0.5
Fix Version/s: 0.8
Component/s: classic
Labels:
None

Description

Here is a lucene2seq tool I used in a project. It creates sequence files based on the stored fields of a lucene index.

The output from this tool can be then fed into seq2sparse and from there you can do text clustering.

Comes with Java bean configuration.

Let me know what you think. Some CLI code can be added later on. I used this for a small-scale project +- 100.000 docs. Is a MR version useful or is that overkill?

See https://github.com/frankscholten/mahout/tree/lucene2seq for commits and review comments from Simon Willnauer (Thanks Simon!)

or the attached patch.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAHOUT-944.patch
11/Jan/12 09:35
20 kB
Frank Scholten
MAHOUT-944.patch
10/Feb/12 17:28
53 kB
Frank Scholten
MAHOUT-944.patch
10/Feb/12 17:33
39 kB
Frank Scholten
MAHOUT-944.patch
11/Feb/12 11:41
39 kB
Frank Scholten
MAHOUT-944.patch
28/Feb/12 09:38
86 kB
Frank Scholten
MAHOUT-944.patch
05/Mar/12 09:12
377 kB
Frank Scholten
MAHOUT-944.patch
02/Jun/13 12:25
85 kB
Grant Ingersoll
MAHOUT-944.patch
02/Jun/13 14:13
82 kB
Grant Ingersoll
MAHOUT-944.patch
02/Jun/13 15:09
81 kB
Grant Ingersoll
MAHOUT-944.patch
02/Jun/13 15:22
81 kB
Grant Ingersoll
MAHOUT-944.patch
05/Jun/13 21:58
86 kB
Grant Ingersoll
MAHOUT-944.patch
06/Jun/13 14:23
91 kB
Grant Ingersoll
MAHOUT-944-minor.patch
06/Jun/13 17:04
69 kB
Grant Ingersoll

Activity

People

Assignee:: Grant Ingersoll

Reporter:: Frank Scholten

Votes:: 2 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 11/Jan/12 09:33

Updated:: 31/Jan/24 22:17

Resolved:: 06/Jun/13 15:51