[MAHOUT-590] add TSV (Tab Separate Value) input file support to SequenceFilesFromDirectory - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 0.4
Fix Version/s: 0.5
Component/s: classic
Labels:
None
Environment:

Mac OS X 10.6.6, java version "1.6.0_22"
RHL Linux 2.6.18

Description

I would like to add TSV (Tab Separated Value) input file type support to SequenceFilesFromDirectory.

Here is my real use case:

I have 36M records of input, each of which consists of ID and CONTENT and various other attributes, and I wanted to convert them to sequence files for clustering records by term vectors of CONTENT. However the problem is since I cannot create 36M files under my home directory due to quota limit that is up to 50k files, I was not able to convert them to sequence files by SequenceFilesFromDirectory utility... Meanwhile, source data format is TSV where each line of a file includes ID\tCONTENT\t... as it is suitable for Pig and most hadoop stream programs to process as input and output. NOTE: CONTENT size is up to around 2k bytes. Hence I feel better TSV support by SequenceFilesFromDirectory directly instead of taking two steps; TSV to text files and text files to Sequence files.

I'm attaching the patch.

Hope this makes sense to other folks.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAHOUT-590.patch
26/Jan/11 12:49
23 kB
Isabel Drost-Fromm
MAHOUT-590.patch
28/Jan/11 10:58
26 kB
Isabel Drost-Fromm
0001-added-TSV-input-file-support.patch
24/Jan/11 02:46
13 kB
Shige Takeda

Activity

People

Assignee:: Sean R. Owen

Reporter:: Shige Takeda

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/Jan/11 02:46

Updated:: 31/Jan/24 22:17

Resolved:: 23/Mar/11 14:52