[GORA-20] Flush datastore regularly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.1-incubating
Fix Version/s: 0.1-incubating
Component/s: storage
Labels:
None

Description

Right now you need to explicitly call the flush method to make the IO operation happen, or close the datastore.

The issue is described here: http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#Free_up_the_memory. Click on the image to see it in real size and look at the Heap utilization on the top right chart.

Not everybody has infinite memory. In a Nutch fetch process, I usually run into trouble after around 20k urls downloaded because it takes up all the memory, the Java Heap space being set to 1G with a system that "only" has 1G RAM as well.

The feature consists of allowing the datastore to be flushed regularly during the Hadoop job's reducer, org.apache.gora.mapreduce.GoraReducer. We would just add a maxBuffer parameter, which default value is 10000 for example and that you can override in org.apache.gora.mapreduce.GoraOutputFormat. It indicates the maximum number of records buffered in memory before the next flush operation occurs to actually write them in the datastore. This would actually be a member of the org.apache.hadoop.mapreduce.RecordWriter extension returned by getRecordWriter method.

An idea of the fix is suggested in the above link.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

mapred-site.xml
19/Dec/10 00:10
0.3 kB
Alexis
gora.patch
19/Dec/10 00:10
3 kB
Alexis

Activity

People

Assignee:: Unassigned

Reporter:: Alexis

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 18/Dec/10 18:51

Updated:: 01/Jul/13 05:05

Resolved:: 11/Jan/11 10:08