Details
-
Bug
-
Status: Resolved
-
P2
-
Resolution: Done
-
None
-
None
Description
When I run WordCount with the Apex runner on a YARN cluster - specifically Dataproc, reading/writing GCS - the word counts are all written to temporary files but they are never moved to their final destination.
Hadoop version 2.7.3
Beam RC 2.0.0
Steps to repro:
1. Instantiate archetype (see below)
2. Build uber jar mvn --settings ../beamrc-settings.xml clean package -P apex-runner
3. SCP to master (or wherever you'd like to launch from)
4. java -cp word-count-beam-0.1.jar beamrc.WordCount --runner=ApexRunner --embeddedExecution=false --inputfile=gs://apache-beam-samples/shakespeare/winterstale-personae --output=SOMEWHERE
Appendix: steps to instantiate RC archetype:
Build an RC-specific beamrc-settings.xml
<settings>
<profiles>
<profile>
<id>beam-2.0.0</id>
<repositories>
<repository>
<!-- This id _must_ be "archetype" -->
<id>archetype</id>
<url>RC_REPO</url>
</repository>
</repositories>
</profile>
</profiles>
<activeProfiles>
<activeProfile>beam-2.0.0</activeProfile>
</activeProfiles>
</settings>
And then instantiate like so
mvn archetype:generate \ --settings beam-rc-settings.xml \ -D archetypeCatalog=internal \ -D archetypeGroupId=org.apache.beam \ -D archetypeArtifactId=beam-sdks-java-maven-archetypes-examples \ -D archetypeVersion=2.0.0 \ -D groupId=beamrc \ -D artifactId=word-count-beam \ -D version="0.1" \ -D package=beamrc \ -D interactiveMode=false