Uploaded image for project: 'Crunch (Retired)'
  1. Crunch (Retired)
  2. CRUNCH-627

Shard API doesn't work well with parquet target

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.13.0
    • 0.13.0
    • MapReduce Patterns
    • Linux X86
    • Important

    Description

      PCollection<User> outTable = oldTable.union(newTable);
      Shard.shard(outTable,10).write(new AvroParquetFileTarget(tempOut+path), Target.WriteMode.OVERWRITE);

      However, I have another job which would read the output of above target output and use a field as the key , the job output looks like below
      3.0.3.1.2.CH24_RELEASE 2
      3.0.3.1.2.CH24_RELEASEE 1
      3.0.3.1.2.CH24_RELEASEEA 1
      3.0.3.1.2.CH24_RELEASEEAS 1
      3.0.3.1.2.CH24_RELEASEEASE 29
      3.0.3.1.2.CH24_RELEASEEASES 160
      3.0.3.1.2.CH24_RELEASEEASESE 85
      3.0.3.1.2.CH24_RELEASEEASESEE 14
      3.0.3.1.2.CH24_RELEASEEASESEEE 4
      3.0.3.1.2.CH24_RELEASEEASESEEES 1
      there is extra suffix added to the key of the PTable, all of them
      should be RELEASE but not the RELEASEEASE bra bra

      If I remove the Shard, and keeps all the same, the output looks like normal
      3.0.0.1.2.CH.1.4_RELEASE 1
      3.0.1.1.2.CH22_RELEASE 1622
      3.0.1.1.2.CH23_RELEASE 10607
      3.0.14.1.2.CH.1.3_RELEASE 18080
      3.0.19.1.2.TC21_RELEASE 5
      3.0.2.1.2.CH11_RELEASE 3
      3.0.2.1.2.TC21_RELEASE 4
      3.0.20.1.2.TC21_RELEASE 247
      3.0.20.7.2.SX.1.2A_RELEASE 2
      3.0.20.8.2.SX.1.3A_RELEASE 1

      Attachments

        1. CRUNCH-627.patch
          4 kB
          Josh Wills

        Activity

          People

            Unassigned Unassigned
            leenuxwu Tony Wu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: