I didn't understand why this code was wrong before, so I looked into it in more depth and I agree with Colin's analysis and patch. In the interest of making this easier for others to understand, here are a few references.
http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html explains the GCC inline assembly syntax, and in particular how the asm("some assembly" : inputconstraints : outputconstraints : clobbers) syntax is parsed, and how the constraints map to the %n in the assembly string.
http://asm.sourceforge.net/articles/rmiyagi-inline-asm.txt describes the x86 indexed addressing modes, in particular explaining how (%5,%4,1) is interpreted as "the word of memory at %5 + 1 * %4".
http://softwarecommunity.intel.com/userfiles/en-us/d9156103.pdf describes the details of the SSE4 CRC32 instruction in mind-numbing detail, but that's not especially relevant to this bug. All we need to know is that crc32size operates on 8, 32, or 64 bits depending on size, and its first argument is read-only while its second argument is used as an accumulator (read, modify, write).
Finally, the comments in bulk_crc32.c are very helpful. Critically, the pipelined_crc32c routine optimizes by computing the CRC of up to 3 blocks in parallel. The block size is passed in to pipelined_crc32c as block_size. As we can see by looking at one of the other asm blocks in pipelined_crc32c, the core idea is that we maintain bdata as a pointer to the word being CRCed in the first block, and then use indexed addressing to compute the appropriate address for the word being CRCed in the second (and possibly third) blocks.
With all that under our belt, the bug in this code becomes clear:
"crc32b (%5), %0;\n\t"
"crc32b (%5,%4,1), %1;\n\t"
: "=r"(c1), "=r"(c2)
: "r"(c1), "r"(c2), "r"(c3), "r"(block_size), "r"(bdata)
The first crc32b instruction dereferences %5 which is block_size, but comparing to any other example of the similar asm block such as:
"crc32q (%7), %0;\n\t"
"crc32q (%7,%6,1), %1;\n\t"
"crc32q (%7,%6,2), %2;\n\t"
: "=r"(c1), "=r"(c2), "=r"(c3)
: "r"(c1), "r"(c2), "r"(c3), "r"(block_size), "r"(data)
it should be dereferencing bdata. And this is caused because the output constraints list includes c3 even though the input constraints list does not, also different from all other examples of the asm block.
Therefore, Colin's fix to remove c3 from the list causes the %4 and %5 references to refer to their intended operands block_size and bdata respectively.