Samza is designed with good support for database changelogs, but the current open source release is mostly centered around Kafka. It would be good to have out-of-the-box support for some common databases, such as MySQL, as well.
Databus is LinkedIn's change capture tool, but the current open source release focuses mainly on Oracle. There is an open source release of Databus for MySQL, but it's a proof-of-concept implementation, not the one used by LinkedIn in production. (The one used by LinkedIn requires a patched version of MySQL.) The open source Databus uses Open Replicator to connect to a MySQL server as a slave, and parses the binlog to find any inserts, updates or deletes.
I played around a bit with Open Replicator today, and got it working — a small Scala program that could get a real-time feed of all changes happening in a MySQL database. However, I have some doubts about the quality of the library (the code is not very good, it has only very cursory tests, the original maintainer hasn't touched it for 18 months, and there are reports of nasty bugs – eg. blowing up on any negative number). There don't seem to be any better Java binlog parsers out there. But I did skim the source of Open Replicator, and it's not too complicated – it seems quite feasible to write a MySQL binlog parser ourselves.
This is still very much at exploratory stage, but I think it could be really cool to have database changelog support easily available in Samza.