It feels like this should really be a known issue, but I've not been able to find any mailing list or JIRA tickets for exactly this. There are a few closed/resolved tickets about specific types of exceptions, but I couldn't find this exact problem, so apologies if this is a dupe!
Spark SQL doesn't appear to support referencing aliases from a SELECT in the GROUP BY part of the query. This is confusing our analysts, as it works in most other tools they use. Here's an example to reproduce:
As you can see, it's particularly inconvenient when using UDFs on nested fields, as it means repeating some potentially complex expressions. It's very common for us to want to make a date type conversion (from epoch milliseconds or something) from some nested field, then reference it in multiple places in the query. With this issue, it makes for quite verbose queries.
Might it also mean that we're mapping these functions over the data twice? I can't quite tell from the explain output whether that's been optimised out or not, but here it is for somebody who understands