Evgenii Karimov - Some lessons learned working with Spark/AWS Glue

Random general and specific notes from Notion made after a couple of projects working with Spark/AWS Glue:

Parquet format doesn’t support empty arrays
Schema evolution in Spark 2.4 for ORC format is broken
Looking forward for ZSTD compression for ORC format.
Always check that desired features work as expected - apparently dynamic partitioning pruning doesn’t work in AWS Glue as expected with broadcast join, reading the whole dataset on the left side. Hopefully works as expected starting Spark 3.0 (#TODO - test it)
Another #TODO - test Apache Hudi for incremental data processing.
Great article about data repartitioning in Spark

Some lessons learned working with Spark/AWS Glue