Random general and specific notes from Notion made after a couple of projects working with Spark/AWS Glue:
- Parquet format doesn’t support empty arrays
- Schema evolution in Spark 2.4 for ORC format is broken
- Looking forward for ZSTD compression for ORC format.
- Always check that desired features work as expected - apparently dynamic partitioning pruning doesn’t work in AWS Glue as expected with broadcast join, reading the whole dataset on the left side. Hopefully works as expected starting Spark 3.0 (#TODO - test it)
- Another #TODO - test Apache Hudi for incremental data processing.
- Great article about data repartitioning in Spark