Query Parquet Files in S3
13 Dec 2020
Got some parquet files sitting in S3 as the result of an export
of an RDS snapshot? Here's how to query them using the aws
CLI command.
First, list your parquet files from the S3 bucket where they live:
$ aws --output text s3api list-objects-v2 \ --bucket my-db-export-bucket \ --prefix my-db-export-for-2020-12-11/mydb/ \ --query 'Contents[?ends_with(Key, `parquet`)]|[*].[Key]' my-db-export-for-2020-12-11/mydb/mydb.accounts/part-00000-3c4f9c6c-e173-4410-821a-a416de9d4503-c000.gz.parquet my-db-export-for-2020-12-11/mydb/mydb.users/part-00000-ef5c6dee-380f-473b-a19b-bb7e75bdabab-c000.gz.parquet my-db-export-for-2020-12-11/mydb/mydb.widgets/part-00000-9f9081a7-c16c-4fa5-92b9-285b2de21f60-c000.gz.parquet ...
Next, run a SQL query on a file. Use the table name "s3object" in your SQL query: it is the "table" that the parquet file represents.
$ aws s3api select-object-content \ --bucket my-db-export-bucket \ --key my-db-export-for-2020-12-11/mydb/mydb.accounts/part-00000-3c4f9c6c-e173-4410-821a-a416de9d4503-c000.gz.parquet \ --expression 'select count(*) from s3object' \ --expression-type SQL \ --input-serialization '{"Parquet": {}}' \ --output-serialization '{"CSV": {}}' \ /dev/stdout 722712