mardi 8 avril 2014

Hadoop - analyse énormément de fichiers JSON sur S3 - Stack Overflow


I have huge amount of json files, >100TB size in total, each json file is 10GB bzipped, and each line contain a json object, and they are stored on s3



  1. If I want to transform the json into csv (also stored on s3) so I can import them into redshift directly, is writing custom code using hadoop the only choice?


  2. Would it be possible to do adhoc query on the json file without transform the data into other format (since I don't want to convert them into other format first every time I need to do query as the source is growing)





The quickest and easiest way would be to launch an EMR cluster loaded with Hive to do the heavy lifting for this. By using the JsonSerde, you can easily transform the data into csv format. This would only require you to do a insert the data into a CSV formatted table from the JSON formatted table.


A good tutorial for handling the JsonSerde can be found here:


http://aws.amazon.com/articles/2855


Also a good library used for CSV format is:


https://github.com/ogrodnek/csv-serde


The EMR cluster can be short-lived and only necessary for that one job, which can also span across low cost spot instances.


Once you have the CSV format, the Redshift COPY documentation should suffice.


http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html



I have huge amount of json files, >100TB size in total, each json file is 10GB bzipped, and each line contain a json object, and they are stored on s3



  1. If I want to transform the json into csv (also stored on s3) so I can import them into redshift directly, is writing custom code using hadoop the only choice?


  2. Would it be possible to do adhoc query on the json file without transform the data into other format (since I don't want to convert them into other format first every time I need to do query as the source is growing)




The quickest and easiest way would be to launch an EMR cluster loaded with Hive to do the heavy lifting for this. By using the JsonSerde, you can easily transform the data into csv format. This would only require you to do a insert the data into a CSV formatted table from the JSON formatted table.


A good tutorial for handling the JsonSerde can be found here:


http://aws.amazon.com/articles/2855


Also a good library used for CSV format is:


https://github.com/ogrodnek/csv-serde


The EMR cluster can be short-lived and only necessary for that one job, which can also span across low cost spot instances.


Once you have the CSV format, the Redshift COPY documentation should suffice.


http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html


0 commentaires:

Enregistrer un commentaire