lundi 21 avril 2014

Hadoop - je peux pointer plusieurs endroits à la même table externe de la ruche ? -Débordement de pile


I need to process the multiple months data simultaneously. So, is there an option to point multiple folders to external table? e.g. Create external table logdata(col1 string, col2 string........) location 's3://logdata/april', 's3://logdata/march'




Simple answer: no, the location of a Hive external table during creation has to be unique, this is needed by the metastore to understand where your table lives.


That being said, you can probably get away with using partitions: you can specify a location for each of your partitions which seems to be what you want ultimately since you are splitting by month.


So create your table like this:


create external table logdata(col1 string, col2 string) partitioned by (month string) location 's3://logdata'

Then you can add partitions like this:


alter table logdata add partition(month='april') location 's3://logdata/april'

You do this for every month, and now you can query your table specifying whichever partition you want, and Hive will only look at the directories for which you actually want data (for example if you're only processing april and june, Hive will not load may)




Have a look at SymlinkTextInputFormat / https://issues.apache.org/jira/browse/HIVE-1272. Think that could solve your problem. Just have to maintain an separate text file with all locations!


Also see https://issues.apache.org/jira/browse/HIVE-951 which isn't solved but would be an solution!



I need to process the multiple months data simultaneously. So, is there an option to point multiple folders to external table? e.g. Create external table logdata(col1 string, col2 string........) location 's3://logdata/april', 's3://logdata/march'



Simple answer: no, the location of a Hive external table during creation has to be unique, this is needed by the metastore to understand where your table lives.


That being said, you can probably get away with using partitions: you can specify a location for each of your partitions which seems to be what you want ultimately since you are splitting by month.


So create your table like this:


create external table logdata(col1 string, col2 string) partitioned by (month string) location 's3://logdata'

Then you can add partitions like this:


alter table logdata add partition(month='april') location 's3://logdata/april'

You do this for every month, and now you can query your table specifying whichever partition you want, and Hive will only look at the directories for which you actually want data (for example if you're only processing april and june, Hive will not load may)



Have a look at SymlinkTextInputFormat / https://issues.apache.org/jira/browse/HIVE-1272. Think that could solve your problem. Just have to maintain an separate text file with all locations!


Also see https://issues.apache.org/jira/browse/HIVE-951 which isn't solved but would be an solution!


0 commentaires:

Enregistrer un commentaire