dimanche 13 avril 2014

Hadoop - AWS EMR Hive partitionnement incapable de reconnaître n'importe quel type de partitions - Stack Overflow


I am trying to process some log files on a bucket in amazon s3.


I create the table :


CREATE EXTERNAL TABLE apiReleaseData2 (
messageId string, hostName string, timestamp string, macAddress string DISTINCT, apiKey string,
userAccountId string, userAccountEmail string, numFiles string)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='messageId, hostName, timestamp, macAddress, apiKey, userAccountId, userAccountEmail, numFiles')
LOCATION 's3://apireleasecandidate1/regression/transferstatistics/2013/12/31/';

Then I run the following HiveQL statement and get my desired output in the file without any issues. My directories are setup in the following manner :


s3://apireleasecandidate1/regression/transferstatistics/2013/12/31/ < All the log files for this day >


What I want to do is that I specify the LOCATION up to the 's3://apireleasecandidate1/regression/transferstatistics/' and then call the


ALTER TABLE <Table Name> ADD PARTITION (<path>) 

statement or the


ALTER TABLE <Table Name> RECOVER PARTITIONS ;

statement to access the files in the subdirectories. But when I do this there is no data in my table.


I tried the following :


CREATE EXTERNAL TABLE apiReleaseDataUsingPartitions (
messageId string, hostName string, timestamp string, macAddress string, apiKey string,
userAccountId string, userAccountEmail string, numFiles string)
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='messageId, hostName, timestamp, macAddress, apiKey, userAccountId, userAccountEmail, numFiles')
LOCATION 's3://apireleasecandidate1/regression/transferstatistics/';

and then I run the following ALTER command :


ALTER TABLE apiReleaseDataUsingPartitions ADD PARTITION (year='2013', month='12', day='31');

But running the Select statement on the table gives out no results.


Can someone please guide me what I am doing wrong ? Am I missing something Important ?


Cheers Tanzeel




In HDFS anyway, the partitions manifest in a key/value format like this:


hdfs://apireleasecandidate1/regression/transferstatistics/year=2013/month=12/day=31

I can't vouch for S3 but an easy way to check would be to write some data into a dummy partition and see where it creates the file.


ADD PARTITION supports an optional LOCATION parameter, so you might be able to deal with this by saying


ALTER TABLE apiReleaseDataUsingPartitions ADD PARTITION (year='2013', month='12', day='31') LOCATION 's3://apireleasecandidate1/regression/transferstatistics/2013/12/31/';

Again I've not dealt with S3 but would be interested to hear if this works for you.



I am trying to process some log files on a bucket in amazon s3.


I create the table :


CREATE EXTERNAL TABLE apiReleaseData2 (
messageId string, hostName string, timestamp string, macAddress string DISTINCT, apiKey string,
userAccountId string, userAccountEmail string, numFiles string)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='messageId, hostName, timestamp, macAddress, apiKey, userAccountId, userAccountEmail, numFiles')
LOCATION 's3://apireleasecandidate1/regression/transferstatistics/2013/12/31/';

Then I run the following HiveQL statement and get my desired output in the file without any issues. My directories are setup in the following manner :


s3://apireleasecandidate1/regression/transferstatistics/2013/12/31/ < All the log files for this day >


What I want to do is that I specify the LOCATION up to the 's3://apireleasecandidate1/regression/transferstatistics/' and then call the


ALTER TABLE <Table Name> ADD PARTITION (<path>) 

statement or the


ALTER TABLE <Table Name> RECOVER PARTITIONS ;

statement to access the files in the subdirectories. But when I do this there is no data in my table.


I tried the following :


CREATE EXTERNAL TABLE apiReleaseDataUsingPartitions (
messageId string, hostName string, timestamp string, macAddress string, apiKey string,
userAccountId string, userAccountEmail string, numFiles string)
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='messageId, hostName, timestamp, macAddress, apiKey, userAccountId, userAccountEmail, numFiles')
LOCATION 's3://apireleasecandidate1/regression/transferstatistics/';

and then I run the following ALTER command :


ALTER TABLE apiReleaseDataUsingPartitions ADD PARTITION (year='2013', month='12', day='31');

But running the Select statement on the table gives out no results.


Can someone please guide me what I am doing wrong ? Am I missing something Important ?


Cheers Tanzeel



In HDFS anyway, the partitions manifest in a key/value format like this:


hdfs://apireleasecandidate1/regression/transferstatistics/year=2013/month=12/day=31

I can't vouch for S3 but an easy way to check would be to write some data into a dummy partition and see where it creates the file.


ADD PARTITION supports an optional LOCATION parameter, so you might be able to deal with this by saying


ALTER TABLE apiReleaseDataUsingPartitions ADD PARTITION (year='2013', month='12', day='31') LOCATION 's3://apireleasecandidate1/regression/transferstatistics/2013/12/31/';

Again I've not dealt with S3 but would be interested to hear if this works for you.


0 commentaires:

Enregistrer un commentaire