samedi 26 avril 2014

Impossible de télécharger ou de lire la sortie de la ruche dans le compartiment d'Amazon S3 - Stack Overflow


I'm new to AWS and Hive, and I'm trying to use Hive to analyze Google Ngrams data. I tried to save a table as tab-delimited CSV in an S3 bucket, but now I don't know how to view it or download it to see if my job executed correctly.


The query I used to create the table was


CREATE EXTERNAL TABLE test_table2 (
gram string,
year int,
occurrences bigint,
pages bigint,
books bigint
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3://mybucket/sub-bucket/test-table2.txt';

I then filled the table with data:


INSERT OVERWRITE TABLE test_table2
SELECT
gram,
year,
occurrences,
pages,
books
FROM
eng1m_5grams_normed
WHERE
gram = 'early bird gets the worm';

The query ran fine, and I think everything worked correctly. However, when I navigate to my bucket in the S3 Management Console online, the text file appears as a folder containing a bunch of files. These files have long hexadecimal character names and are 0 bytes big.


Is this just the text file represented as a directory? Is there a way I can view or download the file to see if my query worked? I tried to make the directory public so I could download it, but the download button in the "Actions" dropdown menu is still greyed out.




In Hive/S3 , think of S3 directories as tables. The files contained in those directories are contents of those tables (i.e. rows). The reason you have multiple files in the directory is because multiple reducers are writing the "table".


S3 Browser is a very nice tool for working with S3.




What happened is that very few rows may have qualified against the predicate in the where clause. so very few (or no) rows were selected and emitted into the output (and hence the zero sized files). EMR doesn't give a simple way to download the result of a query.



I'm new to AWS and Hive, and I'm trying to use Hive to analyze Google Ngrams data. I tried to save a table as tab-delimited CSV in an S3 bucket, but now I don't know how to view it or download it to see if my job executed correctly.


The query I used to create the table was


CREATE EXTERNAL TABLE test_table2 (
gram string,
year int,
occurrences bigint,
pages bigint,
books bigint
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3://mybucket/sub-bucket/test-table2.txt';

I then filled the table with data:


INSERT OVERWRITE TABLE test_table2
SELECT
gram,
year,
occurrences,
pages,
books
FROM
eng1m_5grams_normed
WHERE
gram = 'early bird gets the worm';

The query ran fine, and I think everything worked correctly. However, when I navigate to my bucket in the S3 Management Console online, the text file appears as a folder containing a bunch of files. These files have long hexadecimal character names and are 0 bytes big.


Is this just the text file represented as a directory? Is there a way I can view or download the file to see if my query worked? I tried to make the directory public so I could download it, but the download button in the "Actions" dropdown menu is still greyed out.



In Hive/S3 , think of S3 directories as tables. The files contained in those directories are contents of those tables (i.e. rows). The reason you have multiple files in the directory is because multiple reducers are writing the "table".


S3 Browser is a very nice tool for working with S3.



What happened is that very few rows may have qualified against the predicate in the where clause. so very few (or no) rows were selected and emitted into the output (and hence the zero sized files). EMR doesn't give a simple way to download the result of a query.


0 commentaires:

Enregistrer un commentaire