mardi 27 mai 2014

La ruche système de fichiers DataStax Cassandra - fixe le fichier texte de largeur - question de l'intégration - débordement de pile


I'm trying to read a fixed width text file stored in Cassandra File System (CFS) using Hive. I'm able to query the file when I run from hive client. However, when I try to run from Hadoop Hive JDBC, It says table is not available or bad connection. Below are the steps I followed.


Input file (employees.dat):


2736Ambalavanar              Thirugnanam              BNYM-EAG       2005-05-091982-12-18
2737Anand Jeyamani BNYM-AST 2005-05-091984-07-12
3123Muthukumar Rajendran BNYM-EES 2009-08-121988-02-23

Starting Hive Client


bash-3.2# dse hive;
Logging initialized using configuration in file:/etc/dse/hive/hive-log4j.properties
Hive history file=/tmp/root/hive_job_log_root_201209250900_157600446.txt
hive> use HiveDB;
OK
Time taken: 1.149 seconds

Creating Hive External Table pointing to fixed width format text file


hive> CREATE EXTERNAL TABLE employees (empid STRING, firstname STRING, lastname STRING, dept STRING, dateofjoining STRING, dateofbirth STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES ("input.regex" = "(.{4})(.{25})(.{25})(.{15})(.{10})(.{10}).*" )
> LOCATION 'cfs://hostname:9160/folder/';
OK
Time taken: 0.524 seconds

Do a select * from table.


hive> select * from employees;
OK
2736 Ambalavanar Thirugnanam BNYM-EAG 2005-05-09 1982-12-18
2737 Anand Jeyamani BNYM-AST 2005-05-09 1984-07-12
3123 Muthukumar Rajendran BNYM-EES 2009-08-12 1988-02-23
Time taken: 0.698 seconds

Do a select with specific fields from hive table throws permission error (first issue)


hive> select empid, firstname from employees;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
java.io.IOException: The ownership/permissions on the staging directory cfs:/tmp/hadoop-root/mapred/staging/root/.staging is not as expected. It is owned by root and permissions are rwxrwxrwx. The directory must be owned by the submitter root or by root and permissions must be rwx------
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:108)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:452)
at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:133)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1332)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1123)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:931)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Job Submission failed with exception 'java.io.IOException(The ownership/permissions on the staging directory cfs:/tmp/hadoop-root/mapred/staging/root/.staging is not as expected. It is owned by root and permissions are rwxrwxrwx. The directory must be owned by the submitter root or by root and permissions must be rwx------)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask

The second issue is, when I try to run the select * query from JDBC Hive driver (outside of dse/cassandra nodes), It says the table employees is not available. The external table created acts like a temporary table and it does not get persisted. When I use 'hive> show tables;', the employees table is not listed. Can anyone please help me figure out the problem?




I don't have an immediate answer for the first issue, but the second looks like its due to a known issue.


There is a bug in DSE 2.1 which drops external tables created from CFS files from the metastore when show tables is run. Only the table metadata is removed, the data remains in CFS so if you recreate the table definition you shouldn't have to reload it. Tables backed by Cassandra ColumnFamilies are notaffected by this bug. This has been fixed in the 2.2 release of DSE, which is due for release imminently.


I'm not familiar with the Hive JDBC driver, but if it issues a Show Tables command at any point, it could be triggering this bug.



I'm trying to read a fixed width text file stored in Cassandra File System (CFS) using Hive. I'm able to query the file when I run from hive client. However, when I try to run from Hadoop Hive JDBC, It says table is not available or bad connection. Below are the steps I followed.


Input file (employees.dat):


2736Ambalavanar              Thirugnanam              BNYM-EAG       2005-05-091982-12-18
2737Anand Jeyamani BNYM-AST 2005-05-091984-07-12
3123Muthukumar Rajendran BNYM-EES 2009-08-121988-02-23

Starting Hive Client


bash-3.2# dse hive;
Logging initialized using configuration in file:/etc/dse/hive/hive-log4j.properties
Hive history file=/tmp/root/hive_job_log_root_201209250900_157600446.txt
hive> use HiveDB;
OK
Time taken: 1.149 seconds

Creating Hive External Table pointing to fixed width format text file


hive> CREATE EXTERNAL TABLE employees (empid STRING, firstname STRING, lastname STRING, dept STRING, dateofjoining STRING, dateofbirth STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES ("input.regex" = "(.{4})(.{25})(.{25})(.{15})(.{10})(.{10}).*" )
> LOCATION 'cfs://hostname:9160/folder/';
OK
Time taken: 0.524 seconds

Do a select * from table.


hive> select * from employees;
OK
2736 Ambalavanar Thirugnanam BNYM-EAG 2005-05-09 1982-12-18
2737 Anand Jeyamani BNYM-AST 2005-05-09 1984-07-12
3123 Muthukumar Rajendran BNYM-EES 2009-08-12 1988-02-23
Time taken: 0.698 seconds

Do a select with specific fields from hive table throws permission error (first issue)


hive> select empid, firstname from employees;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
java.io.IOException: The ownership/permissions on the staging directory cfs:/tmp/hadoop-root/mapred/staging/root/.staging is not as expected. It is owned by root and permissions are rwxrwxrwx. The directory must be owned by the submitter root or by root and permissions must be rwx------
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:108)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:452)
at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:133)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1332)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1123)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:931)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:255)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:212)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:671)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:554)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Job Submission failed with exception 'java.io.IOException(The ownership/permissions on the staging directory cfs:/tmp/hadoop-root/mapred/staging/root/.staging is not as expected. It is owned by root and permissions are rwxrwxrwx. The directory must be owned by the submitter root or by root and permissions must be rwx------)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask

The second issue is, when I try to run the select * query from JDBC Hive driver (outside of dse/cassandra nodes), It says the table employees is not available. The external table created acts like a temporary table and it does not get persisted. When I use 'hive> show tables;', the employees table is not listed. Can anyone please help me figure out the problem?



I don't have an immediate answer for the first issue, but the second looks like its due to a known issue.


There is a bug in DSE 2.1 which drops external tables created from CFS files from the metastore when show tables is run. Only the table metadata is removed, the data remains in CFS so if you recreate the table definition you shouldn't have to reload it. Tables backed by Cassandra ColumnFamilies are notaffected by this bug. This has been fixed in the 2.2 release of DSE, which is due for release imminently.


I'm not familiar with the Hive JDBC driver, but if it issues a Show Tables command at any point, it could be triggering this bug.


0 commentaires:

Enregistrer un commentaire