lundi 19 mai 2014

Hadoop - comte et grouper par dans la ruche - Stack Overflow


I have a table in the hive as follows,


table1

Cola | Colb | Colc | Cold |
---------------------------------
...etc
efo18 691 123 5692
efo18 691 345 5657
...etc
fsx31 950 291 23456
fsx31 950 404 23456
fsx31 950 343 23456
fsx31 950 182 23456
fsx31 950 120 45042
fsx31 950 161 23456
....etc
klz57 490 121 3330
klz57 490 113 3330
klz57 490 308 3330
klz57 490 411 3330
klz57 490 161 3330
klz57 386 108 3330
klz57 490 113 3330
klz57 490 125 3330
klz57 490 165 3330
klz57 490 166 3330
...etc
---------------------------------

I wanted another table from table1 data with those having same value for Cold in a group and within that, those having the same Colb to have a sub-group and within that sub-group, those havin same values of Cola to belong to a group. In other words, each unique combination of Cola,Colb,Cold to be a row.And repeated rows be summed up.


insert into table table2(Col1 string,Col2 string,Col3 string,Count int) select cola,colb,cold,count(*) from table1 group by cold,colb,cola;


I expected this,


Col1   | Col2  |  Col3     |  Count  |
-------------------------------------
efo18 691 5692 1
efo18 691 5657 1
fsx31 950 23456 5 <-----1
fsx31 950 45042 1 <-----1
klz57 490 1234 9 <-----2
klz57 386 1234 1 <-----2
--------------------------------------

I got this,


table2

Col1 | Col2 | Col3 | Count |
-------------------------------------
efo18 691 5692 1
efo18 691 5657 1
fsx31 950 23456 4 <-----1
fsx31 950 25456 1 <-----1
fsx31 950 45042 1 <-----1
klz57 490 1234 8 <-----2
klz57 386 1234 1 <-----2
klz57 490 1234 1 <-----2
--------------------------------------

What I don't understand is I am doing a group by on Cold then on Colb followed by Cola, then why does the Count for the rows marked (<----1),for values from Cola, is in different rows, even though everything falls in the same group? Colc is different for these two rows, but I haven't made use of it in grouping sp how are the two rows different?. Similarly for rows marked (<----2), what is the problem here.


Update:


Binary01, I was trying out the example you gave


hive> select * from xyz;
OK
x y z zz
xxx 111 222 123 NULL NULL NULL
xxx 111 222 123 NULL NULL NULL
xxx 101 222 123 NULL NULL NULL
xux 111 422 123 NULL NULL NULL
xxx 111 522 323 NULL NULL NULL
xyx 111 622 123 NULL NULL NULL
xxx 115 322 123 NULL NULL NULL
xxx 111 122 123 NULL NULL NULL
xxx 111 223 123 NULL NULL NULL
xxy 111 212 143 NULL NULL NULL
xxx 117 222 123 NULL NULL NULL

What are those NULL values doing there? I have copy pasted your example line by line. Even creating the table as ,


create table xyz(x string ,y string, z string , zz string) 
row format delimited fields terminated by ',';

and the final query gives,


hive> select * from xyztemp;
OK
xux 111 422 123 NULL NULL 1
xxx 101 222 123 NULL NULL 1
xxx 111 122 123 NULL NULL 1
xxx 111 222 123 NULL NULL 2
xxx 111 223 123 NULL NULL 1
xxx 111 522 323 NULL NULL 1
xxx 115 322 123 NULL NULL 1
xxx 117 222 123 NULL NULL 1
xxy 111 212 143 NULL NULL 1
xyx 111 622 123 NULL NULL 1



I think you missed something. I tried with the following data similar to your table and check the output its perfectly fine as you expected.

hive>set hive.cli.print.header=true;
hive> load data local inpath '/home/brdev/sudeep/testdata.txt' into table xyz;
hive> create table xyz(x string ,y string, z string , zz string) row format delimited fields terminated by ',';
hive> select * from xyz;
OK
x y z zz
xxx 111 222 123
xxx 111 222 123
xxx 101 222 123
xux 111 422 123
xxx 111 522 323
xyx 111 622 123
xxx 115 322 123
xxx 111 122 123
xxx 111 223 123
xxy 111 212 143
xxx 117 222 123

hive>create table xyztemp ( aa string,bb string,cc string , dd int);
hive>insert into table xyztemp select x,y,zz,count(*) from xyz group by zz,y,x;
hive> select * from xyztemp;
OK
aa bb cc dd
xxx 101 123 1
xux 111 123 1
xxx 111 123 4
xyx 111 123 1
xxx 115 123 1
xxx 117 123 1
xxy 111 143 1
xxx 111 323 1

I guess the above is the expected output you are looking for.



Not sure this is the problem here, but I've seen similar results in cases where the column values had leading or trailing spaces in some rows. Hive will not consider 'fsx31' and 'fsx31 ' (note the trailing space for the latter) to be the same value and will therefore give them individual rows when grouping.


What happens when you use TRIM to remove leading and trailing spaces, like so?


INSERT INTO table table2(Col1 STRING, Col2 STRING, Col3 STRING, Count INT) 
SELECT cola, colb, cold, count(*)
FROM (
SELECT
TRIM(cola) AS cola,
TRIM(colb) AS colb,
TRIM(colb) AS colb
FROM table1
)
GROUP BY cold, colb, cola;

(Using a subquery to trim and alias the columns is not strictly necessary, but I hope it helps to clarify what the intent is.)


Hope that helps.



I have a table in the hive as follows,


table1

Cola | Colb | Colc | Cold |
---------------------------------
...etc
efo18 691 123 5692
efo18 691 345 5657
...etc
fsx31 950 291 23456
fsx31 950 404 23456
fsx31 950 343 23456
fsx31 950 182 23456
fsx31 950 120 45042
fsx31 950 161 23456
....etc
klz57 490 121 3330
klz57 490 113 3330
klz57 490 308 3330
klz57 490 411 3330
klz57 490 161 3330
klz57 386 108 3330
klz57 490 113 3330
klz57 490 125 3330
klz57 490 165 3330
klz57 490 166 3330
...etc
---------------------------------

I wanted another table from table1 data with those having same value for Cold in a group and within that, those having the same Colb to have a sub-group and within that sub-group, those havin same values of Cola to belong to a group. In other words, each unique combination of Cola,Colb,Cold to be a row.And repeated rows be summed up.


insert into table table2(Col1 string,Col2 string,Col3 string,Count int) select cola,colb,cold,count(*) from table1 group by cold,colb,cola;


I expected this,


Col1   | Col2  |  Col3     |  Count  |
-------------------------------------
efo18 691 5692 1
efo18 691 5657 1
fsx31 950 23456 5 <-----1
fsx31 950 45042 1 <-----1
klz57 490 1234 9 <-----2
klz57 386 1234 1 <-----2
--------------------------------------

I got this,


table2

Col1 | Col2 | Col3 | Count |
-------------------------------------
efo18 691 5692 1
efo18 691 5657 1
fsx31 950 23456 4 <-----1
fsx31 950 25456 1 <-----1
fsx31 950 45042 1 <-----1
klz57 490 1234 8 <-----2
klz57 386 1234 1 <-----2
klz57 490 1234 1 <-----2
--------------------------------------

What I don't understand is I am doing a group by on Cold then on Colb followed by Cola, then why does the Count for the rows marked (<----1),for values from Cola, is in different rows, even though everything falls in the same group? Colc is different for these two rows, but I haven't made use of it in grouping sp how are the two rows different?. Similarly for rows marked (<----2), what is the problem here.


Update:


Binary01, I was trying out the example you gave


hive> select * from xyz;
OK
x y z zz
xxx 111 222 123 NULL NULL NULL
xxx 111 222 123 NULL NULL NULL
xxx 101 222 123 NULL NULL NULL
xux 111 422 123 NULL NULL NULL
xxx 111 522 323 NULL NULL NULL
xyx 111 622 123 NULL NULL NULL
xxx 115 322 123 NULL NULL NULL
xxx 111 122 123 NULL NULL NULL
xxx 111 223 123 NULL NULL NULL
xxy 111 212 143 NULL NULL NULL
xxx 117 222 123 NULL NULL NULL

What are those NULL values doing there? I have copy pasted your example line by line. Even creating the table as ,


create table xyz(x string ,y string, z string , zz string) 
row format delimited fields terminated by ',';

and the final query gives,


hive> select * from xyztemp;
OK
xux 111 422 123 NULL NULL 1
xxx 101 222 123 NULL NULL 1
xxx 111 122 123 NULL NULL 1
xxx 111 222 123 NULL NULL 2
xxx 111 223 123 NULL NULL 1
xxx 111 522 323 NULL NULL 1
xxx 115 322 123 NULL NULL 1
xxx 117 222 123 NULL NULL 1
xxy 111 212 143 NULL NULL 1
xyx 111 622 123 NULL NULL 1


I think you missed something. I tried with the following data similar to your table and check the output its perfectly fine as you expected.

hive>set hive.cli.print.header=true;
hive> load data local inpath '/home/brdev/sudeep/testdata.txt' into table xyz;
hive> create table xyz(x string ,y string, z string , zz string) row format delimited fields terminated by ',';
hive> select * from xyz;
OK
x y z zz
xxx 111 222 123
xxx 111 222 123
xxx 101 222 123
xux 111 422 123
xxx 111 522 323
xyx 111 622 123
xxx 115 322 123
xxx 111 122 123
xxx 111 223 123
xxy 111 212 143
xxx 117 222 123

hive>create table xyztemp ( aa string,bb string,cc string , dd int);
hive>insert into table xyztemp select x,y,zz,count(*) from xyz group by zz,y,x;
hive> select * from xyztemp;
OK
aa bb cc dd
xxx 101 123 1
xux 111 123 1
xxx 111 123 4
xyx 111 123 1
xxx 115 123 1
xxx 117 123 1
xxy 111 143 1
xxx 111 323 1

I guess the above is the expected output you are looking for.


Not sure this is the problem here, but I've seen similar results in cases where the column values had leading or trailing spaces in some rows. Hive will not consider 'fsx31' and 'fsx31 ' (note the trailing space for the latter) to be the same value and will therefore give them individual rows when grouping.


What happens when you use TRIM to remove leading and trailing spaces, like so?


INSERT INTO table table2(Col1 STRING, Col2 STRING, Col3 STRING, Count INT) 
SELECT cola, colb, cold, count(*)
FROM (
SELECT
TRIM(cola) AS cola,
TRIM(colb) AS colb,
TRIM(colb) AS colb
FROM table1
)
GROUP BY cold, colb, cola;

(Using a subquery to trim and alias the columns is not strictly necessary, but I hope it helps to clarify what the intent is.)


Hope that helps.


0 commentaires:

Enregistrer un commentaire