jeudi 24 avril 2014

Hadoop - ne peut pas extraire des valeurs d'une carte en cochon apache - Stack Overflow


I have a simple relation, v, in Apache Pig:


dump v;

(151364,[ 'ref'#'R813','highway'#'secondary', 'name:ga'#'Lána Chairdif', 'name'#'Cardiff Lane'],(31015271, 31053762))
(151368,[ 'ref'#'N1', 'oneway'#'yes','designation'#'Buses Only', 'highway'#'trunk', 'motor_vehicle'#'designated', 'name:ga'#'Cearnóg Pharnell Thoir', 'maxspeed'#'30', 'name'#'Parnell Square East'],(389365, 540403072))
(151596,[ 'name:en'#'Liffey', 'boundary'#'administrative', 'name:ga'#'An Life','admin_level'#'8', 'name'#'Liffey', 'waterway'#'river'],(1347749, 1426049020, 1347745, 1426049019, 1347742, 900075612))
(367947,[ 'maxspeed'#'80', 'ref'#'L2223','highway'#'tertiary'],(13259933, 2384217, 335978958))
(367952,['created_by'#'YahooApplet 1.0', 'name'#'Charnwood Avenue', 'highway'#'residential'],(2384386, 25963471, 14949594, 2384385, 6146344, 2384254))
(508603,[ 'ref'#'L3018','highway'#'tertiary', 'maxspeed'#'50', 'name'#'Shelerin Road'],(2854184, 2854168, 335978984, 2853307, 2384254, 335978978, 335978975, 2655735, 2655703, 392675957, 11676198, 920037194, 244531387, 2655952, 11675077))
(727153,[ 'ref'#'N8','highway'#'trunk', 'name'#'Merchants' Quay'],(354153, 453344873))
(727157,['highway'#'unclassified', 'oneway'#'yes', 'maxspeed'#'30', 'name'#'Kyle Street'],(354168, 354167))
(727159,['highway'#'unclassified', 'oneway'#'yes', 'maxspeed'#'30', 'name'#'North Main Street'],(354178, 465226768, 354167, 413995429, 72219131, 685537307, 1232381779, 354164))
(727161,[ 'maxspeed'#'30','highway'#'pedestrian', 'name'#'Maylor Street'],(1486492976, 1515360721, 1515360722, 1515345383, 1515344226, 1515344227, 1515344228, 1515344231))

On @orangeoctopus's advice, I have tried regenerating my data with any ' in the key names, and I have this data:


(151364,[ ref#'R813', name:ga#'Lána Chairdif', name#'Cardiff Lane',highway#'secondary'],(31015271, 31053762))
(151368,[ motor_vehicle#'designated', name#'Parnell Square East', highway#'trunk', oneway#'yes',designation#'Buses Only', maxspeed#'30', name:ga#'Cearnóg Pharnell Thoir', ref#'N1'],(389365, 540403072))
(151596,[ name:en#'Liffey', boundary#'administrative', waterway#'river', name:ga#'An Life',admin_level#'8', name#'Liffey'],(1347749, 1426049020, 1347745, 1426049019, 1347742, 900075612))
(367947,[highway#'tertiary', maxspeed#'80', ref#'L2223'],(13259933, 2384217, 335978958))
(367952,[ name#'Charnwood Avenue',created_by#'YahooApplet 1.0', highway#'residential'],(2384386, 25963471, 14949594, 2384385, 6146344, 2384254))
(508603,[ maxspeed#'50', ref#'L3018', name#'Shelerin Road',highway#'tertiary'],(2854184, 2854168, 335978984, 2853307, 2384254, 335978978, 335978975, 2655735, 2655703, 392675957, 11676198, 920037194, 244531387, 2655952, 11675077))
(727153,[highway#'trunk', name#'Merchants' Quay', ref#'N8'],(354153, 453344873))
(727157,[ oneway#'yes', maxspeed#'30', name#'Kyle Street',highway#'unclassified'],(354168, 354167))
(727159,[ oneway#'yes', maxspeed#'30', name#'North Main Street',highway#'unclassified' (354178, 465226768, 354167, 413995429, 72219131, 685537307, 1232381779, 354164))
(727161,[highway#'pedestrian', name#'Maylor Street', maxspeed#'30'],(1486492976, 1515360721, 1515360722, 1515345383, 1515344226, 1515344227, 1515344228, 1515344231))

In both cases v has the same schema/structure:


grunt> describe v;
2012-01-09 22:55:34,271 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).
v: {id: int,tags: map[ ],nodes: (null)}

Then I try to extract out just one value from the tags map:


grunt> w = foreach v generate tags#'ref';    
dump w;

But it only gives me empty data, even though some elements have data here.


()
()
()
()
()
()
()
()
()
()

With the old 'quoted' keys I tried (as per @orangeoctopus' solution)


w = foreach v generate tags#'\'ref\''; 

And that gave me the same 'empty' data, and didn't work. (I also tried other combinations of ' and ", like "'ref'"/'"ref"'/etc. but all except '\'ref\'' were invalid pig latin syntax)


What's going on? If i try to filter based on the tag value, (e.g. filter v by tags#'highway' != ''), I get nothing, which is consistant with this above problem of not being able to extract data from the map, am I doing something wrong?




Very tricky!


Your problem is that your literal data includes single quotes. Your string is not ref (3 characters long), it is 'ref' (5 characters long). I realized this because the dump of a map containing strings does not typically have the quotes there.


Therefore, you need to be keying including those quotes (you have to escape them with \):


grunt> w = foreach v generate tags#'\'ref\'';    

Your other option would be to change the way your data is being loaded so it doesn't include the single quotes in the strings themselves, and strips them out. PigStorage doesn't do this for free, but you could use something like REPLACE or your own UDF to do this.




Are you loading the data correctly too? It is weird that there is a space after the [ and before the ] when you dump your map.


Also it is more simple to drop all the quotes in the key and value in the input data. For example:


Input file


151364  [ref#R813,highway#secondary]

Pig


a = LOAD 'data.txt' AS (id:INT, m:MAP[]);
DUMP a;
b = FOREACH a GENERATE m#'ref';
DUMP b;

Output


(151364,[highway#secondary,ref#R813])

(R813)


I have a simple relation, v, in Apache Pig:


dump v;

(151364,[ 'ref'#'R813','highway'#'secondary', 'name:ga'#'Lána Chairdif', 'name'#'Cardiff Lane'],(31015271, 31053762))
(151368,[ 'ref'#'N1', 'oneway'#'yes','designation'#'Buses Only', 'highway'#'trunk', 'motor_vehicle'#'designated', 'name:ga'#'Cearnóg Pharnell Thoir', 'maxspeed'#'30', 'name'#'Parnell Square East'],(389365, 540403072))
(151596,[ 'name:en'#'Liffey', 'boundary'#'administrative', 'name:ga'#'An Life','admin_level'#'8', 'name'#'Liffey', 'waterway'#'river'],(1347749, 1426049020, 1347745, 1426049019, 1347742, 900075612))
(367947,[ 'maxspeed'#'80', 'ref'#'L2223','highway'#'tertiary'],(13259933, 2384217, 335978958))
(367952,['created_by'#'YahooApplet 1.0', 'name'#'Charnwood Avenue', 'highway'#'residential'],(2384386, 25963471, 14949594, 2384385, 6146344, 2384254))
(508603,[ 'ref'#'L3018','highway'#'tertiary', 'maxspeed'#'50', 'name'#'Shelerin Road'],(2854184, 2854168, 335978984, 2853307, 2384254, 335978978, 335978975, 2655735, 2655703, 392675957, 11676198, 920037194, 244531387, 2655952, 11675077))
(727153,[ 'ref'#'N8','highway'#'trunk', 'name'#'Merchants' Quay'],(354153, 453344873))
(727157,['highway'#'unclassified', 'oneway'#'yes', 'maxspeed'#'30', 'name'#'Kyle Street'],(354168, 354167))
(727159,['highway'#'unclassified', 'oneway'#'yes', 'maxspeed'#'30', 'name'#'North Main Street'],(354178, 465226768, 354167, 413995429, 72219131, 685537307, 1232381779, 354164))
(727161,[ 'maxspeed'#'30','highway'#'pedestrian', 'name'#'Maylor Street'],(1486492976, 1515360721, 1515360722, 1515345383, 1515344226, 1515344227, 1515344228, 1515344231))

On @orangeoctopus's advice, I have tried regenerating my data with any ' in the key names, and I have this data:


(151364,[ ref#'R813', name:ga#'Lána Chairdif', name#'Cardiff Lane',highway#'secondary'],(31015271, 31053762))
(151368,[ motor_vehicle#'designated', name#'Parnell Square East', highway#'trunk', oneway#'yes',designation#'Buses Only', maxspeed#'30', name:ga#'Cearnóg Pharnell Thoir', ref#'N1'],(389365, 540403072))
(151596,[ name:en#'Liffey', boundary#'administrative', waterway#'river', name:ga#'An Life',admin_level#'8', name#'Liffey'],(1347749, 1426049020, 1347745, 1426049019, 1347742, 900075612))
(367947,[highway#'tertiary', maxspeed#'80', ref#'L2223'],(13259933, 2384217, 335978958))
(367952,[ name#'Charnwood Avenue',created_by#'YahooApplet 1.0', highway#'residential'],(2384386, 25963471, 14949594, 2384385, 6146344, 2384254))
(508603,[ maxspeed#'50', ref#'L3018', name#'Shelerin Road',highway#'tertiary'],(2854184, 2854168, 335978984, 2853307, 2384254, 335978978, 335978975, 2655735, 2655703, 392675957, 11676198, 920037194, 244531387, 2655952, 11675077))
(727153,[highway#'trunk', name#'Merchants' Quay', ref#'N8'],(354153, 453344873))
(727157,[ oneway#'yes', maxspeed#'30', name#'Kyle Street',highway#'unclassified'],(354168, 354167))
(727159,[ oneway#'yes', maxspeed#'30', name#'North Main Street',highway#'unclassified' (354178, 465226768, 354167, 413995429, 72219131, 685537307, 1232381779, 354164))
(727161,[highway#'pedestrian', name#'Maylor Street', maxspeed#'30'],(1486492976, 1515360721, 1515360722, 1515345383, 1515344226, 1515344227, 1515344228, 1515344231))

In both cases v has the same schema/structure:


grunt> describe v;
2012-01-09 22:55:34,271 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).
v: {id: int,tags: map[ ],nodes: (null)}

Then I try to extract out just one value from the tags map:


grunt> w = foreach v generate tags#'ref';    
dump w;

But it only gives me empty data, even though some elements have data here.


()
()
()
()
()
()
()
()
()
()

With the old 'quoted' keys I tried (as per @orangeoctopus' solution)


w = foreach v generate tags#'\'ref\''; 

And that gave me the same 'empty' data, and didn't work. (I also tried other combinations of ' and ", like "'ref'"/'"ref"'/etc. but all except '\'ref\'' were invalid pig latin syntax)


What's going on? If i try to filter based on the tag value, (e.g. filter v by tags#'highway' != ''), I get nothing, which is consistant with this above problem of not being able to extract data from the map, am I doing something wrong?



Very tricky!


Your problem is that your literal data includes single quotes. Your string is not ref (3 characters long), it is 'ref' (5 characters long). I realized this because the dump of a map containing strings does not typically have the quotes there.


Therefore, you need to be keying including those quotes (you have to escape them with \):


grunt> w = foreach v generate tags#'\'ref\'';    

Your other option would be to change the way your data is being loaded so it doesn't include the single quotes in the strings themselves, and strips them out. PigStorage doesn't do this for free, but you could use something like REPLACE or your own UDF to do this.



Are you loading the data correctly too? It is weird that there is a space after the [ and before the ] when you dump your map.


Also it is more simple to drop all the quotes in the key and value in the input data. For example:


Input file


151364  [ref#R813,highway#secondary]

Pig


a = LOAD 'data.txt' AS (id:INT, m:MAP[]);
DUMP a;
b = FOREACH a GENERATE m#'ref';
DUMP b;

Output


(151364,[highway#secondary,ref#R813])

(R813)

0 commentaires:

Enregistrer un commentaire