dimanche 25 mai 2014

Créer la table externe de ruche de fichiers avec différents formats d'enregistrement - Stack Overflow


I would like to create an external HIVE table that utilizes data already in HDFS. The files are in a directory like /hdfs/data/location and sub directories with the year-month format. ex: /hdfs/data/location/2013-december and /hdfs/data/location/2014-january.


Within these directories are multiple files but within the files are different types of data (different fields). Examples of the different types of records are:


Type A


type
state
city
population

Type B


type
zipcode
registeredvoters

Actual data example (tab delimited)


type:A   state:New York  city:New York    population:8336697
type:A state:California city:Los Angeles population:3857799
type:B zipcode:92118 registeredvoters:794051
type:B zipcode:92155 registeredvoters:794053
type:A state:Illinois city:Chicago population:2714856

The data is already in this format and being used by other processes outside of HIVE, so changing it may not be an option. I would also not want to duplicate the data in HDFS.


Is there a way to create a HIVE table for only a given type as defined in the data above?


Here is what I have so far for the create:


create external table population (
type string,
state string,
city string,
population int
)
location '/hdfs/data/location';



I don't think you can have a table, but I think you could make a view, interpreting the line using the str_to_map UDF


create external table raw_population( line string ) location '/hdfs/data/location';

create view population_view as
select
pmap['type'] as type,
pmap['state'] as state,
pmap['city'] as city
pmap['population'] as population
from
( select str_to_map( line, '\t', ':') as pmap from raw_population ) pm;


I would like to create an external HIVE table that utilizes data already in HDFS. The files are in a directory like /hdfs/data/location and sub directories with the year-month format. ex: /hdfs/data/location/2013-december and /hdfs/data/location/2014-january.


Within these directories are multiple files but within the files are different types of data (different fields). Examples of the different types of records are:


Type A


type
state
city
population

Type B


type
zipcode
registeredvoters

Actual data example (tab delimited)


type:A   state:New York  city:New York    population:8336697
type:A state:California city:Los Angeles population:3857799
type:B zipcode:92118 registeredvoters:794051
type:B zipcode:92155 registeredvoters:794053
type:A state:Illinois city:Chicago population:2714856

The data is already in this format and being used by other processes outside of HIVE, so changing it may not be an option. I would also not want to duplicate the data in HDFS.


Is there a way to create a HIVE table for only a given type as defined in the data above?


Here is what I have so far for the create:


create external table population (
type string,
state string,
city string,
population int
)
location '/hdfs/data/location';


I don't think you can have a table, but I think you could make a view, interpreting the line using the str_to_map UDF


create external table raw_population( line string ) location '/hdfs/data/location';

create view population_view as
select
pmap['type'] as type,
pmap['state'] as state,
pmap['city'] as city
pmap['population'] as population
from
( select str_to_map( line, '\t', ':') as pmap from raw_population ) pm;

0 commentaires:

Enregistrer un commentaire