mardi 27 mai 2014

Hadoop - création d'un enregistrement imbriqué dans le schéma Avro pour un simple json - Stack Overflow


I'm trying to build an Avro schema for the following json (for Hadoop):


{
"name_tag":"Guy",
"known_nested_structure" : {
"fieldA" : ["value1"],
"fieldB" : ["value1","value2"],
"fieldC" : [],
"fieldD" : ["value1"]
},
"another_field" : "hi"
}

My first idea was this avro schema (including the hive commands):


CREATE EXTERNAL TABLE IF NOT EXISTS record_table
PARTITIONED BY (YEAR INT, MONTH INT, DAY INT, HOUR INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs://localhost/data/output/records_data/hourly'
TBLPROPERTIES ('avro.schema.literal'='{
"name": "myRecord",
"type": "record",
"fields": [
{"name":"name_tag", "type":"string",c"default": ""},
{
"name": "known_nested_structure",
"type": "record",
"fields": [
{"name":"fieldA", "type":{"type":"array","items":"string"},"default":null},
{"name":"fieldB", "type":{"type":"array","items":"string"},"default":null},
{"name":"fieldC", "type":{"type":"array","items":"string"},"default":null},
{"name":"fieldD", "type":{"type":"array","items":"string"},"default":null}
],
"default":null
},
{"name": "another_field","type":"string","default": ""}
]
}');

The hive result of the command: OK error_error_error_error_error_error_error string from deserializer cannot_determine_schema string from deserializer check string from deserializer schema string from deserializer url string from deserializer and string from deserializer literal string from deserializer year int month int day int hour int Time taken: 0.128 seconds


But for some reason this is the avro schema that works.


{
"name": "myRecord",
"type": "record",
"fields": [
{"name":"name_tag", "type":"string","default": null},
{
"name": "known_nested_structure",
"type": {
"name": "known_nested_structure",
"type": "record",
"fields": [
{"name":"fieldA", "type":{"type":"array","items":"string"},"default":null},
{"name":"fieldB", "type":{"type":"array","items":"string"},"default":null},
{"name":"fieldC", "type":{"type":"array","items":"string"},"default":null},
{"name":"fieldD", "type":{"type":"array","items":"string"},"default":null}
],
"default":null

}
},
{"name": "another_field","type": "string","default": null}
]
}

Result:


OK
name_tag string from deserializer
known_nested_structure struct<fielda:array<string>,fieldb:array<string>,fieldc:array<string>,fieldd:array<string>> from deserializer
another_field string from deserializer
year int
month int
day int
hour int
Time taken: 0.123 seconds

What is the reason that the first avro schema doesn't work? Why can't I put a record directly as a field (known_nested_structure is in known_nested_structure in my second schema example)?


Thanks,


Guy




As I can see the AvroSerde uses the Avro api and to parse the Schema, it uses the org.apache.avro.Schema 's parse() method. If you look into the method you could clearly see that it does a recursive call to parse when reading the fields. So, if you have a "record" in your fields, it would need to follow the same convention as the (name, type="record", fields[]) sequence. That is the possible reason why your second avro worked and the first one failed. grepcode on org.apache.avro.Schema and it should explain.




There is one error I can see in your schema(c before default):


{"name":"name_tag", "type":"string",c"default": ""},

It should be:


{"name":"name_tag", "type":"string","default": ""},


I'm trying to build an Avro schema for the following json (for Hadoop):


{
"name_tag":"Guy",
"known_nested_structure" : {
"fieldA" : ["value1"],
"fieldB" : ["value1","value2"],
"fieldC" : [],
"fieldD" : ["value1"]
},
"another_field" : "hi"
}

My first idea was this avro schema (including the hive commands):


CREATE EXTERNAL TABLE IF NOT EXISTS record_table
PARTITIONED BY (YEAR INT, MONTH INT, DAY INT, HOUR INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs://localhost/data/output/records_data/hourly'
TBLPROPERTIES ('avro.schema.literal'='{
"name": "myRecord",
"type": "record",
"fields": [
{"name":"name_tag", "type":"string",c"default": ""},
{
"name": "known_nested_structure",
"type": "record",
"fields": [
{"name":"fieldA", "type":{"type":"array","items":"string"},"default":null},
{"name":"fieldB", "type":{"type":"array","items":"string"},"default":null},
{"name":"fieldC", "type":{"type":"array","items":"string"},"default":null},
{"name":"fieldD", "type":{"type":"array","items":"string"},"default":null}
],
"default":null
},
{"name": "another_field","type":"string","default": ""}
]
}');

The hive result of the command: OK error_error_error_error_error_error_error string from deserializer cannot_determine_schema string from deserializer check string from deserializer schema string from deserializer url string from deserializer and string from deserializer literal string from deserializer year int month int day int hour int Time taken: 0.128 seconds


But for some reason this is the avro schema that works.


{
"name": "myRecord",
"type": "record",
"fields": [
{"name":"name_tag", "type":"string","default": null},
{
"name": "known_nested_structure",
"type": {
"name": "known_nested_structure",
"type": "record",
"fields": [
{"name":"fieldA", "type":{"type":"array","items":"string"},"default":null},
{"name":"fieldB", "type":{"type":"array","items":"string"},"default":null},
{"name":"fieldC", "type":{"type":"array","items":"string"},"default":null},
{"name":"fieldD", "type":{"type":"array","items":"string"},"default":null}
],
"default":null

}
},
{"name": "another_field","type": "string","default": null}
]
}

Result:


OK
name_tag string from deserializer
known_nested_structure struct<fielda:array<string>,fieldb:array<string>,fieldc:array<string>,fieldd:array<string>> from deserializer
another_field string from deserializer
year int
month int
day int
hour int
Time taken: 0.123 seconds

What is the reason that the first avro schema doesn't work? Why can't I put a record directly as a field (known_nested_structure is in known_nested_structure in my second schema example)?


Thanks,


Guy



As I can see the AvroSerde uses the Avro api and to parse the Schema, it uses the org.apache.avro.Schema 's parse() method. If you look into the method you could clearly see that it does a recursive call to parse when reading the fields. So, if you have a "record" in your fields, it would need to follow the same convention as the (name, type="record", fields[]) sequence. That is the possible reason why your second avro worked and the first one failed. grepcode on org.apache.avro.Schema and it should explain.



There is one error I can see in your schema(c before default):


{"name":"name_tag", "type":"string",c"default": ""},

It should be:


{"name":"name_tag", "type":"string","default": ""},

0 commentaires:

Enregistrer un commentaire