Hadoop - création d'un enregistrement imbriqué dans le schéma Avro pour un simple json

I'm trying to build an Avro schema for the following json (for Hadoop):

{
  "name_tag":"Guy",
  "known_nested_structure" : {
    "fieldA" : ["value1"],
    "fieldB" : ["value1","value2"],
    "fieldC" : [],
    "fieldD" : ["value1"]
  },
  "another_field" : "hi"
}

My first idea was this avro schema (including the hive commands):

CREATE EXTERNAL TABLE IF NOT EXISTS record_table
     PARTITIONED BY (YEAR INT, MONTH INT, DAY INT, HOUR INT)
     ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
     STORED AS
     INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
     OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
     LOCATION 'hdfs://localhost/data/output/records_data/hourly'
     TBLPROPERTIES ('avro.schema.literal'='{
  "name": "myRecord",
  "type": "record",
  "fields": [
    {"name":"name_tag", "type":"string",c"default": ""},
    {
      "name": "known_nested_structure",
      "type": "record",
      "fields": [
          {"name":"fieldA", "type":{"type":"array","items":"string"},"default":null},
          {"name":"fieldB", "type":{"type":"array","items":"string"},"default":null},
          {"name":"fieldC", "type":{"type":"array","items":"string"},"default":null},
          {"name":"fieldD", "type":{"type":"array","items":"string"},"default":null}
        ],
        "default":null
    },
    {"name": "another_field","type":"string","default": ""}
  ]
}');

The hive result of the command: OK error_error_error_error_error_error_error string from deserializer cannot_determine_schema string from deserializer check string from deserializer schema string from deserializer url string from deserializer and string from deserializer literal string from deserializer year int month int day int hour int Time taken: 0.128 seconds

But for some reason this is the avro schema that works.

{
  "name": "myRecord",
  "type": "record",
  "fields": [
    {"name":"name_tag", "type":"string","default": null},
    {
  "name": "known_nested_structure",
  "type": {
        "name": "known_nested_structure",
        "type": "record",
        "fields": [
                {"name":"fieldA", "type":{"type":"array","items":"string"},"default":null},
                {"name":"fieldB", "type":{"type":"array","items":"string"},"default":null},
                {"name":"fieldC", "type":{"type":"array","items":"string"},"default":null},
                {"name":"fieldD", "type":{"type":"array","items":"string"},"default":null}
              ],
              "default":null

       }
    },
        {"name": "another_field","type": "string","default": null}
  ]
}

Result:

OK
name_tag    string  from deserializer
known_nested_structure          struct<fielda:array<string>,fieldb:array<string>,fieldc:array<string>,fieldd:array<string>>         from deserializer
another_field   string  from deserializer
year    int 
month   int 
day int 
hour    int 
Time taken: 0.123 seconds

What is the reason that the first avro schema doesn't work? Why can't I put a record directly as a field (known_nested_structure is in known_nested_structure in my second schema example)?

Thanks,

Guy

As I can see the AvroSerde uses the Avro api and to parse the Schema, it uses the org.apache.avro.Schema 's parse() method. If you look into the method you could clearly see that it does a recursive call to parse when reading the fields. So, if you have a "record" in your fields, it would need to follow the same convention as the (name, type="record", fields[]) sequence. That is the possible reason why your second avro worked and the first one failed. grepcode on org.apache.avro.Schema and it should explain.

There is one error I can see in your schema(c before default):

{"name":"name_tag", "type":"string",c"default": ""},

It should be:

{"name":"name_tag", "type":"string","default": ""},

I'm trying to build an Avro schema for the following json (for Hadoop):

{
  "name_tag":"Guy",
  "known_nested_structure" : {
    "fieldA" : ["value1"],
    "fieldB" : ["value1","value2"],
    "fieldC" : [],
    "fieldD" : ["value1"]
  },
  "another_field" : "hi"
}

My first idea was this avro schema (including the hive commands):

CREATE EXTERNAL TABLE IF NOT EXISTS record_table
     PARTITIONED BY (YEAR INT, MONTH INT, DAY INT, HOUR INT)
     ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
     STORED AS
     INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
     OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
     LOCATION 'hdfs://localhost/data/output/records_data/hourly'
     TBLPROPERTIES ('avro.schema.literal'='{
  "name": "myRecord",
  "type": "record",
  "fields": [
    {"name":"name_tag", "type":"string",c"default": ""},
    {
      "name": "known_nested_structure",
      "type": "record",
      "fields": [
          {"name":"fieldA", "type":{"type":"array","items":"string"},"default":null},
          {"name":"fieldB", "type":{"type":"array","items":"string"},"default":null},
          {"name":"fieldC", "type":{"type":"array","items":"string"},"default":null},
          {"name":"fieldD", "type":{"type":"array","items":"string"},"default":null}
        ],
        "default":null
    },
    {"name": "another_field","type":"string","default": ""}
  ]
}');

But for some reason this is the avro schema that works.

{
  "name": "myRecord",
  "type": "record",
  "fields": [
    {"name":"name_tag", "type":"string","default": null},
    {
  "name": "known_nested_structure",
  "type": {
        "name": "known_nested_structure",
        "type": "record",
        "fields": [
                {"name":"fieldA", "type":{"type":"array","items":"string"},"default":null},
                {"name":"fieldB", "type":{"type":"array","items":"string"},"default":null},
                {"name":"fieldC", "type":{"type":"array","items":"string"},"default":null},
                {"name":"fieldD", "type":{"type":"array","items":"string"},"default":null}
              ],
              "default":null

       }
    },
        {"name": "another_field","type": "string","default": null}
  ]
}

Result:

OK
name_tag    string  from deserializer
known_nested_structure          struct<fielda:array<string>,fieldb:array<string>,fieldc:array<string>,fieldd:array<string>>         from deserializer
another_field   string  from deserializer
year    int 
month   int 
day int 
hour    int 
Time taken: 0.123 seconds

What is the reason that the first avro schema doesn't work? Why can't I put a record directly as a field (known_nested_structure is in known_nested_structure in my second schema example)?

Thanks,

Guy

There is one error I can see in your schema(c before default):

{"name":"name_tag", "type":"string",c"default": ""},

It should be:

{"name":"name_tag", "type":"string","default": ""},

Source

Stackoverflow Blog

mardi 27 mai 2014

Hadoop - création d'un enregistrement imbriqué dans le schéma Avro pour un simple json - Stack Overflow

0 commentaires:

Enregistrer un commentaire

Popular Posts