Can we use a combination
of Hive
and MapReduce
Say: I am having a csv file. I need to find the mean
of a column and replace the null data with the mean(replace null with mean
).
so whether we can write a hive query
in driver (to find the mean) then write a mapreduce block to replace the null with mean.
Which is better way
- writing only mapreduce code or
- Use a combination of hive and mapreduce.
here is the other answer that can be solved using (only hive)
say your csv input is like this:
firstname,secondname,score,group
vijay,kumar,123,cse
satish,babu,,it
kumar,nagendra,200,eie
anil,babu,,it
then apply query like this(i ran it worked) :
hive> from students s join (select avg(score)as avg from students) a
> select s.firstname,
> case
> when s.score="" or s.score=NULL then cast(avg AS string)
> else s.score
> end as new_score ;
Total MapReduce jobs = 2
output:
OK
firstname new_score
vijay 123
satish 161.5
kumar 200
anil 161.5
Time taken: 67.059 seconds, Fetched: 4 row(s)
According to my view,
Its better to write a mapreduce code only.(use job1 to find mean, then map only job2 to replace which is easy). combination of hive with MR will be a bit messy(reason for this is you are going to write both in one code, have to ship it to cluster nodes a jar, we cant say where these tasks will run, i mean where hive command execution point ll be).
hope this helps. Thanks :)
Can we use a combination
of Hive
and MapReduce
Say: I am having a csv file. I need to find the mean
of a column and replace the null data with the mean(replace null with mean
).
so whether we can write a hive query
in driver (to find the mean) then write a mapreduce block to replace the null with mean.
Which is better way
- writing only mapreduce code or
- Use a combination of hive and mapreduce.
here is the other answer that can be solved using (only hive)
say your csv input is like this:
firstname,secondname,score,group
vijay,kumar,123,cse
satish,babu,,it
kumar,nagendra,200,eie
anil,babu,,it
then apply query like this(i ran it worked) :
hive> from students s join (select avg(score)as avg from students) a
> select s.firstname,
> case
> when s.score="" or s.score=NULL then cast(avg AS string)
> else s.score
> end as new_score ;
Total MapReduce jobs = 2
output:
OK
firstname new_score
vijay 123
satish 161.5
kumar 200
anil 161.5
Time taken: 67.059 seconds, Fetched: 4 row(s)
According to my view,
Its better to write a mapreduce code only.(use job1 to find mean, then map only job2 to replace which is easy). combination of hive with MR will be a bit messy(reason for this is you are going to write both in one code, have to ship it to cluster nodes a jar, we cant say where these tasks will run, i mean where hive command execution point ll be).
hope this helps. Thanks :)
0 commentaires:
Enregistrer un commentaire