jeudi 8 mai 2014

Java - combinaison de MapReduce et ruche - Stack Overflow


Can we use a combination of Hive and MapReduce


Say: I am having a csv file. I need to find the mean of a column and replace the null data with the mean(replace null with mean).


so whether we can write a hive query in driver (to find the mean) then write a mapreduce block to replace the null with mean.


Which is better way



  1. writing only mapreduce code or

  2. Use a combination of hive and mapreduce.




here is the other answer that can be solved using (only hive)


say your csv input is like this:


firstname,secondname,score,group

vijay,kumar,123,cse


satish,babu,,it


kumar,nagendra,200,eie


anil,babu,,it


then apply query like this(i ran it worked) :


hive> from students s join (select avg(score)as avg from students) a
> select s.firstname,
> case
> when s.score="" or s.score=NULL then cast(avg AS string)
> else s.score
> end as new_score ;

Total MapReduce jobs = 2


output:


OK


firstname       new_score

vijay 123


satish 161.5


kumar 200


anil 161.5


Time taken: 67.059 seconds, Fetched: 4 row(s)




According to my view,


Its better to write a mapreduce code only.(use job1 to find mean, then map only job2 to replace which is easy). combination of hive with MR will be a bit messy(reason for this is you are going to write both in one code, have to ship it to cluster nodes a jar, we cant say where these tasks will run, i mean where hive command execution point ll be).


hope this helps. Thanks :)



Can we use a combination of Hive and MapReduce


Say: I am having a csv file. I need to find the mean of a column and replace the null data with the mean(replace null with mean).


so whether we can write a hive query in driver (to find the mean) then write a mapreduce block to replace the null with mean.


Which is better way



  1. writing only mapreduce code or

  2. Use a combination of hive and mapreduce.



here is the other answer that can be solved using (only hive)


say your csv input is like this:


firstname,secondname,score,group

vijay,kumar,123,cse


satish,babu,,it


kumar,nagendra,200,eie


anil,babu,,it


then apply query like this(i ran it worked) :


hive> from students s join (select avg(score)as avg from students) a
> select s.firstname,
> case
> when s.score="" or s.score=NULL then cast(avg AS string)
> else s.score
> end as new_score ;

Total MapReduce jobs = 2


output:


OK


firstname       new_score

vijay 123


satish 161.5


kumar 200


anil 161.5


Time taken: 67.059 seconds, Fetched: 4 row(s)



According to my view,


Its better to write a mapreduce code only.(use job1 to find mean, then map only job2 to replace which is easy). combination of hive with MR will be a bit messy(reason for this is you are going to write both in one code, have to ship it to cluster nodes a jar, we cant say where these tasks will run, i mean where hive command execution point ll be).


hope this helps. Thanks :)


0 commentaires:

Enregistrer un commentaire