Problem Statement:-
I need to compare two tables Table1
and Table2
and they both store same thing. So I need to compare Table2
with Table1
as Table1
is the main table through which comparisons need to be made. So after comparing I need to make a report that Table2
has some sort of discrepancy. And these two tables has lots of data, around TB of data. So currently I have written HiveQL
to do the comparisons and get the data back.
So my question is which is better in terms of PERFORMANCE
, writing a CUSTOM MAPPER and REDUCER
to do this kind of job or the HiveQL
that I wrote will be fine as I will be joining these two tables on millions of records. As far as I know HiveQL
internally (behind the scenes) generates optimized custom map-reducer and submits for execution and gets back the results.
The answer to your question is two-fold.
Firstly, if there is some processing that you can express in Hive QL syntax, I would argue that Hive's performance is comparable to that of writing custom map-reduce. The only catch here is when you have some extra information about your data that you make use of in your map-reduce code but not through Hive. For example, if your data is sorted, you may make use of this information when processing your file-splits in the mapper whereas unless Hive is made aware of this sorting order, it wouldn't be able to make use of this information to its advantage. Often times, there is a way to specify such extra information (through metadata or config properties) but some times, there may not even be a way to specify this information for use by Hive.
Secondly, sometimes the processing can be convoluted enough to not be easily-expressable in SQL like statement. These cases typically involve having to store intermittent state during your processing. Hive UDAFs alleviate this problem to some extent. However, if you need something more custom, I have always preferred plugging in custom mapper and/or reducer using the Hive Transform functionality. It allows you to take advantage of map-reduce within the context of a Hive query, allowing you to mix-and-match Hive SQL-like functionality with custom map-reduce scripts, all in the same query.
Long story short: if your processing is easily expressible through a Hive QL query, I don't see much reason to write map-reduce code to achieve the same. One of the main reasons Hive was created was to allow people like us to write SQL-like queries instead of writing map-reduce. If we end up writing map-reduce instead of quintessential Hive queries (for performance reasons or otherwise), one could argue that Hive hasn't done a good job at its primary objective. On the other hand, if you have some information about your data that Hive can't take advantage of, you might be better off writing custom map-reduce implementation that makes use of that information. But, then again, no need to write an entire map-reduce program when you can simply plug in the mappers and reducers using Hive transform functionality as mentioned before.
Problem Statement:-
I need to compare two tables Table1
and Table2
and they both store same thing. So I need to compare Table2
with Table1
as Table1
is the main table through which comparisons need to be made. So after comparing I need to make a report that Table2
has some sort of discrepancy. And these two tables has lots of data, around TB of data. So currently I have written HiveQL
to do the comparisons and get the data back.
So my question is which is better in terms of PERFORMANCE
, writing a CUSTOM MAPPER and REDUCER
to do this kind of job or the HiveQL
that I wrote will be fine as I will be joining these two tables on millions of records. As far as I know HiveQL
internally (behind the scenes) generates optimized custom map-reducer and submits for execution and gets back the results.
The answer to your question is two-fold.
Firstly, if there is some processing that you can express in Hive QL syntax, I would argue that Hive's performance is comparable to that of writing custom map-reduce. The only catch here is when you have some extra information about your data that you make use of in your map-reduce code but not through Hive. For example, if your data is sorted, you may make use of this information when processing your file-splits in the mapper whereas unless Hive is made aware of this sorting order, it wouldn't be able to make use of this information to its advantage. Often times, there is a way to specify such extra information (through metadata or config properties) but some times, there may not even be a way to specify this information for use by Hive.
Secondly, sometimes the processing can be convoluted enough to not be easily-expressable in SQL like statement. These cases typically involve having to store intermittent state during your processing. Hive UDAFs alleviate this problem to some extent. However, if you need something more custom, I have always preferred plugging in custom mapper and/or reducer using the Hive Transform functionality. It allows you to take advantage of map-reduce within the context of a Hive query, allowing you to mix-and-match Hive SQL-like functionality with custom map-reduce scripts, all in the same query.
Long story short: if your processing is easily expressible through a Hive QL query, I don't see much reason to write map-reduce code to achieve the same. One of the main reasons Hive was created was to allow people like us to write SQL-like queries instead of writing map-reduce. If we end up writing map-reduce instead of quintessential Hive queries (for performance reasons or otherwise), one could argue that Hive hasn't done a good job at its primary objective. On the other hand, if you have some information about your data that Hive can't take advantage of, you might be better off writing custom map-reduce implementation that makes use of that information. But, then again, no need to write an entire map-reduce program when you can simply plug in the mappers and reducers using Hive transform functionality as mentioned before.
0 commentaires:
Enregistrer un commentaire