Below is the data in TestingTable1
BUYER_ID | ITEM_ID | CREATED_TIME
-----------+-------------------+------------------------
1345653 110909316904 2012-07-09 21:29:06
1345653 151851771618 2012-07-09 19:57:33
1345653 221065796761 2012-07-09 19:31:48
1345653 400307563710 2012-07-09 18:57:33
1345653 310411560125 2012-07-09 16:09:49
1345653 120945302103 2012-07-09 13:40:23
1345653 261060982989 2012-07-09 09:02:21
Below is the data in TestingTable2
USER_ID | PRODUCT_ID | LAST_TIME
-----------+-------------------+-------------------
1345653 110909316904 2012-07-09 21:30:06
1345653 152851771618 2012-07-09 19:57:33
1345653 221065796761 2012-07-09 19:31:48
1345653 400307563710 2012-07-09 18:57:33
I need to Compare TestingTable2
with TestingTable1
on BUYER_ID
and USER_ID
. And I need to find all (basically the count) the missing and mismatch entries in TestingTable2
after comparing from TestingTable1
. I created SQL fiddle for this-
http://sqlfiddle.com/#!3/d87b2/1
If you run my query in the SQL Fiddle, you will get output as-
BUYER_ID ERROR
1345653 5
which is right as last three
rows from TestingTable1
is missing in TestingTable2
and rest two
are mismatch after comparison from TestingTable1
on BUYER_ID
and USER_ID
.
Now the complicated thing is starting.
Problem Statement-
In my current output, I am getting ERROR count as 5
. So if you see first row in both the tables ITEM_ID
and PRODUCT_ID
are same but CREATED_TIME
and LAST_TIME
is not same, and difference between those two times is of only 1 minute
. So currently I am reporting that as a mismatch, but what I need is that if the difference between them is within 15 minutes range
, then I don't want to report as an error. So after implementing this feature in my current query, I will be getting error count as 4 because difference is within 15 minute range for the first row.
So how can I implement this feature in my current query? That's my question.
P.S- I am working with Hive and Hive supports sql like syntax. So I think any modification will work in my current query.
your SQL Server SQL Fiddle query could be modified as follows and made to work. However, i am not sure if Hive supports datediff
SELECT TT.BUYER_ID , COUNT(*) FROM
(SELECT testingtable1.buyer_id, testingtable1.item_id, testingtable1.created_time FROM
testingtable2 RIGHT JOIN testingtable1
ON (testingtable1.item_id = testingtable2.product_id
AND testingtable1.BUYER_ID = testingtable2.USER_ID
AND abs(datediff(mi, testingtable1.created_time,testingtable2.last_time)) <= 15)
where testingtable2.product_id IS NULL) TT GROUP BY TT.BUYER_ID;
Below is the data in TestingTable1
BUYER_ID | ITEM_ID | CREATED_TIME
-----------+-------------------+------------------------
1345653 110909316904 2012-07-09 21:29:06
1345653 151851771618 2012-07-09 19:57:33
1345653 221065796761 2012-07-09 19:31:48
1345653 400307563710 2012-07-09 18:57:33
1345653 310411560125 2012-07-09 16:09:49
1345653 120945302103 2012-07-09 13:40:23
1345653 261060982989 2012-07-09 09:02:21
Below is the data in TestingTable2
USER_ID | PRODUCT_ID | LAST_TIME
-----------+-------------------+-------------------
1345653 110909316904 2012-07-09 21:30:06
1345653 152851771618 2012-07-09 19:57:33
1345653 221065796761 2012-07-09 19:31:48
1345653 400307563710 2012-07-09 18:57:33
I need to Compare TestingTable2
with TestingTable1
on BUYER_ID
and USER_ID
. And I need to find all (basically the count) the missing and mismatch entries in TestingTable2
after comparing from TestingTable1
. I created SQL fiddle for this-
http://sqlfiddle.com/#!3/d87b2/1
If you run my query in the SQL Fiddle, you will get output as-
BUYER_ID ERROR
1345653 5
which is right as last three
rows from TestingTable1
is missing in TestingTable2
and rest two
are mismatch after comparison from TestingTable1
on BUYER_ID
and USER_ID
.
Now the complicated thing is starting.
Problem Statement-
In my current output, I am getting ERROR count as 5
. So if you see first row in both the tables ITEM_ID
and PRODUCT_ID
are same but CREATED_TIME
and LAST_TIME
is not same, and difference between those two times is of only 1 minute
. So currently I am reporting that as a mismatch, but what I need is that if the difference between them is within 15 minutes range
, then I don't want to report as an error. So after implementing this feature in my current query, I will be getting error count as 4 because difference is within 15 minute range for the first row.
So how can I implement this feature in my current query? That's my question.
P.S- I am working with Hive and Hive supports sql like syntax. So I think any modification will work in my current query.
your SQL Server SQL Fiddle query could be modified as follows and made to work. However, i am not sure if Hive supports datediff
SELECT TT.BUYER_ID , COUNT(*) FROM
(SELECT testingtable1.buyer_id, testingtable1.item_id, testingtable1.created_time FROM
testingtable2 RIGHT JOIN testingtable1
ON (testingtable1.item_id = testingtable2.product_id
AND testingtable1.BUYER_ID = testingtable2.USER_ID
AND abs(datediff(mi, testingtable1.created_time,testingtable2.last_time)) <= 15)
where testingtable2.product_id IS NULL) TT GROUP BY TT.BUYER_ID;
0 commentaires:
Enregistrer un commentaire