lundi 19 mai 2014

SQL - compter l'incompatibilité et manquants - Stack Overflow


Below is the data in TestingTable1


BUYER_ID   |   ITEM_ID         |    CREATED_TIME
-----------+-------------------+------------------------
1345653 110909316904 2012-07-09 21:29:06
1345653 151851771618 2012-07-09 19:57:33
1345653 221065796761 2012-07-09 19:31:48
1345653 400307563710 2012-07-09 18:57:33
1345653 310411560125 2012-07-09 16:09:49
1345653 120945302103 2012-07-09 13:40:23
1345653 261060982989 2012-07-09 09:02:21

Below is the data in TestingTable2


USER_ID   |   PRODUCT_ID           |    LAST_TIME
-----------+-------------------+-------------------
1345653 110909316904 2012-07-09 21:30:06
1345653 152851771618 2012-07-09 19:57:33
1345653 221065796761 2012-07-09 19:31:48
1345653 400307563710 2012-07-09 18:57:33

I need to Compare TestingTable2 with TestingTable1 on BUYER_ID and USER_ID. And I need to find all (basically the count) the missing and mismatch entries in TestingTable2 after comparing from TestingTable1. I created SQL fiddle for this-


http://sqlfiddle.com/#!3/d87b2/1


If you run my query in the SQL Fiddle, you will get output as-


BUYER_ID    ERROR
1345653 5

which is right as last three rows from TestingTable1 is missing in TestingTable2 and rest two are mismatch after comparison from TestingTable1 on BUYER_ID and USER_ID.


Now the complicated thing is starting.


Problem Statement-


In my current output, I am getting ERROR count as 5. So if you see first row in both the tables ITEM_ID and PRODUCT_ID are same but CREATED_TIME and LAST_TIME is not same, and difference between those two times is of only 1 minute. So currently I am reporting that as a mismatch, but what I need is that if the difference between them is within 15 minutes range, then I don't want to report as an error. So after implementing this feature in my current query, I will be getting error count as 4 because difference is within 15 minute range for the first row.


So how can I implement this feature in my current query? That's my question.


P.S- I am working with Hive and Hive supports sql like syntax. So I think any modification will work in my current query.




your SQL Server SQL Fiddle query could be modified as follows and made to work. However, i am not sure if Hive supports datediff


SELECT TT.BUYER_ID , COUNT(*) FROM
(SELECT testingtable1.buyer_id, testingtable1.item_id, testingtable1.created_time FROM
testingtable2 RIGHT JOIN testingtable1
ON (testingtable1.item_id = testingtable2.product_id
AND testingtable1.BUYER_ID = testingtable2.USER_ID
AND abs(datediff(mi, testingtable1.created_time,testingtable2.last_time)) <= 15)
where testingtable2.product_id IS NULL) TT GROUP BY TT.BUYER_ID;


Below is the data in TestingTable1


BUYER_ID   |   ITEM_ID         |    CREATED_TIME
-----------+-------------------+------------------------
1345653 110909316904 2012-07-09 21:29:06
1345653 151851771618 2012-07-09 19:57:33
1345653 221065796761 2012-07-09 19:31:48
1345653 400307563710 2012-07-09 18:57:33
1345653 310411560125 2012-07-09 16:09:49
1345653 120945302103 2012-07-09 13:40:23
1345653 261060982989 2012-07-09 09:02:21

Below is the data in TestingTable2


USER_ID   |   PRODUCT_ID           |    LAST_TIME
-----------+-------------------+-------------------
1345653 110909316904 2012-07-09 21:30:06
1345653 152851771618 2012-07-09 19:57:33
1345653 221065796761 2012-07-09 19:31:48
1345653 400307563710 2012-07-09 18:57:33

I need to Compare TestingTable2 with TestingTable1 on BUYER_ID and USER_ID. And I need to find all (basically the count) the missing and mismatch entries in TestingTable2 after comparing from TestingTable1. I created SQL fiddle for this-


http://sqlfiddle.com/#!3/d87b2/1


If you run my query in the SQL Fiddle, you will get output as-


BUYER_ID    ERROR
1345653 5

which is right as last three rows from TestingTable1 is missing in TestingTable2 and rest two are mismatch after comparison from TestingTable1 on BUYER_ID and USER_ID.


Now the complicated thing is starting.


Problem Statement-


In my current output, I am getting ERROR count as 5. So if you see first row in both the tables ITEM_ID and PRODUCT_ID are same but CREATED_TIME and LAST_TIME is not same, and difference between those two times is of only 1 minute. So currently I am reporting that as a mismatch, but what I need is that if the difference between them is within 15 minutes range, then I don't want to report as an error. So after implementing this feature in my current query, I will be getting error count as 4 because difference is within 15 minute range for the first row.


So how can I implement this feature in my current query? That's my question.


P.S- I am working with Hive and Hive supports sql like syntax. So I think any modification will work in my current query.



your SQL Server SQL Fiddle query could be modified as follows and made to work. However, i am not sure if Hive supports datediff


SELECT TT.BUYER_ID , COUNT(*) FROM
(SELECT testingtable1.buyer_id, testingtable1.item_id, testingtable1.created_time FROM
testingtable2 RIGHT JOIN testingtable1
ON (testingtable1.item_id = testingtable2.product_id
AND testingtable1.BUYER_ID = testingtable2.USER_ID
AND abs(datediff(mi, testingtable1.created_time,testingtable2.last_time)) <= 15)
where testingtable2.product_id IS NULL) TT GROUP BY TT.BUYER_ID;

0 commentaires:

Enregistrer un commentaire