我具有表A和表B,其中包含以下示例人口统计数据和列。
表A:
F_Name S_Name DOB SSN David Sam 1/1/1980 123-45-6789 David Lieser 10/7/1940 987-65-4321 John Doe 12/31/2001 500-00-0000
表B:
F_Name S_Name DOB SSN Dave Sammy 1/2/1980 223-45-6789
没有可能用于连接两个表的唯一标识符。
通过查看上述样本数据,我想将David Sam 1/1/1980 123-45-6789(来自表A)和Dave Sammy 1/2/1980 223-455-6789(来自表B)返回为可能是同一个人–理由是DOB,SSN距离足够近,只有一位或几位数字掉了,这可能是由于人为数据输入错误造成的,而且名称听起来相似或相似或相似。我该如何实现?
F_Name S_Name DOB SSN F_Name_1 S_Name_1 DOB_1 SSN_1 David Sam 1/1/1980 123-45-6789 Dave Sammy 1/2/1980 223-455-6789
Littlefoot.. 5
Jaro-Winkler的相似性可能会有所帮助。看下面的例子:
SQL> with 2 table_a (fname, sname, dob, ssn) as 3 (select 'David', 'Sam' , date '1980-01-01', '123-45-6789' from dual union all 4 select 'David', 'Lieser', date '1940-10-07', '987-65-4321' from dual union all 5 select 'John' , 'Doe' , date '2001-12-31', '500-00-0000' from dual 6 ), 7 table_b (fname, sname, dob, ssn) as 8 (select 'Dave', 'Sammy' , date '1980-01-02', '223-45-6789' from dual 9 ) 10 select a.fname, a.sname, a.dob, a.ssn, 11 b.fname, b.sname, b.dob, b.ssn, 12 utl_match.jaro_winkler_similarity(a.fname, b.fname) jws_fname, 13 utl_match.jaro_winkler_similarity(a.sname, b.sname) jws_sname, 14 utl_match.jaro_winkler_similarity(to_char(a.dob, 'yyyymmdd'), to_char(b.dob, 'yyyymmdd')) jws_dob, 15 utl_match.jaro_winkler_similarity(a.ssn, b.ssn) jws_ssn 16 from table_a a cross join table_b b 17 where 18 utl_match.jaro_winkler_similarity(a.fname, b.fname) >= 80 19 and utl_match.jaro_winkler_similarity(a.sname, b.sname) >= 80 20 and utl_match.jaro_winkler_similarity(to_char(a.dob, 'yyyymmdd'), to_char(b.dob, 'yyyymmdd')) >= 80 21 and utl_match.jaro_winkler_similarity(a.ssn, b.ssn) >= 80; FNAME SNAME DOB SSN FNAM SNAME DOB SSN JWS_FNAME JWS_SNAME JWS_DOB JWS_SSN ----- ------ -------- ----------- ---- ----- -------- ----------- ---------- ---------- ---------- ---------- David Sam 01.01.80 123-45-6789 Dave Sammy 02.01.80 223-45-6789 84 90 95 93 SQL>
我将限制设置为80,但您可能会做出不同的决定。删除该WHERE
子句(第17-21行)并检查返回的结果,然后您将对所发生的事情有一个更清晰的了解。
Jaro-Winkler的相似性可能会有所帮助。看下面的例子:
SQL> with 2 table_a (fname, sname, dob, ssn) as 3 (select 'David', 'Sam' , date '1980-01-01', '123-45-6789' from dual union all 4 select 'David', 'Lieser', date '1940-10-07', '987-65-4321' from dual union all 5 select 'John' , 'Doe' , date '2001-12-31', '500-00-0000' from dual 6 ), 7 table_b (fname, sname, dob, ssn) as 8 (select 'Dave', 'Sammy' , date '1980-01-02', '223-45-6789' from dual 9 ) 10 select a.fname, a.sname, a.dob, a.ssn, 11 b.fname, b.sname, b.dob, b.ssn, 12 utl_match.jaro_winkler_similarity(a.fname, b.fname) jws_fname, 13 utl_match.jaro_winkler_similarity(a.sname, b.sname) jws_sname, 14 utl_match.jaro_winkler_similarity(to_char(a.dob, 'yyyymmdd'), to_char(b.dob, 'yyyymmdd')) jws_dob, 15 utl_match.jaro_winkler_similarity(a.ssn, b.ssn) jws_ssn 16 from table_a a cross join table_b b 17 where 18 utl_match.jaro_winkler_similarity(a.fname, b.fname) >= 80 19 and utl_match.jaro_winkler_similarity(a.sname, b.sname) >= 80 20 and utl_match.jaro_winkler_similarity(to_char(a.dob, 'yyyymmdd'), to_char(b.dob, 'yyyymmdd')) >= 80 21 and utl_match.jaro_winkler_similarity(a.ssn, b.ssn) >= 80; FNAME SNAME DOB SSN FNAM SNAME DOB SSN JWS_FNAME JWS_SNAME JWS_DOB JWS_SSN ----- ------ -------- ----------- ---- ----- -------- ----------- ---------- ---------- ---------- ---------- David Sam 01.01.80 123-45-6789 Dave Sammy 02.01.80 223-45-6789 84 90 95 93 SQL>
我将限制设置为80,但您可能会做出不同的决定。删除该WHERE
子句(第17-21行)并检查返回的结果,然后您将对所发生的事情有一个更清晰的了解。