OCR识别URL地址会引入各种各样的错误:
(1):分割符丢失,如dot符号
(2):字符错误,o->0 如 www.google.com --> www.google.c0m
这些错误纠完全依靠经验知识,用Regular expression,但效果还不是很好。
终于、终于发现FirFox中的 URL Fixer很好用。
Less Talk, More Do
开始吧!
源码在这里:http://www.ohloh.net/p/url-fixer
点击打开链接
浏览后发现了大堆的表达式:
// One letter off/missing one letter {nomatch : "(\\.(cc|mm|co|cm|mc|om|mo)$)", find : "\\.(co[^m]|c[^o]m|[^c]om|[com]{2,3})$", replace : ".com"}, {nomatch : "(\\.(ee|tt|ne|nt|et|tn)$)", find : "\\.(ne[^t]|n[^e]t|[^n]et|[net]{2,3})$", replace : ".net"}, {nomatch : "(\\.(gg|ro|gr)$)", find : "\\.(or[^g]|o[^r]g|[^o]rg|[org]{2,3})$", replace : ".org"}, {nomatch : "(\\.(ee|eu|de)$)", find : "\\.(ed[^u]|e[^d]u|[^e]du|[edu]{2,3})$", replace : ".edu"}, {nomatch : "(\\.(gg|vg)$)", find : "\\.(go[^v]|g[^o]v|[^g]ov|[gov]{2,3})$", replace : ".gov"}, {nomatch : "(\\.(mm|ml|im|il|li)$)", find : "\\.(mi[^l]|m[^i]l|[^m]il|[mil]{2,3})$", replace : ".mil"}, {find : "\\.(aer[^o]|ae[^r]o|a[^e]ro|[^a]ero|[aero]{3,4})$", replace : ".aero"}, {nomatch : "(\\.(bb|bi|bz)$)", find : "\\.(bi[^z]|b[^i]z|[^b]iz|[biz]{2,3})$", replace : ".biz"}, {find : "\\.(coo[^p]|co[^o]p|c[^o]op|[^c]oop|[cop]{3,4})$", replace : ".coop"}, {find : "\\.(inf[^o]|in[^f]o|i[^n]fo|[^i]nfo|[info]{3,4})$", replace : ".info"}, {find : "\\.(museu[^m]|muse[^u]m|mus[^e]um|mu[^s]eum|m[^u]seum|[^m]useum|[museum]{5,6})$", replace : ".museum"}, {find : "\\.(nam[^e]|na[^m]e|n[^a]me|[^n]ame|[name]{3,4})$", replace : ".name"}, {nomatch : "(\\.(pr|ro)$)", find : "\\.(pr[^o]|p[^r]o|[^p]ro|[pro]{2,3})$", replace : ".pro"}, {nomatch : "(\\.(cc|ca|ac|at|tc|tt)$)", find : "\\.(ca[^t]|c[^a]t|[^c]at|[cat]{2,3})$", replace : ".cat"}, {find : "\\.(job[^s]|jo[^b]s|j[^o]bs|[^j]obs|[jobs]{3,4})$", replace : ".jobs"}, {find : "\\.(mob[^i]|mo[^b]i|m[^o]bi|[^m]obi|[mobi]{3,4})$", replace : ".mobi"}, {find : "\\.(trave[^l]|trav[^e]l|tra[^v]el|tr[^a]vel|t[^r]avel|[^t]ravel|[travel]{5,6})$", replace : ".travel"},
翻译成object-c后、、、