上一次我们讨论了Tesseract OCR引擎的用法,作为一款老牌的OCR引擎,目前已经开源,最新版本3.0中更是加入了中文OCR功能,再加上Google的更新、维护,可以说是潜力很大,值得期待。由上一次的测试结果也可以看出,Tesseract的OCR结果还不是很理想,尤其是中英文混合的文字,其识别率有限。本次我们来关注下Office 2010中的Onenote,调用其API来测试OCR功能。
PS:在公司经理一直推荐使用MyBase来记录工作中遇到的问题、工作日志等,但是我一直坚持使用Onenote :)
测试代码下载
在Visual Studio 2010 Ultimate + Onenote 2010 x64中测试通过
转载请注明出处:http://www.cnblogs.com/brooks-dotnet/archive/2010/10/07/1845313.html
1、Onenote 2010 新特性:
New features in 2010:
Gather, organize, and search |
Sharing and universal access |
|
|
Examples:
2.7、构建插入图片后的Onenote XML代码:
var onenoteApp
=
new
Microsoft.Office.Interop.OneNote.Application();
string notebookXml; onenoteApp.GetHierarchy( null , Microsoft.Office.Interop.OneNote.HierarchyScope.hsPages, out notebookXml); var doc = XDocument.Parse(notebookXml); var ns = doc.Root.Name.Namespace; var pageNode = doc.Descendants(ns + " Page " ).FirstOrDefault(); var existingPageId = pageNode.Attribute( " ID " ).Value;
2.8、这里有一处小细节,就是Onenote XML中图片格式只支持如下几种:auto|png|emf|jpg,故需要将图片格式做一下处理:
string
ImgExtension
=
file.Extension.ToLower().Substring(
1
);
switch (ImgExtension) { case " jpg " : ImgExtension = " jpg " ; break ; case " png " : ImgExtension = " png " ; break ; case " emf " : ImgExtension = " emf " ; break ; default : ImgExtension = " auto " ; break ; }
2.9、下面这段是关键代码了,使用Linq to XML构造出插入图片后的Onenote XML:
/*Onenote 2010 中图片的XML格式
< one:Image format ="" originalPageNumber ="0" lastModifiedTime ="" objectID ="" > < one:Position x ="" y ="" z ="" /> < one:Size width ="" height ="" /> < one:Data > Base64 one:Data > //以下标签由Onenote 2010自动生成,不要在程序中处理,目标是获取OCRText中的内容。 < one:OCRData lang ="en-US" > < one:OCRText > OCR后的文字 ]]> one:OCRText > < one:OCRToken startPos ="0" region ="0" line ="0" x ="4.251968383789062" y ="3.685039281845092" width ="31.18110275268555" height ="7.370078563690185" /> < one:OCRToken startPos ="7" region ="0" line ="0" x ="39.40157318115234" y ="3.685039281845092" width ="13.32283401489258" height ="8.78740119934082" /> < one:OCRToken startPos ="12" region ="0" line ="1" x ="4.251968383789062" y ="17.85826683044434" width ="23.52755928039551" height ="6.803150177001953" /> < one:OCRToken startPos ="18" region ="0" line ="1" x ="32.031494140625" y ="17.85826683044434" width ="41.10236358642578" height ="6.803150177001953" /> < one:OCRToken startPos ="28" region ="0" line ="1" x ="77.66928863525391" y ="17.85826683044434" width ="31.46456718444824" height ="6.803150177001953" /> ................ one:Image > */ /*ObjectID格式 The representation of an object to be used for identification of objects on a page. Not unique through OneNote, but unique on the page and the hierarchy. < xsd:simpleType name ="ObjectID" " > < xsd:restriction base ="xsd:string" > < xsd:pattern value ="\{[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}\}\{[0-9]+\}\{[A-Z][0-9]+\}" /> xsd:restriction > xsd:simpleType > */
var page
=
new
XDocument(
new
XElement(ns
+
"
Page
"
,
new XElement(ns + " Outline " , new XElement(ns + " OEChildren " , new XElement(ns + " OE " , new XElement(ns + " Image " , new XAttribute( " format " , ImgExtension), new XAttribute( " originalPageNumber " , " 0 " ), new XElement(ns + " Position " , new XAttribute( " x " , " 0 " ), new XAttribute( " y " , " 0 " ), new XAttribute( " z " , " 0 " )), new XElement(ns + " Size " , new XAttribute( " width " , bp.Width.ToString()), new XAttribute( " height " , bp.Height.ToString())), new XElement(ns + " Data " , _Base64))))))); page.Root.SetAttributeValue( " ID " , existingPageId); onenoteApp.UpdatePageContent(page.ToString(), DateTime.MinValue);
2.10、线程休眠几秒钟,等待OCR完成,Onenote OCR根据图片大小需要消耗一些时间:
//
线程休眠时间,单位毫秒,若图片很大,则延长休眠时间,保证Onenote OCR完毕
System.Threading.Thread.Sleep(Int32.Parse(System.Configuration.ConfigurationManager.AppSettings[ " WaitTIme " ]));
2.11、为了便于提取OCR后的结果,将构造好的Onenote XML代码写入一个临时的XML文件:
string
pageXml;
onenoteApp.GetPageContent(existingPageId, out pageXml, Microsoft.Office.Interop.OneNote.PageInfo.piAll); // 获取OCR后的内容 FileStream tmpXml = new FileStream(System.Configuration.ConfigurationManager.AppSettings[ " tmpPath " ] + @" \tmp.xml " , FileMode.Create, FileAccess.ReadWrite); StreamWriter sw = new StreamWriter(tmpXml); sw.Write(pageXml); sw.Flush(); sw.Close(); tmpXml.Close();
2.12、使用Linq to XML和XPath表达式提取OCR后的结果:
FileStream tmpOnenote
=
new
FileStream(System.Configuration.ConfigurationManager.AppSettings[
"
tmpPath
"
]
+
@"
\tmp.xml
"
, FileMode.Open, FileAccess.ReadWrite);
XmlReader reader = XmlReader.Create(tmpOnenote); XElement rdlc = XElement.Load(reader); XmlNameTable nameTable = reader.NameTable; XmlNamespaceManager mgr = new XmlNamespaceManager(nameTable); mgr.AddNamespace( " one " , ns.ToString()); StringReader sr = new StringReader(pageXml); XElement onenote = XElement.Load(sr); var xml = from o in onenote.XPathSelectElements( " //one:Image " , mgr) select o.XPathSelectElement( " //one:OCRText " , mgr).Value; this .txtOCRed.Text = xml.First().ToString();
2.13、释放占用的资源:
sr.Close();
reader.Close(); tmpOnenote.Close();
2.14、最后将OCR后的结果写入到输出文件中:
FileStream fs
=
new
FileStream(
this
.__OutputFileName, FileMode.Create, FileAccess.ReadWrite);
StreamWriter sw = new StreamWriter(fs); sw.Write( this .txtOCRed.Text); sw.Flush(); sw.Close(); fs.Close(); this .labMsg.Content = " OCR成功。 " ;
由于我安装的是Onenote 2010 x64英文版,未找到中文语言包,故先测试下英文OCR。 2.15、本地图片测试结果:
2.16、网络图片测试结果: 网络图片是先下载到本地,后面步骤和本地图片一样。
小结 此方法的优点是效率很高,可扩展性强,只要改改配置文件、Linq to XML代码就可以完成很多附加工作。 缺点是,要求客户端必须要安装Onenote,且至少要有一个打开的Page,OCR时无法判断哪一个图片是正在OCR的,若连续操作则显示结果混乱。 此外,我没有找到编程建立Onenote文档的方法,以及对Onenote XML架构了解的还不够多,对一些元素不知道如何编程生成,如ObjectID等。 综上所述,Onenote 2010的OCR水平还是很高的,和Tesseract相比,OCR的准确率与效率均提高了不止一个档次,但是鉴于Onenote 2010 API十分简陋,远不及Word、Excel等操作方便,且官方文档对于Onenote 2010 XML架构的介绍还不是很详细,缺少示例。希望Office 15、Onenote 2014能有所改进吧。关于OCR的介绍到此告一段落,欢迎感兴趣的朋友继续讨论。 写下你的评论吧 !
推荐阅读
|