下载Tesseract-OCR安装包,地址为:
https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w32-setup-v4.0.0-beta.1.20180608.exe
参考链接:https://github.com/tesseract-ocr/tesseract
双击安装,语言库部分选择math,chinese simplified.
// Create Tesseract object
tesseract::TessBaseAPI *ocr = new tesseract::TessBaseAPI();
/*
Initialize OCR engine to use English (eng) and The LSTM
OCR engine.
There are four OCR Engine Mode (oem) available
OEM_TESSERACT_ONLY Legacy engine only.
OEM_LSTM_ONLY Neural nets LSTM engine only.
OEM_TESSERACT_LSTM_COMBINED Legacy + LSTM engines.
OEM_DEFAULT Default, based on what is available.
*/
ocr->Init(NULL, "chi_sim+eng+equ", tesseract::OEM_DEFAULT);
// Set Page segmentation mode to PSM_AUTO (3)
// Other important psm modes will be discussed in a future post.
ocr->SetPageSegMode(tesseract::PSM_AUTO);
// Open input image using OpenCV
Mat im = cv::imread(imPath, IMREAD_COLOR);
// Set image data
ocr->SetImage(im.data, im.cols, im.rows, 3, im.step);
// Run Tesseract OCR on image
outText = string(ocr->GetUTF8Text());
// print recognized text
cout < // Destroy used object and release memory ocr->End(); return EXIT_SUCCESS; } 该工程选用的是OpenCV2.4,Tesseract4.0,Leptonica-1.76 因而增加对应的头文件目录和库文件目录如下 编译运行结果如下: 可见中文完全乱码 针对中文乱码情况,网上提供解决方案,UTF--->Unicode--->Ansi 在test.cpp中增加如下两个函数: //utf-8转unicode wchar_t * CIDcardRecogizeDlg::Utf_8ToUnicode(char* szU8) { //UTF8 to Unicode //由于中文直接复制过来会成乱码,编译器有时会报错,故采用16进制形式 //预转换,得到所需空间的大小 int wcsLen = ::MultiByteToWideChar(CP_UTF8, NULL, szU8, strlen(szU8), NULL, 0); //分配空间要给'\0'留个空间,MultiByteToWideChar不会给'\0'空间 wchar_t* wszString = new wchar_t[wcsLen + 1]; //转换 ::MultiByteToWideChar(CP_UTF8, NULL, szU8, strlen(szU8), wszString, wcsLen); //最后加上'\0' wszString[wcsLen] = '\0'; return wszString; } //将宽字节wchar_t*转化为单字节char* char* CIDcardRecogizeDlg::UnicodeToAnsi( const wchar_t* szStr ) { int nLen = WideCharToMultiByte( CP_ACP, 0, szStr, -1, NULL, 0, NULL, NULL ); if (nLen == 0) { return NULL; } char* pResult = new char[nLen]; WideCharToMultiByte( CP_ACP, 0, szStr, -1, pResult, nLen, NULL, NULL ); return pResult; } 并修改main函数: char* test1 = ocr->GetUTF8Text(); wchar_t* tempchar = Utf_8ToUnicode(test1); char* resulttemp = UnicodeToAnsi(tempchar); // outText = string(ocr->GetUTF8Text()); // print recognized text cout < 即可解决中文乱码问题,识别结果如下: 参考链接:https://blog.csdn.net/liulina603/article/details/456683072.2 工程配置
2.4 中文乱码