Tesseract-ocr实现图像文本识别

作者：饱和深潜者_463 | 来源：互联网 | 2023-09-16 14:34

Tesseract是一个开源的OCR（OpticalCharacterRecognition，光学字符识别）引擎，可以识别多种格式的图像文件并将其转换成文本，目前已支持60多种语言（包括中文）。

Tesseract是一个开源的OCR（Optical Character Recognition，光学字符识别）引擎，可以识别多种格式的图像文件并将其转换成文本，目前已支持60多种语言（包括中文）。 Tesseract最初由HP公司开发，后来由Google接盘填坑。

git网址https://github.com/tesseract-ocr/tesseract

其源码调用栈：

main[api/tesseractmain.cpp] -> 
TessBaseAPI::ProcessPages[api/baseapi.cpp] -> 
TessBaseAPI::ProcessPage[api/baseapi.cpp] -> 
TessBaseAPI::Recognize[api/baseapi.cpp] -> 
TessBaseAPI::FindLines[api/baseapi.cpp] -> 
TessBaseAPI::Threshold[api/baseapi.cpp] -> 
ImageThresholder::ThresholdToPix[ccmain/thresholder.cpp] -> 
ImageThresholder::OtsuThresholdRectToPix [ccmain/thresholder.cpp]

/**
 * Recognizes all the pages in the named file, as a multi-page tiff or
 * list of filenames, or single image, and gets the appropriate kind of text
 * according to parameters: tessedit_create_boxfile,
 * tessedit_make_boxes_from_boxes, tessedit_write_unlv, tessedit_create_hocr.
 * Calls ProcessPage on each page in the input file, which may be a
 * multi-page tiff, single-page other file format, or a plain text list of
 * images to read. If tessedit_page_number is non-negative, processing begins
 * at that page of a multi-page tiff file, or filelist.
 * The text is returned in text_out. Returns false on error.
 * If non-zero timeout_millisec terminates processing after the timeout on
 * a single page.
 * If non-NULL and non-empty, and some page fails for some reason,
 * the page is reprocessed with the retry_config config file. Useful
 * for interactively debugging a bad page.
 */
bool TessBaseAPI::ProcessPages(const char* filename,
                               const char* retry_config, int timeout_millisec,
                               STRING* text_out) {
  int page = tesseract_->tessedit_page_number;
  if (page <0)
    page = 0;
  FILE* fp = fopen(filename, "rb");
  if (fp == NULL) {
    tprintf(_("Image file %s cannot be opened!\n"), filename);
    return false;
  }
  // Find the number of pages if a tiff file, or zero otherwise.
  int npages = CountTiffPages(fp);
  fclose(fp);

  if (tesseract_->tessedit_create_hocr) {
    *text_out =
        "\n"
        "        "    \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n"
        "        "lang=\"en\">\n \n  \n"
        "  		"charset=utf-8\" />\n"
        "  \n"
        "          " ocr_line ocrx_word'/>\n"
        " \n \n";
  } else {
    *text_out = "";
  }

  bool success = true;
  Pix *pix;
  if (npages > 0) {
    for (; page          ++page) {
      if ((page >= 0) && (npages > 1))
        tprintf(_("Page %d of %d\n"), page + 1, npages);
      char page_str[kMaxIntSize];
      snprintf(page_str, kMaxIntSize - 1, "%d", page);
      SetVariable("applybox_page", page_str);
      success &= ProcessPage(pix, page, filename, retry_config,
                             timeout_millisec, text_out);
      pixDestroy(&pix);
      if (tesseract_->tessedit_page_number >= 0 || npages == 1) {
        break;
      }
    }
  } else {
    // The file is not a tiff file, so use the general pixRead function.
    pix = pixRead(filename);
    if (pix != NULL) {
      success &= ProcessPage(pix, 0, filename, retry_config,
                             timeout_millisec, text_out);
      pixDestroy(&pix);
    } else {
      // The file is not an image file, so try it as a list of filenames.
      FILE* fimg = fopen(filename, "rb");
      if (fimg == NULL) {
        tprintf(_("File %s cannot be opened!\n"), filename);
        return false;
      }
      tprintf(_("Reading %s as a list of filenames...\n"), filename);
      char pagename[MAX_PATH];
      // Skip to the requested page number.
      for (int i = 0; i            fgets(pagename, sizeof(pagename), fimg) != NULL;
           ++i);
      while (fgets(pagename, sizeof(pagename), fimg) != NULL) {
        chomp_string(pagename);
        pix = pixRead(pagename);
        if (pix == NULL) {
          tprintf(_("Image file %s cannot be read!\n"), pagename);
          fclose(fimg);
          return false;
        }
        tprintf(_("Page %d : %s\n"), page, pagename);
        success &= ProcessPage(pix, page, pagename, retry_config,
                               timeout_millisec, text_out);
        pixDestroy(&pix);
        ++page;
      }
      fclose(fimg);
    }
  }
  if (tesseract_->tessedit_create_hocr)
    *text_out += " \n\n";
  return success;
}

/**
 * Recognizes a single page for ProcessPages, appending the text to text_out.
 * The pix is the image processed - filename and page_index are metadata
 * used by side-effect processes, such as reading a box file or formatting
 * as hOCR.
 * If non-zero timeout_millisec terminates processing after the timeout.
 * If non-NULL and non-empty, and some page fails for some reason,
 * the page is reprocessed with the retry_config config file. Useful
 * for interactively debugging a bad page.
 * The text is returned in text_out. Returns false on error.
 */
bool TessBaseAPI::ProcessPage(Pix* pix, int page_index, const char* filename,
                              const char* retry_config, int timeout_millisec,
                              STRING* text_out) {
  SetInputName(filename);
  SetImage(pix);
  bool failed = false;
  if (timeout_millisec > 0) {
    // Running with a timeout.
    ETEXT_DESC monitor;
    monitor.cancel = NULL;
    monitor.cancel_this = NULL;
    monitor.set_deadline_msecs(timeout_millisec);
    // Now run the main recognition.
    failed = Recognize(&monitor) <0;
  } else if (tesseract_->tessedit_pageseg_mode == PSM_OSD_ONLY ||
             tesseract_->tessedit_pageseg_mode == PSM_AUTO_ONLY) {
    // Disabled character recognition.
    PageIterator* it = AnalyseLayout();
    if (it == NULL) {
      failed = true;
    } else {
      delete it;
      return true;
    }
  } else {
    // Normal layout and character recognition with no timeout.
    failed = Recognize(NULL) <0;
  }
  if (tesseract_->tessedit_write_images) {
    Pix* page_pix = GetThresholdedImage();
    pixWrite("tessinput.tif", page_pix, IFF_TIFF_G4);
  }
  if (failed && retry_config != NULL && retry_config[0] != '\0') {
    // Save current config variables before switching modes.
    FILE* fp = fopen(kOldVarsFile, "wb");
    PrintVariables(fp);
    fclose(fp);
    // Switch to alternate mode for retry.
    ReadConfigFile(retry_config);
    SetImage(pix);
    Recognize(NULL);
    // Restore saved config variables.
    ReadConfigFile(kOldVarsFile);
  }
  // Get text only if successful.
  if (!failed) {
    char* text;
    if (tesseract_->tessedit_create_boxfile ||
        tesseract_->tessedit_make_boxes_from_boxes) {
      text = GetBoxText(page_index);
    } else if (tesseract_->tessedit_write_unlv) {
      text = GetUNLVText();
    } else if (tesseract_->tessedit_create_hocr) {
      text = GetHOCRText(page_index);
    } else {
      text = GetUTF8Text();
    }
    *text_out += text;
    delete [] text;
    return true;
  }
  return false;
}

/**
 * Recognize the tesseract global image and return the result as Tesseract
 * internal structures.
 */
int TessBaseAPI::Recognize(ETEXT_DESC* monitor) {
  if (tesseract_ == NULL)
    return -1;
  if (FindLines() != 0)
    return -1;
  if (page_res_ != NULL)
    delete page_res_;
  if (block_list_->empty()) {
    page_res_ = new PAGE_RES(block_list_, &tesseract_->prev_word_best_choice_);
    return 0; // Empty page.
  }

  tesseract_->SetBlackAndWhitelist();
  recognition_done_ = true;
  if (tesseract_->tessedit_resegment_from_line_boxes)
    page_res_ = tesseract_->ApplyBoxes(*input_file_, true, block_list_);
  else if (tesseract_->tessedit_resegment_from_boxes)
    page_res_ = tesseract_->ApplyBoxes(*input_file_, false, block_list_);
  else
    page_res_ = new PAGE_RES(block_list_, &tesseract_->prev_word_best_choice_);
  if (tesseract_->tessedit_make_boxes_from_boxes) {
    tesseract_->CorrectClassifyWords(page_res_);
    return 0;
  }

  if (truth_cb_ != NULL) {
    tesseract_->wordrec_run_blamer.set_value(true);
    truth_cb_->Run(tesseract_->getDict().getUnicharset(),
                   image_height_, page_res_);
  }

  int result = 0;
  if (tesseract_->interactive_display_mode) {
    #ifndef GRAPHICS_DISABLED
    tesseract_->pgeditor_main(rect_width_, rect_height_, page_res_);
    #endif  // GRAPHICS_DISABLED
    // The page_res is invalid after an interactive session, so cleanup
    // in a way that lets us continue to the next page without crashing.
    delete page_res_;
    page_res_ = NULL;
    return -1;
  } else if (tesseract_->tessedit_train_from_boxes) {
    tesseract_->ApplyBoxTraining(*output_file_, page_res_);
  } else if (tesseract_->tessedit_ambigs_training) {
    FILE *training_output_file = tesseract_->init_recog_training(*input_file_);
    // OCR the page segmented into words by tesseract.
    tesseract_->recog_training_segmented(
        *input_file_, page_res_, monitor, training_output_file);
    fclose(training_output_file);
  } else {
    // Now run the main recognition.
    if (tesseract_->recog_all_words(page_res_, monitor, NULL, NULL, 0)) {
      DetectParagraphs(true);
    } else {
      result = -1;
    }
  }
  return result;
}

/** Find lines from the image making the BLOCK_LIST. */
int TessBaseAPI::FindLines() {
  if (thresholder_ == NULL || thresholder_->IsEmpty()) {
    tprintf("Please call SetImage before attempting recognition.");
    return -1;
  }
  if (recognition_done_)
    ClearResults();
  if (!block_list_->empty()) {
    return 0;
  }
  if (tesseract_ == NULL) {
    tesseract_ = new Tesseract;
    tesseract_->InitAdaptiveClassifier(false);
  }
  if (tesseract_->pix_binary() == NULL)
    Threshold(tesseract_->mutable_pix_binary());
  if (tesseract_->ImageWidth() > MAX_INT16 ||
      tesseract_->ImageHeight() > MAX_INT16) {
    tprintf("Image too large: (%d, %d)\n",
            tesseract_->ImageWidth(), tesseract_->ImageHeight());
    return -1;
  }

  tesseract_->PrepareForPageseg();

  if (tesseract_->textord_equation_detect) {
    if (equ_detect_ == NULL && datapath_ != NULL) {
      equ_detect_ = new EquationDetect(datapath_->string(), NULL);
    }
    tesseract_->SetEquationDetect(equ_detect_);
  }

  Tesseract* osd_tess = osd_tesseract_;
  OSResults osr;
  if (PSM_OSD_ENABLED(tesseract_->tessedit_pageseg_mode) && osd_tess == NULL) {
    if (strcmp(language_->string(), "osd") == 0) {
      osd_tess = tesseract_;
    } else {
      osd_tesseract_ = new Tesseract;
      if (osd_tesseract_->init_tesseract(
          datapath_->string(), NULL, "osd", OEM_TESSERACT_ONLY,
          NULL, 0, NULL, NULL, false) == 0) {
        osd_tess = osd_tesseract_;
        osd_tesseract_->set_source_resolution(
            thresholder_->GetSourceYResolution());
      } else {
        tprintf("Warning: Auto orientation and script detection requested,"
                " but osd language failed to load\n");
        delete osd_tesseract_;
        osd_tesseract_ = NULL;
      }
    }
  }

  if (tesseract_->SegmentPage(input_file_, block_list_, osd_tess, &osr) <0)
    return -1;
  // If Devanagari is being recognized, we use different images for page seg
  // and for OCR.
  tesseract_->PrepareForTessOCR(block_list_, osd_tess, &osr);
  return 0;
}

/**
 * Run the thresholder to make the thresholded image, returned in pix,
 * which must not be NULL. *pix must be initialized to NULL, or point
 * to an existing pixDestroyable Pix.
 * The usual argument to Threshold is Tesseract::mutable_pix_binary().
 */
void TessBaseAPI::Threshold(Pix** pix) {
  ASSERT_HOST(pix != NULL);
  if (!thresholder_->IsBinary()) {
    tesseract_->set_pix_grey(thresholder_->GetPixRectGrey());
  }
  if (*pix != NULL)
    pixDestroy(pix);
  // Zero resolution messes up the algorithms, so make sure it is credible.
  int y_res = thresholder_->GetScaledYResolution();
  if (y_res  kMaxCredibleResolution) {
    // Use the minimum default resolution, as it is safer to under-estimate
    // than over-estimate resolution.
    thresholder_->SetSourceYResolution(kMinCredibleResolution);
  }
  thresholder_->ThresholdToPix(pix);
  thresholder_->GetImageSizes(&rect_left_, &rect_top_,
                              &rect_width_, &rect_height_,
                              &image_width_, &image_height_);
  // Set the internal resolution that is used for layout parameters from the
  // estimated resolution, rather than the image resolution, which may be
  // fabricated, but we will use the image resolution, if there is one, to
  // report output point sizes.
  int estimated_res = ClipToRange(thresholder_->GetScaledEstimatedResolution(),
                                  kMinCredibleResolution,
                                  kMaxCredibleResolution);
  if (estimated_res != thresholder_->GetScaledEstimatedResolution()) {
    tprintf("Estimated resolution %d out of range! Corrected to %d\n",
            thresholder_->GetScaledEstimatedResolution(), estimated_res);
  }
  tesseract_->set_source_resolution(estimated_res);
}

// Threshold the source image as efficiently as possible to the output Pix.
// Creates a Pix and sets pix to point to the resulting pointer.
// Caller must use pixDestroy to free the created Pix.
void ImageThresholder::ThresholdToPix(Pix** pix) {
  if (pix_ != NULL) {
    if (image_bytespp_ == 0) {
      // We have a binary image, so it just has to be cloned.
      *pix = GetPixRect();
    } else {
      if (image_bytespp_ == 4) {
        // Color data can just be passed direct.
        const uinT32* data = pixGetData(pix_);
        OtsuThresholdRectToPix(reinterpret_cast(data),
                               image_bytespp_, image_bytespl_, pix);
      } else {
        // Convert 8-bit to IMAGE and then pass its
        // buffer to the raw interface to complete the conversion.
        IMAGE temp_image;
        temp_image.FromPix(pix_);
        OtsuThresholdRectToPix(temp_image.get_buffer(),
                               image_bytespp_,
                               COMPUTE_IMAGE_XDIM(temp_image.get_xsize(),
                                                  temp_image.get_bpp()),
                               pix);
      }
    }
    return;
  }
  if (image_bytespp_ > 0) {
    // Threshold grey or color.
    OtsuThresholdRectToPix(image_data_, image_bytespp_, image_bytespl_, pix);
  } else {
    RawRectToPix(pix);
  }
}

// Otsu threshold the rectangle, taking everything except the image buffer
// pointer from the class, to the output Pix.
void ImageThresholder::OtsuThresholdRectToPix(const unsigned char* imagedata,
                                              int bytes_per_pixel,
                                              int bytes_per_line,
                                              Pix** pix) const {
  int* thresholds;
  int* hi_values;
  OtsuThreshold(imagedata, bytes_per_pixel, bytes_per_line,
                rect_left_, rect_top_, rect_width_, rect_height_,
                &thresholds, &hi_values);

  // Threshold the image to the given IMAGE.
  ThresholdRectToPix(imagedata, bytes_per_pixel, bytes_per_line,
                     thresholds, hi_values, pix);
  delete [] thresholds;
  delete [] hi_values;
}

同时在输入图像处理时是以坐标参数为基准了，所以支持处理指定区域。

即:api.SetRectangle(0, 0, x, y);

最后说一下安装版的cmd调用方法：

tesseract.exe input result -l chi_sim

input:输入图片

result;输出文本

-l 字体库资源

chi_sim 中文大陆

字体库资源在tessdata文件夹下：

输入图片：

输出图片：

一般都是用jTessBoxEditor进行矫正错误并训练，ps：需要java环境，还是比较麻烦的。

最终生成的也是字体库资源，即在tessdata文件夹下的*.traineddata。如果普通业务需求的话建议从网上或者其他途径获取比较适合自己使用的库资源。

当然如果你的图像文字存在一些规律性或者对输出要把控的很严谨的话还是建议自己训练吧。

推荐阅读

get
Spring源码解密之默认标签的解析方式分析

本文分析了Spring源码解密中默认标签的解析方式。通过对命名空间的判断，区分默认命名空间和自定义命名空间，并采用不同的解析方式。其中，bean标签的解析最为复杂和重要。 ... [详细]

蜡笔小新 2023-12-14 17:24:50
include
C++省略号类型和参数个数不确定函数参数范例

本文介绍了C++中省略号类型和参数个数不确定函数参数的使用方法，并提供了一个范例。通过宏定义的方式，可以方便地处理不定参数的情况。文章中给出了具体的代码实现，并对代码进行了解释和说明。这对于需要处理不定参数的情况的程序员来说，是一个很有用的参考资料。 ... [详细]

蜡笔小新 2023-12-14 12:36:28
foreach
PHP实现断点续传乱序合并文件的方法和源码

本文介绍了使用PHP实现断点续传乱序合并文件的方法和源码。由于网络原因，文件需要分割成多个部分发送，因此无法按顺序接收。文章中提供了merge2.php的源码，通过使用shuffle函数打乱文件读取顺序，实现了乱序合并文件的功能。同时，还介绍了filesize、glob、unlink、fopen等相关函数的使用。阅读本文可以了解如何使用PHP实现断点续传乱序合并文件的具体步骤。 ... [详细]

蜡笔小新 2023-12-14 04:33:19
get
关于cuowu类的错误提示和使用AdjustmentListener的问题

本文讨论了一个关于cuowu类的问题，作者在使用cuowu类时遇到了错误提示和使用AdjustmentListener的问题。文章提供了16个解决方案，并给出了两个可能导致错误的原因。 ... [详细]

蜡笔小新 2023-12-13 22:09:56
text
PDO MySQL

PDOMySQL如果文章有成千上万篇，该怎样保存？数据保存有多种方式，比如单机文件、单机数据库（SQLite）、网络数据库（MySQL、MariaDB）等等。根据项目来选择，做We ... [详细]

蜡笔小新 2023-12-12 10:25:39
get
在类中定义数组时出错 - Error on defining arrays in class

Iamtryingtomakeaclassthatwillreadatextfileofnamesintoanarray,thenreturnthatarra ... [详细]

蜡笔小新 2023-12-14 17:38:12
text
如何限制php数据库链接数和连接超时时间？

本文介绍了如何使用php限制数据库插入的条数并显示每次插入数据库之间的数据数目，以及避免重复提交的方法。同时还介绍了如何限制某一个数据库用户的并发连接数，以及设置数据库的连接数和连接超时时间的方法。最后提供了一些关于浏览器在线用户数和数据库连接数量比例的参考值。 ... [详细]

蜡笔小新 2023-12-14 14:06:10
text
qt学习(六)数据库注册用户的实现方法

本文介绍了在qt学习中实现数据库注册用户的方法，包括登录按钮按下后出现注册页面、账号可用性判断、密码格式判断、邮箱格式判断等步骤。具体实现过程包括UI设计、数据库的创建和各个模块调用数据内容。 ... [详细]

蜡笔小新 2023-12-14 13:29:32
text
Spring 3.1：数据源未自动连接到@Configuration类的错误原因及解决方法

本文讨论了在Spring 3.1中，数据源未能自动连接到@Configuration类的错误原因，并提供了解决方法。作者发现了错误的原因，并在代码中手动定义了PersistenceAnnotationBeanPostProcessor。作者删除了该定义后，问题得到解决。此外，作者还指出了默认的PersistenceAnnotationBeanPostProcessor的注册方式，并提供了自定义该bean定义的方法。 ... [详细]

蜡笔小新 2023-12-14 03:54:26
int
Linux进程控制块PCBtask_struct结构体结构及作用详解

本文详细介绍了Linux中进程控制块PCBtask_struct结构体的结构和作用，包括进程状态、进程号、待处理信号、进程地址空间、调度标志、锁深度、基本时间片、调度策略以及内存管理信息等方面的内容。阅读本文可以更加深入地了解Linux进程管理的原理和机制。 ... [详细]

蜡笔小新 2023-12-13 21:31:18
int
PhysioNet生理信号处理（三）WFDB Toolbox for Matlab的安装和使用方法

本文介绍了PhysioNet网站提供的生理信号处理工具箱WFDB Toolbox for Matlab的安装和使用方法。通过下载并添加到Matlab路径中或直接在Matlab中输入相关内容，即可完成安装。该工具箱提供了一系列函数，可以方便地处理生理信号数据。详细的安装和使用方法可以参考本文内容。 ... [详细]

蜡笔小新 2023-12-13 20:46:48
char
clone的fork与pthread_create创建线程有何不同

本文讨论了clone的fork与pthread_create创建线程的不同之处。进程是一个指令执行流及其执行环境，其执行环境是一个系统资源的集合。在调用系统调用fork创建一个进程时，子进程只是完全复制父进程的资源，这样得到的子进程独立于父进程，具有良好的并发性。但是二者之间的通讯需要通过专门的通讯机制，另外通过fork创建子进程系统开销很大。因此，在某些情况下，使用clone或pthread_create创建线程可能更加高效。 ... [详细]

蜡笔小新 2023-12-12 20:00:06
text
浏览器中的异常检测算法及其在深度学习中的应用

本文介绍了在浏览器中进行异常检测的算法，包括统计学方法和机器学习方法，并探讨了异常检测在深度学习中的应用。异常检测在金融领域的信用卡欺诈、企业安全领域的非法入侵、IT运维中的设备维护时间点预测等方面具有广泛的应用。通过使用TensorFlow.js进行异常检测，可以实现对单变量和多变量异常的检测。统计学方法通过估计数据的分布概率来计算数据点的异常概率，而机器学习方法则通过训练数据来建立异常检测模型。 ... [详细]

蜡笔小新 2023-12-12 16:22:39
get
Express App如何提供不需要的静态文件？

本文介绍了如何使用Express App提供静态文件，同时提到了一些不需要使用的文件，如package.json和/.ssh/known_hosts，并解释了为什么app.get('*')无法捕获所有请求以及为什么app.use(express.static(__dirname))可能会提供不需要的文件。 ... [详细]

蜡笔小新 2023-12-12 14:38:07
get
Week04面向对象设计与继承学习总结及作业要求

本文总结了Week04面向对象设计与继承的重要知识点，包括对象、类、封装性、静态属性、静态方法、重载、继承和多态等。同时，还介绍了私有构造函数在类外部无法被调用、static不能访问非静态属性以及该类实例可以共享类里的static属性等内容。此外，还提到了作业要求，包括讲述一个在网上商城购物或在班级博客进行学习的故事，并使用Markdown的加粗标记和语句块标记标注关键名词和动词。最后，还提到了参考资料中关于UML类图如何绘制的范例。 ... [详细]

蜡笔小新 2023-12-11 16:50:17

饱和深潜者_463

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章