综合论文训练

清华大学
综合论文训练
题目：
OCR地址识别后处理
方法的研究与应用
系专姓
别：业：名：
机科学与技术机科学与技术龙翀
指导教师：
朱小燕教授
二零零五年六月二十三日
清华大学综合论文训练
摘要
OCR (Optical Character Recognition，光学字符识别技术)作为方便有效的字体识别技术，在办公自动化、信息恢复、数字图书馆等方面发挥着日益重要的作用。在 OCR 识别的过程中，由于文字和图像结构复杂多变，单字的识别率受到了一定程度的影响。为了提高识别率，需要利用其它信息对 OCR 识别的结果进行后处理工作。语言模型在 OCR 后处理，特别是在中文的文字识别后处理方面有着广泛的应用。本文详细分析了前期工作中采用的语言模型以及相关算法，分别讨论了基于字和基于词的语言模型，分析了它们各自的优点和缺点。经过详细的分析，采用了基于词的语言模型取代基于字的语言模型，接着提出了基于多信息的分词方法。在图的搜索中，采用了 N-best 搜索算法取代 Viterbi 算法。本文的测试数据分为两类：第一类为无分割错误测试数据(一个测试集)，总共 15000 条中文手写地址；第二类为含分割错误测试数据（三个测试集），总共 58269 条中文手写地址。经过改进，在无分割错误测试集上，手写地址的整体识别率由原来的 83.73%上升到了 96.84% ，错误率下降了 80.58%；在含分割错误测试集上，手写地址的整体识别率由原来的 28.56%上升到了 74.15% ，错误率下降了 63.82%，大大提高了系统的性能。关键词： OCR，后处理, 语言模型, 基于词的语言模型，分词算法
I
清华大学综合论文训练
Abstract
OCR （Optical Character Recognition） a convenient and efficient tool for office is automation and information retrieval, and is becoming more and more important in today’s office and library environment. During OCR processing, the recognition rate of isolate characters is limited because of the complex structure of character images. In order to improve the recognition rate, some other information besides image is required for post-processing of the OCR results. Language model is widely used in OCR post-processing, especially Chinese. In this thesis, the language model and related algorithms used in former system are analyzed. Character-based language model and word-based language model are both discussed. Their advantage and disadvantage are also presented. After analyzing, word-based language model is adopted instead of character-based language model. And then Multi-information based segmentation approach is proposed. Finally we use N-best instead of Viterbi as search algorithm. Two kinds of Experiments are made: one is a test set including none segmentation errors，which has 15000 handwritten Chinese addresses; the other one includes three test sets that containing segmentation errors, which have 58269 handwritten Chinese addresses in all. After improvement, in none segmentation errors data set, recognition rate of the whole address increase from 83.73% to 96.84%, this means 80.58% errors reduction. And in the segmentation-involved data sets, recognition rate of the whole address increase from 28.56% to 74.15%, this means 63.82% errors reduction. It has greatly improved the performance of the OCR system. Key words: OCR, post-processing, language model, word-based language model, word segmentation algorithm

下一页

综 合 论 文 训 练

综合论文训练