• 朱小燕的论文 > 综 合 论 文 训 练
  • 综 合 论 文 训 练

    免费下载 下载该文档 文档格式:All   更新时间:2014-04-01   下载次数:0   点击次数:1
    清 华 大 学
    综 合 论 文 训 练
    题目:
    OCR地址识别后处理
    方法的研究与应用
    系 专 姓
    别: 业: 名:
    机科学与技术 机科学与技术 龙 翀
    指导教师:
    朱 小 燕 教 授
    二零零五 年 六 月 二十三 日
    清华大学综合论文训练
    摘 要
    OCR (Optical Character Recognition,光学字符识别技术)作为方便有效 的字体识别技术,在办公自动化、信息恢复、数字图书馆等方面发挥着日益重 要的作用。在 OCR 识别的过程中,由于文字和图像结构复杂多变,单字的识别 率受到了一定程度的影响。为了提高识别率,需要利用其它信息对 OCR 识别的 结果进行后处理工作。 语言模型在 OCR 后处理,特别是在中文的文字识别后处理方面有着广泛的 应用。本文详细分析了前期工作中采用的语言模型以及相关算法,分别讨论了 基于字和基于词的语言模型,分析了它们各自的优点和缺点。经过详细的分析, 采用了基于词的语言模型取代基于字的语言模型,接着提出了基于多信息的分 词方法。在图的搜索中,采用了 N-best 搜索算法取代 Viterbi 算法。 本文的测试数据分为两类:第一类为无分割错误测试数据(一个测试集), 总共 15000 条中文手写地址;第二类为含分割错误测试数据(三个测试集) ,总 共 58269 条中文手写地址。经过改进,在无分割错误测试集上,手写地址的整 体识别率由原来的 83.73%上升到了 96.84% ,错误率下降了 80.58%;在含分割 错误测试集上,手写地址的整体识别率由原来的 28.56%上升到了 74.15% ,错 误率下降了 63.82%,大大提高了系统的性能。 关键词: OCR,后处理, 语言模型, 基于词的语言模型,分词算法
    I
    清华大学综合论文训练
    Abstract
    OCR (Optical Character Recognition) a convenient and efficient tool for office is automation and information retrieval, and is becoming more and more important in today’s office and library environment. During OCR processing, the recognition rate of isolate characters is limited because of the complex structure of character images. In order to improve the recognition rate, some other information besides image is required for post-processing of the OCR results. Language model is widely used in OCR post-processing, especially Chinese. In this thesis, the language model and related algorithms used in former system are analyzed. Character-based language model and word-based language model are both discussed. Their advantage and disadvantage are also presented. After analyzing, word-based language model is adopted instead of character-based language model. And then Multi-information based segmentation approach is proposed. Finally we use N-best instead of Viterbi as search algorithm. Two kinds of Experiments are made: one is a test set including none segmentation errors,which has 15000 handwritten Chinese addresses; the other one includes three test sets that containing segmentation errors, which have 58269 handwritten Chinese addresses in all. After improvement, in none segmentation errors data set, recognition rate of the whole address increase from 83.73% to 96.84%, this means 80.58% errors reduction. And in the segmentation-involved data sets, recognition rate of the whole address increase from 28.56% to 74.15%, this means 63.82% errors reduction. It has greatly improved the performance of the OCR system. Key words: OCR, post-processing, language model, word-based language model, word segmentation algorithm

    下一页

  • 下载地址 (推荐使用迅雷下载地址,速度快,支持断点续传)
  • 免费下载 All格式下载
  • 您可能感兴趣的
  • 朱小燕  论文的格式  议论文的三要素  论文的研究方法  写论文的格式  论文的标准格式  论文的致谢  毕业论文的格式  关于坚持的议论文