图书

课程

文章

专题

电子书

精通Python自然语言处理

978-7-115-45968-8

作者: 【印度】Deepti Chopra Nisheeth Joshi Iti Mathur

译者: 王威

编辑: 陈冀康

分类: 自然语言处理 Python

图书目录:

详情

本书教会读者如何运用Python强大的功能实现自然语言处理，帮助读者掌握设计和构建自然语言处理程序的方法。同时，通过阅读本书，读者将了解字符串匹配、统计建模、搜索引擎、话语分析系统等应用的开发，并逐步成为自然语言处理的高手。

图书摘要

版权信息

书名：精通Python自然语言处理

ISBN：978-7-115-45968-8

您购买的人民邮电出版社电子书仅供您个人使用，未经授权，不得以任何方式复制和传播本书内容。

我们愿意相信读者具有这样的良知和觉悟，与我们共同保护知识产权。

如果购买者有侵权行为，我们可能对该用户实施包括但不限于关闭该帐号等维权措施，并可能追究法律责任。

•　著　　　　[印度] Deepti Chopra Nisheeth Joshi Iti Mathur

　译　　　　王　威

　责任编辑　陈冀康

•　人民邮电出版社出版发行　　北京市丰台区成寿寺路11号

　邮编　100164 　电子邮件　315@ptpress.com.cn

　网址　http://www.ptpress.com.cn

•　读者服务热线：(010)81055410

　反盗版热线：(010)81055315

版权声明

本书由英国Packt Publishing公司授权人民邮电出版社出版。未经出版者书面许可，对本书的任何部分不得以任何方式或任何手段复制和传播。

内容提要

自然语言处理是计算语言学和人工智能之中与人机交互相关的领域之一。

本书是学习自然语言处理的一本综合学习指南，介绍了如何用Python实现各种NLP任务，以帮助读者创建基于真实生活应用的项目。全书共10章，分别涉及字符串操作、统计语言建模、形态学、词性标注、语法解析、语义分析、情感分析、信息检索、语篇分析和NLP系统评估等主题。

本书适合熟悉Python语言并对自然语言处理开发有一定了解和兴趣的读者阅读参考。

作者简介

Deepti Chopra是Banasthali大学的助理教授。她的主要研究领域是计算语言学、自然语言处理以及人工智能，她也参与了将英语转换为印度诸语言的机器翻译引擎的研发。她在各种期刊和会议上发表过一些文章，此外她还担任一些期刊及会议的程序委员会委员。

Nisheeth Joshi是Banasthali大学的副教授。他感兴趣的领域包括计算语言学、自然语言处理以及人工智能。除此之外，他也非常积极地参与了将英语转换为印度诸语言的机器翻译引擎的研发。他是印度政府电子和信息技术部TDIL计划选任的专家之一，TDIL是负责印度语言技术资金和研究的主要组织。他在各种期刊和会议上发表过一些文章，并同时担任一些期刊及会议的程序委员会及编审委员会委员。

Iti Mathur是Banasthali大学的助理教授。她感兴趣的领域是计算语义和本体工程。除此之外，她也非常积极地参与了将英语转换为印度诸语言的机器翻译引擎的研发。她是印度政府电子和信息技术部TDIL计划选任的专家之一，TDIL是负责印度语言技术资金和研究的主要组织。她在期刊和会议上发表过一些文章，并同时担任一些期刊及会议的程序委员会及编审委员会委员。

我们要诚挚地感谢所有的亲朋好友，因为你们的祝福促使我们完成了出版这本基于自然语言处理的图书的目标。

审阅者简介

Arturo Argueta目前是一名在读博士研究生，他专注于高性能计算和自然语言处理领域的研究。他在聚类算法、有关自然语言处理的机器学习算法以及机器翻译等方面有一定的研究。他还精通英语、德语和西班牙语。

译者简介

王威　资深研发工程师，曾就职于携程、东方财富等互联网公司。目前专注于互联网分布式架构设计、大数据与机器学习、算法设计等领域的研究，擅长C#、Python、Java、C++等技术。内涵段子手、空想创业家、业余吉他手、重度读书人。

前言

在本书中，我们将学习如何使用Python实现各种有关自然语言处理的任务，并了解一些有关自然语言处理的当下和新进的研究主题。本书是一本综合的进阶指南，以期帮助学生和研究人员创建属于他们自己的基于真实生活应用的项目。

本书涵盖内容

第1章，字符串操作，介绍如何执行文本上的预处理任务，例如切分和标准化，此外还介绍了各种字符串匹配方法。

第2章，统计语言建模，包含如何计算单词的频率以及如何执行各种语言建模的技术。

第3章，形态学：在实践中学习，讨论如何开发词干提取器、形态分析器以及形态生成器。

第4章，词性标注：单词识别，解释词性标注以及有关n-gram方法的统计建模。

第5章，语法解析：分析训练资料，提供关于Tree bank建设、CFG建设、CYK算法、线图分析算法以及音译等概念的相关信息。

第6章，语义分析：意义很重要，介绍浅层语义分析（即NER）的概念和应用以及使用Wordnet执行WSD。

第7章，情感分析：我很快乐，提供可以帮助你理解和应用情感分析相关概念的信息。

第8章，信息检索：访问信息，将帮助你理解和应用信息检索及文本摘要的概念。

第9章，语篇分析：理解才是可信的，探讨语篇分析系统和基于指代消解的系统。

第10章，NLP系统评估：性能分析，谈论NLP系统评估相关概念的理解与应用。

本书的阅读前提

本书中所有的代码示例均使用Python 2.7或Python 3.2以上的版本编写。不管是32位机还是64位机，都必须安装NLTK（Natural Language Toolkit，NLTK）3.0包。操作系统要求为Windows、Mac或UNIX。

本书的目标读者

本书主要面向对Python语言有一定认知水平的自然语言处理的中级开发人员。

排版约定

本书中用不同的文本样式来区分不同种类的信息。下面给出了这些文本样式的示例及其含义。

文本中的代码单词、数据库表名、文件夹名称、文件名、文件扩展名、路径名、虚拟URL、用户输入以及推特用户定位表示如下：

“对于法语文本的切分，我们将使用french.pickle文件。”

代码块的样式如下所示：

>>> import nltk
>>> text=" Welcome readers. I hope you find it interesting. Please do
reply."
>>> from nltk.tokenize import sent_tokenize

　

此图标表示警告或需要特别注意的内容。

　

此图标表示提示或者技巧。

读者反馈

我们始终欢迎来自读者的反馈。请告诉我们你对本书的看法——喜欢或者不喜欢的部分。你的意见对我们来说非常重要，这将有助于我们开发出读者真正感兴趣的东西。

一般的反馈，你只需发送邮件至feedback@packtpub.com，并在邮件主题中写清楚书名。

如果你擅长某个主题，并有兴趣编写一本书或者想为一本书做贡献，请参考我们的作者指南，网址www.packtpub.com/authors。

客户支持

既然你已经是Packt引以为傲的读者了，为了能让你的购买物超所值，我们还为你准备了以下内容。

下载示例代码

你可以用你的http://www.packtpub.com 账户在上面下载本书配套的示例代码。如果你是在别的地方购买的本书，你可以访问http://www.packtpub.com/support 并注册，我们会用邮件把代码文件直接发给你。

你可以按照以下步骤下载代码文件。

1．使用你的邮箱地址和密码登录或注册我们的网站。

2．将鼠标指针移至顶端的SUPPORT选项卡上。

3．单击Code Downloads & Errata。

4．在搜索框中输入书名。

5．选择你需要下载代码文件的图书。

6．在下拉菜单里选择你从哪里购买的这本书。

7．单击Code Download。

你也可以通过单击Packt出版社官网上关于本书的网页中的“Code Files”按钮来下载代码文件。你可以通过在搜索框中输入书名进入到这个页面。请注意你需要登录你的Packt账户。

一旦下载示例代码文件后，请确保使用以下最新版本的工具解压文件夹：

WinRAR / 7-Zip for Windows。
Zipeg / iZip / UnRarX for Mac。
7-Zip / PeaZip for Linux。

本书的代码包也托管在Github上，网址是https://github.com/PacktPublishing/ Mastering-Natural-Language-Processing-with-Python。我们也有来自于我们丰富的图书和视频目录的其他代码包，地址是https://github.com/PacktPublishing/ 。欢迎访问！

勘误

虽然我们竭尽全力保证图书内容的准确性，但错误仍在所难免。如果你在我们的任何一本书里发现错误，可能是文字的或者代码中的错误，都烦请报告给我们，我们将不胜感激。这样不仅使其他读者免于困惑，也能帮助我们不断改进后续版本。如果你发现任何错误，请访问http://www.packtpub.com/submit-errata报告给我们，选择相应图书，单击“Errata Submission Form”链接，并输入勘误详情。一旦你提出的错误被证实，你的勘误将被接收并上传至我们的网站，或加入到已有的勘误列表中。

若要查看之前提交的勘误，请访问https://www.packtpub.com/books/content/support 并在搜索框中输入书名，所需的信息将会展现在“Errata”部分的下面。

反盗版

在互联网上，所有媒体都会遭遇盗版问题。对Packt来说，我们严格保护版权和许可证。如果你在互联网上发现我们出版物的任何非法副本，请立即向我们提供侵权网站的地址和名称，以便我们采取补救措施。

请通过copyright@packtpub.com联系我们，同时请提供涉嫌侵权内容的链接。

非常感激你帮助保护我们的作者，让我们尽力提供更有价值的内容。

问题

如果你对本书有任何疑问，都可以通过questions@packtpub.com 邮箱联系我们，我们将尽最大努力为你答疑解惑。

第1章　字符串操作

自然语言处理（Natural Language Processing，NLP）关注的是自然语言与计算机之间的交互。它是人工智能（Artificial Intelligence，AI）和计算语言学的主要分支之一。它提供了计算机和人类之间的无缝交互并使得计算机能够在机器学习的帮助下理解人类语言。在编程语言（例如C、C++、Java、Python等）里用于表示一个文件或文档内容的基础数据类型被称为字符串。在本章中，我们将探索各种可以在字符串上执行的操作，这些操作将有助于完成各种NLP任务。

本章将包含以下主题：

文本切分。
文本标准化。
替换和校正标识符。
在文本上应用Zipf定律。
使用编辑距离算法执行相似性度量。
使用Jaccard系数执行相似性度量。
使用Smith Waterman算法执行相似性度量。

1.1　切分

切分可以认为是将文本分割成更小的并被称作标识符的模块的过程，它被认为是NLP的一个重要步骤。

当安装好NLTK包并且Python的交互式开发环境（IDLE）也运行起来时，我们就可以将文本或者段落切分成独立的语句。为了实现切分，我们可以导入语句切分函数，该函数的参数即为需要被切分的文本。sent_tokenize函数使用了NLTK包的一个叫作PunktSentenceTokenizer类的实例。基于那些可以标记句子开始和结束的字母和标点符号，NLTK中的这个实例已经被训练用于对不同的欧洲语言执行切分。

1.1.1　将文本切分为语句

现在，让我们来看看一段给定的文本是如何被切分为独立的句子的：

>>> import nltk
>>> text=" Welcome readers. I hope you find it interesting. Please do
reply."
>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize(text)
[' Welcome readers.', 'I hope you find it interesting.', 'Please do
reply.']

这样，一段给定的文本就被分割成了独立的句子。我们还可以进一步对这些独立的句子进行处理。

要切分大批量的句子，我们可以加载PunktSentenceTokenizer并使用其tokenize()函数来进行切分。下面的代码展示了该过程：

>>> import nltk
>>> tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
>>> text=" Hello everyone. Hope all are fine and doing well. Hope you
find the book interesting"
>>> tokenizer.tokenize(text)
[' Hello everyone.', 'Hope all are fine and doing well.', 'Hope you
find the book interesting']

1.1.2　其他语言文本的切分

为了对除英文之外的其他语言执行切分，我们可以加载它们各自的pickle文件（可以在tokenizers/punkt里边找到），然后用该语言对文本进行切分，这些文本是tokenize()函数的参数。对于法语文本的切分，我们将使用如下的french.pickle文件：

>>> import nltk
>>> french_tokenizer=nltk.data.load('tokenizers/punkt/french.pickle')
>>> french_tokenizer.tokenize('Deux agressions en quelques jours,
voilà ce qui a motivé hier matin le débrayage collège franco-
britanniquede Levallois-Perret. Deux agressions en quelques jours,
voilà ce qui a motivé hier matin le débrayage Levallois. L'équipe
pédagogique de ce collège de 750 élèves avait déjà été choquée
par l'agression, janvier , d'un professeur d'histoire. L'équipe
pédagogique de ce collège de 750 élèves avait déjà été choquée par
l'agression, mercredi , d'un professeur d'histoire')
['Deux agressions en quelques jours, voilà ce qui a motivé hier
matin le débrayage collège franco-britanniquedeLevallois-Perret.',
'Deux agressions en quelques jours, voilà ce qui a motivé hier matin
le débrayage Levallois.', 'L'équipe pédagogique de ce collège de
750 élèves avait déjà été choquée par l'agression, janvier , d'un
professeur d'histoire.', 'L'équipe pédagogique de ce collège de
750 élèves avait déjà été choquée par l'agression, mercredi , d'un
professeur d'histoire']

1.1.3　将句子切分为单词

现在，我们将对独立的句子执行处理，独立的句子会被切分为单词。通过使用word_tokenize()函数可以执行单词的切分。word_tokenize函数使用NLTK包的一个叫作TreebankWordTokenizer类的实例用于执行单词的切分。

使用word_tokenize函数切分英文文本的代码如下所示：

>>> import nltk
>>> text=nltk.word_tokenize("PierreVinken , 59 years old , will join
as a nonexecutive director on Nov. 29 .»)
>>> print(text)
['PierreVinken', ',', '59', ' years', ' old', ',', 'will', 'join',
'as', 'a', 'nonexecutive', 'director' , 'on', 'Nov.', '29', '.']

实现单词的切分还可以通过加载TreebankWordTokenizer，然后调用tokenize()函数来完成，其中tokenize()函数的参数是需要被切分为单词的句子。基于空格和标点符号，NLTK包的这个实例已经被训练用于将句子切分为单词。

如下代码将帮助我们获取用户的输入，然后再将其切分并计算切分后的列表长度：

>>> import nltk
>>> from nltk import word_tokenize
>>> r=input("Please write a text")
Please write a textToday is a pleasant day
>>> print("The length of text is",len(word_tokenize(r)),"words")
The length of text is 5 words

1.1.4　使用TreebankWordTokenizer执行切分

让我们来看看使用TreebankWordTokenizer执行切分的代码：

>>> import nltk
>>> from nltk.tokenize import TreebankWordTokenizer
>>> tokenizer = TreebankWordTokenizer()
>>> tokenizer.tokenize("Have a nice day. I hope you find the book
interesting")
['Have', 'a', 'nice', 'day.', 'I', 'hope', 'you', 'find', 'the',
'book', 'interesting']

TreebankWordTokenizer依据Penn Treebank语料库的约定，通过分离缩略词来实现切分。此过程展示如下：

>>> import nltk
>>> text=nltk.word_tokenize(" Don't hesitate to ask questions")
>>> print(text)
['Do', "n't", 'hesitate', 'to', 'ask', 'questions']

另一个分词器是PunktWordTokenizer，它是通过分离标点来实现切分的，每一个单词都会被保留，而不是去创建一个全新的标识符。还有一个分词器是WordPunctTokenizer，它通过将标点转化为一个全新的标识符来实现切分，我们通常需要这种形式的切分：

>>> from nltk.tokenize import WordPunctTokenizer
>>> tokenizer=WordPunctTokenizer()
>>> tokenizer.tokenize(" Don't hesitate to ask questions")
['Don', "'", 't', 'hesitate', 'to', 'ask', 'questions']

分词器的继承树如图1-1所示。

图1-1

1.1.5　使用正则表达式实现切分

可以通过构建如下两种正则表达式来实现单词的切分：

通过匹配单词。
通过匹配空格或间隔。

我们可以导入NLTK包的RegexpTokenizer模块，并构建一个与文本中的标识符相匹配的正则表达式：

>>> import nltk
>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer=RegexpTokenizer([\w]+")
>>> tokenizer.tokenize("Don't hesitate to ask questions")
["Don't", 'hesitate', 'to', 'ask', 'questions']

另一种不用实例化类的切分方式将使用下面的函数：

>>> import nltk
>>> from nltk.tokenize import regexp_tokenize
>>> sent="Don't hesitate to ask questions"
>>> print(regexp_tokenize(sent, pattern='\w+|\$[\d\.]+|\S+'))
['Don', "'t", 'hesitate', 'to', 'ask', 'questions']

RegularexpTokenizer在使用re.findall()函数时是通过匹配标识符来执行切分的；在使用re.split()函数时是通过匹配间隔或者空格来执行切分的。

让我们来看一个如何通过空格来执行切分的例子：

>>> import nltk
>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer=RegexpTokenizer('\s+',gaps=True)
>>> tokenizer.tokenize("Don't hesitate to ask questions")
["Don't", 'hesitate', 'to', 'ask', 'questions']

要筛选以大写字母开头的单词，可以使用下面的代码：

>>> import nltk
>>> from nltk.tokenize import RegexpTokenizer
>>> sent=" She secured 90.56 % in class X . She is a meritorious
student"
>>> capt = RegexpTokenizer('[A-Z]\w+')
>>> capt.tokenize(sent)
['She', 'She']

下面的代码展示了RegexpTokenizer的一个子类是如何使用预定义正则表达式的：

>>> import nltk
>>> sent=" She secured 90.56 % in class X . She is a meritorious
student"
>>> from nltk.tokenize import BlanklineTokenizer
>>> BlanklineTokenizer().tokenize(sent)
[' She secured 90.56 % in class X \n. She is a meritorious student\n']

字符串的切分可以通过空格、间隔、换行等来完成：

>>> import nltk
>>> sent=" She secured 90.56 % in class X . She is a meritorious
student"
>>> from nltk.tokenize import WhitespaceTokenizer
>>> WhitespaceTokenizer().tokenize(sent)
['She', 'secured', '90.56', '%', 'in', 'class', 'X', '.', 'She', 'is',
'a', 'meritorious', 'student']

WordPunctTokenizer使用正则表达式\w+|[^\w\s]+来执行文本的切分，并将其切分为字母与非字母字符。

使用split()方法进行切分的代码描述如下：

>>> import nltk
>>> sent= She secured 90.56 % in class X. She is a meritorious student"
>>> sent.split()
['She', 'secured', '90.56', '%', 'in', 'class', 'X', '.', 'She', 'is',
'a', 'meritorious', 'student']
>>> sent.split('')
['', 'She', 'secured', '90.56', '%', 'in', 'class', 'X', '.', 'She',
'is', 'a', 'meritorious', 'student']
>>> sent=" She secured 90.56 % in class X \n. She is a meritorious
student\n"
>>> sent.split('\n')
[' She secured 90.56 % in class X ', '. She is a meritorious student',
'']

类似于sent.split('\n')方法，LineTokenizer通过将文本切分为行来执行切分：

>>> import nltk
>>> from nltk.tokenize import BlanklineTokenizer
>>> sent=" She secured 90.56 % in class X \n. She is a meritorious
student\n"
>>> BlanklineTokenizer().tokenize(sent)
[' She secured 90.56 % in class X \n. She is a meritorious student\n']
>>> from nltk.tokenize import LineTokenizer
>>> LineTokenizer(blanklines='keep').tokenize(sent)
[' She secured 90.56 % in class X ', '. She is a meritorious student']
>>> LineTokenizer(blanklines='discard').tokenize(sent)
[' She secured 90.56 % in class X ', '. She is a meritorious student']

SpaceTokenizer与sent.split('')方法的工作原理类似:

>>> import nltk
>>> sent=" She secured 90.56 % in class X \n. She is a meritorious
student\n"
>>> from nltk.tokenize import SpaceTokenizer
>>> SpaceTokenizer().tokenize(sent)
['', 'She', 'secured', '90.56', '%', 'in', 'class', 'X', '\n.', 'She',
'is', 'a', 'meritorious', 'student\n']

nltk.tokenize.util模块通过返回元组形式的序列来执行切分，该序列为标识符在语句中的位置和偏移量：

>>> import nltk
>>> from nltk.tokenize import WhitespaceTokenizer
>>> sent=" She secured 90.56 % in class X \n. She is a meritorious
student\n"
>>> list(WhitespaceTokenizer().span_tokenize(sent))
[(1, 4), (5, 12), (13, 18), (19, 20), (21, 23), (24, 29), (30, 31),
(33, 34), (35, 38), (39, 41), (42, 43), (44, 55), (56, 63)]

给定一个标识符的序列，则可以返回其跨度序列：

>>> import nltk
>>> from nltk.tokenize import WhitespaceTokenizer
>>> from nltk.tokenize.util import spans_to_relative
>>> sent=" She secured 90.56 % in class X \n. She is a meritorious
student\n"
>>>list(spans_to_relative(WhitespaceTokenizer().span_tokenize(sent)))
[(1, 3), (1, 7), (1, 5), (1, 1), (1, 2), (1, 5), (1, 1), (2, 1), (1,
3), (1, 2), (1, 1), (1, 11), (1, 7)]

通过在每一个分隔符的连接处进行分割，nltk.tokenize.util.string_span_`` ``tokenize(sent,separator)将返回sent中标识符的偏移量：

>>> import nltk
>>> from nltk.tokenize.util import string_span_tokenize
>>> sent=" She secured 90.56 % in class X \n. She is a meritorious
student\n"
>>> list(string_span_tokenize(sent, ""))
[(1, 4), (5, 12), (13, 18), (19, 20), (21, 23), (24, 29), (30, 31),
(32, 34), (35, 38), (39, 41), (42, 43), (44, 55), (56, 64)]

1.2　标准化

为了实现对自然语言文本的处理，我们需要对其执行标准化，主要涉及消除标点符号、将整个文本转换为大写或小写、数字转换成单词、扩展缩略词、文本的规范化等操作。

1.2.1　消除标点符号

有时候，在切分文本的过程中，我们希望删除标点符号。当在NLTK中执行标准化操作时，删除标点符号被认为是主要的任务之一。

考虑下面的代码示例：

>>> text=[" It is a pleasant evening.","Guests, who came from US
arrived at the venue","Food was tasty."]
>>> from nltk.tokenize import word_tokenize
>>> tokenized_docs=[word_tokenize(doc) for doc in text]
>>> print(tokenized_docs)
[['It', 'is', 'a', 'pleasant', 'evening', '.'], ['Guests', ',', 'who',
'came', 'from', 'US', 'arrived', 'at', 'the', 'venue'], ['Food',
'was', 'tasty', '.']]

以上代码得到了切分后的文本。以下代码将从切分后的文本中删除标点符号：

>>> import re
>>> import string
>>> text=[" It is a pleasant evening.","Guests, who came from US
arrived at the venue","Food was tasty."]
>>> from nltk.tokenize import word_tokenize
>>> tokenized_docs=[word_tokenize(doc) for doc in text]
>>> x=re.compile('[%s]' % re.escape(string.punctuation))
>>> tokenized_docs_no_punctuation = []
>>> for review in tokenized_docs:
    new_review = []
    for token in review:
    new_token = x.sub(u'', token)
    if not new_token == u'':
            new_review.append(new_token)
    tokenized_docs_no_punctuation.append(new_review)
>>> print(tokenized_docs_no_punctuation)
[['It', 'is', 'a', 'pleasant', 'evening'], ['Guests', 'who', 'came',
'from', 'US', 'arrived', 'at', 'the', 'venue'], ['Food', 'was',
'tasty']]

1.2.2　文本的大小写转换

通过lower ( )和upper ( )函数可以将一段给定的文本彻底转换为小写或大写文本。将文本转换为大小写的任务也属于文本标准化的范畴。

考虑下面的大小写转换例子：

>>> text='HARdWork IS KEy to SUCCESS'
>>> print(text.lower())
hardwork is key to success
>>> print(text.upper())
HARDWORK IS KEY TO SUCCESS

1.2.3　处理停止词

停止词是指在执行信息检索任务或其他自然语言任务时需要被过滤掉的词，因为这些词对理解句子的整体意思没有多大的意义。许多搜索引擎通过去除停止词来工作，以便缩小搜索范围。消除停止词在NLP中被认为是至关重要的标准化任务之一。

NLTK库为多种语言提供了一系列的停止词，为了可以从nltk_data/corpora/ stopwords中访问停止词列表，我们需要解压datafile文件：

>>> import nltk
>>> from nltk.corpus import stopwords
>>> stops=set(stopwords.words('english'))
>>> words=["Don't", 'hesitate','to','ask','questions']
>>> [word for word in words if word not in stops]
["Don't", 'hesitate', 'ask', 'questions']

nltk.corpus.reader.WordListCorpusReader类的实例是一个stopwords语料库，它拥有一个参数为fileid的words()函数。这里参数为English，它指的是在英语文件中存在的所有停止词。如果words()函数没有参数，那么它指的将是关于所有语言的全部停止词。

可以在其中执行停止词删除的其他语言，或者在NLTK中其文件存在停止词的语言数量都可以通过使用fileids ( )函数找到：

>>> stopwords.fileids()
['danish', 'dutch', 'english', 'finnish', 'french', 'german',
'hungarian', 'italian', 'norwegian', 'portuguese', 'russian',
'spanish', 'swedish', 'turkish']

上面列出的任何一种语言都可以用作words()函数的参数，以便获取该语言的停止词。

1.2.4　计算英语中的停止词

让我们来看一个有关如何计算停止词的例子：

>>> import nltk
>>> from nltk.corpus import stopwords
>>> stopwords.words('english')
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you',
'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his',
'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself',
'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which',
'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having',
'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if',
'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for',
'with', 'about', 'against', 'between', 'into', 'through', 'during',
'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in',
'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then',
'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any',
'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no',
'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's',
't', 'can', 'will', 'just', 'don', 'should', 'now']

>>> def para_fraction(text):
stopwords = nltk.corpus.stopwords.words('english')
para = [w for w in text if w.lower() not in stopwords]
return len(para) / len(text)

>>> para_fraction(nltk.corpus.reuters.words())
0.7364374824583169
>>> para_fraction(nltk.corpus.inaugural.words())
0.5229560503653893

标准化操作还涉及将数字转化为单词（例如，1可以替换为one）和扩展缩略词（例如，can’t可以替换为cannot），这可以通过使用替换模式表示它们来实现。我们将在下一节讨论这些内容。

1.3　替换和校正标识符

在本节中，我们将讨论用其他类型的标识符来替换标识符。我们还会讨论如何来校正标识符的拼写（通过用正确拼写的标识符替换拼写不正确的标识符）。

1.3.1　使用正则表达式替换单词

为了消除错误或执行文本的标准化，需要做单词替换。一种可以完成文本替换的方法是使用正则表达式。之前，在执行缩略词切分时我们遇到了问题。通过使用文本替换，我们可以用缩略词的扩展形式来替换缩略词。例如，doesn’t可以被替换为does not。

我们将从编写以下代码开始，并命名此程序为replacers.py，最后将其保存在nltkdata文件夹中：

import re
replacement_patterns = [
(r'won\'t', 'will not'),
(r'can\'t', 'cannot'),
(r'i\'m', 'i am'),
(r'ain\'t', 'is not'),
(r'(\w+)\'ll', '\g<1> will'),
(r'(\w+)n\'t', '\g<1> not'),
(r'(\w+)\'ve', '\g<1> have'),
(r'(\w+)\'s', '\g<1> is'),
(r'(\w+)\'re', '\g<1> are'),
(r'(\w+)\'d', '\g<1> would')
]
class RegexpReplacer(object):
    def __init__(self, patterns=replacement_patterns):
        self.patterns = [(re.compile(regex), repl) for (regex, repl)
in
        patterns]
    def replace(self, text):
        s = text
        for (pattern, repl) in self.patterns:
             (s, count) = re.subn(pattern, repl, s)
        return s

这里我们定义了替换模式，模式第一项表示需要被匹配的模式，第二项是其对应的替换模式。RegexpReplacer类被定义用来执行编译模式对的任务，并且它提供了一个叫作replace()的方法，该方法的功能是用另一种模式来执行模式的替换。

1.3.2　用其他文本替换文本的示例

让我们来看一个有关如何用其他文本来替换文本的例子：

>>> import nltk
>>> from replacers import RegexpReplacer
>>> replacer= RegexpReplacer()
>>> replacer.replace("Don't hesitate to ask questions")
'Do not hesitate to ask questions'
>>> replacer.replace("She must've gone to the market but she didn't
go")
'She must have gone to the market but she did not go'

RegexpReplacer.replace()函数用其相应的替换模式来更换被替换模式的每一个实例。在这里，must’ve被替换为must have, didn’t被替换为did not，因为在replacers.py中已经通过元组对的形式定义了替换模式，也就是（r'（\ w +）\'ve' ，'\ g <1>have'）和（r'（\ w +）n\'t'，'\ g<1>not'）。

我们不仅可以执行缩略词的替换，还可以用其他任意标识符来替换一个标识符。

1.3.3　在执行切分前先执行替换操作

标识符替换操作可以在切分前执行，以避免在切分缩略词的过程中出现问题：

>>> import nltk
>>> from nltk.tokenize import word_tokenize
>>> from replacers import RegexpReplacer
>>> replacer=RegexpReplacer()
>>> word_tokenize("Don't hesitate to ask questions")
['Do', "n't", 'hesitate', 'to', 'ask', 'questions']
>>> word_tokenize(replacer.replace("Don't hesitate to ask questions"))
['Do', 'not', 'hesitate', 'to', 'ask', 'questions']

1.3.4　处理重复字符

有时候，人们在写作时会涉及一些可以引起语法错误的重复字符。例如考虑这样的一个句子：I like it a lotttttt。在这里，lotttttt是指lot。所以现在我们将使用反向引用方法来去除这些重复的字符，在该方法中，一个字符指的是正则表达式分组中的先前字符。消除重复字符也被认为是标准化任务之一。

首先，将以下代码附加到先前创建的replacers.py文件中：

class RepeatReplacer(object):
    def __init__(self):
        self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
        self.repl = r'\1\2\3'
    def replace(self, word):
        repl_word = self.repeat_regexp.sub(self.repl, word)
        if repl_word != word:
            return self.replace(repl_word)
        else:
            return repl_word

1.3.5　去除重复字符的示例

让我们来看一个关于如何从一个标识符中去除重复字符的示例：

>>> import nltk
>>> from replacers import RepeatReplacer
>>> replacer=RepeatReplacer()
>>> replacer.replace('lotttt')
'lot'
>>> replacer.replace('ohhhhh')
'oh'
>>> replacer.replace('ooohhhhh')
'oh'

在replacers.py文件中，RepeatReplacer类通过编译正则表达式和替换的字符串来工作，并使用backreference.Repeat_regexp来定义。它匹配可能是以零个或多个(\ w *)字符开始，以零个或多个(\ w *)，或者一个(\ w)其后面带有相同字符的字符而结束的字符。

例如，lotttt被分拆为(lo)(t)t(tt)。这里减少了一个t并且字符串变为lottt。分拆的过程还将继续，最后得到的结果字符串是lot。

使用RepeatReplacer的问题是它会将happy转换为hapy，这样是不妥的。为了避免这个问题，我们可以嵌入wordnet与其一起使用。

在先前创建的replacers.py程序中，添加以下代码行以便包含wordnet：

import re
from nltk.corpus import wordnet
class RepeatReplacer(object):
    def __init__(self):
        self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
        self.repl = r'\1\2\3'
    def replace(self, word):
        if wordnet.synsets(word):
            return word
        repl_word = self.repeat_regexp.sub(self.repl, word)
        if repl_word != word:
            return self.replace(repl_word)
        else:
            return repl_word

现在，让我们来看看如何解决前面提到的问题：

>>> import nltk
>>> from replacers import RepeatReplacer
>>> replacer=RepeatReplacer()
>>> replacer.replace('happy')
'happy'

1.3.6　用单词的同义词替换

现在我们将看到如何用其同义词来替代一个给定的单词。对于已经存在的replacers.py文件，我们可以为其添加一个名为WordReplacer的类，这个类提供了一个单词与其同义词之间的映射关系：

class WordReplacer(object):
    def __init__(self, word_map):
        self.word_map = word_map
    def replace(self, word):
        return self.word_map.get(word, word)

1.3.7　用单词的同义词替换的示例

让我们来看一个有关用其同义词来替换单词的例子：

>>> import nltk
>>> from replacers import WordReplacer
>>> replacer=WordReplacer({'congrats':'congratulations'})
>>> replacer.replace('congrats')
'congratulations'
>>> replacer.replace('maths')
'maths'

在这段代码中，replace ( )函数在word_map中寻找单词对应的同义词。如果给定的单词存在同义词，则该单词将被其同义词替换；如果给定单词的同义词不存在，则不执行替换，将返回单词本身。

1.4　在文本上应用Zipf定律

Zipf定律指出，文本中标识符出现的频率与其在排序列表中的排名或位置成反比。该定律描述了标识符在语言中是如何分布的：一些标识符非常频繁地出现，另一些出现频率较低，还有一些基本上不出现。

让我们来看看NLTK中用于获取基于Zipf定律的双对数图（log-log plot）的代码：

>>> import nltk
>>> from nltk.corpus import gutenberg
>>> from nltk.probability import FreqDist
>>> import matplotlib
>>> import matplotlib.pyplot as plt
>>> matplotlib.use('TkAgg')
>>> fd = FreqDist()
>>> for text in gutenberg.fileids():
. . . for word in gutenberg.words(text):
. . . fd.inc(word)
>>> ranks = []
>>> freqs = []
>>> for rank, word in enumerate(fd):
. . . ranks.append(rank+1)
. . . freqs.append(fd[word])
. . .
>>> plt.loglog(ranks, freqs)
>>> plt.xlabel('frequency(f)', fontsize=14, fontweight='bold')
>>> plt.ylabel('rank(r)', fontsize=14, fontweight='bold')
>>> plt.grid(True)
>>> plt.show()

上述代码将获取一个关于单词在文档中的排名相对其出现的频率的双对数图。因此，我们可以通过查看单词的排名与其频率之间的比例关系来验证Zipf定律是否适用于所有文档。

1.5　相似性度量

有许多可用于执行NLP任务的相似性度量。NLTK中的nltk.metrics包用于提供各种评估或相似性度量，这将有利于执行各种各样的NLP任务。

在NLP中，为了测试标注器、分块器等的性能，可以使用从信息检索中检索到的标准分数。

让我们来看看如何使用标准分（从一个训练文件中获取的）来分析命名实体识别器的输出：

>>> from __future__ import print_function
>>> from nltk.metrics import *
>>> training='PERSON OTHER PERSON OTHER OTHER ORGANIZATION'.split()
>>> testing='PERSON OTHER OTHER OTHER OTHER OTHER'.split()
>>> print(accuracy(training,testing))
0.6666666666666666
>>> trainset=set(training)
>>> testset=set(testing)
>>> precision(trainset,testset)
1.0
>>> print(recall(trainset,testset))
0.6666666666666666
>>> print(f_measure(trainset,testset))
0.8

1.5.1　使用编辑距离算法执行相似性度量

两个字符串之间的编辑距离或Levenshtein编辑距离算法用于计算为了使两个字符串相等所插入、替换或删除的字符数量。

在编辑距离算法中需要执行的操作包含以下内容：

将字母从第一个字符串复制到第二个字符串（cost为0），并用另一个字母替换字母（cost为1）：

D(i−1,j−1) + d(si,tj)（替换 /复制操作）

删除第一个字符串中的字母（cost为1）：

D(i,j−1)+1（删除操作）

在第二个字符串中插入一个字母（cost为1）：

D(i,j) = min D(i−1,j)+1 （插入操作）

nltk.metrics包中的Edit Distance算法的Python代码如下所示：

from __future__ import print_function
def _edit_dist_init(len1, len2):
    lev = []
    for i in range(len1):
        lev.append([0] * len2)      # initialize 2D array to zero
    for i in range(len1):
        lev[i][0] = i               # column 0: 0,1,2,3,4,...
    for j in range(len2):
        lev[0][j] = j               # row 0: 0,1,2,3,4,...
    return lev

def _edit_dist_step(lev,i,j,s1,s2,transpositions=False):
c1 =s1[i-1]
c2 =s2[j-1]

# skipping a character in s1
a =lev[i-1][j] +1
# skipping a character in s2
b =lev[i][j -1]+1
# substitution
c =lev[i-1][j-1]+(c1!=c2)
# transposition
d =c+1 # never picked by default
if transpositions and i>1 and j>1:
if s1[i -2]==c2 and s2[j -2]==c1:
d =lev[i-2][j-2]+1
# pick the cheapest
lev[i][j] =min(a,b,c,d)

def edit_distance(s1, s2, transpositions=False):
    # set up a 2-D array
    len1 = len(s1)
    len2 = len(s2)
    lev = _edit_dist_init(len1 + 1, len2 + 1)

    # iterate over the array
    for i in range(len1):
    for j in range(len2):
        _edit_dist_step(lev, i + 1, j + 1, s1, s2,
transpositions=transpositions)
    return lev[len1][len2]

让我们看一看使用NLTK中的nltk.metrics包来计算编辑距离的代码：

>>> import nltk
>>> from nltk.metrics import *
>>> edit_distance("relate","relation")
3
>>> edit_distance("suggestion","calculation")
7

这里，当我们计算relate和relation之间的编辑距离时，需要执行三个操作（一个替换操作和两个插入操作）。当计算suggestion和calculation之间的编辑距离时，需要执行七个操作（六个替换操作和一个插入操作）。

1.5.2　使用Jaccard系数执行相似性度量

Jaccard系数或Tanimoto系数可以认为是两个集合X和Y交集的相似程度。

它可以定义如下：

Jaccard(X,Y)=|X∩Y|/|XUY|。
Jaccard(X,X)=1。
Jaccard(X,Y)=0 if X∩Y=0。

有关Jaccard相似度的代码如下：

def jacc_similarity(query, document):
first=set(query).intersection(set(document))
second=set(query).union(set(document))
return len(first)/len(second)

让我们来看看NLTK中Jaccard相似性系数的实现：

>>> import nltk
>>> from nltk.metrics import *
>>> X=set([10,20,30,40])
>>> Y=set([20,30,60])
>>> print(jaccard_distance(X,Y))
0.6

1.5.3　使用Smith Waterman距离算法执行相似性度量

Smith Waterman距离算法类似于编辑距离算法。开发这种相似度指标以便检测相关蛋白质序列和DNA之间的光学比对。它包括被分配的成本和将字母表映射到成本值的函数（替换）；成本也分配给gap惩罚（插入或删除）。

1．0 //start over

2．D(i−1,j−1) −d(si,tj) //subst/copy

3．D(i,j) = max D(i−1,j) −G //insert

1．D(i,j−1) −G //delete

　

Distance is maximum over all i,j in table of D(i,j)。

4．G = 1 //example value for gap

5．d(c,c) = −2 //context dependent substitution cost

6．d(c,d) = +1 //context dependent substitution cost

与编辑距离算法类似，Smith Waterman的Python代码可以嵌入到nltk.metrics包中，以便使用NLTK中的Smith Waterman算法执行字符串相似性度量。

1.5.4　其他字符串相似性度量

二进制距离是一个字符串相似性指标。如果两个标签相同，它的返回值为0.0；否则，它的返回值为1.0。

二进制距离度量的Python代码为：

def binary_distance(label1, label2):
 return 0.0 if label1 == label2 else 1.0

让我们来看看在NLTK中如何实现二进制距离算法度量：

>>> import nltk
>>> from nltk.metrics import *
>>> X = set([10,20,30,40])
>>> Y= set([30,50,70])
>>> binary_distance(X, Y)
1.0

当存在多个标签时，Masi距离基于部分协议。

包含在nltk.metrics包中的masi距离算法的Python代码如下：

def masi_distance(label1, label2):
    len_intersection = len(label1.intersection(label2))
    len_union = len(label1.union(label2))
    len_label1 = len(label1)
    len_label2 = len(label2)
    if len_label1 == len_label2 and len_label1 == len_intersection:
        m = 1
    elif len_intersection == min(len_label1, len_label2):
        m = 0.67
    elif len_intersection > 0:
        m = 0.33
    else:
        m = 0

return 1 - (len_intersection / float(len_union)) * m

让我们来看看NLTK中masi距离算法的实现：

>>> import nltk
>>> from __future__ import print_function
>>> from nltk.metrics import *
>>> X = set([10,20,30,40])
>>> Y= set([30,50,70])
>>> print(masi_distance(X,Y))
0.945

1.6　小结

在本章中，你已经学会了各种可以在文本（由字符串集合组成）上执行的操作。你已经理解了字符串切分、替换和标准化的概念，以及使用NLTK在字符串上应用各种相似性度量方法。此外我们还讨论了可能适用于一些现存文档的Zipf定律。

在下一章中，我们将讨论各种语言建模技术以及各种不同的NLP任务。

第2章　统计语言建模

计算语言学是一个广泛应用于分析、软件应用程序和人机交互上下文的新兴领域。我们可以认为其是人工智能的一个子领域。计算语言学的应用范围包括机器翻译、语音识别、智能Web搜索、信息检索和智能拼写检查等。理解各种可以在自然语言文本上执行的预处理任务或者计算是至关重要的。在以下章节中，我们将会讨论一些计算单词频率、最大似然估计（Maximum Likelihood Estimation，MLE）模型、数据插值等的方法。但是首先让我们来看看本章将会涉及的各个主题，具体如下：

计算单词频率（1-gram，2-gram，3-gram）。
为给定的文本开发MLE。
在MLE模型上应用平滑。
为MLE开发一个回退机制。
应用数据插值以获得混合搭配。
通过复杂度来评估语言模型。
在语言建模中应用Metropolis-Hastings算法。
在语言处理中应用Gibbs采样法。

2.1　理解单词频率

词的搭配可以被定义为倾向于并存的两个或多个标识符的集合。例如: the United States, the United Kingdom, Union of Soviet Socialist Republics等。

Unigram（一元语法）代表单个标识符。以下代码用于为Alpino语料库生成unigrams：

>>> import nltk
>>> from nltk.util import ngrams
>>> from nltk.corpus import alpino
>>> alpino.words()
['De', 'verzekeringsmaatschappijen', 'verhelen', ...]>>>
unigrams=ngrams(alpino.words(),1)
>>> for i in unigrams:
print(i)

考虑另一个有关从alpino语料库生成quadgrams或fourgrams（四元语法）的例子：

>>> import nltk
>>> from nltk.util import ngrams
>>> from nltk.corpus import alpino
>>> alpino.words()
['De', 'verzekeringsmaatschappijen', 'verhelen', ...]
>>> quadgrams=ngrams(alpino.words(),4)
>>> for i in quadgrams:
print(i)

bigram（二元语法）指的是一对标识符。为了在文本中找到bigrams，首先需要搜索小写单词，把文本创建为小写单词列表后，然后创建BigramCollocationFinder实例。在nltk.metrics包中找到的BigramAssocMeasures可用于在文本中查找bigrams：

>>> import nltk
>>> from nltk.collocations import BigramCollocationFinder
>>> from nltk.corpus import webtext
>>> from nltk.metrics import BigramAssocMeasures
>>> tokens=[t.lower() for t in webtext.words('grail.txt')]
>>> words=BigramCollocationFinder.from_words(tokens)
>>> words.nbest(BigramAssocMeasures.likelihood_ratio, 10)
[("'", 's'), ('arthur', ':'), ('#', '1'), ("'", 't'), ('villager',
'#'), ('#', '2'), (']', '['), ('1', ':'), ('oh', ','), ('black',
'knight')]

在上面的代码中，我们可以添加一个用来消除停止词和标点符号的单词过滤器：

>>> from nltk.corpus import stopwords
>>> from nltk.corpus import webtext
>>> from nltk.collocations import BigramCollocationFinder
>>> from nltk.metrics import BigramAssocMeasures
>>> set = set(stopwords.words('english'))
>>> stops_filter = lambda w: len(w) < 3 or w in set
>>> tokens=[t.lower() for t in webtext.words('grail.txt')]
>>> words=BigramCollocationFinder.from_words(tokens)
>>> words.apply_word_filter(stops_filter)
>>> words.nbest(BigramAssocMeasures.likelihood_ratio, 10)
[('black', 'knight'), ('clop', 'clop'), ('head', 'knight'), ('mumble',
'mumble'), ('squeak', 'squeak'), ('saw', 'saw'), ('holy', 'grail'),
('run', 'away'), ('french', 'guard'), ('cartoon', 'character')]

这里，我们可以将bigrams的频率更改为其他数字。

另一种从文本中生成bigrams的方法是使用词汇搭配查找器，如下代码所示：

>>> import nltk
>>> from nltk.collocation import *
>>> text1="Hardwork is the key to success. Never give up!"
>>> word = nltk.wordpunct_tokenize(text1)
>>> finder = BigramCollocationFinder.from_words(word)
>>> bigram_measures = nltk.collocations.BigramAssocMeasures()
>>> value = finder.score_ngrams(bigram_measures.raw_freq)
>>> sorted(bigram for bigram, score in value)
[('.', 'Never'), ('Hardwork', 'is'), ('Never', 'give'), ('give',
'up'), ('is', 'the'), ('key', 'to'), ('success', '.'), ('the', 'key'),
('to', 'success'), ('up', '!')]

现在让我们看看另外一段从alpino语料库生成bigrams的代码：

>>> import nltk
>>> from nltk.util import ngrams
>>> from nltk.corpus import alpino
>>> alpino.words()
['De', 'verzekeringsmaatschappijen', 'verhelen', ...]
>>> bigrams_tokens=ngrams(alpino.words(),2)
>>> for i in bigrams_tokens:
print(i)

此代码将从alpino语料库生成bigrams。

现在我们来看看用于生成trigrams的代码：

>>> import nltk
>>> from nltk.util import ngrams
>>> from nltk.corpus import alpino
>>> alpino.words()
['De', 'verzekeringsmaatschappijen', 'verhelen', ...]>>> trigrams_
tokens=ngrams(alpino.words(),3)
>>> for i in trigrams_tokens:
print(i)

为了生成fourgrams并生成fourgrams的频率，可以使用如下代码：

>>> import nltk
>>> import nltk
>>> from nltk.collocations import *
>>> text="Hello how are you doing ? I hope you find the book
interesting"
>>> tokens=nltk.wordpunct_tokenize(text)
>>> fourgrams=nltk.collocations.QuadgramCollocationFinder.from_
words(tokens)
>>> for fourgram, freq in fourgrams.ngram_fd.items():
print(fourgram,freq)

('hope', 'you', 'find', 'the') 1
('Hello', 'how', 'are', 'you') 1
('you', 'doing', '?', 'I') 1
('are', 'you', 'doing', '?') 1
('how', 'are', 'you', 'doing') 1
('?', 'I', 'hope', 'you') 1
('doing', '?', 'I', 'hope') 1
('find', 'the', 'book', 'interesting') 1
('you', 'find', 'the', 'book') 1
('I', 'hope', 'you', 'find') 1

现在我们来看看为给定句子生成ngrams（n元语法）的代码：

>>> import nltk
>>> sent=" Hello , please read the book thoroughly . If you have any
queries , then don't hesitate to ask . There is no shortcut to success
."
>>> n=5
>>> fivegrams=ngrams(sent.split(),n)
>>> for grams in fivegrams:
    print(grams)


('Hello', ',', 'please', 'read', 'the')
(',', 'please', 'read', 'the', 'book')
('please', 'read', 'the', 'book', 'thoroughly')
('read', 'the', 'book', 'thoroughly', '.')
('the', 'book', 'thoroughly', '.', 'If')
('book', 'thoroughly', '.', 'If', 'you')
('thoroughly', '.', 'If', 'you', 'have')
('.', 'If', 'you', 'have', 'any')
('If', 'you', 'have', 'any', 'queries')
('you', 'have', 'any', 'queries', ',')
('have', 'any', 'queries', ',', 'then')
('any', 'queries', ',', 'then', "don't")
('queries', ',', 'then', "don't", 'hesitate')
(',', 'then', "don't", 'hesitate', 'to')
('then', "don't", 'hesitate', 'to', 'ask')
("don't", 'hesitate', 'to', 'ask', '.')
('hesitate', 'to', 'ask', '.', 'There')
('to', 'ask', '.', 'There', 'is')
('ask', '.', 'There', 'is', 'no')
('.', 'There', 'is', 'no', 'shortcut')
('There', 'is', 'no', 'shortcut', 'to')
('is', 'no', 'shortcut', 'to', 'success')
('no', 'shortcut', 'to', 'success', '.')

2.1.1　为给定的文本开发MLE

最大似然估计（Maximum Likelihood Estimate，MLE），是NLP领域中的一项重要任务，其也被称作多元逻辑回归或条件指数分类器。Berger和Della Pietra曾于1996年首次介绍了它。最大熵模型被定义在NLTK中的nltk.classify.maxent模块里，在该模块中，所有的概率分布被认为是与训练数据保持一致的。该模型用于指代两个特征，即输入特征和联合特征。输入特征可以认为是未加标签单词的特征，而联合特征可以认为是加标签单词的特征。MLE用于生成freqdist，它包含了文本中给定标识符出现的概率分布。参数freqdist由作为概率分布基础的频率分布组成。

让我们来看看NLTK中有关最大熵模型的代码：

from __future__ import print_function,unicode_literals
__docformat__='epytext en'

try:
import numpy
except ImportError:
    pass
import tempfile
import os
from collections import defaultdict
from nltk import compat
from nltk.data import gzip_open_unicode
from nltk.util import OrderedDict
from nltk.probability import DictionaryProbDist
from nltk.classify.api import ClassifierI
from nltk.classify.util import CutoffChecker,accuracy,log_likelihood
from nltk.classify.megam import (call_megam,
write_megam_file,parse_megam_weights)
from nltk.classify.tadm import call_tadm,write_tadm_file,parse_tadm_
weights

在以上代码中，nltk.probability包含了FreqDist类，该类可以用来确定文本中单个标识符出现的频率。

ProbDistI用于确定单个标识符在文本中出现的概率分布。基本上有两种概率分布：派生概率分布和分析概率分布。派生概率分布是从频率分布中获取的，而分析概率分布则是从参数中获取的，例如方差。

为了获取频率分布，可以使用最大似然估计。它基于各个标识符在频率分布中的频率来计算其概率：

class MLEProbDist(ProbDistI):

    def __init__(self, freqdist, bins=None):
        self._freqdist = freqdist

    def freqdist(self):
"""

此函数将在概率分布的基础上找到频率分布：

"""
    return self._freqdist

    def prob(self, sample):
        return self._freqdist.freq(sample)

    def max(self):
        return self._freqdist.max()

    def samples(self):
        return self._freqdist.keys()

    def __repr__(self):
"""
        It will return string representation of ProbDist
"""
        return '<MLEProbDist based on %d samples>' % self._
freqdist.N()


class LidstoneProbDist(ProbDistI):
"""

该类用于获取频率分布。该频率分布由实数Gamma表示，其取值范围在0到1之间。LidstoneProbDist使用计数c、样本结果N和能够从概率分布中获取的样本值B来计算给定样本概率的公式如下：(c+Gamma)/(N+B*Gamma)。

这也意味着将Gamma加到了每一个可能的样本结果的计数上，并且从给定的频率分布中计算出了MLE：

"""
SUM_TO_ONE = False
    def __init__(self, freqdist, gamma, bins=None):
"""

Lidstone用于计算概率分布以便获取freqdist。

参数freqdist可以定义为概率估计所基于的频率分布。

参数bins可以被定义为能够从概率分布中获取的样本值，概率的总和等于1：

"""
        if (bins == 0) or (bins is None and freqdist.N() == 0):
            name = self.__class__.__name__[:-8]
            raise ValueError('A %s probability distribution ' % name +
'must have at least one bin.')
        if (bins is not None) and (bins < freqdist.B()):
            name = self.__class__.__name__[:-8]
            raise ValueError('\nThe number of bins in a %s
distribution ' % name +
'(%d) must be greater than or equal to\n' % bins +
'the number of bins in the FreqDist used ' +
'to create it (%d).' % freqdist.B())

        self._freqdist = freqdist
        self._gamma = float(gamma)
        self._N = self._freqdist.N()

        if bins is None:
            bins = freqdist.B()
        self._bins = bins

        self._divisor = self._N + bins * gamma
        if self._divisor == 0.0:
            # In extreme cases we force the probability to be 0,
            # which it will be, since the count will be 0:
            self._gamma = 0
            self._divisor = 1

def freqdist(self):
"""

该函数基于概率分布获取了频率分布：

    """
        return self._freqdist

def prob(self, sample):
c = self._freqdist[sample]
        return (c + self._gamma) / self._divisor

   def max(self):
 # To obtain most probable sample, choose the one
# that occurs very frequently.
        return self._freqdist.max()

def samples(self):
        return self._freqdist.keys()

def discount(self):
    gb = self._gamma * self._bins
        return gb / (self._N + gb)

    def __repr__(self):
"""
        String representation of ProbDist is obtained.


"""
        return '<LidstoneProbDist based on %d samples>' % self._
freqdist.N()


class LaplaceProbDist(LidstoneProbDist):
"""

该类用于获取频率分布。它使用计数c、样本结果N和能够被生成的样本值的频率B来计算一个样本的概率，计算公式如下：

(c+1)/(N+B)

这也意味着将1加到了每一个可能的样本结果的计数上，并且获取了所得频率分布的最大似然估计：

"""
    def __init__(self, freqdist, bins=None):
"""

LaplaceProbDist用于获取为生成freqdist的概率分布。

参数freqdist用于获取基于概率估计的频率分布。

参数bins可以被认为是能够被生成的样本值的频率。概率的总和必须为1：

"""
        LidstoneProbDist.__init__(self, freqdist, 1, bins)

    def __repr__(self):
"""
        String representation of ProbDist is obtained.
"""
        return '<LaplaceProbDist based on %d samples>' % self._
freqdist.N()

class ELEProbDist(LidstoneProbDist):
"""

该类用于获取频率分布。它使用计数c，样本结果N和能够被生成的样本值的频率B来计算一个样本的概率，计算公式如下：

(c+0.5)/(N+B/2)

这也意味着将0.5加到了每一个可能的样本结果的计数上，并且获取了所得频率分布的最大似然估计：

"""
    def __init__(self, freqdist, bins=None):
"""

预期似然估计用于获取生成freqdist的概率分布。参数freqdist用于获取基于概率估计的频率分布。

参数bins可以被认为是能够被生成的样本值的频率。概率的总和必须为1：

"""
LidstoneProbDist.__init__(self, freqdist, 0.5, bins)

    def __repr__(self):
"""
        String representation of ProbDist is obtained.
    """
        return '<ELEProbDist based on %d samples>' % self._
freqdist.N()



class WittenBellProbDist(ProbDistI):
"""

WittenBellProbDist类用于获取概率分布。在之前看到的样本频率的基础上，该类用于获取均匀的概率质量。关于样本概率质量的计算公式如下：

T / (N + T)

这里，T是观察到的样本数，N是观察到的事件的总数。样本的概率质量等于即将出现的新样本的最大似然估计。所有概率的总和等于1：

Here,
     p = T / Z (N + T), if count = 0
     p = c / (N + T), otherwise
"""
    def __init__(self, freqdist, bins=None):
"""

此段代码获取了概率分布。该概率用于向未知的样本提供均匀的概率质量。样本的概率质量计算公式给出如下：

T / (N + T)

这里，T是观察到的样本数，N是观察到的事件的总数。样本的概率质量等于即将出现的新样本的最大似然估计。所有概率的总和等于1：

Here,
     p = T / Z (N + T), if count = 0
     p = c / (N + T), otherwise

Z是使用这些值和一个bin值计算出的规范化因子。

参数freqdist用于估算可以从中获取概率分布的频率计数。

参数bins可以定义为样本的可能类型的数量：

"""
        assert bins is None or bins >= freqdist.B(),\
'bins parameter must not be less than %d=freqdist.B()' % freqdist.B()
        if bins is None:
            bins = freqdist.B()
        self._freqdist = freqdist
        self._T = self._freqdist.B()
        self._Z = bins - self._freqdist.B()
        self._N = self._freqdist.N()
        # self._P0 is P(0), precalculated for efficiency:
        if self._N==0:
            # if freqdist is empty, we approximate P(0) by a
UniformProbDist:
            self._P0 = 1.0 / self._Z
        else:
            self._P0 = self._T / float(self._Z * (self._N + self._T))

    def prob(self, sample):
        # inherit docs from ProbDistI
        c = self._freqdist[sample]
        return (c / float(self._N + self._T) if c != 0 else self._P0)

    def max(self):
        return self._freqdist.max()

    def samples(self):
        return self._freqdist.keys()

    def freqdist(self):
        return self._freqdist

    def discount(self):
        raise NotImplementedError()

    def __repr__(self):
"""
        String representation of ProbDist is obtained.


"""
        return '<WittenBellProbDist based on %d samples>' % self._
freqdist.N()

我们可以使用最大似然估计来执行测试，让我们考虑如下NLTK中有关MLE的代码：

>>> import nltk
>>> from nltk.probability import *
>>> train_and_test(mle)
28.76%
>>> train_and_test(LaplaceProbDist)
69.16%
>>> train_and_test(ELEProbDist)
76.38%
>>> def lidstone(gamma):
    return lambda fd, bins: LidstoneProbDist(fd, gamma, bins)

>>> train_and_test(lidstone(0.1))
86.17%
>>> train_and_test(lidstone(0.5))
76.38%
>>> train_and_test(lidstone(1.0))
69.16%

2.1.2　隐马尔科夫模型估计

隐马尔科夫模型（Hidden Markov Model，HMM）包含观察状态和帮助确定观察状态的隐藏状态。我们来看看关于HMM的图解说明，如图2-1所示，x表示隐藏状态，y表示观察状态。

图2-1

我们可以使用HMM估计执行测试，让我们考虑如下使用Brown语料库的代码：

>>> import nltk
>>> corpus = nltk.corpus.brown.tagged_sents(categories='adventure')
[:700]
>>> print(len(corpus))
700
>>> from nltk.util import unique_list
>>> tag_set = unique_list(tag for sent in corpus for (word,tag) in
sent)
>>> print(len(tag_set))
104
>>> symbols = unique_list(word for sent in corpus for (word,tag) in
sent)
>>> print(len(symbols))
1908
>>> print(len(tag_set))
104
>>> symbols = unique_list(word for sent in corpus for (word,tag) in
sent)
>>> print(len(symbols))
1908
>>> trainer = nltk.tag.HiddenMarkovModelTrainer(tag_set, symbols)
>>> train_corpus = []
>>> test_corpus = []
>>> for i in range(len(corpus)):
if i % 10:
train_corpus += [corpus[i]]
else:
test_corpus += [corpus[i]]


>>> print(len(train_corpus))
630
>>> print(len(test_corpus))
70
>>> def train_and_test(est):
hmm = trainer.train_supervised(train_corpus, estimator=est)
print('%.2f%%' % (100 * hmm.evaluate(test_corpus)))

在上面的代码中，我们创建了一个90％用于训练和10％用于测试的文件，并且我们已经测试了估计量。

2.2　在MLE模型上应用平滑

平滑（Smoothing）用于处理之前未曾出现过的单词。因此，未知单词的概率为0。为了解决这个问题，我们使用了平滑。

2.2.1　加法平滑

在18世纪，Laplace发明了加法平滑。在加法平滑中，需要将每个单词的计数加1。除了1之外，任何其他数值均可以被加到未知单词的计数上，以便未知单词可以被处理并且使它们的概率不为0。伪计数是指被加到未知单词计数上以使其概率不为0的值（即1或非0值）。

让我们考虑如下NLTK中有关加法平滑的代码：

>>> import nltk
>>> corpus=u"<s> hello how are you doing ? Hope you find the book
interesting. </s>".split()
>>> sentence=u"<s>how are you doing</s>".split()
>>> vocabulary=set(corpus)
>>> len(vocabulary)
13
>>> cfd = nltk.ConditionalFreqDist(nltk.bigrams(corpus))
>>> # The corpus counts of each bigram in the sentence:
>>> [cfd[a][b] for (a,b) in nltk.bigrams(sentence)]
[0, 1, 0]
>>> # The counts for each word in the sentence:
>>> [cfd[a].N() for (a,b) in nltk.bigrams(sentence)]
[0, 1, 2]
>>> # There is already a FreqDist method for MLE probability:
>>> [cfd[a].freq(b) for (a,b) in nltk.bigrams(sentence)]
[0, 1.0, 0.0]
>>> # Laplace smoothing of each bigram count:
>>> [1 + cfd[a][b] for (a,b) in nltk.bigrams(sentence)]
[1, 2, 1]
>>> # We need to normalise the counts for each word:
>>> [len(vocabulary) + cfd[a].N() for (a,b) in nltk.bigrams(sentence)]
[13, 14, 15]
>>> # The smoothed Laplace probability for each bigram:
>>> [1.0 * (1+cfd[a][b]) / (len(vocabulary)+cfd[a].N()) for (a,b) in
nltk.bigrams(sentence)]
[0.07692307692307693, 0.14285714285714285, 0.06666666666666667]

考虑另一种执行加法平滑或者说生成Laplace概率分布的方法：

>>> # MLEProbDist is the unsmoothed probability distribution:
>>> cpd_mle = nltk.ConditionalProbDist(cfd, nltk.MLEProbDist,
bins=len(vocabulary))
>>> # Now we can get the MLE probabilities by using the .prob method:
>>> [cpd_mle[a].prob(b) for (a,b) in nltk.bigrams(sentence)]
[0, 1.0, 0.0]
>>> # LaplaceProbDist is the add-one smoothed ProbDist:
>>> cpd_laplace = nltk.ConditionalProbDist(cfd, nltk.LaplaceProbDist,
bins=len(vocabulary))
>>> # Getting the Laplace probabilities is the same as for MLE:
>>> [cpd_laplace[a].prob(b) for (a,b) in nltk.bigrams(sentence)]
[0.07692307692307693, 0.14285714285714285, 0.06666666666666667]

2.2.2　Good Turing平滑

Good Turing平滑是由Alan Turing和他的统计助理I.J. Good提出的。这是一种有效的平滑方法，这种方法提高了用于执行语言学任务的统计技术的性能，例如词义消歧（WSD）、命名实体识别（NER）、拼写校正、机器翻译等。此方法有助于预测未知对象的概率。在该方法中，我们感兴趣的对象服从二项分布。在大样本量的基础上，该方法可用于计算出现0 次或出现较低次数样本的质量概率。通过对对数空间上的一条线性直线进行线性回归运算，Simple Good Turing可以执行从一个频率到另一个频率的近似估计。如果c\是调整后的计数，它将计算如下：

c\ = (c + 1) N(c + 1) / N(c)　c >= 1

c == 0，训练文件中的零频率的样本= N(1)。

这里，c是初始计数，N(i)是用计数i观察到的事件类型的数量。

Bill Gale和Geoffrey Sampson已经呈现了Simple Good Turing平滑：

class SimpleGoodTuringProbDist(ProbDistI):
"""


    Given a pair (pi, qi), where pi refers to the frequency and
    qi refers to the frequency of frequency, our aim is to minimize
the
    square variation. E(p) and E(q) is the mean of pi and qi.

    - slope, b = sigma ((pi-E(p)(qi-E(q))) / sigma ((pi-E(p))(pi-E(p)))
    - intercept: a = E(q) - b.E(p)
"""
    SUM_TO_ONE = False
    def __init__(self, freqdist, bins=None):
"""
        param freqdist refers to the count of frequency from which
probability
        distribution is estimated.
        Param bins is used to estimate the possible number of samples.
"""
        assert bins is None or bins > freqdist.B(),\
'bins parameter must not be less than %d=freqdist.B()+1' %
(freqdist.B()+1)
        if bins is None:
            bins = freqdist.B() + 1
        self._freqdist = freqdist
        self._bins = bins
        r, nr = self._r_Nr()
        self.find_best_fit(r, nr)
        self._switch(r, nr)
        self._renormalize(r, nr)

    def _r_Nr_non_zero(self):
        r_Nr = self._freqdist.r_Nr()
        del r_Nr[0]
        return r_Nr
    def _r_Nr(self):
"""
Split the frequency distribution in two list (r, Nr), where Nr(r) > 0
"""
        nonzero = self._r_Nr_non_zero()

        if not nonzero:
            return [], []
        return zip(*sorted(nonzero.items()))

    def find_best_fit(self, r, nr):
"""
        Use simple linear regression to tune parameters self._slope
and self._intercept in the log-log space based on count and
Nr(count) (Work in log space to avoid floating point underflow.)
"""
        # For higher sample frequencies the data points becomes
horizontal
        # along line Nr=1. To create a more evident linear model in
log-log
        # space, we average positive Nr values with the surrounding
zero
        # values. (Church and Gale, 1991)

        if not r or not nr:
            # Empty r or nr?
            return

        zr = []
        for j in range(len(r)):
            i = (r[j-1] if j > 0 else 0)
            k = (2 * r[j] - i if j == len(r) - 1 else r[j+1])
            zr_ = 2.0 * nr[j] / (k - i)
            zr.append(zr_)

        log_r = [math.log(i) for i in r]
        log_zr = [math.log(i) for i in zr]

        xy_cov = x_var = 0.0
        x_mean = 1.0 * sum(log_r) / len(log_r)
        y_mean = 1.0 * sum(log_zr) / len(log_zr)
        for (x, y) in zip(log_r, log_zr):
            xy_cov += (x - x_mean) * (y - y_mean)
            x_var += (x - x_mean)**2
        self._slope = (xy_cov / x_var if x_var != 0 else 0.0)
            if self._slope >= -1:
                warnings.warn('SimpleGoodTuring did not find a proper best
fit '
'line for smoothing probabilities of occurrences. '
'The probability estimates are likely to be '
'unreliable.')
        self._intercept = y_mean - self._slope * x_mean

    def _switch(self, r, nr):
"""
        Calculate the r frontier where we must switch from Nr to Sr
        when estimating E[Nr].
"""
        for i, r_ in enumerate(r):
            if len(r) == i + 1 or r[i+1] != r_ + 1:
                # We are at the end of r, or there is a gap in r
                self._switch_at = r_
                break

            Sr = self.smoothedNr
            smooth_r_star = (r_ + 1) * Sr(r_+1) / Sr(r_)
            unsmooth_r_star = 1.0 * (r_ + 1) * nr[i+1] / nr[i]

            std = math.sqrt(self._variance(r_, nr[i], nr[i+1]))
            if abs(unsmooth_r_star-smooth_r_star) <= 1.96 * std:
                self._switch_at = r_
                break

    def _variance(self, r, nr, nr_1):
        r = float(r)
        nr = float(nr)
        nr_1 = float(nr_1)
        return (r + 1.0)**2 * (nr_1 / nr**2) * (1.0 + nr_1 / nr)

    def _renormalize(self, r, nr):
"""

重整化对于确保获取到正确的概率分布是至关重要的。它可以通过公式N(1)/N对未知的样本进行概率估计，然后对所有之前所见的样本概率进行重整来获取：

"""
        prob_cov = 0.0
        for r_, nr_ in zip(r, nr):
            prob_cov += nr_ * self._prob_measure(r_)
        if prob_cov:
            self._renormal = (1 - self._prob_measure(0)) / prob_cov

    def smoothedNr(self, r):
"""
        Return the number of samples with count r.

"""

        # Nr = a*r^b (with b < -1 to give the appropriate hyperbolic
        # relationship)
        # Estimate a and b by simple linear regression technique on
        # the logarithmic form of the equation: log Nr = a + b*log(r)

        return math.exp(self._intercept + self._slope * math.log(r))

    def prob(self, sample):
"""
        Return the sample's probability.
"""
        count = self._freqdist[sample]
        p = self._prob_measure(count)
        if count == 0:
            if self._bins == self._freqdist.B():
                p = 0.0
            else:
                p = p / (1.0 * self._bins - self._freqdist.B())
        else:
            p = p * self._renormal
        return p

    def _prob_measure(self, count):
        if count == 0 and self._freqdist.N() == 0 :
            return 1.0
        elif count == 0 and self._freqdist.N() != 0:
            return 1.0 * self._freqdist.Nr(1) / self._freqdist.N()
        if self._switch_at > count:
            Er_1 = 1.0 * self._freqdist.Nr(count+1)
            Er = 1.0 * self._freqdist.Nr(count)
        else:
            Er_1 = self.smoothedNr(count+1)
            Er = self.smoothedNr(count)

        r_star = (count + 1) * Er_1 / Er
        return r_star / self._freqdist.N()

    def check(self):
        prob_sum = 0.0
        for i in range(0, len(self._Nr)):
            prob_sum += self._Nr[i] * self._prob_measure(i) / self._
renormal
        print("Probability Sum:", prob_sum)
        #assert prob_sum != 1.0, "probability sum should be one!"

    def discount(self):
"""
        It is used to provide the total probability transfers from the
        seen events to the unseen events.
"""
        return 1.0 * self.smoothedNr(1) / self._freqdist.N()

    def max(self):
        return self._freqdist.max()

    def samples(self):
        return self._freqdist.keys()

    def freqdist(self):
        return self._freqdist

    def __repr__(self):
"""
        It obtains the string representation of ProbDist.
"""
        return '<SimpleGoodTuringProbDist based on %d samples>'\
                % self._freqdist.N()

让我们来看看NLTK中有关Simple Good Turing的代码：

>>> gt = lambda fd, bins: SimpleGoodTuringProbDist(fd, bins=1e5)
>>> train_and_test(gt)
5.17%

2.2.3　Kneser Ney平滑

Kneser Ney平滑是与trigrams一起使用的。让我们来看看下面NLTK中的有关Kneser Ney平滑的代码：

>>> import nltk
>>> corpus = [[((x[0],y[0],z[0]),(x[1],y[1],z[1]))
    for x, y, z in nltk.trigrams(sent)]
   for sent in corpus[:100]]
>>> tag_set = unique_list(tag for sent in corpus for (word,tag) in
sent)
>>> len(tag_set)
906
>>> symbols = unique_list(word for sent in corpus for (word,tag) in
sent)
>>> len(symbols)
1341
>>> trainer = nltk.tag.HiddenMarkovModelTrainer(tag_set, symbols)
>>> train_corpus = []
>>> test_corpus = []
>>> for i in range(len(corpus)):
if i % 10:
train_corpus += [corpus[i]]
else:
test_corpus += [corpus[i]]

>>> len(train_corpus)
90
>>> len(test_corpus)
10
>>> kn = lambda fd, bins: KneserNeyProbDist(fd)
>>> train_and_test(kn)
0.86%

2.2.4　Witten Bell平滑

Witten Bell是用于处理具有0概率的未知单词的一种平滑算法。让我们考虑如下NLTK中关于Witten Bell平滑的代码：

>>> train_and_test(WittenBellProbDist)
6.90%

2.3　为MLE开发一个回退机制

Katz回退模型可以认为是一个具备高效生产力的n gram语言模型，如果在n gram中能够给出一个指定标识符的先前信息，那么该模型可以计算出其条件概率。依据这个模型，在训练文件中，如果n gram出现的次数多于n次，在已知其先前信息的条件下，标识符的条件概率与该n gram的MLE成正比。否则，条件概率相当于(n-1) gram的回退条件概率。

以下是NLTK中有关Katz回退模型的代码：

def prob(self, word, context):
"""
Evaluate the probability of this word in this context using Katz
Backoff.
: param word: the word to get the probability of
: type word: str
:param context: the context the word is in
:type context: list(str)
"""
context = tuple(context)
if(context+(word,) in self._ngrams) or (self._n == 1):
return self[context].prob(word)
else:
return self._alpha(context) * self._backoff.prob(word,context[1:])

2.4　应用数据的插值以便获取混合搭配

使用加法平滑模型bigram的局限是当我们处理罕见文本时就会回退到一个不可知的状态。例如，单词captivating在训练数据中出现了五次，其中三次出现在by之前，两次出现在the之前。使用加法平滑模型，在captivating之前，a和new的出现频率是一样的。这两种情况都是合理的，但与后者相比前者出现的可能性更大。这个问题可以通过使用unigram概率模型来修正。我们可以开发一个能够结合unigram和bigram概率模型的插值模型。

在语言模型训练工具SRILM中，我们先通过用-order 1来训练unigram模型并用-order 2来训练bigram模型来执行插值模型：

ngram - count - text / home / linux / ieng6 / ln165w / public / data
/ engand hintrain . txt \ - vocab / home / linux / ieng6 / ln165w /
public / data / engandhinlexicon . txt \ - order 1 - addsmooth 0.0001
- lm wsj1 . lm

2.5　通过复杂度来评估语言模型

NLTK中的nltk.model.ngram模块有一个子模块perplexity(text)。这个子模块用于评估指定文本的复杂度。复杂度（Perplexity）被定义为文本的2 **交叉熵。复杂度定义了概率模型或概率分布是怎样被用于预测文本的。

nltk.model.ngram模块中所呈现的用于评估文本复杂度的代码如下：

def perplexity(self, text):
"""
        Calculates the perplexity of the given text.
        This is simply 2 ** cross-entropy for the text.

        :param text: words to calculate perplexity of
        :type text: list(str)
"""

        return pow(2.0, self.entropy(text))

2.6　在语言建模中应用Metropolis-Hastings算法

在马尔科夫链蒙特卡罗 (Markov Chain Monte Carlo，MCMC)中有多种关于后验概率的执行处理方法。一种方法是使用Metropolis-Hastings采样器。为了实现Metropolis-Hastings算法，我们需要标准的均匀分布、建议分布和与后验概率成正比的目标分布。下面的话题谈论了一个有关Metropolis-Hastings算法的示例。

2.7　在语言处理中应用Gibbs采样法

在Gibbs采样法的帮助下，可以通过从条件概率中采样建立马尔科夫链。当完成了对所有参数的迭代时，就完成了一次Gibbs采样周期。当不能从条件分布中采样时，则可以使用Metropolis-Hastings算法，这被称作Metropolis within Gibbs。Gibbs采样法可以认为是具有特殊建议分布的Metropolis-hastings采样法。在每一次迭代中，我们为每一个特定参数的新值抽取一个建议值。

考虑一个关于投掷两枚硬币的例子，它以一枚硬币正面朝上的次数和掷币次数为表征：

def bern(theta,z,N):
"""Bernoulli likelihood with N trials and z successes."""
return np.clip(theta**z*(1-theta)**(N-z),0,1)
def bern2(theta1,theta2,z1,z2,N1,N2):
"""Bernoulli likelihood with N trials and z successes."""
return bern(theta1,z1,N1)*bern(theta2,z2,N2)
def make_thetas(xmin,xmax,n):
xs=np.linspace(xmin,xmax,n)
widths=(xs[1:]-xs[:-1])/2.0
thetas=xs[:-1]+widths
return thetas
def make_plots(X,Y,prior,likelihood,posterior,projection=None):
fig,ax=plt.subplots(1,3,subplot_kw=dict(projection=projection,aspect='
equal'),figsize=(12,3))
if projection=='3d':
ax[0].plot_surface(X,Y,prior,alpha=0.3,cmap=plt.cm.jet)
ax[1].plot_surface(X,Y,likelihood,alpha=0.3,cmap=plt.cm.jet)
ax[2].plot_surface(X,Y,posterior,alpha=0.3,cmap=plt.cm.jet)
else:
ax[0].contour(X,Y,prior)
ax[1].contour(X,Y,likelihood)
ax[2].contour(X,Y, posterior)
ax[0].set_title('Prior')
ax[1].set_title('Likelihood')
ax[2].set_title('posteior')
plt.tight_layout()
thetas1=make_thetas(0,1,101)
thetas2=make_thetas(0,1,101)
X,Y=np.meshgrid(thetas1,thetas2)

对于Metropolis算法，可考虑以下值：

a=2
b=3

z1=11
N1=14
z2=7
N2=14

prior=lambda theta1,theta2:stats.beta(a,b).pdf(theta1)*stats.beta(a,b).
pdf(theta2)
lik=partial(bern2,z1=z1,z2=z2,N1=N1,N2=N2)
target=lambda theta1,theta2:prior(theta1,theta2)*lik(theta1,theta2)

theta=np.array([0.5,0.5])
niters=10000
burnin=500
sigma=np.diag([0.2,0.2])

thetas=np.zeros((niters-burnin,2),np.float)
for i inrange(niters):
new_theta=stats.multivariate_normal(theta,sigma).rvs()
p=min(target(*new_theta)/target(*theta),1)
if np.random.rand()<p:
theta=new_theta
if i>=burnin:
thetas[i-burnin]=theta
kde=stats.gaussian_kde(thetas.T)
XY=np.vstack([X.ravel(),Y.ravel()])
posterior_metroplis=kde(XY).reshape(X.shape)
make_plots(X,Y,prior(X,Y),lik(X,Y),posterior_metroplis)
make_plots(X,Y,prior(X,Y),lik(X,Y),posterior_
metroplis,projection='3d')

对于Gibbs，可考虑以下值：

a=2
b=3

z1=11
N1=14
z2=7
N2=14

prior=lambda theta1,theta2:stats.beta(a,b).pdf(theta1)*stats.
beta(a,b).pdf(theta2)
lik=partial(bern2,z1=z1,z2=z2,N1=N1,N2=N2)
target=lambda theta1,theta2:prior(theta1,theta2)*lik(theta1,theta2)

theta=np.array([0.5,0.5])
niters=10000
burnin=500
sigma=np.diag([0.2,0.2])

thetas=np.zeros((niters-burnin,2),np.float)
for i inrange(niters):
theta=[stats.beta(a+z1,b+N1-z1).rvs(),theta[1]]
theta=[theta[0],stats.beta(a+z2,b+N2-z2).rvs()]

if i>=burnin:
thetas[i-burnin]=theta
kde=stats.gaussian_kde(thetas.T)
XY=np.vstack([X.ravel(),Y.ravel()])
posterior_gibbs=kde(XY).reshape(X.shape)
make_plots(X,Y,prior(X,Y),lik(X,Y), posterior_gibbs)
make_plots(X,Y,prior(X,Y),lik(X,Y), posterior_gibbs,projection='3d')

在上面有关Metropolis和Gibbs的代码中，可以获取到先验概率、似然估计和后验概率的2D和3D图。

2.8　小结

在本章中，我们讨论了单词频率（unigram、bigram和trigram）。你已经学习了最大似然估计以及它在NLTK中的实现。此外我们还讨论了插值法、回退法、Gibbs采样法和Metropolis-hastings算法。同时我们还讨论了如何通过复杂度来进行语言建模。

在下一章中，我们将讨论词干提取器（Stemmer）和词形还原器（Lemmatizer），以及使用机器学习工具创建形态生成器（Morphological generator）。

精通Python自然语言处理

图书目录:

详情

图书摘要

版权信息

版权声明

内容提要

作者简介

审阅者简介

译者简介

前言

本书涵盖内容

本书的阅读前提

本书的目标读者

排版约定

读者反馈

客户支持

下载示例代码

勘误

反盗版

问题

第1章 字符串操作

1.1 切分

1.1.1 将文本切分为语句

1.1.2 其他语言文本的切分

1.1.3 将句子切分为单词

1.1.4 使用TreebankWordTokenizer执行切分

1.1.5 使用正则表达式实现切分

1.2 标准化

1.2.1 消除标点符号

1.2.2 文本的大小写转换

1.2.3 处理停止词

1.2.4 计算英语中的停止词

1.3 替换和校正标识符

1.3.1 使用正则表达式替换单词

1.3.2 用其他文本替换文本的示例

1.3.3 在执行切分前先执行替换操作

1.3.4 处理重复字符

1.3.5 去除重复字符的示例

1.3.6 用单词的同义词替换

1.3.7 用单词的同义词替换的示例

1.4 在文本上应用Zipf定律

1.5 相似性度量

1.5.1 使用编辑距离算法执行相似性度量

1.5.2 使用Jaccard系数执行相似性度量

1.5.3 使用Smith Waterman距离算法执行相似性度量

1.5.4 其他字符串相似性度量

1.6 小结

第2章 统计语言建模

2.1 理解单词频率

2.1.1 为给定的文本开发MLE

2.1.2 隐马尔科夫模型估计

2.2 在MLE模型上应用平滑

2.2.1 加法平滑

2.2.2 Good Turing平滑

2.2.3 Kneser Ney平滑

2.2.4 Witten Bell平滑

2.3 为MLE开发一个回退机制

2.4 应用数据的插值以便获取混合搭配

2.5 通过复杂度来评估语言模型

2.6 在语言建模中应用Metropolis-Hastings算法

2.7 在语言处理中应用Gibbs采样法

2.8 小结

相关图书

相关文章

相关课程