【说话处理惩罚与Python】7.3开辟和评估分块器

添加时间:2013-5-31 点击量:

读取IOB格局与CoNLL2000分块语料库

CoNLL2000，是已经加载标注的文本，应用IOB符号分块。

这个语料库供给的类型有NP，VP，PP。

例如：

hePRPB-NP

accepted VBDB-VP

the DTB-NP

positionNNI-NP

...

chunk.conllstr2tree（）的函数感化：将字符串建树一个树默示。

例如：

>>>text = 

... he PRPB-NP

... accepted VBDB-VP

... the DTB-NP

... position NNI-NP

... of IN B-PP

... vice NNB-NP

... chairman NNI-NP

... of IN B-PP

... CarlyleNNPB-NP

... GroupNNPI-NP

... ， ， O

... a DTB-NP

... merchantNNI-NP

... banking NNI-NP

... concernNNI-NP

... . . O

... 

>>>nltk.chunk.conllstr2tree（text，chunk_types=[NP]）.draw（）

运行成果如图所示：

对于CoNLL2000分块语料，我们可以对他进行如下操纵：

＃接见分块语料文件

>>> nltk.corpusimport conll2000

>>>print conll2000.chunked_sents（train.txt）[99]

（S

    （PP Over/IN）

    （NP a/DT cup/NN）

    （PP of/IN）

    （NP coffee/NN）

    ，/，

    （NP Mr./NNPStone/NNP）

    （VP told/VBD）

    （NP his/PRP￥story/NN）

    ./.）

＃若是只对NP感爱好，可以如许写

>>>print conll2000.chunked_sents（train.txt，chunk_types=[NP]）[99]

（S

    Over/IN

    （NP a/DT cup/NN）

    of/IN

    （NP coffee/NN）

    ，/，

    （NP Mr./NNPStone/NNP）

    told/VBD

    （NP his/PRP￥story/NN）

    ./.）

简单评估和基准

>>>grammar= r"NP: {<[CDJNP].>+}"

>>>cp = nltk.RegexpParser（grammar）

>>>print cp.evaluate（test_sents）

ChunkParsescore:

IOB Accuracy: 87.7％

Precision: 70.6％

Recall: 67.8％

F-Measure: 69.2％

我们可以机关一个Unigram标注器来建树一个分块器。

＃我们定义一个分块器，此中包含机关函数和一个parse办法，用来给新的句子分块

例7-4. 应用unigram标注器对名词短语分块。

classUnigramChunker（nltk.ChunkParserI）:

    def __init__（self， train_sents）: 

        train_data = [[（t，c） for w，t，cin nltk.chunk.tree2conlltags（sent）]

            for sent in train_sents]

        self.tagger = nltk.UnigramTagger（train_data） 

    def parse（self， sentence）: 

        pos_tags= [pos for （word，pos） in sentence]

        tagged_pos_tags = self.tagger.tag（pos_tags）

        chunktags= [chunktag for （pos， chunktag） in tagged_pos_tags]

        conlltags =[（word， pos，chunktag）for （（word，pos），chunktag）

                in zip（sentence， chunktags）]

    return nltk.chunk.conlltags2tree（conlltags）

重视parse这个函数，他的工作流程是如许的：

1、取一个已经标注的句子作为输入

2、从那句话提取的词性标识表记标帜开端

3、应用在机关函数中练习过的标注器self.tagger，为词性添加标注IOB块标识表记标帜。

4、提取块标识表记标帜，与原句组合。

5、组合成一个块树。

做好块标识表记标帜器之后，应用分块语料库库练习他。

>>>test_sents = conll2000.chunked_sents（test.txt，chunk_types=[NP]）

>>>train_sents = conll2000.chunked_sents（train.txt，chunk_types=[NP]）

>>>unigram_chunker= UnigramChunker（train_sents）

>>>print unigram_chunker.evaluate（test_sents）

ChunkParsescore:

IOB Accuracy: 92.9％

Precision: 79.9％

Recall: 86.8％

F-Measure: 83.2％

＃我们可以经由过程这些代码，看到进修景象

>>>postags= sorted（set（pos for sent in train_sents

... for （word，pos） in sent.leaves（）））

>>>print unigram_chunker.tagger.tag（postags）

[（＃， B-NP）， （￥， B-NP）， （""， O）， （（， O）， （）， O），

（，， O）， （.， O）， （:， O）， （CC， O）， （CD， I-NP），

（DT， B-NP）， （EX， B-NP）， （FW， I-NP）， （IN， O），

（JJ， I-NP）， （JJR， B-NP）， （JJS， I-NP）， （MD， O），

（NN， I-NP）， （NNP， I-NP）， （NNPS， I-NP）， （NNS， I-NP），

（PDT， B-NP）， （POS， B-NP）， （PRP， B-NP）， （PRP￥， B-NP），

（RB， O）， （RBR， O）， （RBS， B-NP）， （RP， O）， （SYM， O），

（TO， O）， （UH， O）， （VB， O）， （VBD， O）， （VBG， O），

（VBN， O）， （VBP， O）， （VBZ， O）， （WDT， B-NP），

（WP， B-NP）， （WP￥， B-NP）， （WRB， O）， （``， O）]

同样，我们也可以建树bigramTagger。

>>>bigram_chunker= BigramChunker（train_sents）

>>>print bigram_chunker.evaluate（test_sents）

ChunkParsescore:

IOB Accuracy: 93.3％

Precision: 82.3％

Recall: 86.8％

F-Measure: 84.5％

练习基于分类器的分块器

今朝评论辩论的分块器有：正则表达式分块器、n-gram分块器，决意创建什么块完全基于词性标识表记标帜。然而，有时词性标识表记标帜不足以断定一个句子应如何分块。

例如：

（3） a. Joey/NNsold/VBD the/DT farmer/NN rice/NN ./.

b.Nick/NNbroke/VBD my/DTcomputer/NNmonitor/NN./.

固然标识表记标帜都一样，然则很明显分块并不一样。

所以，我们须要应用词的内容信息作为词性标识表记标帜的补充。

若是想应用词的内容信息的办法之一，是应用基于分类器的标注器对句子分块。

基于分类器的NP分块器的根蒂根基代码如下面的代码所示：

&＃160;

＃在第2个类上，根蒂根基上是标注器的一个包装器，将它变成一个分块器。练习时代，这第二个类映射练习预猜中的块树到标识表记标帜序列

＃在parse办法中，它将标注器供给的标识表记标帜序列转换回一个块树。

classConsecutiveNPChunkTagger（nltk.TaggerI）:

    def __init__（self， train_sents）:

        train_set = []

        for tagged_sent in train_sents:

            untagged_sent = nltk.tag.untag（tagged_sent）

            history = []

            for i， （word， tag） in enumerate（tagged_sent）:

                featureset = npchunk_features（untagged_sent， i， history） 

                train_set.append（ （featureset， tag） ）

                history.append（tag）

        self.classifier = nltk.MaxentClassifier.train（ 

            train_set， algorithm=megam， trace=0）

    def tag（self， sentence）:

        history = []

        for i， wordin enumerate（sentence）:

            featureset = npchunk_features（sentence，i， history）

            tag = self.classifier.classify（featureset）

            history.append（tag）

        return zip（sentence， history）

classConsecutiveNPChunker（nltk.ChunkParserI）:④

    def __init__（self， train_sents）:

        tagged_sents = [[（（w，t），c） for （w，t，c） in

            nltk.chunk.tree2conlltags（sent）]

            for sent in train_sents]

        self.tagger = ConsecutiveNPChunkTagger（tagged_sents）

    def parse（self， sentence）:

        tagged_sents = self.tagger.tag（sentence）

        conlltags =[（w，t，c） for （（w，t），c） in tagged_sents]

        return nltk.chunk.conlltags2tree（conlltags）

然后，定义一个特点提取函数：

&＃160;

>>>def npchunk_features（sentence，i， history）:

... word，pos= sentence[i]

... return {"pos": pos}

>>>chunker = ConsecutiveNPChunker（train_sents）

>>>print chunker.evaluate（test_sents）

ChunkParsescore:

IOB Accuracy: 92.9％

Precision: 79.9％

Recall: 86.7％

F-Measure: 83.2％

对于这个分类标识表记标帜器我们还可以做改进，增加一个前面的词性标识表记标帜。

&＃160;

>>>def npchunk_features（sentence，i， history）:

... word，pos= sentence[i]

..    . if i ==0:

...         prevword， prevpos= "<START>"， "<START>"

...     else:

...         prevword， prevpos= sentence[i-1]

...     return {"pos": pos，"prevpos": prevpos}

>>>chunker = ConsecutiveNPChunker（train_sents）

>>>print chunker.evaluate（test_sents）

ChunkParsescore:

IOB Accuracy: 93.6％

Precision: 81.9％

Recall: 87.1％

F-Measure: 84.4％

我们可以不仅仅以两个词性为特点，还可以再添加一个词的内容。

>>>def npchunk_features（sentence，i， history）:

...     word，pos= sentence[i]

..    . if i ==0:

..        . prevword， prevpos= "<START>"， "<START>"

...     else:

...         prevword， prevpos= sentence[i-1]

...     return {"pos": pos，"word": word，"prevpos": prevpos}

>>>chunker = ConsecutiveNPChunker（train_sents）

>>>print chunker.evaluate（test_sents）

ChunkParsescore:

IOB Accuracy: 94.2％

Precision: 83.4％

Recall: 88.6％

F-Measure: 85.9％

我们可以试着测验测验多加几种特点提取，来增长分块器的发挥解析，例如下面代码中增加了预取特点、配对功能和错杂的语境特点。最后一个特点是tags-since-dt，创建了一个字符串，描述自比来的限制词以来碰到的所有的词性标识表记标帜。

>>>def npchunk_features（sentence，i， history）:

...     word，pos= sentence[i]

...     if i ==0:

...         prevword， prevpos= "<START>"， "<START>"

...     else:

...         prevword， prevpos= sentence[i-1]

...     if i ==len（sentence）-1:

...         nextword， nextpos= "<END>"， "<END>"

...     else:

...         nextword， nextpos= sentence[i+1]

...     return {"pos": pos，

...         "word": word，

...         "prevpos": prevpos，

...         "nextpos": nextpos，

..        . "prevpos+pos": "％s+％s" ％（prevpos， pos），

...         "pos+nextpos": "％s+％s" ％（pos， nextpos），

...         "tags-since-dt": tags_since_dt（sentence， i）}

>>>def tags_since_dt（sentence， i）:

...     tags = set（）

...     for word，pos in sentence[:i]:

...         if pos==DT:

...             tags = set（）

...         else:

...             tags.add（pos）

...     return +.join（sorted（tags））

>>>chunker = ConsecutiveNPChunker（train_sents）

>>>print chunker.evaluate（test_sents）

ChunkParsescore:

IOB Accuracy: 95.9％

Precision: 88.3％

Recall: 90.7％

F-Measure: 89.5％

真正的心灵世界会告诉你根本看不见的东西，这东西需要你付出思想和灵魂的劳动去获取，然后它会照亮你的生命，永远照亮你的生命。——王安忆《小说家的十三堂课》

分享到：

相关文章

按版本划分

按功能划分

企业管理软件

【说话处理惩罚与Python】7.3开辟和评估分块器