【说话处理惩罚与Python】7.3开辟和评估分块器

    添加时间:2013-5-31 点击量:

    读取IOB格局与CoNLL2000分块语料库

    CoNLL2000,是已经加载标注的文本,应用IOB符号分块。

    这个语料库供给的类型有NP,VP,PP。

    例如:

    hePRPB-NP
    
    accepted VBDB
    -VP
    the DTB
    -NP
    positionNNI
    -NP
    ...



    chunk.conllstr2tree()的函数感化:将字符串建树一个树默示。



    例如:




    >>>text = 
    
    ... he PRPB-NP
    ... accepted VBDB-VP
    ... the DTB-NP
    ... position NNI-NP
    ... of IN B-PP
    ... vice NNB-NP
    ... chairman NNI-NP
    ... of IN B-PP
    ... CarlyleNNPB-NP
    ... GroupNNPI-NP
    ... , , O
    ... a DTB-NP
    ... merchantNNI-NP
    ... banking NNI-NP
    ... concernNNI-NP
    ... . . O
    ...

    >>>nltk.chunk.conllstr2tree(text,chunk_types=[NP]).draw()



    运行成果如图所示:





    对于CoNLL2000分块语料,我们可以对他进行如下操纵:




    接见分块语料文件
    
    >>> nltk.corpusimport conll2000
    >>>print conll2000.chunked_sents(train.txt)[99]
    (S
    (PP Over
    /IN)
    (NP a
    /DT cup/NN)
    (PP of
    /IN)
    (NP coffee
    /NN)
    /
    (NP Mr.
    /NNPStone/NNP)
    (VP told
    /VBD)
    (NP his
    /PRP¥story/NN)
    .
    /.)






    若是只对NP感爱好,可以如许写
    
    >>>print conll2000.chunked_sents(train.txt,chunk_types=[NP])[99]
    (S
    Over
    /IN
    (NP a
    /DT cup/NN)
    of
    /IN
    (NP coffee
    /NN)
    /
    (NP Mr.
    /NNPStone/NNP)
    told
    /VBD
    (NP his
    /PRP¥story/NN)
    .
    /.)



    简单评估和基准




    >>>grammar= r"NP: {<[CDJNP].>+}"
    
    >>>cp = nltk.RegexpParser(grammar)
    >>>print cp.evaluate(test_sents)
    ChunkParsescore:
    IOB Accuracy:
    87.7%
    Precision:
    70.6%
    Recall:
    67.8%
    F
    -Measure: 69.2%



    我们可以机关一个Unigram标注器来建树一个分块器。




    我们定义一个分块器,此中包含机关函数和一个parse办法,用来给新的句子分块
    
    例7-4. 应用unigram标注器对名词短语分块。
    classUnigramChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
    train_data
    = [[(t,c) for w,t,cin nltk.chunk.tree2conlltags(sent)]
    for sent in train_sents]
    self.tagger
    = nltk.UnigramTagger(train_data)
    def parse(self, sentence):
    pos_tags
    = [pos for (word,pos) in sentence]
    tagged_pos_tags
    = self.tagger.tag(pos_tags)
    chunktags
    = [chunktag for (pos, chunktag) in tagged_pos_tags]
    conlltags
    =[(word, pos,chunktag)for ((word,pos),chunktag)
    in zip(sentence, chunktags)]
    return nltk.chunk.conlltags2tree(conlltags)



    重视parse这个函数,他的工作流程是如许的:



    1、取一个已经标注的句子作为输入



    2、从那句话提取的词性标识表记标帜开端



    3、应用在机关函数中练习过的标注器self.tagger,为词性添加标注IOB块标识表记标帜。



    4、提取块标识表记标帜,与原句组合。



    5、组合成一个块树。



    做好块标识表记标帜器之后,应用分块语料库库练习他。




    >>>test_sents = conll2000.chunked_sents(test.txt,chunk_types=[NP])
    
    >>>train_sents = conll2000.chunked_sents(train.txt,chunk_types=[NP])
    >>>unigram_chunker= UnigramChunker(train_sents)
    >>>print unigram_chunker.evaluate(test_sents)
    ChunkParsescore:
    IOB Accuracy:
    92.9%
    Precision:
    79.9%
    Recall:
    86.8%
    F
    -Measure: 83.2%




    我们可以经由过程这些代码,看到进修景象
    
    >>>postags= sorted(set(pos for sent in train_sents
    ...
    for (word,pos) in sent.leaves()))
    >>>print unigram_chunker.tagger.tag(postags)
    [(
    B-NP), (B-NP), (""O), (O), (O),
    O), (.O), (:O), (CCO), (CDI-NP),
    DTB-NP), (EXB-NP), (FWI-NP), (INO),
    JJI-NP), (JJRB-NP), (JJSI-NP), (MDO),
    NNI-NP), (NNPI-NP), (NNPSI-NP), (NNSI-NP),
    PDTB-NP), (POSB-NP), (PRPB-NP), (PRP¥B-NP),
    RBO), (RBRO), (RBSB-NP), (RPO), (SYMO),
    TOO), (UHO), (VBO), (VBDO), (VBGO),
    VBNO), (VBPO), (VBZO), (WDTB-NP),
    WPB-NP), (WP¥B-NP), (WRBO), (``O)]



    同样,我们也可以建树bigramTagger。




    >>>bigram_chunker= BigramChunker(train_sents)
    
    >>>print bigram_chunker.evaluate(test_sents)
    ChunkParsescore:
    IOB Accuracy:
    93.3%
    Precision:
    82.3%
    Recall:
    86.8%
    F
    -Measure: 84.5%



    练习基于分类器的分块器



    今朝评论辩论的分块器有:正则表达式分块器、n-gram分块器,决意创建什么块完全基于词性标识表记标帜。然而,有时词性标识表记标帜不足以断定一个句子应如何分块。



    例如:




    (3) a. Joey/NNsold/VBD the/DT farmer/NN rice/NN ./.
    
    b.Nick
    /NNbroke/VBD my/DTcomputer/NNmonitor/NN./.



    固然标识表记标帜都一样,然则很明显分块并不一样。



    所以,我们须要应用词的内容信息作为词性标识表记标帜的补充。



    若是想应用词的内容信息的办法之一,是应用基于分类器的标注器对句子分块。



    基于分类器的NP分块器的根蒂根基代码如下面的代码所示:



    &#160;




    在第2个类上,根蒂根基上是标注器的一个包装器,将它变成一个分块器。练习时代,这第二个类映射练习预猜中的块树到标识表记标帜序列
    
    在parse办法中,它将标注器供给的标识表记标帜序列转换回一个块树。
    classConsecutiveNPChunkTagger(nltk.TaggerI):
    def __init__(self, train_sents):
    train_set
    = []
    for tagged_sent in train_sents:
    untagged_sent
    = nltk.tag.untag(tagged_sent)
    history
    = []
    for i, (word, tag) in enumerate(tagged_sent):
    featureset
    = npchunk_features(untagged_sent, i, history)
    train_set.append( (featureset, tag) )
    history.append(tag)
    self.classifier
    = nltk.MaxentClassifier.train(
    train_set, algorithm
    =megam, trace=0)
    def tag(self, sentence):
    history
    = []
    for i, wordin enumerate(sentence):
    featureset
    = npchunk_features(sentence,i, history)
    tag
    = self.classifier.classify(featureset)
    history.append(tag)
    return zip(sentence, history)
    classConsecutiveNPChunker(nltk.ChunkParserI):④
    def __init__(self, train_sents):
    tagged_sents
    = [[((w,t),c) for (w,t,c) in
    nltk.chunk.tree2conlltags(sent)]
    for sent in train_sents]
    self.tagger
    = ConsecutiveNPChunkTagger(tagged_sents)
    def parse(self, sentence):
    tagged_sents
    = self.tagger.tag(sentence)
    conlltags
    =[(w,t,c) for ((w,t),c) in tagged_sents]
    return nltk.chunk.conlltags2tree(conlltags)



    然后,定义一个特点提取函数:



    &#160;




    >>>def npchunk_features(sentence,i, history):
    
    ... word,pos
    = sentence[i]
    ...
    return {"pos": pos}
    >>>chunker = ConsecutiveNPChunker(train_sents)
    >>>print chunker.evaluate(test_sents)
    ChunkParsescore:
    IOB Accuracy:
    92.9%
    Precision:
    79.9%
    Recall:
    86.7%
    F
    -Measure: 83.2%



    对于这个分类标识表记标帜器我们还可以做改进,增加一个前面的词性标识表记标帜。



    &#160;




    >>>def npchunk_features(sentence,i, history):
    
    ... word,pos
    = sentence[i]
    .. .
    if i ==0:
    ... prevword, prevpos
    = "<START>""<START>"
    ...
    else:
    ... prevword, prevpos
    = sentence[i-1]
    ...
    return {"pos": pos,"prevpos": prevpos}
    >>>chunker = ConsecutiveNPChunker(train_sents)
    >>>print chunker.evaluate(test_sents)
    ChunkParsescore:
    IOB Accuracy:
    93.6%
    Precision:
    81.9%
    Recall:
    87.1%
    F
    -Measure: 84.4%



    我们可以不仅仅以两个词性为特点,还可以再添加一个词的内容。




    >>>def npchunk_features(sentence,i, history):
    
    ... word,pos
    = sentence[i]
    .. .
    if i ==0:
    .. . prevword, prevpos
    = "<START>""<START>"
    ...
    else:
    ... prevword, prevpos
    = sentence[i-1]
    ...
    return {"pos": pos,"word": word,"prevpos": prevpos}
    >>>chunker = ConsecutiveNPChunker(train_sents)
    >>>print chunker.evaluate(test_sents)
    ChunkParsescore:
    IOB Accuracy:
    94.2%
    Precision:
    83.4%
    Recall:
    88.6%
    F
    -Measure: 85.9%



    我们可以试着测验测验多加几种特点提取,来增长分块器的发挥解析,例如下面代码中增加了预取特点、配对功能和错杂的语境特点。最后一个特点是tags-since-dt,创建了一个字符串,描述自比来的限制词以来碰到的所有的词性标识表记标帜。




    >>>def npchunk_features(sentence,i, history):
    
    ... word,pos
    = sentence[i]
    ...
    if i ==0:
    ... prevword, prevpos
    = "<START>""<START>"
    ...
    else:
    ... prevword, prevpos
    = sentence[i-1]
    ...
    if i ==len(sentence)-1:
    ... nextword, nextpos
    = "<END>""<END>"
    ...
    else:
    ... nextword, nextpos
    = sentence[i+1]
    ...
    return {"pos": pos,
    ...
    "word": word,
    ...
    "prevpos": prevpos,
    ...
    "nextpos": nextpos,
    .. .
    "prevpos+pos": "%s+%s"(prevpos, pos),
    ...
    "pos+nextpos": "%s+%s"(pos, nextpos),
    ...
    "tags-since-dt": tags_since_dt(sentence, i)}
    >>>def tags_since_dt(sentence, i):
    ... tags
    = set()
    ...
    for word,pos in sentence[:i]:
    ...
    if pos==DT:
    ... tags
    = set()
    ...
    else:
    ... tags.add(pos)
    ...
    return +.join(sorted(tags))
    >>>chunker = ConsecutiveNPChunker(train_sents)
    >>>print chunker.evaluate(test_sents)
    ChunkParsescore:
    IOB Accuracy:
    95.9%
    Precision:
    88.3%
    Recall:
    90.7%
    F
    -Measure: 89.5%

    真正的心灵世界会告诉你根本看不见的东西,这东西需要你付出思想和灵魂的劳动去获取,然后它会照亮你的生命,永远照亮你的生命。——王安忆《小说家的十三堂课》
    分享到: