【说话处理惩罚与Python】5.5N-gram标注

    添加时间:2013-5-26 点击量:

    一元标注(Unigram Tagging)

    一元标注基于简单的统策画法,对每个标识符分派这个独特的标识符最有可能的标识表记标帜。

    >>> nltk.corpusimport brown
    
    >>>brown_tagged_sents= brown.tagged_sents(categories=news
    >>>brown_sents= brown.sents(categories=news
    >>>unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
    >>>unigram_tagger.tag(brown_sents[2007])
    [(
    VariousJJ), (ofIN), (theAT), (apartmentsNNS),
    areBER), (ofIN), (theAT), (terraceNN), (typeNN),
    ), (beingBEG), (onIN), (theAT), (groundNN),
    floorNN), (soQL), (thatCS), (entranceNN), (isBEZ),
    directJJ), (..)]
    >>>unigram_tagger.evaluate(brown_tagged_sents)
    0.9349006503968017



    分别练习和测试数据



    为了可以或许更客观的分别练习和测试数据,一般我们的练习和测试数据不应用雷同的。




    >>>size = int(len(brown_tagged_sents) 0.9
    >>>size
    4160
    >>>train_sents = brown_tagged_sents[:size]
    >>>test_sents = brown_tagged_sents[size:]
    >>>unigram_tagger = nltk.UnigramTagger(train_sents)
    >>>unigram_tagger.evaluate(test_sents)
    0.81202033290142528



    一般的N-gram的标注



    在标注的时辰,会推敲在这个词之前的n-1个词汇进行标注。




    一个bigram tagger的例子
    
    >>>bigram_tagger = nltk.BigramTagger(train_sents)
    >>>bigram_tagger.tag(brown_sents[2007])
    [(
    VariousJJ), (ofIN), (theAT), (apartmentsNNS),
    areBER), (ofIN), (theAT), (terraceNN),
    typeNN), (), (beingBEG), (onIN), (theAT),
    groundNN), (floorNN), (soCS), (thatCS),
    entranceNN), (isBEZ), (directJJ), (..)]
    >>>unseen_sent = brown_sents[4203]
    >>>bigram_tagger.tag(unseen_sent)
    [(
    TheAT), (populationNN), (ofIN), (theAT), (CongoNP),
    isBEZ), (13.5, None),(million, None),(, None),(divided, None),
    into, None),(at, None),(least, None),(seven, None),(major, None),
    ``, None),(culture, None),(clusters, None),("", None),(and, None),
    innumerable, None),(tribes, None),(speaking, None),(400, None),
    separate, None),(dialects, None),(., None)]
    bigram标注器对于看到过的句子的词标注的很好,然则没有见过的,就会很是差。
    >>>bigram_tagger.evaluate(test_sents)
    0.10276088906608193



    组合标注器



    为懂得决精度和覆盖局限之间的衡量的一个办法是,尽可能的应用更正确的算法。



    鄙人面举了一个组合标注器的例子,进步的评价得分。



    1. 测验测验应用bigram标注器标注标识符。

    2. 若是bigram标注器无法找到一个标识表记标帜,测验测验unigram标注器。


    3. 若是unigram标注器也无法找到一个标识表记标帜,应用默认标注器。




    >>>t0 =nltk.DefaultTagger(NN
    >>>t1 =nltk.UnigramTagger(train_sents,backoff=t0)
    >>>t2 =nltk.BigramTagger(train_sents,backoff=t1)
    >>>t2.evaluate(test_sents)
    0.84491179108940495



    我们也可以指定一个标注器须要看到一个高低文的几许个实例才干保存它。




    nltk.BigramTagger(sents,cutoff=2,backoff=t1)



    标注生词



    在碰到生词时,不合的标注器标注的成果可能是不一样的。



    存储标注器



    将练习好的标注器保存起来,然后再反复应用。




    >>> cPickle import dump
    
    >>>output= open(t2.pklwb
    >>>dump(t2,output,-1
    >>>output.close()

    >>> cPickle import load
    >>>input = open(t2.pklrb
    >>>tagger = load(input)
    >>>input.close()



    跨句子鸿沟标注(句子层面的N-gram标注)




    brown_tagged_sents= brown.tagged_sents(categories=news
    brown_sents
    = brown.sents(categories=news
    size
    = int(len(brown_tagged_sents) 0.9
    train_sents
    = brown_tagged_sents[:size]
    test_sents
    = brown_tagged_sents[size:]
    t0
    = nltk.DefaultTagger(NN
    t1
    = nltk.UnigramTagger(train_sents,backoff=t0)
    t2
    = nltk.BigramTagger(train_sents, backoff=t1)
    >>>t2.evaluate(test_sents)
    0.84491179108940495

    容易发怒的意思就是: 别人做了蠢事, 然后我们代替他们, 表现出笨蛋的样子。—— 蔡康永
    分享到: