【说话处理惩罚与Python】6.1有把守分类

    添加时间:2013-5-26 点击量:

    模式辨认是天然说话处理惩罚的一个核心项目组。

    6.1有把守分类

    分类:是为给定的输入选择正确的类标签任务。

    有把守分类:若是分类的根蒂根基基于包含每个输入正确标签的练习语料。

    有把守分类的应用框架图如下:

    性别剖断

    让我们以性别剖断这个简单的例子,再次申明上述图形的流程。

    靠山:男女的名字是有必然靠山的,以此为靠山来申明。在这里我们以每个名字的最后一个字母为特点来标记是否为男女。

    1、断定特点集

    def gender_features(word):
    
    return {last_letter:word[-1]}





    这个函数返回的字典被成为特点集。



    2、筹办一个例子和对应类标签链表




     nltk.corpus import names
    
    import random
    >>>names= ([(name, malefor name in names.words(male.txt)] +
    ... [(name,
    femalefor namein names.words(female.txt)])
    >>>random.shuffle(names)



    3、提取特点集




    featuresets=[(gender_feature(n),g) for (n,g) in names]





    4、把特点集分成练习集和测试集




    >>>train_set, test_set = featuresets[500:], featuresets[:500]
    
    >>>classifier = nltk.NaiveBayesClassifier.train(train_set)



    5、应用分类器




    >>>classifier.classify(gender_features(Neo))
    
    male
    >>>classifier.classify(gender_features(Trinity))
    female



    到这里就知道了,应当怎么应用分类器。



    我们可以评价此分类器:




    >>>print nltk.classify.accuracy(classifier, test_set)
    
    0.758



    我们还可以查看,哪些特点对于区分名字的性别是有效的。




    >>>classifier.show_most_informative_features(5
    MostInformative Features
    last_letter
    = a female : male = 38.3 : 1.0
    last_letter
    = k male: female = 31.4 : 1.0
    last_letter
    = f male: female = 15.3: 1.0
    last_letter
    = p male: female = 10.6 : 1.0
    last_letter
    = w male: female = 10.6 : 1.0



    这些比例被称为似然比。



    重视,在处理惩罚大型语料库的时辰,若是要建树测试集和练习集,为了节俭内存,可以应用下面的办法:




    >>> nltk.classifyimport apply_features
    
    >>>train_set = apply_features(gender_features, names[500:])
    >>>test_set = apply_features(gender_features, names[:500])



    拔取正确的特点



    可以或许拔取正确的特点,对一个模型的影响很是首要。



    在机关特点的时辰,我们可以拔取很多特点,例如,还是在性别剖断的例子上:




    def gender_features2(name):
    
    features
    = {}
    features[
    "firstletter"] = name[0].lower()
    features[
    "lastletter"] = name[–1].lower()
    for letter in abcdefghijklmnopqrstuvwxyz:
    features[
    "count(%s)" %letter] = name.lower().count(letter)
    features[
    "has(%s)" %letter] =(letter in name.lower())
    return features
    >>>gender_features2(John
    {
    count(j):1,has(d): False,count(b): 0,...}



    须要重视的是,若是拔取的特点过多,会高度依附这些特点,当运作在小练习集上会呈现题目,正确度会降落。这个题目被叫做过拟合




    >>>featuresets =[(gender_features2(n), g) for (n,g) in names]
    
    >>>train_set, test_set = featuresets[500:], featuresets[:500]
    >>>classifier = nltk.NaiveBayesClassifier.train(train_set)
    >>>print nltk.classify.accuracy(classifier, test_set)
    0.748



    其实,在找到特点集之后,还须要错误会析,来完美特点集。



    我们可以把一个特点集分成练习集、开辟测试集和终极用来评估的测试集。




    >>>train_names = names[1500:]
    
    >>>devtest_names= names[500:1500]
    >>>test_names = names[:500]



    文档分类




    读取文档
    
    >>> nltk.corpusimport movie_reviews
    >>>documents= [(list(movie_reviews.words(fileid)), category)
    ...
    for categoryin movie_reviews.categories()
    ...
    for fileid in movie_reviews.fileids(category)]
    >>>random.shuffle(documents)
    设置分类器,特点为每个词是否在一个特定文档中
    all_words = nltk.FreqDist(w.lower()for win movie_reviews.words())
    word_features
    = all_words.keys()[:2000]
    def document_features(document):
    document_words
    = set(document)
    features
    = {}
    for wordin word_features:
    features[
    contains(%s) %word]= (word in document_words)
    return features
    >>>print document_features(movie_reviews.words(pos/cv957_8737.txt))
    {
    contains(waste):False, contains(lot): False,...}
    练习一个特点集
    featuresets = [(document_features(d), c)for (d,c) in documents]
    train_set, test_set
    = featuresets[100:], featuresets[:100]
    classifier
    = nltk.NaiveBayesClassifier.train(train_set)
    >>>print nltk.classify.accuracy(classifier, test_set) ?
    0.81
    >>>classifier.show_most_informative_features(5) ?
    MostInformative Features
    contains(outstanding)
    = True pos: neg = 11.1: 1.0
    contains(seagal)
    = True neg: pos = 7.7: 1.0
    contains(wonderfully)
    = True pos: neg = 6.8 : 1.0
    contains(damon)
    = True pos: neg = 5.9 : 1.0
    contains(wasted)
    =True neg : pos = 5.8: 1.0



    摸索高低文语境



    例如,我们可能会碰到一个词,fly,若是在他前面的词是a,那么我们就可以断定他是一个名词而不是一个动词。



    如今我们设计一个词性分类器,来搜检他呈现的高低文,以便决意应当如何分派词性标识表记标帜。




    def pos_features(sentence,i):
    
    features
    = {"suffix(1)": sentence[i][-1:],
    "suffix(2)": sentence[i][-2:],
    "suffix(3)": sentence[i][-3:]}
    if i ==0:
    features[
    "prev-word"] = "<START>"
    else:
    features[
    "prev-word"] = sentence[i-1]
    return features



    然后,来生成一个特点集,然后把这个凑集分成练习集和测试集。




    >>>tagged_sents = brown.tagged_sents(categories=news
    >>>featuresets =[]
    >>>for tagged_sent in tagged_sents:
    ... untagged_sent
    = nltk.tag.untag(tagged_sent)
    ...
    for i, (word, tag) in enumerate(tagged_sent):
    ... featuresets.append(
    (pos_features(untagged_sent, i), tag) )
    >>>size = int(len(featuresets) 0.1
    >>>train_set, test_set = featuresets[size:], featuresets[:size]
    >>>classifier = nltk.NaiveBayesClassifier.train(train_set)
    >>>nltk.classify.accuracy(classifier,test_set)



    哄骗如许的标注器,可以进步正确率。



    序列分类



    为了捕获相干的分类任务之间的依附关系,我们可以应用结合分类器模型,收集有关输入,选择合适标签。



    在词性标注的例子中,各类不合的序列分类器模型可以被用来为一个给定的句子中的所有的词共同选择词性标签。



    一种序列分类器策略,称为连气儿分类或贪婪序列分类:是为第一个输入找到最有可能的类标签,然后应用这个题目的答案帮助寻找到下一个输入的标签。这个过程可以络续的反复,直到所有的输入都被贴上标签。



    下面是一个连气儿分类的进行词性标注的例子:



    &#160;




    def pos_features(sentence,i, history): 
    
    features
    = {"suffix(1)": sentence[i][-1:],
    "suffix(2)": sentence[i][-2:],
    "suffix(3)": sentence[i][-3:]}
    if i ==0:
    features[
    "prev-word"] = "<START>"
    features[
    "prev-tag"] = "<START>"
    else:
    features[
    "prev-word"] = sentence[i-1]
    features[
    "prev-tag"] = history[i-1]
    return features




    classConsecutivePosTagger(nltk.TaggerI):
    
    def __init__(self, train_sents):
    train_set
    = []
    for tagged_sent in train_sents:
    untagged_sent
    = nltk.tag.untag(tagged_sent)
    history
    = []
    for i, (word, tag) in enumerate(tagged_sent):
    featureset
    = pos_features(untagged_sent,i, history)
    train_set.append( (featureset, tag) )
    history.append(tag)
    self.classifier
    = nltk.NaiveBayesClassifier.train(train_set)
    def tag(self, sentence):
    history
    = []
    for i, wordin enumerate(sentence):
    featureset
    = pos_features(sentence,i, history)
    tag
    = self.classifier.classify(featureset)
    history.append(tag)
    return zip(sentence, history)
    >>>tagged_sents = brown.tagged_sents(categories=news
    >>>size = int(len(tagged_sents) 0.1
    >>>train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]
    >>>tagger = ConsecutivePosTagger(train_sents)
    >>>print tagger.evaluate(test_sents)
    0.79796012981



    其他序列分类办法



    隐马尔可夫模型:类似于连气儿分类器。他不但看输入也看已经猜测标识表记标帜的汗青。为所有可能的序列打分,选择总得分高的序列。

    原来,再大的房子,再大的床,没有相爱的人陪伴,都只是冰冷的物质。而如果身边有爱人陪伴,即使房子小,床小,也觉得无关紧要,因为这些物质上面有了爱的温度,成了家的元素。—— 何珞《婚房》#书摘#
    分享到: