-
【说话处理惩罚与Python】11.4应用XML\11.5应用Toolbox数据
添加时间:2013-6-6 点击量:11.4应用Toolbox数据
说话布局中应用XML
(2) <entry>
<headword>whale</headword>
<pos>noun</pos>
<gloss>anyofthe larger cetaceanmammalshaving a streamlined
bodyand breathing through a blowhole onthe head</gloss>
</entry>
XML的感化
(关于XML更多的根蒂根基常识请本身查询相干材料)
ElementTree接口
>>> nltk.etree.ElementTreeimport ElementTree
>>>merchant= ElementTree().parse(merchant_file)
>>>merchant
<Element PLAYat 22fa800>
>>>merchant[0]
<ElementTITLEat 22fa828>
>>>merchant[0].text
The MerchantofVenice
>>>merchant.getchildren()
[<Element TITLEat 22fa828>, <Element PERSONAE at 22fa7b0>, <Element SCNDE
SCRat 2300170>,
<ElementPLAYSUBTat 2300198>, <ElementACTat 23001e8>, <ElementACTat 2
34ec88>,
<ElementACTat 23c87d8>, <ElementACTat 2439198>, <ElementACTat 24923c8
>]
我们可以应用更多的办法来操纵XML:
>>>for i, act in enumerate(merchant.findall(ACT)):
... for j, scene in enumerate(act.findall(SCENE)):
... for k,speechin enumerate(scene.findall(SPEECH)):
... for line in speech.findall(LINE):
... if music in str(line.text):
... print "Act %dScene %dSpeech %d:%s"%(i+1, j+1, k+1, line.text)
Act3Scene2Speech9: Let musicsoundwhilehedoth makehis choice;
Act3Scene2Speech9: Fadingin music:that the comparison
Act3Scene2Speech9:Andwhatis musicthen? Thenmusicis
Act5Scene1Speech23:Andbring yourmusicforth into the air.
Act5Scene1Speech23: Herewillwesit and let the sounds ofmusic
Act5Scene1Speech23:Anddrawher homewithmusic.
Act5Scene1Speech24: I am never merrywhenI hear sweet music.
Act5Scene1Speech25: Orany air ofmusictouch their ears,
Act5Scene1Speech25: Bythe sweet powerof music:therefore the poet
Act5Scene1Speech25: Butmusicfor the time doth changehis nature.
Act5Scene1Speech25: Themanthat hathnomusicin himself,
Act5Scene1Speech25: Let nosuchmanbe trusted. Markthe music.
Act5Scene1Speech29: It is yourmusic,madam,of the house.
Act5Scene1Speech32: Nobetter a musicianthan the wren.
我们也可以查查演员的次序。我们可以应用频率分布看看谁最能说:
>>>speaker_seq = [s.text for s in merchant.findall(ACT/SCENE/SPEECH/SPEAKER
)]
>>>speaker_freq = nltk.FreqDist(speaker_seq)
>>>top5 =speaker_freq.keys()[:5]
>>>top5
[PORTIA, SHYLOCK, BASSANIO, GRATIANO, ANTONIO]
我们也可以查看对话中谁跟着谁的模式。
>>>mapping= nltk.defaultdict(lambda: OTH)
>>>for s in top5:
... mapping[s]= s[:4]
...
>>>speaker_seq2 = [mapping[s] for s in speaker_seq]
>>>cfd =nltk.ConditionalFreqDist(nltk.ibigrams(speaker_seq2))
>>>cfd.tabulate()
应用ElementTree接见Toolbox数据
我们可以用toolbox.xml()来接见Toolbox文件。
>>> nltk.corpusimport toolbox
>>>lexicon = toolbox.xml(rotokas.dic)
可以经由过程如许的体式格式来接见内容:
>>>lexicon[3][0]
<Element lx at 77bd28>
>>>lexicon[3][0].tag
lx
>>>lexicon[3][0].text
kaa
我们也可以应用路径接见XML的内容:
>>>[lexeme.text.lower() for lexeme in lexicon.findall(record/lx)]
[kaa, kaa, kaa, kaakaaro, kaakaaviko, kaakaavo, kaakaoko,
kaakasi, kaakau, kaakauko, kaakito, kaakuupato, ..., kuvuto]
>>>import sys
>>> nltk.etree.ElementTreeimport ElementTree
>>>tree = ElementTree(lexicon[3])
>>>tree.write(sys.stdout)
<record>
<lx>kaa</lx>
<ps>N</ps>
<pt>MASC</pt>
<cl>isi</cl>
<ge>cookingbanana</ge>
<tkp>bananabilong kukim</tkp>
<pt>itoo</pt>
<sf>FLORA</sf>
<dt>12/Aug/2005</dt>
<ex>Taeaviiria kaaisi kovopaueva kaparapasia.</ex>
<xp>Taeavii bin planim gadenbanana bilongkukim tasol long paia.</xp>
<xe>Taeaviplantedbanana in orderto cookit.</xe>
</record>
格局化条目
我们可以按照本身的须要,来生成特定的格局输出。
>>>html= "<table>\n"
>>>for entry in lexicon[70:80]:
... lx = entry.findtext(lx)
... ps = entry.findtext(ps)
... ge = entry.findtext(ge)
... html +=" <tr><td>%s</td><td>%s</td><td>%s</td></tr>\n"%(lx, ps,ge)
>>>html+="</table>"
>>>print html
<table>
<tr><td>kakae</td><td>???</td><td>small</td></tr>
<tr><td>kakae</td><td>CLASS</td><td>child</td></tr>
<tr><td>kakaevira</td><td>ADV</td><td>small-like</td></tr>
<tr><td>kakapikoa</td><td>???</td><td>small</td></tr>
<tr><td>kakapikoto</td><td>N</td><td>newbornbaby</td></tr>
<tr><td>kakapu</td><td>V</td><td>placein sling for purpo搜刮引擎优化f carrying</td></tr>
<tr><td>kakapua</td><td>N</td><td>slingfor lifting</td></tr>
<tr><td>kakara</td><td>N</td><td>armband</td></tr>
<tr><td>Kakarapaia</td><td>N</td><td>villagename</td></tr>
<tr><td>kakarau</td><td>N</td><td>frog</td></tr>
</table>
11.5应用Toolbox数据
为每个条目添加一个字段
例11-2. 为词汇条目添加新的cv字段
nltk.etree.ElementTreeimport SubElement
def cv(s):
s = s.lower()
s = re.sub(r[^a-z], r_, s)
s = re.sub(r[aeiou], rV, s)
s = re.sub(r[^V_], rC, s)
return (s)
def add_cv_field(entry):
for field in entry:
if field.tag ==lx:
cv_field = SubElement(entry,cv)
cv_field.text = cv(field.text)
>>>lexicon = toolbox.xml(rotokas.dic)
>>>add_cv_field(lexicon[53])
>>>print nltk.to_sfm_string(lexicon[53])
\lx kaeviro
\ps V
\pt A
\ge lift off
\ge take off
\tkp goantap
\sc MOTION
\vx 1
\nt usedto describe action of plane
\dt 03/Jun/2005
\ex Pitakaeviroroekepakekesia oavuripierevo kiuvu.
\xp Pitai goantap nalukim hauswini bagarapim.
\xe Peterwentto look at the housethat the winddestroyed.
\cv CVVCVCV
验证Toolbox词汇
Toolbox格局的很多词汇不合适任何特定的模式。有些条目可能包含额外的字段,或以一种新的体式格式排序现有字段。
例如,我们可以在FreqDist的帮助下,很轻易的找到频率异常的字段序列:
>>>fd = nltk.FreqDist(:.join(field.tag for field in entry) for entry in lexicon)
>>>fd.items()
[(lx:ps:pt:ge:tkp:dt:ex:xp:xe, 41),(lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe, 37),
(lx:rt:ps:pt:ge:tkp:dt:ex:xp:xe:ex:xp:xe, 27), (lx:ps:pt:ge:tkp:nt:dt:ex:xp:xe, 20),
..., (lx:alt:rt:ps:pt:ge:eng:eng:eng:tkp:tkp:dt:ex:xp:xe:ex:xp:xe:ex:xp:xe, 1)]