mardi 13 mai 2014

python - Concordance Unicode caractères Unicode corpus dans nltk - Stack Overflow


I have Unicode phrase wanted to search among my Unicode corpus in nltk but the problem is I should convert my encoding in nltk or I the indexing result will be zero. But I don't know how? This is my simple code:


import nltk
f=open('word-freq-utf8-new.txt','rU')
text=f.read()
text1=text.split()
abst=nltk.Text(text1)
abst.concordance('سلام')



The nltk does not yet work really well with unicode, although they are working on it. As a bit of a quick fix, you can create a subclass for the concordance and overwrite the print_concordance method to make sure you are encoding/decoding at the right times for processing and display purposes. Here is a really quick fix, assuming you have already imported the nltk (I am using as an example part of a unicode Greek text):


>>> tokens = re.findall(ur'\w+', t.decode('utf-8'), flags=re.U)    # I did this to make sure I was working with a decoded text. If you are working with an encoded text, skip this. `t` is the equivalent of your `text`.

>>> class ConcordanceIndex2(nltk.ConcordanceIndex):
'Extends the ConcordanceIndex class.'
def print_concordance(self, word, width=75, lines=25):
half_width = (width - len(word) - 2) // 2
context = width // 4 # approx number of words of context

offsets = self.offsets(word)
if offsets:
lines = min(lines, len(offsets))
print("Displaying %s of %s matches:" % (lines, len(offsets)))
for i in offsets:
if lines <= 0:
break
left = (' ' * half_width +
' '.join([x.decode('utf-8') for x in self._tokens[i-context:i]])) # decoded here for display purposes
right = ' '.join([x.decode('utf-8') for x in self._tokens[i+1:i+context]]) # decoded here for display purposes
left = left[-half_width:]
right = right[:half_width]
print(' '.join([left, self._tokens[i].decode('utf-8'), right])) # decoded here for display purposes
lines -= 1
else:
print("No matches")

If you are working with a decoded text, you will need to encode the tokens like so:


>>> concordance_index = ConcordanceIndex2([x.encode('utf-8') for x in tokens], key=lambda s: s.lower())    # encoded here to match an encoded text
>>> concordance_index.print_concordance(u'\u039a\u0391\u0399\u03a3\u0391\u03a1\u0395\u0399\u0391\u03a3'.encode('utf-8'))
Displaying 1 of 1 matches:
ΚΑΙΣΑΡΕΙΑΣ ΕΚΚΛΗΣΙΑΣΤΙΚΗ ΙΣΤΟΡΙΑ Euse

Otherwise, you can simply do this:


>>> concordance_index = ConcordanceIndex2(tokens, key=lambda s: s.lower())
>>> concordance_index.print_concordance('\xce\x9a\xce\x91\xce\x99\xce\xa3\xce\x91\xce\xa1\xce\x95\xce\x99\xce\x91\xce\xa3')
Displaying 1 of 1 matches:
ΚΑΙΣΑΡΕΙΑΣ ΕΚΚΛΗΣΙΑΣΤΙΚΗ ΙΣΤΟΡΙΑ Euse


I have Unicode phrase wanted to search among my Unicode corpus in nltk but the problem is I should convert my encoding in nltk or I the indexing result will be zero. But I don't know how? This is my simple code:


import nltk
f=open('word-freq-utf8-new.txt','rU')
text=f.read()
text1=text.split()
abst=nltk.Text(text1)
abst.concordance('سلام')


The nltk does not yet work really well with unicode, although they are working on it. As a bit of a quick fix, you can create a subclass for the concordance and overwrite the print_concordance method to make sure you are encoding/decoding at the right times for processing and display purposes. Here is a really quick fix, assuming you have already imported the nltk (I am using as an example part of a unicode Greek text):


>>> tokens = re.findall(ur'\w+', t.decode('utf-8'), flags=re.U)    # I did this to make sure I was working with a decoded text. If you are working with an encoded text, skip this. `t` is the equivalent of your `text`.

>>> class ConcordanceIndex2(nltk.ConcordanceIndex):
'Extends the ConcordanceIndex class.'
def print_concordance(self, word, width=75, lines=25):
half_width = (width - len(word) - 2) // 2
context = width // 4 # approx number of words of context

offsets = self.offsets(word)
if offsets:
lines = min(lines, len(offsets))
print("Displaying %s of %s matches:" % (lines, len(offsets)))
for i in offsets:
if lines <= 0:
break
left = (' ' * half_width +
' '.join([x.decode('utf-8') for x in self._tokens[i-context:i]])) # decoded here for display purposes
right = ' '.join([x.decode('utf-8') for x in self._tokens[i+1:i+context]]) # decoded here for display purposes
left = left[-half_width:]
right = right[:half_width]
print(' '.join([left, self._tokens[i].decode('utf-8'), right])) # decoded here for display purposes
lines -= 1
else:
print("No matches")

If you are working with a decoded text, you will need to encode the tokens like so:


>>> concordance_index = ConcordanceIndex2([x.encode('utf-8') for x in tokens], key=lambda s: s.lower())    # encoded here to match an encoded text
>>> concordance_index.print_concordance(u'\u039a\u0391\u0399\u03a3\u0391\u03a1\u0395\u0399\u0391\u03a3'.encode('utf-8'))
Displaying 1 of 1 matches:
ΚΑΙΣΑΡΕΙΑΣ ΕΚΚΛΗΣΙΑΣΤΙΚΗ ΙΣΤΟΡΙΑ Euse

Otherwise, you can simply do this:


>>> concordance_index = ConcordanceIndex2(tokens, key=lambda s: s.lower())
>>> concordance_index.print_concordance('\xce\x9a\xce\x91\xce\x99\xce\xa3\xce\x91\xce\xa1\xce\x95\xce\x99\xce\x91\xce\xa3')
Displaying 1 of 1 matches:
ΚΑΙΣΑΡΕΙΑΣ ΕΚΚΛΗΣΙΑΣΤΙΚΗ ΙΣΤΟΡΙΑ Euse

0 commentaires:

Enregistrer un commentaire