dimanche 4 mai 2014

Entrées de la liste classifiant en Python - débordement de pile


In Python I am trying to create a list (myClassifier) that appends a classification ('bad'/'good') for each text file (txtEntry) stored in a list (txtList), based on whether or not it contains a bad word stored in a list of bad words (badWord).


txtList = ['mywords.txt', 'apple.txt, 'banana.txt', ... , 'something.txt']
badWord = ['pie', 'vegetable, 'fatigue', ... , 'something']

txtEntry is merely a placeholder, really I just want to iterate through every entry in txtList.


I've produced the following code in response:


for txtEntry in txtList:
if badWord in txtEntry:
myClassifier += 'bad'
else:
myClassifier += 'good'

However I'm receiving TypeError: 'in ' requires string as left operand, not list as a result.


I'm guessing that badWord needs to be a string as opposed to a list, though I'm not sure how I can get this to work otherwise.


How could I otherwise accomplish this?




To find which files have bad words in them, you could:


import re
from pprint import pprint

filenames = ['mywords.txt', 'apple.txt', 'banana.txt', 'something.txt']
bad_words = ['pie', 'vegetable', 'fatigue', 'something']

classified_files = {} # filename -> good/bad
has_bad_words = re.compile(r'\b(?:%s)\b' % '|'.join(map(re.escape, bad_words)),
re.I).search
for filename in filenames:
with open(filename) as file:
for line in file:
if has_bad_words(line):
classified_files[filename] = 'bad'
break # go to the next file
else: # no bad words
classified_files[filename] = 'good'

pprint(classified_files)

If you want to mark as 'bad' the different inflected forms of a word e.g., if cactus is in bad_words and you want to exclude cacti (a plural) then you might need stemmers or more generally lemmatizers e.g.,


from nltk.stem.porter import PorterStemmer # $ pip install nltk

stemmer = PorterStemmer()
print(stemmer.stem("pies"))
# -> pie

Or


from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('cacti'))
# -> cactus

Note: you might need import nltk; nltk.download() to download wordnet data.


It might be simpler, just to add all possible forms such as pies, cacti to bad_words list directly.




This


if badWord in txtEntry:

tests whether badWord equals any substring in textEntry. Since it is a list, it doesn't and can't - what you need to do instead is to check each string in badWord separately. The easiest way to do this is with the function any. You do need to normalise the txtEntry, though, because (as mentioned in the comments) you care about matching exact words, not just substrings (which string in string tests for), and you (probably) want the search to be case insensitive:


import re

for txtEntry in txtList:
# Ensure that `word in contents` doesn't give
# false positives for substrings - avoid eg, 'ass in class'
contents = [w.lower() for w in re.split('\W+', txtEntry)]

if any(word in contents for word in badWord):
myClassifier.append('bad')
else:
myClassifer.append('good')

Note that, like other answers, I've used the list.append method instead of += to add the string to the list. If you use +=, your list would end up looking like this: ['g', 'o', 'o', 'd', 'b', 'a', 'd'] instead of ['good', 'bad'].


Per the comments on the question, if you want this to check the file's content when you're only storing its name, you need to adjust this slightly - you need a call to open, and you need to then test against the contents - but the test and the normalisation stay the same:


import re

for txtEntry in txtList:
with open(txtEntry) as f:
# Ensure that `word in contents` doesn't give
# false positives for substrings - avoid eg, 'ass in class'
contents = [w.lower() for w in re.split('\W+', f.read())]
if any(word in contents for word in badWord):
myClassifier.append('bad')
else:
myClassifer.append('good')

These loops both assume that, as in your sample data, all of the strings in badWord are in lower case.




You should be looping over badWord items too, and for each item you should check if it exists in txtEntry.


for txtEntry in txtList:
if any(word in txtEntry for word in badWord)::
myClassifier.append("bad") # append() is better and will give you the right output as += will add every letter in "bad" as a list item. or you should make it myClassifier += ['bad']
else:
myClassifier.append("good")

Thanks to @lvc comment




try this code:


    myClassifier.append('bad') 


In Python I am trying to create a list (myClassifier) that appends a classification ('bad'/'good') for each text file (txtEntry) stored in a list (txtList), based on whether or not it contains a bad word stored in a list of bad words (badWord).


txtList = ['mywords.txt', 'apple.txt, 'banana.txt', ... , 'something.txt']
badWord = ['pie', 'vegetable, 'fatigue', ... , 'something']

txtEntry is merely a placeholder, really I just want to iterate through every entry in txtList.


I've produced the following code in response:


for txtEntry in txtList:
if badWord in txtEntry:
myClassifier += 'bad'
else:
myClassifier += 'good'

However I'm receiving TypeError: 'in ' requires string as left operand, not list as a result.


I'm guessing that badWord needs to be a string as opposed to a list, though I'm not sure how I can get this to work otherwise.


How could I otherwise accomplish this?



To find which files have bad words in them, you could:


import re
from pprint import pprint

filenames = ['mywords.txt', 'apple.txt', 'banana.txt', 'something.txt']
bad_words = ['pie', 'vegetable', 'fatigue', 'something']

classified_files = {} # filename -> good/bad
has_bad_words = re.compile(r'\b(?:%s)\b' % '|'.join(map(re.escape, bad_words)),
re.I).search
for filename in filenames:
with open(filename) as file:
for line in file:
if has_bad_words(line):
classified_files[filename] = 'bad'
break # go to the next file
else: # no bad words
classified_files[filename] = 'good'

pprint(classified_files)

If you want to mark as 'bad' the different inflected forms of a word e.g., if cactus is in bad_words and you want to exclude cacti (a plural) then you might need stemmers or more generally lemmatizers e.g.,


from nltk.stem.porter import PorterStemmer # $ pip install nltk

stemmer = PorterStemmer()
print(stemmer.stem("pies"))
# -> pie

Or


from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('cacti'))
# -> cactus

Note: you might need import nltk; nltk.download() to download wordnet data.


It might be simpler, just to add all possible forms such as pies, cacti to bad_words list directly.



This


if badWord in txtEntry:

tests whether badWord equals any substring in textEntry. Since it is a list, it doesn't and can't - what you need to do instead is to check each string in badWord separately. The easiest way to do this is with the function any. You do need to normalise the txtEntry, though, because (as mentioned in the comments) you care about matching exact words, not just substrings (which string in string tests for), and you (probably) want the search to be case insensitive:


import re

for txtEntry in txtList:
# Ensure that `word in contents` doesn't give
# false positives for substrings - avoid eg, 'ass in class'
contents = [w.lower() for w in re.split('\W+', txtEntry)]

if any(word in contents for word in badWord):
myClassifier.append('bad')
else:
myClassifer.append('good')

Note that, like other answers, I've used the list.append method instead of += to add the string to the list. If you use +=, your list would end up looking like this: ['g', 'o', 'o', 'd', 'b', 'a', 'd'] instead of ['good', 'bad'].


Per the comments on the question, if you want this to check the file's content when you're only storing its name, you need to adjust this slightly - you need a call to open, and you need to then test against the contents - but the test and the normalisation stay the same:


import re

for txtEntry in txtList:
with open(txtEntry) as f:
# Ensure that `word in contents` doesn't give
# false positives for substrings - avoid eg, 'ass in class'
contents = [w.lower() for w in re.split('\W+', f.read())]
if any(word in contents for word in badWord):
myClassifier.append('bad')
else:
myClassifer.append('good')

These loops both assume that, as in your sample data, all of the strings in badWord are in lower case.



You should be looping over badWord items too, and for each item you should check if it exists in txtEntry.


for txtEntry in txtList:
if any(word in txtEntry for word in badWord)::
myClassifier.append("bad") # append() is better and will give you the right output as += will add every letter in "bad" as a list item. or you should make it myClassifier += ['bad']
else:
myClassifier.append("good")

Thanks to @lvc comment



try this code:


    myClassifier.append('bad') 

0 commentaires:

Enregistrer un commentaire