In Python I am trying to create a list (myClassifier) that appends a classification ('bad'/'good') for each text file (txtEntry) stored in a list (txtList), based on whether or not it contains a bad word stored in a list of bad words (badWord).
txtList = ['mywords.txt', 'apple.txt, 'banana.txt', ... , 'something.txt']
badWord = ['pie', 'vegetable, 'fatigue', ... , 'something']
txtEntry is merely a placeholder, really I just want to iterate through every entry in txtList.
I've produced the following code in response:
for txtEntry in txtList:
if badWord in txtEntry:
myClassifier += 'bad'
else:
myClassifier += 'good'
However I'm receiving TypeError: 'in ' requires string as left operand, not list as a result.
I'm guessing that badWord needs to be a string as opposed to a list, though I'm not sure how I can get this to work otherwise.
How could I otherwise accomplish this?
To find which files have bad words in them, you could:
import re
from pprint import pprint
filenames = ['mywords.txt', 'apple.txt', 'banana.txt', 'something.txt']
bad_words = ['pie', 'vegetable', 'fatigue', 'something']
classified_files = {} # filename -> good/bad
has_bad_words = re.compile(r'\b(?:%s)\b' % '|'.join(map(re.escape, bad_words)),
re.I).search
for filename in filenames:
with open(filename) as file:
for line in file:
if has_bad_words(line):
classified_files[filename] = 'bad'
break # go to the next file
else: # no bad words
classified_files[filename] = 'good'
pprint(classified_files)
If you want to mark as 'bad'
the different inflected forms of a word e.g., if cactus
is in bad_words
and you want to exclude cacti
(a plural) then you might need stemmers or more generally lemmatizers e.g.,
from nltk.stem.porter import PorterStemmer # $ pip install nltk
stemmer = PorterStemmer()
print(stemmer.stem("pies"))
# -> pie
Or
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('cacti'))
# -> cactus
Note: you might need import nltk; nltk.download()
to download wordnet
data.
It might be simpler, just to add all possible forms such as pies
, cacti
to bad_words
list directly.
This
if badWord in txtEntry:
tests whether badWord
equals any substring in textEntry
. Since it is a list, it doesn't and can't - what you need to do instead is to check each string in badWord separately. The easiest way to do this is with the function any
. You do need to normalise the txtEntry
, though, because (as mentioned in the comments) you care about matching exact words, not just substrings (which string in string
tests for), and you (probably) want the search to be case insensitive:
import re
for txtEntry in txtList:
# Ensure that `word in contents` doesn't give
# false positives for substrings - avoid eg, 'ass in class'
contents = [w.lower() for w in re.split('\W+', txtEntry)]
if any(word in contents for word in badWord):
myClassifier.append('bad')
else:
myClassifer.append('good')
Note that, like other answers, I've used the list.append
method instead of +=
to add the string to the list. If you use +=
, your list would end up looking like this: ['g', 'o', 'o', 'd', 'b', 'a', 'd']
instead of ['good', 'bad']
.
Per the comments on the question, if you want this to check the file's content when you're only storing its name, you need to adjust this slightly - you need a call to open
, and you need to then test against the contents - but the test and the normalisation stay the same:
import re
for txtEntry in txtList:
with open(txtEntry) as f:
# Ensure that `word in contents` doesn't give
# false positives for substrings - avoid eg, 'ass in class'
contents = [w.lower() for w in re.split('\W+', f.read())]
if any(word in contents for word in badWord):
myClassifier.append('bad')
else:
myClassifer.append('good')
These loops both assume that, as in your sample data, all of the strings in badWord are in lower case.
You should be looping over badWord items too, and for each item you should check if it exists in txtEntry.
for txtEntry in txtList:
if any(word in txtEntry for word in badWord)::
myClassifier.append("bad") # append() is better and will give you the right output as += will add every letter in "bad" as a list item. or you should make it myClassifier += ['bad']
else:
myClassifier.append("good")
Thanks to @lvc comment
try this code:
myClassifier.append('bad')
In Python I am trying to create a list (myClassifier) that appends a classification ('bad'/'good') for each text file (txtEntry) stored in a list (txtList), based on whether or not it contains a bad word stored in a list of bad words (badWord).
txtList = ['mywords.txt', 'apple.txt, 'banana.txt', ... , 'something.txt']
badWord = ['pie', 'vegetable, 'fatigue', ... , 'something']
txtEntry is merely a placeholder, really I just want to iterate through every entry in txtList.
I've produced the following code in response:
for txtEntry in txtList:
if badWord in txtEntry:
myClassifier += 'bad'
else:
myClassifier += 'good'
However I'm receiving TypeError: 'in ' requires string as left operand, not list as a result.
I'm guessing that badWord needs to be a string as opposed to a list, though I'm not sure how I can get this to work otherwise.
How could I otherwise accomplish this?
To find which files have bad words in them, you could:
import re
from pprint import pprint
filenames = ['mywords.txt', 'apple.txt', 'banana.txt', 'something.txt']
bad_words = ['pie', 'vegetable', 'fatigue', 'something']
classified_files = {} # filename -> good/bad
has_bad_words = re.compile(r'\b(?:%s)\b' % '|'.join(map(re.escape, bad_words)),
re.I).search
for filename in filenames:
with open(filename) as file:
for line in file:
if has_bad_words(line):
classified_files[filename] = 'bad'
break # go to the next file
else: # no bad words
classified_files[filename] = 'good'
pprint(classified_files)
If you want to mark as 'bad'
the different inflected forms of a word e.g., if cactus
is in bad_words
and you want to exclude cacti
(a plural) then you might need stemmers or more generally lemmatizers e.g.,
from nltk.stem.porter import PorterStemmer # $ pip install nltk
stemmer = PorterStemmer()
print(stemmer.stem("pies"))
# -> pie
Or
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('cacti'))
# -> cactus
Note: you might need import nltk; nltk.download()
to download wordnet
data.
It might be simpler, just to add all possible forms such as pies
, cacti
to bad_words
list directly.
This
if badWord in txtEntry:
tests whether badWord
equals any substring in textEntry
. Since it is a list, it doesn't and can't - what you need to do instead is to check each string in badWord separately. The easiest way to do this is with the function any
. You do need to normalise the txtEntry
, though, because (as mentioned in the comments) you care about matching exact words, not just substrings (which string in string
tests for), and you (probably) want the search to be case insensitive:
import re
for txtEntry in txtList:
# Ensure that `word in contents` doesn't give
# false positives for substrings - avoid eg, 'ass in class'
contents = [w.lower() for w in re.split('\W+', txtEntry)]
if any(word in contents for word in badWord):
myClassifier.append('bad')
else:
myClassifer.append('good')
Note that, like other answers, I've used the list.append
method instead of +=
to add the string to the list. If you use +=
, your list would end up looking like this: ['g', 'o', 'o', 'd', 'b', 'a', 'd']
instead of ['good', 'bad']
.
Per the comments on the question, if you want this to check the file's content when you're only storing its name, you need to adjust this slightly - you need a call to open
, and you need to then test against the contents - but the test and the normalisation stay the same:
import re
for txtEntry in txtList:
with open(txtEntry) as f:
# Ensure that `word in contents` doesn't give
# false positives for substrings - avoid eg, 'ass in class'
contents = [w.lower() for w in re.split('\W+', f.read())]
if any(word in contents for word in badWord):
myClassifier.append('bad')
else:
myClassifer.append('good')
These loops both assume that, as in your sample data, all of the strings in badWord are in lower case.
You should be looping over badWord items too, and for each item you should check if it exists in txtEntry.
for txtEntry in txtList:
if any(word in txtEntry for word in badWord)::
myClassifier.append("bad") # append() is better and will give you the right output as += will add every letter in "bad" as a list item. or you should make it myClassifier += ['bad']
else:
myClassifier.append("good")
Thanks to @lvc comment
try this code:
myClassifier.append('bad')
0 commentaires:
Enregistrer un commentaire