So I am working on this text mining project. I am trying to open all files, grab information of organizations and abstracts, split words in abstracts, and then find out how many files every word shows. My questions is about the last step: how many files does one word show? To answer this question, I am making a dictionary wordFrequency to count that. I am trying to tell the dictionary: if a word does not show in the dictionary, capture the word and file number attached to it; if a word shows in dictionary, but the file number is different from any existing ones, append the file number behind it. If both the word and its file number is already in the dictionary, ignore it. Below is my code.
capturedfiles = []
capturedabstracts = []
wordFrequency = {}
wordlist=open('test.txt','w')
worddict=open('test3.txt','w')
for filepath in matches[0:5]:
with open (filepath,'rt') as mytext:
mytext=mytext.read()
#print mytext
# code to capture file organizations.
grabFile=re.findall(r'File\s+\:\s+(\w\d{7})',mytext)
if len(grabFile) == 0:
matchFile= "N/A"
else:
matchFile = grabFile[0]
capturedfiles.append(matchFile)
# code to capture file abstracts
grabAbs=re.findall(r'Abstract\s\:\s\d{7}\s(\w.+)',mytext)
if len(grabAbs) == 0:
matchAbs= "N/A"
else:
matchAbs = grabAbs
capturedabstracts.append(matchAbs)
# arrange words in format.
lineCount = 0
wordCount = 0
lines = matchAbs[0].split('. ')
for line in lines:
lineCount +=1
for word in line.split(' '):
wordCount +=1
wordlist.write(matchFile + '|' + str(lineCount) + '|' + str(wordCount) + '|' + word + '\n')
if word not in wordFrequency:
wordFrequency[word]=[matchFile]
else:
if matchFile not in wordFrequency[word]:
wordFrequency[word].append(matchFile)
worddict.write(word + '|' + str(matchFile) + '\n')
wordlist.close()
worddict.close()
What I am getting now is every word gets printed out with its matching file number. If a word shows up twice in the whole text, it will prints out twice separately. Below is an example of how it looks like:
variation|a9500006 are|a9500006 are|a9500007
I want it to look like:
variation|a9500006 are|a9500006, a9500007
Instead of writing to worddict
every time in a loop, write the whole wordFrequency
dictionary after building it. Like so:
#assuming wordFrequency is a correctly built dictionary
for key, value in wordFrequency.items():
#key is a word, value is a list
worddict.write(key + '|')
for word in value:
#write each word in value
worddict.write(word)
#if it's not the last word, write a comma
if word != value[-1]:
worddict.write(', ')
#no more words, end line
worddict.write('\n')
PS: Never, ever, EVER mix tabs and spaces! Especially in python!
So I am working on this text mining project. I am trying to open all files, grab information of organizations and abstracts, split words in abstracts, and then find out how many files every word shows. My questions is about the last step: how many files does one word show? To answer this question, I am making a dictionary wordFrequency to count that. I am trying to tell the dictionary: if a word does not show in the dictionary, capture the word and file number attached to it; if a word shows in dictionary, but the file number is different from any existing ones, append the file number behind it. If both the word and its file number is already in the dictionary, ignore it. Below is my code.
capturedfiles = []
capturedabstracts = []
wordFrequency = {}
wordlist=open('test.txt','w')
worddict=open('test3.txt','w')
for filepath in matches[0:5]:
with open (filepath,'rt') as mytext:
mytext=mytext.read()
#print mytext
# code to capture file organizations.
grabFile=re.findall(r'File\s+\:\s+(\w\d{7})',mytext)
if len(grabFile) == 0:
matchFile= "N/A"
else:
matchFile = grabFile[0]
capturedfiles.append(matchFile)
# code to capture file abstracts
grabAbs=re.findall(r'Abstract\s\:\s\d{7}\s(\w.+)',mytext)
if len(grabAbs) == 0:
matchAbs= "N/A"
else:
matchAbs = grabAbs
capturedabstracts.append(matchAbs)
# arrange words in format.
lineCount = 0
wordCount = 0
lines = matchAbs[0].split('. ')
for line in lines:
lineCount +=1
for word in line.split(' '):
wordCount +=1
wordlist.write(matchFile + '|' + str(lineCount) + '|' + str(wordCount) + '|' + word + '\n')
if word not in wordFrequency:
wordFrequency[word]=[matchFile]
else:
if matchFile not in wordFrequency[word]:
wordFrequency[word].append(matchFile)
worddict.write(word + '|' + str(matchFile) + '\n')
wordlist.close()
worddict.close()
What I am getting now is every word gets printed out with its matching file number. If a word shows up twice in the whole text, it will prints out twice separately. Below is an example of how it looks like:
variation|a9500006 are|a9500006 are|a9500007
I want it to look like:
variation|a9500006 are|a9500006, a9500007
Instead of writing to worddict
every time in a loop, write the whole wordFrequency
dictionary after building it. Like so:
#assuming wordFrequency is a correctly built dictionary
for key, value in wordFrequency.items():
#key is a word, value is a list
worddict.write(key + '|')
for word in value:
#write each word in value
worddict.write(word)
#if it's not the last word, write a comma
if word != value[-1]:
worddict.write(', ')
#no more words, end line
worddict.write('\n')
PS: Never, ever, EVER mix tabs and spaces! Especially in python!
0 commentaires:
Enregistrer un commentaire