Associer plusieurs valeurs pour une clé dans dictionary Python

So I am working on this text mining project. I am trying to open all files, grab information of organizations and abstracts, split words in abstracts, and then find out how many files every word shows. My questions is about the last step: how many files does one word show? To answer this question, I am making a dictionary wordFrequency to count that. I am trying to tell the dictionary: if a word does not show in the dictionary, capture the word and file number attached to it; if a word shows in dictionary, but the file number is different from any existing ones, append the file number behind it. If both the word and its file number is already in the dictionary, ignore it. Below is my code.

capturedfiles = []
capturedabstracts = []
wordFrequency = {}
wordlist=open('test.txt','w')
worddict=open('test3.txt','w')
for filepath in matches[0:5]:
    with open (filepath,'rt') as mytext:
    mytext=mytext.read()
    #print mytext

    # code to capture file organizations.
    grabFile=re.findall(r'File\s+\:\s+(\w\d{7})',mytext)
    if len(grabFile) == 0:
        matchFile= "N/A"
    else:
        matchFile = grabFile[0]
    capturedfiles.append(matchFile)

    # code to capture file abstracts
    grabAbs=re.findall(r'Abstract\s\:\s\d{7}\s(\w.+)',mytext)
    if len(grabAbs) == 0:
        matchAbs= "N/A"
    else:
        matchAbs = grabAbs
    capturedabstracts.append(matchAbs)

    # arrange words in format.
    lineCount = 0
    wordCount = 0
    lines = matchAbs[0].split('. ')
    for line in lines:
        lineCount +=1
        for word in line.split(' '):
            wordCount +=1
            wordlist.write(matchFile + '|' + str(lineCount) + '|' + str(wordCount) + '|' + word + '\n')

            if word not in wordFrequency:
                wordFrequency[word]=[matchFile]
            else:
                if matchFile not in wordFrequency[word]:
                        wordFrequency[word].append(matchFile)
                worddict.write(word + '|' + str(matchFile) + '\n')


wordlist.close()
worddict.close()

What I am getting now is every word gets printed out with its matching file number. If a word shows up twice in the whole text, it will prints out twice separately. Below is an example of how it looks like:

variation|a9500006 are|a9500006 are|a9500007

I want it to look like:

variation|a9500006 are|a9500006, a9500007

Instead of writing to worddict every time in a loop, write the whole wordFrequency dictionary after building it. Like so:

#assuming wordFrequency is a correctly built dictionary
for key, value in wordFrequency.items():
    #key is a word, value is a list
    worddict.write(key + '|')
    for word in value:
        #write each word in value
        worddict.write(word)
        #if it's not the last word, write a comma
        if word != value[-1]:
            worddict.write(', ')
    #no more words, end line
    worddict.write('\n')

PS: Never, ever, EVER mix tabs and spaces! Especially in python!

capturedfiles = []
capturedabstracts = []
wordFrequency = {}
wordlist=open('test.txt','w')
worddict=open('test3.txt','w')
for filepath in matches[0:5]:
    with open (filepath,'rt') as mytext:
    mytext=mytext.read()
    #print mytext

    # code to capture file organizations.
    grabFile=re.findall(r'File\s+\:\s+(\w\d{7})',mytext)
    if len(grabFile) == 0:
        matchFile= "N/A"
    else:
        matchFile = grabFile[0]
    capturedfiles.append(matchFile)

    # code to capture file abstracts
    grabAbs=re.findall(r'Abstract\s\:\s\d{7}\s(\w.+)',mytext)
    if len(grabAbs) == 0:
        matchAbs= "N/A"
    else:
        matchAbs = grabAbs
    capturedabstracts.append(matchAbs)

    # arrange words in format.
    lineCount = 0
    wordCount = 0
    lines = matchAbs[0].split('. ')
    for line in lines:
        lineCount +=1
        for word in line.split(' '):
            wordCount +=1
            wordlist.write(matchFile + '|' + str(lineCount) + '|' + str(wordCount) + '|' + word + '\n')

            if word not in wordFrequency:
                wordFrequency[word]=[matchFile]
            else:
                if matchFile not in wordFrequency[word]:
                        wordFrequency[word].append(matchFile)
                worddict.write(word + '|' + str(matchFile) + '\n')


wordlist.close()
worddict.close()

variation|a9500006 are|a9500006 are|a9500007

I want it to look like:

variation|a9500006 are|a9500006, a9500007

Instead of writing to worddict every time in a loop, write the whole wordFrequency dictionary after building it. Like so:

#assuming wordFrequency is a correctly built dictionary
for key, value in wordFrequency.items():
    #key is a word, value is a list
    worddict.write(key + '|')
    for word in value:
        #write each word in value
        worddict.write(word)
        #if it's not the last word, write a comma
        if word != value[-1]:
            worddict.write(', ')
    #no more words, end line
    worddict.write('\n')

PS: Never, ever, EVER mix tabs and spaces! Especially in python!

Source

Stackoverflow Blog

jeudi 10 avril 2014

Associer plusieurs valeurs pour une clé dans dictionary Python - Stack Overflow

0 commentaires:

Enregistrer un commentaire

Popular Posts