jeudi 24 avril 2014

python - ne peut pas convertir le HTML d'un site en texte correctement - Stack Overflow


EDIT: I cannot believe that BeautifullSoup actually cannot parse HTML properly. Actually i maybe do something wrong, but if I do not this is a really amateurish module.


I am trying to get text from web but i am unable to do so as i am always getting some strange characters in the most of sentences. I never get a sentence that containt words such as "isn't' correctly.


useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}
request = urllib2.Request('SomeURL',None,useragent)
myreq = urllib2.urlopen(request, timeout = 5)
html = myreq.read()

#get paragraphs
soup = BeautifulSoup(html)
textList = soup.find_all('p')
mytext = ""
for par in textList:
if len(str(par))<2000:
print par
mytext +=" " + str(par)

print "the text is ", mytext

The result contains some strange characters:


The plural of “comedo� is comedomes�.</p>
Surprisingly, the visible black head isn’t caused by dirt

Obviously i want to get isn't instead of isn’t. What should i do?




I believe the problem is with your system output encoding, which cannot output the encoded character properly since it's outside the displayed character range.


BeautifulSoup4 is meant to fully support HTML entities.


Notice the strange behaviour of these commands:


>python temp.py
...
ed a blackhead. The plural of ÔÇ£comedoÔÇØ is comedomesÔÇØ.</p>
...

>python temp.py > temp.txt

>cat temp.txt
....
ed a blackhead. The plural of "comedo" is comedomes".</p> <p> </p> <p>Blackheads is an open and wide
....

I suggest writing your output to a text file, or perhaps using a different terminal/changing your terminal settings to support a wider range of characters.




You are trying to convert a unicode object to a string object. I think you should use codecs instead.


import codecs
f = codecs.open('myFile.txt',mode='w',encoding='utf-8')
...
f.write(par.text)



Since this is Python 2 the urllib.urlopen().read() call returns a string of bytes most likely encoded in UTF-8 - you can look at the HTTP headers to see the encoding if it is specifically included. I assumed UTF-8.


You fail to decode this external representation before you start handling the content, and this is only going to lead to tears. General rule: decode inputs immediately, encode only on output.


Here's your code in working form with only two modifications;


import urllib2
from BeautifulSoup import BeautifulSoup

useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}
request = urllib2.Request('SomeURL',None,useragent)
myreq = urllib2.urlopen(request, timeout = 5)
html = unicode(myreq.read(), "UTF-8")

#get paragraphs
soup = BeautifulSoup(html)
textList = soup.findAll('p')
mytext = ""
for par in textList:
if len(str(par))<2000:
print par
mytext +=" " + str(par)

print "the text is ", mytext

All I have done is added unicode decoding of html and used soup.findAll() rather than soup.find_all().




This is a solution based on people's answers from here and my research.


import html2text
import urllib2
import re
import nltk

useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}
request = urllib2.Request('SomeURL',None,useragent)
myreq = urllib2.urlopen(request, timeout = 5)
html = myreq.read()
html = html.decode("utf-8")


textList = re.findall(r'(?<=<p>).*?(?=</p>)',html, re.MULTILINE|re.DOTALL)
mytext = ""
for par in textList:
if len(str(par))<2000:
par = re.sub('<[^<]+?>', '', par)
mytext +=" " + html2text.html2text(par)

print "the text is ", mytext


EDIT: I cannot believe that BeautifullSoup actually cannot parse HTML properly. Actually i maybe do something wrong, but if I do not this is a really amateurish module.


I am trying to get text from web but i am unable to do so as i am always getting some strange characters in the most of sentences. I never get a sentence that containt words such as "isn't' correctly.


useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}
request = urllib2.Request('SomeURL',None,useragent)
myreq = urllib2.urlopen(request, timeout = 5)
html = myreq.read()

#get paragraphs
soup = BeautifulSoup(html)
textList = soup.find_all('p')
mytext = ""
for par in textList:
if len(str(par))<2000:
print par
mytext +=" " + str(par)

print "the text is ", mytext

The result contains some strange characters:


The plural of “comedo� is comedomes�.</p>
Surprisingly, the visible black head isn’t caused by dirt

Obviously i want to get isn't instead of isn’t. What should i do?



I believe the problem is with your system output encoding, which cannot output the encoded character properly since it's outside the displayed character range.


BeautifulSoup4 is meant to fully support HTML entities.


Notice the strange behaviour of these commands:


>python temp.py
...
ed a blackhead. The plural of ÔÇ£comedoÔÇØ is comedomesÔÇØ.</p>
...

>python temp.py > temp.txt

>cat temp.txt
....
ed a blackhead. The plural of "comedo" is comedomes".</p> <p> </p> <p>Blackheads is an open and wide
....

I suggest writing your output to a text file, or perhaps using a different terminal/changing your terminal settings to support a wider range of characters.



You are trying to convert a unicode object to a string object. I think you should use codecs instead.


import codecs
f = codecs.open('myFile.txt',mode='w',encoding='utf-8')
...
f.write(par.text)


Since this is Python 2 the urllib.urlopen().read() call returns a string of bytes most likely encoded in UTF-8 - you can look at the HTTP headers to see the encoding if it is specifically included. I assumed UTF-8.


You fail to decode this external representation before you start handling the content, and this is only going to lead to tears. General rule: decode inputs immediately, encode only on output.


Here's your code in working form with only two modifications;


import urllib2
from BeautifulSoup import BeautifulSoup

useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}
request = urllib2.Request('SomeURL',None,useragent)
myreq = urllib2.urlopen(request, timeout = 5)
html = unicode(myreq.read(), "UTF-8")

#get paragraphs
soup = BeautifulSoup(html)
textList = soup.findAll('p')
mytext = ""
for par in textList:
if len(str(par))<2000:
print par
mytext +=" " + str(par)

print "the text is ", mytext

All I have done is added unicode decoding of html and used soup.findAll() rather than soup.find_all().



This is a solution based on people's answers from here and my research.


import html2text
import urllib2
import re
import nltk

useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}
request = urllib2.Request('SomeURL',None,useragent)
myreq = urllib2.urlopen(request, timeout = 5)
html = myreq.read()
html = html.decode("utf-8")


textList = re.findall(r'(?<=<p>).*?(?=</p>)',html, re.MULTILINE|re.DOTALL)
mytext = ""
for par in textList:
if len(str(par))<2000:
par = re.sub('<[^<]+?>', '', par)
mytext +=" " + html2text.html2text(par)

print "the text is ", mytext

0 commentaires:

Enregistrer un commentaire