dimanche 13 avril 2014

python - beautifulsoup grattage nytimes - Stack Overflow


I'm trying to scrape articles from the NY Times and keep getting a very long list of errors. I was wondering if someone could help point me in the right direction. Below is the URL of the article in question, my code, and the output from the console. Any help would really be tremendous.


article: http://www.nytimes.com/2014/03/10/world/asia/malaysia-airlines-flight.html?ref=world&_r=0


import urllib2
from bs4 import BeautifulSoup
import re

# Ask user to enter URL
url = "http://www.nytimes.com/2014/03/10/world/asia/malaysia-airlines-flight.html?ref=world&_r=0"

# Open txt document for output
txt = open('ctp_output.txt', 'w')

# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())

# Write the article title to the file
title = soup.find("h1")
txt.write('\n' + "Title: " + title.string + '\n' + '\n')

# Write the article date to the file
try:
date = soup.find("span", {'class':'dateline'}).text
txt.write("Date: " + str(date) + '\n' + '\n')
except:
print "Could not find the date!"

# Write the article author to the file
try:
byline=soup.find("p", {'class':'byline-author'}).text
txt.write("Author: " + str(byline) + '\n' + '\n')
except:
print "Could not find the author!"

# Write the article location to the file
regex = '<span class="location">(.+?)</span>'
pattern = re.compile(regex)
byline = re.findall(pattern,str(soup))
txt.write("Location: " + str(byline) + '\n' + '\n')

# retrieve all of the paragraph tags
with open('ctp_output.txt', 'w'):
for tag in soup.find_all('p'):
txt.write(tag.text.encode('utf-8') + '\n' + '\n')

# Close txt file with new content added
txt.close()

Sample output from console:
andrews-mbp-3:CTP Andrew$ python idle_test.py
Please enter a valid URL: http://www.nytimes.com/2014/03/10/world/asia/malaysia-airlines- flight.html?ref=world&_r=0
Traceback (most recent call last):
File "idle_test.py", line 20, in <module>
soup = BeautifulSoup(urllib2.urlopen(url).read())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 442, in error
result = self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)



As you can see from list of errors (also called the traceback), the (first) error happens on line 20, in the part when you are making a call to urllib. So, check out what you're passing into that function. Your variable url, which urllib expects to be a string, has no quotes around it, which makes me wonder how the code didn't throw an error earlier.


I said first error, earlier, because your code when you first write it (this is true for most programmers and always true for new programmers) will have many errors in it. Learning to program is in many ways learning how to interpret the errors (the traceback) from the computer.


Update


You just changed the definition of url to a raw_input function. Please don't do this, because it makes it harder to read and debug the code. urllib is having an issue with the variable url. Obscuring the value of the variable url makes it harder to debug. From experience, I'd suggest that maybe including (or not) http or some such syntax could be messing you up there -- but I can only guess at that if I can't see url.



I'm trying to scrape articles from the NY Times and keep getting a very long list of errors. I was wondering if someone could help point me in the right direction. Below is the URL of the article in question, my code, and the output from the console. Any help would really be tremendous.


article: http://www.nytimes.com/2014/03/10/world/asia/malaysia-airlines-flight.html?ref=world&_r=0


import urllib2
from bs4 import BeautifulSoup
import re

# Ask user to enter URL
url = "http://www.nytimes.com/2014/03/10/world/asia/malaysia-airlines-flight.html?ref=world&_r=0"

# Open txt document for output
txt = open('ctp_output.txt', 'w')

# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())

# Write the article title to the file
title = soup.find("h1")
txt.write('\n' + "Title: " + title.string + '\n' + '\n')

# Write the article date to the file
try:
date = soup.find("span", {'class':'dateline'}).text
txt.write("Date: " + str(date) + '\n' + '\n')
except:
print "Could not find the date!"

# Write the article author to the file
try:
byline=soup.find("p", {'class':'byline-author'}).text
txt.write("Author: " + str(byline) + '\n' + '\n')
except:
print "Could not find the author!"

# Write the article location to the file
regex = '<span class="location">(.+?)</span>'
pattern = re.compile(regex)
byline = re.findall(pattern,str(soup))
txt.write("Location: " + str(byline) + '\n' + '\n')

# retrieve all of the paragraph tags
with open('ctp_output.txt', 'w'):
for tag in soup.find_all('p'):
txt.write(tag.text.encode('utf-8') + '\n' + '\n')

# Close txt file with new content added
txt.close()

Sample output from console:
andrews-mbp-3:CTP Andrew$ python idle_test.py
Please enter a valid URL: http://www.nytimes.com/2014/03/10/world/asia/malaysia-airlines- flight.html?ref=world&_r=0
Traceback (most recent call last):
File "idle_test.py", line 20, in <module>
soup = BeautifulSoup(urllib2.urlopen(url).read())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 410, in open
response = meth(req, response)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 442, in error
result = self._call_chain(*args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)


As you can see from list of errors (also called the traceback), the (first) error happens on line 20, in the part when you are making a call to urllib. So, check out what you're passing into that function. Your variable url, which urllib expects to be a string, has no quotes around it, which makes me wonder how the code didn't throw an error earlier.


I said first error, earlier, because your code when you first write it (this is true for most programmers and always true for new programmers) will have many errors in it. Learning to program is in many ways learning how to interpret the errors (the traceback) from the computer.


Update


You just changed the definition of url to a raw_input function. Please don't do this, because it makes it harder to read and debug the code. urllib is having an issue with the variable url. Obscuring the value of the variable url makes it harder to debug. From experience, I'd suggest that maybe including (or not) http or some such syntax could be messing you up there -- but I can only guess at that if I can't see url.


Related Posts:

0 commentaires:

Enregistrer un commentaire