Python 2.7 - sans passer par le corps de déchargement = « window.print » tout en grattant la page

I'm trying to scrape the page which loads after the print popup is gone(canceled).

testing the xpath to the product name and ID (as shown in the screenshot) with any possible combinations so far return empty and I suspect that print popup js is the reason.

Any tips about how to bypass the print popup would be appreciated.

Thanks :)

Here is the screenshot from the DOM:

enter image description here

Here's an example spider for getting the text you've highlighted on the screenshot:

from scrapy.item import Item, Field
from scrapy.selector import Selector
from scrapy.spider import BaseSpider


class MarketItem(Item):
    name = Field()


class MarketSpider(BaseSpider):
    name = "market"
    allowed_domains = ["mymarket.ge"]
    start_urls = ["http://www.mymarket.ge/classified_details_print.php?product_id=5827165"]

    def parse(self, response):
        contacts = Selector(response)

        item = MarketItem()
        item['name'] = contacts.xpath('//td[@class="product_info_details_text"]/b/text()').extract()[0].strip()
        return item

this gets an item:

{'name': u'Nokia asha 515 dual sim'}

Hope that helps.

I'm trying to scrape the page which loads after the print popup is gone(canceled).

testing the xpath to the product name and ID (as shown in the screenshot) with any possible combinations so far return empty and I suspect that print popup js is the reason.

Any tips about how to bypass the print popup would be appreciated.

Thanks :)

Here is the screenshot from the DOM:

enter image description here

Here's an example spider for getting the text you've highlighted on the screenshot:

from scrapy.item import Item, Field
from scrapy.selector import Selector
from scrapy.spider import BaseSpider


class MarketItem(Item):
    name = Field()


class MarketSpider(BaseSpider):
    name = "market"
    allowed_domains = ["mymarket.ge"]
    start_urls = ["http://www.mymarket.ge/classified_details_print.php?product_id=5827165"]

    def parse(self, response):
        contacts = Selector(response)

        item = MarketItem()
        item['name'] = contacts.xpath('//td[@class="product_info_details_text"]/b/text()').extract()[0].strip()
        return item

this gets an item:

{'name': u'Nokia asha 515 dual sim'}

Hope that helps.

Source

Stackoverflow Blog

mercredi 16 avril 2014

Python 2.7 - sans passer par le corps de déchargement = « window.print » tout en grattant la page - Stack Overflow

0 commentaires:

Enregistrer un commentaire

Popular Posts