I am working with scrapy 0.20 on python 2.7
Question
What is the best python scheduler.
My need
I need to run my spider, which is a python script, each 3 hours.
What I have thought
- I tried using windows scheduler features which comes with Windows 7 and It works good. I am able to run a python script each 3 hours but I may deploy my python script on Linux server so I may not be able to use this option.
- I create a Java application using Quartz-Scheduler. It works good but this is a third library, which my manager may refuse.
- I created a windows service and I made it fire the script each three hours. It works but I may deploy my python script on a Linux server so I may not be able to use this option.
I am asking about the best practice to fire a python script
I tried using windows scheduler features which comes with Windows 7 and it works good.
So that works fine for you already. Good, no need to change your script to do scheduling work yourself.
but I may deploy my python script on Linux server so I may not be able to use this option.
On Linux, you can use cron jobs to achieve this.
The other way would be to simply keep your script running the whole time, but pause it for the three hours in which you are doing nothing. So you don’t need to set up anything on the target machine, but just need to run the script in the background, and it will keep running and doing its job.
This is exactly how job schedulers work btw. They are launched early when the operating system starts, and then they just keep running forever and every short time interval (a minute or so) they check if there is any job on their list that needs to run now. And if that’s the case, they spawn a new process and run the job.
So if you wanted to make such a scheduler in Python, you would just keep it running forever, and once every time interval (in your case 3 hours because you only have a single job anyway), you start your job. That can be in a separate process, in a separate thread, or indirectly in a separate thread using asynchronous functions.
The best way to deploy/schedule your scrapy project is to use scrapyd server.
You should install scrapyd.
sudo-apt get install scrapyd
You change your project config file to something like this :
[deploy:somename]
url = http://localhost:6800/ ## this the default
project = scrapy_projectyou deploy your project under the scrapyd server:
scrapy deploy somename
You change your poll interval in /etc/scrapyd/conf.d/default-000 to 3 hours ( default to 5 seconds):
poll_interval = 10800
You configure your spider something like :
curl http://localhost:6800/schedule.json -d project=scrapy_project -d spider=myspider
You can use the web service to monitor your jobs:
http://localhost:6800/
PS: I just test it under ubuntu So I am not sure that a windows version exist. If not you can install a VM with ubuntu to launch the spiders.
Well, there's always the charming sched
(docs) module, which provides a generic scheduling interface.
Give it a time
function and a sleep
function, and it'll give you back a pretty nice and extensible scheduler.
It's not system-level, but if you can run it as a service, it should suffice.
I am working with scrapy 0.20 on python 2.7
Question
What is the best python scheduler.
My need
I need to run my spider, which is a python script, each 3 hours.
What I have thought
- I tried using windows scheduler features which comes with Windows 7 and It works good. I am able to run a python script each 3 hours but I may deploy my python script on Linux server so I may not be able to use this option.
- I create a Java application using Quartz-Scheduler. It works good but this is a third library, which my manager may refuse.
- I created a windows service and I made it fire the script each three hours. It works but I may deploy my python script on a Linux server so I may not be able to use this option.
I am asking about the best practice to fire a python script
I tried using windows scheduler features which comes with Windows 7 and it works good.
So that works fine for you already. Good, no need to change your script to do scheduling work yourself.
but I may deploy my python script on Linux server so I may not be able to use this option.
On Linux, you can use cron jobs to achieve this.
The other way would be to simply keep your script running the whole time, but pause it for the three hours in which you are doing nothing. So you don’t need to set up anything on the target machine, but just need to run the script in the background, and it will keep running and doing its job.
This is exactly how job schedulers work btw. They are launched early when the operating system starts, and then they just keep running forever and every short time interval (a minute or so) they check if there is any job on their list that needs to run now. And if that’s the case, they spawn a new process and run the job.
So if you wanted to make such a scheduler in Python, you would just keep it running forever, and once every time interval (in your case 3 hours because you only have a single job anyway), you start your job. That can be in a separate process, in a separate thread, or indirectly in a separate thread using asynchronous functions.
The best way to deploy/schedule your scrapy project is to use scrapyd server.
You should install scrapyd.
sudo-apt get install scrapyd
You change your project config file to something like this :
[deploy:somename]
url = http://localhost:6800/ ## this the default
project = scrapy_projectyou deploy your project under the scrapyd server:
scrapy deploy somename
You change your poll interval in /etc/scrapyd/conf.d/default-000 to 3 hours ( default to 5 seconds):
poll_interval = 10800
You configure your spider something like :
curl http://localhost:6800/schedule.json -d project=scrapy_project -d spider=myspider
You can use the web service to monitor your jobs:
http://localhost:6800/
PS: I just test it under ubuntu So I am not sure that a windows version exist. If not you can install a VM with ubuntu to launch the spiders.
0 commentaires:
Enregistrer un commentaire