constructing batch scheduler on AWS which
Before leaving AiRS, NAVER, I used to use Airflow to schedule batch jobs.
Airflow provides many good features: visual maintenance tool, SLA or task fail callback, scale out feature like celery executor.
Moreover, not like Cron, it also has useful environment variables like ds
, yesterday_ds
, or otherelse let job know when the job is scheduled for. If I have no such environment variable then pipeline must be spaghetti if I re-run a lapsed task. I call it Idempotent running environment.
But Airflow has drawbacks also: infinitly increasing logs, become slower slower as time goes, and mainly it needs to maintain a server instance in using time.
In Vlogr, we use less reserved instance and use more SaaS and IaaS,
Thus I decide to find a serverless scheduler solution. It must have at least..
- Scheduling function for every N minutes(/hours/days/ …)
- Idempotent running environment
some are pleased but not required
- visual tools
- SLA /task fail callback
- scale out feature
candidates
currently I am using Airflow on EC2, and that means I should care the instance in short cycle(less than a week). I’d prefer not to use such solution. We have some alternatives:
- AWS Batch
- AWS CloudWatch
- AWS SageMaker
function | AWS Batch | AWS CloudWatch | AWS SageMaker |
---|---|---|---|
scheduling function | yes | yes | yes |
Idempotent running environment |
no | ? | no |
visual tools | no | no | no |
SLA/task fail callback | |||
scale out feature |