Explore job scheduling and orchestration using Hopsworks and Airflow. Hopsworks offers simple job scheduling through its UI, while Airflow enables complex job orchestration for technical users. Hopsworks integrates Airflow for DAG creation and execution. We also discuss sharing Airflow DAGs in Hopsworks, addressing issues of user permissions and code sharing for smoother workflow management.
Job scheduling involves listing when a particular job or task should run. Job orchestration goes a step further by automating and managing the execution of multiple interconnected tasks or jobs in a data pipeline. Job orchestration tools schedule, coordinate, and monitor the execution of these jobs, ensuring they run in the right order and at the designated times. Job scheduling and orchestration are crucial for large-scale data pipelines that involve different steps like data ingestion, backfilling, feature engineering, model training, and inference.
This article covers the different aspects of Job Scheduling in Hopsworks including how simple jobs can be scheduled through the Hopsworks UI by non-technical users, how more complicated job orchestration can be achieved using Airflow, some of the limitations of both these approaches from a user sharing/permissions perspective and how these limitations were addressed by Hopsworks in the 3.6 release.
Given that MLOps platforms are increasingly being used by both technical and non-technical professionals, the Hopsworks platform caters for both these groups by providing support for simple job scheduling through its UI and for more complex job orchestration using Apache Airflow.
As shown in the figure below, a user of Hopsworks through the UI has a number of different options when setting up a job to run. In the job scheduling dialog, a user has the option of specifying whether the job is to be run using Spark or as a standalone python job. The specific code to be run as part of this job can be specified by the user along with passing any additional optional arguments.
Also shown in the UI is the ability to schedule the job at a particular day & time in the future and for that job to run a number of times up to an end day & time. For the more technical users, it is possible to schedule this job using a cron expression as shown.
Job scheduling using the UI in the Hopsworks platform is intentionally designed to be simple so that a typical non-technical user is not overloaded with information about dependency relationships between jobs. For such complex behavior, Hopsworks provides support for Apache Airflow, this support is described in the next section.
Apache Airflow is an open source project that allows technical users (i.e., python programmers) to create, schedule and monitor their workflows. These workflows can consist of multiple jobs where the dependencies between jobs are modeled as a Directed Acyclic Graph (DAG).
Using the Hopsworks DAG builder tool you can create a DAG to orchestrate jobs in Hopsworks using Airflow. To create a DAG you first need to create the job(s). Hopsworks provides two abstractions to define the job dependencies or execution order in Airflow: Launch Operator and Job Successor. The Launch Operator is used to trigger the job execution. As shown in the image below, in the Launch Operator configuration, you select the job you want to execute and optionally provide the job arguments. The wait option is for specifying whether Airflow should wait for this job to finish before starting other task(s) i.e., defining parallelism. You can also specify which other Airflow tasks this job depends on. If you specify a dependency, the task will be executed only after the upstream tasks have been successfully executed. The Job Sensor is used to wait for some tasks to be completed. You can set up a Job Sensor to wait for multiple upstream tasks to complete before starting downstream tasks. A very simple Airflow DAG is shown in the diagram below where one job executes after its dependent job completes its execution.
Airflow has a modular architecture using a message queue to coordinate different workers to execute the designed workflows / pipelines. All workflows are written in python and can be dynamically generated giving the more technical user significant flexibility in how they solve the orchestration of the jobs that they need to execute. Airflow does provide a UI allowing a user to monitor the execution of their jobs.
In a larger company that has many different teams, Airflow is typically set up with a version control system (e.g., git) so that best practice is used in terms of development and deployment of changes to jobs. There are a number of limitations specific to such a setup in terms of sharing the code specific to certain jobs and restricting access to code for other jobs depending on a particular user's privileges/permissions.
Hopsworks jobs can be orchestrated using Apache Airflow. You can define an Airflow DAG (Directed Acyclic Graph) containing the dependencies between Hopsworks jobs. You can then schedule the DAG to be executed at a specific schedule using a cron expression.
In Hopsworks Airflow DAG jobs are simply python files within which different operators can be used to trigger different actions (e.g., waiting for a job to finish). To create a new Airflow DAG to orchestrate jobs, the Hopsworks DAG builder tool is used to create a new workflow. A Hopsworks DAG can include different Airflow operators and sensors to orchestrate the execution of jobs. Both the DAG builder tool or the git version control system can be used to make updates to a particular DAG python file.
When writing a DAG file, you should also add the access_control parameter to the DAG configuration. The access_control parameter specifies which Hopsworks projects have access to the DAG and which actions the project members can perform. If you do not specify the access_control option, project members will not be able to see the DAG in the Airflow UI.
Inherent to the setup of Airflow is the assumption that members of a team (or teams) developing and running the workflows/pipelines are python programmers who understand and are able to use a version control system (i.e., git). This places a high barrier to entry to running jobs, Hopsworks has further improved this approach by allowing Airflow DAGs to be shared between users within a project on the Hopsworks platform. This sharing can be controlled on a user by user basis allowing certain non-technical users to run jobs specific to their roles (e.g., finance related jobs vs human resource related jobs).
This blog delves into the synergy between job orchestration and scheduling using Hopsworks and Apache Airflow. It defines job scheduling as the timing of tasks and orchestration as the automation of interconnected tasks within data pipelines. Hopsworks simplifies scheduling through its UI, accommodating both technical and non-technical users by allowing tasks to be scheduled easily at specific times and intervals.
The blog also explains Apache Airflow's role in orchestrating complex workflows through Directed Acyclic Graphs (DAGs), leveraging Python's flexibility for task management. Airflow's modular architecture, with features like the Launch Operator and Job Sensor, enables efficient task dependency management. The integration of Airflow with Hopsworks streamlines DAG creation and scheduling within the platform, emphasizing version and access control for secure workflow management. Additionally, it addresses challenges related to DAG sharing within teams, showcasing Hopsworks' advancements in facilitating controlled sharing based on user roles, thereby enhancing accessibility for non-technical users to run relevant jobs.