Getting Started
To get started with PyDolphinScheduler you must ensure python and pip are installed on your machine, if you’re already set up, you can skip straight to Installing PyDolphinScheduler, otherwise please continue with Installing Python.
Installing Python
How to install python and pip depends on what operating system you’re using. The python wiki provides up to date instructions for all platforms here. When you enter the website and choose your operating system, you would be offered the choice and select python version. PyDolphinScheduler recommends using a version above Python 3.6 and we highly recommend installing Stable Releases instead of Pre-releases.
After you have download and installed Python, you should open your terminal,
type and run python --version
to check whether the installation
is correct or not. If everything is good, you could see the version in console
without error(here is an example after Python 3.8.7 is installed)
python --version
Will see detail of Python version, such as Python 3.8.7
Installing PyDolphinScheduler
After Python is already installed on your machine following section installing Python, it is easy to install PyDolphinScheduler using pip.
python -m pip install apache-dolphinscheduler
The latest version of PyDolphinScheduler would be installed after you run the above command in your terminal. You could go and start Python Gateway Service to finish the preparation, and then go to Tutorial to get your hand dirty. But if you want to install the unreleased version of PyDolphinScheduler, you could go and see section installing PyDolphinScheduler in dev branch for more details.
Note
Currently, we have released multiple pre-release packages in PyPI, you can see all released packages
including pre-release in release history.
You can fix the the package version if you want to install pre-release package, for example if
you want to install version 3.0.0-beta-2 package, you can run command
python -m pip install apache-dolphinscheduler==3.0.0b2
.
Installing PyDolphinScheduler In DEV Branch
Because the project is developing and some of the features are still not released. If you want to try something unreleased you could install from the source code which we hold on GitHub
# Clone Apache DolphinScheduler repository
git clone git@github.com:apache/dolphinscheduler-sdk-python.git
# Install PyDolphinScheduler in develop mode
python -m pip install -e .
After you installed PyDolphinScheduler, please remember start Python Gateway Service which is required for PyDolphinScheduler’s workflow definition.
Above command will clone whole dolphinscheduler source code to local, maybe you want to install the latest pydolphinscheduler package directly and do not care about other code(including Python gateway service code), you can execute the command
# Must escape the '&' character by adding '\'
pip install -e "git+https://github.com/apache/dolphinscheduler-sdk-python.git#egg=apache-dolphinscheduler"
Start Python Gateway Service
Since PyDolphinScheduler is Python API for Apache DolphinScheduler, it could define workflow and task structures, but could not run it unless you install Apache DolphinScheduler and start its API server which includes Python gateway service in it. We only write some key steps here and you could go install Apache DolphinScheduler for more details
# Export the environment variable to enabled python-gateway service
export API_PYTHON_GATEWAY_ENABLED="true"
# Start DolphinScheduler api-server which including python gateway service
./bin/dolphinscheduler-daemon.sh start api-server
To check whether the server is alive or not, you could run jps
. And
the server is healthy if keyword ApiApplicationServer is in the console.
jps
# ....
# 201472 ApiApplicationServer
# ....
Note
Please make sure you already started Python gateway service along with api-server. You can enabled it via
Environment: export API_PYTHON_GATEWAY_ENABLED=”true”
Configuration File: Set python-gateway.enabled : true in api-server/conf/application.yaml
Please modify the token in your production environment and update it periodically, as this is related to your data read and write rights.
Environment: export API_PYTHON_GATEWAY_AUTH_TOKEN=”GsAurNxU7A@Xc”
Configuration File: Set python-gateway.auth-token : GsAurNxU7A@Xc in api-server/conf/application.yaml
Run an Example
Before run an example for pydolphinscheduler, you should get the example code from its source code. You could run single bash command to get it
wget https://raw.githubusercontent.com/apache/dolphinscheduler-sdk-python/main/src/pydolphinscheduler/examples/tutorial.py
or you could copy-paste the content from tutorial source code. And then you could run the example in your terminal
python tutorial.py
If you want to submit your workflow to a remote API server, which means that your workflow script is different from the API server, you should first change pydolphinscheduler configuration and then submit the workflow script
pydolphinscheduler config --init
pydolphinscheduler config --set java_gateway.address <YOUR-API-SERVER-IP-OR-HOSTNAME>
python tutorial.py
Note
You could see more information in Configuration about all the configurations pydolphinscheduler supported.
After that, you could go and see your DolphinScheduler web UI to find out a new workflow created by pydolphinscheduler, and the path of web UI is Project -> Workflow -> Workflow Definition, and you can see a workflow and workflow instance had been created and DAG is automatically formatted by web UI.
Note
We have default authentication token when you first launch dolphinscheduler and pydolphinscheduler. Please change
the parameter auth_token
when you deploy in production environment or test dolphinscheduler in public network.
See authentication token for more details.
What’s More
If you are not familiar with PyDolphinScheduler, you could go to Tutorial and see how it works. But if you already know the basic usage or concept of PyDolphinScheduler, you could go and play with all Tasks PyDolphinScheduler supports, or see our HOWTOs about useful cases.