在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
开源软件名称:mozilla/python_mozetl开源软件地址:https://github.com/mozilla/python_mozetl开源编程语言:Python 98.3%开源软件介绍:Firefox Telemetry Python ETLThis repository is a collection of ETL jobs for Firefox Telemetry. BenefitsJobs committed to python_mozetl can be scheduled via airflow or ATMO. We provide a testing suite and code review, which makes your job more maintainable. Centralizing our jobs in one repository allows for code reuse and easier collaboration. There are a host of benefits to moving your analysis out of a Jupyter notebook and into a python package. For more on this see the writeup at cookiecutter-python-etl. TestsDependenciesFirst install the necessary runtime dependencies -- snappy and the java runtime
environment. These are used for the $ sudo apt-get install libsnappy-dev openjdk-8-jre-headless Calling the test runnerRun tests by calling Arguments to
Tests are configured in tox.ini Manual ExecutionATMOThe first method of manual execution is the In an SSH session with an ATMO cluster, grab a copy of the script:
Push your code to your own fork, where the job has been added to $ ./mozetl-submit.sh \
-p https://github.com/<USERNAME>/python_mozetl.git \
-b <BRANCHNAME> \
<COMMAND> \
--first-argument foo \
--second-argument bar See comments in DatabricksJobs may also be executed on Databricks.
They are scheduled via the This script runs on your local machine and submits the job to a remote spark executor. First, generate an API token in the User Settings page in Databricks. Then run the script. python bin/mozetl-databricks.py \
--git-path https://github.com/<USERNAME>/python_mozetl.git \
--git-branch <BRANCHNAME> \
--token <TOKEN> \
<COMMAND> \
--first-argument foo \
--second-argument bar Run It is also possible to use this script for external mozetl-compatible modules by setting the SchedulingYou can schedule your job on either ATMO or airflow. Scheduling a job on ATMO is easy and does not require review, but is less maintainable. Use ATMO to schedule jobs you are still prototyping or jobs that have a limited lifespan. Jobs scheduled on Airflow will be more robust.
ATMOTo schedule a job on ATMO, take a look at the load_and_run notebook. This notebook clones and installs the python_mozetl package. You can then run your job from the notebook. AirflowTo schedule a job on Airflow, you'll need to add a new Operator to the DAGs and provide a shell script for running your job. Take a look at this example shell script. and this example Operator for templates. Early Stage ETL JobsWe usually require tests before accepting new ETL jobs. If you're still prototyping your job, but you'd like to move your code out of a Jupyter notebook take a look at cookiecutter-python-etl. This tool will initialize a new repository with all of the necessary boilerplate for testing and packaging. In fact, this project was created with cookiecutter-python-etl. |
2023-10-27
2022-08-15
2022-08-17
2022-09-23
2022-08-13
请发表评论