Welcome to CUBI-TK’s documentation!¶
- Installation & Getting Started
Instructions for the installation of the module and some examples to get you started.
- Manual
This section contains manuals for specific commands.
- Use cases
Use cases for common processing tasks.
- Project Info
More information on the project, including the changelog, list of contributing authors, and contribution instructions.
Installation¶
Prerequisites when using conda:
$ conda create -n cubi-tk python=3.10
$ conda activate cubi-tk
Clone CUBI-TK and install:
$ git clone git@cubi-gitlab.bihealth.org:CUBI/Pipelines/cubi-tk.git
$ cd cubi-tk
$ pip install -e .
For building the manual or running tests you will need some more packages.
$ pip install -r requirements/develop.txt
Run tests¶
$ make test
Build manual¶
$ cd docs_manual
$ make clean html
Command Line Interface¶
usage: cubi-tk [-h] [--verbose] [--version] [--config CONFIG]
[--sodar-server-url SODAR_SERVER_URL]
[--sodar-api-token SODAR_API_TOKEN]
{isa-tpl,isa-tab,snappy,sodar,irods,org-raw,sea-snap,archive}
...
Positional Arguments¶
cmd | Possible choices: isa-tpl, isa-tab, snappy, sodar, irods, org-raw, sea-snap, archive |
Named Arguments¶
--verbose | Increase verbosity. Default: False |
--version | show program’s version number and exit |
Basic Configuration¶
--config | Path to configuration file. |
--sodar-server-url | |
SODAR server URL key to use, defaults to env SODAR_SERVER_URL. | |
--sodar-api-token | |
SODAR API token to use, defaults to env SODAR_API_TOKEN. |
Sub-commands¶
isa-tpl¶
Create of ISA-tab directories from predefined templates.
cubi-tk isa-tpl [-h]
{single_cell_rnaseq,bulk_rnaseq,tumor_normal_dna,tumor_normal_triplets,germline,generic,microarray,ms_meta_biocrates,stem_cell_core_bulk,stem_cell_core_sc}
...
Positional Arguments¶
tpl | Possible choices: single_cell_rnaseq, bulk_rnaseq, tumor_normal_dna, tumor_normal_triplets, germline, generic, microarray, ms_meta_biocrates, stem_cell_core_bulk, stem_cell_core_sc |
Sub-commands¶
single_cell_rnaseq¶
When specifying the –var-* argument, you can use JSON syntax. Failing to parse JSON will keep the string value.
cubi-tk isa-tpl single_cell_rnaseq [-h]
[--var-investigation-title VAR_INVESTIGATION_TITLE]
[--var-sample-names VAR_SAMPLE_NAMES]
[--var-a-measurement-type VAR_A_MEASUREMENT_TYPE]
[--var-lib-kit VAR_LIB_KIT]
[--var-source-meta VAR_SOURCE_META]
[--var-sample-meta VAR_SAMPLE_META]
[--var-dissociation-meta VAR_DISSOCIATION_META]
[--var-library-construction-meta VAR_LIBRARY_CONSTRUCTION_META]
[--var-sequencing-meta VAR_SEQUENCING_META]
[--var--library-types VAR__LIBRARY_TYPES]
[--var-library-type VAR_LIBRARY_TYPE]
[--var-sample-multiplexing VAR_SAMPLE_MULTIPLEXING]
[--var-genotype-multiplexing VAR_GENOTYPE_MULTIPLEXING]
[--var-study-title VAR_STUDY_TITLE]
[--var-s-file-name VAR_S_FILE_NAME]
[--var-assay-prefix VAR_ASSAY_PREFIX]
[--var-a-technology-type VAR_A_TECHNOLOGY_TYPE]
[--var-assay-name VAR_ASSAY_NAME]
[--var---output-dir VAR___OUTPUT_DIR]
output_dir
output_dir | Path to output directory |
--var-investigation-title | |
template variables ‘investigation_title’ | |
--var-sample-names | |
template variables ‘sample_names’ | |
--var-a-measurement-type | |
template variables ‘a_measurement_type’ | |
--var-lib-kit | template variables ‘lib_kit’ |
--var-source-meta | |
template variables ‘source_meta’ | |
--var-sample-meta | |
template variables ‘sample_meta’ | |
--var-dissociation-meta | |
template variables ‘dissociation_meta’ | |
--var-library-construction-meta | |
template variables ‘library_construction_meta’ | |
--var-sequencing-meta | |
template variables ‘sequencing_meta’ | |
--var--library-types | |
template variables ‘_library_types’ | |
--var-library-type | |
template variables ‘library_type’ | |
--var-sample-multiplexing | |
template variables ‘sample_multiplexing’ | |
--var-genotype-multiplexing | |
template variables ‘genotype_multiplexing’ | |
--var-study-title | |
template variables ‘study_title’ | |
--var-s-file-name | |
template variables ‘s_file_name’ | |
--var-assay-prefix | |
template variables ‘assay_prefix’ | |
--var-a-technology-type | |
template variables ‘a_technology_type’ | |
--var-assay-name | |
template variables ‘assay_name’ | |
--var---output-dir | |
template variables ‘__output_dir’ |
bulk_rnaseq¶
When specifying the –var-* argument, you can use JSON syntax. Failing to parse JSON will keep the string value.
cubi-tk isa-tpl bulk_rnaseq [-h]
[--var-investigation-title VAR_INVESTIGATION_TITLE]
[--var-sample-names VAR_SAMPLE_NAMES]
[--var-a-measurement-type VAR_A_MEASUREMENT_TYPE]
[--var-a-measurement-types VAR_A_MEASUREMENT_TYPES]
[--var-a-technology-type VAR_A_TECHNOLOGY_TYPE]
[--var-a-technology-types VAR_A_TECHNOLOGY_TYPES]
[--var-lib-kit VAR_LIB_KIT]
[--var-organism VAR_ORGANISM]
[--var-batch VAR_BATCH]
[--var-lib-kits VAR_LIB_KITS]
[--var-organisms VAR_ORGANISMS]
[--var-instrument VAR_INSTRUMENT]
[--var-center-name VAR_CENTER_NAME]
[--var-center-contact VAR_CENTER_CONTACT]
[--var-study-title VAR_STUDY_TITLE]
[--var-s-file-name VAR_S_FILE_NAME]
[--var-assay-prefix VAR_ASSAY_PREFIX]
[--var-a-measurement-abbreviation VAR_A_MEASUREMENT_ABBREVIATION]
[--var-assay-name VAR_ASSAY_NAME]
[--var-sample-type VAR_SAMPLE_TYPE]
[--var-lib-strategy VAR_LIB_STRATEGY]
[--var-lib-selection VAR_LIB_SELECTION]
[--var-lib-layout VAR_LIB_LAYOUT]
[--var---output-dir VAR___OUTPUT_DIR]
output_dir
output_dir | Path to output directory |
--var-investigation-title | |
template variables ‘investigation_title’ | |
--var-sample-names | |
template variables ‘sample_names’ | |
--var-a-measurement-type | |
template variables ‘a_measurement_type’ | |
--var-a-measurement-types | |
template variables ‘a_measurement_types’ | |
--var-a-technology-type | |
template variables ‘a_technology_type’ | |
--var-a-technology-types | |
template variables ‘a_technology_types’ | |
--var-lib-kit | template variables ‘lib_kit’ |
--var-organism | template variables ‘organism’ |
--var-batch | template variables ‘batch’ |
--var-lib-kits | template variables ‘lib_kits’ |
--var-organisms | |
template variables ‘organisms’ | |
--var-instrument | |
template variables ‘instrument’ | |
--var-center-name | |
template variables ‘center_name’ | |
--var-center-contact | |
template variables ‘center_contact’ | |
--var-study-title | |
template variables ‘study_title’ | |
--var-s-file-name | |
template variables ‘s_file_name’ | |
--var-assay-prefix | |
template variables ‘assay_prefix’ | |
--var-a-measurement-abbreviation | |
template variables ‘a_measurement_abbreviation’ | |
--var-assay-name | |
template variables ‘assay_name’ | |
--var-sample-type | |
template variables ‘sample_type’ | |
--var-lib-strategy | |
template variables ‘lib_strategy’ | |
--var-lib-selection | |
template variables ‘lib_selection’ | |
--var-lib-layout | |
template variables ‘lib_layout’ | |
--var---output-dir | |
template variables ‘__output_dir’ |
tumor_normal_dna¶
When specifying the –var-* argument, you can use JSON syntax. Failing to parse JSON will keep the string value.
cubi-tk isa-tpl tumor_normal_dna [-h]
[--var-investigation-title VAR_INVESTIGATION_TITLE]
[--var-sample-names VAR_SAMPLE_NAMES]
[--var-a-measurement-type VAR_A_MEASUREMENT_TYPE]
[--var-lib-kit VAR_LIB_KIT]
[--var-lib-kits VAR_LIB_KITS]
[--var-instrument VAR_INSTRUMENT]
[--var-center-name VAR_CENTER_NAME]
[--var-center-contact VAR_CENTER_CONTACT]
[--var-study-title VAR_STUDY_TITLE]
[--var-is-triplet VAR_IS_TRIPLET]
[--var-s-file-name VAR_S_FILE_NAME]
[--var-assay-prefix VAR_ASSAY_PREFIX]
[--var-a-technology-type VAR_A_TECHNOLOGY_TYPE]
[--var-a-measurement-abbreviation VAR_A_MEASUREMENT_ABBREVIATION]
[--var-assay-name VAR_ASSAY_NAME]
[--var-sample-type VAR_SAMPLE_TYPE]
[--var-lib-strategy VAR_LIB_STRATEGY]
[--var-lib-selection VAR_LIB_SELECTION]
[--var-lib-layout VAR_LIB_LAYOUT]
[--var---output-dir VAR___OUTPUT_DIR]
output_dir
output_dir | Path to output directory |
--var-investigation-title | |
template variables ‘investigation_title’ | |
--var-sample-names | |
template variables ‘sample_names’ | |
--var-a-measurement-type | |
template variables ‘a_measurement_type’ | |
--var-lib-kit | template variables ‘lib_kit’ |
--var-lib-kits | template variables ‘lib_kits’ |
--var-instrument | |
template variables ‘instrument’ | |
--var-center-name | |
template variables ‘center_name’ | |
--var-center-contact | |
template variables ‘center_contact’ | |
--var-study-title | |
template variables ‘study_title’ | |
--var-is-triplet | |
template variables ‘is_triplet’ | |
--var-s-file-name | |
template variables ‘s_file_name’ | |
--var-assay-prefix | |
template variables ‘assay_prefix’ | |
--var-a-technology-type | |
template variables ‘a_technology_type’ | |
--var-a-measurement-abbreviation | |
template variables ‘a_measurement_abbreviation’ | |
--var-assay-name | |
template variables ‘assay_name’ | |
--var-sample-type | |
template variables ‘sample_type’ | |
--var-lib-strategy | |
template variables ‘lib_strategy’ | |
--var-lib-selection | |
template variables ‘lib_selection’ | |
--var-lib-layout | |
template variables ‘lib_layout’ | |
--var---output-dir | |
template variables ‘__output_dir’ |
tumor_normal_triplets¶
When specifying the –var-* argument, you can use JSON syntax. Failing to parse JSON will keep the string value.
cubi-tk isa-tpl tumor_normal_triplets [-h]
[--var-investigation-title VAR_INVESTIGATION_TITLE]
[--var-sample-names VAR_SAMPLE_NAMES]
[--var-a-measurement-type VAR_A_MEASUREMENT_TYPE]
[--var-lib-kit VAR_LIB_KIT]
[--var-lib-kits VAR_LIB_KITS]
[--var-instrument VAR_INSTRUMENT]
[--var-center-name VAR_CENTER_NAME]
[--var-center-contact VAR_CENTER_CONTACT]
[--var-study-title VAR_STUDY_TITLE]
[--var-is-triplet VAR_IS_TRIPLET]
[--var-s-file-name VAR_S_FILE_NAME]
[--var-assay-prefix VAR_ASSAY_PREFIX]
[--var-a-technology-type VAR_A_TECHNOLOGY_TYPE]
[--var-a-measurement-abbreviation VAR_A_MEASUREMENT_ABBREVIATION]
[--var-assay-name VAR_ASSAY_NAME]
[--var-sample-type VAR_SAMPLE_TYPE]
[--var-lib-strategy VAR_LIB_STRATEGY]
[--var-lib-selection VAR_LIB_SELECTION]
[--var-lib-layout VAR_LIB_LAYOUT]
[--var---output-dir VAR___OUTPUT_DIR]
output_dir
output_dir | Path to output directory |
--var-investigation-title | |
template variables ‘investigation_title’ | |
--var-sample-names | |
template variables ‘sample_names’ | |
--var-a-measurement-type | |
template variables ‘a_measurement_type’ | |
--var-lib-kit | template variables ‘lib_kit’ |
--var-lib-kits | template variables ‘lib_kits’ |
--var-instrument | |
template variables ‘instrument’ | |
--var-center-name | |
template variables ‘center_name’ | |
--var-center-contact | |
template variables ‘center_contact’ | |
--var-study-title | |
template variables ‘study_title’ | |
--var-is-triplet | |
template variables ‘is_triplet’ | |
--var-s-file-name | |
template variables ‘s_file_name’ | |
--var-assay-prefix | |
template variables ‘assay_prefix’ | |
--var-a-technology-type | |
template variables ‘a_technology_type’ | |
--var-a-measurement-abbreviation | |
template variables ‘a_measurement_abbreviation’ | |
--var-assay-name | |
template variables ‘assay_name’ | |
--var-sample-type | |
template variables ‘sample_type’ | |
--var-lib-strategy | |
template variables ‘lib_strategy’ | |
--var-lib-selection | |
template variables ‘lib_selection’ | |
--var-lib-layout | |
template variables ‘lib_layout’ | |
--var---output-dir | |
template variables ‘__output_dir’ |
germline¶
When specifying the –var-* argument, you can use JSON syntax. Failing to parse JSON will keep the string value.
cubi-tk isa-tpl germline [-h]
[--var-investigation-title VAR_INVESTIGATION_TITLE]
[--var-sample-names VAR_SAMPLE_NAMES]
[--var-a-measurement-type VAR_A_MEASUREMENT_TYPE]
[--var-lib-kit VAR_LIB_KIT] [--var-batch VAR_BATCH]
[--var-lib-kits VAR_LIB_KITS]
[--var-instrument VAR_INSTRUMENT]
[--var-center-name VAR_CENTER_NAME]
[--var-center-contact VAR_CENTER_CONTACT]
[--var-study-title VAR_STUDY_TITLE]
[--var-s-file-name VAR_S_FILE_NAME]
[--var-assay-prefix VAR_ASSAY_PREFIX]
[--var-a-technology-type VAR_A_TECHNOLOGY_TYPE]
[--var-a-measurement-abbreviation VAR_A_MEASUREMENT_ABBREVIATION]
[--var-assay-name VAR_ASSAY_NAME]
[--var-sample-type VAR_SAMPLE_TYPE]
[--var-lib-strategy VAR_LIB_STRATEGY]
[--var-lib-selection VAR_LIB_SELECTION]
[--var-lib-layout VAR_LIB_LAYOUT]
[--var---output-dir VAR___OUTPUT_DIR]
output_dir
output_dir | Path to output directory |
--var-investigation-title | |
template variables ‘investigation_title’ | |
--var-sample-names | |
template variables ‘sample_names’ | |
--var-a-measurement-type | |
template variables ‘a_measurement_type’ | |
--var-lib-kit | template variables ‘lib_kit’ |
--var-batch | template variables ‘batch’ |
--var-lib-kits | template variables ‘lib_kits’ |
--var-instrument | |
template variables ‘instrument’ | |
--var-center-name | |
template variables ‘center_name’ | |
--var-center-contact | |
template variables ‘center_contact’ | |
--var-study-title | |
template variables ‘study_title’ | |
--var-s-file-name | |
template variables ‘s_file_name’ | |
--var-assay-prefix | |
template variables ‘assay_prefix’ | |
--var-a-technology-type | |
template variables ‘a_technology_type’ | |
--var-a-measurement-abbreviation | |
template variables ‘a_measurement_abbreviation’ | |
--var-assay-name | |
template variables ‘assay_name’ | |
--var-sample-type | |
template variables ‘sample_type’ | |
--var-lib-strategy | |
template variables ‘lib_strategy’ | |
--var-lib-selection | |
template variables ‘lib_selection’ | |
--var-lib-layout | |
template variables ‘lib_layout’ | |
--var---output-dir | |
template variables ‘__output_dir’ |
generic¶
When specifying the –var-* argument, you can use JSON syntax. Failing to parse JSON will keep the string value.
cubi-tk isa-tpl generic [-h]
[--var-investigation-title VAR_INVESTIGATION_TITLE]
[--var-sample-names VAR_SAMPLE_NAMES]
[--var-a-measurement-type VAR_A_MEASUREMENT_TYPE]
[--var-a-measurement-types VAR_A_MEASUREMENT_TYPES]
[--var-a-technology-type VAR_A_TECHNOLOGY_TYPE]
[--var-a-technology-types VAR_A_TECHNOLOGY_TYPES]
[--var-lib-kit VAR_LIB_KIT]
[--var-organism VAR_ORGANISM] [--var-batch VAR_BATCH]
[--var-lib-kits VAR_LIB_KITS]
[--var-organisms VAR_ORGANISMS]
[--var-instrument VAR_INSTRUMENT]
[--var-center-name VAR_CENTER_NAME]
[--var-center-contact VAR_CENTER_CONTACT]
[--var-study-title VAR_STUDY_TITLE]
[--var-s-file-name VAR_S_FILE_NAME]
[--var-assay-prefix VAR_ASSAY_PREFIX]
[--var-a-measurement-abbreviation VAR_A_MEASUREMENT_ABBREVIATION]
[--var-assay-name VAR_ASSAY_NAME]
[--var-sample-type VAR_SAMPLE_TYPE]
[--var-lib-strategy VAR_LIB_STRATEGY]
[--var-lib-selection VAR_LIB_SELECTION]
[--var-lib-layout VAR_LIB_LAYOUT]
[--var---output-dir VAR___OUTPUT_DIR]
output_dir
output_dir | Path to output directory |
--var-investigation-title | |
template variables ‘investigation_title’ | |
--var-sample-names | |
template variables ‘sample_names’ | |
--var-a-measurement-type | |
template variables ‘a_measurement_type’ | |
--var-a-measurement-types | |
template variables ‘a_measurement_types’ | |
--var-a-technology-type | |
template variables ‘a_technology_type’ | |
--var-a-technology-types | |
template variables ‘a_technology_types’ | |
--var-lib-kit | template variables ‘lib_kit’ |
--var-organism | template variables ‘organism’ |
--var-batch | template variables ‘batch’ |
--var-lib-kits | template variables ‘lib_kits’ |
--var-organisms | |
template variables ‘organisms’ | |
--var-instrument | |
template variables ‘instrument’ | |
--var-center-name | |
template variables ‘center_name’ | |
--var-center-contact | |
template variables ‘center_contact’ | |
--var-study-title | |
template variables ‘study_title’ | |
--var-s-file-name | |
template variables ‘s_file_name’ | |
--var-assay-prefix | |
template variables ‘assay_prefix’ | |
--var-a-measurement-abbreviation | |
template variables ‘a_measurement_abbreviation’ | |
--var-assay-name | |
template variables ‘assay_name’ | |
--var-sample-type | |
template variables ‘sample_type’ | |
--var-lib-strategy | |
template variables ‘lib_strategy’ | |
--var-lib-selection | |
template variables ‘lib_selection’ | |
--var-lib-layout | |
template variables ‘lib_layout’ | |
--var---output-dir | |
template variables ‘__output_dir’ |
microarray¶
When specifying the –var-* argument, you can use JSON syntax. Failing to parse JSON will keep the string value.
cubi-tk isa-tpl microarray [-h]
[--var-investigation-title VAR_INVESTIGATION_TITLE]
[--var-sample-names VAR_SAMPLE_NAMES]
[--var-a-measurement-type VAR_A_MEASUREMENT_TYPE]
[--var-organism VAR_ORGANISM]
[--var-organisms VAR_ORGANISMS]
[--var-technology-platform VAR_TECHNOLOGY_PLATFORM]
[--var-array-design-ref VAR_ARRAY_DESIGN_REF]
[--var-study-title VAR_STUDY_TITLE]
[--var-s-file-name VAR_S_FILE_NAME]
[--var-assay-prefix VAR_ASSAY_PREFIX]
[--var-a-technology-type VAR_A_TECHNOLOGY_TYPE]
[--var-assay-name VAR_ASSAY_NAME]
[--var-terms VAR_TERMS]
[--var---output-dir VAR___OUTPUT_DIR]
output_dir
output_dir | Path to output directory |
--var-investigation-title | |
template variables ‘investigation_title’ | |
--var-sample-names | |
template variables ‘sample_names’ | |
--var-a-measurement-type | |
template variables ‘a_measurement_type’ | |
--var-organism | template variables ‘organism’ |
--var-organisms | |
template variables ‘organisms’ | |
--var-technology-platform | |
template variables ‘technology_platform’ | |
--var-array-design-ref | |
template variables ‘array_design_ref’ | |
--var-study-title | |
template variables ‘study_title’ | |
--var-s-file-name | |
template variables ‘s_file_name’ | |
--var-assay-prefix | |
template variables ‘assay_prefix’ | |
--var-a-technology-type | |
template variables ‘a_technology_type’ | |
--var-assay-name | |
template variables ‘assay_name’ | |
--var-terms | template variables ‘terms’ |
--var---output-dir | |
template variables ‘__output_dir’ |
ms_meta_biocrates¶
When specifying the –var-* argument, you can use JSON syntax. Failing to parse JSON will keep the string value.
cubi-tk isa-tpl ms_meta_biocrates [-h]
[--var-investigation-title VAR_INVESTIGATION_TITLE]
[--var-study-title VAR_STUDY_TITLE]
[--var-study-id VAR_STUDY_ID]
[--var-study-file-name VAR_STUDY_FILE_NAME]
[--var-sample-names VAR_SAMPLE_NAMES]
[--var-organism VAR_ORGANISM]
[--var-organisms VAR_ORGANISMS]
[--var-assay-measurement-type VAR_ASSAY_MEASUREMENT_TYPE]
[--var-assay-technology-type VAR_ASSAY_TECHNOLOGY_TYPE]
[--var-assay-technology-types VAR_ASSAY_TECHNOLOGY_TYPES]
[--var-biocrates-kit VAR_BIOCRATES_KIT]
[--var-assay-prefix VAR_ASSAY_PREFIX]
[--var-assay-name VAR_ASSAY_NAME]
[--var-assay-measurement-abbreviation-LC VAR_ASSAY_MEASUREMENT_ABBREVIATION_LC]
[--var-assay-measurement-abbreviation-FIA VAR_ASSAY_MEASUREMENT_ABBREVIATION_FIA]
[--var-biocrates-metidq-version VAR_BIOCRATES_METIDQ_VERSION]
[--var-metaquac-version VAR_METAQUAC_VERSION]
[--var-instrument VAR_INSTRUMENT]
[--var-instruments VAR_INSTRUMENTS]
[--var-chromatography-instrument VAR_CHROMATOGRAPHY_INSTRUMENT]
[--var---output-dir VAR___OUTPUT_DIR]
output_dir
output_dir | Path to output directory |
--var-investigation-title | |
template variables ‘investigation_title’ | |
--var-study-title | |
template variables ‘study_title’ | |
--var-study-id | template variables ‘study_id’ |
--var-study-file-name | |
template variables ‘study_file_name’ | |
--var-sample-names | |
template variables ‘sample_names’ | |
--var-organism | template variables ‘organism’ |
--var-organisms | |
template variables ‘organisms’ | |
--var-assay-measurement-type | |
template variables ‘assay_measurement_type’ | |
--var-assay-technology-type | |
template variables ‘assay_technology_type’ | |
--var-assay-technology-types | |
template variables ‘assay_technology_types’ | |
--var-biocrates-kit | |
template variables ‘biocrates_kit’ | |
--var-assay-prefix | |
template variables ‘assay_prefix’ | |
--var-assay-name | |
template variables ‘assay_name’ | |
--var-assay-measurement-abbreviation-LC | |
template variables ‘assay_measurement_abbreviation_LC’ | |
--var-assay-measurement-abbreviation-FIA | |
template variables ‘assay_measurement_abbreviation_FIA’ | |
--var-biocrates-metidq-version | |
template variables ‘biocrates_metidq_version’ | |
--var-metaquac-version | |
template variables ‘metaquac_version’ | |
--var-instrument | |
template variables ‘instrument’ | |
--var-instruments | |
template variables ‘instruments’ | |
--var-chromatography-instrument | |
template variables ‘chromatography_instrument’ | |
--var---output-dir | |
template variables ‘__output_dir’ |
stem_cell_core_bulk¶
When specifying the –var-* argument, you can use JSON syntax. Failing to parse JSON will keep the string value.
cubi-tk isa-tpl stem_cell_core_bulk [-h] [--var-group VAR_GROUP]
[--var-study-title VAR_STUDY_TITLE]
[--var-sample-numbers VAR_SAMPLE_NUMBERS]
[--var-investigation-title VAR_INVESTIGATION_TITLE]
[--var-source-type VAR_SOURCE_TYPE]
[--var-cellline VAR_CELLLINE]
[--var-cellculture-meta VAR_CELLCULTURE_META]
[--var-model-type VAR_MODEL_TYPE]
[--var-sample-meta VAR_SAMPLE_META]
[--var-a-measurement-type VAR_A_MEASUREMENT_TYPE]
[--var-a-technology-type VAR_A_TECHNOLOGY_TYPE]
[--var-library-kit VAR_LIBRARY_KIT]
[--var-library-construction-meta VAR_LIBRARY_CONSTRUCTION_META]
[--var-sequencing-meta VAR_SEQUENCING_META]
[--var-s-file-name VAR_S_FILE_NAME]
[--var-assay-prefix VAR_ASSAY_PREFIX]
[--var-assay-name VAR_ASSAY_NAME]
[--var---output-dir VAR___OUTPUT_DIR]
output_dir
output_dir | Path to output directory |
--var-group | template variables ‘group’ |
--var-study-title | |
template variables ‘study_title’ | |
--var-sample-numbers | |
template variables ‘sample_numbers’ | |
--var-investigation-title | |
template variables ‘investigation_title’ | |
--var-source-type | |
template variables ‘source_type’ | |
--var-cellline | template variables ‘cellline’ |
--var-cellculture-meta | |
template variables ‘cellculture_meta’ | |
--var-model-type | |
template variables ‘model_type’ | |
--var-sample-meta | |
template variables ‘sample_meta’ | |
--var-a-measurement-type | |
template variables ‘a_measurement_type’ | |
--var-a-technology-type | |
template variables ‘a_technology_type’ | |
--var-library-kit | |
template variables ‘library_kit’ | |
--var-library-construction-meta | |
template variables ‘library_construction_meta’ | |
--var-sequencing-meta | |
template variables ‘sequencing_meta’ | |
--var-s-file-name | |
template variables ‘s_file_name’ | |
--var-assay-prefix | |
template variables ‘assay_prefix’ | |
--var-assay-name | |
template variables ‘assay_name’ | |
--var---output-dir | |
template variables ‘__output_dir’ |
stem_cell_core_sc¶
When specifying the –var-* argument, you can use JSON syntax. Failing to parse JSON will keep the string value.
cubi-tk isa-tpl stem_cell_core_sc [-h] [--var-group VAR_GROUP]
[--var-study-title VAR_STUDY_TITLE]
[--var-sample-names VAR_SAMPLE_NAMES]
[--var-investigation-title VAR_INVESTIGATION_TITLE]
[--var-source-type VAR_SOURCE_TYPE]
[--var-cellline VAR_CELLLINE]
[--var-cellculture-meta VAR_CELLCULTURE_META]
[--var-model-type VAR_MODEL_TYPE]
[--var-sample-meta VAR_SAMPLE_META]
[--var-a-measurement-type VAR_A_MEASUREMENT_TYPE]
[--var-a-technology-type VAR_A_TECHNOLOGY_TYPE]
[--var-library-kit VAR_LIBRARY_KIT]
[--var-dissociation-meta VAR_DISSOCIATION_META]
[--var-library-construction-meta VAR_LIBRARY_CONSTRUCTION_META]
[--var-sequencing-meta VAR_SEQUENCING_META]
[--var--library-types VAR__LIBRARY_TYPES]
[--var-library-type VAR_LIBRARY_TYPE]
[--var-sample-multiplexing VAR_SAMPLE_MULTIPLEXING]
[--var-genotype-multiplexing VAR_GENOTYPE_MULTIPLEXING]
[--var-s-file-name VAR_S_FILE_NAME]
[--var-assay-prefix VAR_ASSAY_PREFIX]
[--var-assay-name VAR_ASSAY_NAME]
[--var---output-dir VAR___OUTPUT_DIR]
output_dir
output_dir | Path to output directory |
--var-group | template variables ‘group’ |
--var-study-title | |
template variables ‘study_title’ | |
--var-sample-names | |
template variables ‘sample_names’ | |
--var-investigation-title | |
template variables ‘investigation_title’ | |
--var-source-type | |
template variables ‘source_type’ | |
--var-cellline | template variables ‘cellline’ |
--var-cellculture-meta | |
template variables ‘cellculture_meta’ | |
--var-model-type | |
template variables ‘model_type’ | |
--var-sample-meta | |
template variables ‘sample_meta’ | |
--var-a-measurement-type | |
template variables ‘a_measurement_type’ | |
--var-a-technology-type | |
template variables ‘a_technology_type’ | |
--var-library-kit | |
template variables ‘library_kit’ | |
--var-dissociation-meta | |
template variables ‘dissociation_meta’ | |
--var-library-construction-meta | |
template variables ‘library_construction_meta’ | |
--var-sequencing-meta | |
template variables ‘sequencing_meta’ | |
--var--library-types | |
template variables ‘_library_types’ | |
--var-library-type | |
template variables ‘library_type’ | |
--var-sample-multiplexing | |
template variables ‘sample_multiplexing’ | |
--var-genotype-multiplexing | |
template variables ‘genotype_multiplexing’ | |
--var-s-file-name | |
template variables ‘s_file_name’ | |
--var-assay-prefix | |
template variables ‘assay_prefix’ | |
--var-assay-name | |
template variables ‘assay_name’ | |
--var---output-dir | |
template variables ‘__output_dir’ |
isa-tab¶
ISA-tab tools besides templating.
cubi-tk isa-tab [-h] {add-ped,resolve-hpo,annotate,validate} ...
Positional Arguments¶
isa_tab_cmd | Possible choices: add-ped, resolve-hpo, annotate, validate |
Sub-commands¶
add-ped¶
Add records from PED file to ISA-tab
cubi-tk isa-tab add-ped [-h] [--sample-name-normalization {snappy,none}]
[--yes] [--dry-run] [--no-show-diff]
[--show-diff-side-by-side] [--batch-no BATCH_NO]
[--library-type {WES,WGS,Panel_seq}]
[--library-layout {SINGLE,PAIRED}]
[--library-kit LIBRARY_KIT]
[--library-kit-catalogue-id LIBRARY_KIT_CATALOGUE_ID]
[--platform PLATFORM]
[--instrument-model INSTRUMENT_MODEL]
investigation.tsv pedigree.ped
investigation.tsv | |
Path to ISA-tab investigation file. | |
pedigree.ped | Path to PLINK PED file with records to add. |
--sample-name-normalization | |
Possible choices: snappy, none Normalize sample names, default: snappy, choices: snappy, none Default: “snappy” | |
--yes | Assume all answers are yes. Default: False |
--dry-run, -n | Perform a dry run, i.e., don’t change anything only display change, implies ‘–show-diff’. Default: False |
--no-show-diff, -D | |
Don’t show change when creating/updating sample sheets. Default: True | |
--show-diff-side-by-side | |
Show diff side by side instead of unified. Default: False | |
--batch-no | Value to set as the batch number. Default: “.” |
--library-type | Possible choices: WES, WGS, Panel_seq The library type. Default: “WES” |
--library-layout | |
Possible choices: SINGLE, PAIRED The library layout. Default: “PAIRED” | |
--library-kit | The library kit used. Default: “” |
--library-kit-catalogue-id | |
The library kit catalogue ID. Default: “” | |
--platform | The string to use for the platform Default: “ILLUMINA” |
--instrument-model | |
The string to use for the instrument model Default: “” |
resolve-hpo¶
Resolve HPO term lists to ISA-tab fragments
cubi-tk isa-tab resolve-hpo [-h] [--hpo-obo-url HPO_OBO_URL] [term_file]
term_file | Path to ISA-tab investigation file. Default: <_io.TextIOWrapper name=’<stdin>’ mode=’r’ encoding=’UTF-8’> |
--hpo-obo-url | Default URL to OBO file. Default: “http://purl.obolibrary.org/obo/hp.obo” |
annotate¶
Add annotation from CSV file to ISA-tab
cubi-tk isa-tab annotate [-h] [--yes] [--dry-run] [--no-show-diff]
[--show-diff-side-by-side] [--force-update]
[--target-study s_study.tsv]
[--target-assay a_assay.tsv]
investigation.tsv annotation.tsv
investigation.tsv | |
Path to ISA-tab investigation file. | |
annotation.tsv | Path to annotation (TSV) file with information to add. |
--yes | Assume all answers are yes. Default: False |
--dry-run, -n | Perform a dry run, i.e., don’t change anything only display change, implies ‘–show-diff’. Default: False |
--no-show-diff, -D | |
Don’t show change when creating/updating sample sheets. Default: True | |
--show-diff-side-by-side | |
Show diff side by side instead of unified. Default: False | |
--force-update | Overwrite non-empty ISA-tab entries. Default: False |
--target-study, -s | |
File name study to annotate. If not provided, first study in investigation is used. | |
--target-assay, -a | |
File name of assay to annotate. If not provided, first assay in investigation is used. |
validate¶
Validate ISA-tab
cubi-tk isa-tab validate [-h] [--show-duplicate-warnings] investigation.tsv
investigation.tsv | |
Path to ISA-tab investigation file. |
--show-duplicate-warnings | |
Show duplicated warnings, i.e. with same message and same category (False by default) Default: False |
snappy¶
Tools for supporting the SNAPPY pipeline.
cubi-tk snappy [-h]
{check-local,check-remote,itransfer-raw-data,itransfer-ngs-mapping,itransfer-variant-calling,itransfer-step,pull-sheets,pull-all-data,pull-processed-data,pull-raw-data,varfish-upload,kickoff}
...
Positional Arguments¶
snappy_cmd | Possible choices: check-local, check-remote, itransfer-raw-data, itransfer-ngs-mapping, itransfer-variant-calling, itransfer-step, pull-sheets, pull-all-data, pull-processed-data, pull-raw-data, varfish-upload, kickoff |
Sub-commands¶
check-local¶
Check consistency within local sample sheet and between local sheets and files
cubi-tk snappy check-local [-h] [--tsv-shortcut {germline,cancer}]
[--base-path BASE_PATH]
[project_uuids [project_uuids ...]]
project_uuids | UUID(s) from project(s) to check. Use all if not given. |
--tsv-shortcut | Possible choices: germline, cancer The shortcut TSV schema to use. Default: “germline” |
--base-path | Base path of project (contains ‘ngs_mapping/’ etc.), spiders up from biomedsheet_tsv and falls back to current working directory by default. Default: “/home/docs/checkouts/readthedocs.org/user_builds/cubi-tk/checkouts/latest/docs_manual” |
check-remote¶
Check consistency within remote sample sheet and files
cubi-tk snappy check-remote [-h] [--sodar-url SODAR_URL]
[--sodar-api-token SODAR_API_TOKEN]
[--tsv-shortcut {cancer,generic,germline}]
[--base-path BASE_PATH] [--md5]
[--assay-uuid ASSAY_UUID]
project_uuid
project_uuid | UUID from Project to check. |
--sodar-url | URL to SODAR, defaults to SODAR_URL environment variable or fallback to https://sodar.bihealth.org/ Default: “https://sodar.bihealth.org/” |
--sodar-api-token | |
Authentication token when talking to SODAR. Defaults to SODAR_API_TOKEN environment variable. | |
--tsv-shortcut | Possible choices: cancer, generic, germline The shortcut TSV schema to use. Default: “germline” |
--base-path | Base path of project (contains ‘ngs_mapping/’ etc.), spiders up from biomedsheet_tsv and falls back to current working directory by default. Default: “/home/docs/checkouts/readthedocs.org/user_builds/cubi-tk/checkouts/latest/docs_manual” |
--md5 | Flag to indicate if local and remote MD5 files should be compared. Default: False |
--assay-uuid | UUID from Assay to check. Used to specify target while dealing with multi-assay projects. |
itransfer-raw-data¶
Transfer FASTQs into iRODS landing zone
cubi-tk snappy itransfer-raw-data [-h] [--sodar-url SODAR_URL]
[--sodar-api-token SODAR_API_TOKEN]
[--num-parallel-transfers NUM_PARALLEL_TRANSFERS]
[--tsv-shortcut {germline,cancer}]
[--first-batch FIRST_BATCH]
[--last-batch LAST_BATCH]
[--base-path BASE_PATH]
[--remote-dir-date REMOTE_DIR_DATE]
[--remote-dir-pattern REMOTE_DIR_PATTERN]
[--yes] [--validate-and-move]
[--assay ASSAY]
destination
destination | UUID from Landing Zone or Project - where files will be moved to. |
--num-parallel-transfers | |
Number of parallel transfers, defaults to 8 Default: 8 | |
--tsv-shortcut | Possible choices: germline, cancer The shortcut TSV schema to use. Default: “germline” |
--first-batch | First batch to be transferred. Defaults: 0. Default: 0 |
--last-batch | Last batch to be transferred. |
--base-path | Base path of project (contains ‘ngs_mapping/’ etc.), defaults to current path. Default: “/home/docs/checkouts/readthedocs.org/user_builds/cubi-tk/checkouts/latest/docs_manual” |
--remote-dir-date | |
Date to use in remote directory, defaults to YYYY-MM-DD of today. Default: “2023-07-04” | |
--remote-dir-pattern | |
Pattern to use for constructing remote pattern Default: “{library_name}/{step}/{date}” | |
--yes | Assume all answers are yes, e.g., will create or use existing available landing zones without asking. Default: False |
--validate-and-move | |
After files are transferred to SODAR, it will proceed with validation and move. Default: False | |
--assay | UUID of assay to download data for. |
itransfer-ngs-mapping¶
Transfer ngs_mapping results into iRODS landing zone
cubi-tk snappy itransfer-ngs-mapping [-h] [--sodar-url SODAR_URL]
[--sodar-api-token SODAR_API_TOKEN]
[--num-parallel-transfers NUM_PARALLEL_TRANSFERS]
[--tsv-shortcut {germline,cancer}]
[--first-batch FIRST_BATCH]
[--last-batch LAST_BATCH]
[--base-path BASE_PATH]
[--remote-dir-date REMOTE_DIR_DATE]
[--remote-dir-pattern REMOTE_DIR_PATTERN]
[--yes] [--validate-and-move]
[--assay ASSAY] [--mapper MAPPER]
destination
destination | UUID from Landing Zone or Project - where files will be moved to. |
--num-parallel-transfers | |
Number of parallel transfers, defaults to 8 Default: 8 | |
--tsv-shortcut | Possible choices: germline, cancer The shortcut TSV schema to use. Default: “germline” |
--first-batch | First batch to be transferred. Defaults: 0. Default: 0 |
--last-batch | Last batch to be transferred. |
--base-path | Base path of project (contains ‘ngs_mapping/’ etc.), defaults to current path. Default: “/home/docs/checkouts/readthedocs.org/user_builds/cubi-tk/checkouts/latest/docs_manual” |
--remote-dir-date | |
Date to use in remote directory, defaults to YYYY-MM-DD of today. Default: “2023-07-04” | |
--remote-dir-pattern | |
Pattern to use for constructing remote pattern Default: “{library_name}/{step}/{date}” | |
--yes | Assume all answers are yes, e.g., will create or use existing available landing zones without asking. Default: False |
--validate-and-move | |
After files are transferred to SODAR, it will proceed with validation and move. Default: False | |
--assay | UUID of assay to download data for. |
--mapper | Name of the mapper to transfer for, defaults to bwa. Default: “bwa” |
itransfer-variant-calling¶
Transfer variant_calling results into iRODS landing zone
cubi-tk snappy itransfer-variant-calling [-h] [--sodar-url SODAR_URL]
[--sodar-api-token SODAR_API_TOKEN]
[--num-parallel-transfers NUM_PARALLEL_TRANSFERS]
[--tsv-shortcut {germline,cancer}]
[--first-batch FIRST_BATCH]
[--last-batch LAST_BATCH]
[--base-path BASE_PATH]
[--remote-dir-date REMOTE_DIR_DATE]
[--remote-dir-pattern REMOTE_DIR_PATTERN]
[--yes] [--validate-and-move]
[--assay ASSAY] [--mapper MAPPER]
[--caller CALLER]
destination
destination | UUID from Landing Zone or Project - where files will be moved to. |
--num-parallel-transfers | |
Number of parallel transfers, defaults to 8 Default: 8 | |
--tsv-shortcut | Possible choices: germline, cancer The shortcut TSV schema to use. Default: “germline” |
--first-batch | First batch to be transferred. Defaults: 0. Default: 0 |
--last-batch | Last batch to be transferred. |
--base-path | Base path of project (contains ‘ngs_mapping/’ etc.), defaults to current path. Default: “/home/docs/checkouts/readthedocs.org/user_builds/cubi-tk/checkouts/latest/docs_manual” |
--remote-dir-date | |
Date to use in remote directory, defaults to YYYY-MM-DD of today. Default: “2023-07-04” | |
--remote-dir-pattern | |
Pattern to use for constructing remote pattern Default: “{library_name}/{step}/{date}” | |
--yes | Assume all answers are yes, e.g., will create or use existing available landing zones without asking. Default: False |
--validate-and-move | |
After files are transferred to SODAR, it will proceed with validation and move. Default: False | |
--assay | UUID of assay to download data for. |
--mapper | Name of the mapper to transfer for, defaults to bwa. Default: “bwa” |
--caller | Name of the variant caller to transfer for, defaults to gatk_hc Default: “gatk_hc” |
itransfer-step¶
Transfer snappy step results into iRODS landing zone
cubi-tk snappy itransfer-step [-h] [--sodar-url SODAR_URL]
[--sodar-api-token SODAR_API_TOKEN]
[--num-parallel-transfers NUM_PARALLEL_TRANSFERS]
[--tsv-shortcut {germline,cancer}]
[--first-batch FIRST_BATCH]
[--last-batch LAST_BATCH]
[--base-path BASE_PATH]
[--remote-dir-date REMOTE_DIR_DATE]
[--remote-dir-pattern REMOTE_DIR_PATTERN]
[--yes] [--validate-and-move] [--assay ASSAY]
[--step STEP] [--tool [TOOL [TOOL ...]]]
destination
destination | UUID from Landing Zone or Project - where files will be moved to. |
--num-parallel-transfers | |
Number of parallel transfers, defaults to 8 Default: 8 | |
--tsv-shortcut | Possible choices: germline, cancer The shortcut TSV schema to use. Default: “germline” |
--first-batch | First batch to be transferred. Defaults: 0. Default: 0 |
--last-batch | Last batch to be transferred. |
--base-path | Base path of project (contains ‘ngs_mapping/’ etc.), defaults to current path. Default: “/home/docs/checkouts/readthedocs.org/user_builds/cubi-tk/checkouts/latest/docs_manual” |
--remote-dir-date | |
Date to use in remote directory, defaults to YYYY-MM-DD of today. Default: “2023-07-04” | |
--remote-dir-pattern | |
Pattern to use for constructing remote pattern Default: “{library_name}/{step}/{date}” | |
--yes | Assume all answers are yes, e.g., will create or use existing available landing zones without asking. Default: False |
--validate-and-move | |
After files are transferred to SODAR, it will proceed with validation and move. Default: False | |
--assay | UUID of assay to download data for. |
--step | Name of the snappy pipeline step (step name must be identical to step directory).Steps names are available from the snappy command snappy-start-step –help |
--tool | Name of the tool, for example bwa. Tools order in important:it must match the order used to generate filename prefix.For example, the variant annotation step requires the mapper, caller andthe annotator software. In that case, the snappy file prefix is:<mapper>.<caller>.<annotator>, so the command would be:–tool <mapper> <vcaller> <annotator>. Some steps add more information to theirprefix, for example ‘jannovar_somatic_vcf’ |
pull-sheets¶
Pull SODAR sample sheets into biomedsheet
cubi-tk snappy pull-sheets [-h] [--base-path BASE_PATH] [--yes] [--dry-run]
[--no-show-diff] [--show-diff-side-by-side]
[--library-types LIBRARY_TYPES]
[--first-batch FIRST_BATCH]
[--last-batch LAST_BATCH]
[--tsv-shortcut {cancer,generic,germline}]
--base-path | Base path of project (contains ‘.snappy_pipeline/’ etc.), spiders up from current work directory and falls back to current working directory by default. Default: “/home/docs/checkouts/readthedocs.org/user_builds/cubi-tk/checkouts/latest/docs_manual” |
--yes | Assume all answers are yes. Default: False |
--dry-run, -n | Perform a dry run, i.e., don’t change anything only display change, implies ‘–show-diff’. Default: False |
--no-show-diff, -D | |
Don’t show change when creating/updating sample sheets. Default: True | |
--show-diff-side-by-side | |
Show diff side by side instead of unified. Default: False | |
--library-types | |
Library type(s) to use, comma-separated, default is to use all. | |
--first-batch | First batch to be included in local sample sheet. Defaults: 0. Default: 0 |
--last-batch | Last batch to be included in local sample sheet. Not used by default. |
--tsv-shortcut | Possible choices: cancer, generic, germline The shortcut TSV schema to use; default: ‘germline’. Default: “germline” |
pull-all-data¶
Pull all data from SODAR to specified output directory
cubi-tk snappy pull-all-data [-h] [--base-path BASE_PATH]
[--sodar-url SODAR_URL]
[--sodar-api-token SODAR_API_TOKEN]
--output-directory OUTPUT_DIRECTORY [--overwrite]
[--first-batch FIRST_BATCH] [--samples SAMPLES]
[--allow-missing] [--yes] [--dry-run]
[--irsync-threads IRSYNC_THREADS]
[--assay ASSAY_UUID]
project_uuid
project_uuid | UUID of project to download data for. |
--base-path | Base path of project (contains ‘.snappy_pipeline/’ etc.), spiders up from current work directory and falls back to current working directory by default. Default: “/home/docs/checkouts/readthedocs.org/user_builds/cubi-tk/checkouts/latest/docs_manual” |
--output-directory | |
Output directory, where downloaded files will be stored. | |
--overwrite | Allow overwriting of files Default: False |
--first-batch | First batch number to pull Default: 0 |
--samples | Optional list of samples to pull |
--allow-missing | |
Allow missing data in assay Default: False | |
--yes | Assume all answers are yes. Default: False |
--dry-run, -n | Perform a dry run, i.e., don’t change anything only display change, implies ‘–show-diff’. Default: False |
--irsync-threads | |
Parameter -N to pass to irsync | |
--assay | UUID of assay to create landing zone for. |
pull-processed-data¶
Pull processed data from SODAR to specified output directory
cubi-tk snappy pull-processed-data [-h] [--sodar-url SODAR_URL]
[--sodar-api-token SODAR_API_TOKEN]
[--tsv-shortcut {cancer,generic,germline}]
[--base-path BASE_PATH]
[--selected-samples SELECTED_SAMPLES]
[--first-batch FIRST_BATCH]
[--last-batch LAST_BATCH]
--output-directory OUTPUT_DIRECTORY
[--sample-id] --file-type
{bam,vcf,txt,csv,log}
[--download-all-versions] [--overwrite]
[--assay-uuid ASSAY_UUID]
project_uuid
project_uuid | UUID from Project to check. |
--sodar-url | URL to SODAR, defaults to SODAR_URL environment variable or fallback to https://sodar.bihealth.org/ Default: “https://sodar.bihealth.org/” |
--sodar-api-token | |
Authentication token when talking to SODAR. Defaults to SODAR_API_TOKEN environment variable. | |
--tsv-shortcut | Possible choices: cancer, generic, germline The shortcut TSV schema to use; default: ‘germline’. Default: “germline” |
--base-path | Base path of project (contains ‘ngs_mapping/’ etc.), spiders up from biomedsheet_tsv and falls back to current working directory by default. Default: “/home/docs/checkouts/readthedocs.org/user_builds/cubi-tk/checkouts/latest/docs_manual” |
--selected-samples | |
Limits the request to the listed sample names. Don’t include the full library name, just the sample name (e.g., ‘P001’ instead of ‘P001-N1-DNA1-WES1’). Separate the sample with comma for multiple samples, example: ‘P001,P002,P003’. Note: argument overrides batch related arguments. | |
--first-batch | First batch to be transferred. Defaults: 0. Default: 0 |
--last-batch | Last batch to be transferred. |
--output-directory | |
Output directory, where downloaded files will be stored. | |
--sample-id | Flag to indicate if search should be based on sample identifier (e.g.’P001’) instead of library name (e.g. ‘P001-N1-DNA1-WGS1’). Default: False |
--file-type | Possible choices: bam, vcf, txt, csv, log File extensions to be retrieved. Valid options: (‘bam’, ‘vcf’, ‘txt’, ‘csv’, ‘log’) |
--download-all-versions | |
By default only the latest version of a file will be download. For instance, if a was uploaded two times, in ‘2022-01-31’ and ‘2022-02-28’, only the latest is downloaded. If this flag is present, both versions will be downloaded. Default: False | |
--overwrite | Allow overwriting of local files. Default: False |
--assay-uuid | UUID from Assay to check. Used to specify target while dealing with multi-assay projects. |
pull-raw-data¶
Pull raw data from SODAR to SNAPPY dataset raw data directory
cubi-tk snappy pull-raw-data [-h] [--base-path BASE_PATH]
[--sodar-url SODAR_URL]
[--sodar-api-token SODAR_API_TOKEN] [--dry-run]
[--overwrite]
[--tsv-shortcut {cancer,generic,germline}]
[--first-batch FIRST_BATCH]
[--last-batch LAST_BATCH] [--samples SAMPLES]
[--use-library-name] [--assay-uuid ASSAY_UUID]
project_uuid
project_uuid | UUID of project to download data for. |
--base-path | Base path of project (contains ‘.snappy_pipeline/’ etc.), spiders up from current work directory and falls back to current working directory by default. Default: “/home/docs/checkouts/readthedocs.org/user_builds/cubi-tk/checkouts/latest/docs_manual” |
--dry-run, -n | Perform a dry run, i.e., just displays the files that would be downloaded. Default: False |
--overwrite | Allow overwriting of files Default: False |
--tsv-shortcut | Possible choices: cancer, generic, germline The shortcut TSV schema to use. Default: “germline” |
--first-batch | First batch to be transferred. Defaults: 0. Default: 0 |
--last-batch | Last batch to be transferred. |
--samples | Optional list of samples to pull |
--use-library-name | |
Flag to indicate that the search in SODAR directories should be based on library name (e.g. ‘P001-N1-DNA1-WGS1’) instead of sample identifier (e.g.’P001’) in the file name. Default: False | |
--assay-uuid | UUID of assay to create landing zone for. |
varfish-upload¶
Upload variant analysis results into VarFish
cubi-tk snappy varfish-upload [-h] [--varfish-config VARFISH_CONFIG]
[--varfish-server-url VARFISH_SERVER_URL]
[--varfish-api-token VARFISH_API_TOKEN]
[--base-path BASE_PATH] [--steps STEPS]
[--external-data] [--min-batch MIN_BATCH]
[--yes] [--samples SAMPLES]
project [project ...]
project | The UUID(s) of the SODAR project to submit. |
--base-path | Base path of project (contains ‘.snappy_pipeline/’ etc.), spiders up from current work directory and falls back to current working directory by default. Default: “/home/docs/checkouts/readthedocs.org/user_builds/cubi-tk/checkouts/latest/docs_manual” |
--steps | Pipeline steps to consider for the export. Defaults to include all of the following; specify this with +name/-name to add/remove and either give multiple arguments or use a comma-separated list. {ngs_mapping, targeted_seq_cnv_export, varfish_export, variant_export, variant_export_external, wgs_cnv_export, wgs_cnv_export_external, wgs_sv_export, wgs_sv_export_external} Default: [] |
--external-data | |
Flag to indicate that data was externally generated. Search for files will not filter based on common internally tool combinations, example: ‘bwa.delly2’ or ‘bwa.gatk_hc’. Default: False | |
--min-batch | Smallest batch to transfer, keep empty to transfer all. |
--yes, -y | Assume yes to all answers Default: False |
--samples | Limits the submission to the listed sample names. Don’t include the full library name just the sample name (e.g., ‘P001’ instead of ‘P001-N1-DNA1-WES1’). Separate the sample with comma for multiple samples, example: ‘P001,P002,P003’. Default: “” |
--varfish-config | |
Path to configuration file. | |
--varfish-server-url | |
SODAR server URL key to use, defaults to env VARFISH_SERVER_URL. | |
--varfish-api-token | |
SODAR API token to use, defaults to env VARFISH_API_TOKEN. |
kickoff¶
Kick-off SNAPPY pipeline steps.
cubi-tk snappy kickoff [-h] [--dry-run] [--timeout TIMEOUT] [path]
path | Path into SNAPPY directory (below a directory containing .snappy_pipeline). |
--dry-run, -n | Perform dry-run, do not do anything. Default: False |
--timeout | Number of seconds to wait for commands. Default: 10 |
sodar¶
SODAR command line interface.
cubi-tk sodar [-h]
{add-ped,download-sheet,upload-sheet,pull-raw-data,landing-zone-create,landing-zone-list,landing-zone-move,ingest-fastq,check-remote}
...
Positional Arguments¶
sodar_cmd | Possible choices: add-ped, download-sheet, upload-sheet, pull-raw-data, landing-zone-create, landing-zone-list, landing-zone-move, ingest-fastq, check-remote |
Sub-commands¶
add-ped¶
Augment sample sheet from PED file
cubi-tk sodar add-ped [-h] [--sodar-url SODAR_URL]
[--sodar-api-token SODAR_API_TOKEN] [--dry-run]
[--show-diff] [--show-diff-side-by-side]
[--sample-name-normalization {snappy,none}] [--yes]
[--batch-no BATCH_NO]
[--library-type {WES,WGS,Panel_seq}]
[--library-layout {SINGLE,PAIRED}]
[--library-kit LIBRARY_KIT]
[--library-kit-catalogue-id LIBRARY_KIT_CATALOGUE_ID]
[--platform PLATFORM]
[--instrument-model INSTRUMENT_MODEL]
project_uuid pedigree.ped
project_uuid | UUID of project to download the ISA-tab for. |
pedigree.ped | Path to PLINK PED file with records to add. |
--dry-run, -n | Perform a dry run, i.e., don’t change anything only display change, implies ‘–show-diff’. Default: False |
--show-diff, -D | |
Show change when creating/updating sample sheets. Default: False | |
--show-diff-side-by-side | |
Show diff side by side instead of unified. Default: False | |
--sample-name-normalization | |
Possible choices: snappy, none Normalize sample names, default: snappy, choices: snappy, none Default: “snappy” | |
--yes | Assume all answers are yes. Default: False |
--batch-no | Value to set as the batch number. Default: “.” |
--library-type | Possible choices: WES, WGS, Panel_seq The library type. Default: “WES” |
--library-layout | |
Possible choices: SINGLE, PAIRED The library layout. Default: “PAIRED” | |
--library-kit | The library kit used. Default: “” |
--library-kit-catalogue-id | |
The library kit catalogue ID. Default: “” | |
--platform | The string to use for the platform Default: “ILLUMINA” |
--instrument-model | |
The string to use for the instrument model Default: “” |
download-sheet¶
Download ISA-tab
cubi-tk sodar download-sheet [-h] [--sodar-url SODAR_URL]
[--sodar-api-token SODAR_API_TOKEN]
[--no-makedirs] [--overwrite] [--yes] [--dry-run]
[--show-diff] [--show-diff-side-by-side]
project_uuid output_dir
project_uuid | UUID of project to download the ISA-tab for. |
output_dir | Path to output directory to write the sheet to. |
--no-makedirs | Create output directories Default: True |
--overwrite | Allow overwriting of files Default: False |
--yes | Assume all answers are yes. Default: False |
--dry-run, -n | Perform a dry run, i.e., don’t change anything only display change, implies ‘–show-diff’. Default: False |
--show-diff, -D | |
Show change when creating/updating sample sheets. Default: False | |
--show-diff-side-by-side | |
Show diff side by side instead of unified. Default: False |
upload-sheet¶
Upload and replace ISA-tab
cubi-tk sodar upload-sheet [-h] [--sodar-url SODAR_URL]
[--sodar-api-token SODAR_API_TOKEN]
project_uuid input_investigation_file
project_uuid | UUID of project to upload the ISA-tab for. |
input_investigation_file | |
Path to input investigation file. |
pull-raw-data¶
Download raw data from iRODS
cubi-tk sodar pull-raw-data [-h] [--sodar-url SODAR_URL]
[--sodar-api-token SODAR_API_TOKEN] [--overwrite]
[--min-batch MIN_BATCH] [--allow-missing] [--yes]
[--dry-run] [--irsync-threads IRSYNC_THREADS]
[--assay ASSAY]
project_uuid output_dir
project_uuid | UUID of project to download data for. |
output_dir | Path to output directory to write the raw data to. |
--overwrite | Allow overwriting of files Default: False |
--min-batch | Minimal batch number to pull Default: 0 |
--allow-missing | |
Allow missing data in assay Default: False | |
--yes | Assume all answers are yes. Default: False |
--dry-run, -n | Perform a dry run, i.e., don’t change anything only display change, implies ‘–show-diff’. Default: False |
--irsync-threads | |
Parameter -N to pass to irsync | |
--assay | UUID of assay to download data for. |
landing-zone-create¶
Creating landing zone
cubi-tk sodar landing-zone-create [-h] [--sodar-url SODAR_URL]
[--sodar-api-token SODAR_API_TOKEN]
[--unless-exists] [--dry-run]
[--assay ASSAY] [--format FORMAT_STRING]
project_uuid
project_uuid | UUID of project to create the landing zone in. |
--unless-exists | |
If there already is a landing zone in the current project then use this one Default: False | |
--dry-run, -n | Perform a dry run, i.e., don’t change anything only display change, implies ‘–show-diff’. Default: False |
--assay | UUID of assay to create landing zone for. |
--format | Format string for printing, e.g. %(uuid)s |
landing-zone-list¶
List landing zones
cubi-tk sodar landing-zone-list [-h] [--sodar-url SODAR_URL]
[--sodar-api-token SODAR_API_TOKEN]
[--unless-exists] [--dry-run]
[--format FORMAT_STRING]
project_uuid
project_uuid | UUID of project to create the landing zone in. |
--unless-exists | |
If there already is a landing zone in the current project then use this one Default: False | |
--dry-run, -n | Perform a dry run, i.e., don’t change anything only display change, implies ‘–show-diff’. Default: False |
--format | Format string for printing, e.g. %(uuid)s |
landing-zone-move¶
Submit landing zone for moving
cubi-tk sodar landing-zone-move [-h] [--sodar-url SODAR_URL]
[--sodar-api-token SODAR_API_TOKEN]
[--dry-run] [--format FORMAT_STRING]
landing_zone_uuid
landing_zone_uuid | |
UUID of landing zone to move. |
--dry-run, -n | Perform a dry run, i.e., don’t change anything only display change, implies ‘–show-diff’. Default: False |
--format | Format string for printing, e.g. %(uuid)s |
ingest-fastq¶
Upload external files to SODAR (defaults for fastq)
cubi-tk sodar ingest-fastq [-h] [--sodar-url SODAR_URL]
[--sodar-api-token SODAR_API_TOKEN]
[--num-parallel-transfers NUM_PARALLEL_TRANSFERS]
[--yes] [--base-path BASE_PATH]
[--remote-dir-date REMOTE_DIR_DATE]
[--src-regex SRC_REGEX]
[--remote-dir-pattern REMOTE_DIR_PATTERN]
[--add-suffix ADD_SUFFIX] [-m MATCH REPL]
[--tmp TMP]
sources [sources ...] destination
sources | paths to fastq folders |
destination | UUID or iRods path of landing zone to move to. |
--num-parallel-transfers | |
Number of parallel transfers, defaults to 8 Default: 8 | |
--yes | Assume the answer to all prompts is ‘yes’ Default: False |
--base-path | Base path of project (contains ‘ngs_mapping/’ etc.), defaults to current path. Default: “/home/docs/checkouts/readthedocs.org/user_builds/cubi-tk/checkouts/latest/docs_manual” |
--remote-dir-date | |
Date to use in remote directory, defaults to YYYY-MM-DD of today. Default: “2023-07-04” | |
--src-regex | Regular expression to use for matching input fastq files, default: (.*/)?(?P<sample>.+?)(?:_(?P<lane>L[0-9]+?))?(?:_(?P<mate>R[0-9]+?))?(?:_(?P<batch>[0-9]+?))?.f(?:ast)?q.gz Default: “(.*/)?(?P<sample>.+?)(?:_(?P<lane>L[0-9]+?))?(?:_(?P<mate>R[0-9]+?))?(?:_(?P<batch>[0-9]+?))?.f(?:ast)?q.gz” |
--remote-dir-pattern | |
Pattern to use for constructing remote pattern, default: {sample}/{date}/{filename} Default: “{sample}/{date}/{filename}” | |
--add-suffix | Suffix to add to all file names (e.g. ‘-N1-DNA1-WES1’). Default: “” |
-m, --remote-dir-mapping | |
Substitutions applied to the filled remote dir paths. Can for example be used to modify sample names. Use pythons regex syntax of ‘re.sub’ package. This argument can be used multiple times (i.e. ‘-m <regex1> <repl1> -m <regex2> <repl2>’ …). Default: [] | |
--tmp | Folder to save files from WebDAV temporarily, if set as source. Default: “temp/” |
check-remote¶
Compare local files with md5 sum against SODAR/iRODS
cubi-tk sodar check-remote [-h] [--sodar-url SODAR_URL]
[--sodar-api-token SODAR_API_TOKEN] [-p BASE_PATH]
[--filename-only] [--recheck-md5]
[--report-md5sums] [--assay-uuid ASSAY_UUID]
project_uuid
project_uuid | UUID from Project to check. |
--sodar-url | URL to SODAR, defaults to SODAR_URL environment variable or fallback to https://sodar.bihealth.org/ Default: “https://sodar.bihealth.org/” |
--sodar-api-token | |
Authentication token when talking to SODAR. Defaults to SODAR_API_TOKEN environment variable. | |
-p, --base-path | |
Base path in which local files with md5 sums should be identified. Default: CWD Default: “/home/docs/checkouts/readthedocs.org/user_builds/cubi-tk/checkouts/latest/docs_manual” | |
--filename-only | |
Flag to indicate whether file comparison between local and remote files should only use file names and ignore md5 values. Default: False | |
--recheck-md5 | Flag to double check that md5 sums stored in local files do actually match their corresponding files Default: False |
--report-md5sums | |
Flag to indicate if md5 sums should be included in file report Default: False | |
--assay-uuid | UUID from Assay to check. Used to specify target while dealing with multi-assay projects. |
irods¶
iRods command line interface.
cubi-tk irods [-h] {check} ...
Positional Arguments¶
irods_cmd | Possible choices: check |
Sub-commands¶
check¶
Check target iRODS collection (all MD5 files? metadata MD5 consistent? enough replicas?).
cubi-tk irods check [-h] [-r REQ_NUM_REPS] [-p NUM_PARALLEL_TESTS]
[-d NUM_DISPLAY_FILES] [-s HASH_SCHEME]
irods_path
irods_path | Path to an iRODS collection. |
-r, --num-replicas | |
Minimum number of replicas, defaults to 2 Default: 2 | |
-p, --num-parallel-tests | |
Number of parallel tests, defaults to 4 Default: 4 | |
-d, --num-display-files | |
Number of files listed when checking, defaults to 20 Default: 20 | |
-s, --hash-scheme | |
Hash scheme used to verify checksums, defaults to MD5 Default: “MD5” |
org-raw¶
org_raw command line interface.
cubi-tk org-raw [-h] {check,organize} ...
Positional Arguments¶
org_raw_cmd | Possible choices: check, organize |
Sub-commands¶
check¶
Check consistency of raw data
cubi-tk org-raw check [-h] [--num-threads NUM_THREADS] [--no-gz-check]
[--no-md5-check] [--no-compute-md5]
[--missing-md5-error] [--create-md5-fail-no-error]
FILE.fastq.gz [FILE.fastq.gz ...]
FILE.fastq.gz | Path(s) to .fastq.gz files to perform the check for |
--num-threads | Number of parallel threads Default: 0 |
--no-gz-check | Deactivate check for gzip consistency (default is to perform check). Default: True |
--no-md5-check | Deactivate comparison of MD5 sum if .md5 file exists (default is to perform check). Default: True |
--no-compute-md5 | |
Deactivate computation of MD5 sum if missing (default is to compute MD5 sum). Default: True | |
--missing-md5-error | |
Make missing .md5 files constitute an error. Default is to issue an log message only. Default: False | |
--create-md5-fail-no-error | |
Make failure to create .md5 file not an error. Default is to make it an error. Default: True |
organize¶
Check consistency of raw data
cubi-tk org-raw organize [-h] [--dry-run] [--yes] [--move] [--no-check]
[--src-regex SRC_REGEX] [--dest-pattern DEST_PATTERN]
[--num-threads NUM_THREADS] [--no-gz-check]
[--no-md5-check] [--no-compute-md5]
[--missing-md5-error] [--create-md5-fail-no-error]
out_path path.fastq.gz [path.fastq.gz ...]
out_path | Path to output directory. |
path.fastq.gz | Path to input files. |
--dry-run | Dry-run, do not actually do anything Default: False |
--yes | Assume the answer to all prompts is ‘yes’ Default: False |
--move | Move file(s) instead of copying, default is to copy. Default: False |
--no-check | Do not run ‘raw-org check’ on output (default is to run). Default: True |
--src-regex | Regular expression for parsing file paths. Default: (.*/)?(?P<sample>.+)(?:-.+?)?.f(?:ast)?q.gz Default: “(.*/)?(?P<sample>.+)(?:-.+?)?.f(?:ast)?q.gz” |
--dest-pattern | Format expression for destination path generation. Default: {sample_name}/{file_name} Default: “{sample_name}/{file_name}” |
--num-threads | Number of parallel threads Default: 0 |
--no-gz-check | Deactivate check for gzip consistency (default is to perform check). Default: True |
--no-md5-check | Deactivate comparison of MD5 sum if .md5 file exists (default is to perform check). Default: True |
--no-compute-md5 | |
Deactivate computation of MD5 sum if missing (default is to compute MD5 sum). Default: True | |
--missing-md5-error | |
Make missing .md5 files constitute an error. Default is to issue an log message only. Default: False | |
--create-md5-fail-no-error | |
Make failure to create .md5 file not an error. Default is to make it an error. Default: True |
sea-snap¶
Tools for supporting the RNA-SeASnaP pipeline.
cubi-tk sea-snap [-h]
{itransfer-raw-data,itransfer-results,working-dir,write-sample-info,check-irods}
...
Positional Arguments¶
sea_snap_cmd | Possible choices: itransfer-raw-data, itransfer-results, working-dir, write-sample-info, check-irods |
Sub-commands¶
itransfer-raw-data¶
Transfer FASTQs into iRODS landing zone
cubi-tk sea-snap itransfer-raw-data [-h] [--sodar-url SODAR_URL]
[--sodar-api-token SODAR_API_TOKEN]
[--num-parallel-transfers NUM_PARALLEL_TRANSFERS]
[--tsv-shortcut {germline,cancer}]
[--first-batch FIRST_BATCH]
[--last-batch LAST_BATCH]
[--base-path BASE_PATH]
[--remote-dir-date REMOTE_DIR_DATE]
[--remote-dir-pattern REMOTE_DIR_PATTERN]
[--yes] [--validate-and-move]
[--assay ASSAY]
destination
destination | UUID from Landing Zone or Project - where files will be moved to. |
--num-parallel-transfers | |
Number of parallel transfers, defaults to 8 Default: 8 | |
--tsv-shortcut | Possible choices: germline, cancer The shortcut TSV schema to use. Default: “germline” |
--first-batch | First batch to be transferred. Defaults: 0. Default: 0 |
--last-batch | Last batch to be transferred. |
--base-path | Base path of project (contains ‘ngs_mapping/’ etc.), defaults to current path. Default: “/home/docs/checkouts/readthedocs.org/user_builds/cubi-tk/checkouts/latest/docs_manual” |
--remote-dir-date | |
Date to use in remote directory, defaults to YYYY-MM-DD of today. Default: “2023-07-04” | |
--remote-dir-pattern | |
Pattern to use for constructing remote pattern Default: “{library_name}/{step}/{date}” | |
--yes | Assume all answers are yes, e.g., will create or use existing available landing zones without asking. Default: False |
--validate-and-move | |
After files are transferred to SODAR, it will proceed with validation and move. Default: False | |
--assay | UUID of assay to download data for. |
itransfer-results¶
Transfer mapping results into iRODS landing zone
cubi-tk sea-snap itransfer-results [-h] [--sodar-url SODAR_URL]
[--sodar-api-token SODAR_API_TOKEN]
[--num-parallel-transfers NUM_PARALLEL_TRANSFERS]
transfer_blueprint destination
transfer_blueprint | |
Path to blueprint file to load. This file contains commands to sync files with iRODS. Blocks of commands separated by an empty line will be executed together in one thread. | |
destination | UUID or iRods path of landing zone to move to. |
--num-parallel-transfers | |
Number of parallel transfers, defaults to 8 Default: 8 |
working-dir¶
Create working directory
cubi-tk sea-snap working-dir [-h] [--dry-run] [--dirname DIRNAME]
[--configs {mapping,DE} [{mapping,DE} ...]]
[sea_snap_path]
sea_snap_path | Path into RNA-SeA-SnaP directory (below a directory containing ‘mapping_pipeline.snake’). Default: “/home/docs/checkouts/readthedocs.org/user_builds/cubi-tk/checkouts/latest/docs_manual” |
--dry-run, -n | Perform dry-run, do not do anything. Default: False |
--dirname, -d | Name of the working directory to create (default: ‘results_YEAR_MONTH_DAY/’). Default: “results_%Y_%m_%d/” |
--configs, -c | Possible choices: mapping, DE Configs to be imported (default: all). Default: [‘mapping’, ‘DE’] |
write-sample-info¶
Generate sample info file
cubi-tk sea-snap write-sample-info [-h] [--allow-overwrite] [--dry-run]
[--show-diff] [--show-diff-side-by-side]
[--from-file FROM_FILE]
[--isa-assay ISA_ASSAY]
[--project_uuid PROJECT_UUID]
[--output_folder OUTPUT_FOLDER]
[--overwrite-isa] [--sodar-url SODAR_URL]
[--sodar-auth-token SODAR_AUTH_TOKEN]
in_path_pattern [output_file]
in_path_pattern | |
Path pattern to use for extracting input file information. See https://cubi-gitlab.bihealth.org/CUBI/Pipelines/sea-snap/blob/master/documentation/prepare_input.md#fastq-files-folder-structure. | |
output_file | Filename ending with ‘.yaml’ or ‘.tsv’. default: sample_info.yaml. Default: sample_info.yaml |
--allow-overwrite | |
Allow to overwrite output file, default is not to allow overwriting output file. Default: False | |
--dry-run | Perform a dry run, i.e., don’t change anything only display change, implies ‘–show-diff’. Default: False |
--show-diff | Show change when creating/updating sample sheets. Default: False |
--show-diff-side-by-side | |
Show diff side by side instead of unified. Default: False | |
--from-file | Path to yaml file to convert to tsv or tsv to yaml. Not used, if not specified. |
--isa-assay | Path to ISA assay file. Not used, if not specified. |
--project_uuid | If set pull ISA files from SODAR. UUID of project to pull from. Default: False |
--output_folder | |
Output folder path to store ISA files. Default: “ISA_files/” | |
--overwrite-isa | |
Allow to overwrite output file, default is not to allow overwriting output file. Default: False |
check-irods¶
Check consistency of sample info, blueprint and files on SODAR
cubi-tk sea-snap check-irods [-h] [--num-replicas NUM_REPLICAS]
[--num-parallel-tests NUM_PARALLEL_TESTS] [--yes]
[--transfer-blueprint TRANSFER_BLUEPRINT]
results_folder irods_path
results_folder | Path to a Sea-snap results folder. |
irods_path | Path to an iRods collection. |
--num-replicas | Minimum number of replicas, defaults to 2 Default: 2 |
--num-parallel-tests | |
Number of parallel tests, defaults to 8 Default: 8 | |
--yes | Assume the answer to all prompts is ‘yes’ Default: False |
--transfer-blueprint | |
Filename of blueprint file for export to SODAR (created e.g. with ‘./sea-snap sc l export’). Assumed to be in the results folder. Default: ‘SODAR_export_blueprint.txt’ Default: “SODAR_export_blueprint.txt” |
archive¶
helper for archiving projects.
cubi-tk archive [-h] {copy,prepare,readme,summary} ...
Positional Arguments¶
archive_cmd | Possible choices: copy, prepare, readme, summary |
Sub-commands¶
copy¶
Perform archival (copy and audit)
cubi-tk archive copy [-h] [--num-threads NUM_THREADS]
[--skip [SKIP [SKIP ...]]] [--keep-workdir-hashdeep]
[--read-only]
project destination
project | Path of project directory |
destination | Final destination directory for archive, must not exist |
--num-threads | Number of parallel threads Default: 4 |
--skip | Step to skip (hashdeep, rsync, audit) |
--keep-workdir-hashdeep | |
Save hashdeep report & audit of the temporary destination Default: False | |
--read-only | Change destination files to read-only Default: False |
prepare¶
Prepare the project directory for archival
cubi-tk archive prepare [-h] [--num-threads NUM_THREADS] [--rules RULES]
[--readme README] [--ignore-tar-errors]
project destination
project | Path of project directory |
destination | Destination directory (for symlinks and later archival) |
--num-threads | Number of parallel threads Default: 4 |
--rules, -r | Default: “/home/docs/checkouts/readthedocs.org/user_builds/cubi-tk/checkouts/latest/cubi_tk/archive/default_rules.yaml” |
--readme | Path to README.md created with cubi-tk |
--ignore-tar-errors | |
Ignore errors due to access permissions in when compressind folders Default: False |
readme¶
Prepare a valid README.md
cubi-tk archive readme [-h] [--skip-collect] [--is-valid]
[--var-directory VAR_DIRECTORY]
[--var-PI-name VAR_PI_NAME]
[--var-PI-email VAR_PI_EMAIL]
[--var-archiver-name VAR_ARCHIVER_NAME]
[--var-archiver-email VAR_ARCHIVER_EMAIL]
[--var-CUBI-name VAR_CUBI_NAME]
[--var-client-name VAR_CLIENT_NAME]
[--var-client-email VAR_CLIENT_EMAIL]
[--var-SODAR-UUID VAR_SODAR_UUID]
[--var-SODAR-URL VAR_SODAR_URL]
[--var-Gitlab-URL VAR_GITLAB_URL]
[--var-project-name VAR_PROJECT_NAME]
[--var-start-date VAR_START_DATE]
[--var-current-status VAR_CURRENT_STATUS]
[--var-size VAR_SIZE] [--var-inodes VAR_INODES]
[--var-size-follow VAR_SIZE_FOLLOW]
[--var-inodes-follow VAR_INODES_FOLLOW]
[--var-snakemake-nb VAR_SNAKEMAKE_NB]
project filename
project | Path of project directory |
filename | README.md path & filename |
--skip-collect, -s | |
Skip the collection of file size & inodes Default: False | |
--is-valid, -t | Test validity of existing README file Default: False |
--var-directory | |
template variable ‘directory’ | |
--var-PI-name | template variable ‘PI_name’ |
--var-PI-email | template variable ‘PI_email’ |
--var-archiver-name | |
template variable ‘archiver_name’ | |
--var-archiver-email | |
template variable ‘archiver_email’ | |
--var-CUBI-name | |
template variable ‘CUBI_name’ | |
--var-client-name | |
template variable ‘client_name’ | |
--var-client-email | |
template variable ‘client_email’ | |
--var-SODAR-UUID | |
template variable ‘SODAR_UUID’ | |
--var-SODAR-URL | |
template variable ‘SODAR_URL’ | |
--var-Gitlab-URL | |
template variable ‘Gitlab_URL’ | |
--var-project-name | |
template variable ‘project_name’ | |
--var-start-date | |
template variable ‘start_date’ | |
--var-current-status | |
template variable ‘current_status’ | |
--var-size | template variable ‘size’ |
--var-inodes | template variable ‘inodes’ |
--var-size-follow | |
template variable ‘size_follow’ | |
--var-inodes-follow | |
template variable ‘inodes_follow’ | |
--var-snakemake-nb | |
template variable ‘snakemake_nb’ |
summary¶
Collects a summary of files in the project directory. The summary can be saved to a file for further inspection
cubi-tk archive summary [-h] [--classes CLASSES] [--dont-follow-links]
project table
project | Path of project directory |
table | Location of the summary output table |
--classes | Location of the file describing files of interest Default: “/home/docs/checkouts/readthedocs.org/user_builds/cubi-tk/checkouts/latest/cubi_tk/archive/classes.yaml” |
--dont-follow-links | |
Do not follow symlinks to directories. Required when the project contains circular symlinks Default: False |
Manual for isa-tpl
¶
cubi-tk isa-tpl
: create ISA-tab directories using Cookiecutter.
You can use this command to quickly bootstrap an ISA-tab investigation. The functionality is built on Cookiecutter.
To create a directory with ISA-tab files, run:
$ cubi-tk isa-tpl <template name> <output directory>
This will prompt a number of questions interactively on the command line to collect information about the files that are going to be created.
The requested information will depend on the chosen ISA-tab template.
It is also possible to pass this information non-interactively together with other command line arguments (see cubi-tk isa-tpl <template name> --help
).
The completed information will then be used to create a directory with ISA-tab files. It will be necessary to edit and extend the automatically generated files, e.g. to add additional rows to the assays.
Available Templates¶
The Cookiecutter directories are located in this module’s directory. Currently available templates are:
isatab-generic
isatab-germline
isatab-microarray
isatab-ms_meta_biocrates
isatab-single_cell_rnaseq
isatab-bulk_rnaseq
isatab-tumor_normal_dna
isatab-tumor_normal_triplets
isatab-stem_cell_core_bulk
isatab-stem_cell_core_sc
Adding Templates¶
Adding templates consists of the following steps:
- Add a new template directory below
cubi_tk/isa_tpl
. - Register it appending a
IsaTabTemplate
object to_TEMPLATES
incubi_tk.isa_tpl
. - Add it to the list above in the docstring.
The easiest way to start out is to copy an existing cookiecutter template and registration.
More Information¶
Also see cubi-tk isa-tpl
CLI documentation and cubi-tk isa-tab --help
for more information.
Manual for isa-tab
¶
cubi-tk isa-tab
: ISA-tab tooling.
Sub Commands¶
validate
- Validate ISA-tab files for correctness and perform sanity checks.
resolve-hpo
- Resolve lists of HPO terms to TSV suitable for copy-and-paste into ISA-tab.
add-ped
- Given a germline DNA sequencing ISA-tab file and a PED file, add new lines to the ISA-tab file and update existing ones, e.g., for newly added parents.
annotate
- Add annotation to an ISA-tab file, given a tsv file.
Annotate¶
cubi-tk isa-tab annotate
updates material and file nodes in ISA-tab studies and assays with
annotations provided as tab-separated text file.
In the annotation file header, target node types need to be indicated in ISA-tab style (i.e. “Source Name”, etc.) while annotations are just named normally. Annotations for materials are automatically recorded as Characteristics, while annotations for files are recorded as Comments. Different node types can be annotated using only one annotation file, as demonstrated in the example below.
By default, if Characteristics or Comments with the same name already exist for a node type, only empty values are updated. Overwriting existing values requires confirmation (–force-update).
Annotations are only applied to one study and assay, since material names are not necessarily unique between the same material types of different studies or different assays (and thus, annotations couldn’t be assigned unambiguously). By default the first study and assay listed in the investigation file are considered for annotation. A specific study and assay may be selected by file name (not path, just as listed in the investigation file) via –target-study or –target-assay, resp.
Example execution:
$ cubi-tk isa-tab annotate investigation.tsv annotation.tsv --target-study s_study.tsv
--target-assay a_assay.tsv
Note: investigation.tsv and annotation.tsv have to be indicated via absolute or relative paths. However, s_study.tsv and a_assay.tsv have to be indicated by name only, just as they are referenced in their corresponding investigation file.
Source Name | Age | Sex | Sample Name | Volume |
---|---|---|---|---|
alpha | 18 | FEMALE | alpha-N1 | 1000 |
beta | 27 | MALE | beta-N1 | 1000 |
gamma | 69 | FEMALE | gamma-N1 | 800 |
More Information¶
Also see cubi-tk isa-tab
CLI documentation and cubi-tk isa-tab --help
for more
information.
Manual for ingest-fastq
¶
The cubi-tk sodar ingest-fastq
command lets you upload raw data files to SODAR.
It is configured for uploading FASTQ files by default, but the parameters can be adjusted to upload any files.
The basic usage is:
$ cubi-tk sodar ingest-fastq SOURCE [SOURCE ...] DESTINATION
where each SOURCE
is a path to a folder containing relevant files and DESTINATION
is either an iRODS path to a landing zone in SODAR or the UUID of that landing zone.
Other file types¶
By default, the parameters --src-regex
and --remote-dir-pattern
are configured for FASTQ files, but they may be changed to upload other files as well.
The two parameters have the following functions:
--src-regex
: a regular expression to recognize paths to raw data files to upload (the paths starting from theSOURCE
directories).--remote-dir-pattern
: a pattern specifying into which folder structure the raw data files should be uploaded. This is a file path with wildcards that are replaced by the captured content of named groups in the regular expression passed via--src-regex
.
For example, the default --src-regex
is
(.*/)?(?P<sample>.+?)(?:_(?P<lane>L[0-9]+?))?(?:_(?P<mate>R[0-9]+?))?(?:_(?P<batch>[0-9]+?))?\.f(?:ast)?q\.gz
It can capture a variety of different FASTQ file names and has the named groups sample
, lane
, mate
and batch
.
The default --remote-dir-pattern
is
{sample}/{date}/{filename}
It contains the wildcard {sample}
, which will be filled with the captured content of group (?P<sample>...)
.
In addition, the wildcards {date}
and {filename}
can always be used and will be filled with the current date and full filename (the basename of a matched file), respectively.
Mapping of file names¶
In some cases additional mapping of filenames is required (for example the samples should be renamed).
This can be done via the parameter --remote-dir-mapping
or short -m
.
It can be supplied several times, each time for another mapping.
With each -m MATCH REPL
a pair of a regular expression and a replacement string are specified.
Internally, pythons re.sub
command is executed on the --remote-dir-pattern
after wildcards have been filled.
Therefore, you can refer to the documentation of the re package for syntax questions.
Source files on WevDAV¶
If a SOURCE
is a WebDAV url, the files will temporarily be downloaded into a directory called “./temp/”.
This can be adjusted with the --tmp
option.
SODAR authentication¶
To use this command, which internally executes iRODS icommands, you need to authenticate with iRODS by running:
$ iinit
To be able to access the SODAR API (which is only required, if you specify a landing zone UUID instead of an iRODS path), you also need an API token. For token management for SODAR, the following docs can be used:
- https://sodar.bihealth.org/manual/ui_user_menu.html
- https://sodar.bihealth.org/manual/ui_api_tokens.html
There are three options how to supply the token. Only one is needed. The options are the following:
configure
~/.cubitkrc.toml
.[global] sodar_server_url = "https://sodar.bihealth.org/" sodar_api_token = "<your API token here>"
pass via command line.
$ cubi-tk sodar ingest-fastq --sodar-url "https://sodar.bihealth.org/" --sodar-api-token "<your API token here>"
set as environment variable.
$ SODAR_API_TOKEN="<your API token here>"
More Information¶
Also see cubi-tk sodar ingest-fastq
CLI documentation and cubi-tk sodar ingest-fastq --help
for more information.
Manual for sea-snap itransfer-results
¶
The cubi-tk sea-snap itransfer-results
command lets you upload results of the Seasnap pipeline to SODAR.
It relies on running the export
function of Seasnap first.
This export
function allows to select which result files of the pipeline shall be uploaded into what folder structure, which can be configured via the Seasnap config file.
It outputs a blueprint
file with file paths and commands to use for the upload.
For more information see the Seasnap documentation
The itransfer-results
function parallelizes the upload of these files.
The basic usage is:
- create blueprint
$ ./sea-snap mapping l export
- upload to SODAR
$ cubi-tk sea-snap itransfer-results BLUEPRINT DESTINATION
where each BLUEPRINT
is the blueprint file mentioned above (probably “SODAR_export_blueprint.txt”) and DESTINATION
is either an iRODS path to a landing zone in SODAR or the UUID of that landing zone.
SODAR authentication¶
To use this command, which internally executes iRODS icommands, you need to authenticate with iRODS by running:
$ iinit
To be able to access the SODAR API (which is only required, if you specify a landing zone UUID instead of an iRODS path), you also need an API token. For token management for SODAR, the following docs can be used:
- https://sodar.bihealth.org/manual/ui_user_menu.html
- https://sodar.bihealth.org/manual/ui_api_tokens.html
There are three options how to supply the token. Only one is needed. The options are the following:
configure
~/.cubitkrc.toml
.[global] sodar_server_url = "https://sodar.bihealth.org/" sodar_api_token = "<your API token here>"
pass via command line.
$ cubi-tk sodar ingest-fastq --sodar-url "https://sodar.bihealth.org/" --sodar-api-token "<your API token here>"
set as environment variable.
$ SODAR_API_TOKEN="<your API token here>"
More Information¶
Also see cubi-tk sea-snap itransfer-results
CLI documentation and cubi-tk sea-snap itransfer-results --help
for more information.
Manual for sea-snap write-sample-info
¶
The cubi-tk sea-snap write-sample-info
command can be used to collect information by parsing the folder structure of raw data files (FASTQ) and meta-information (ISA-tab).
It collects this information in a YAML file that will be loaded by the Seasnap pipeline.
The basic usage is:
$ cubi-tk sea-snap write-sample-info IN_PATH_PATTERN
where IN_PATH_PATTERN
is a file path with wildcards specifying the location to FASTQ files.
The wildcards are also used to extract information from the parsed paths.
By default, a file called sample_info.yaml
will be generated in the current working directory.
If this file is in the project working directory, Seasnap will load it automatically.
However, you can specify another file name after IN_PATH_PATTERN
.
Then this file can be used in Seasnap e.g. like so:
$ ./sea-snap mapping l --config file_name='sample_info_alt.yaml'
Note: check and edit the auto-generated sample_info.yaml file before running the pipeline.
Path pattern and wildcards¶
For example, if the FASTQ files are stored in a folder structure like this:
input
├── sample1
│ ├── sample1_R1.fastq.gz
│ └── sample1_R2.fastq.gz
└── sample2
├── sample2_R1.fq
└── sample2_R2.fq
Then the path pattern can look like the following:
$ cubi-tk sea-snap write-sample-info "input/{sample}/*_{mate,R1|R2}"
Keywords in braces (e.g. {sample}
) are wildcards.
It is possible to add a regular expression separated with a comma after the keyword.
This is useful to restrict what part of the file path the wildcard can match (e.g. {mate,R1|R2}
means that mate can only be R1
or R2
).
In addition, *
and **
can be used to match anything that does not need to be captured with a wildcard.
Setting the IN_PATH_PATTERN
as shown above will allow the write-sample-info
command to extract the information that samples sample1 and sample2 exist and that there are paired reads for both of them.
The extension (e.g. fastq.gz
, fastq
or fq
) should be omitted and will be detected automatically.
Available wildcards are: {sample}
, {mate}
, {flowcell}
, {lane}
, {batch}
and {library}
.
However, only ``{sample}`` is obligatory.
Note: wildcards do not match ``/`` and``.``. For further information also see the Seasnap docu.
Meta information¶
When working with SODAR, additional meta-information should be included in the sample info file. In SODAR this meta-information is stored in the form of ISA-tab files.
There are two ways to add the information from an ISA-tab assay file to the generated sample info file:
- Load from a local ISA-tab assay file
$ cubi-tk sea-snap write-sample-info --isa-assay PATH/TO/a_FILE_NAME.txt IN_PATH_PATTERN
- Download from SODAR
$ cubi-tk sea-snap write-sample-info --project_uuid UUID IN_PATH_PATTERN
Here, UUID
is the UUID of the respective project on SODAR.
SODAR authentication¶
To be able to access the SODAR API (which is only required if you download meta-data from SODAR), you also need an API token. For token management for SODAR, the following docs can be used:
- https://sodar.bihealth.org/manual/ui_user_menu.html
- https://sodar.bihealth.org/manual/ui_api_tokens.html
There are three options how to supply the token. Only one is needed. The options are the following:
configure
~/.cubitkrc.toml
.[global] sodar_server_url = "https://sodar.bihealth.org/" sodar_api_token = "<your API token here>"
pass via command line.
$ cubi-tk sodar ingest-fastq --sodar-url "https://sodar.bihealth.org/" --sodar-api-token "<your API token here>"
set as environment variable.
$ SODAR_API_TOKEN="<your API token here>"
Table format¶
Although this is not really necessary to run the workflow, it is possible to convert the YAML file to a table / sample sheet:
$ cubi-tk sea-snap write-sample-info --from-file sample_info.yaml XXX sample_info.tsv
And back:
$ cubi-tk sea-snap write-sample-info --from-file sample_info.tsv XXX sample_info.yaml
More Information¶
Also see cubi-tk sea-snap write-sample-info
CLI documentation and cubi-tk sea-snap write-sample-info --help
for more information.
Manual for archive
¶
The cubi-tk archive
is designed to facilitate the archival of older projects away from the cluster’s fast file system.
This document provides an overview of these commands, and how they can be adapted to meet specific needs.
Glossary¶
Hot storage: Fast and expensive, therefore usually size restricted. For example:
- GPFS by DDN (currently at
/fast
) - Ceph with SSDs
Warm storage: Slower, but with more space and possibly mirroring. For example:
- SODAR with irods
- Ceph with HDDs (
/data/cephfs-2/
)
Cold storage: For data that needs to be accessed only rarely. For example:
- Tape archive
Background: the archiving process¶
CUBI archive resources are three-fold:
- SODAR and associated irods storage should contain raw data generated for the project. SODAR also contains important results (mapping, variants, differential expression, …).
- Gitlab contains small files required to generate the results, typically scripts, configuration files, READMEs, meeting notes, …, but also knock-in gene sequence, list of papers, gene lists, etc.
- The rest should be stored in CEPH (warm storage).
For older projects or intermediate results produced by older pipelines the effort of uploading the data to SODAR & gitlab may not be warranted. In this case, the bulk of the archive might be stored in the CEPH file system.
The module aims to facilitate this last step, i.e. the archival of old projects to move them away from the hot storage.
Archiving process requirements¶
Archived projects should contain all important files, but not data already stored elsewhere. In particular, the following files should not be archived:
- raw data (
*.fastq.gz
files) saved in SODAR or in theSTORE
, - data from public repositories (SRA, GDC portal, …) that can easily be downloaded again,
- static data such as genome sequence & annotations, variant databases from gnomAD, … that can also be easily retrieved,
- indices files for mapping that can be re-generated.
Importantly, a README file should be present in the archive, briefly describing the project, listing contacts to the client & within CUBI and providing links to SODAR & Gitlab when appropriate.
The purpose of the module is:
- to provide a summary of files that require special attention, for example symlinks which targets lie outside of the project, or large files (
*.fastq.gz
or*.bam
especially) - to create a temporary directory that mimicks the archived files with symlinks,
- to use this temporary directory as template to copy files on the CEPH filesystem, and
- to compute checksums on the originals and copies, to ensure accuracy of the copy process.
Basic usage¶
Summary of files in project¶
$ cubi-tk archive summary PROJECT_DIRECTORY DESTINATION
Unlike other cubi-tk
commands, here DESTINATION
is not a landing zone, but a local filename for the summary of files that require attention.
By default, the summary reports:
- dangling symlinks (also dangling because of permission),
- symlinks pointing outside of the project directory,
- large (greater than 256MB)
*.fastq.gz
,*.fq.gz
&*.bam
files, - large static data files with extension
*.gtf
,*.gff
,*.fasta
&*.fa
(possibly gzipped), that can potentially be publicly available. - large files from SRA with prefix
SRR
.
The summary file is a table with the following columns:
- Class: the name(s) of the pattern(s) that match the file. When the file matches several patterns, all are listed, separated by
|
. - Filename: the relative path of the file (from the project’s root).
- Target: the symlink’s target (when applicable)
- ResolvedName: the resolved (absolute, symlinks removed) path of the target. When the target doesn’t exist or is inaccessible because of permissions, the likely path of the target.
- Size: file size (target file size for symlinks). When the file doesn’t exist, it is set to 0.
- Dangling:
True
when the file cannot be read (missing or inaccessible),False
otherwise. - Outside:
True
when the target path is outside of the project directory,False
otherwise. It is alwaysFalse
for real files (_i.e._ not symlinks).
The summary step also reports an overview of the results, with the total number of files, the total size of the project, and the number of links to files. Number of dangling links and links inaccessible because of permission issues are listed separately. Likewise, the number of files outside of the projects, which are linked to from within the project by symlinks is also quoted. Finally, for each of the “important files” classes, the number of files, the number of files outside of the project directory and the number of files lost because of symlink failures are reported.
Archive preparation: README.md file creation¶
$ cubi-tk archive readme PROJECT_DIRECTORY README_FILE
README_FILE
is here the path to the README file that will be created. It must not exist.
The README file will be created by filling contact information interactively. Command-line options are also available, but interactive confirmation is needed.
It is possible to test if a generated README file is valid for project archival, using
$ cubi-tk archive readme --is-valid PROJECT_DIRECTORY README_FILE
The module will highlight mandatory records that could not be found in the current file. These mandatory records are lines following the patterns below:
- P.I.: [Name of the PI, any string](mailto:<valid email address in lowercase>)
- Client contact: [Name of our contact in the PI's group](mailto:<valid email address in lowercase>)
- CUBI project leader: [Name of the CUBI member leading the project]
- CUBI contact: [Name of the archiver](mailto:<valid email address in lowercase>)
- Project name: <any string>
- Start date: YYYY-MM-DD
- Current status: <One of Active, Inactive, Finished, Archived>
Archive preparation: temporary copy¶
$ cubi-tk archive prepare --readme README PROJECT_DIRECTORY TEMPORARY_DESTINATION
TEMPORARY_DESTINATION
is here the path to the temporary directory that will be created. It must not exist.
For each file that must be archived, the module creates a symlink to that file’s absolute path. The module also reproduces the project’s directories hierarchy, so that the symlink sits in the same relative position in the temporary directory than in the original project.
The module deals with symlinks in the project differently whether their target in inside the project or not. For symlinks pointing outside of the project, a symlink to the target’s absolute path is created. For symlinks pointing inside the project, a relative path symlink is created. This allows to store all files (even those outside of the project), without duplicating symlinks inside the project.
Additional transformation of the original files are carried out during the preparation step:
- The contents of the
.snakemake
,sge_log
,cubi-wrappers
&snappy-pipeline
directories are processed differently: the directories are tarred & compressed in the temporary destination, to reduce the number of inodes in the archive. - The core dump files are not copied to the temporary destination, and therefore won’t be copied to the final archive.
- The
README.md
file created by thereadme
subcommand must also be included to be put in the temprary’s destination top level. If the original project already contains aREADME.md
file, it will be appended to the generated one, as the latter is valid (it contains all mandatory information).
Copy to archive & verification¶
$ cubi-tk archive copy TEMPORARY_DESTINATION FINAL_DESTINATION
FINAL_DESTINATION
is here the path to the final destination of the archive, on the warm storage. It must not exist.
Configuration¶
The files reported in the summary are under user control, through the --classes
option, which must point to a yaml file describing the regular expression pattern & minimum size for each class. For example, raw data files can be identified as follows:
fastq:
min_size: 268435456
pattern: "^(.*/)?[^/]+(\\.f(ast)?q(\\.gz)?)$"
The files larger than 256MB, with extension *.fastq
, *.fq
, *.fastq.gz
or *.fq.gz
will be reported with the class fastq
.
Any number of file class can be defined. The default classes configuration is in cubi_tk/archive/classes.yaml
The behaviour of the archive preparation can also be changed using the --rules
option. The rules are also described in a yaml file by regular expression patterns.
Three different archiving options are implemented:
- ignore: the files or directories matching the pattern are simply omitted from the temporary destination. This is useful to ignore remaining temporary files, core dumps or directories containing lists of input symlinks, for example.
- compress: the files or directories matching the pattern will be replaced in the temporary destination by a compressed (gzipped) tar file. This is how
.snakemake
orsge_log
directories are treated by default, but patterns for other directories may be added, for example for the Slurm log directories. - squash: the files matching the pattern will be replaced by zero-length placeholders in the temporary destination. A md5 checksum file will be added next to the original file, to enable verification.
When the user doesn’t specify her own set using the --rules
option, the rules applied are the following: core dumps are ignored, .snakemake
, sge_log
, .git
, snappy-pipeline
and cubi_wrappers
directories are compressed, and nothing is squashed. The exact definitions are:
ignore: # Patterns for files or directories to skip
- "^(.*/)?core\\.[0-9]+$"
- "^(.*/)?\\.venv$"
compress: # Patterns for files or directories to tar-gzip
- "^(.*/)?\\.snakemake$"
- "^(.*/)?sge_log$"
- "^(.*/)?\\.git$"
- "^(.*/)?snappy-pipeline$"
- "^(.*/)?cubi_wrappers$"
squash: [] # Patterns for files to squash (compute MD5 checksum, and replace by zero-length placeholder)
Examples¶
Consider an example project. It contains:
- raw data in a
raw_data
directory, some of which is stored outside of the project’s directory, - processing results in the
pipeline
directory, - additional data files & scripts in
extra_data
, - a
.snakemake
directory that can potentially contain many files in conda environments, for example, and - a bunch on temporary & obsolete files that shouldn’t be archived, conveniently grouped into the
ignored_dir
directory.
The architecture of this toy project is displayed below:
project/
├── extra_data
│ ├── dangling_symlink -> ../../outside/inexistent_data
│ ├── file.public
│ ├── to_ignored_dir -> ../ignored_dir
│ └── to_ignored_file -> ../ignored_dir/ignored_file
├── ignored_dir
│ └── ignored_file
├── pipeline
│ ├── output
│ │ ├── sample1
│ │ │ └── results -> ../../work/sample1/results
│ │ └── sample2 -> ../work/sample2
│ └── work
│ ├── sample1
│ │ └── results
│ └── sample2
│ └── results
├── raw_data
│ ├── batch1 -> ../../outside/batch1
│ ├── batch2
│ │ ├── sample2.fastq.gz -> ../../../outside/batch2/sample2.fastq.gz
│ │ └── sample2.fastq.gz.md5 -> ../../../outside/batch2/sample2.fastq.gz.md5
│ └── batch3
│ ├── sample3.fastq.gz
│ └── sample3.fastq.gz.md5
└── .snakemake
└── snakemake
Prepare the copy on the temporary destination¶
Imagine now that the raw data is already safely archived in SODAR. We don’t want to save these files in duplicate, so we decide ito _squash_ the raw data files so that their size is set to 0, and their md5 checksum is added. We also do the same for the publicly downloadable file file.public
. We also want to ignore the junk in ignored_dir
, and to compress the .snakemake
directory. So we have the following rules:
After running the preparation command cubi-tk archive prepare --rules my_rules.yaml project temp_dest
, the temporary destination contains the following files:
temp_dest
├── <today's date>_hashdeep_report.txt
├── extra_data
│ ├── file.public
│ ├── file.public.md5
│ ├── to_ignored_dir -> ../ignored_dir
│ └── to_ignored_file -> ../ignored_dir/ignored_file
├── pipeline
│ ├── output
│ │ ├── sample1
│ │ │ └── results -> ../../work/sample1/results
│ │ └── sample2 -> ../work/sample2
│ └── work
│ ├── sample1
│ │ └── results -> /absolute_path/project/pipeline/work/sample1/results
│ └── sample2
│ └── results -> /absolute_path/project/pipeline/work/sample2/results
├── raw_data
│ ├── batch1
│ │ ├── sample1.fastq.gz
│ │ └── sample1.fastq.gz.md5 -> /absolute_path/outside/batch1/sample1.fastq.gz.md5
│ ├── batch2
│ │ ├── sample2.fastq.gz
│ │ └── sample2.fastq.gz.md5 -> /absolute_path/outside/batch2/sample2.fastq.gz.md5
│ └── batch3
│ ├── sample3.fastq.gz
│ └── sample3.fastq.gz.md5 -> /absolute_path/project/raw_data/batch3/sample3.fastq.gz.md5
├── README.md
└── .snakemake.tar.gz
The inaccessible file project/extra_data/dangling_symlink
& the contents of the project/ignored_dir
are not present in the temporary destination, either because they are not accessible, or because they have been conscientiously ignored by the preparation step.
The .snakemake
directory is replaced by the the gzipped tar file .snakemake.tar.gz
in the temporary destination.
The file.public
& the 3 *.fastq.gz
files have been replaced by placeholder files of size 0. For file.public
, the md5 checksum has been computed by the preparing step, but for the *.fastq.gz
files, the existing checksums are used.
All other files are kept for archiving: symlinks for real files point to their target’s absolute path, symlinks are absolute for paths outside of the project, and relative for paths inside the project.
Finally, the hashdeep report of the original project directory is written to the temporary destination, and a README.md
file is created. At this point, we edit the ``README.md`` file to add a meaningful description of the project. If a README.md
file was already present in the orginial project directory, its content will be added to the newly created file.
Note that the symlinks temp_dest/extra_data/to_ignored_dir
& temp_dest/extra_data/to_ignored_file
are dangling, because the link themselves were not omitted, but their targets were. This is the expected, but perhaps unwanted behaviour: symlinks pointing to files or directories within compressed or ignored directories will be dangling in the temporary destination, as the original file exists, but is not part of the temporary destination.
Copy to the final destination¶
When the README.md
editing is complete, the copy to the final destination on the warm file system can be done. It is matter of cubi-tk archive copy temp_dest final_dest
.
The copy step writes in the final destination the hashdeep audit of the copy against the original project. This audit is expected to fail, because files & directories are ignored, compressed or squashed. The option --keep-workdir--hashdeep
, the programme also outputs the hashdeep report of the temporary destination, and the audit of the final copy against the temporary destination. Both the report and the audit are also stored in the final copy directory. The audit of the copy against the temporary destination should be successful, as the copy doesn’t re-process files, it only follows symlinks.
If all steps have been completed successfully (including checking the README.md
for validity), then a marker file named archive_copy_complete
is created. The final step is to remove write permissions if the --read-only
option was selected.
Additional notes and caveats¶
- Generally, the module doesn’t like circular symlinks. It is wise to fix them before any operation, or use the rules facility to ignore them during preparation. The
--dont-follow-links
option in the summary step prevents against such problems, at the expense of missing some files in the report. - The module is untested for symlink corner cases (for example, where a symlink points to a symlink outside of the project, which in turn points to another file in the project).
- In the archive, relative symlinks within the project are resolved. For example, in the original project one might have
variants.vcf -> ../work/variants.vcf -> variants.somatic.vcf
. In the archive, the link will bevariants.vcf -> ../work/variants.somatic.vcf
.
More Information¶
Also see cubi-tk archive --help
, cubi-tk archive summary --help
, cubi-tk archive prepare --help
& cubi-tk archive copy --help
for more information.
Use Case: Exomes¶
This section describes the cubi-tk use case for exomes that are sequenced at Labor Berlin and processed by CUBI. This section provides an outline of how cubi-tk helps in connecting
- SODAR (the CUBI system for meta and mass data storage and management),
- SNAPPY (the CUBI pipeline for the processing of DNA sequencing, including exomes),
- and VarFish (the CUBI web app for interactive analysis and annotation of variant calling results).
Overview¶
The overall data flow for the Translate-NAMSE use case is depicted below.

- A Labor Berlin (LB) bioinformatician uses “cubi-tk sodar add-ped” to augment the sample sheet of a SODAR project with new family members or new families alltogether. He also transfers the FASTQ read data sequences to the iRODS system that backs SODAR for file storage.
- At this stage, a Charite geneticist can review and refine the sample sheet. This mostly relates to information that is secondary for the subsequent analysis. It is assumed that the family relations updated by the bioinformatician are correct (two parents of a sample are the two parents, if father and mother are flipped, this is not important for analysis by SNAPPY).
- A CUBI Bioinformatician can now update the sample sheet for the SNAPPY pipeline using “cubi-tk snappy pull-sheets” and update a copy of the raw data sequence with “cubi-tk snappy pull-raw-data” files earlier transferred by LB.
- Once the data has been pulled from SODAR and iRODS, the CUBI bioinformatician launches the SNAPPY pipeline which processes the data on the BIH HPC.
The command
cubi-tk snappy kickoff
launches the pipeline steps with their dependencies. Inspection of results is based on manual inspection of log files for now. - Once this is complete, Manuel uses
cubi-tk snappy varfish-upload
andcubi-tk snappy itransfer-{variant-calling,ngs-mapping}
to transfer the resulting BAM and VCF files into VarFish via its REST API and iRODS via landing zones (cubi-tk sodar lz-{create,move}
).
To summarise more concisely
- LB copies data and meta data to SODAR/iRODS.
- CUBI pulls mass data and meta data form SODAR/iRODS and starts the pipeline.
- CUBI submits the resulting mass data results back into SODAR and annotated/exported variant calls into VarFish.
- The clinician can review the sample sheet independently of Manuel and Johannes.
Human interaction is required if
- The sample sheet does not sufficiently reflect reality (sample swaps)
- Files are broken and/or swapped.
- Tools terminate too early; data is not copied.
- Overall, this is not fully automated system, rather a system with heavy tool support and semi-automation.
Future improvements are
- Ask clinicians sending in samples for sex of child.
- Properly track parents as father/mother.
More Notes
- Data is processed in batches.
- Many tooling steps rely on “start processing in batch NUMBER”
- That is, everything behind NUMBER will be processed.
- Requires human-manual tracking of batch to start at (easy to seee in SODAR)
Setup¶
For token management for both VarFish and SODAR, the following docs can be used:
- https://sodar.bihealth.org/manual/ui_user_menu.html
- https://sodar.bihealth.org/manual/ui_api_tokens.html
Obtain a VarFish API token from the varfish system and configure
~/.varfishrc.toml
.[global] varfish_server_url = "https://varfish.bihealth.org/" varfish_api_token = "<your API token here>"
Obtain a SODAR API token and configure
~/.cubitkrc.toml
.[global] sodar_server_url = "https://sodar.bihealth.org/" sodar_api_token = "<your API token here>"
Create a new Miniconda installation if necessary.
host:~$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh host:~$ bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3 host:~$ source $HOME/miniconda3/bin/activate (conda) host:~$
Checkout and install VarFish CLI:
(conda) host:~$ git clone https://github.com/bihealth/varfish-cli.git (conda) host:~$ cd varfish-cli (conda) host:varfish-cli$ pip install -r requirements/base.txt (conda) host:varfish-cli$ pip install -e .
Checkout and install CUBI-TK
(conda) host:~$ git clone git@cubi-gitlab.bihealth.org:CUBI/Pipelines/cubi-tk.git (conda) host:~$ cd cubi-tk (conda) host:cubi-tk$ pip install -r requirements/base.txt (conda) host:cubi-tk$ pip install -e .
SNAPPY Configuration¶
You have to adjust the configuration of the SNAPPY data set as follows:
- You have ot provide the
sodar_uuid
attribute. Set it to the SODAR project’s UUID. - Data will be downloaded in the last entry of
search_paths
. - If you are starting a new project then just use one entry with an appropriate value.
- If you are moving a project to use cubi-tk then add a new entry where to download the data to.
- Data will be downloaded in the last entry of
# ...
data_sets:
"<the dataset name here>:
sodar_uuid: "<dataset uuid here>
sodar_title: "<optional title here>
file: "<biomedsheets file path here>.tsv"
type: germline_variants
naming_scheme: only_secondary_id
search_patterns:
- {left: '**/*_R1.fastq.gz', right: '**/*_R2.fastq.gz'}
- {left: '**/*_R1_*.fastq.gz', right: '**/*_R2_*.fastq.gz'}
search_paths:
- "<path to search data for here>"
Note that you will need the **/* in the pattern.
Processing Commands¶
The setup up to here only has to be done only once for each project/dataset. The following step will (a) fetch the meta data and raw data from SODAR/iRODS, (b) start the processing with SNAPPY, and (c) submit the results back to SODAR once SNAPPY is done.
First, you pull the meta data from SODAR with the command:
$ cubi-tk snappy pull-sheets
This will show the changes that are to be applied in unified patch format and you have to confirm by files.
You can also add --yes --dry-run
to see all pending changes at once without actually applying them or --yes
to apply all changes.
The next step is to fetch the raw data from SODAR/iRODS.
You first have to authenticate with iRODS using init
.
You then fetch the raw data, optionally only the data starting at batch number $BATCH
.
You also have to provide the project UUID $PROJECT
.
Internally, cubi-tk will use the iRODS icommands and you will be shown the commands it is about to execute.
$ iinit
$ cubitk snappy pull-raw-data --min-batch $BATCH $PROJECT
Now you could start the processing.
However, it is advisable to ensure that the input FASTQ files can be linked in the ngs_mapping
step.
$ cd ngs_mapping
$ snappy-snake -p $(snappy-snake -S | grep -v 'no update' | grep input_links | cut -f 1)
If this fails, a good starting point is removing ngs_mapping/.snappy_path_cache
.
You can kick off the current pipeline using
$ cubi-tk snappy kickoff
After the pipeline has finished, you can create a new landing zone with the following command.
This will print the landing zone properties as JSON.
You will neded both the landing zone UUID (ZONE
) and iRODS path ($IRODS_PATH
) for now (in the future this will be simplified).
$ cubi-tk sodar landing-zone-create $PROJECT
You can then transfer the data using the following commands.
You will have to specify the path to the SNAPPY sample sheet TSV as $TSV
and the landing zone iRODS path $IRODS_PATH
.
$ cubi-tk snappy itransfer-ngs-mapping --start-batch $BATCH $TSV $IRODS_PATH
$ cubi-tk snappy itransfer-variant-calling --start-batch $BATCH $TSV $IRODS_PATH
Finally, you can validate and move the landing zone to get the data into SODAR:
$ cubi-tk sodar landing-zone-move $ZONE
And last but not least, here is how to transfer the data into VarFish (starting at $BATCH
).
$ cubi-tk snappy varfish-upload --min-batch $BATCH $PROJECT
Use Case: Single Cell¶
This section describes the cubi-tk use case for the analysis of single cell data. It provides an outline of how cubi-tk helps in connecting
- Sea-Snap (the CUBI pipeline for the processing of RNA sequencing, including scRNA-seq),
- SODAR (the CUBI system for meta and mass data storage and management).
Overview¶

- 1 FASTQ and ISA-tab files are uploaded to SODAR.
- ISA-tab files can be created with the help of
cubi-tk isa-tpl isatab-single_cell
. - FASTQ files can be uploaded with the help of
cubi-tk sodar ingest-fastq
- ISA-tab files can be created with the help of
- 2 FASTQ and ISA-tab files are pulled from SODAR.
- FASTQ files can be downloaded using
cubi-tk sodar pull-raw-data
or iRods icommands. - ISA-tab files can be downloaded using
cubi-tk sea-snap pull-isa
.
- FASTQ files can be downloaded using
- 3 A results folder is created on the HPC cluster and the config files are edited. A sample info file is created.
- A results folder can be created with
cubi-tk sea-snap working-dir
. - The sample_info.yaml file can be created with
cubi-tk sea-snap write-sample-info
. This combines information from the parsed FASTQ folder structure and ISA-tab meta information.
- A results folder can be created with
- 4 Running the Sea-snap pipeline.
- This is done as usual via
./sea-snap sc --slurm c
.
- This is done as usual via
- 5 The results are uploaded to SODAR.
- Create a landing zone on SODAR with
cubi-tk sodar lz-create
. - Create a blueprint of which files to upload with
./sea-snap sc l export
. - Upload the results using the blueprint and
cubi-tk itransfer-results
.
- Create a landing zone on SODAR with
- 6 Check whether all files have been uploaded to SODAR correctly.
- This can be done via
cubi-tk sea-snap check-irods
.
- This can be done via
Setup¶
For token management for SODAR, the following docs can be used:
- https://sodar.bihealth.org/manual/ui_user_menu.html
- https://sodar.bihealth.org/manual/ui_api_tokens.html
Obtain a SODAR API token and configure
~/.cubitkrc.toml
.[global] sodar_server_url = "https://sodar.bihealth.org/" sodar_api_token = "<your API token here>"
Create a new Miniconda installation if necessary.
host:~$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh host:~$ bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3 host:~$ source $HOME/miniconda3/bin/activate (conda) host:~$
Checkout and install CUBI-TK
(conda) host:~$ git clone git@cubi-gitlab.bihealth.org:CUBI/Pipelines/cubi-tk.git (conda) host:~$ cd cubi-tk (conda) host:cubi-tk$ pip install -r requirements/base.txt (conda) host:cubi-tk$ pip install -e .
Processing Commands¶
Hint: Also see the Seasnap single cell pipeline documentation here.
First, you can pull the meta data from SODAR with the command:
$ cubi-tk sea-snap pull-isa <project_uuid>
This will create a folder with ISA-tab files. Alternatively, you can omit this step and automatically pull the files later.
The next step is to fetch the raw data from SODAR/iRODS.
You first have to authenticate with iRODS using iinit
.
Internally, cubi-tk will use the iRODS icommands and you will be shown the commands it is about to execute.
$ iinit
$ cubi-tk sodar pull-raw-data <project_uuid>
Create a working directory for the project results:
$ cubi-tk sea-snap working-dir <path_to_seasnap_pipeline>
This will also copy relevant files and a config template into the new directory. Edit the config files to adjust the pipeline execution to your needs.
Create a sample info file. This is equivalent to a sample sheet and summarizes information about the samples in yaml format. A path pattern to the downloaded FASTQ files is needed, see Sea-snap doku: https://cubi-gitlab.bihealth.org/CUBI/Pipelines/sea-snap/blob/master/documentation/prepare_input.md#fastq-files-folder-structure
$ cubi-tk sea-snap write-sample-info --isa-assay <path_to_assay_file> <path_pattern_to_fastq>
This combines information from both the FASTQ folder structure (given via path pattern) and the ISA-tab meta data (given via ISA-assay file).
If ISA-tab files have not been downloaded yet, you can use the option --project-uuid <project_uuid>
instead of --isa-assay
to download them on-the-fly.
Now you can start the processing. Run the Sea-snap pipeline as usual:
$ ./sea-snap sc --slurm c <any snakemake options>
$ ./sea-snap sc --slurm c export
After the pipeline has finished, you can create a new landing zone with the following command.
This will print the landing zone properties as JSON.
You will need the landing zone UUID (ZONE
) in the next step.
$ cubi-tk sodar landing-zone-create <project_uuid>
You can then transfer the data using the following commands. You will have to specify the blueprint file generated by the export rule of sea-snap.
$ cubi-tk sea-snap itransfer-results <blueprint_file> <landing_zone_uuid>
Finally, you can validate and move the landing zone to get the data into SODAR:
$ cubi-tk sodar landing-zone-move <landing_zone_uuid>
You may check, whether everything was uploaded correctly using the following command:
$ cubi-tk sea-snap check-irods <path_to_local_results_folder> <irods_path_to_results_on_sodar>
Use Case: Archiving a project¶
This section describes the process of archiving a project using cubi-tk
.
This section provides an example of how cubi-tk can be used in different cases.
Overview¶
The general process to archive projects is:
- Get acquainted with the contents of the projects directory.
The command
cubi-tk archive summary
provides a basic facility to identify several important aspects for the archival process. It does not, however, check whether files are already stored on SODAR. This must be done independently. - Archives must be accompanied by a
README.md
file, which provides important contact information about the project’s scientific P.I., e-mail addresses of the post-doc in charge, the individuals in CUBI that processed the data, and the person in charge of the archive. URLs for SODAR & Gitlab are also important. The commandcubi-tk archive readme
creates a valid README file, that contains these informations. - In many cases, not all files should be archived: there is no need to duplicate large sequencing files (fastrq or bam) if they are already safely stored on SODAR.
Likewise, whole genome sequence, annotations, indices, should not be archived in most cases.
The command
cubi-tk archive prepare
identifies files that must be copied, and those which shouldn’t. (it can do a bit more, see below). - Once these preparation steps have been carried out, the command
cubi-tk archive copy
performs the copy of the project to its final archive destination. This command creates checksums for all files in the project, and in the archive copy. It provides an audit of the comparison between these two sets of checksums, to ensure trhat the archival was successful.
Each of these steps descibed above are discussed below, to give practical examples, and to suggest good practice.
Summary¶
The summarisation step aims to report several cases of files that may require attention for archiving. In particular, symbolic links to destinations outside of the project’s directory should be reported. Dangling symbolic links (either because the target is missing, or because of permissions) are also listed.
The module also lists specific files of interest. By default, large bam or fastq files (larger than 256MB)
are reported, as well as large fasta files, annotations (with .gtf
or .gff
extensions), and
short-read-archive sequencing data.
It is possible for the user to change the reporting criteria, using a yaml
file & the --classes
option.
For example:
$ cubi-tk archive summary \ --classes reporting_classes.yaml \ # Use your own reporting selection <project_directory> \ <summary file>
The default summary classes can be found in <cubi-tk installation>/cubi_tk/archive/classes.yaml
.
Its content reads:
fastq: min_size: 268435456 pattern: "^(.*/)?[^/]+(\\.f(ast)?q(\\.gz)?)$" bam: min_size: 268435456 pattern: "^(.*/)?[^/]+(\\.bam(\\.bai)?)$" public: min_size: 268435456 pattern: "^(.*/)?(SRR[0-9]+[^/]*|[^/]+\\.(fa(sta)?|gtf|gff[23]?)(\\.gz)?)$"
The output of the summarization is a table, with the reason why the file is reported in the first column, the file name, the symlink target if the file is a symlink, the file’s normalised path, its size, and, in case of symlinks, if the target is accessible, and if it is inside the project or not.
Readme file creation¶
The module creates README files that must contain contact information to
- The project’s scientific P.I. (Name & email address),
- The contact to the person in charge of the project, very often a post-doc in the P.I.’s group (name & e-mail address),
- The contact to the person who is archiving the project (name & e-mail address). This person will be the project’s contact in CUBI.
- The name of the person who actually did the data processing & analysis in CUBI. It is generally the same person who is archiving the project, unless he or she has left CUBI.
The SODAR & Gitlab’s URLs should also be present in the README file, when applicable. But this information is not mandatory, unlike the contact information.
Important notes
The creation of the README file is a frequent source of errors and frustrations. To minimize the inconveniences, please heed these wise words.
- E-mail addresses must be present, valid & cannot contain uppercase letters (don’t ask why…)
- Generally, the module is quite fussy about the format. Spaces, justification, … may be important.
- Upon README creation, the project directory is quickly scanned to generate an overview of the
project’s size and number of inodes. For large projects, it is possible to disable this behaviour
using the
--skip-collect
option. - Because of these problems, the module offers a possibility to check README file validity. The command is cubi-tk archive readme –is-valid project_dir readme_file.
- If a README file is already present in the project, it will be appended at the bottom of the README file generated by the module.
Most importantly, please edit your README file after generation by the module. The module generates no description of the aims & results of the project, even though it is very useful and important to have.
Preparation of the copy¶
During preparation, the user can select the files that will be archived, those that will be discarded, and those that must be processed differently.
The file selection is achieved by creating a temporary copy of the project’s directory structure, using symbolic links. The location of this temporary copy is called temporary destination.
When copying a file to this temporary destination, its fate is decided based on its filename & path, using regular expression pattern matching. There are 4 types of operations:
- The files are selected for copy. This is the default behaviour.
- Files can be omitted (or ignored) from the copy.
- Directories with many (smallish) files can be tarred & compressed to reduce the total number of inodes (which is very file-system friendly).
- Finally, files can be squashed. In this case, a file will have its md5 checksum computed and seved in a companion files next to it, and the file will finally be replaced with a placeholder with the same name, but with a size of 0. This is useful for large files that can easily be downloaded again from the internet. Public sequencing datasets, genome sequences & annotations are typical examples.
The user can impose its own rules, based on the content of the project.
The selection rules are defined in a yaml file accessed through the module’s --rules
option.
The default rules file is in <cubi-tk installation>/cubi_tk/archive/default_rules.yaml
,
and its content reads:
ignore: # Patterns for files or directories to skip - "^(.*/)?core\\.[0-9]+$" # Ignore core dumps - "^(.*/)?\\.venv$" # Ignore virtual environment .venv directories compress: # Patterns for files or directories to tar-gzip - "^(.*/)?\\.snakemake$" # Created by snakemake process - "^(.*/)?sge_log$" # Snappy SGE log directories - "^(.*/)?\\.git$" # Git internals - "^(.*/)?snappy-pipeline$" # Copy of snappy - "^(.*/)?cubi_wrappers$" # Copy of snappy's ancestor squash: [] # Patterns for files to squash (compute MD5 checksum, and replace by zero-length placeholder)
Important notes
- The temporary destination is typically chosen as
/fast/scratch/users/<user>/Archive/<project_name>
. - The README file generated in the previous step is copied to the temporary destination using the module’s
--readme
option. - When the temporary destination is complete, the module creates a complete list of all files accessible from the original project directory,
and computes md5 & sh256 checksums, using
hashdeep
. This is done for all files accessible from the project’s directory, including all symbolic links. - The computation of checksums can be extremely time-consuming. Multiple threads can be used with the
--num-threads
option. Nevertheless, in most cases, it is advisable to submit the preparation as a slurm job, rather than interactively.
Example of usage:
$ cubi-tk archive prepare \
--rules <my_rules> \ # Project-specific rules
--readme <my_readme> \ # README.md file generated in the previous step
--ignore-tar-errors \ # Useful only in cases of inaccessible files to compress
<project_dir> \
<termporary_destination>
Copy to final destination¶
The last step consist in copying all files in the temporary destination to the archiving location.
This is done internally using rsync
, having previously removed all symbolic links connecting files wihtin the project directory.
These local symbolic links are restored after the copy is complete, in both the temporary & final destinations.
After the copy is complete, the archiving directory can be protected against writing with the --read-only
option.
A verification based on md5 checksums is automatically done between the original project directory and the final copy.
In most cases, differences between the directories are expected, because of the files ignored, compressed and squashed.
However, it is good practice to examine the audit file to make sure that all files missing from the copy are missing for the right reasons.
The report of checksums of all files in the original project, and the audit result are both present in the final destination,
as files called <date>_hashdeep_report.txt
and <date>_hashdeep_audit.txt
respectively.
For additional verification, it is also possible to request (using the --keep-workdir-hashdeep
option) a hashdeep report of the
temporary destination, and the corresponding audit of the final copy. These contents of these two directories
are expected to be identical, and any discrepancy should be looked at carefully.
The report & audit files relative to the temporary destination are called <date>_workdir_report.txt
& <date>_workdir_audit.txt
.
Finally, the copy and hasdeep steps are quite time-consuming, and it is good practice to submit the copy as a slurm job
rather than interactively, even when multiple threads are used (through the --num-threads
option).
An example of a copy script that can be submitted to slurm is:
#!/bin/bash
#SBATCH --job-name=copy
#SBATCH --output=slurm_log/copy.%j.out
#SBATCH --error=slurm_log/copy.%j.err
#SBATCH --partition=medium
#SBATCH --mem=4000
#SBATCH --time=72:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
# ------------------ Command-line options -----------------------------
# Taken from https://stackoverflow.com/questions/402377/using-getopts-to-process-long-and-short-command-line-options
TEMP=$(getopt -o ts:d: --long dryrun,source:,destination: -- "$@")
if [ $? != 0 ] ; then echo "Terminating..." >&2 ; exit 1 ; fi
# Note the quotes around '$TEMP': they are essential!
eval set -- "$TEMP"
dryrun=0
src=""
dest=""
while true; do
case "$1" in
-t | --dryrun ) dryrun=1; shift ;;
-s | --source ) src="$2"; shift 2 ;;
-d | --destination ) dest="$2"; shift 2 ;;
-- ) shift; break ;;
* ) break ;;
esac
done
if [[ "X$src" == "X" ]] ; then echo "No project directory defined" >&2 ; exit 1 ; fi
if [[ ! -d "$src" ]] ; then echo "Can't find project directory $src" >&2 ; exit 1 ; fi
if [[ "X$dest" == "X" ]] ; then echo "No temporary directory defined" >&2 ; exit 1 ; fi
if [[ -e "$dest" ]] ; then echo "Temporary directory $dest already exists" >&2 ; exit 1 ; fi
if [[ dryrun -eq 1 ]] ; then
echo "cubi-tk archive copy "
echo "--read-only --keep-workdir-hashdeep --num-threads 8 "
echo "\"$src\" \"$dest\""
exit 0
fi
# ---------------------- Subtmit to slurm -----------------------------
export LC_ALL=en_US
unset DRMAA_LIBRARY_PATH
test -z "${SLURM_JOB_ID}" && SLURM_JOB_ID=$(date +%Y-%m-%d_%H-%M)
mkdir -p slurm_log/${SLURM_JOB_ID}
CONDA_PATH=$HOME/work/miniconda3
set +euo pipefail
conda deactivate &>/dev/null || true # disable any existing
source $CONDA_PATH/etc/profile.d/conda.sh
conda activate cubi_tk # enable found
set -euo pipefail
cubi-tk archive copy \
--read-only --keep-workdir-hashdeep --num-threads 8 \
"$src" "$dest"
Credits¶
- Eudes Barbosa
- Johannes Helmuth
- Manuel Holtgrewe
- Patrick Pett
HISTORY¶
License¶
You can find the License of AltamISA below.
MIT License
Copyright (c) 2020-2021, Berlin Institute of Health
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.