Manual for `sea-snap write-sample-info`

The cubi-tk sea-snap write-sample-info command can be used to collect information by parsing the folder structure of raw data files (FASTQ) and meta-information (ISA-tab). It collects this information in a YAML file that will be loaded by the Seasnap pipeline.

The basic usage is:

$ cubi-tk sea-snap write-sample-info IN_PATH_PATTERN

where IN_PATH_PATTERN is a file path with wildcards specifying the location to FASTQ files. The wildcards are also used to extract information from the parsed paths.

By default, a file called sample_info.yaml will be generated in the current working directory. If this file is in the project working directory, Seasnap will load it automatically. However, you can specify another file name after IN_PATH_PATTERN. Then this file can be used in Seasnap e.g. like so:

$ ./sea-snap mapping l --config file_name='sample_info_alt.yaml'

Note: check and edit the auto-generated sample_info.yaml file before running the pipeline.

Path pattern and wildcards

For example, if the FASTQ files are stored in a folder structure like this:

input
├── sample1
│   ├── sample1_R1.fastq.gz
│   └── sample1_R2.fastq.gz
└── sample2
    ├── sample2_R1.fq
    └── sample2_R2.fq

Then the path pattern can look like the following:

$ cubi-tk sea-snap write-sample-info "input/{sample}/*_{mate,R1|R2}"

Keywords in braces (e.g. {sample}) are wildcards. It is possible to add a regular expression separated with a comma after the keyword. This is useful to restrict what part of the file path the wildcard can match (e.g. {mate,R1|R2} means that mate can only be R1 or R2). In addition, * and ** can be used to match anything that does not need to be captured with a wildcard.

Setting the IN_PATH_PATTERN as shown above will allow the write-sample-info command to extract the information that samples sample1 and sample2 exist and that there are paired reads for both of them. The extension (e.g. fastq.gz, fastq or fq) should be omitted and will be detected automatically.

Available wildcards are: {sample}, {mate}, {flowcell}, {lane}, {batch} and {library}. However, only ``{sample}`` is obligatory.

Note: wildcards do not match ``/`` and``.``. For further information also see the Seasnap docu.

Meta information

When working with SODAR, additional meta-information should be included in the sample info file. In SODAR this meta-information is stored in the form of ISA-tab files.

There are two ways to add the information from an ISA-tab assay file to the generated sample info file:

Load from a local ISA-tab assay file

$ cubi-tk sea-snap write-sample-info --isa-assay PATH/TO/a_FILE_NAME.txt IN_PATH_PATTERN

Download from SODAR

$ cubi-tk sea-snap write-sample-info --project_uuid UUID IN_PATH_PATTERN

Here, UUID is the UUID of the respective project on SODAR.

Table format

Although this is not really necessary to run the workflow, it is possible to convert the YAML file to a table / sample sheet:

$ cubi-tk sea-snap write-sample-info --from-file sample_info.yaml XXX sample_info.tsv

And back:

$ cubi-tk sea-snap write-sample-info --from-file sample_info.tsv XXX sample_info.yaml

More Information

Also see cubi-tk sea-snap write-sample-info CLI documentation and cubi-tk sea-snap write-sample-info --help for more information.

Manual for sea-snap write-sample-info

Path pattern and wildcards

Meta information

Table format

More Information

Manual for `sea-snap write-sample-info`