Manual for archive
¶
The cubi-tk archive
is designed to facilitate the archival of older projects away from the cluster’s fast file system.
This document provides an overview of these commands, and how they can be adapted to meet specific needs.
Glossary¶
Hot storage: Fast and expensive, therefore usually size restricted. For example:
- GPFS by DDN (currently at
/fast
) - Ceph with SSDs
Warm storage: Slower, but with more space and possibly mirroring. For example:
- SODAR with irods
- Ceph with HDDs (
/data/cephfs-2/
)
Cold storage: For data that needs to be accessed only rarely. For example:
- Tape archive
Background: the archiving process¶
CUBI archive resources are three-fold:
- SODAR and associated irods storage should contain raw data generated for the project. SODAR also contains important results (mapping, variants, differential expression, …).
- Gitlab contains small files required to generate the results, typically scripts, configuration files, READMEs, meeting notes, …, but also knock-in gene sequence, list of papers, gene lists, etc.
- The rest should be stored in CEPH (warm storage).
For older projects or intermediate results produced by older pipelines the effort of uploading the data to SODAR & gitlab may not be warranted. In this case, the bulk of the archive might be stored in the CEPH file system.
The module aims to facilitate this last step, i.e. the archival of old projects to move them away from the hot storage.
Archiving process requirements¶
Archived projects should contain all important files, but not data already stored elsewhere. In particular, the following files should not be archived:
- raw data (
*.fastq.gz
files) saved in SODAR or in theSTORE
, - data from public repositories (SRA, GDC portal, …) that can easily be downloaded again,
- static data such as genome sequence & annotations, variant databases from gnomAD, … that can also be easily retrieved,
- indices files for mapping that can be re-generated.
Importantly, a README file should be present in the archive, briefly describing the project, listing contacts to the client & within CUBI and providing links to SODAR & Gitlab when appropriate.
The purpose of the module is:
- to provide a summary of files that require special attention, for example symlinks which targets lie outside of the project, or large files (
*.fastq.gz
or*.bam
especially) - to create a temporary directory that mimicks the archived files with symlinks,
- to use this temporary directory as template to copy files on the CEPH filesystem, and
- to compute checksums on the originals and copies, to ensure accuracy of the copy process.
Basic usage¶
Summary of files in project¶
$ cubi-tk archive summary PROJECT_DIRECTORY DESTINATION
Unlike other cubi-tk
commands, here DESTINATION
is not a landing zone, but a local filename for the summary of files that require attention.
By default, the summary reports:
- dangling symlinks (also dangling because of permission),
- symlinks pointing outside of the project directory,
- large (greater than 256MB)
*.fastq.gz
,*.fq.gz
&*.bam
files, - large static data files with extension
*.gtf
,*.gff
,*.fasta
&*.fa
(possibly gzipped), that can potentially be publicly available. - large files from SRA with prefix
SRR
.
The summary file is a table with the following columns:
- Class: the name(s) of the pattern(s) that match the file. When the file matches several patterns, all are listed, separated by
|
. - Filename: the relative path of the file (from the project’s root).
- Target: the symlink’s target (when applicable)
- ResolvedName: the resolved (absolute, symlinks removed) path of the target. When the target doesn’t exist or is inaccessible because of permissions, the likely path of the target.
- Size: file size (target file size for symlinks). When the file doesn’t exist, it is set to 0.
- Dangling:
True
when the file cannot be read (missing or inaccessible),False
otherwise. - Outside:
True
when the target path is outside of the project directory,False
otherwise. It is alwaysFalse
for real files (_i.e._ not symlinks).
The summary step also reports an overview of the results, with the total number of files, the total size of the project, and the number of links to files. Number of dangling links and links inaccessible because of permission issues are listed separately. Likewise, the number of files outside of the projects, which are linked to from within the project by symlinks is also quoted. Finally, for each of the “important files” classes, the number of files, the number of files outside of the project directory and the number of files lost because of symlink failures are reported.
Archive preparation: README.md file creation¶
$ cubi-tk archive readme PROJECT_DIRECTORY README_FILE
README_FILE
is here the path to the README file that will be created. It must not exist.
The README file will be created by filling contact information interactively. Command-line options are also available, but interactive confirmation is needed.
It is possible to test if a generated README file is valid for project archival, using
$ cubi-tk archive readme --is-valid PROJECT_DIRECTORY README_FILE
The module will highlight mandatory records that could not be found in the current file. These mandatory records are lines following the patterns below:
- P.I.: [Name of the PI, any string](mailto:<valid email address in lowercase>)
- Client contact: [Name of our contact in the PI's group](mailto:<valid email address in lowercase>)
- CUBI project leader: [Name of the CUBI member leading the project]
- CUBI contact: [Name of the archiver](mailto:<valid email address in lowercase>)
- Project name: <any string>
- Start date: YYYY-MM-DD
- Current status: <One of Active, Inactive, Finished, Archived>
Archive preparation: temporary copy¶
$ cubi-tk archive prepare --readme README PROJECT_DIRECTORY TEMPORARY_DESTINATION
TEMPORARY_DESTINATION
is here the path to the temporary directory that will be created. It must not exist.
For each file that must be archived, the module creates a symlink to that file’s absolute path. The module also reproduces the project’s directories hierarchy, so that the symlink sits in the same relative position in the temporary directory than in the original project.
The module deals with symlinks in the project differently whether their target in inside the project or not. For symlinks pointing outside of the project, a symlink to the target’s absolute path is created. For symlinks pointing inside the project, a relative path symlink is created. This allows to store all files (even those outside of the project), without duplicating symlinks inside the project.
Additional transformation of the original files are carried out during the preparation step:
- The contents of the
.snakemake
,sge_log
,cubi-wrappers
&snappy-pipeline
directories are processed differently: the directories are tarred & compressed in the temporary destination, to reduce the number of inodes in the archive. - The core dump files are not copied to the temporary destination, and therefore won’t be copied to the final archive.
- The
README.md
file created by thereadme
subcommand must also be included to be put in the temprary’s destination top level. If the original project already contains aREADME.md
file, it will be appended to the generated one, as the latter is valid (it contains all mandatory information).
Copy to archive & verification¶
$ cubi-tk archive copy TEMPORARY_DESTINATION FINAL_DESTINATION
FINAL_DESTINATION
is here the path to the final destination of the archive, on the warm storage. It must not exist.
Configuration¶
The files reported in the summary are under user control, through the --classes
option, which must point to a yaml file describing the regular expression pattern & minimum size for each class. For example, raw data files can be identified as follows:
fastq:
min_size: 268435456
pattern: "^(.*/)?[^/]+(\\.f(ast)?q(\\.gz)?)$"
The files larger than 256MB, with extension *.fastq
, *.fq
, *.fastq.gz
or *.fq.gz
will be reported with the class fastq
.
Any number of file class can be defined. The default classes configuration is in cubi_tk/archive/classes.yaml
The behaviour of the archive preparation can also be changed using the --rules
option. The rules are also described in a yaml file by regular expression patterns.
Three different archiving options are implemented:
- ignore: the files or directories matching the pattern are simply omitted from the temporary destination. This is useful to ignore remaining temporary files, core dumps or directories containing lists of input symlinks, for example.
- compress: the files or directories matching the pattern will be replaced in the temporary destination by a compressed (gzipped) tar file. This is how
.snakemake
orsge_log
directories are treated by default, but patterns for other directories may be added, for example for the Slurm log directories. - squash: the files matching the pattern will be replaced by zero-length placeholders in the temporary destination. A md5 checksum file will be added next to the original file, to enable verification.
When the user doesn’t specify her own set using the --rules
option, the rules applied are the following: core dumps are ignored, .snakemake
, sge_log
, .git
, snappy-pipeline
and cubi_wrappers
directories are compressed, and nothing is squashed. The exact definitions are:
ignore: # Patterns for files or directories to skip
- "^(.*/)?core\\.[0-9]+$"
- "^(.*/)?\\.venv$"
compress: # Patterns for files or directories to tar-gzip
- "^(.*/)?\\.snakemake$"
- "^(.*/)?sge_log$"
- "^(.*/)?\\.git$"
- "^(.*/)?snappy-pipeline$"
- "^(.*/)?cubi_wrappers$"
squash: [] # Patterns for files to squash (compute MD5 checksum, and replace by zero-length placeholder)
Examples¶
Consider an example project. It contains:
- raw data in a
raw_data
directory, some of which is stored outside of the project’s directory, - processing results in the
pipeline
directory, - additional data files & scripts in
extra_data
, - a
.snakemake
directory that can potentially contain many files in conda environments, for example, and - a bunch on temporary & obsolete files that shouldn’t be archived, conveniently grouped into the
ignored_dir
directory.
The architecture of this toy project is displayed below:
project/
├── extra_data
│ ├── dangling_symlink -> ../../outside/inexistent_data
│ ├── file.public
│ ├── to_ignored_dir -> ../ignored_dir
│ └── to_ignored_file -> ../ignored_dir/ignored_file
├── ignored_dir
│ └── ignored_file
├── pipeline
│ ├── output
│ │ ├── sample1
│ │ │ └── results -> ../../work/sample1/results
│ │ └── sample2 -> ../work/sample2
│ └── work
│ ├── sample1
│ │ └── results
│ └── sample2
│ └── results
├── raw_data
│ ├── batch1 -> ../../outside/batch1
│ ├── batch2
│ │ ├── sample2.fastq.gz -> ../../../outside/batch2/sample2.fastq.gz
│ │ └── sample2.fastq.gz.md5 -> ../../../outside/batch2/sample2.fastq.gz.md5
│ └── batch3
│ ├── sample3.fastq.gz
│ └── sample3.fastq.gz.md5
└── .snakemake
└── snakemake
Prepare the copy on the temporary destination¶
Imagine now that the raw data is already safely archived in SODAR. We don’t want to save these files in duplicate, so we decide ito _squash_ the raw data files so that their size is set to 0, and their md5 checksum is added. We also do the same for the publicly downloadable file file.public
. We also want to ignore the junk in ignored_dir
, and to compress the .snakemake
directory. So we have the following rules:
After running the preparation command cubi-tk archive prepare --rules my_rules.yaml project temp_dest
, the temporary destination contains the following files:
temp_dest
├── <today's date>_hashdeep_report.txt
├── extra_data
│ ├── file.public
│ ├── file.public.md5
│ ├── to_ignored_dir -> ../ignored_dir
│ └── to_ignored_file -> ../ignored_dir/ignored_file
├── pipeline
│ ├── output
│ │ ├── sample1
│ │ │ └── results -> ../../work/sample1/results
│ │ └── sample2 -> ../work/sample2
│ └── work
│ ├── sample1
│ │ └── results -> /absolute_path/project/pipeline/work/sample1/results
│ └── sample2
│ └── results -> /absolute_path/project/pipeline/work/sample2/results
├── raw_data
│ ├── batch1
│ │ ├── sample1.fastq.gz
│ │ └── sample1.fastq.gz.md5 -> /absolute_path/outside/batch1/sample1.fastq.gz.md5
│ ├── batch2
│ │ ├── sample2.fastq.gz
│ │ └── sample2.fastq.gz.md5 -> /absolute_path/outside/batch2/sample2.fastq.gz.md5
│ └── batch3
│ ├── sample3.fastq.gz
│ └── sample3.fastq.gz.md5 -> /absolute_path/project/raw_data/batch3/sample3.fastq.gz.md5
├── README.md
└── .snakemake.tar.gz
The inaccessible file project/extra_data/dangling_symlink
& the contents of the project/ignored_dir
are not present in the temporary destination, either because they are not accessible, or because they have been conscientiously ignored by the preparation step.
The .snakemake
directory is replaced by the the gzipped tar file .snakemake.tar.gz
in the temporary destination.
The file.public
& the 3 *.fastq.gz
files have been replaced by placeholder files of size 0. For file.public
, the md5 checksum has been computed by the preparing step, but for the *.fastq.gz
files, the existing checksums are used.
All other files are kept for archiving: symlinks for real files point to their target’s absolute path, symlinks are absolute for paths outside of the project, and relative for paths inside the project.
Finally, the hashdeep report of the original project directory is written to the temporary destination, and a README.md
file is created. At this point, we edit the ``README.md`` file to add a meaningful description of the project. If a README.md
file was already present in the orginial project directory, its content will be added to the newly created file.
Note that the symlinks temp_dest/extra_data/to_ignored_dir
& temp_dest/extra_data/to_ignored_file
are dangling, because the link themselves were not omitted, but their targets were. This is the expected, but perhaps unwanted behaviour: symlinks pointing to files or directories within compressed or ignored directories will be dangling in the temporary destination, as the original file exists, but is not part of the temporary destination.
Copy to the final destination¶
When the README.md
editing is complete, the copy to the final destination on the warm file system can be done. It is matter of cubi-tk archive copy temp_dest final_dest
.
The copy step writes in the final destination the hashdeep audit of the copy against the original project. This audit is expected to fail, because files & directories are ignored, compressed or squashed. The option --keep-workdir--hashdeep
, the programme also outputs the hashdeep report of the temporary destination, and the audit of the final copy against the temporary destination. Both the report and the audit are also stored in the final copy directory. The audit of the copy against the temporary destination should be successful, as the copy doesn’t re-process files, it only follows symlinks.
If all steps have been completed successfully (including checking the README.md
for validity), then a marker file named archive_copy_complete
is created. The final step is to remove write permissions if the --read-only
option was selected.
Additional notes and caveats¶
- Generally, the module doesn’t like circular symlinks. It is wise to fix them before any operation, or use the rules facility to ignore them during preparation. The
--dont-follow-links
option in the summary step prevents against such problems, at the expense of missing some files in the report. - The module is untested for symlink corner cases (for example, where a symlink points to a symlink outside of the project, which in turn points to another file in the project).
- In the archive, relative symlinks within the project are resolved. For example, in the original project one might have
variants.vcf -> ../work/variants.vcf -> variants.somatic.vcf
. In the archive, the link will bevariants.vcf -> ../work/variants.somatic.vcf
.
More Information¶
Also see cubi-tk archive --help
, cubi-tk archive summary --help
, cubi-tk archive prepare --help
& cubi-tk archive copy --help
for more information.