From clawbio
Wraps nf-core/scrnaseq 4.1.0 to preprocess single-cell RNA-seq from FASTQ to h5ad with preflight validation and downstream handoff to ClawBio scRNA skills.
How this skill is triggered — by the user, by Claude, or both
Slash command
/clawbio:nfcore-scrnaseq-wrapperThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are **nfcore-scrnaseq-wrapper**, a specialised ClawBio agent for upstream single-cell RNA-seq preprocessing from FASTQ using the `nf-core/scrnaseq` Nextflow pipeline.
CHANGELOG.mdREADME.md_isolated_imports.pycommand_builder.pydemo/README.mderrors.pyexecutor.pynfcore_4_1_0_contract.pynfcore_scrnaseq_wrapper.pyoutputs_parser.pyparams_builder.pypipeline_source.pypreflight.pyprovenance.pyremap_paths.pyreporting.pyrepro_commands.pyreproducibility/compatibility_policy.jsonreproducibility/pinned_versions.jsonsamplesheet_builder.pyYou are nfcore-scrnaseq-wrapper, a specialised ClawBio agent for upstream single-cell RNA-seq preprocessing from FASTQ using the nf-core/scrnaseq Nextflow pipeline.
Fire when:
scrnaseq from raw FASTQ filesnf-core/scrnaseq.h5ad from raw single-cell FASTQssimpleaf, STARsolo, alevin-fry, or kb-python for upstream processingDo NOT fire when:
.h5ad and wants clustering, UMAP, or markers → route to scrna-orchestratorscrna-embeddingrnaseq-deOne skill, one task: run upstream scRNA preprocessing from FASTQ using nf-core/scrnaseq and produce canonical outputs for downstream ClawBio skills.
This skill does NOT perform clustering, normalization, marker detection, dimensionality reduction, or any analysis on the .h5ad it produces.
.h5ad for downstream analysis.scrna and scrna-embedding.standard, star, kallisto, cellranger, cellrangerarc, cellrangermulti).-params-file, a fixed pipeline source, and explicit reproducibility artifacts..h5ad, .rds, and select a canonical preferred_h5ad when possible.scrna-orchestrator (automatic via --run-downstream); scrna-embedding can follow as a second step.| Format | Extension | Required columns (all presets) | Preset-conditional columns | Optional columns |
|---|---|---|---|---|
| Samplesheet | .csv | sample, fastq_1, fastq_2 | sample_type + fastq_barcode (required for cellrangerarc); feature_type (required for cellrangermulti) | expected_cells, seq_center |
| Demo mode | n/a | none — test profile provides its own data | — | — |
The wrapper enforces the preset-conditional columns before execution (samplesheet_builder.py): a cellrangerarc sheet missing sample_type/fastq_barcode, or a cellrangermulti sheet missing feature_type, is rejected with INVALID_SAMPLESHEET. Independently, whenever a sample_type or feature_type value is present — under any preset — it is validated against the nf-core enum (sample_type ∈ {atac, gex}; feature_type ∈ {gex, vdj, ab, crispr, cmo}), matching the property-level enums in assets/schema_input.json, so an invalid value fails fast in preflight rather than late in Nextflow.
params.yaml and a fixed Nextflow command.nf-core/scrnaseq using the local sibling checkout when available, or the pinned remote tag..h5ad, .rds, and CellBender-derived outputs.report.md, result.json, provenance JSON files, and reproducibility artifacts.preferred_h5ad when handoff_available = true.The wrapper executes a strictly ordered 7-step pipeline. A failure at any step raises a structured SkillError with an error_code and a fix hint; no subsequent step runs.
Pipeline source resolution (pipeline_source.py): Prefer a local sibling scrnaseq/ checkout (pinned commit, audit-safe). Fall back to the remote pipeline tag when no checkout is found or the checkout path contains whitespace (macOS Docker restriction). A dirty local checkout is rejected by default; --allow-dirty-pipeline is an explicit development-only opt-in that is recorded in provenance. Use --require-local-pipeline when fallback to the remote pipeline would be unacceptable.
Samplesheet validation (samplesheet_builder.py): Parse the CSV, resolve FASTQ paths relative to the CSV parent directory, normalize sample-name whitespace to underscores, verify readability and FASTQ extensions, reject FASTQ basenames with whitespace, enforce consistent expected_cells (≥1) and seq_center for repeated sample rows, reject exact duplicate FASTQ rows, and write a normalized copy with absolute POSIX paths to reproducibility/samplesheet.valid.csv.
Preflight (preflight.py): Verify Java (≥17) and Nextflow (≥25.04.0). Compare version tuples after zero-padding to 3 elements (avoids false negatives such as (24, 4) < (24, 4, 0)). For docker, run docker info and gate on exit code. For conda/mamba, locate the binary. Cell Ranger presets are rejected with conda/mamba unless --allow-conda-cellranger is supplied with a trusted site config. For singularity/apptainer, accept either binary interchangeably. For wave and gpu, no binary check is needed (Nextflow-native features). Safe institutional profile components are accepted for HPC/site profiles, every -c/--config file must exist before execution, and configs are treated as trusted Groovy code. All preflight subprocess calls have a 60-second timeout (_SUBPROCESS_TIMEOUT in preflight.py); the git probes in pipeline_source.py use a 10-second timeout.
Params construction (params_builder.py): Translate the preset + CLI flags into a params.yaml consumed by Nextflow via -params-file. All file paths use .as_posix() for forward-slash consistency across platforms. igenomes_ignore is automatically set to true whenever an explicit genome reference (fasta, gtf, transcript_fasta, txp2gene, or any prebuilt index) is provided — auxiliary files (barcode whitelist, CMO/probe/feature sets, primers, multi-barcode samplesheets) never trigger it, so they remain compatible with --genome (suppresses nf-schema DNS validation of the default iGenomes S3 URL). Skip flags are only written when true, keeping params.yaml minimal. In --demo mode no reference/protocol params are written at all (the test profile owns them).
Command build + execution (command_builder.py, executor.py): Construct the nextflow run command with -params-file, validated -c/--config files, and a work directory that defaults to <output>/upstream/work but may be overridden by --work-dir (including object-store URIs for cloud executors), then launch via subprocess.Popen with stdout and stderr piped to log files on disk — never buffered in RAM. On TimeoutExpired, the process is killed and EXECUTION_FAILED is raised. On KeyboardInterrupt, the child process tree is terminated before the interrupt is re-raised.
Output parsing (outputs_parser.py): Scan the upstream results tree for MultiQC HTML, pipeline_info/, aligner output directories, .h5ad (CellBender/filtered preferred over generic combined/raw), .rds, CellBender-derived files, and an official_outputs manifest for documented nf-core output families. Required outputs are validated before success artifacts are written; handoff_available is set to true only when a preferred_h5ad is confirmed on disk.
Provenance + reporting (provenance.py, reporting.py): Write JSON provenance bundles, a SHA-256 checksum manifest (files only — never directories), environment.yml, a portable commands.sh, report.md, and result.json.
| Preset | Aligner | Use case |
|---|---|---|
standard | simpleaf (alevin-fry) | Default for 10x GEX; fast, memory-efficient |
star | STARsolo | Best FASTQ QC metrics; supports RNA velocity (--star-feature "Gene Velocyto") |
kallisto | kb-python / BUStools | Pseudo-alignment; fastest; lamanno/nac RNA velocity via --kb-workflow |
cellranger | CellRanger | CellRanger v2/v3 compatibility; CellRanger is provided by the nf-core container under docker/singularity (no host binary needed). Not available under -profile conda (10x licensing keeps it off bioconda) |
cellrangerarc | CellRanger ARC | Multiome (GEX + ATAC); accepts prebuilt --cellranger-index or reference-build inputs |
cellrangermulti | CellRanger Multi | GEX + VDJ + feature barcoding; --cellranger-multi-barcodes required for CMO/FFPE multiplexing |
Each preset requires at least one reference option: --genome <iGenomes_shortcut> OR a pre-built index (--star-index, --simpleaf-index, etc.) OR --fasta + --gtf. The standard/simpleaf preset additionally accepts a transcriptome pair --transcript-fasta + --txp2gene in place of a genome reference (per the nf-core/scrnaseq Simpleaf options).
This wrapper targets nf-core/scrnaseq 4.1.0. It is not a free-form passthrough. Parameters are grouped as:
--preset, --check, --run-downstream, --skip-downstream, --expected-cells, --timeout-hours, --work-dir, --allow-remote-inputs, --allow-dirty-pipeline, --require-local-pipeline, --allow-pipeline-version-override, --trust-config-params, --allow-conda-cellranger, and -c/--config/--nextflow-config; these are ClawBio conveniences and are not nf-core parameters.skip_emptydrops is accepted only as --skip-emptydrops and translated to skip_cellbender: true; the deprecated upstream parameter is never written.custom_config_version, custom_config_base, config_profile_name, config_profile_description, config_profile_contact, config_profile_url, version, plaintext_email, max_multiqc_email_size, hook_url, validate_params, pipelines_testdata_base_path, help, help_full, show_hidden.Unsupported parameters are either hidden/institutional metadata, interactive help/version flags, or options that would weaken the wrapper's fixed validation/reproducibility policy.
Local-first by default. Samplesheet FASTQs and reference/index inputs must be local paths unless you explicitly opt in. Remote URIs (s3://, gs://, https://, ftp://, …) are rejected at preflight with REMOTE_INPUT_NOT_ALLOWED, so genetic data and references stay on the local machine and no accidental cloud fetch happens. This guarantee is enforced by the code, not just advertised (preflight._check_remote_inputs).
Opt-in for remote inputs. Pass --allow-remote-inputs to permit remote samplesheet inputs and reference paths (parity with nfcore-sarek-wrapper / nfcore-rnaseq-wrapper, which share the same flag). When enabled, remote URIs are passed through verbatim (Nextflow resolves and stages them; only the FASTQ/FASTA basename is validated) and preflight emits a runtime WARNING listing every path that will be fetched over the network, so cloud access is always visible. The object-store --work-dir is a separate setting and is not gated.
Local paths are still validated eagerly at preflight so they fail fast with a clear error instead of a late Nextflow error:
--fasta, --gtf, --star-index, …) that does not exist raises MISSING_REFERENCE (preflight.py).MISSING_FASTQ (samplesheet_builder.py).Readability is never pre-checked: Nextflow reads inputs in the true execution context (often a root container under the default Docker profile), so a launcher-side os.access(R_OK) probe would false-block valid runs (errors.py).
# Standard real-data usage (explicit protocol and reference are required)
python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py \
--input samplesheet.csv --output ./scrnaseq_run \
--preset star --protocol 10XV3 --genome GRCh38
# Preflight check only (no Nextflow execution)
python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py \
--input samplesheet.csv --output ./scrnaseq_run --check \
--preset star --protocol 10XV3 --genome GRCh38
# Demo mode (runs the upstream nf-core test profile; forces star preset; uses the
# selected backend — default --profile docker, which must be running)
python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py \
--demo --output ./scrnaseq_demo
# Via ClawBio runner
python clawbio.py run scrnaseq-pipeline --input samplesheet.csv --output ./scrnaseq_run \
--preset star --protocol 10XV3 --genome GRCh38
python clawbio.py run scrnaseq-pipeline --demo --output ./scrnaseq_demo
# STARsolo with local FASTA+GTF (STAR index built by the pipeline)
python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py \
--input samplesheet.csv --output ./run --preset star --protocol 10XV3 \
--fasta /refs/hg38.fa --gtf /refs/hg38.gtf
# STARsolo with prebuilt STAR index
python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py \
--input samplesheet.csv --output ./run --preset star --protocol 10XV3 \
--star-index /refs/star_index
# STARsolo RNA velocity (star requires an explicit --protocol like every star/standard/kallisto run)
python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py \
--input samplesheet.csv --output ./run --preset star --protocol 10XV3 \
--star-feature "Gene Velocyto" --star-ignore-sjdbgtf \
--fasta /refs/hg38.fa --gtf /refs/hg38.gtf
# Simpleaf (standard) with UMI resolution override
python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py \
--input samplesheet.csv --output ./run --preset standard --protocol 10XV3 \
--simpleaf-umi-resolution cr-like-em --genome GRCh38
# Kallisto RNA velocity (NAC workflow)
python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py \
--input samplesheet.csv --output ./run --preset kallisto --protocol 10XV3 \
--kb-workflow nac --fasta /refs/hg38.fa --gtf /refs/hg38.gtf
# Air-gapped cluster: local iGenomes mirror
python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py \
--input samplesheet.csv --output ./run --preset star --protocol 10XV3 \
--genome GRCh38 --igenomes-base /mnt/local_igenomes
# CellRanger Multi (CMO multiplexing)
python skills/nfcore-scrnaseq-wrapper/nfcore_scrnaseq_wrapper.py \
--input samplesheet.csv --output ./run --preset cellrangermulti \
--cellranger-index /refs/refdata-gex-GRCh38 \
--gex-cmo-set /refs/cmo_set.csv \
--cellranger-multi-barcodes /refs/multi_barcodes.csv
| Flag | Type | Default | Description |
|---|---|---|---|
--input | path | — | Samplesheet CSV (sample,fastq_1,fastq_2). Required unless --demo |
--output | path | — | Output directory for results and the reproducibility bundle (required) |
--demo | flag | — | Run the upstream nf-core test profile; forces --preset star and skip_cellbender |
--check | flag | — | Run preflight only and exit (no Nextflow execution) |
--preset | string | standard | Aligner preset |
--aligner | string | — | nf-core/scrnaseq aligner alias for --preset: simpleaf maps to standard; the other values map to same-named presets |
--profile | string | docker | Execution backend or comma-separated nf-core profile list: docker, conda, mamba, singularity, apptainer, podman, shifter, charliecloud, wave, gpu, debug, arm64, emulate_amd64, test, test_full, test_cellrangermulti, test_multiome |
--pipeline-version | string | 4.1.0 | Remote nf-core/scrnaseq tag or commit (used when no local sibling checkout is found) |
--allow-dirty-pipeline | flag | — | Development-only opt-in to run a modified local sibling scrnaseq/ checkout; rejected by default for production reproducibility |
--require-local-pipeline | flag | — | Require a verifiable local sibling scrnaseq/ checkout; fail instead of falling back to the remote pipeline |
--allow-pipeline-version-override | flag | — | Allow a --pipeline-version other than the pinned 4.1.0 contract (warned, recorded in provenance; validations stay 4.1.0) |
--trust-config-params | flag | — | Allow -c/--config files that set params.* (otherwise blocked); detected overrides are recorded in provenance |
-c / --config | path | — | Additional Nextflow config file. May be repeated; files are validated, applied to the live run, copied into the reproducibility bundle, and replayed by commands.sh |
--protocol | string | None | Chemistry/protocol forwarded to the aligner. Required for standard, star, and kallisto; omitted only preserves CellRanger auto-detection for cellranger, cellrangerarc, and cellrangermulti. Explicit auto is invalid for standard, star, and kallisto. smartseq is valid for star and kallisto only. Other protocol strings are mapped when known or passed through by nf-core |
--genome | string | — | iGenomes shortcut (GRCh38, mm10, etc.) — mutually exclusive with --fasta/--gtf and all index flags |
--igenomes-base | string | — | Base URL or local path for iGenomes (default s3://ngi-igenomes/igenomes/). Use for local mirrors or air-gapped clusters. A local base path is existence-checked in preflight when --genome is set (remote s3:///https:// bases are deferred to Nextflow) |
--igenomes-ignore | flag | — | Do not load the iGenomes reference config. Set automatically whenever an explicit genome reference is supplied; only needed manually in unusual setups |
--fasta | path | — | Genome FASTA (.fa, .fna, .fasta, .gz variants; no whitespace in path) |
--gtf | path | — | Gene annotation GTF |
--star-index | path | — | Prebuilt STAR genome index directory |
--simpleaf-index | path | — | Prebuilt simpleaf/alevin-fry index |
--kallisto-index | path | — | Prebuilt kallisto index |
--cellranger-index | path | — | Prebuilt CellRanger or CellRanger ARC reference |
--transcript-fasta | path | — | Transcriptome FASTA for simpleaf |
--txp2gene | path | — | Transcript-to-gene mapping for simpleaf |
--barcode-whitelist | path | — | Custom barcode whitelist (per-aligner format) |
--star-feature | enum | — | STARsolo feature type: Gene, GeneFull, Gene Velocyto |
--star-ignore-sjdbgtf | flag | — | Do not use GTF for SJDB construction (required for Gene Velocyto) |
--seq-center | string | — | Sequencing center name for BAM read group tag |
--simpleaf-umi-resolution | enum | — | UMI resolution strategy for alevin-fry: cr-like, cr-like-em, parsimony, parsimony-em, parsimony-gene, parsimony-gene-em |
--kb-workflow | enum | — | Kallisto workflow: standard, lamanno, nac |
--kb-t1c | path | — | cDNA transcripts-to-capture file for RNA velocity (lamanno/nac). Required only with a prebuilt --kallisto-index; auto-generated from --fasta/--gtf |
--kb-t2c | path | — | Intron transcripts-to-capture file for RNA velocity (lamanno/nac). Required only with a prebuilt --kallisto-index; auto-generated from --fasta/--gtf |
--skip-cellbender | flag | — | Disable the CellBender ambient RNA removal subworkflow |
--skip-emptydrops | flag | — | Deprecated compatibility alias for --skip-cellbender; the wrapper writes skip_cellbender: true and never writes deprecated upstream skip_emptydrops |
--skip-fastqc | flag | — | Skip FastQC quality control |
--skip-multiqc | flag | — | Skip MultiQC report generation |
--skip-cellranger-renaming | flag | — | Skip automatic sample renaming in CellRanger modules |
--skip-cellrangermulti-vdjref | flag | — | Skip mkvdjref in cellrangermulti (when VDJ data is absent or a prebuilt --cellranger-vdj-index is supplied) |
--save-reference | flag | — | Save the built reference index for future reuse |
--save-align-intermeds / --no-save-align-intermeds | flag | — | Forward save_align_intermeds: true/false; when neither is given the upstream nf-core/scrnaseq 4.1.0 default (true) is preserved, so intermediate BAMs are published by default — pass --no-save-align-intermeds on large runs to save disk |
--expected-cells | int | — | Override expected cell count for a single-sample samplesheet; multi-sample runs must set expected_cells per row. The wrapper enforces ≥1 (stricter than the upstream integer schema, which has no minimum) since a non-positive count is meaningless |
--timeout-hours | float | 12 | Wall-clock cap for the Nextflow run, in hours. Use 0 to disable the cap for long HPC/cloud runs whose walltime is enforced by the scheduler. Via the ClawBio runner the runner's own timeout also applies |
--work-dir | string | <output>/upstream/work | Nextflow work directory. Local paths are resolved before execution; object-store URIs such as s3://... or gs://... are passed through for cloud executors |
--allow-conda-cellranger | flag | — | Allow Cell Ranger presets with conda/mamba only when a trusted site config supplies Cell Ranger |
--email | string | — | Email address for pipeline completion notification |
--email-on-fail | string | — | Email address for pipeline failure notification |
--multiqc-title | string | — | Custom title for the MultiQC report |
--multiqc-config | path | — | Custom MultiQC config YAML |
--multiqc-logo | path | — | Custom MultiQC logo image |
--multiqc-methods-description | path | — | Custom MultiQC methods-description YAML |
--publish-dir-mode | enum | — | Forwarded nf-core publish mode: symlink, rellink, link, copy, copyNoFollow, or move |
--trace-report-suffix | string | — | Suffix for Nextflow trace/report/timeline filenames |
--monochrome-logs | flag | — | Disable ANSI colors in nf-core logs |
--resume | flag | — | Nextflow resume (checksum-verified against prior manifest; preset/profile/source/work-dir must match) |
--run-downstream | flag | — | Opt in to scrna_orchestrator handoff after pipeline completion |
--skip-downstream | flag | — | Force-skip the downstream handoff even if --run-downstream is given (handoff is off by default) |
--cellrangerarc-config | path | — | Config JSON for CellRanger ARC index construction |
--cellrangerarc-reference | string | — | Reference genome name used inside the CellRanger ARC config |
--motifs | path | — | Motif file (e.g. JASPAR) for CellRanger ARC |
--cellranger-vdj-index | path | — | Prebuilt CellRanger VDJ reference |
--gex-frna-probe-set | path | — | Probe set CSV for FFPE fixed RNA profiling (cellrangermulti) |
--gex-target-panel | path | — | Target panel CSV for targeted GEX (cellrangermulti) |
--gex-cmo-set | path | — | CMO reference CSV for multiplexed samples (cellrangermulti) |
--gex-barcode-sample-assignment | path | — | Barcode-to-sample assignment override CSV (cellrangermulti). Not an OCM selector — OCM mode is encoded via the ocm_ids column of --cellranger-multi-barcodes |
--fb-reference | path | — | Feature-barcode reference CSV for antibody capture (cellrangermulti) |
--vdj-inner-enrichment-primers | path | — | V(D)J cDNA enrichment primer sequences (cellrangermulti) |
--cellranger-multi-barcodes | path | — | Multiplexed sample samplesheet for CMO/FFPE demultiplexing (cellrangermulti) |
output_directory/
├── report.md # Wrapper run summary
├── result.json # Structured result payload
├── check_result.json # Written only with --check (preflight-only mode); no upstream/ is produced
├── logs/
│ ├── stdout.txt # Nextflow stdout
│ └── stderr.txt # Nextflow stderr
├── upstream/
│ └── results/ # nf-core/scrnaseq output tree
│ ├── fastqc/ # Per-read FastQC reports
│ ├── multiqc/ # MultiQC HTML and data
│ │ └── multiqc_report.html
│ ├── pipeline_info/ # Execution report, timeline, trace, DAG
│ └── <aligner>/ # Aligner-specific outputs
│ └── mtx_conversions/ # AnnData (.h5ad), SCE (.rds), Seurat (.rds)
│ │ # Per nf-core/scrnaseq 4.1.0 conf/modules.config, the concatenated
│ │ # matrices sit at the TOP of mtx_conversions/ while each per-sample
│ │ # matrix is nested one level deeper under <sample>/ (MTX_TO_H5AD/
│ │ # CONCAT_H5AD/ANNDATA_BARCODES rewrite non-combined files to
│ │ # `${meta.id}/${filename}`). The wrapper scans BOTH depths and does
│ │ # not hard-code filenames: it ranks whatever .h5ad it finds to pick
│ │ # one preferred_h5ad (combined > per-sample; within a group
│ │ # cellbender_filter > filtered > plain > raw, matched on the filename
│ │ # suffix). combined_* filter-suffixed variants are version-dependent.
│ ├── combined_matrix.h5ad # documented concatenated matrix (top level)
│ ├── combined_filtered_matrix.h5ad # version-dependent: filtering ran
│ ├── combined_cellbender_filter_matrix.h5ad # version-dependent: CellBender ran → top preference
│ └── <sample>/ # per-sample matrices nested one level deeper
│ ├── <sample>_raw_matrix.h5ad
│ └── <sample>_filtered_matrix.h5ad # conditional: filtering ran
├── reproducibility/ # Reproducibility + provenance bundle (single directory)
│ ├── samplesheet.valid.csv # Normalized samplesheet (absolute POSIX paths); named samplesheet.demo.csv with --demo
│ ├── params.yaml # Effective Nextflow parameters
│ ├── nextflow_configs/ # Written only when -c/--config is used: copies of the supplied config files, replayed by commands.sh
│ ├── commands.sh # Portable replay script
│ ├── environment.yml # Conda environment spec (for reference)
│ ├── checksums.sha256 # SHA-256 for in-bundle artifacts only; `sha256sum -c` passes from output dir
│ ├── manifest.json # Run metadata: preset, profile, versions, checksums
│ ├── macos_docker.config # macOS+Docker workarounds (VirtioFS, ARM64, STAR FIFOs)
│ ├── remap_paths.py # Helper for replaying on a different machine
│ ├── compatibility_policy.json # Copied policy snapshot (resume/update rules)
│ ├── pinned_versions.json # Copied pinned-versions snapshot
│ ├── inputs.json # Samplesheet + fastq/reference paths and digests (reference_checksums)
│ ├── invocation.json # Timestamp, preset, profile, pipeline version
│ ├── preflight.json # Java/Nextflow/backend info
│ ├── upstream.json # Pipeline source resolution details
│ ├── outputs.json # Detected artifacts
│ ├── runtime.json # Execution timing
│ └── skill.json # Skill name and version
├── provenance/ # Written only with --run-downstream
│ └── handoff.json # Downstream orchestrator path + checksum + outcome
└── scrna_analysis/ # Written only with --run-downstream: scrna_orchestrator output
result.json (abbreviated):
{
"skill": "scrnaseq-pipeline",
"version": "0.1.0",
"summary": {
"preset": "star",
"aligner_effective": "star",
"pipeline_source_kind": "remote_repo",
"pipeline_version_or_commit": "4.1.0",
"profile": "docker",
"preferred_h5ad": "<output>/upstream/results/star/mtx_conversions/combined_filtered_matrix.h5ad",
"handoff_available": true,
"samples_detected": 2,
"cellbender_used": false
}
}
report.md closes with:
## Next Steps
- python clawbio.py run scrna --input <preferred_h5ad> --output <dir>
- python clawbio.py run scrna-embedding --input <preferred_h5ad> --output <dir>
error_code and a fix hint. Nextflow ≥25.04.0 is required.conda/mamba binary exists — it deliberately does not probe network connectivity (a flaky check that would also reject valid offline caches). With -profile conda, Nextflow resolves environments from bioconda/conda-forge at runtime, so an offline or proxied host can make a task fail mid-run with a confusing resolver error. Cell Ranger presets are rejected with conda/mamba by default because Cell Ranger is not distributed via bioconda; use --allow-conda-cellranger only when a trusted site config provides Cell Ranger. For fully offline/reproducible execution prefer docker/singularity (pinned containers), or pre-warm the conda package cache before running.--genome conflicts with any explicit genome-reference flag. Providing --genome alongside --fasta, --gtf, --transcript-fasta, --txp2gene, or any prebuilt index raises CONFLICTING_REFERENCES in preflight. Use either --genome <shortcut> or explicit genome flags — never both. Auxiliary files (barcode whitelist, CMO/probe/feature sets, primers, multi-barcode samplesheets) are not genome references and are compatible with --genome.igenomes_ignore is set automatically. Whenever an explicit genome reference (fasta, gtf, transcript_fasta, txp2gene, or any prebuilt index) is provided, the wrapper writes igenomes_ignore: true to suppress nf-schema DNS validation of the default iGenomes S3 URL. Auxiliary files never trigger it. You do not need to set this manually. Use --igenomes-base only for local iGenomes mirrors.standard, star, and kallisto require an explicit --protocol, because nf-core/scrnaseq 4.1.0 documents auto as CellRanger-only. Explicit auto is rejected for those presets. smartseq denotes Smart-seq3 and is accepted for star and kallisto only; smartseq2 is rejected for every preset because nf-core/scrnaseq 4.1.0 does not support Smart-seq2. cellranger accepts only auto and 10XV1-10XV4; cellrangerarc accepts only auto; cellrangermulti is samplesheet-driven. Unknown custom protocol strings are passed through only for standard, star, and kallisto, where nf-core documents custom values.save_align_intermeds defaults to ON in 4.1.0. Upstream nf-core/scrnaseq 4.1.0 sets save_align_intermeds: true by default, so aligner intermediate BAMs are published into the results tree unless you pass --no-save-align-intermeds. On real (non---demo) datasets these intermediates can be tens to hundreds of GB; disable them when you only need the count matrices.skip_emptydrops is deprecated in nf-core/scrnaseq 4.1.0. The wrapper accepts --skip-emptydrops only as a compatibility alias and emits skip_cellbender: true.--demo forces preset=star and skip_cellbender=true. The nf-core upstream test profile ships STAR-compatible data and explicitly disables CellBender (which does not work on small test datasets). If a different preset is requested with --demo, the wrapper warns and overrides it.--demo is fully hermetic — every pipeline flag is ignored. The test profile owns the entire pipeline configuration: input, references, protocol, and all QC/skip/tuning/save/reporting knobs. Because a -params-file value overrides profile config in Nextflow, the wrapper writes only the four forced essentials into params.yaml in demo mode — outdir, aligner (star), igenomes_ignore, and skip_cellbender — and nothing else. Any other flag you pass with --demo (e.g. --genome, --fasta, an index, --igenomes-base, --protocol, --star-feature, --skip-fastqc, --save-reference, --seq-center, --publish-dir-mode, …) is ignored and listed in a WARNING. Demo output validation likewise requires FastQC and MultiQC regardless of any --skip-* you pass, since those flags are not written. Drop --demo to run on your own inputs.--timeout-hours <n> or pass --timeout-hours 0 to disable the cap entirely (recommended on HPC/cloud where the scheduler enforces walltime). When the run targets an object-store work directory (--work-dir s3://…/gs://…) or an institutional/site profile while the cap is still active, preflight prints a WARNING reminding you to pass --timeout-hours 0, since the scheduler — not the wrapper — should bound such runs. When neither the cap fires nor the job is killed, behaviour is unchanged.pipeline_info/, the effective aligner directory, MultiQC unless --skip-multiqc, FastQC HTML/ZIP reports unless --skip-fastqc, and at least one .h5ad matrix. FastQC is a hard requirement for every aligner — including the Cell Ranger family (cellranger/cellrangerarc/cellrangermulti) — because nf-core/scrnaseq 4.1.0 runs FASTQC on the shared input-read channel (ch_fastq) before any aligner-specific branching and publishes fastqc/ for all of them (the output docs state "FastQC is applied to all aligners' input reads"). A missing fastqc/ tree on a non-skipped run is therefore a genuine failure, not a tolerated gap. All six presets run the pipeline's MTX_TO_H5AD conversion (CellRanger ARC/Multi reuse the CellRanger template), so a completed run with zero .h5ad is treated as a failure (EXPECTED_OUTPUTS_NOT_FOUND) rather than a silent success. A present-but-ambiguous selection (e.g. several per-sample matrices, no combined) does not fail the run; it is signalled by handoff_available = false, and --run-downstream prints a warning instead of silently returning.preferred_h5ad may be absent. If no combined matrix is produced and there are multiple per-sample files, handoff_available is false. Always check result.json before chaining to scrna-orchestrator or scrna-embedding.params.yaml; no custom --param flags can be injected. Validated -c/--config files are allowed for infrastructure/HPC configuration and are copied into the reproducibility bundle. Note this is a trust boundary, not a sandbox: a Nextflow config is executable Groovy (it can set process.beforeScript, process.shell, etc.), so a -c file is as trusted as code you run yourself. The wrapper validates that each config exists and lints it for params.* overrides (see the next gotcha), but otherwise does not sandbox its contents — only pass configs you authored or trust.<output>/upstream/work. This preserves local resume and bundle portability for workstation/HPC runs. Managed cloud-batch executors that need an object-store work directory can pass --work-dir s3://... or --work-dir gs://...; the value is recorded in provenance and replayed by commands.sh. Local custom paths are resolved before execution and are less portable than the default.outdir) is intentionally the local <output>/upstream/results. This is a wrapper policy, not an nf-core limitation: the wrapper parses the results tree to detect the preferred_h5ad, MultiQC, and pipeline_info/, and to hand off downstream, so results must land on the local filesystem. This is still compatible with cloud-batch executors — the work directory may be remote (--work-dir s3://...) while Nextflow copies the published results back to the local outdir. There is no --outdir override to an object store, because the wrapper could not then parse outputs or chain downstream; for a remote-published run, invoke the upstream pipeline directly.--pipeline-version is pinned to the 4.1.0 contract. The wrapper's parameter set, protocol matrix and output validation are written for nf-core/scrnaseq 4.1.0, so a different --pipeline-version is rejected unless you pass --allow-pipeline-version-override (a warned, provenance-recorded opt-in; validations stay 4.1.0).-c/--config files may not set params.*. nf-core advises against setting parameters via -c. The wrapper blocks any config that assigns parameters outside the audited params.yaml — dotted (params.aligner = …), bracket (params['aligner'] = …), whole-map (params = […]), or a params { … } block (including the { on the next line). Pass --trust-config-params to allow it (detected overrides are recorded in provenance). Note the config is still trusted Groovy and is not sandboxed; this lint only guarantees params.yaml stays the single parameter source.--resume enforces strict compatibility. The wrapper checks that the stored manifest matches the current preset, profile, pipeline source, effective params.yaml checksum, and Nextflow work directory. Mismatches raise INVALID_RESUME_STATE.--star-feature "Gene Velocyto" requires --star-ignore-sjdbgtf. For Kallisto --kb-workflow lamanno/nac: when the index is built from --fasta/--gtf, kb ref generates the capture files, so --kb-t1c/--kb-t2c are not required (matching the documented fasta+gtf example); they are required only alongside a prebuilt --kallisto-index, and supplying exactly one of the two is always rejected. Invalid combinations raise INVALID_PRESET_CONFIGURATION before Nextflow starts.cellrangerarc config and reference are paired. Providing --cellrangerarc-config without --cellrangerarc-reference (or vice versa) raises INVALID_PRESET_CONFIGURATION. --motifs is optional and independent.cellrangermulti validates documented preflight constraints. feature_type=ab or feature_type=crispr requires --fb-reference (in nf-core/scrnaseq 4.1.0 both Antibody Capture and CRISPR Guide Capture share the single fb_reference feature reference — there is no separate CRISPR reference param, so a crispr run without it has no feature reference and fails inside Cell Ranger); feature_type=vdj with --skip-cellrangermulti-vdjref requires --cellranger-vdj-index; --gex-frna-probe-set (FFPE) requires --cellranger-multi-barcodes; feature_type=cmo (CMO multiplexing) also requires --cellranger-multi-barcodes. The three multiplexing modes nf-core documents as mutually exclusive are encoded per multiplexed sample in the --cellranger-multi-barcodes samplesheet via the probe_barcode_ids (FFPE), cmo_ids (CMO) and ocm_ids (OCM) columns. The wrapper parses that samplesheet and rejects any physical sample that populates more than one of those columns (INVALID_PRESET_CONFIGURATION, with conflicting_modes). It additionally rejects the run-level FFPE↔CMO flag combination (--gex-frna-probe-set with feature_type=cmo/--gex-cmo-set). Note --gex-barcode-sample-assignment is a cell/tag-calling override, not an OCM selector, so it is never treated as a mode. cellrangermulti FASTQ filenames are not validated by the wrapper (unlike cellranger/cellrangerarc): Cell Ranger Multi maps libraries through its own multi samplesheet and [libraries] config, so the per-file 10x naming is left to Cell Ranger to validate.^\S+\.fn?a(sta)?(\.gz)?$ (the nf-core/scrnaseq 4.1.0 schema). Paths with whitespace or non-standard extensions are rejected in preflight.../scrnaseq relative to the ClawBio repo root. If the checkout path contains whitespace (common on macOS), the wrapper warns and falls back to the remote pipeline. Ensure ClawBio/ and scrnaseq/ share the same parent folder.--pipeline-version or its commit equals --pipeline-version. Git-less or unverifiable local directories fall back to the remote pinned pipeline.macos_docker.config is written to the reproducibility bundle and passed to Nextflow. It sets stageInMode = "copy" (avoids VirtioFS EDEADLK), --platform linux/amd64 (Rosetta emulation), and routes STAR _STARtmp to the container's /tmp (avoids VirtioFS FIFO limitation). These three workarounds apply to every macOS+Docker run. A resourceLimits block (cpus 4, memory 15.GB) is written only for --demo runs, with time raised to 4.h (the upstream conf/test.config caps tasks at 1.h, too short for STAR genome generation under Apple-Silicon emulation). These ceilings are test-machine sized and are never applied to real datasets (a human STAR index needs far more than 15 GB), so production runs use the host's real resources. One consequence: a --demo run is capped at 4.h on macOS+Docker but at the upstream 1.h on Linux. Output directories under /tmp emit a WARNING — use a path under HOME.REMOTE_INPUT_NOT_ALLOWED) unless --allow-remote-inputs is explicitly passed, which also logs a runtime warning naming every path fetched over the network.The bundle guarantees configuration- and provenance-level reproducibility, not bit-for-bit identical scientific results. Be explicit about the boundary:
Guaranteed (captured in the bundle):
-r <version> (or a git-verified local checkout commit).export NXF_VER=<version> in commands.sh.params.yaml + SHA-256; --resume rejects a mismatched params checksum.checksums.sha256 (relative labels — sha256sum -c passes after the bundle is copied to any folder), reference-file digests in inputs.json, and a portable self-anchoring commands.sh + remap_paths.py.NOT guaranteed (outside the wrapper's control):
.h5ad.conda/mamba is not offline-reproducible — environments resolve from channels at run time (prefer docker/singularity).--genome pulls from iGenomes S3; local refs are checksummed for provenance but not copied into the bundle.To approach bit-level reproduction: use docker/singularity (not conda), pin container digests, fix tool seeds where the pipeline allows, and archive the exact reference files alongside the bundle.
The agent dispatches and explains; this skill executes.
Agent: Interpret the user's preprocessing intent, choose the preset, and verify that handoff_available is true in result.json before routing to downstream skills.
Skill: Validate environment and inputs, run the pipeline with controlled parameters, write all provenance and reproducibility artifacts, and report the detected preferred_h5ad.
| Skill | When to chain |
|---|---|
scrna-orchestrator | After a successful run, pass preferred_h5ad for clustering, QC, and markers |
scrna-embedding | Pass preferred_h5ad for scVI/scANVI batch integration and latent embeddings |
multiqc-reporter | Re-aggregate QC across multiple wrapper runs |
Review cadence: After each nf-core/scrnaseq major release. Check NEXTFLOW_MIN_VERSION (schemas.py), SUPPORTED_PRESETS, SUPPORTED_PROFILES, and this SKILL.md for accuracy.
Staleness signals:
NEXTFLOW_MIN_VERSION in schemas.py and reproducibility/pinned_versions.json.PRESET_ALIGNERS → add to schemas.py and update tests.GENOME_REFERENCE_FIELDS or AUXILIARY_PATH_FIELDS in schemas.py (the single source both params_builder and preflight consume). Never re-introduce per-module copies.STAR_ALIGN_BASE_EXT_ARGS in nfcore_4_1_0_contract.py drifts from conf/modules.config → test_pinned_star_args_match_sibling_checkout_if_present fails when a sibling checkout is present; update the constant to match.stageInMode = "copy") is only necessary while Apple Silicon runs Docker via QEMU. Remove build_macos_docker_config / write_macos_docker_config when a native arm64 Docker runtime eliminates VirtioFS deadlocks.Deprecation criteria: Deprecate if nf-core/scrnaseq releases a Python SDK with equivalent preflight, params, and provenance APIs.
npx claudepluginhub clawbio/clawbio --plugin clawbioRuns nf-core/rnaseq bulk RNA-seq preprocessing from FASTQ or BAM inputs with preflight checks, reproducibility outputs, and handoff to downstream differential expression skills.
Accesses ENCODE uniform analysis pipelines for ChIP-seq/ATAC-seq/etc., generates custom Nextflow/WDL workflows, manages CPU/GPU/memory resources, supports local/HPC/cloud (AWS/GCP) execution.
Deploys nf-core pipelines (rnaseq, sarek, atacseq) for RNA-seq, WGS/WES, ATAC-seq analysis using local FASTQs or GEO/SRA data, with env checks and samplesheets.