README.md

# metagWGS: Documentation

## Introduction

**metagWGS** is a [Nextflow](https://www.nextflow.io/docs/latest/index.html#) bioinformatics analysis pipeline used for **metag**enomic **W**hole **G**enome **S**hotgun sequencing data (Illumina HiSeq3000 or NovaSeq, paired, 2\*150bp ; PacBio HiFi reads, single-end).

### Pipeline graphical representation
The workflow processes raw data from `.fastq/.fastq.gz` input and/or assemblies (contigs) `.fa/.fasta` and uses the modules represented in this figure:

![](docs/source/images/metagwgs_metro_map.png)

### metagWGS steps

metagWGS is split into different steps that correspond to different parts of the bioinformatics analysis.
Many of these steps are optional and their necessity depends on the desired analysis.

* `S01_CLEAN_QC`
   * trims adapters sequences and deletes low quality reads ([Cutadapt](https://cutadapt.readthedocs.io/en/stable/#), [Sickle](https://github.com/najoshi/sickle))
   * suppresses host contaminants ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) or [Minimap2](https://github.com/lh3/minimap2) + [Samtools](http://www.htslib.org/))
   * controls the quality of raw and cleaned data ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
   * makes a taxonomic classification of cleaned reads ([Kaiju MEM](https://github.com/bioinformatics-centre/kaiju) + [kronaTools](https://github.com/marbl/Krona/wiki/KronaTools) + [plot_kaiju_stat.py](bin/plot_kaiju_stat.py) + [merge_kaiju_results.py](bin/merge_kaiju_results.py))
* `S02_ASSEMBLY` 
   * assembles reads ([metaSPAdes](https://github.com/ablab/spades) or [Megahit](https://github.com/voutcn/megahit) or [Hifiasm_meta](https://github.com/lh3/hifiasm-meta), [metaFlye](https://github.com/fenderglass/Flye))
   * assesses the quality of assembly ([metaQUAST](http://quast.sourceforge.net/metaquast))
   * reads deduplication, alignment against contigs for short reads ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) + [Samtools](http://www.htslib.org/))
   * reads alignment against contigs for HiFi reads ([Minimap2](https://github.com/lh3/minimap2)  + [Samtools](http://www.htslib.org/))
* `S03_FILTERING` 
   * filters contigs with low CPM value ([filter_contig_per_cpm.py](bin/filter_contig_per_cpm.py) + [metaQUAST](http://quast.sourceforge.net/metaquast))
* `S04_STRUCTURAL_ANNOT` 
   * makes a structural annotation of genes ([Prodigal](https://github.com/hyattpd/Prodigal) + [Barrnap](https://github.com/tseemann/barrnap) + [tRNAscan-SE](https://github.com/UCSC-LoweLab/tRNAscan-SE) + [merge_annotations.py](bin/merge_annotations.py))
* `S05_PROTEIN_ALIGNMENT`
   * aligns the protein sequence of genes against a protein database ([DIAMOND](https://github.com/bbuchfink/diamond))
* `S06_FUNC_ANNOT` 
   * makes a sample and global clustering of proteins ([cd-hit](https://www.bioinformatics.org/cd-hit/) + [cd_hit_produce_table_clstr.py](bin/cd_hit_produce_table_clstr.py))
   * quantifies reads that align with the genes ([featureCounts](http://subread.sourceforge.net/) + [quantification_clusters.py](bin/quantification_clusters.py))
   * makes a functional annotation of genes and a quantification of reads by function ([eggNOG-mapper](http://eggnog-mapper.embl.de/) + [merge_abundance_and_functional_annotations.py](bin/merge_abundance_and_functional_annotations.py) + [quantification_by_functional_annotation.py](bin/quantification_by_functional_annotation.py))
* `S07_TAXO_AFFI` 
   * taxonomically affiliates the genes ([Samtools](http://www.htslib.org/) + [aln_to_tax_affi.py](bin/aln_to_tax_affi.py))
   * taxonomically affiliates the contigs ([Samtools](http://www.htslib.org/) + [aln_to_tax_affi.py](bin/aln_to_tax_affi.py))
   * counts the number of reads and contigs, for each taxonomic affiliation, per taxonomic level ([Samtools](http://www.htslib.org/) + [merge_contig_quantif_perlineage.py](bin/merge_contig_quantif_perlineage.py) + [quantification_by_contig_lineage.py](bin/quantification_by_contig_lineage.py))
* `S08_BINNING` 
![](docs/source/images/08_binning.png)
   * aligns reads samples against assemblies (according to the strategy used) ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) or [Minimap2](https://github.com/lh3/minimap2))
   * performs metagenome binning ([METABAT2](https://bitbucket.org/berkeleylab/metabat/src/master/) + [MAXBIN2](https://sourceforge.net/projects/maxbin/) + [CONCOCT](https://github.com/BinPro/CONCOCT))
    * refines bin sets ([BINETTE](https://github.com/genotoul-bioinfo/Binette)). Circular contigs are used as bin set if you have some (in case of HiFi reads)
   * dereplicates bins between samples ([DREP](https://github.com/MrOlm/drep))
   * taxonomically affiliates the bins ([GTDBTK](https://github.com/Ecogenomics/GTDBTk))
   * calculates bins abundances between samples ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) or [Minimap2](https://github.com/lh3/minimap2) + [SAMTOOLS](http://www.htslib.org/))

All steps are launched one after another by default. Use `--stop_at_[STEP]` and `--skip_[STEP]` parameters to tweak execution to your will.

A report html file is generated at the end of the workflow with [MultiQC](https://multiqc.info/).

The pipeline is built using [Nextflow](https://www.nextflow.io/docs/latest/index.html#), a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner.

Two [Singularity](https://sylabs.io/docs/) containers are available making installation trivial and results highly reproducible.

## Documentation

The metagWGS documentation can be found in the following pages:

   * [Installation](/docs/source/installation.md)
      * The pipeline installation procedure. You can also see this documentation [here](https://genotoul-bioinfo.pages.mia.inra.fr/metagwgs/master/installation.html).
   * [Usage](/docs/source/usage.md)
      * An overview of how the pipeline works, how to run it and a description of all of the different command-line flags. You can also see this documentation [here](https://genotoul-bioinfo.pages.mia.inra.fr/metagwgs/master/usage.html).
   * [Output](/docs/source/output.md)
      * An overview of the different output files and directories produced by the pipeline. You can also see this documentation [here](https://genotoul-bioinfo.pages.mia.inra.fr/metagwgs/master/output.html).
   * [Use case](/docs/source/use_case.md) (WARNING: not up-to-date, needs to be updated)
      * A tutorial to learn how to launch the pipeline on a test dataset on [genobioinfo cluster](http://bioinfo.genotoul.fr/).
   * [Functional tests](/docs/source/functionnal_tests.md)
      * (for developers) A tool to launch a new version of the pipeline on curated input data and compare its results with known output.

A comprehensive documentation of Metagwgs is available here: https://genotoul-bioinfo.pages.mia.inra.fr/metagwgs/master .

## Contact us

If you have any questions or suggestions for improvement, please contact us to claire.hoede[@]inrae.fr.

## Cite us

For the moment if you use metagWGS for your research, please cite : 
Joanna Fourquet, Jean Mainguy, Maïna Vienne, Céline Noirot, Pierre Martin, et al.. metagWGS: a workflow to analyse short and long HiFi metagenomic reads Taxonomic profile HiFi vs Short reads assembly. JOBIM 2022, Jul 2022, Rennes, France. ⟨10.15454/1.5572369328961167E12⟩. ⟨hal-03771202⟩