Newer
Older
**metagWGS** is a [Nextflow](https://www.nextflow.io/docs/latest/index.html#) bioinformatics analysis pipeline used for **metag**enomic **W**hole **G**enome **S**hotgun sequencing data (Illumina HiSeq3000 or NovaSeq, paired, 2\*150bp ; PacBio HiFi reads, single-end).
The workflow processes raw data from `.fastq/.fastq.gz` input and/or assemblies (contigs) `.fa/.fasta` and uses the modules represented in this figure:

metagWGS is split into different steps that correspond to different parts of the bioinformatics analysis.
Many of these steps are optional and their necessity depends on the desired analysis.
* trims adapters sequences and deletes low quality reads ([Cutadapt](https://cutadapt.readthedocs.io/en/stable/#), [Sickle](https://github.com/najoshi/sickle))
* suppresses host contaminants ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) or [Minimap2](https://github.com/lh3/minimap2) + [Samtools](http://www.htslib.org/))
* controls the quality of raw and cleaned data ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
* makes a taxonomic classification of cleaned reads ([Kaiju MEM](https://github.com/bioinformatics-centre/kaiju) + [kronaTools](https://github.com/marbl/Krona/wiki/KronaTools) + [plot_kaiju_stat.py](bin/plot_kaiju_stat.py) + [merge_kaiju_results.py](bin/merge_kaiju_results.py))
* assembles reads ([metaSPAdes](https://github.com/ablab/spades) or [Megahit](https://github.com/voutcn/megahit) or [Hifiasm_meta](https://github.com/lh3/hifiasm-meta), [metaFlye](https://github.com/fenderglass/Flye))
* assesses the quality of assembly ([metaQUAST](http://quast.sourceforge.net/metaquast))
* reads deduplication, alignment against contigs for short reads ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) + [Samtools](http://www.htslib.org/))
* reads alignment against contigs for HiFi reads ([Minimap2](https://github.com/lh3/minimap2) + [Samtools](http://www.htslib.org/))
* filters contigs with low CPM value ([filter_contig_per_cpm.py](bin/filter_contig_per_cpm.py) + [metaQUAST](http://quast.sourceforge.net/metaquast))
* makes a structural annotation of genes ([Prodigal](https://github.com/hyattpd/Prodigal) + [Barrnap](https://github.com/tseemann/barrnap) + [tRNAscan-SE](https://github.com/UCSC-LoweLab/tRNAscan-SE) + [merge_annotations.py](bin/merge_annotations.py))
* `S05_PROTEIN_ALIGNMENT`
* aligns the protein sequence of genes against a protein database ([DIAMOND](https://github.com/bbuchfink/diamond))
* makes a sample and global clustering of proteins ([cd-hit](https://www.bioinformatics.org/cd-hit/) + [cd_hit_produce_table_clstr.py](bin/cd_hit_produce_table_clstr.py))
* quantifies reads that align with the genes ([featureCounts](http://subread.sourceforge.net/) + [quantification_clusters.py](bin/quantification_clusters.py))
* makes a functional annotation of genes and a quantification of reads by function ([eggNOG-mapper](http://eggnog-mapper.embl.de/) + [merge_abundance_and_functional_annotations.py](bin/merge_abundance_and_functional_annotations.py) + [quantification_by_functional_annotation.py](bin/quantification_by_functional_annotation.py))
* `S07_TAXO_AFFI`
* taxonomically affiliates the genes ([Samtools](http://www.htslib.org/) + [aln_to_tax_affi.py](bin/aln_to_tax_affi.py))
* taxonomically affiliates the contigs ([Samtools](http://www.htslib.org/) + [aln_to_tax_affi.py](bin/aln_to_tax_affi.py))
* counts the number of reads and contigs, for each taxonomic affiliation, per taxonomic level ([Samtools](http://www.htslib.org/) + [merge_contig_quantif_perlineage.py](bin/merge_contig_quantif_perlineage.py) + [quantification_by_contig_lineage.py](bin/quantification_by_contig_lineage.py))
* aligns reads samples against assemblies (according to the strategy used) ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) or [Minimap2](https://github.com/lh3/minimap2))
* performs metagenome binning ([METABAT2](https://bitbucket.org/berkeleylab/metabat/src/master/) + [MAXBIN2](https://sourceforge.net/projects/maxbin/) + [CONCOCT](https://github.com/BinPro/CONCOCT))
* refines bin sets ([BINETTE](https://github.com/genotoul-bioinfo/Binette)). Circular contigs are used as bin set if you have some (in case of HiFi reads)
* dereplicates bins between samples ([DREP](https://github.com/MrOlm/drep))
* taxonomically affiliates the bins ([GTDBTK](https://github.com/Ecogenomics/GTDBTk))
* calculates bins abundances between samples ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) or [Minimap2](https://github.com/lh3/minimap2) + [SAMTOOLS](http://www.htslib.org/))
All steps are launched one after another by default. Use `--stop_at_[STEP]` and `--skip_[STEP]` parameters to tweak execution to your will.
A report html file is generated at the end of the workflow with [MultiQC](https://multiqc.info/).
The pipeline is built using [Nextflow](https://www.nextflow.io/docs/latest/index.html#), a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner.
Two [Singularity](https://sylabs.io/docs/) containers are available making installation trivial and results highly reproducible.
The metagWGS documentation can be found in the following pages:
* [Installation](/docs/source/installation.md)
* The pipeline installation procedure. You can also see this documentation [here](https://genotoul-bioinfo.pages.mia.inra.fr/metagwgs/master/installation.html).
* [Usage](/docs/source/usage.md)
* An overview of how the pipeline works, how to run it and a description of all of the different command-line flags. You can also see this documentation [here](https://genotoul-bioinfo.pages.mia.inra.fr/metagwgs/master/usage.html).
* [Output](/docs/source/output.md)
* An overview of the different output files and directories produced by the pipeline. You can also see this documentation [here](https://genotoul-bioinfo.pages.mia.inra.fr/metagwgs/master/output.html).
* [Use case](/docs/source/use_case.md) (WARNING: not up-to-date, needs to be updated)
* A tutorial to learn how to launch the pipeline on a test dataset on [genobioinfo cluster](http://bioinfo.genotoul.fr/).
* [Functional tests](/docs/source/functionnal_tests.md)
* (for developers) A tool to launch a new version of the pipeline on curated input data and compare its results with known output.
A comprehensive documentation of Metagwgs is available here: https://genotoul-bioinfo.pages.mia.inra.fr/metagwgs/master .
## Contact us
If you have any questions or suggestions for improvement, please contact us to claire.hoede[@]inrae.fr.
## Cite us
For the moment if you use metagWGS for your research, please cite :
Joanna Fourquet, Jean Mainguy, Maïna Vienne, Céline Noirot, Pierre Martin, et al.. metagWGS: a workflow to analyse short and long HiFi metagenomic reads Taxonomic profile HiFi vs Short reads assembly. JOBIM 2022, Jul 2022, Rennes, France. ⟨10.15454/1.5572369328961167E12⟩. ⟨hal-03771202⟩