Newer
Older
**metagWGS** is a [Nextflow](https://www.nextflow.io/docs/latest/index.html#) bioinformatics analysis pipeline used for **metag**enomic **W**hole **G**enome **S**hotgun sequencing data (Illumina HiSeq3000 or NovaSeq, paired, 2\*150bp ; PacBio HiFi reads, single-end).
The workflow processes raw data from `.fastq/.fastq.gz` input and/or assemblies (contigs) `.fa/.fasta` and uses the modules represented in this figure:
metagWGS is split into different steps that correspond to different parts of the bioinformatics analysis:
* trims adapters sequences and deletes low quality reads ([Cutadapt](https://cutadapt.readthedocs.io/en/stable/#), [Sickle](https://github.com/najoshi/sickle))
* suppresses host contaminants ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) or [Minimap2](https://github.com/lh3/minimap2) + [Samtools](http://www.htslib.org/))
* controls the quality of raw and cleaned data ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
* makes a taxonomic classification of cleaned reads ([Kaiju MEM](https://github.com/bioinformatics-centre/kaiju) + [kronaTools](https://github.com/marbl/Krona/wiki/KronaTools) + [plot_kaiju_stat.py](bin/plot_kaiju_stat.py) + [merge_kaiju_results.py](bin/merge_kaiju_results.py))
* `S02_ASSEMBLY`
* assembles cleaned reads (combined with `S01_CLEAN_QC` step) or raw reads (combined with `--skip_clean` parameter) ([metaSPAdes](https://github.com/ablab/spades) or [Megahit](https://github.com/voutcn/megahit) or [Hifiasm_meta](https://github.com/lh3/hifiasm-meta) or [metaFlye](https://github.com/fenderglass/Flye))
* assesses the quality of assembly ([metaQUAST](http://quast.sourceforge.net/metaquast))
* deduplicates cleaned reads (combined with `S01_CLEAN_QC` step) or raw reads (combined with `--skip_clean` parameter) ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) or [Minimap2](https://github.com/lh3/minimap2) + [Samtools](http://www.htslib.org/))
* `S03_FILTERING`
* filters contigs with low CPM value ([Filter_contig_per_cpm.py](bin/Filter_contig_per_cpm.py) + [metaQUAST](http://quast.sourceforge.net/metaquast))
* makes a structural annotation of genes ([Prokka](https://github.com/tseemann/prokka) + [Rename_contigs_and_genes.py](bin/Rename_contigs_and_genes.py))
* aligns reads to the contigs ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) or [Minimap2](https://github.com/lh3/minimap2) + [Samtools](http://www.htslib.org/))
* aligns the protein sequence of genes against a protein database ([DIAMOND](https://github.com/bbuchfink/diamond))
* makes a sample and global clustering of genes ([cd-hit-est](http://weizhongli-lab.org/cd-hit/) + [cd_hit_produce_table_clstr.py](bin/cd_hit_produce_table_clstr.py))
* quantifies reads that align with the genes ([featureCounts](http://subread.sourceforge.net/) + [Quantification_clusters.py](bin/Quantification_clusters.py))
* makes a functional annotation of genes and a quantification of reads by function ([eggNOG-mapper](http://eggnog-mapper.embl.de/) + [merge_abundance_and_functional_annotations.py](bin/merge_abundance_and_functional_annotations.py) + [quantification_by_functional_annotation.py](bin/quantification_by_functional_annotation.py))
* `S07_TAXO_AFFI`
* taxonomically affiliates the genes ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](bin/aln2taxaffi.py))
* taxonomically affiliates the contigs ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](bin/aln2taxaffi.py))
* counts the number of reads and contigs, for each taxonomic affiliation, per taxonomic level ([Samtools](http://www.htslib.org/) + [merge_contig_quantif_perlineage.py](bin/merge_contig_quantif_perlineage.py) + [quantification_by_contig_lineage.py](bin/quantification_by_contig_lineage.py))

* aligns reads samples against assemblies (according to the strategy used) ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) or [Minimap2](https://github.com/lh3/minimap2))
* performs metagenome binning ([METABAT2](https://bitbucket.org/berkeleylab/metabat/src/master/) + [MAXBIN2](https://sourceforge.net/projects/maxbin/) + [CONCOCT](https://github.com/BinPro/CONCOCT))
* refines bin sets ([bin_refinement.sh](bin/bin_refinement.sh) adapt from [METAWRAP](https://github.com/bxlab/metaWRAP) bin_refinement)
* dereplicates bins between samples ([DREP](https://github.com/MrOlm/drep))
* taxonomically affiliates the bins ([GTDBTK](https://github.com/Ecogenomics/GTDBTk))
* calculates bins abundances between samples ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) or [Minimap2](https://github.com/lh3/minimap2) + [SAMTOOLS](http://www.htslib.org/))
All steps are launched one after another by default. Use `--stop_at_[STEP]` and `--skip_[STEP]` parameters to tweak execution to your will.
A report html file is generated at the end of the workflow with [MultiQC](https://multiqc.info/).
The pipeline is built using [Nextflow,](https://www.nextflow.io/docs/latest/index.html#) a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner.
Three [Singularity](https://sylabs.io/docs/) containers are available making installation trivial and results highly reproducible.
The metagWGS documentation can be found in the following pages:
* The pipeline installation procedure.
* An overview of how the pipeline works, how to run it and a description of all of the different command-line flags.
* An overview of the different output files and directories produced by the pipeline.
* A tutorial to learn how to launch the pipeline on a test dataset on [genologin cluster](http://bioinfo.genotoul.fr/).
* [Functional tests](functional_tests/README.md)
* (for developers) A tool to launch a new version of the pipeline on curated input data and compare its results with known output.
## Contact us
If you have any questions or suggestions for improvement, please contact us to claire.hoede[@]inrae.fr.
## Cite us
For the moment if you use metagWGS for your research, plese cite :
Joanna Fourquet, Céline Noirot, Christophe Klopp, Philippe Pinton, Sylvie Combes, et al.. Whole metagenome analysis with metagWGS. JOBIM2020, Jun 2020, Montpellier, France. ⟨hal-03176836⟩