README.md

# metagWGS: Documentation

## Introduction

**metagWGS** is a [Nextflow](https://www.nextflow.io/docs/latest/index.html#) bioinformatics analysis pipeline used for **metag**enomic **W**hole **G**enome **S**hotgun sequencing data (Illumina HiSeq3000 or NovaSeq, paired, 2\*150bp ; PacBio HiFi reads, single-end).

### Pipeline graphical representation
The workflow processes raw data from `.fastq/.fastq.gz` input and/or assemblies (contigs) `.fa/.fasta` and uses the modules represented in this figure:
![](docs/Pipeline.png)

### metagWGS steps

metagWGS is split into different steps that correspond to different parts of the bioinformatics analysis:

* `S01_CLEAN_QC`
   * trims adapters sequences and deletes low quality reads ([Cutadapt](https://cutadapt.readthedocs.io/en/stable/#), [Sickle](https://github.com/najoshi/sickle))
   * suppresses host contaminants ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) or [Minimap2](https://github.com/lh3/minimap2) + [Samtools](http://www.htslib.org/))
   * controls the quality of raw and cleaned data ([FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
   * makes a taxonomic classification of cleaned reads ([Kaiju MEM](https://github.com/bioinformatics-centre/kaiju) + [kronaTools](https://github.com/marbl/Krona/wiki/KronaTools) + [plot_kaiju_stat.py](bin/plot_kaiju_stat.py) + [merge_kaiju_results.py](bin/merge_kaiju_results.py))
* `S02_ASSEMBLY` 
   * assembles cleaned reads (combined with `S01_CLEAN_QC` step) or raw reads (combined with `--skip_clean` parameter) ([metaSPAdes](https://github.com/ablab/spades) or [Megahit](https://github.com/voutcn/megahit) or [Hifiasm_meta](https://github.com/lh3/hifiasm-meta) or [metaFlye](https://github.com/fenderglass/Flye))
   * assesses the quality of assembly ([metaQUAST](http://quast.sourceforge.net/metaquast))
   * deduplicates cleaned reads (combined with `S01_CLEAN_QC` step) or raw reads (combined with `--skip_clean` parameter) ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) or [Minimap2](https://github.com/lh3/minimap2)  + [Samtools](http://www.htslib.org/))
* `S03_FILTERING` 
   * filters contigs with low CPM value ([Filter_contig_per_cpm.py](bin/Filter_contig_per_cpm.py) + [metaQUAST](http://quast.sourceforge.net/metaquast))
* `S04_STRUCTURAL_ANNOT` 
   * makes a structural annotation of genes ([Prokka](https://github.com/tseemann/prokka) + [Rename_contigs_and_genes.py](bin/Rename_contigs_and_genes.py))
* `S05_ALIGNMENT`
   * aligns reads to the contigs ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) or [Minimap2](https://github.com/lh3/minimap2)  + [Samtools](http://www.htslib.org/))
   * aligns the protein sequence of genes against a protein database ([DIAMOND](https://github.com/bbuchfink/diamond))
* `S06_FUNC_ANNOT` 
   * makes a sample and global clustering of genes ([cd-hit-est](http://weizhongli-lab.org/cd-hit/) + [cd_hit_produce_table_clstr.py](bin/cd_hit_produce_table_clstr.py))
   * quantifies reads that align with the genes ([featureCounts](http://subread.sourceforge.net/) + [Quantification_clusters.py](bin/Quantification_clusters.py))
   * makes a functional annotation of genes and a quantification of reads by function ([eggNOG-mapper](http://eggnog-mapper.embl.de/) + [merge_abundance_and_functional_annotations.py](bin/merge_abundance_and_functional_annotations.py) + [quantification_by_functional_annotation.py](bin/quantification_by_functional_annotation.py))
* `S07_TAXO_AFFI` 
   * taxonomically affiliates the genes ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](bin/aln2taxaffi.py))
   * taxonomically affiliates the contigs ([Samtools](http://www.htslib.org/) + [aln2taxaffi.py](bin/aln2taxaffi.py))
   * counts the number of reads and contigs, for each taxonomic affiliation, per taxonomic level ([Samtools](http://www.htslib.org/) + [merge_contig_quantif_perlineage.py](bin/merge_contig_quantif_perlineage.py) + [quantification_by_contig_lineage.py](bin/quantification_by_contig_lineage.py))
* `S08_BINNING` 
![](docs/08_binning.png)
   * aligns reads samples against assemblies (according to the strategy used) ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) or [Minimap2](https://github.com/lh3/minimap2))
   * performs metagenome binning ([METABAT2](https://bitbucket.org/berkeleylab/metabat/src/master/) + [MAXBIN2](https://sourceforge.net/projects/maxbin/) + [CONCOCT](https://github.com/BinPro/CONCOCT))
   * refines bin sets ([bin_refinement.sh](bin/bin_refinement.sh) adapt from [METAWRAP](https://github.com/bxlab/metaWRAP) bin_refinement)
   * dereplicates bins between samples ([DREP](https://github.com/MrOlm/drep))
   * taxonomically affiliates the bins ([GTDBTK](https://github.com/Ecogenomics/GTDBTk))
   * calculates bins abundances between samples ([BWA-MEM2](https://github.com/bwa-mem2/bwa-mem2) or [Minimap2](https://github.com/lh3/minimap2) + [SAMTOOLS](http://www.htslib.org/))

All steps are launched one after another by default. Use `--stop_at_[STEP]` and `--skip_[STEP]` parameters to tweak execution to your will.

A report html file is generated at the end of the workflow with [MultiQC](https://multiqc.info/).

The pipeline is built using [Nextflow,](https://www.nextflow.io/docs/latest/index.html#) a bioinformatics workflow tool to run tasks across multiple compute infrastructures in a very portable manner.

Three [Singularity](https://sylabs.io/docs/) containers are available making installation trivial and results highly reproducible.

## Documentation

The metagWGS documentation can be found in the following pages:

   * [Installation](docs/installation.md)
      * The pipeline installation procedure.
   * [Usage](docs/usage.md)
      * An overview of how the pipeline works, how to run it and a description of all of the different command-line flags.
   * [Output](docs/output.md)
      * An overview of the different output files and directories produced by the pipeline.
   * [Use case](docs/use_case.md)
      * A tutorial to learn how to launch the pipeline on a test dataset on [genologin cluster](http://bioinfo.genotoul.fr/).
   * [Functional tests](functional_tests/README.md)
      * (for developers) A tool to launch a new version of the pipeline on curated input data and compare its results with known output.

## Contact us

If you have any questions or suggestions for improvement, please contact us to claire.hoede[@]inrae.fr.

## Cite us

For the moment if you use metagWGS for your research, plese cite : 
Joanna Fourquet, Céline Noirot, Christophe Klopp, Philippe Pinton, Sylvie Combes, et al.. Whole metagenome analysis with metagWGS. JOBIM2020, Jun 2020, Montpellier, France. ⟨hal-03176836⟩