Skip to content
This repository has been archived by the owner on Feb 14, 2022. It is now read-only.
/ gwas Public archive

R-based pipeline/script to implement a Genome-Wide Association Study (GWAS) analysis based on the RAINBOWR package

License

Notifications You must be signed in to change notification settings

SilkeAllmannLab/gwas

Repository files navigation

From SNPs to phenotype

A pipeline to relate Single Nucleotide Polymorphisms (SNPs) to a continuous phenotype using either a Random Forest approach from the MUVR package or a canonical GWAS pipeline from RAINBOWR.

Both methods will perform a Genome Wide Analysis (GWAS) using genetic variant (Variant Call Format) and phenotype information.

PLINK

mydata is the prefix of the .ped and .map files.

plink --file mydata --pheno pheno.txt --pheno-name Phenotype --perm --threads 20 --assoc 

1. Inputs and outputs

Both MUVR Random Forest and RAINBOWR methods use the same input files (VCF and phenotype). They also have similar outputs (table of significant SNPs) with some disctint graphs and tables.

Inputs

VCF file

A Variant Call Format (VCF) file that contains the nucleotidic variations is needed. The VCF file has to be of version 4.0 and higher. It will be converted to a genotype matrix using vcfR where genotypes are represented by:

  • "0" for genome positions homozygous for the reference allele.
  • "2" for genome positions homozygous for the alternative allele.
  • "1" for genome positions heterozygous (one reference allele, one alternative allele).

Some useful tools to work with VCF files:

VCFtools can be used to select split the VCF file into chromosomes: vcftools --gzvcf <input_file.vcf.gz> --chr 1

Phenotype file

A tabulated separated file containing two columns:

  1. The identifier for each individual. Column name should be id.
  2. A phenotype column that contains the phenotypic values. Column name should be phenotype

Common outputs

Table of most important SNPs: a table of the most important SNPs related to the phenotype along with their p-values.

Distinct outputs

Random Forest

  • Plot of the model Q2: a plot of the model Q2 metric compared to a distribution of random Q2 values obtained using N permutations (e.g. N = 100).

GWAS RAINBOWR

  • Manhattan plot: a plot of the SNP chromosome position (x-axis) against the $$-log_{10}$$ of the computed p-value of each SNP.

2. Installation

Install RStudio

If not already done, install R (version >= 3.5) and RStudio for your Operating System.

Clone the repository

In the Shell, type git clone https://github.com/SilkeAllmannLab/gwas.git

πŸŽ‰ 🎊 That's it! πŸŽ‰ 🎊

3. Test

Example datasets

All input data are to be found in data/.

VCF dataset

The VCF data from a study from Arouisse et al. 2019.
The VCF file itself can be downloaded directly from FigShare (file is called "Arabidopsis_2029_Maf001_Filter80").

Number of SNPs per chromosome:

chrom # of SNPs
chr01 734,401
chr02 513,750
chr03 616,986
chr04 496,272
chr05 661,335

Several files were then generated from this initial big VCF file:

  • Chromosome VCF files: for instance, for chromosome 1,chr01.Arabidopsis_2029_Maf001_Filter80.vcf.gz
  • Randomly subsampled chromosome VCF files: 10% or more of the initial SNPs were subsampled and extracted on a per-chromosome basis. For instance, for chromosome 1,chr01.subsampled_10_percent.Arabidopsis_2029_Maf001_Filter80.vcf.gz
  • Test file: a small subset available for testing purposes. It can be found at data/Arabidopsis_2029_Maf001_Filter80.1000lines.vcf.

Examples of code used:

To subsample 10% of the original VCF file restricted to chromosome 4 (from ~490,000 variants to 49,000):

gzip -c -d VCF_chr04.recode.vcf.gz |tail -n +16|shuf -n 49000 > chr04.49k_SNPs.no_header.vcf \
  && cat VCF.header.txt chr04.49k_SNPs.no_header.vcf > chr04.49k_SNPs.vcf \
  && gzip chr04.49k_SNPs.vcf \
  && rm  chr04.49k_SNPs.no_header.vcf 

To create one VCF file for each Arabidopsis chromosome:

for i in {1..5}; 
  do echo "working on chromosome $i";
  vcftools  --gzvcf Arabidopsis_2029_Maf001_Filter80.vcf.gz  --chr $i  --recode --recode-INFO-all --out  VCF_chr0$i; 
done

Phenotypes

A root_data_fid_and_names.tsv file contains the genotype line identifier and the phenotypic values. Arabidopsis ecotypes were treated with 2-E-hexanal and their main root length measured. A response ratio was then calculated for each of the ecotypes by comparing the 2-E-hexanal treatment with a mock (methanol).

Test runs

Random Forest

  • Open a new Shell window.
  • Execute the random forest script: Rscript random_forest/gwas.R --help to see the full list of arguments.

RAINBOWR GWAS

  • Open RStudio
  • Run the rainbowr/rainbowr.R script. Make sure you specify a valid VCF file path and phenotype file.

4. References

✍️ Authors

  • Marc Galland (@mgalland)
  • Martha van Os (@MvanOs)
  • Machiel Clige (@MCligge)

vcfR

The R package vcfR is heavily described here.

RAINBOWR

The RAINBOWR package for R is described here.

5. Troubleshooting

RAINBOWR package

Sometimes, you might have issues with the installation of the Rccp package which is a dependency of RAINBOWR.

In the Shell, try: apt-get install r-base-dev

6. Useful links

There are 4 important parameters to set when building an RF model:

  • nTree: Number of trees in the forest

  • mTry: Number of variants evaluated at each node of a tree

  • maxD: Maximum depth of a tree to grow

  • minNS: Minimum number of samples in a node to be processed

About

R-based pipeline/script to implement a Genome-Wide Association Study (GWAS) analysis based on the RAINBOWR package

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published