Merge branch 'master' of github.com:cumc/bioworkflows

cumc · Oct 11, 2023 · 5e167a7 · 5e167a7
2 parents 591e076 + 12af9c8
commit 5e167a7
Show file tree

Hide file tree

Showing 26 changed files with 2,911 additions and 1,060 deletions.
diff --git a/GWAS/LMM.ipynb b/GWAS/LMM.ipynb
diff --git a/GWAS/MTAG.ipynb b/GWAS/MTAG.ipynb
@@ -0,0 +1,387 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "463a39e9-76aa-4097-bbb2-d975f030d30d",
+   "metadata": {
+    "kernel": "SoS",
+    "tags": []
+   },
+   "source": [
+    "# Multi-Trait Aanalysis of GWAS (MTAG)\n",
+    "\n",
+    "## Aim\n",
+    "\n",
+    "Joint analysis of summary statistics from GWAS of different traits from possibly overlapping samples\n",
+    "\n",
+    "## Method\n",
+    "\n",
+    "[MTAG](https://www.nature.com/articles/s41588-017-0009-4) uses bivariate linkage disequilibrium (LD) score regression to account for (possibly unknown) sample overlap between the GWAS results for different traits\n",
+    "\n",
+    "The MTAG estimator is a generalization of inverse-variance-weighted meta-analysis that takes summary statistics from single-trait GWAS and outputs trait-specific association statistics. The resulting P values can be used like P values from a single-trait GWAS, for example, to prioritize SNPs for subsequent analyses such as biological annotation or to construct polygenic scores.\n",
+    "\n",
+    "## Requirements\n",
+    "\n",
+    "MTAG needs the use of python=2.7. to install follow intructions [here](https://github.com/JonJala/mtag)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b872d69c-9ad4-486a-997e-1a5aac10d6c3",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Format for Summary Statistics\n",
+    "\n",
+    "```\n",
+    "snpid    chr    bpos    a1    a2    freq    z    pval    n\n",
+    "```\n",
+    "\n",
+    "snpid: the ld reference panel for mtag uses hg19 and rsid identifiers. If you are going to use this one for you analyses you have to make sure you snpid column is in the correct format. \n",
+    "\n",
+    "chr: chromosome number\n",
+    "\n",
+    "bpos: base position\n",
+    "\n",
+    "a1 should be the effect allele\n",
+    "\n",
+    "a1: the non-effect allele\n",
+    "\n",
+    "freq: allele frequency for a1\n",
+    "\n",
+    "z: The Z-scores associated with the SNP effect sizes for the GWAS\n",
+    "\n",
+    "pval: p-value\n",
+    "\n",
+    "n: the SNP sample sizes\n",
+    "\n",
+    "\n",
+    "## LD reference panel FIXME\n",
+    "\n",
+    "One issue that I encountered is that in our summary stats, we use the snpid as `chr:pos:ref:alt` format and it is in hg38. Therefore, I've added the liftover module to deal with this and get the summary stats in hg19. \n",
+    "\n",
+    "However, there's still the problem between matching the rsid from the LD reference panel and the snpid column in the GWAS summary statistics. \n",
+    "\n",
+    "To try to solve this problem I've downloaded the LDSC-compatible flat files from [Pan-UKBiobank](https://pan.ukbb.broadinstitute.org/downloads) open source (also in hg19). However, these need to be split by chromosome which I did using this file `/mnt/mfs/statgen/data_public/UKBB.ALL.ldscore.hg19/UKBB.ALL.ldscore.hg19/UKBB.EUR.l2.ldscore.gz`. Then I encountered the problem that `mtag` needs that each of the `{1..22}.l2.ldscore.gz` has two accompanying files `{1..22}.l2.M` and `{1..22}.l2.M_5_50` with the total number of SNPs and the number of SNP's with MAF>5% respectively. The PAN-UK provides these counts for the whole dataset (chrom 1 to 22) and does not provide a MAF which makes is very difficult to create the `l2.M5_50` files.\n",
+    "\n",
+    "\n",
+    "## Errors encountered\n",
+    "\n",
+    "```\n",
+    "raise ValueError(msg.format(F=name, M=expected_median, V=round(m, 2)))\n",
+    "ValueError: WARNING: median value of SIGNED_SUMSTAT is -0.38 (should be close to 0.0). This column may be mislabeled\n",
+    "```\n",
+    "\n",
+    "This issue is discussed [here](https://github.com/JonJala/mtag/issues/86)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "00edeb70-d4e7-4821-ac3b-55b8c6988719",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## MTAG defaults\n",
+    "\n",
+    "1. Read in the input GWAS summary statistics and filter the SNPs by minor allele frequency (MAF) >= 0.01 and sample size N >= (2/3) * 90th percentile\n",
+    "\n",
+    "2. Merge the filtered GWAS summary statistics results together, taking the intersection of available SNPs.\n",
+    "\n",
+    "3. Estimate the residual covariance matrix via LD Score regression\n",
+    "\n",
+    "4. Estimate the genetic covariance matrix Omega\n",
+    "\n",
+    "5. Perform MTAG and output results"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9940a4a0-e55a-43bb-874c-dd95896aa96b",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## MTAG special options\n",
+    "\n",
+    "1. Assumes no overlap between any of the cohorts in any pair of GWAS studies fed into mtag\n",
+    "\n",
+    "`--no-overlap`\n",
+    "\n",
+    "\n",
+    "2. Assumes the T summary statistics used in MTAG are GWAS estimates for traits that are perfectly correlated with one another, i.e., each GWAS is on a different measure of the same \"trait\".\n",
+    "\n",
+    "`--perfect_gencov`\n",
+    "\n",
+    "3. Performing meta-analysis with mtag. Assumes: Variation between \"traits\" is only due to non-genetic factors. All summary statistics files have in MTAG have the same heritability as they considered to be results on the same measure of a single trait\n",
+    "\n",
+    "`-equal_h2`\n",
+    "\n",
+    "Use mtag to implement a type of inverse-variance meta-analysis that can handle sample overlap in the GWAS results. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "08667d6d-4be0-45c3-96a3-bf5baf9270d7",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## Output\n",
+    "\n",
+    "* .log: timestamps the different steps taken by mtag.py\n",
+    "\n",
+    "* sigma_hat.txt: stores the estimated residual covariance matrix\n",
+    "\n",
+    "* omega_hat.txt: stores the estimated genetic covariance matrix\n",
+    "\n",
+    "* trait_{n}.txt: tab-delimited results files corresponding to the MTAG-adjusted effect sizes and standard errors for n imputed traits\n",
+    "\n",
+    "The first 8 columns are copied from the sumstats and then the results obtained by MTAG are given:\n",
+    "\n",
+    "* mtag_beta: unstandardized weights \n",
+    "\n",
+    "* mtag_se: unstandardized standard errors\n",
+    "\n",
+    "* mtag_z: z-scores\n",
+    "\n",
+    "* mtag_pval: p-values"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d10b9906-c821-4902-b15e-cff9ecfce037",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "source": [
+    "## MWE\n",
+    "```\n",
+    "sumstatsFiles=`echo ~/output/*hg19.snp_stats_original_columns.gz`\n",
+    "sos run ~/project/UKBB_GWAS_dev/workflow/MTAG.ipynb mtag \\\n",
+    "--cwd ~/output \\\n",
+    "--sumstatsFiles $sumstatsFiles \\\n",
+    "--formatFile ~/project/bioworkflows/GWAS/data/mtag_template.yml \\\n",
+    "--ld_ref_panel ~/output/ldscores_ukbb_per_chrom/ \\\n",
+    "--job_name 'f2247_f2257_combined'\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7d76a7ca-c942-4fc8-b54b-412e35071402",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[global]\n",
+    "# the output directory for generated files\n",
+    "parameter: cwd = path\n",
+    "# Path to summary stats file\n",
+    "parameter: sumstatsFiles = paths('.')\n",
+    "# snpid column\n",
+    "parameter: snp_name = 'snpid'\n",
+    "# The effect allele\n",
+    "parameter: a1_name = 'a1'\n",
+    "# The non-effect allele\n",
+    "parameter: a2_name = 'a2'\n",
+    "# Frequency of the effect allele\n",
+    "parameter: freq = 'freq'\n",
+    "# Sample size for every SNP\n",
+    "parameter: n_name = 'n'\n",
+    "# Z-scores column\n",
+    "parameter: z_name = 'z'\n",
+    "# chrosomose column\n",
+    "parameter: chr_name = 'chr'\n",
+    "# base pair position column\n",
+    "parameter: bpos_name = 'bpos'\n",
+    "# Specific number of threads to use\n",
+    "parameter: numThreads = 2\n",
+    "# For cluster jobs, number commands to run per job\n",
+    "parameter: job_size = 1\n",
+    "# The container with the lmm software. Can be either a dockerhub image or a singularity `sif` file.\n",
+    "parameter: container_lmm = 'statisticalgenetics/lmm:3.0'\n",
+    "# Summary statistics format file path used for unifying input column names. Will not unify names if empty\n",
+    "parameter: formatFile = path('.')\n",
+    "# If the sumstatsfile has logP instead of P-val\n",
+    "parameter: reverse_log_p = True\n",
+    "# If the zscore needs to be calculated\n",
+    "parameter: z_score = True\n",
+    "# If there's no overlap between samples\n",
+    "parameter: no_overlap = False\n",
+    "# If the traits are perfectly correlated\n",
+    "parameter: perfect_gencov = True\n",
+    "# Assume equal heritability of traits\n",
+    "parameter: h2_equal = False\n",
+    "# Reference Ld used by ldsc.py needs to be splitted by chromosome\n",
+    "parameter: ld_ref_panel = str"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4e25366e-597e-44ca-8ef6-59f1905dc2ae",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[liftover]\n",
+    "parameter: lifover_pipeline = path\n",
+    "parameter: formatFile_liftover = path('.')\n",
+    "parameter: fr = 'hg38'\n",
+    "parameter: to = 'hg19'\n",
+    "parameter: container = ''\n",
+    "input: susmtastsFiles, group_by=1\n",
+    "output: f'{cwd}/{_input:bnn}.{_hg}.snp_stats.gz'\n",
+    "depends: formatFile_regenie\n",
+    "task: trunk_workers = 1, trunk_size = job_size, walltime = '2h', mem = '10G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n",
+    "bash: expand = \"${ }\",  stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n",
+    "    sos run ${liftover_pipeline} \\\n",
+    "    --cwd ${cwd} \\\n",
+    "    --input_file ${_input} \\\n",
+    "    --output_file ${_input}.${to}.mtag \\\n",
+    "    --fr ${fr} --to ${to} \\\n",
+    "    --yaml_file ${formatFile_liftover} \\\n",
+    "    --no-rename \\\n",
+    "    --container ${container}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "369e095a-4737-4703-a428-3adc07cb6498",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[mtag_1]\n",
+    "input: sumstatsFiles, group_by=1\n",
+    "output: f'{cwd}/{_input:bnn}.mtag.snp_stats'\n",
+    "depends: formatFile\n",
+    "task: trunk_workers = 1, trunk_size = job_size, walltime = '2h', mem = '10G', cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'\n",
+    "python: expand = \"${ }\",  stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n",
+    "    import gzip\n",
+    "    import pandas as pd\n",
+    "    \n",
+    "    # unify output format\n",
+    "    if ${formatFile.is_file()} or ${reverse_log_p} or ${z_score}:\n",
+    "        sumstats = pd.read_csv(${_input:r}, compression='gzip', header=0, sep='\\t', quotechar='\"')  \n",
+    "        if ${formatFile.is_file()}:\n",
+    "            import yaml\n",
+    "            config = yaml.safe_load(open(${formatFile:r}, 'r'))\n",
+    "        try:\n",
+    "            sumstats = sumstats.loc[:,list(config.values())]\n",
+    "        except:\n",
+    "            raise ValueError(f'According to ${formatFile}, input summary statistics should have the following columns: {list(config.values())}.')\n",
+    "        sumstats.columns = list(config.keys())\n",
+    "        if ${reverse_log_p}:\n",
+    "            sumstats['pval'] = sumstats['pval'].apply(lambda row: 10**-row)\n",
+    "        if ${z_score}:\n",
+    "            sumstats['z'] = sumstats['beta']/sumstats['se']\n",
+    "\n",
+    "        sumstats[[\"chr\", \"bpos\"]] = sumstats[[\"chr\", \"bpos\"]].astype(str)\n",
+    "        sumstats[\"snpid\"] = sumstats.chr.str.cat(others=[sumstats.bpos, sumstats.a2, sumstats.a1], sep=':')\n",
+    "        sumstats[[\"chr\", \"bpos\"]] = sumstats[[\"chr\", \"bpos\"]].astype(int)\n",
+    "        sumstats.to_csv(${_output:r}, sep='\\t', header = True, index = False, na_rep='.')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "54a6c8f3-d722-4a6e-a739-f6c225dca848",
+   "metadata": {
+    "kernel": "SoS"
+   },
+   "outputs": [],
+   "source": [
+    "[mtag_2]\n",
+    "parameter: job_name=''\n",
+    "input: group_by='all'\n",
+    "output: f'{cwd}/{job_name}_sigma_hat.txt',\n",
+    "        f'{cwd}/{job_name}_omega_hat.txt',\n",
+    "        [f'{cwd}/{job_name}.trait_{x}.txt' for x in range(len(sumstatsFiles))]\n",
+    "task: trunk_workers = 1, trunk_size = job_size, walltime = '10h', mem = '30G', cores = numThreads, tags = f'{step_name}_{_output[1]:bn}'\n",
+    "bash: expand = \"${ }\", stderr = f'{_output[1]:n}.stderr', stdout = f'{_output[1]:n}.log'\n",
+    "\n",
+    "    ~/conda/my-envs/py27/bin/python2.7 ~/project/mtag/mtag.py  \\\n",
+    "    --sumstats ${','.join(['%s' % x for x in _input if x is not None])} \\\n",
+    "    --snp_name ${snp_name} \\\n",
+    "    --a1_name ${a1_name} \\\n",
+    "    --a2_name ${a2_name} \\\n",
+    "    --eaf_name ${freq} \\\n",
+    "    --n_name ${n_name} \\\n",
+    "    --z_name ${z_name} \\\n",
+    "    --chr_name ${chr_name} \\\n",
+    "    --bpos_name ${bpos_name} \\\n",
+    "    --out ${cwd}/${job_name} \\\n",
+    "    --n_min 0.0 \\\n",
+    "    ${('--ld_ref_panel ' + ld_ref_panel + '/') } \\\n",
+    "    ${('--no_overlap') if no_overlap  else ''} \\\n",
+    "    ${('--perfect_gencov') if perfect_gencov  else ''} \\\n",
+    "    ${('--h2_equal') if h2_equal else ''} \\\n",
+    "    --force"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c4d7839b-b2f6-4465-b029-b9d63bfff9be",
+   "metadata": {
+    "kernel": "Bash"
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "SoS",
+   "language": "sos",
+   "name": "sos"
+  },
+  "language_info": {
+   "codemirror_mode": "sos",
+   "file_extension": ".sos",
+   "mimetype": "text/x-sos",
+   "name": "sos",
+   "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter",
+   "pygments_lexer": "sos"
+  },
+  "sos": {
+   "kernels": [
+    [
+     "Bash",
+     "calysto_bash",
+     "Bash",
+     "#E6EEFF",
+     "shell"
+    ],
+    [
+     "Python3",
+     "python3",
+     "Python3",
+     "#FFD91A",
+     {
+      "name": "ipython",
+      "version": 3
+     }
+    ],
+    [
+     "SoS",
+     "sos",
+     "",
+     "",
+     "sos"
+    ]
+   ],
+   "version": "0.22.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}