Skip to content

Oxford Nanopore Sequencing Data Processing

Chee-Hong Wong edited this page Dec 31, 2017 · 2 revisions

We use the following procedure to process data from Oxford Nanopore sequencing.

Step 1 : Fastq File Generation

Reads from ONT .fast5 are converted to .FASTQ format using poretools. We choose to generate the "pass" and "fail" reads into separate individual .fastq file.

export RUN=2016-08-24-R9-WTD-R009
poretools fastq -q pass > "${RUN}.pass.fastq"
poretools fastq -q fail > "${RUN}.fail.fastq"

Step 2 : Human Friendly Read ID instead of Read UUID

To facilitate visualization of read information in IGV browser, we re-assigned Read UUID with a human friendly read id and group the reads accordingly into four different files, namely .2D.fastq, .2Dsupport.fastq, .1D.fastq, and .strange.fastq.

The four separate files allow one to focus the analysis on 2D reads, the template and complement reads generated the 2D consensus reads (2Dsupport), all the 1D reads and finally any remaining reads that do not fall into the earlier groups. The last category (i.e. strange) file is generally empty these days, although there used to be a handful of reads in the earlier days.

export EXPERIMENT=WTD09
export RUN=2016-08-24-R9-WTD-R009
./picky.pl hashFq --pfile ${RUN}.pass.fastq --ffile ${RUN}.fail.fastq --oprefix ${EXPERIMENT}

# alternatively, to analyze pass read downstream only
./picky.pl hashFq --pfile ${RUN}.pass.fastq --oprefix ${EXPERIMENT}.Pass

# or to analyze only fail read downstream
./picky.pl hashFq --ffile ${RUN}.fail.fastq --oprefix ${EXPERIMENT}.Pass

Step 3 : Building the Lastal database

LAST is used to generate possible read alignemnts against a reference genome. It is thus necessary to build the reference genome database to enable the alignment.

last-755/src/lastdb -v -P 1 hg19.lastdb hg19.fa 2>&1 \
  | tee lastdb.v755.hg19.log

NOTE: In our experience, it takes 50 min to build hg19 on a single core with 16GB RAM.

NOTE: Lastal database is build once. You only need to re-build it if the sequence has changed.

Step 4 : Read alignments

LAST is used to generate the possible read alignments against a reference genome. Picky selectRep then sieves through the read alignments (.maf stream) to pick the best representative alignment.

export RUNTYPE=WTD09.2D.subset
last-755/src/lastal -r1 -q1 -a0 -b2 -Q1 -P4 \
  hg19.lastdb\${RUNTYPE}.fastq 2>${RUNTYPE}.lastal.log \
  | ./picky.pl selectRep --thread 4 --preload 10 \
    1>${RUNTYPE}.align 2>${RUNTYPE}.selectRep.log

The direct streaming of last output to picky remove the need to store the gigantic intermediate .maf file. However, if you are troubleshooting or experimenting with other lastal alignment parameters and have sufficient storage, you may keep the .maf file and use the following commands instead.

last-755/src/lastal -r1 -q1 -a0 -b2 -Q1 -P4 hg19.lastdb ${RUNTYPE}.fastq \
  1>${RUNTYPE}.maf 2>${RUNTYPE}.lastal.log
cat ${RUNTYPE}.maf \
  | ./picky.pl selectRep --thread 4 --preload 6 \
    1>${RUNTYPE}.align 2>${RUNTYPE}.selectRep.log

NOTE: For .align format, see output documentation

Step 5 : SVs Calling

Finally, the selected alignments for read are processed to identify harbored structural variants.

cat ${RUNTYPE}.align \
  | ./picky.pl callSV \
    --oprefix ${RUNTYPE}.sv \
    --fastq ${RUNTYPE}.fastq \
    --exclude=chrY --exclude=chrM \
    --sam 2>${RUNTYPE}.callSV.log

Each structural variants are classified into the respectively files; namely deletion (.DEL.xls), insertion (.INS.ls), indel (.INDEL.xls), inversion (.INV.xls), translocation (.TTLC.xls), tandem duplication junction (.TDSR.xls), and tandem duplication complete read (.TDC.xls). Read not harboring any structural variant is recorded in .NONE.xls. Each aligned read fragment is recorded in .xls file.

NOTE: See output documentation for details.

.xls Output

File Description
<oprefix>.profile.DEL.xls tab-delimited file for deletions
<oprefix>.profile.INS.xls tab-delimited file for insertions
<oprefix>.profile.INDEL.xls tab-delimited file for possible co-insertion-and-deletion
<oprefix>.profile.INV.xls tab-delimited file for inversions
<oprefix>.profile.TTLC.xls tab-delimited file for translocations
<oprefix>.profile.TDSR.xls tab-delimited file for tandem duplications where read span the junction
<oprefix>.profile.TDC.xls tab-delimited file for tandem duplications where read completely cover the duplications
<oprefix>.profile.xls tab-delimited file for all read alignment segments

Deletions in Homopolymer Regions

Earlier ONT base-caller under-call bases in homopolymer region. This manifests as deletions and can be excluded by specifying the option "--removehomopolymerdeletion". It is necessary to provide the genome sequences (.FASTA file; option "--genome") used to construct the lastdb.

ONT base-caller has addressed this issue and you should NOT need this option (off by default) if your base-calling is done with the improved version of the base-caller.

Not interested in alignments

If you are only interested in getting the list of structural variants and NOT the sam file, you may omit the "--sam" option.

Excluding SV on certain chromosome

Sometime, we may not be interested in SV identified on specific chromosomes. You may use the "--exclude" option. For instance, to exclude structural variant identified on mitochrondria, use the option "--exclude=chrM".

Step 6 : SVs in VCF (experimental)

It is possible to convert the tab-delimited SVs .xls files to VCF format.

./picky.pl xls2vcf \
  --xls ${RUNTYPE}.sv.profile.DEL.xls \
  > ${RUNTYPE}.DEL.vcf

By default, only SVs with at least 2 read evidence support are reported. For low coverage data, one may wish to report even SVs that have single read support only.

./picky.pl xls2vcf \
  --xls ${RUNTYPE}.sv.profile.DEL.xls \
  --re 1 > ${RUNTYPE}.include.Singleton.DEL.vcf

It is also possible to have all the different types of SVs in a single vcf file.

export OPREFIX=${RUNTYPE}.sv
./picky.pl xls2vcf \
  --xls ${OPREFIX}.profile.DEL.xls \
  --xls ${OPREFIX}.profile.INS.xls \
  --xls ${OPREFIX}.profile.INDEL.xls \
  --xls ${OPREFIX}.profile.INV.xls \
  --xls ${OPREFIX}.profile.TTLC.xls \
  --xls ${OPREFIX}.profile.TDSR.xls \
  --xls ${OPREFIX}.profile.TDC.xls \
  > ${RUNTYPE}.allsv.vcf