Skip to content

PacBio Sequencing Data Processing

Chee-Hong Wong edited this page Dec 31, 2017 · 2 revisions

Although one might starts with the circular consensus (CCS) reads, we prefer to work with the more "raw" reads, i.e. subreads. We use the following procedure to process PacBio sequencing data.

Step 1 : Fastq File Quality Tweaking

Fastq can be generated from SMRTLink portal. The base quality is default to zero and thus making the read unusable. For now, we will default the base quality artificially to the letter "H". For instance, a .bam generated by SMRTLink Portal for Sequel data will be tweaked as follow:

export RUNTYPE=m54186_170711_202450.subreads
samtools view ${RUNTYPE}.bam \
  | perl -ne 'chomp();@b=split(/\t/);$rl=length($b[9]); 
    printf "@%s\n%s\n+\n%s\n",$b[0], $b[9], ("H" x $rl) if ($rl>=500);' \
    > ${RUNTYPE}.fastq

Step 2 : Building the Lastal database

LAST is used to generate possible read alignemnts against a reference genome. It is thus necessary to build the reference genome database to enable the alignment.

last-755/src/lastdb -v -P 1 hg19.nomask.st.lastdb hg19.fa 2>&1 \
  | tee lastdb.v755.hg19.log

NOTE: In our experience, it takes 50 min to build hg19 on a single core with 16GB RAM.

NOTE: Lastal database is build once. You only need to re-build it if the sequence has changed.

Step 3 : Read alignments

LAST is used to generate the possible read alignments against a reference genome. Picky selectRep then sieves through the read alignments (.maf file) to pick the best representative alignment.

export RUNTYPE=m54186_170711_202450.subreads
last-755/src/lastal -r1 -q1 -a0 -b2 -Q1 -P4 \
  hg19.lastdb\${RUNTYPE}.fastq 2>${RUNTYPE}.lastal.log \
  | ./picky.pl selectRep --thread 4 --preload 10 \
    1>${RUNTYPE}.align 2>${RUNTYPE}.selectRep.log

The direct streaming of last output to picky remove the need to store the gigantic intermediate .maf file. However, if you are troubleshooting or experimenting with other lastal alignment parameters and have sufficient storage, you may keep the .maf file and use the following commands instead.

last-755/src/lastal -r1 -q1 -a0 -b2 -Q1 -P4 hg19.lastdb ${RUNTYPE}.fastq \
  1>${RUNTYPE}.maf 2>${RUNTYPE}.lastal.log
cat ${RUNTYPE}.maf \
  | ./picky.pl selectRep --thread 4 --preload 6 \
    1>${RUNTYPE}.align 2>${RUNTYPE}.selectRep.log

NOTE: For .align format, see output documentation

Step 4 : SVs Calling

Finally, the selected alignments for read are processed to identify harbored structural variants.

cat ${RUNTYPE}.align \
  | ./picky.pl callSV \
    --oprefix ${RUNTYPE}.sv \
    --fastq ${RUNTYPE}.fastq \
    --genome hg19.fa --removehomopolymerdeletion \
    --exclude=chrY --exclude=chrM \
    --sam 2>${RUNTYPE}.callSV.log

Each structural variants are classified into the respectively files; namely deletion (.DEL.xls), insertion (.INS.ls), indel (.INDEL.xls), inversion (.INV.xls), translocation (.TTLC.xls), tandem duplication junction (.TDSR.xls), and tandem duplication complete read (.TDC.xls). Read not harboring any structural variant is recorded in .NONE.xls. Each aligned read fragment is recorded in .xls file.

NOTE: See output documentation for details.

.xls Output

File Description
<oprefix>.profile.DEL.xls tab-delimited file for deletions
<oprefix>.profile.INS.xls tab-delimited file for insertions
<oprefix>.profile.INDEL.xls tab-delimited file for possible co-insertion-and-deletion
<oprefix>.profile.INV.xls tab-delimited file for inversions
<oprefix>.profile.TTLC.xls tab-delimited file for translocations
<oprefix>.profile.TDSR.xls tab-delimited file for tandem duplications where read span the junction
<oprefix>.profile.TDC.xls tab-delimited file for tandem duplications where read completely cover the duplications
<oprefix>.profile.xls tab-delimited file for all read alignment segments

Not interested in alignments

If you are only interested in getting the list of structural variants and NOT the sam file, you may omit the "--sam" option.

Excluding SV on certain chromosome

Sometime, we may not be interested in SV identified on specific chromosomes. You may use the "--exclude" option. For instance, to exclude structural variant identified on mitochrondria, use the option "--exclude=chrM".

Step 5 : SVs in VCF (experimental)

It is possible to convert the tab-delimited SVs .xls files to VCF format.

./picky.pl xls2vcf \
  --xls ${RUNTYPE}.sv.profile.DEL.xls \
  > ${RUNTYPE}.DEL.vcf

By default, only SVs with at least 2 read evidence support are reported. For low coverage data, one may wish to report even SVs that have single read support only.

./picky.pl xls2vcf \
  --xls ${RUNTYPE}.sv.profile.DEL.xls \
  --re 1 > ${RUNTYPE}.include.Singleton.DEL.vcf

It is also possible to have all the different types of SVs in a single vcf file.

export OPREFIX=${RUNTYPE}.sv
./picky.pl xls2vcf \
  --xls ${OPREFIX}.profile.DEL.xls \
  --xls ${OPREFIX}.profile.INS.xls \
  --xls ${OPREFIX}.profile.INDEL.xls \
  --xls ${OPREFIX}.profile.INV.xls \
  --xls ${OPREFIX}.profile.TTLC.xls \
  --xls ${OPREFIX}.profile.TDSR.xls \
  --xls ${OPREFIX}.profile.TDC.xls \
  > ${RUNTYPE}.allsv.vcf