Skip to content
Chee-Hong Wong edited this page Mar 9, 2018 · 10 revisions

Welcome to the Picky wiki!

What is Picky?

Picky is a structural variant pipeline for long reads developed by Genome Technologies, The Jackson Laboratory. Picky was initially designed for Oxford Nanopore long reads, but will also work for PacBio reads. Picky uses LAST to generate all possible High-scoring Segment Pairs (HSPs) for a read, and 'pick'-and-stitch the segments into the representative alignments with a greedy algorithm instead of using last-split. Picky's picking does not assume colinearity nor mandate that each read base must only be aligned to at most a single genomic location.

  1. Why build another structural variant pipeline?

  2. How do I process Oxford nanopore sequencing data for structural variants?

  3. How do I process PacBio sequencing data for structural variants?

  4. What are the picky commands?

  5. What do I make of the output?

  6. How do I scale up with my cluster?

  7. Tell me more about how Picky works!

  8. How do I use my favorite aligner instead?

Download

Picky release

Latest LAST release

Installation

We have run Picky and LAST successfully on Linux (CentOS, Ubuntu) and Mac (El Capitan v10.11.6) system .

Picky is a Perl script with a collection of Perl modules. Just having a copy of them downloaded from GitHub to a machine with Perl interpreter pre-installed would be sufficient. (~few minutes)

PERL

Please visit the Perl website should you need Perl for your computer. Picky has been run with Perl v5.10.1 to v5.18.2 successfully.

LAST

Please visit LAST website for LAST installation and/or build instructions. (~10 minutes if you need to build LAST.)

Quick start

Let's assume that LAST is installed, and you have your human sample sequencer output in "LongRead.fastq".

Demo data

You could try out the public ONT dataset Scrappie chr20 FASTA as Picky's demonstration data.

# download faToFastq
curl -O http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/faToFastq

# make user executable
chmod u+x faToFastq

# download Scrappie based-called chr20 reads
curl -O http://s3.amazonaws.com/nanopore-human-wgs/na12878.chr20ScrappieFiltered.fasta

# convert fasta to fastq with default base quality 'H'
./faToFastq -qual=H na12878.chr20ScrappieFiltered.fasta LongRead.fastq

You may download the demo data results for comparison or for better understanding the output without pipeline execution. (.sam excluded by file size limit.)

Human lastdb construction

last-755/src/lastdb -v -P 1 hg19.lastdb hg19.fa

NOTE: The LAST reference database needs to be build ONCE ONLY. The hg19 primary assembly took 50 min to build on a single core with 16GB RAM.

samtools dict -H hg19.fa > hg19.seq.dict

NOTE: The sequence dictionary needs to be build ONCE ONLY. It is used by picky callSV to generate .sam file header.

Picky processes

We use at least 4 threads for the compute intensive read alignment process. For faster turn-around, use more threads if your machine has more compute power.

# generate the bash script for processing
./picky.pl script --fastq LongRead.fastq --thread 4 > LongRead.sh

# change script to be executable
chmod u+x LongRead.sh

# start the script
./LongRead.sh

At the end of the script execution, you will have the SV records in the vcf file "LongRead.allsv.vcf", and the alignments in "LongRead.profile.sam".

See Oxford nanopore and PacBio sequence data processing for more details. See cluster support on scaling up on a PBS cluster.

NOTE: The public ONT dataset Scrappie chr20 FASTA with 277,054 reads took 2.5hr to complete using 16 cores and 20GB memory.

The Picky Script

The generated content LongRead.sh from the previous step is as shown below and you may customize it before execution or re-execution. Depending on your installation, you have to adapt the first 3 blocks (of exports).

# general installation
export LASTAL=last-755/src/lastal
export PICKY=./picky.pl

# general configuration
export LASTALDB=hg19.lastdb
export LASTALDBFASTA=hg19.fa
export NTHREADS=4

# FASTQ file
export RUN=LongRead

# reads alignments
time (${LASTAL} -C2 -K2 -r1 -q3 -a2 -b1 -v -v -P${NTHREADS} -Q1 ${LASTALDB} ${RUN}.fastq 2>${RUN}.lastal.log | ${PICKY} selectRep --thread ${NTHREADS} --preload 6 1>${RUN}.align 2>${RUN}.selectRep.log)

# call SVs
time (cat ${RUN}.align | ${PICKY} callSV --oprefix ${RUN} --fastq ${RUN}.fastq --genome ${LASTALDBFASTA} --exclude=chrM --sam)

# generate VCF format
${PICKY} xls2vcf --xls ${RUN}.profile.DEL.xls --xls ${RUN}.profile.INS.xls --xls ${RUN}.profile.INDEL.xls --xls ${RUN}.profile.INV.xls --xls ${RUN}.profile.TTLC.xls --xls ${RUN}.profile.TDSR.xls --xls ${RUN}.profile.TDC.xls > ${RUN}.allsv.vcf