|« Testing for Equivalence||Linux Firewall »|
SNP calling refers to the process of identifying posititions where the genome of a sequenced sample differs to that of the reference genome. This might lead to finding disease-causing genomic alterations.
In the following I wanted to re-align short NGS reads against a specific reference (in this case the Mitochondrial genome sequence). A simple way is to use samtools.
1. make a reference genome index
bwa index -a is NCBI_chrM.fa
2. filter reads
samtools view -F4 –hb A1_S1.bam chrM > A1_S1_chrM.bam samtools view -f4 -hb A1_S1.bam > A1_S1_unmapped.bam samtools merge A1_S1_chrM_and_Un.bam A1_S1_chrM.bam A1_S1_unmapped.bam
3. create fastq
bamToFastq -i A1_S1_chrM_and_Un.bam -fq A1_S1_chrM_and_Un.1.fq \ -fq2 A1_S1_chrM_and_Un.2.fq 2> bwa.err &
4. align to new reference
bwa aln NCBI_chrM.fa A1_S1_chrM_and_Un.1.fq > A1_S1_chrM_and_Un.1.sa bwa aln NCBI_chrM.fa A1_S1_chrM_and_Un.2.fq > A1_S1_chrM_and_Un.2.sai bwa sampe NCBI_chrM.fa A1_S1_chrM_and_Un.1.sai A1_S1_chrM_and_Un.2.sai \ A1_S1_chrM_and_Un.1.fq A1_S1_chrM_and_Un.2.fq > A1_S1_chrM_realigned.sam samtools view -F4 -Sbh A1_S1_chrM_realigned.sam \ | samtools sort -o - sorted > A1_S1_chrM_realigned.bam
5. call SNPs
samtools mpileup -uD -f NCBI_chrM.fa A1_S1_chrM_realigned.bam \ | bcftools view -cg - > A1_S1_chrM_realigned.sam.vcf
From the Samtools help pages:
One should consider to apply the following parameters to mpileup in different scenarios:
- Apply -C50 to reduce the effect of reads with excessive mismatches. This aims to fix overestimated mapping quality and appears to be preferred for BWA-short.
- Given multiple technologies, apply -P to specify which technologies to use for collecting initial INDEL candidates. It is recommended to find INDEL candidates from technologies with low INDEL error rate, such as Illumina. When this option is in use, the value(s) following the option must appear in the PL tag in the @RG header lines.
- Apply -D and -S to keep per-sample read depth and strand bias. This is preferred if there are more than one samples at high coverage.
- Adjust -m and -F to control when to initiate indel realignment (requiring r877+). Samtools only finds INDELs where there are sufficient reads containing the INDEL at the same position. It does this to avoid excessive realignment that is computationally demanding. The default works well for many low-coverage samples but not for, say, 500 exomes. In the latter case, using -m 3 -F 0.0002 (3 supporting reads at minimum 0.02% frequency) is necessary to find singletons.
- Use `-BQ0 -d10000000 -f ref.fa' if the purpose is to get the precise depth of coverage rather than call SNPs. Under this setting, mpileup will count low-quality bases, process all reads (by default the depth is capped at 8000), and skip the time-demanding BAQ calculation.
- Apply -A to use anomalous read pairs in mpileup, which are not used by default (requring r874+).
The VCF format
The Variant Call Format (VCF) is the emerging standard for storing variant data. Originally designed for SNPs and short INDELs, it also works for structural variations.
VCF consists of a header and a data section. The header must contain a line starting with one '#', showing the name of each field, and then the sample names starting at the 10th column. The data section is TAB delimited with each line consisting of at least 8 mandatory fields (the first 8 fields in the table below).
Col Field Description 1 CHROM Chromosome name 2 POS 1-based position. For an indel, this is the position preceding the indel. 3 ID Variant identifier. Usually the dbSNP rsID. 4 REF Reference sequence at POS involved in the variant. For a SNP, it is a single base. 5 ALT Comma delimited list of alternative seuqence(s). 6 QUAL Phred-scaled probability of all samples being homozygous reference. 7 FILTER Semicolon delimited list of filters that the variant fails to pass. 8 INFO Semicolon delimited list of variant information. 9 FORMAT Colon delimited list of the format of individual genotypes in the following fields. 10+ Sample(s) Individual genotype information defined by FORMAT.
The following table gives the INFO tags used by samtools and bcftools.
Tag Description AC Allele count in genotypes AC1 Max-likelihood estimate of the first ALT allele count (no HWE assumption) AF1 Max-likelihood estimate of the first ALT allele frequency (assuming HWE) AN Total number of alleles in called genotypes CGT The most probable constrained genotype configuration in the trio CLR Log ratio of genotype likelihoods with and without the constraint DP Raw read depth (sum for all samples) DP4 Number of high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases FQ Phred probability of all samples being the same G3 ML estimate of genotype frequencies HWE Hardy-Weinberg equilibrium test (PMID:15789306) ICF Inbreeding coefficient F INDEL Indicates that the variant is an INDEL. IS Maximum number of reads supporting an indel and fraction of indel reads MDV Maximum number of high-quality nonRef reads in samples MQ Root-mean-square mapping quality of covering reads PC2 Phred probability of the nonRef allele frequency in group1 samples being larger (, smaller) than in group2. PCHI2 Posterior weighted chi2 P-value for testing the association between group1 and group2 samples. PR Number of permutations yielding a smaller PCHI2. PV4 P-values for strand bias, baseQ bias, mapQ bias and tail distance bias QBD Quality by Depth: QUAL/#reads QCHI2 Phred scaled PCHI2. RP # permutations yielding a smaller PCHI2 RPB Read Position Bias SF Source File (index to sourceFiles, f when filtered) TYPE Variant type UGT The most probable unconstrained genotype configuration in the trio VDB Variant Distance Bias (v2) for filtering splice-site artefacts in RNA-seq data.