Bioinformatics Pipeline for Selecting Anonymous Genetic Markers
Illumina technology is currently the most successful and widely adopted next-generation sequencing technology.1 Illumina can be used to sequence single-end or paired-end reads. Paired-end reads are two ends of the same DNA molecule where one end is sequenced, flipped around, and the other end is sequenced on the opposite strand. In this pipeline, paired-end reads are bioinformatically manipulated in order to select anonymous loci for probe design. The resulting probes capture select portions of DNA for sequencing in shallow scale studies of non-model organisms (e.g., between species of anoles). Anonymous loci selection is based on hy- bridization affinity, copy number, locus length, and similarity to other read sequences (read-seq similarity). Hybridization affinity is the affinity of a probe to a target site. This affinity increases with an increase in the percent conservation of the loci. There is a trade off here, however, since the more diversity a probe has, the more diversity can be captured using the probe. Hybridization affinity also increases with the percentage of nitrogenous bases in the molecule that are guanine or cytosine (GC content). The GC pair is bound by three hydrogen bonds whereas the adenine-thymine pair has only two hydrogen bonds. The number of hydrogen bonds along with stacking interactions increase the stability of the DNA.2 More variation in GC content leads to more variable loci. The next two selectors are copy number and length. The duplication of genes and loss of genes over time leads to a lack of available single copy genes to choose from. Instead, genes with a low copy number are selected. Genes with low copy number and a long length in base pairs (bp) are of particular value because tree resolution improves with locus length. However, there is a trade off between the number of loci that can be selected and the length of each locus (e.g., 400 loci of 2000bp in length would be equivalent to 2000 loci of 400bp length). Finally, the proportion of loci with read-seq similarity is examined. Too high of a proportion could result in targeting uninformative loci for sequencing.