By Peter Unmack
It is wise to come up with a logical naming scheme for your samples based on some combination of the site and species name that is easily understood. Otherwise every time you show someone your tree you will have no clue what samples are what and you'll have to do a lot of relabeling all the time. It also helps to use site names or field codes of some sort if you are going to run your data through Arlequin or if grouping them in MEGA as otherwise it will be more confusing and difficult.
There should be no spaces or unusual characters in the sequence names. Avoid using a - or any special characters (e.g., ! @ $ ^ & * ~ ? . | / [ ] < > \ ` " ;# ( ) ). Only use letters, numbers, periods and underscores. The length of the sequence name shouldn't matter unless you get ridiculous (as long as you don’t need to use a strict phylip format for any data analysis as that is limited to 10 characters).
You will find the sequencing machine consistently makes the same mistakes, often in the same places on different chromatograms. The first few bases are always somewhat difficult to read, usually we just delete this “junk.”
Also, it is nearly always the case that within the first 50 bp the sequencing machine fails to read two consecutive bases. That is, the machine will call it “A”, but really it is “AA.” There will usually be two or sometimes three instances where this needs to be corrected. Sometimes two bases will be very close to each other and it will miss one of them too (often “CA”).
Figure. Left image shows unedited chromatogram near the start of the chromatogram. Right image shows the correct base calls (ATG are the first bases of this gene, thus I’ve deleted everything prior to that).
Often too there will be a large “pulse” of dye that overrides the underlying sequence, again, usually this is in the first 100 bp or so. You have to interpret through this junk, which is usually easy to achieve. If certain bases are unclear then call them “n.”
Figure. Left image shows unedited chromatogram with a dye blob. Right image shows how I would call those bases (in lower case). Ideally, I’d repeat this sequencing reaction.
Most sequencing machines call ambiguous positions as an N which needs to be corrected if the error is clearly diagnosable.
Note that not all of the peaks are the same height. Height is determined by the individual base, as well as the base that proceeds it. For instance, the highest peaks will tend to be G after a T, but when a G occurs after an A it tends to be very low. In some cases a C before a G also tends to lower the G peak. An A after a G tends to be higher than normal. These are all things you need to take into consideration when calling bases correctly. See the following link for some more extreme examples of base pair miscalls.
There is a code for different base pair pairings that represent heterozygous positions (in nuclear DNA, this does not usually occur in mtDNA [but there are exceptions]). Here are the codes for two base heterozygosity. Note there are also codes for three base pairings as well.
M=(A, C), K=(G, T), R=(A, G), Y=(T, C), W=(A, T), S=(G, C)
It can difficult at times to judge whether a heterozygote is really a heterozygote, and not just an error. Also, machines vary in their sensitivity as well as between runs. You also need to consider what the levels the background peaks are as well. And remember both peaks should be half of what they would normally be, but that will also depend on the preceding base too! You have to be very careful when checking your chromatogram, especially for nuclear sequences.
Spacing of the letters across the top tends to slowly change along the sequence, check for obvious discrepancies in this spacing as this usually indicates an added or missing base.
Check the peak height for obvious heterozygotes to see that they have been called correctly.
Once the chromatogram is corrected save it and move onto the next one.
Once I have my sequence editing completed, I put all the individuals from the same population together into a temporary BioEdit file and see what differences exist between individuals. When I only have a single sequence compliment (i.e., only the forward or reverse), I then go back to the original chromatograms to double check they are correct. If both the forward and reverses show the same change then it should be good without further checking. The degree of checking you do depends on the project. For pop gen / phylogeography projects it is a good idea to do this by population. For deeper phylogenetic projects this is more difficult if you only have single sequences per species as there will be 10’s to 100’s of differences. In this case you should check you sequences more carefully when you first edit them, or do two individuals per species so that they are easier to check. I often find when I do this type of checking that I discover that at times, especially from early in my data collection efforts, that I occasionally retained data from chromatograms that really should have been repeated. Don’t forget to also check you data in MEGA (or some other program) via the amino acid coding for stop codons, missing data and runs of different amino acids in single individuals which are indicative of errors in your original data.
Figure. Left image shows runs of different amino acids were are clearly errors. Right image shows an error due to a stop codon.
I also strongly suggest that any sample within a phylogeographic dataset that is unique be double checked against the original chromatograms. It may also be worth checking at least some unique pairs of sequences as errors in chromatograms tend to often occur at the same place within a sequence read. Also, if you see any samples that stick out beyond the rest then this is often due to errors as well. To check samples I make a neighbor joining tree to identify samples to check.
Once you data are checked and you are convinced that everything looks good then it is time to generate additional file formats of your data.
Back to Unmack's Molecular Phylogenetics page.