Guide to editing sequences with Chromas and BioEdit

By Peter Unmack

Editing chromatograms with Chromas

Chromas has the advantage the you can save all of your chromatograms which can subsequently be used in any other programs unlike Sequencher which saves everything in a project file which cannot be opened by anything else. If I loose my sequence alignment, at least all my chromatograms with the correct edits are still there to rebuild it from. BioEdit can also edit chromatograms, but I find Chromas to be nicer. MEGA also has an alignment editor, but Iíve not really used it very much.

Double click on the chromatogram file (usually has the extension ab1). This opens the file in Chromas (see below under installation notes if some other program opens it instead of Chromas). The chromatograms come off the machine with all bases in upper case. I usually make all of my edits as lower case bases as it makes it easier to identify where I have made edits. When I am done I save the chromatogram and export the data to a line file (which is saved with a .seq extension). Alternatively, you can go edit, copy sequence, FASTA format and paste that into BioEdit. One trick I find useful later is to always edit your sequences from the same starting base (unless the starts are all messy), as it makes sequence alignment much easier later.

Iíve always used the free Chromas version, Chromas Lite, but there are two other versions with more features that are fairly cheap. Each of the commercial versions have a free 60 day trial should you wish to try them.

Aligning sequences with BioEdit

I use BioEdit to align sequences as it is free and has some handy features. The most annoying aspect is that you have to manually align up each sequence and manually create a consensus sequence (which commercial programs like Sequencher and Geneious are very good at). Aside from that limitation (which isnít as bad as it might sound once you learn a few tricks), I really like its features. It is the only program I know of that allows you to edit, search and replace, and paste over the sequence title names independent of your sequences. I use this feature on nearly every dataset I create. As far as I can tell there is no difference between saving your file as a BioEdit formatted file versus as a fasta file. I would recommend saving everything in fasta format since that is the format I use in order to convert the data to another format or to another person (who probably doesnít have a copy of BioEdit).

One quirk of BioEdit is that if you double click a data file it will open in a new copy of BioEdit, not in an existing one. The regular copy and paste features work between copies of the program, but copying and pasting sequences does not. If you need to copy and paste between copies of the program select the sequences, go Edit, Copy Sequences to clipboard (FASTA Format). In the other copy of BioEdit I usually go File, New from Clipboard. I then select those sequences (control-shift-a), cut (control-shift-c) or copy them (control-a) and paste them (control-s) to the desired BioEdit file. The reason why I paste them to a new file first is that importing from the clipboard (File, Import from Clipboard) will place them at the bottom of your file, which is usually not where I want them be.

Once I have edited all of my chromatograms I copy the .seq files into an empty directory. Open BioEdit from the start menu. Note that I have changed or set many menu short cuts (see BioEdit stuff to change after installation below) to make things quicker, thus these instructions are based on these changes. Create a new BioEdit file. To import .seq files exported from Chromas go File, Import, Sequence alignment file, browse to the correct directory, change file type to all, and select the files (.seq) exported from Chromas (in the open file box it often helps to change the view type to details, then click on type to group them all together). If you wish to keep them in the same order as they are in your directory then click on the bottom sequence file first, then click on the top one while holding the shift key. Make sure your mode is set to edit and insert. It helps if you edit the sequences to start from the same base prior to importing them, that way if you do multiple sequences they are already mostly aligned. And save frequently! There is no auto save function.

I usually import all the forwards and reverses into a new BioEdit file. I first group all the forwards together, then all the reverses. I manually align them and check for obvious missing bases and either correct them or add a gap to preserve the alignment. Before trying to merge the forwards and reverses together, reverse the first reverse sequence (Sequence, Nucleic Acid, Reverse Compliment or control-shift-r) and align it to your forward sequence (usually I have to delete a few bases). Once that is aligned, reverse it back to its original orientation and trim / add to all of the reverse sequence ends so that they are the same length as the first sequence (you can draw a box to select the bases at the end, then hit delete). Then reverse compliment all of them and they should be perfectly aligned relative to the forwards. (otherwise when you reverse compliment them they will all need to be realigned). Note that this works best with coding sequences without indels as every sequence is an identical length, it is all a bit trickier with different length sequences. In that case I try and get them close, but each individual one many require adjustment. Once I am happy with that I ready to create what will become the consensus sequences. I copy all the forwards to a new BioEdit file, select the sequence titles (Edit, Select All Sequences, control-shift-a) and copy them to clipboard (Edit, Copy Sequences, control-a), make the new BioEdit file active and paste them in (Edit, Paste Sequences, control-s). I copy the sequence titles to the clipboard (Edit, Copy sequence titles). I paste these into Microsoft Word and use search and replace to get rid of extra details. My sequence names look like this, PU26226.NVCann.1.Glu31. I trim off the sequence number (search for PU^#^#^#^#^#. and replace with nothing), change the primer name (search for .Glu31 and replace with .cons), that gives me NVCann.1.cons as the sequence name. Select them all (control-a), copy to clipboard (control-c), go back to BioEdit, to paste these names over the existing ones. Go Edit, Paste Over Titles. Now your BioEdit file has all the forwards and reverses, with the .cons sequences in another file. Now comes the painful part as you have to drag and/or cut and paste them all together such that you have the forward, then reverse, then consensus for each individual next to one another. It helps to also have additional individuals from the same population all next to one another too. To correct the consensus sequence I copy and paste the sequences from a population (or individual, group, etc.) to a new BioEdit file. Change the view type (on the lower toolbar (3rd) of the alignment window), select the third colored button from the left (says Shade identities and similarities when you hold the mouse over it). This highlights any columns that have different bases. Depending on how well your reverse sequences overlap with your forwards, scroll right until they overlap with good sequences. Select all the reverse sequences and cut them. This will allow you to see any base pairs that are different in the clean forwards. I check any unique differences by opening the chromatogram. Undo the cut of the reverses (Edit, Undo or control-z) (note that this only works if you havenít made any other edits, otherwise you have to paste them at the bottom and drag them back up to the correct place). Now scroll right again and look for any bases that need checking. Eventually the forwards will start to be a poor match to the reverses. At that point I finish my consensus sequence. I select a point in the reverse, then select sequence to the end (Edit, Select to End, control-e). Copy it (control-c). Now place the cursor in the same place in the consensus sequence. Hit control-e to select to the end, hit delete, move right one base then paste (control-c). Repeat for each consensus. Just be sure to select to end from a different location each time to reduce the chances of pasting the wrong reverse into your consensus. Now I select all the forward sequences and cut them and scroll right to check for any bases changes that need to be checked. Then I undo the cut, select all the sequences (Edit, Select All Sequences, control-shift-a), copy them (control-a--note that copy and pasting sequences is different to any other copy and paste action). Go back to your BioEdit file with all your sequences (which should still have the original sequences highlighted), paste the sequences (control-s), then delete the selected sequences (control-d), thus replacing the newly edited ones and removing the originals. Hit save (control-shift-s) and repeat for each group of sequences. At the end of this phase you have done two data checks, one when you edited your original chromatogram, second when you checked any unique base pair changes.

For each gene within a dataset I usually have this file with the forward, reverse and consensus. I then create a second file which has only the .cons sequences. The .cons sequences can then be trimmed to the target length and then they are ready to convert to the appropriate data file format for analysis. I always keep the BioEdit file with all forwards, reverses and consensus sequences so that if I double check stuff later it is easier to find the relevant chromatograms (I can tell what sequence is from where by the sequence name). I usually add more forwards and reverses to my existing BioEdit files since they are already setup and aligned correctly, otherwise youíll end up with many different, but similar versions of your files and it will be difficult to know which is the correct, most complete version.

All of that probably sounds very confusing, once you have carefully worked through it a couple of times it becomes very easy.

Importing data for phylogenetic analysis

In BioEdit, clean up all the ends and get things to the base pairs you want to analyze. It can be helpful to make sure any missing bases are labeled with an n, only use a - for indels so that you can easily distinguish which is which.

If I wished to create a MEGA files I would select all sequences (control-a), go Edit, copy sequences to clipboard (Fasta format). Open an existing MEGA file in Word. Remove the existing sequences (from the first sequence hit control-shift-end, then hit delete), then paste in the ones you just copied. Do a search for > and replace them with # (MEGA files require each sequence start with #). Note how many replacements it does, this is the number of samples. Enter that information in the header of the MEGA file. Figure out how many base pairs are present (in BioEdit, go to the last base and select it and look at the number). Enter that information in the header of the MEGA file. Save the file as text only and make sure it has the correct file extension (.meg). If the program sticks the .txt on the end manually change it in File Explorer.

Double click the MEGA file and MEGA will open the file and check and report any errors in the data file that are usually easily fixed either in MEGA ís editor or in Word/BioEdit (make sure you correct it in the original dataset too, otherwise youíll get the same error next time you export your data). Then I run a NJ analysis to see what is going on with the dataset. I usually set Gaps / Missing data to pairwise deletion, otherwise it excludes all positions that have any ambiguous bases.

Chromas and BioEdit installation notes

When you first install BioEdit and Chromas, the default will be that BioEdit opens the chromatogram files. To fix this, right click on a chromatogram, select properties, it should say opens with BioEdit, hit change, browse to the Chromas executable, select it, choose always open with this program, hit ok. Now when you double click on a chromatogram it will open in Chromas.

BioEdit stuff to change after installation

BioEdit lets you modify just about anything that it does relative to menus and keyboard short cuts as well as the default settings for displaying data. Once you set your preferences on one machine you can copy the bioedit.ini file to any other machine to transfer them. You can download my bioedit.ini file here and save it to your BioEdit directory (rename your existing copy to something else in case you run into any problems). These are my preferences, you can use these or change them whatever you prefer. I hate menus, so anything that I can use the keyboard for I tend to change it. Much editing in BioEdit requires extensive repetitive actions, so using the menus will rather slow. To change settings first create a new alignment (File, New Alignment) or open an existing file. Next go View, Customize Menu Shortcuts. Select the value you wish to change, hit the value on the keyboard and that will reset it.

These are the changes I make.

Under File

Save, change to Control+Shift+s

Under Edit

Cut sequences, change to Control+Shift+c

Copy sequences, change to Control+a

Paste sequences, change to Control+s

Delete sequences, change to Control+d

Select all sequences, change to Control+Shift+a

Select to end sequences, change to Control+e

Select to beginning sequences, change to Control+b

Hit close

Go to Options, Preferences

Under include (far left), select N, move it to donít include.

Go to Options, Color Table

I change all the ambiguous bases to yellow as that makes it much easier to see them.

On the lower toolbar (3rd) of the alignment window, select the first solidly colored button. This changes the way the sequences are displayed.

On the middle toolbar (2nd) in the alignment window change mode to edit, change box next to it to insert.

Go View, save options as default. If you donít hit this option then all of the changes are lost. Close BioEdit, reopen your files and the settings should all be saved.

Back to Unmack's Molecular Phylogenetics page.