Sequence Matrix: Gene concatenation made easy

20
Sequence Matrix Gaurav Vaidya 1 , David Lohman 2 , Rudolf Meier 2 Gene concatenation made easy 1: NeatCo Asia, Singapore. 2: Department of Biological Sciences, National University of Singapore, Singapore.

description

Creating large datasets by concatenating genes can be challenging. This tool hopes to make that process much, much easier. For more information, see http://code.google.com/p/sequencematrix/ or http://www3.interscience.wiley.com/journal/123577052/abstract

Transcript of Sequence Matrix: Gene concatenation made easy

Page 1: Sequence Matrix: Gene concatenation made easy

Sequence MatrixGaurav Vaidya1, David Lohman2, Rudolf Meier2

Gene concatenation made easy

1: NeatCo Asia, Singapore.2: Department of Biological Sciences, National University of Singapore, Singapore.

Page 2: Sequence Matrix: Gene concatenation made easy

Our goals

✤ Many powerful tools exist for concatenating sequences.

✤ Adding new sequences to an existing dataset is tedious and time consuming.

✤ Our initial goal: simple, user-friendly program for concatenating sequences.

✤ We also added a few tools to help you look for lab contamination in your dataset.

Page 3: Sequence Matrix: Gene concatenation made easy

Sequence Matrix

✤ Written in Java.

✤ Graphical user interface libraries.

✤ Works on different operating systems.

✤ Easy to install: download and run the batch file.

Page 4: Sequence Matrix: Gene concatenation made easy

Importing sequences

✤ You can use the sequence names as entered in the input file.

✤ Or you can ask Sequence Matrix to try to identify the species names.

Page 5: Sequence Matrix: Gene concatenation made easy

Importing sequences

✤ Sequences mode:

✤ gi|237510679|gb|AY556753.2|Daubentonia madagascariensis voucher WE94001 5.8S ribosomal RNA gene, partial sequence; internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence

✤ gi|237510678|gb|AY556735.2|Macaca sylvanus voucher OK96022 5.8S ribosomal RNA gene, partial sequence; internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence

✤ Species name

✤ Daubentonia madagascariensis

✤ Macaca sylvanus

Page 6: Sequence Matrix: Gene concatenation made easy

Importing sequences

✤ A common source of error is forgetting to recode leading and trailing gaps as missing information.

✤ Sequence Matrix can automatically replace such gaps with question marks.

Page 7: Sequence Matrix: Gene concatenation made easy

Importing sequences: Naming

✤ Sequences from one dataset are matched up to another dataset by sequence name.

✤ Errors in sequence naming need to be fixed.

✤ We recommend naming your files by gene name: ‘coi’, ‘cytb’, ‘28S’ and so on.

Page 8: Sequence Matrix: Gene concatenation made easy

Export: Taxonsets

✤ By default, we generate taxonsets on the basis of:

✤ Combined length.

✤ Number of character sets

✤ Information for a particular gene.

Page 9: Sequence Matrix: Gene concatenation made easy

Gene trees

✤ Two ways to do them:

✤ Use the taxonset of taxa having information for a particular gene to exclude other taxa.

✤ Export the entire dataset with one file per column.

Page 10: Sequence Matrix: Gene concatenation made easy

Export features

✤ You can also export the Sequence Matrix table as an Excel-readable text file.

✤ Supervisory mode.

✤ Keep track of a project as it grows.

Page 11: Sequence Matrix: Gene concatenation made easy

Character sets

✤ We can read character sets defined in Nexus CHARSET and TNT xgroup commands.

✤ These can be “split” into individual columns, or imported as a single column representing the entire file.

Page 12: Sequence Matrix: Gene concatenation made easy

Excision

✤ Individual sequences can be excised from the dataset.

✤ Excised sequences will not be exported.

✤ Sequence Matrix will warn you about that.

Page 13: Sequence Matrix: Gene concatenation made easy

Contamination

✤ You thought you were sequencing Gorilla gorilla

✤ but you were really sequencing Homo sapiens.

✤ We have two tools you can use:

✤ If Homo sapiens is in your dataset.

✤ If Homo sapiens is not in your dataset (experimental!).

Page 14: Sequence Matrix: Gene concatenation made easy

H. sapiens in dataset

✤ Looks for pairs of sequences whose pairwise distance is very low.

✤ Expected difference depends on gene:

✤ 28S doesn’t change very much, but

✤ COI changes very quickly.

✤ Some interpretation is required.

Page 15: Sequence Matrix: Gene concatenation made easy

H. sapiens not present

✤ Use “Pairwise Distance Mode” to look for unusual pairwise distances.

✤ Ignore one charset, then sort taxa based on their pairwise distance to a “reference taxon”.

✤ Colour sequences by their individual pairwise distances to the reference taxon.

Page 16: Sequence Matrix: Gene concatenation made easy

H. sapiens not present

✤ Colour pairwise distances on the gene in question by their pairwise distance to the reference taxon.

✤ Look for colour variation which is unusual or out of place.

✤ We would expect sequences from different species to be correlated together.

Page 17: Sequence Matrix: Gene concatenation made easy

Pairwise distance mode

✤ You need to vary:

✤ The gene you are studying.

✤ The reference taxon being compared against.

✤ Possibly helpful as an alert mechanism.

Page 18: Sequence Matrix: Gene concatenation made easy

✤ Sequence Matrix allows you to assemble and examine multigene, multitaxon datasets.

✤ Taxonsets allow you to analyse subsets of your data in downstream programs.

✤ Excising sequences gives you greater control over which sequences to analyse.

✤ You can look for contamination in two ways:

✤ Looking for very low pairwise distances across your entire dataset.

✤ Looking for unusual pairwise distances in Pairwise Distance Mode.

Summary

Page 19: Sequence Matrix: Gene concatenation made easy

Acknowledgements

✤ Rudolf Meier

✤ Zhang Guanyang

✤ Farhan Ali

✤ David Lohman

✤ Everybody at the NUS DBS Evolutionary Biology lab.

Page 20: Sequence Matrix: Gene concatenation made easy

Question time!