ResearchCurrent pubs listTeaching experienceSupplemental DataLLinksHome
Anat Caspi, homepage
  • Prior to 2002, I was primarily interested in computer vision and pattern recognition. My main research interests were in motion flow fields and inferring motion boundaries from motion sequences. As a sub-problem, I was concerned with image registration and tracking. Upon returning to school in 2002, I switched my research focus from computer vision to biosequence analysis and comparative genomics. I have worked extensively with Markov Random Fields, hidden Markov models, and Support Vector Machines.
  • In the past year and a half, I have been involved in several separate projects. The first was an application to discriminantly train a classifier to recognize a particular class of proteins. The protein superfamily we chose to classify was Receptor-Like Kinases. My collaborator (a plant biologist) and I did extensive research in the feature design portion of the project, to incorporate both sequence and structure-based features. We then trained the classifier on a subset of A.thaliana proteins. We predicted new proteins of this class in a A.thaliana, and were able to extend the prediction to other genomes, such as O.sativa. Currently, a selection of our new predictions in A.thaliana are being experimentally validated.
  • A second project, also related to genome annotation, is part of the Berkeley Drosophila Genome Project (BDGP), a collaboration among several Berkeley laboratories to assemble, align and annotate twelve drosophila genomes. The Pachter lab has developed whole genome alignment applications which are being used to align the new scaffold genomic sequences as they are sequenced. The portions of the genomes hardest to assemble and align are the middle- and high- repetitive regions. I have developed new comparative methods by which to identify and annotate transposable duplicating elements in the genomes. This is the first comparative technique to identify repetitive elements.
  • As part of the Rat Genome Sequencing Consortium, I was responsible for showing that genome alignment strategies are useful in determining gene evolution, including information about gene expansion and duplication events. An analysis of glyceraldehyde-3-phosphate dehydrogenase (GAPDH)-related families appears in the rat paper (Rat Genome Sequencing Consortium, Nature, 2004). Using phylogenetic models, I was able to infer that the GAPDS (testis-specific) gene arose from a duplication of the GAPDH gene; biogenesis of the GAPDH pseudogenes has been occurring steadily over time both before and after rodent-human and mouse-rat divergence; and the GAPDS gene has undergone little retrotransposition in all three genomes compared with its relative, the GAPDH gene (consistent with respective gene-expression levels in the germ line).
  • Lastly, I have proposed a novel method for biological sequence comparison which expands the model of mutation to include operators for chromosomal duplication, inversion and micro-rearrangements. The method uses a graphical model (specifically, a markov network or markov random field) with an underlying statistical model to express the relationships among components of the compared sequences. The approach is therefore probabilistic (like Hidden Markov Models or profile approaches, as in HMMER, SAM, etc.). Utilizing the unique underlying structure of the sequence comparison problem, I find the maximum a posteriori assignment of homologs. The model prefers assignments with no mismatches, gaps, duplications and inversions, but allows them. In this sense, this is a combinatorial optimization scheme (like MSA or SAGA). This approach allows me to commpare the significance of one homology assignment against another. Notably, the method is not restricted to pairwise sequence comparisons (whereas traditional scoring matrices and gap costs are not directly applicable in multiple sequence alignment). The method can be directly extended to multiple sequence homology assignments. The extension to multiple sequences is also guaranteed to infer the optimal scoring homology assignment and does not proceed iteratively or progressively.