 |
-
Prior to 2002, I was primarily interested in computer vision and
pattern recognition. My main research interests were in motion flow
fields and inferring motion boundaries from motion sequences. As a
sub-problem, I was concerned with image registration and tracking.
Upon returning to school in 2002, I switched my research focus
from computer vision to biosequence analysis and comparative
genomics. I have worked extensively with Markov Random Fields,
hidden Markov models, and Support Vector Machines.
- In the past year and a half, I have been involved in several
separate projects. The first was an application to discriminantly
train a classifier to recognize a particular class of proteins.
The protein superfamily we chose to classify was Receptor-Like Kinases. My
collaborator (a plant biologist) and I did extensive research in
the feature design portion of the project, to incorporate both
sequence and structure-based features. We then trained the
classifier on a subset of A.thaliana proteins. We predicted new
proteins of this class in a A.thaliana, and were able to extend
the prediction to other genomes, such as O.sativa. Currently, a
selection of our new predictions in A.thaliana are being
experimentally validated.
- A second project, also related to genome annotation, is
part of the Berkeley Drosophila Genome Project (BDGP), a collaboration
among several Berkeley laboratories to assemble, align and
annotate twelve drosophila genomes. The Pachter lab has developed
whole genome alignment applications which are being used to align
the new scaffold genomic sequences as they are sequenced. The portions of the
genomes hardest to assemble and align are the middle- and high-
repetitive regions. I have developed new comparative methods by
which to identify and annotate transposable duplicating elements
in the genomes. This is the first comparative technique to
identify repetitive elements.
- As part of the Rat Genome Sequencing Consortium, I was responsible
for showing that genome alignment strategies are useful in
determining gene evolution, including information about gene
expansion and duplication events. An analysis of
glyceraldehyde-3-phosphate dehydrogenase (GAPDH)-related families
appears in the rat paper (Rat Genome Sequencing Consortium,
Nature, 2004). Using phylogenetic models, I was able to infer
that the GAPDS (testis-specific) gene
arose from a duplication of the GAPDH gene; biogenesis of the
GAPDH pseudogenes has been occurring steadily over time both
before and after rodent-human and mouse-rat divergence; and the
GAPDS gene has undergone little retrotransposition in all three
genomes compared with its relative, the GAPDH gene (consistent
with respective gene-expression levels in the germ line).
- Lastly, I have proposed a novel method for
biological sequence comparison which expands the model of
mutation to include operators for chromosomal duplication,
inversion and micro-rearrangements. The method uses a graphical
model (specifically, a markov network or markov random field)
with an underlying
statistical model to express the relationships among components of
the compared sequences. The approach is therefore probabilistic (like
Hidden Markov Models or profile approaches, as in HMMER, SAM, etc.).
Utilizing the unique underlying structure of the
sequence comparison problem, I find the maximum a posteriori
assignment of homologs. The model prefers assignments with no
mismatches, gaps, duplications and inversions, but allows them. In
this sense, this is a combinatorial optimization scheme (like MSA or
SAGA). This approach allows me to commpare the significance
of one homology assignment against
another. Notably, the method is not restricted to pairwise
sequence comparisons (whereas traditional scoring matrices and gap
costs are not directly applicable in multiple sequence alignment).
The method can be directly extended to multiple sequence homology
assignments. The extension to multiple sequences is also
guaranteed to infer the optimal scoring homology assignment and
does not proceed iteratively or progressively.
|
|