Abstract:
De ning relationships between species is a fundamental problem in bioinformatics.
One of the ways to de ne relationships is to detect gene clusters, and can be
formulated as a combinatorial problem called Approximate Gene Cluster Discovery
Problem (AGCDP). Graph concepts have been applied to several genomic studies.
AGCDP can be reduced to optimization problems in graph, speci cally the Minimum
Weight t-Partite Clique Problem (MWtCP). The goal of MWtCP is to create
a t-partite graph and to nd a t-star with minimum weight, which is used to approximate
a t-clique. Clustar is a tool that applies an algorithm which solves the
MWtCP for detecting gene clusters. It allows the user to detect gene clusters using
three methods: approximate gene clustering, exact gene clustering (using GPU), and
exact gene clustering (without using GPU). Clustar is able to produce candidate gene
clusters and its alignment among the genomes, as well as the graph representation
and the adjacency matrix produced from the generated graph. To verify the validity
of the results produced by Clustar, a dataset containing homologous genes from 30
$gamma$-proteobacterial genomes was processed using Clustar and other algorithms such as
The Row's Subset of Symmetric Matrix (RSSM) and hierarchical clustering. Several
gene clusters were found common across these three algorithms using di erent gene
cluster sizes. Another dataset was used containing genes from E. coli and B. subtilis
where several gene-groups have been established already. Clustar was able to produce
candidate gene clusters that matched these gene-groups, using di erent gene cluster
sizes.