KEGG Syntax - Genome Alignment

Genome Alignment

Genome alignment is usually done by aligning nucleotide sequences of two genomes. Here the genome is considered as a sequence of genes identified by KOs (K numbers) and the genome alignment is done by aligning sequences of matching K numbers. Thus, this approach significantly simplifies the problem of gene order alignment.

We have developed a new tool for finding all instances of locally similar gene orders in two genomes above a given threshold using the Goad-Kanehisa algorithm (see below).

More details about this implementation

Gene orders are available for KEGG organisms with the NCBI assembly level of "Complete Genome" or "Chromosome" (see br08611) and all viruses (see br08621).
Genes for CDS, tRNA and rRNA are considered.
Since genes are labeled with K numbers, the gene order is converted to a sequence of K numbers.
The algorithm applies to comparsion of two such gene order sequences with the scoring of
match: 1, mismatch: -1, gap: -1, neutral: 0
where neutral means the alignment of genes without K numbers.
Locally similar gene orders with the score of 3 or more are reported.
When genes with the same K numbers are repeated, they are combined into a single unit with the number of repeats in parentheses in the output, enabling the alignment of varying numbers of repeats in two sequences.
When genes on the complementary strand are matched, they are marked with "<" in the output.
Comparison of gene order sequences is made twice in two directions: forward-forward and forward-reverse directions.
The reverse direction is marked with "(r)" in the output.

Output example

In the graphical diagram, red lines indicate matched regions in the forward-forward alignment, and green lines in the forward-reverse alignment.
In the detailed view, locally similar regions are listed in tabular forms, where gene IDs from two genomes with matching K numbers in the middle are shown. "KO list" link shows the names of KOs in the table, and "Genome browser" link shows actual gene locations in the chromosomes.

About Goad-Kanehisa Algorithm

In the early 1980s during the pre-GenBank project of Los Alamos Sequence Library, an algorithm for finding locally similar regions of two sequences was developed by Goad and Kanehisa and reported in Nucleic Acids Res 10:247-263 (1982) [doi]. The essence of this algorithm is to perform pruning of paths by taking a logical product of forward and reverse path matrices, in addition to the pruning associated with the weighting scheme of not allowing negative score values, which is similar to the Smith-Waterman algorithm [doi] as mentioned in their Note added in proof.

For protein and nucleic sequence alignments, the approach taken by Smith and Waterman for finding the best local similarity is sufficient. However, for the gene order alignment of two genomes, in which many gene positions are likely to be split and changed, the Goad-Kanehisa algorithm is better suited for finding a comprehensive set of local similarities.