Protein sequence alignment has become a widely used method in the study of
newly sequenced proteins. Most sequence alignment methods use an affine gap
penalty to assign scores to insertions and deletions. Although affine gap
penalties represent the relative ease of extending a gap compared with init
ializing a gap, it is still an obvious oversimplification of the real proce
sses that occur during sequence evolution. To improve the efficiency of seq
uence alignment methods and to obtain a better understanding of the process
of sequence evolution, we wanted to find a more accurate model of insertio
ns and deletions in homologous proteins. In this work, we extract the proba
bility of a gap occurrence and the resulting gap length distribution in dis
tantly related proteins (sequence identity < 25%) using alignments based on
their common structures. We observe a distribution of gaps that can be fit
ted with a multiexponential with four distinct components. The results sugg
est new approaches to modeling insertions and deletions in sequence alignme
nts. (C) 2001 Wiley-Liss, Inc.