Home Page

Papers

Submissions

News

Editorial Board

Special Issues

Open Source Software

Proceedings (PMLR)

Data (DMLR)

Transactions (TMLR)

Search

Statistics

Login

Frequently Asked Questions

Contact Us



RSS Feed

Biological Sequence Kernels with Guaranteed Flexibility

Alan N. Amin, Debora S. Marks, Eli N. Weinstein; 26(216):1−63, 2025.

Abstract

Applying machine learning to biological sequences---DNA, RNA and protein---has enormous potential to advance human health and environmental sustainability. To support such high-stakes applications, it is important to develop models and evaluations that not only capture underlying biology, but also have theoretical guarantees of reliability and performance. In this article, we analyze kernel methods for biological sequences, including both hand-crafted kernels and deep neural network-based kernels. We show that popular biological kernels can severely fail at learning functions or distinguishing distributions. We then develop modified kernels that (1) are universal, characteristic, and metrize the space of distributions, and (2) preserve the underlying biological inductive biases and domain knowledge embedded in the original kernel. Our results rest on novel proof techniques for kernels that handle the structure of biological sequence space--discrete, variable length sequences--and biological notions of sequence similarity. We illustrate our theoretical results in simulation and on real biological data sets.

[abs][pdf][bib]        [code]
© JMLR 2025. (edit, beta)

Mastodon