Evo: A New Approach to Genomic Sequencing
Researchers have developed a system called Evo that leverages the power of large language models (LLMs) to understand and generate genomic sequences.The core principle behind Evo is its ability to “link nucleotide-level patterns to kilobase-scale genomic context,” meaning it can interpret genomic DNA fragments much like an LLM interprets text, and produce relevant outputs. This allows evo to predict and even create functional genetic material.
Initial tests focused on Evo’s ability to reconstruct known proteins. When provided with incomplete gene sequences – as little as 30% – Evo successfully predicted a meaningful portion of the missing information, achieving 85% completion. With 80% of a gene sequence provided, Evo accurately reconstructed the entire sequence. Furthermore,it demonstrated the ability to identify and reinstate missing genes within established functional clusters.
This accuracy stems from Evo’s extensive training on a vast dataset of bacterial genomes. This training enabled the system to recognise critical regions within proteins and, when making alterations to sequences, to confine those changes to areas were genetic variation is naturally tolerated. Essentially, Evo has learned the evolutionary constraints governing gene structure and function.
To explore Evo’s potential for innovation, the researchers challenged it to generate entirely new protein sequences. They focused on bacterial toxins, frequently enough paired with antitoxins to protect the producing cell. The team designed a novel toxin, distantly related to known toxins and lacking a corresponding antitoxin, and used its sequence as a prompt for Evo. To ensure novelty, any generated sequences resembling known antitoxins were excluded from the results. This experiment aimed to determine if Evo could produce genuinely new genetic information, rather than simply replicating existing sequences.