Applying a phylogenomic framework to analyse pathogenic genomes

Mobility of antimicrobial resistance and virulence genes

Horizontal gene transfer between bacterial lineages is widespread and plays a key role in the evolution of antimicrobial resistance and virulence. Despite its clinical importance, however, we have only a limited understanding of the general trends and impacts of gene exchange between pathogens and multidrug-resistant commensal bacteria. We – together with the groups of Balázs Papp and Bálint Kintses (HUN-REN Biological Research Centre, Institute of Biochemistry, Szeged, Hungary) – address these issues by analysing the gene exchange networks of human microbiota, multidrug-resistant and pathogenic bacteria alike.

Large-scale genomic surveillance

Antibiotic-resistant bacterial infections pose a significant public health threat. The success of novel precision therapies – which selectively target specific pathogen types – depends on the accurate identification of the causative agents. To address this challenge, we employ a new approach: large-scale genomic surveillance. This method enables the characterisation of the spatial and temporal distribution of bacterial variants, as well as their biological traits. In doing so, it supports the identification of patients who require the same targeted precision treatment.

The project is sponsored by the Supported Research Groups Programme 2025 - 2028: Genomic surveillance for precision therapies against antibiotic-resistant bacteria; PI: Papp Balázs; Nr. TKCS-2024/66.

Investigating the genomic epidemiology of the Hungarian SARS-CoV-2 genomes

We compare the virus genomes of Hungarian samples to genomes from other countries and infer a time-scaled phylogenetic tree. Based on this tree, we can ascertain the relatives – and potential origins – of the Hungarian clusters, the time of their emergence, and the extent of each clade. We published our results in the Virus Evolution journal.

Developing databases for biologists

TFLink: a transcription factor - target gene interaction database

We created, maintain, and update the TFLink database that uniquely provides comprehensive and highly accurate information on transcription factor - target gene interactions, nucleotide sequences, and genomic locations of transcription factor binding sites for human and six model organisms. We integrated the results of small- and large-scale approaches from ten different databases. We are working to make the database organ-, tissue-, and cell-specific using data obtained by various high-throughput methods. TFLink is already popular, having more than 16,000 active users worldwide. The related publication in the Database journal became highly cited.

BacCurate: Standardized the sequencing metadata for 1.4 million samples of ESKAPEE pathogen bacteria

We are working to standardize sequencing metadata for 1.4 a million samples of ESKAPEE pathogen bacteria, making large public genome collections more useful for antimicrobial resistance research and surveillance. Using large language models, we transform unstructured metadata such as host, isolation source, collection date, and geographic location into consistent, machine-readable formats. This work helps turn fragmented public repository records into a reusable resource for epidemiology, resistance tracking, and One Health analyses across human, animal, and environmental reservoirs.

Developing R and Python packages for bioinformatics analyses

mulea

We developed the mulea (multi enrichment analysis) and the muleaData R packages, an extensive analytical tool using diverse databases (e.g., Gene Ontology, pathways, miRNAs, transcription factors, or protein domains) and provides statistical models and p-value correction procedures that can extend our understanding of the results of various high-throughput analyses. mulea uniquely provides a permutation-based, empirical false discovery rate correction of the p-values, making the gene set overrepresentation analyses more reliable.

treetune

DNA or protein sequence data used for reconstructing phylogenetic trees can contain various errors due to contamination, low-quality genome assembly, or misclassification of taxa. While these errors are generally identified at the sequence level, undetected errors often result in leaves that appear as unusually long branches on the inferred phylogenetic tree. Therefore, pruning is key to detecting and removing erroneous tips from the phylogenetic trees. We implemented treetune in R and Python with three novel pruning algorithms along with flexible combined workflows, allowing users to remove extremely long branches under customizable retention thresholds. By optimising tree radius and improving root‑to‑tip regression, treetune enhances phylogenetic accuracy while preserving most of the input taxa.