Course projects

During the second semester, each student works on an individual course project, which may be further developed into a Master's thesis.

The projects are supervised by the leading experts from Russian and foreign scientific centers working in bioinformatics.

Projects of different years are listed below.
Projects, spring 2022
Comparative evaluation of gene expression deconvolution algorithms
Student: Alexey Serdyukov
Supervisor: Maxim Artyomov (Washington University in St. Louis)

Many methods have been proposed to infer cell type proportions from bulk gene expression data. In this work I present an approach to compare such methods both on real and simulated data. The main focus of the work was to implement a reproducible benchmark, for it to be used to evaluate a novel deconvolution method, currently being developed in my supervisor's laboratory.
MCMC-based active module search in functional networks combined with ML-based gene prioritization significantly improves disease gene identification
Student: Ivan Molotkov
Supervisor: Mykyta Artomov (Massachusetts General Hospital / Broad Institute)

The identification of disease genes remains an important task of complex trait genetics. Main challenge is mostly non-coding nature of associated variants identified in GWAS. Therefore, the identification of susceptibility genes should be conducted in composition with functional information, such as gene expression (eQTL), chromatin modifications, etc. We describe a novel computational approach for combining positive-unlabeled learning gene prioritization with and MCMC protein-protein network
analysis for the active module search. The superiority of the resulting model over existing methods was evaluated on three phenotypes: inflammatory bowel disease, coronary artery disease and schizophrenia.
Clusters stability estimation in scRNA-seq data analysis
Student: Roman Smirnov
Supervisor: Maria Firulyova (Almazov Medical Research Center / ITMO)

Generating and analysis of single-cell data has become a very common technique to explore tissue heterogeneity. Nowadays, there are many algorithms for clustering these data to identify new cell types. However, many of the clustering pipelines rely on user-tuned parameters, which are usually intuitive. Due to a lack of a systematic approach, scientists often produce different results. Moreover, researchers often encounter a widespread challenge during the clustering, which is under clustering or over clustering, which leads to misleading conclusions. To overcome this problem, a set of tools to evaluate cluster stability were
implemented. In this research work, the benchmarking of the two most promising tolls were conducted: chooseR and scclusteval. Based on our analysis, we conclude that the chooseR can produce almost the same results as professional biologists do. However, the performance of this tool is still slow and should be optimized. Nevertheless, another tool - scclusteval has much more advantages in comparison with chooseR since it produces two metrics instead of one as well as allows processing many datasets by snakemake workflow.
Analysis of next-generation sequencing cohort of patients with rare diseases
Student: Violetta Konygina
Supervisor: Natalia Petukhova (Pavlov Medical University)

Approximately 3.5–5.9% of the worldwide population is affected by Rare diseases. In the Russian population there are about 36 thousand patients, 50% of whom are children. There is no doubt that NGS technologies have constituted a turning point in rare disease research, diagnosis, and treatment. In our study, we will present in detail the DNA-seq analysis based on GATK best practices of patients diagnosed with rare diseases.
Exploration of VDR secondary structure elements involved into the ligand-binding mechanism using structural descriptors
Student: Igor Glukhov
Supervisor: Karina Pats

Structural descriptors are widely used in structural biology studies. They are typically applied when it is required to check the extent of similarity between proteins. Another field of their application is determining of the belonging of the protein to a particular family. One of the current research studies is verifying the ability of such descriptors to distinguish the patterns in the binding modes of Nuclear Receptors. We suppose that structural descriptors may serve as a tool for prediction of allosteric communications in proteins such as nuclear receptors. For this purpose, a user-friendly Python package, which allows flexible calculation of the selected descriptors and their storage in the convenient format, has been developed a term prior. This semester the focus is on using those structural descriptors for exploration of secondary structure elements of Vitamin D Receptors (VDRs) that are involved into the ligand-binding mechanism.
Whole-genome analysis of rare variants in Parkinson's disease and their impact on expression
Students: Svetlana Antonets, Alevtina Bogoliubova-Kuznetsova, Anna Rusnak
Supervisor: Konstantin Senkevich (McGill University)

Common variants in Parkinson disease explain only up to 36% of heritability. Rare variants could account for the proportion of missing heritability. One aim of this work was to study association of rare variants with PD by detection of aberrant splicing with FRASER and expression outliers with RIVER. As a result, pipeline of splicing outliers' detection was developed and tested on GTEx data of substantia nigra of healthy donors and can be further used for data of individuals with PD. Usage of RIVER package was temporarily postponed due to lack of documentation and available data. [...] Additionally, we attempted to evaluate ANEVA-DOT - a statistical tool developed for identification of AE outliers - and apply it to identify rare PD-causing variants. We first performed a control study of known PD-causing genes over GTEx data. As expected, ANEVA-DOT detected statistically significant outliers - variants driving allelic imbalance. We also faced expected limitations of the tool - sparsity of AE data and variance estimates. We then aimed to test it on data from dopaminergic neurons of patients with PD.
ICA-based clustering of gene expression data
Student: Yulia Galatonova
Supervisor: Alexey Sergushichev

Module detection is a key step in analysis of gene expression datasets. Numerous methods of gene clustering are developed for this purpose. Among them, ICA-based clustering has advantages of cluster number estimation and filtering genes that do not participate in any biological processes. However, overclustering (detection of small nonsense pathways) and underclustering (when some meaningful clusters are not identified by method) can lead to biases in biological interpretation of experiment results. To explore problem of underclustering and overclustering in ICA-based clustering method a pipeline for systematic benchmarking analysis based on results of hypergeometric test was developed. 100 datasets were analyzed according to the pipeline. Conducted systematic analysis revealed numerous of under- and over- clustering. Considering on results of analysis a new approach to ICA clustering was implemented. In comparative analysis of initial and corrected methods it was showed that new approach allowed to decrease occurrence of over- and under- clustering, which positively contributes to interpretation of biological processes ongoing in analyzed datasets.
Identification of short tandem repeats from whole-genome sequences
Student: Anna Shchetsova
Supervisor: Anna Zhuk

Eukaryotic DNA is rich with repeats of different type. In particular, tandem repeats have been associated to many cellular processes and also involved in genetic disorders. It is well known that telomere repeats play a key role in chromosome stability, preventing end-to-end fusions and precluding the recurrent DNA loss during replication. The studies on the distribution of telomere repeats in the human genome are important for understanding of the mechanism of aging and cancer. The existing tools for short tandem repeats identification have some caveats in terms of accuracy and comprehensiveness or speed, flexibility memory usage and ease-of-use. We have developed a tool to search for tandem repeats for a given motif. The search for repeats in the genome is carried out in linear time relative to the length of the genome. A many numbers of configurable parameters allows you to make the search more convenient. The tool is written at the python programming language and it is cross-platform. Testing on various motifs showed that the developed tool using basic settings is not inferior to existing tools, and changing the parameters allows you to increase the number of tandem repeats found.
Assessment of unresolved PIK3CA gene variants associated with cancer by comparative genomics
Student: Svetlana Milrud
Supervisor: Natalia Petukhova (Pavlov Medical University)

Phosphatidylinositol 3-kinases (PI3Ks) are important regulators of cell growth, transformation, apoptosis, and survival. PIK3CA is a gene located on the third chromosome and encoding the phosphatidylinositol 3-kinase catalytic subunit (p110a). Over the past two decades, missense mutations in the PIK3CA gene have been reported in many types of human cancer, including breast, colon, brain and lung cancers. Numerous studies have revealed the prognostic and therapeutic implications of these mutations. [...] Computational tools, such as PolyPhen-2, SIFT and PROVEAN, are frequently used to evaluate variants of unresolved significance. However, these tools have not yet reached the desired level of performance in terms of accuracy. In our project, we established the precise evolutionary history of the PIK3CA gene and implemented computational approach to predict unresolved PIK3CA missense variants from ClinVar, COSMIC and LOVD databases. We have demonstrated improved accuracy in categorizing pathogenic and benign single amino acid substitutions in PIK3CA compared to automated tools.
System for optimized clinical data collection to improve genetic studies reliability
Student: Valerii Kvan
Supervisor: Mykyta Artomov (Massachusetts General Hospital / Broad Institute)

Accurate genetic analysis requires accumulation of clinical data through various steps of laborious and time-consuming process. Moreover, for such procedure's huge amounts of different specialized staff and tools can be needed for completion. Phenotypic data collected from participants can have a variative nature: from questionnaire on a computer or on paper to collecting blood samples. The hardest part is the survey, when the participant tries to answer the questions about lifestyle and social status, because it takes so much time from nurse or doctor to collect, and the answers need to be verified for mismatches, gaps, and outliers. Furthermore, the independence of answers and possible outcome cannot be verified only by personal, but this can be validated with genetics. There are polygenic risks in some participants that can be connected to behavior when a person tries to answer the questionnaire dependent on the certain outcome. In this research the hypothesis was tested, that the polygenic risk score to increased obesity can be correlated with mismatches and mistypes in answers across 3145 individuals with collected data about food habits. Also, to optimize data collection from patients, the REST API server was developed.
Pseudoautosomal region (PAR) in mammalian genomes
Student: Denis Fedorov
Supervisor: Sergei Kliver (Institute of Molecular and Cellular Biology SB RAS)

Sex chromosomes of mammals have a short region of true similarity on the end of chromosomes. It has a lot of unique properties but yet has not been studied for many mammals. I collected chromosome level assemblies and male raw read data for several species of eutherian mammals to identify PAR region on them. Results showed that most analyzed species had variations of the same ancestral PAR gene content with highly variable gene and nucleotide lengths. The last PAR gene was usually SHROOM2. Like some murids, rabbits appear to have no PAR at all, but presence of normal PAR in squirrels and naked mole-rats indicates independent loss in murids and rabbits.
Genetic map construction for the muscadine grape hybrid populations using RADseq data
Student: Evgenii Raines
Supervisor: Elizaveta Grigoreva (Skoltech / Gregor Mendel Institute)

Vitis vinifera L. is an important plant in agriculture, used for making wine for centuries. However, it is not resistant to some pathogens, for example, Phylloxera larvae, Meldew and Oidium, which leads to crop losses. This resistance is owned by its relative, Muscadinia rotundifolia Michx. But Muscadinia rotundifolia Michx. is not used for making wine because of its characteristics. Thus, the task of breeders is to create such hybrids that will have the characteristics of Vitis vinifera L., as well as resistance to Phylloxera larvae as Muscadinia rotundifolia Michx. An important tool for breeders is the use of genetic markers obtained by building a genetic map. The aim is a genetic map construction for the muscadine grape using RADseq data in order to search for introgressed muscadine grape (Muscadinia rotundifolia Michx.) genes related to resistance to the pathogenes by calculationg of genetic linkage maps, processing of Composite Interval Mapping (CIM) analysis and annotation of significant markers.
Assessment of unresolved NOTCH family gene variants by comparative genomics
Student: Artem Amosov
Supervisor: Natalia Petukhova (Pavlov Medical University)

Notch family is represented by four paralogs in the human genome (NOTCH1-4). These genes encode transmembrane receptor proteins that take part in juxtacrine signaling events, controlling a lot of processes on different developmental stages. This diversity of processes in which Notch genes are involved makes this family a very interesting target to study. Moreover, it is known that mutations in all four paralogs are involved in different hereditary diseases, such as CADASIL disease or Adams-Oliver syndrome, associated with the Notch3 and Notch1 genes correspondingly. Although these genes are important for human health, there are a lot of variants with unclear significance and also variants of these genes are worse predicted by automated tools, that is why we decided to use a comparative genomics approach to assess the unresolved variants in these genes. For this purpose, we performed phylogenetic analysis of these genes to build paralog-specific multiple sequence alignments that were used to make predictions for variants in these genes, taken from the Clinvar database. For predictions we have used three strategies: automated tools (PolyPhen2, Provean), straightforward algorithm and SAVER algorithm. Our results show that the variability of target genes makes it difficult to all three approaches to make 100% accurate predictions and different approaches are more effective for different paralogs. Talking about the straightforward and SAVER algorithms, our results show that the quality of MSA is very important for their better performance, however some positions can't be resolved with current alignments.
Fine mapping of regions GBA-SYT11 and PARK16
Student: Oksana Prosniakova
Supervisor: Konstantin Senkevich (McGill University)

The cause of Parkinson's disease is still unclear. Many SNPs and genes are found to be associated with this disease, but it is hard to say which of them play a causal role in it. The most recent Parkinson's disease Genome wide association study nominated 90 independent SNPs in 78 loci. The problem is that GWASs can't distinguish specific causative SNPs, as they could be in high linkage disequilibrium with each other and they can also be located in different genetic regions. The possible solution to this problem is fine mapping - it results in a set of variables that are associated with the response and together can explain 95% of association of the locus with PD risk. As a result of this project, several SNPs were found through fine mapping and were also identified in colocalization analysis. It was shown that lead SNP from fine-mapping (rs3747973) of PARK16 locus is also lead SNP in LocusCompareR results and is in high LD with results from COLOC. Based on fine-mapping and COLOC results, RAB7L1 is a gene associated with Parkinson's disease. Protein encoded by this gene is playing a role in the phosphorylation of LRRK2. Variants in LRRK2 are the most common risk factors of Parkinson's disease. Therefore, existing functional data supports our results.
Genetic diversity of modern Yakutian wolverines
Student: Guillaume Donnet
Supervisor: Sergei Kliver (Institute of Molecular and Cellular Biology SB RAS)

In this study, we wished to assess the heterozygosity level of the Yakutian wolverines. Our material is composed of resequenced samples from 8 individuals collected recently in the Yakutian region. We resumed the work starting from filtered data. Initial filtration and quality check was already done by another team member. Low quality reads were filtered out and adapters were trimmed. Since there was no good quality assembly available of Gulo gulo genome, we built a new reference genome for our work, using a fragmented Gulo gulo genome and a Martes zibellina chromosome length assembly. After we checked the quality of this new reference genome, we used it to align our resequencing data. The rest of the typical pipeline will consist of the diversity study and the visualization of the results.
Search of domestication genes in cats
Student: Daria Yakimova
Supervisors: Anton Zamyatin, Alexander Tkachenko

Domestication is permanent genetic modification of a bred lineage that leads to, among other things, a heritable predisposition toward human association. Unlike other domesticated animals, domestication of cats was not a result of artificial selection for agricultural purposes, but rather a result of natural selection. Cats, that were behaviorally more predisposed to live among humans, were eventually domesticated. Thus, by comparing the genomes of wildcats, who were predecessors of domestic cats, with the genomes of domestic cats, we can determine the genes behind the domestic phenotype.

Projects, spring 2021
Uncovering possible cells-of-origin for medulloblastoma tumors from comparison to normal brain scRNA-seq data
Student: Ekaterina Petrova
Supervisor: Konstantin Okonechnikov (German Cancer Research Center)

The report presents results of the research work devoted to the analysis of publicly available single-cell RNA-seq data derived from human medulloblastoma tumors combined with normal human fetal cerebellum data. The goal of the project was to discover the similarities in transcriptomes between malignant and normal cells, in order to find the origins of different medulloblastoma subgroups. Integration of cerebellar and tumor data, preprocessing, dimensionality reduction, and clustering with UMAP 2D visualization were performed using the Seurat R package. Cell coordinates extracted from UMAP were used to determine the nearest normal cerebellar type for each tumor cell. In parallel, we applied the SingleR correlation-based method to medulloblastoma tumors single cell data using normal fetal cerebellar dataset as a reference to verify UMAP derived results. The analysis revealed the expected correspondence between the SHH subgroup of medulloblastoma and granule neurons. Medulloblastoma Group 4 was associated with unipolar brush cells. For Group 3, we were not able to detect any significant similarity. It was demonstrated that both clustering using UMAP approach and the SingleR method yield comparable results thus such strategy might be beneficial for future comparisons between tumors and single-cell RNA datasets of normal human tissues.

Comparison of the clinical risk scales for multiple psychiatric traits in their ability to detect genetic susceptibility
Student: Darya Pinakhina
Supervisor: Mykyta Artomov (Broad Institute / Massachusetts General Hospital)

In this project we wanted to assess the power of qualitative and quantitative DSM and HADS scales for anxiety, depression and bipolar disorder to detect genetic susceptibility to these mental health conditions. For this purpose, the analysis of the results of 7 GWAS studies using these scales has been carried out. Replication rates for the corresponding GWAS results at variant and gene levels, character of causal probability distributions for the most probable associated genes and the results of gene enrichment analysis using large-scale reference GWAS studies for the corresponding conditions indicate that DSM scales tend to perform better both for anxiety and depression, and qualitative scales have shown higher power for anxiety in our case. This is the first assessment of HADS and DSM scales' performance in GWAS for the Russian cohort, and the results, after further verification, could be used to guide further GWAS studies, as well as for testing in clinical practice until more advanced approaches to diagnose mental health conditions are developed.
Clustering of structural descriptors data for vitamin D receptor
Student: Elizaveta Vinogradova
Supervisor: Karina Pats
Clustering of protein structures is an important task of structural bioinformatics. Clustering can be performed, for example, (1) using structural descriptors calculated from different protein structures, (2) using atomic coordinates, (3) based on amino acid sequence comparison (pairwise alignment) or (3) by comparing structural domains. In this project, structural descriptors were used to cluster the protein structures of the vitamin D receptor of three species (homo sapiens, rattus norvegicus, danio rerio). Clustering revealed an outlier on independently normalized VDR data that was not detected by other approaches. The result was confirmed by structure source information. Clustering by structural descriptors was compared with clustering using a matrix of pairwise alignment scores and with the CD-HIT clustering tool.

Phenotype-driven gene prioritization for rare diseases
Student: Valentina Yakushina
Supervisor: Dmitrii Smirnov (Technical University of Munich)

Phenotype-driven gene prioritization utilizes known associations between clinically relevant genes and clinical phenotypes to return ordered list of genes where ideally causal gene is expected to be at first rank. We benchmark of major phenotype-driven gene prioritization tools (AMELIE, PCAN, PubCaseFinder, Phen2gene, GADO) was performed. PubCaseFinder returns the smallest number of putative genes with best rank for causal genes and highest rate of missed causal genes. AMELIE, Phen2gene, GADO return highest number of putative genes with worst rank of causal genes and lowest number of missed causal genes. PCAN takes moderate position regarding number of returned genes with rank of causal genes near PubCaseFinder and low value of missed genes.
Genomic landscape in multiple myeloma
Student: Ekaterina Kazantseva
Supervisor: Anna Zhuk
Multiple myeloma is the second most common blood cancer, approximately 25% of myeloma patients die within the first 3 years of their disease, and approximately 10% of patients die within the first year[1]. One promising way of finding new therapies is to identify genomic driver events and using specific drug to target these aberrations. WES data analysis of MM patients enables us to find variants which caused the disease and improve treatment processes. We analysed samples of 2 patients and found germline and somatic mutations associated with myeloma.
Establishing connection between chromosome-length assemblies and karyotypes
Student: Aliya Yakupova
Supervisor: Sergei Kliver (Institute of Molecular and Cellular Biology SB RAS)

Reference genome assemblies are very important for conservation biology aims. It helps to understand genetic diversity, estimate localization, and visualization of low heterozygosity regions in threatened species. The best approach for that is to use chromosome-level assemblies. It provides better estimation and is easier to work with in comparison with fragmented assemblies. However, even chromosome level assemblies have many differences with real chromosomes. For example, heterochromatin regions of chromosomes usually remain unassembled. This work aimed to establish connection between chromosome-length assemblies and karyotypes of Ailurus fulgens (red panda), Bassariscus astutus (ringtail), and Procyon lotor (common racoon) against Felis catus as a reference genome. In this project a fully automated scalable pipeline for whole genome alignment was developed, which can be deployed on multiple platforms. Assemblies and karyotypes were compared using obtained alignments and Zoo-FISH data. Moreover, significant chromosomal rearrangements such as inversions and translocations were detected in defined chromosomes.

Reconstruction of FaRLiP photosynthetic gene cluster from a novel cyanobacteria
Student: Diana Lupova
Supervisor: Anton Korobeynikov (Saint Petersburg State University)

The report presents the results of a project on detection of FaRLiP cluster in novel cyanobacteria. Analysis was performed on two samples of sequencing data from unicellular organisms, shown to be two strains belonging to a novel cyanobacteria genus by 16S rRNA and transcribed spacers analysis. The goal of this project was to detect and reconstruct the FaRLiP gene cluster in the sample that demonstrated ability for far-red light photoacclimation and confirm absence of such cluster in the sample with no photoacclimation activity.
Search for fusion genes in follicular lymphoma samples
Student: Maria Pospelova
Supervisor: Igor Evsyukov

Follicular lymphoma is a hematological cancer. Although the mechanism of neoplasm formation is not fully investigated, it is known that B-cell apoptosis damage is responsible for it. Treatment strategies based on individual genomes can be formed only knowing the target fusion genes, in instance, translocation t(14;18) (q32;q21), which is known as the fusion responsible for follicular lymphoma. This research focused on fusion genes detection with four tools 1. Arriba; 2. STAR-Fusion; 3. FusionCatcher; 4. Kallisto/Pizzly. After the detection of the fusion genes, those that were detected with two or more tools were chosen for
further analysis. Some of the found fusions were connected to some types of cancer. Yet, some of the fusions were found as new, so the annotation analysis was performed. The PCR primers were prepared for further validation by Sanger sequencing.
PathSeq parameters optimization for improved pathogens search in human tissue NGS data
Student: Aleksandr Cherdintsev
Supervisor: Olga Kudryashova (Boston Gene)
Numerous tools for detection of microbiota by sequencing data analysis exist yet demonstrate inaccuracy in results if not tuned for the specific case. Fortunately, articles considering benchmarking of tools and their parameters for various cases are issued. However, some cases remain uncovered. As an example, a PathSeq tool, one of the most frequently used tools for biota detection, is not tested on human biopsy sample sequencing data yet. Thus, the current article was aimed to reveal the optimal parameters of PathSeq for processing of the above-mentioned type of data.
Identifying metabolic modules in TCGA datasets
Student: Evgenia Chikina
Supervisor: Anastasiia Gainullina

The GAM-clustering algorithm is a method based on the joint clustering in network and correlation spaces producing metabolic modules in a graph representation and in the form of patterns (average gene expression in a particular module). The annotation can be explored and compared with patterns after modules identifications. Here we present modified GAM-clustering pipeline with the solver interactions through the specialized R-library mwcsr and including new parameter – maximum module size – allowing to perform automatic dynamic base estimation (parameter of the algorithm) along with the modules identification. Results show correspondences between two approaches of setting base parameter and the ability of the algorithm to find biologically relevant modules. Results are shown for the TCGA LUSC dataset and reveal metabolic differences according to the presence or absence of the mutations in the NFE2L2 and KEAP1 genes.
Hi-C maps visualization and by-hand scaffolding software development
Student: Konstantin Danilov
Supervisor: Anton Zamyatin

In this project we implement the backend model for manual scaffolding using Hi-C map with python3 programming language. It works with Hi-C map in cool format and allows user to perform moving and reversing of contigs. The resulting sequences and Hi-C map representations can be easily saved in new fasta file, based on the original one, or PNG images. We used a state model to store current states of contigs and recalculate region of interest on request from original Hi-C map. Performance of the operations is quite fast and works for O(c) for moving contig and O(1) for reversing of contig, but visualization part works much slower. There two options for users: command line interface or Jupiter Notebook, both available from the GitHub for free.
Virtual high-throughput screening of bile acids
Student: Elizaveta Zhaivoron
Supervisor: Karina Pats

Nuclear receptors (NR) are transcription factors able to regulate various cellular processes. Bile acids (BAs) can bind nuclear receptors in transcriptional complexes. It is known that some bile acids can promote cancer, while others - suppress it. There
is a lack of structural data for BA-NR complexes, so in this research, we performed virtual high-throughput screening of bile acids library and determined their specific binding profiles with vitamin D receptor, pregnane X receptor, and farnesoid X receptor.
Comprehensive analysis of differentially expressed genes between advanced-stage follicular lymphoma and diffuse large B-cell lymphoma samples
Student: Maria Shumilova
Supervisor: Igor Evsyukov

Follicular lymphoma (FL) and diffuse large B-cell lymphoma (DLBCL) are the most common forms of non-Hodgkin lymphoma among adults. FL is slowgrowing while DLBCL is clinically aggressive. Transformation of FL into DLBCL is associated with rapid progression, treatment resistance and poor prognosis. The aim of this project was to search for potential therapeutic targets
and biological markers of FL transformation. We performed differential gene expression analysis between FL grade 3 and DLBCL using RNA-Seq data. And after identifying significant genes pathway enrichment analysis was performed. Obtained results suggest an important role of signaling pathways associated with metal ions in the transformation of follicular lymphoma. Perhaps some of the found genes can be used as targets in therapy or biomarkers of follicular lymphoma malignization.
Evolutionary analysis of different prokaryotic species based on synteny blocks
Student: Asya Marshak
Supervisor: Alexey Zabelkin

Bacterial genomes have very high plasticity, provided by rearrangements such as inversions, deletions, insertions, and duplications. When such rearrangements occur independently in different strains, this indicates parallel adaptation that can be responsible for multi-virulence, antibiotic resistance, and antigenic variation. Such processes can be the cause of increased pathogenicity in some strains. Studying the mechanisms of evolution of various species can help in identifying pathogenic strains. We took four human pathogens that exist in different conditions and studied their evolutionary mechanisms with PaReBrick and pan genomic.
Mining clinical knowledge on Phenotype - Mendelian disease gene associations
Student: Gleb Korelsky
Supervisor: Dmitrii Smirnov (Technical University of Munich)
Researchers have developed phenotype-driven differential-diagnosis systems to improve speed and accuracy of differential-diagnostics of mendelian disorders. Above mentioned systems usually rely on databases of gene-disease associations
(GDA) and gene-phenotype associations (GPA). Therefore, performance of these systems depends on the quality and quantity of data stored in the GDA/GPA databases. In this project the pipeline was developed to retrieve GDAs and GPAs from medical case reports deposited in PubMed.
Projects, spring 2020
Pilot analysis of the inherited predisposition to lung cancer recurrence
Student: Valeria Rezapova
Supervisor: Mykyta Artomov (Broad Institute / Massachusetts General Hospital)

Lung cancer is one of the most important medical and social-economic problems in most developed countries of the world due to its leading position in the structure of cancer incidence and mortality. Recurrence occurs in 30% of lung cancer patients after radical therapy and about 50% of patients with recurrence die within 2 years. Recurrence could be predicted using molecular, immunohistochemical methods and clinical observations of the epithelial changes. Here, we analyzed pilot exome sequencing data from 3 groups of lung cancer patients - 10 samples with distant recurrence, 10 samples with locoregional recurrence and 10 controls with no recurrence. We describe optimal quality filtering and association study strategies and two gene candidates, with prior evidence of involvement in cancer progression - PRSS21 and KRT6A.
Melanoma somatic variant and copy number alterations data analysis
Student: Dmitrii Usoltsev
Supervisor: Mykyta Artomov (Broad Institute / Massachusetts General Hospital)

Melanoma is the deadliest skin cancer. In 2020, it is estimated that there will be 100,350 new cases of melanoma and 6,850 deaths in the United States. Unlike other cancers, melanoma has a clear pattern of onset mediated by carcinogen exposure – UV light. Here we present an analysis of lab-generated melanoma tumors obtained by UV-irradiation of healthy melanocyte cell line and further grown in immunocompromised mice. We show that artificial melanoma obtained in this way is different from majority of clinically obtained melanomas in patients. Specifically, ultraviolet mutational signature discriminates between the melanoma origins. Although natural and lab-grown tumors are very similar according to the overall gene signature. Using the somatic variant and copy number variation calling approaches, three gene candidates were found LCE3C, IFIH1 and MEIS2. Deletion of LCE3C is crucial for skin barrier function. IFIH1 involves in tumor cell apoptosis. MEIS2 is important for cancer pathogenesis. All these genes significantly influence survival of patients with melanoma.
Analysis of scRNA-seq data of mitral valve disease in K/BxN mouse model
Student: Maria Firulyova
Supervisor: Konstantin Zaitsev
The heart valves are structures composed of different cell populations, and both the mitral and tricuspid valves are characterized by specific properties in terms of morphological, physiological and other levels, including cell-type composition. Due to these differences, the valves don't equally contribute to the progression of some diseases — in particular, the mitral valve disease. However, the certain factors which play the role of a predisposition to the manifestation of mitral valve disease are still unclear. The project is focused on secondary analysis of single-cell RNA sequencing of both valves under the control and inflamed condition. The analysis results covered multiple topics including annotation of cell populations which are presented in the data, cellular type composition changes during disease progression in the mitral valve, and the role of endothelial-mesenchymal transition in the development of mitral valve disease.
Microarray analysis of gene expression induced by vitamin D3 and its analogs in a human THP-1 monocytic cell line
Student: Marina Terekhova
Supervisors: Ferdinand Molnar (Nazarbayev University); Alexey Sergushichev
It is well established that besides regulating bone and calcium homeostasis vitamin D is involved in a multitude of fundamental cellular processes, including the cell cycle, apoptosis and differentiation, as well as effects on carcinogenesis, immune function, autoimmune diseases and cardiovascular disorders. Since supraphysiological doses of 1,25(OH)2D3, that are necessary to obtain these non-classical effects, result in hypercalcemia, a huge variety of analogs were developed to minimize the calcemic side effects while preserving or augmenting the beneficial effects of 1,25(OH)2D3. The development of DNA microarray technologies has created the opportunity to investigate the effects of 1,25-dihydroxyvitamin D3 on the gene expression profile in various cell types. So, in this study, we aimed to examine changes in gene expression associated with the vitamin D3 and its analogs stimulation in THP-1 cells. THP-1 monocytic human cells were treated with 4 hour 100 nM 1,25-Dihydroxyvitamin D3, 100 nM Gemini or 100 nM TX527 and the gene expression were examined using Illumina HumanHT-12 V4.0 expression BeadChip. Limma based gene prioritization identified a set of 224 genes with adjusted p-values < 0.05 for 1,25-dihydroxyvitamin D3 versus ethanol (113 up- and 111 downregulated genes), a set of 742 genes - for Gemini versus ethanol (268 up- and 474 downregulated genes) and a set of 1166 genes - for TX527 versus ethanol (429 up- and 737 downregulated genes). Differences in gene expression between stimulations with vitamin D preparations were generally implicit and by further analysis of overlapping genes, using published data for THP-1 cells, 15 up- and 3 downregulated genes were selected as vitamin D key target genes. Most of these vitamin D-induced genes are related to immune system regulation. Overall picture of gene set enrichment analysis showed a very low number of significant results, however overlapping with vitamin D pathways and ChIP-Seq datasets was clearly evident.
RNA-seq analysis of early and delayed guar (Cyamopsis tetragonoloba) varieties
Student: Elizaveta Grigorieva
Supervisor: Alexander Tkachenko
Guar (Cyamopsis tetragonoloba (L.) Taub.) is an annual legume crop widely cultivated in India and Pakistan. Guar beans are a source of guar gum that is used in many industries - food, cosmetics, and oil. However, lack of information about guar genome imposes difficulties in breeding this culture across the word and specially in Russia. The main challenge in guar breeding is day length, because guar is a short day plant and the most of guar varieties cannot start to flower under Russian weather conditions. In this study RNA-Seq technology and metabolome investigation were used with a goal of indicate responsible to photoperiodic sensitivity genes by integrative omics approach for 6 early and 9 delayed guar varieties.
SNP calling in transcriptomic and genomic data from guar, Cyamopsis tetragonoloba
Student: Aleksandar Beatovich
Supervisor: Alexander Tkachenko

The guar plant, Cyamopsis tetragonoloba, is a leguminous annual herbaceous plant that contains in its seed galactomannan, a polysaccharide with multiple industrial applications. For this reason identifying SNPs would provide useful information to plant breeders and allow an enhanced exploitation of this plant. In this article SNPs were called from guar RNA-seq data and a novel reference genome was assembled. The 20800 SNPs reported from the KisSplice pipeline are consistent with previously reported numbers of SNPs from guar RNA-seq data. The high number (102778) of SNPs reported by GATK-rna pipelines is at an discrepency with the previously reported number of SNPs in RNA-seq data indicating possible false positives. The novel guar genome assembly although an improvement on its previous draft assembly needs to have its level of fragmentation further reduced in order to be used as a reference ATK cohort genotyping pipeline due to a runtime issue in one of the steps of the pipeline.
Re-annotation of the Intoshia linei genome (Orthonectida)
Student: Elizaveta Skalon
Supervisors: George Slyusarev, Natalya Bondarenko (St. Petersburg State University)
Intoshia linei is an invertebrate organism belonging to the phylum Orthonectida. Orthonectids are parasites of various marine animals. For a long time, orthonectids were considered to be a basal group due to their extremely simplified nervous and muscle system. However, recent molecular studies are placing them among highly specified annelids, the segmented worms. Because of the many exciting aspects of orthonectids biology, they are poised to address many integrative organismal biology questions. Discovering orthonectids genomes is an essential step towards revealing the global mechanism which is behind a secondary reduction of parasitic organisms. Our main goal was to improve the accuracy of the existing genome annotation of Intoshia linei in light of new RNA-sequencing data. We sequenced and assembled additional Intoshia linei adults transcriptome and re-annotated the genome with MAKER pipeline. The annotation assessment results demonstrate that the updated annotation contains more structural elements and is more consistent with our RNA-seq data.
The role of the microbiome in Parkinson's Disease
Student: Mary Futey
Supervisor: Dmitry Rodionov (Sanford-Burnham-Prebys Medical Discovery Institute)
Several studies analyzing the role of Parkinson's Disease indicate that gut microbiome dysbiosis co-occurs with PD. However, there is still a lack of consensus in certain areas, such as the prevalence of potential protective or detrimental bacterial families and the change in alpha diversity in PD. This report compares the microbial community composition among and between the gut microbiomes of 89 patients with confirmed PD and 66 healthy controls, utilizing 16s RNA marker gene sequencing. PD samples had higher alpha diversity was and decreased levels of butyrate producing families. Future studies should include the use of potential confounders, such as diet, medications and demographics and the functional prediction of the microbiomes.
Lipid network analysis
Student: Mariia Emelianova
Supervisor: Alexey Sergushichev
This work is dedicated to metabolic network analysis, which has a great advantage of possibility to identificate new metabolic pathways. But for now, it is still not possible to fully integrate lipidomic data with metabolic network. This project is dedicated to creating lipid network for lipid data and testing the concept of lipid network analysis.
Identification of novel miRNAs in Flax Stem
Student: Anzhelika Dun
Supervisor: Alexander Tkachenko

Linum Usitatissimum also known as Flax is one of the most widespread cultivated plants. Linen which is the most popular fabric after cotton is made from flax stem fibres and its production represents a big part of textile industry. Unique properties of flax stem related to intrusive elongation mechanisms of which are poorly studied. 124 miRNAs of L.Usitatissimum are known nowadays and form 23 families, almost a half of them participate in intrusive elongation. Therefore, miRNAs are the key players in Flax development and can be used in order to improve fibres properties. We identified 150 novel miRNAs for 10 RNA-seq samples using highly accurate tool miRDeep2. The following filtering of dublicates and low quality candidates left 17 best predicted miRNAs. Our results after validation could expand base of L.Usitatissimum miRNAs and to help in clarification of intrusive elongation mechanisms.
Spatial reconstruction of Arabidopsis gene expression using NovoSpaRc
Student: Natalia Baymacheva
Supervisor: Alexander Tkachenko

Spatial information about gene expression is important for our understanding of how organisms function. A popular method of scRNAseq allows analysis of gene expression on single-cell level but most technologies lose spatial information during the sample preparation. Recent tool for gene cartography from single-cell expression data is able to regain the spatial information from expression data and some prior knowledge. This tool is tested on several animal models with some great results but the same methodology may not apply to plant single-cell data due to how root stem cell niche functions.
Analysis of transcription start sites in Salmonella enterica in Amoeba symbiosis using Cappable-seq
Student: Liuaza Etezova
Supervisor: Alexander Tkachenko

Salmonella enterica is a dangerous pathogen which can cause life-threatening bacteremic illness. Symbiosis with an amoeba enhances its virulence and antibiotics resistance. We analysed Cappable-seq transcription data of S. typhimurium serving as an Аcanthamoeba castellanii symbiont and found novel TSSs which are highly expressed. We identified differentially expressed genes, which indicate oxidative stress and responsive to zinc concentrations. We found that S. typhimurium in symbiosis uses TSSs, which are not used in the control group. Also, we analysed antisense transcription using Cappable-seq
Comparison of genome assemblies for Nanger Dama
Student: Azat Mingaleev
Supervisor: Pavel Dobrynin
In this work, we analyzed three different genomic assemblies of the Nanger dama gazelle using sequencing techniques such as Hi-C, 10X, and Bionano. As a result of the analysis, problems with Bionano were identified and further work was carried out with two assemblies. In the end, it was found that for HiC and 10X the number of unaligned fragments was slightly less than 6%. In the future, due to technical problems with haltools and smash ++, it is planned to continue work on finding break points and rearrangements.
Detection of novel regions in the newly sequenced bacterial isolate
Student: Mikhail Lebedev
Supervisor: Sergey Nurk (National Human Genome Research Institute, NIH)

During this project we developed an easy to use pipeline to get novel parts of sequenced bacterial genome, where it was not possible to get results this easily. Moreover, it allows students to look at pan-genome analysis of sequenced bacterial strain and comprehensively visualizes problems with possible uncorrected insertions and deletions (indels) errors in a given bacterial genome sequenced with Oxford Nanopore. Returned novel parts of bacterial genomes can be used to analyze and to find new genes. The resulting pipeline is available at GitHub.
Projects, spring 2019
Chromothripsis: bioinformatic analysis of complex genomic rearrangements
Student: Natalia Petukhova
Supervisor: Sergey Aganezov (Johns Hopkins University)

Initially described in cancer genomes, and subsequently observed in constitutional disorders, chromothripsis constitutes a new class of massive genomic alterations characterized by the simultaneous shattering of chromosome segments followed by random reassembly of the DNA fragments during a single cellular event, to form complex derivative chromosomes.
The purpose of this study is to analyze genome of the patient with acute leukemia and presumably having chromothripsis in the basis of complex genomic rearrangements. The objective of this work is a bioinformatic analysis of DNA whole genome sequence using different bioinformatic tools for genomic aberrations searching and comparison/confirmation of the obtained results with cytogenetic study which still remains a "gold standard" diagnostic expertise in medical practice.
Producing a chromosome-scale mosquitos genome assembly from Oxford Nanopore reads
Student: Anton Zamyatin
Supervisors: Pavel Avdeyev (George Washington University); Nikita Alexeev

The main task of this project is a producing of the chromosome-level genome assemblies for An. coluzzii and An. arabiensis using different sequencing technologies. Long reads from Oxford Nanopore sequencing technology are used for draft assembly then short reads from Illumina are used for polishing and data from HI-C sequencing is used for mapping scaffolds. The second task is a developing of the working pipeline for eukaryotic genome assembly using these types of input data and state-of-art computational tools that are available for these purposes. At the end of the project, it must be two chromosome-level genome assemblies of two mosquito's species An. coluzzii and An. arabiensis that can be used in further population genomics studies and other chromosome-level analyses.
Estimating gene priorities in complex traits based on GWAS summary statistics
Student: Nikita Kolosov
Supervisor: Mykyta Artomov (Massachusetts General Hospital)

Genome-Wide Association Study or GWAS is a powerful tool for investigating genetic origin of complex traits. They are called complex since their development is affected by multiple genes and external factors. In this review, we will predominantly focus on such complex traits as polygenic diseases. Such diseases are caused by DNA alterations in several genes at once. As a result, it is hard to identify their biological mechanisms using family-based studies. GWAS allows one to get closer to solving this problem by looking on disease alleles prevalence in cohorts of cases and controls. In this article, we present a new approach, which allows to get probabilities for genes to be causal based on GWAS summary statistics using machine learning (ML) classification algorithms.
Signaling responses to personalized exercise therapy in skeletal muscle in heart failure patients
Student: Oksana Ivanova
Supervisors: Renata Dmitrieva (Almazov Centre); Alexey Sergushichev
Heart failure (HF) is one of the most widespread disorders in the world. Signs and symptoms of heart failure commonly include shortness of breath, excessive tiredness, and leg swelling as well as the limited ability to exercise. Moreover, HF often induces
skeletal muscle myopathies like cachexia and myodystrophy. However, there still does not exist a common treatment of muscle disorders in HF patients for improving the quality of their lives and increasing ability to exercise. This study is trying to describe the effect of physical exercise therapy in heart failure patients in order to define possible targets for pharmacological treatment of muscle wasting.
16s rRNA analysis of the gut microbiota in obesity and healthy state
Student: Ksenia Maksimova
Supervisor: Yulia Kondratenko (St. Petersburg State University)

In the last decade studying human gut microbiome is gaining popularity. There is much research about impact of microorganisms on the host, role of gut microbiota in diseases and about the prospect of the treatment different diseases by repairing gut microbiome. Recent studies already revealed the main species living in our body and the great interest of exploration in human microbiota issues so far is to link changes inside such community with particular human phenotype. One of the highest-profile demonstrations of the microbiota's influence on human health is the microbiome in obesity. In present study, will be characterized the gut microbiota community of obese phenotypes compared to the healthy.
Multi chain effect analysis from paired antibody repertoire data
Student: Sedreh Nassirnia
Supervisor: Maria Chernigovskaya
B lymphocytes (B-cells) are one of the most important parts of the adaptive immune system that are involved in immune response with secreting antibodies and producing B-cell receptors (BCRs) through multiple mechanisms for binding to a specific antigen. There are some checkpoints to ensure that only antibodies with two identical heavy chains (IGH) and two identical light chains (IGL) participate in the immune response against antigens. But sometimes antibodies are made up dual light chains that can show auto-reactivity in the immune system and leads to the generation of autoimmune disease in human and mouse samples. In this work, four human and three mouse 10-x genomics VDJ paired datasets were analyzed to investigate cells with a dual light chain. The IgBlast tool was used to filter out suspicious contigs from the datasets and further analyses.
Exome analysis of samples using GATK4 pipeline and database development
Student: Mrinal Vashisth
Supervisor: Yury Barbitoff (Bioinformatics Institute)

Expanding GeneQuery transcriptional database into RNA-Seq space
Student: Boris Shpak
Supervisors: Alexander Predeus (University of Liverpool / Bioinformatics Institute); Maxim Artyomov (Washington University in St. Louis)
InCHIANTI dataset analysis
Student: Maria Romanova
Supervisor: Maxim Artyomov (Washington University in St. Louis)
A comparative analysis of viral outbreak networks reconstruction methods
Student: Daria Nemirich
Supervisor: Nikita Alexeev

FSHD diagnosis through Nanopore sequencing
Student: Ekaterina Gibitova
Supervisor: Pavel Avdeyev (George Washington University)