Master's thesis projects

The fourth semester is mostly dedicated
to the students' work on their Master's theses. The theses are supervised by the leading experts from Russian and foreign scientific centers working in the field of bioinformatics.

Projects of different years are listed below.
Master's theses, spring 2022
Phasing of Partially Resolved Metagenomic Assemblies
Student: Ekaterina Kazantseva
Supervisor: Mikhail Kolmogorov (National Cancer Institute, NIH)
Long-read metagenomic sequencing has recently been used to recover complete bacterial genomes from various complex metagenomic communities. Metagenome assembly algorithms however are still facing challenges in deconvolution of closely-related species and strains. De novo assemblies of highly heterogeneous bacterial species typically result in tangled assembly graphs, where some sequences could be strain-specific, while others represent species-level consensus. Such partially-collapsed representation of bacterial strains does not take full advantage of the ability of long reads to phase small variants. In this work we present an algorithm called MetaPhase that extends metagenomic phasing approaches to assembly graphs. Our algorithm operates on graph paths rather than single contigs, and iteratively simplifies assembly graphs with newly reconstructed strain contigs. We benchmark our algorithm using mock communities and show that it produces accurate and complete strain-level reconstructions and substantially improves over the initial partially-collapsed assemblies.

Presentation_Kazantseva (slides)
Genome Assembly Annotation and Genetic Diversity of Baikal Seal (Pusa sibirica)
Student: Aliya Yakupova
Supervisor: Sergei Kliver (Institute of Molecular and Cellular Biology SB RAS)
In this work we focused on conservation genomics of the Baikal seal (Pusa sibirica). This pinniped species isolated in Lake Baikal hides a great mystery of its origin. Even today the scientific community has plenty of assumptions and disputes about how and when seals appeared in the lake now separated from the ocean by thousands of kilometres. In the first part of this work we assessed the quality of chromosome-length assembly of the baikal seal, generated by our team early. It demonstrated a bit lower but comparable quality to available assemblies of other pinnipeds. Second part of the project was focused on annotation of analysed genome assembly. We detected coordinates of the pseudoautosomal region (PAR) on the X chromosome and performed prediction of tRNA, rRNA and protein-coding genes. As a final stage of annotation we established a connection between chromosome-length assemblies and karyotypes of baikal seal (Pusa sibirica) and five other pinniped species. Final part of work included heterozygosity and demographic history analysis for baikal seal, grey seal and spotted seal. All three pinniped species showed low heterozygosity levels. It could be explained by inbreeding in a closed ecosystem (Pusa sibirica), decrease in population size (Halichoerus grypus) and vulnerable status (Phoca largha). Reconstruction of demographic history showed the presence of a severe bottleneck happened ~300 ka for spotted and grey seals. Surprisingly, for baikal seal bottleneck is shifted by ~150 ka to present time. That could be explained by different environmental conditions that species faced after speciation. For Pusa sibirica demographic history supports the hypothesis that seals were isolated in Baikal for at least for 80-150 ka depending on mutation rate.

Presentation_Yakupova (slides)
Uncovering Possible Cells-of-Origin for Medulloblastoma Tumors from Comparison to Normal Brain Single-Cell RNA Sequencing Data
Student: Ekaterina Petrova
Supervisor: Konstantin Zaitsev
Medulloblastoma is one of the most prevalent and highly malignant childhood brain cancers. Finding the most effective treatment strategy for this type of cancer is particularly challenging due to its remarkable molecular heterogeneity. [...] In this work, we show how single-cell RNA sequencing data analysis can be applied to uncover the origins of human medulloblastoma tumors through mapping to healthy developing human brain cells. We applied existing methods for correlation-based comparison of gene expression between different cell types as well as developed a new approach for joint analysis of cerebellar and tumor single-cell RNA sequencing data based on non-negative matrix factorization. The results demonstrated that medulloblastomas from the SHH, group 3 and group 4 are transcriptionally similar to the cells from granule neuron/unipolar brush cell cerebellar lineage, while identifying specific cells-of-origin for each group requires additional investigation.

Presentation_Petrova (slides)
Deep Learning Approaches for Drug-Target Affinity Prediction
Student: Elizaveta Vinogradova
Supervisor: Karina Pats
Assessment of the strength of binding between a drug and its target (drug-target affinity) is an important aspect of the drug discovery process. Obtaining this data experimentally is both time-consuming and costly. Therefore, computational methods for predicting binding strength are being widely used and developed. However, the experimental data used to train these prediction models are highly inconsistent and unevenly represented for different drug-target pairs. This leads to biased algorithms that do not generalize well to new data. As a result, several approaches have been developed to address this problem. One of them is the use of multi-task learning. However, the extent of the application of this technique in this field is currently heavily undeveloped. The approach proposed in this work hopes to fill this gap. This work explores the application of methods that have not been thoroughly explored before: the use of multi-task learning with auxiliary tasks, the use of positional embedding for small molecules, the use of loss functions with the ability to fill in missing values, and the use of residual connections. The proposed method was found to achieve an accuracy of over 90%.

Presentation_Vinogradova (slides)
Investigating the Genetic Underpinnings of Glia-Neuron Interactions Promoting Synapse Formation
Student: Darya Pinakhina
Supervisor: Mykyta Artomov (Broad Institute / Massachusetts General Hospital)
Analysis of the data on expressional changes associated with astrocyte-neuron interaction during synapse formation obtained using hiPSC technology highlights the importance of TGFβ signaling in neuronal response to astrocytic activation and reveals the connections between astrocyte-promoted synaptogenesis, schizophrenia and sleep deprivation. Signaling by PDGF appears to be a common significantly enriched pathway in these contexts, at the same time being closely connected to TGFβ signaling. Ligand-receptor interactions potentially involved astrocyte-neuron crosstalk are prioritized. Risk gene ranking for schizophrenia using the pipeline developed in the project emphasizes involvement of dopaminergic synapse, retrograde endocannabinoid signaling, MAPK signaling and nicotine addiction molecular mechanisms in disease pathogenesis, while also complying with involvement of genes related to astrocyte-neuron crosstalk in it.

Presentation_Pinakhina (slides)
Molecular Profiling of Human Hepatocellular Carcinoma
Student: Konstantin Danilov
Supervisor: Ivan Valiev (Boston Gene LLC)
Despite all efforts, cancer is still a major problem worldwide. Liver cancer is one of the most common cancers worldwide and is thought to be number 3 by projected cancer deaths in the USA by 2030. Standard of care for liver cancer currently includes immune checkpoint blockade application, with a clear need of biomarker guidance for treatment. Previous works established subtypes of liver cancer, but none of them efficiently described microenvironment. In this research work we collected a meta-cohort from 11 publicly available datasets of hepatocellular carcinoma expression data with 1432 total number of samples after quality control. Using previously described knowledge-based functional gene expression signatures and unsupervised clustering approach, we identified 5 clinically and biologically meaningful clusters with different properties, activated processes and survival based on microenvironment and tumor features. We validated survival findings on the holdout data with 103 samples. These clusters could be potentially used to predict prognosis and preferable treatment options for single samples of tumor tissue, but further validation is needed.

Presentation_Danilov (slides)
Compendium and Comparative Analysis of Publicly Available Single-Cell RNA Sequencing Datasets of Lymphoid Endothelial Cells
Student: Diana Lupova
Supervisor: Konstantin Zaitsev
In recent years, the idea of lymphatic endothelial cells being an important part of the lymphatic system that is responsible for more than just transporting lymph to the lymph nodes, began to be widely spread. A number of studies have demonstrated that lymphatic endothelial cells are a highly heterogeneous cell population that can carry out a vast range of different functions depending on their anatomical location. However, despite the growing interest in this cell type, at the moment they are still poorly studied, especially in terms of application of single-cell RNA sequencing. To fill this gap, we created a compendium of lymphatic endothelial cells by extracting them from publicly available single-cell RNA-seq datasets from studies dedicated to other cell types. Almost all tissues require lymph drainage and thus contain some small amounts of lymphatic endothelial cells that we can use in our analysis. Using the existing markers for lymphatic endothelial cells we extracted them from publicly available datasets. We integrated obtained lymphatic endothelial cells to remove effects introduced by different datasets and further analyzed them in order to gain insight into tissue variations, canonical transcription factors, and other key genes. We found new possible markers for lymphatic endothelial cell population and their subpopulations and extended the list of known markers for lymphatic endothelial cells. Usage of the updated list of markers as gene signature makes the discovery of more unique lymphatic endothelial cell subsets possible, leading to a more comprehensive analysis.

Presentation_Lupova (slides)
Metabolic Modules Identification in Single Cell Data
Student: Evgenia Chikina
Supervisor: Anastasiia Gainullina (Institute of Developmental Biology RAS)
The GAM-clustering algorithm is a method based on the joint clustering in network and correlation spaces producing metabolic modules. Previously the GAM- clustering method was adapted to the bulk RNA-Seq data. The application to the single-cell data required user supervised data preprocessing. Here we present automated pipeline for the preprocessing of the single-cell RNA sequencing data prior to the GAM-clustering including two rounds of clustering with ChooseR and kmeans. We compare results obtained using automated pipeline and simple kmeans clustering approach revealing advantages of the implemented automated pipeline. Metabolic networks built on KEGG and Rhea databases as well as metabolite and atom based networks were used in the analysis. GAM-clustering results obtained using different metabolic networks appeared to be similar with certain differences depending on the network type used in the analysis. Benchmarking of the method was performed with TiCoNE and Compass. Both methods were able to reveal biochemical properties also found by the GAM-clustering. However, GAM- clustering was able to identify unique metabolic pathways that were not found by other methods.

Presentation_Chikina (slides)
Machine Learning Based Approach for Artifact Variant Filtration from High Throughput Sequencing Data
Student: Valentina Yakushina
Supervisor: Maxim Ivanov (Atlas Oncology Diagnostics)
The clinical significance of somatic mutations, its expected low allelic frequency and load of high throughput sequencing data with errors determine the demand of approach that allow accurate automatic artifact variant filtration. This project aimed to develop machine learning based approach for artifact variant filtration from high throughput sequencing data. The achieved goals: the pipeline for automatic variant calling and quality metrics collection was developed; load of high throughput sequencing data with false variants called by standard variant calling tools was evaluated; set of true positive and true negative variants for machine learning was prepared; Random Forest algorithm was applied for variant classification based on quality metrics from standard variant callers. The accuracy of variant classification with developed algorithm achieved AUC 0.9996, True Positive Rate 0.9978, True Negative Rate 0.9333 – for SNV, and 0.9987, 0.9867, 0.9380, respectively - for INDELs.

Presentation_Yakushina (slides)
Analyzing Patterns of Tyrosine Sulfation in Human Immunoglobulins
Student: Maria Pospelova
Supervisor: Yana Safonova (Johns Hopkins University)
Adaptive immune system is aimed to control the interaction with pathogens through antigen-antibody binding. Sulfated tyrosine (sY) is suggested to play a crucial role in the binding processes, considering it is found on the binding sites of some co-receptors, like CCR5. CCR5 is a protein used by HIV-1 to infect the human cell, while the antibodies that mimic CCR5 can successfully inhibit the virus. Despite the great therapeutic potential of antibodies with sulfated Ys, we still know how they are generated. The aim of this research is to investigate if sulfated tyrosines can help antibodies to mimic such receptors; find their common features. The target motif of coupled aspartic and/or glutamic acids neighboring sulfated tyrosine ([E/D]+sY) that can help antibodies mimic receptor CCR5 was found in about 8% of the reads containing sulfated tyrosine. These reads were aligned to the D genes, and the possible algorithms of recombinations that lead to the target motif formation were revealed. The common feature of reads with aligned D genes was distinguished – the highest counts have the genes IGHD4-17, IGHD3-22, IGHD3-16.
Master's theses, spring 2021
Data-driven approach to identify differentiation trajectories of myeloid cells in atherosclerotic plaques
Student: Maria Firulyova
Supervisors: Konstantin Zaitsev; Jesse Williams (University of Minnesota)
Atherosclerotic cardiovascular disease is an inflammatory disease of the arteries. During atherosclerosis progression, the special structure with complex cellular composition called atherosclerotic plaque is formed. The differentiation relationships still remain unclear within the intima myeloid cell population associated with atherosclerotic plaque: the differentiation process which leads to foam macrophages formation in the plaque still remains unknown. New computational approaches for trajectory analysis designed for single-cell RNA sequencing data provide an opportunity to reconstruct trajectories for cells of interest. The important feature of trajectory inference is the possibility to identify genes and regulons which are statistically significant associated with the identified lineages. The project is focused on secondary analysis of public single-cell RNA sequencing studies dedicated to atherosclerosis. The results covered multiple topics including processing, integration and annotation of scRNA-seq datasets and trajectory inference of myeloid cells which were identified in all prepared scRNA-seq atherosclerosis datasets.

Presentation_Firulyova M. (slides)
Analysis of somatic mutability in cutaneous melanoma in response to UV-irradiation
Student: Dmitrii Usoltsev
Supervisor: Mykyta Artomov (Broad Institute / Massachusetts General Hospital)
Skin cancers, such as cutaneous melanoma, harbor the highest mutation burden among all malignancies. While the vast majority of these changes are consistent with UV-induced mutations, the biological effects of UV carcinogenesis have yet to be fully elucidated. We performed in silico analysis of the TCGA cutaneous melanoma cohort to find genes that selectively accumulate mutations in high UV-burden tumors. Subsequently our findings were replicated in in vivo tumors derived from human melanocytes with controlled UV-exposure to confirm UV-induced nature of identified mutations. TCGA melanoma tumors were separated into 3 groups by their UV-signature burden, analysis of per gene mutational burden was adjusted relevant clinical features, overall tumor mutational burden and gene length. In vivo tumors were generated by UV-irradiation of human melanocytes and further injection into mice. Somatic variant calling was performed on lab-generated tumors using original melanoma cell line and tumor resulting from non-irradiated cells as comparison.

Presentation will be available after publication of results
Comparison of T-cell signaling programs associated with response to checkpoint immunotherapy in different cancer types
Student: Marina Terekhova
Supervisor: Vadim Zhernovkov (University College Dublin)
It is a well-known fact that the immune system is critical in cancer development and progression. The immune surveillance theory suggests that the immune system permanently controls the cells and tissues of the body and is responsible for recognition and killing cancer cells. However, this immune surveillance causing selection of cancer cells that are poorly immunogenic or have extensive mechanisms, allowing escape from immune detection. As a result, the malignant cells arise with the capability to slip away from immune destruction, proliferate and manifest clinically as cancer. T-cells are important component of cell-mediated immunity against cancer and they are controlled by the number of costimulatory and inhibitory signals that serve as checkpoints. Checkpoint regulators guarantee that T-cell responses maintain self-tolerance, effectively protect the organism from pathogens and malignancies. Immune checkpoint inhibitors represent a new class of immunotherapy and have demonstrated a rapid increase of overall survival rate in patients with different types of advanced cancer. [...] The aim of this work is to reveal transcription factors that can serve as predictors for treatment with immune checkpoint inhibitors in different types of cancer.

Presentation_Terekhova M. (slides)
Identification and functional annotation of hypothetical proteins from orthonectids` parasitic plasmodium (Bilateria: Orthonectida)
Student: Elizaveta Skalon
Supervisors: George Slyusarev, Natalya Bondarenko (St. Petersburg State University)
Orthonectida Giard, 1877 is a small phylum of poorly known marine invertebrates. [...] Orthonectids dramatic loss of complexity and a unique life cycle is the only case among the large Annelida group, but the origin of orthonectids parasitism is still unknown. The main adaptation to parasitism, the plasmodium, remains underexplored and many questions related to the biology of the orthonectids parasitic stage have not yet been resolved. Discovering genes expressed explicitly in the plasmodium is an essential step towards revealing the mechanisms behind the development and functioning of orthonectids' parasitic stage. It will help to explore orthonectids adaptations to a parasitic lifestyle. Here, we present the identification and annotation of orthonectids' plasmodium-specific hypothetical proteins, intending to understand the processes behind the plasmodium functioning and the orthonectids adaptations to parasitism.

Presentation_Skalon E. (slides)
A genome‐wide association study for flowering time in guar (Cyamopsis tetragonoloba (L.) Taub.)
Student: Aleksandar Beatovich
Supervisor: Alexander Tkachenko
The guar plant (Cyamopsis tetragonoloba, (L.) Taub.) is a short day annual herbaceous flowering plant native to India and Pakistan, industrially important for serving as the main source of guar gum, widely used in the oil and gas industry. The exploitation of this plant in countries of northern latitudes is limited due to longer day lengths during the vegatitive season compared to its native habitat. The ability to efficiently identify early flowering guar varieties would greatly accelerate breeding efforts. Genomic resources of guar are limited, despite its economic significance. This study presents a new highly contiguous guar genome assembly and 10736 variant sites derived from RADseq data of a cohort of 192 guar plants of different varieties. A pilot genome wide association study was performed that found a number of SNP markers in proximity to previously established genes that regulated flowering time that could be used as markers in marker assisted breeding of this economically important crop.

Presentation_Beatovich A. (slides)
Investigation of common DNA variants contribution to polygenic disease risks in Russian population
Student: Valeria Rezapova
Supervisor: Mykyta Artomov (Broad Institute / Massachusetts General Hospital)
Human traits and diseases are the results of individual or combinatorial factors of genetics and environment. Over the last decade, genome-wide association studies (GWAS) have discovered a substantial number of associated variants for many complex traits. The success of GWAS in finding and replicating thousands of associations for thousands of phenotypes has demonstrated the usefulness of previous approaches and ushered in a new era of human genetics. However, even within European-centered GWAS data, there are local subpopulations significantly under-represented in these studies. For example, Russians, being one of the largest ethnic groups among the Europeans, remained significantly under-represented in GWASs for years. The aim of the present work was to test whether UK biobank GWAS results could be successfully applied for estimation of the polygenic risk scores in samples of Russian-descent.

Presentation_Rezapova V. (slides)
Integrating lipidomics data with reaction networks
Student: Mariia Emelianova
Supervisor: Alexey Sergushichev
Lipids are an important class of biomolecules that are involved in many vital cellular processes. Due to their hydrophobic nature, lipids are the major constituents of biological membranes and are thus the physical basis of all living organisms because they provide the ability to separate living entities from their natural surroundings. Another task that lipids fulfil is the storage of surplus energy for later consumption. Finally, lipids are also involved in extra- and intracellular signaling processes, where they transduce signals and amplify regulatory cascades. Since lipids play a crucial role in many biological processes, any imbalance in their homeostasis can lead to serious conditions in living organisms, such as chronic inflammation, cardiovascular diseases, diabetes, and neurodegenerative diseases. Therefore, the importance of lipid influence in biomedical research should not be underestimated. [...] For now, lipidomics data cannot be easily integrated into current pipelines and it remains unclear of the particular lipid roles in metabolism, their exact function and impact in various biological processes. The aim of this study was to extend the applicability of the metabolic network analysis to lipidomics data. To do this, it was needed to build comprehensive metabolic and lipid-specific graphs, then update the currently existing pipeline for network analysis to suit lipid-specific analysis and test the pipeline on real datasets.

Presentation_Emelianova M. (slides)
Application of metabolome-transcriptome integration approach for detection of loci controlling flowering time of guar (Cyamopsis tetragonoloba (L.) Taub.)
Student: Elizaveta Grigorieva
Supervisor: Alexander Tkachenko
Guar (Cyamopsis tetragonoloba (L.) Taub.) is an annual legume crop native to India and Pakistan. Seeds of the plant serve as a source of galactomannan polysaccharide (guar gum) used in the food industry as a stabilizer (E412) and as a gelling agent in oil and gas fracturing fluids. There were several attempts to introduce this crop to countries of more northern latitudes. However, guar is a plant of a short photoperiod, therefore, its introduction to Russia is complicated by a long day length during the growing season. Breeding of the new guar varieties insensitive to photoperiod is slowed down due to the lack of information on functional molecular markers, which, in turn, requires information on guar genome. In this work presented an attempt to use integrative transcriptome-metabolome integration approach to understand the genetic determination of flowering time variation among guar plants with different in their photoperiod sensitivity. This study was performed on nine early and six delayed flowering guar plants with the goal of finding a connection between biomarkers and differentially expressed transcripts. Metabolome-transcriptome integration was done by two different approaches: WGCNA and Shiny GAM.

Presentation_Grigorieva E. (slides)
Human exome variant database construction
Student: Mary Futey
Supervisors: Alexander Tkachenko; Yury Barbitoff (Bioinformatics Institute)
Next generation sequencing has greatly increased the amount of data available for both research and clinical uses. However, in order to utilize this data there is a need for both accurate tools and standardized analytic pipelines, as well as resources such as variant databases that capture variation across all populations. There are several large databases that are worldwide in scope, however they often under-represent certain populations, leading to initiatives that focus on these underserved groups. One such group is the various ethnic populations within Russia. We developed a pipeline to conduct variant calling on WES data from 1739 individuals from the Russian Federation. Lastly we conducted an over-representation analysis to assess the frequency of disease causing alleles in the Russian population compared with a reference population.

Presentation_Futey M. (slides)
Analysis of microRNA expression profiles in mechanical tissues of cultivated and wild varieties of flax (Linum usitatissimum)
Student: Angelica Dun
Supervisor: Alexander Tkachenko
MiRNAs were suggested to be the key players during flax stem development. Numerous studies have shown that indeed, this type of small RNAs demostrate a great impact on the regulation of both intrusive elongation and cell wall thickening which are the most important stages of flax development. Samples of phloem fibers at the late stages of development from three poorly studied Flax varieties (fiber, linseed and wild), were used in our work. Novel miRNAs and their targets were computationally predicted for all samples. Differentially expressed miRNAs specific for fiber cultivar were identified and their mRNA targets among differentially expressed genes were predicted.

Presentation_Dun A. (slides)
From the distribution of synapses to neural function in a circuit that mediates attention shifts
Student: Natalia Baymacheva
Supervisor: Karl Farrow (KU Leuven, IMEC)
Neurons receive through thousands of synapses distributed throughout dendrites. These synapses are transformed into electrical signals, which undergo specific integrations while travelling across the dendrite down to soma. Scientists have been studying dendritic computation properties for decades, but the exact mechanisms remain to be unravelled. One of the most arguable aspects of this question is the input location. Does it provide any significant influence on the output when considering in vivo network scale? To address this puzzle, we studied pathways in the superior colliculus (SC) responsible for evoking innate defensive behavior from visual stimuli. Wide-field (WF) neurons play the leading role in these pathways, receiving their inputs directly from retinal ganglion cells (RGC) 1 and inhibitory interneurons (Gad2). We used in-vivo recordings from RGC, Gad2 and WF neurons in mouse brain to build linear-nonlinear models of a WF neuron. Simple linear model fittings showed the correlation of RGC subtypes and the layer of WF neurons, to which they contribute. Implementation of the same model for inhibitory Gad2 recordings demonstrated strong inhibition in deeper, proximal to soma layers of WF neurons. When applying the activation function model to the data, no notable improvements have been observed, thus concluding that dendrites process local signals linearly. To fully tackle the problem of input locations, we still need to probe the built models on a different set of stimuli and comparison with other known models.

Presentation_Baymacheva N. (slides)
Development of 5`-end RNA sequencing data analysis method
Student: Liuaza Etezova
Supervisor: Alexander Tkachenko
RNA sequencing (RNA-seq) is a powerful tool to study gene regulation and functioning on a transcriptional level that has been successfully used in application to various scientific questions in a plethora of organisms. Sequencing of RNA 5'-ends is an important approach for studying gene regulation with a particular focus on transcription initiation level. Many program packages for analyzing 5'-end sequencing are at the disposal of researchers. The majority of them, however, fail to address special issues arising in the context of transcription initiation and regulation processes characteristic of different domains of life thus making necessary the development of a specialized approach that would take into account these differences. The aim of this study was to develop a bacterial 5`-end RNA sequencing data analysis method with prospects of application of this method
to Cappable-seq — specialized 5'-end sequencing method used for analysis of bacterial transcription start sites. In this work, we analyzed a dataset of matched samples sequenced with RNA-seq and Cappable-seq and implemented several functionalities on top of the existing ecosystem for 5'-end data analysis. Our implemented utilities allow assaying gene-expression on operon level as well as subtracting non-enriched libraries used in Cappable-seq.

Presentation_Etezova L. (slides)
Master's theses, spring 2020
Transcriptome analysis of myoblasts C2C12 with mutations in LMNA gene
Student: Oksana Ivanova
Supervisors: Renata Dmitrieva (Almazov Center); Alexey Sergushichev
The nuclear lamina is a polymer located on the inner surface of a nuclear membrane. Lamina supports the structure of the nucleus, participates in the organization of chromatin, regulation of gene expression and the processes of cell division. The major components of nuclear lamina – proteins lamin A and C – are encoded by a single gene called LMNA. Mutations in the LMNA cause diseases that are united into the laminopathy group. These disorders include cardiomyopathy, neuromuscular diseases, myo- and lipodystrophy, and metabolic syndrome. Laminopathies caused by missense mutations p.G232E and p.R482L in LMNA affect skeletal muscle tissue. To date, treatment of laminopathy is symptomatic and there are no effective medications against disease. Despite the big number of fundamental scientific researches of LMNA mutations, the exact molecular mechanisms of disorder development and muscle specificity remain unknown. In this work, we investigate gene expression and molecular pathways of muscle tissue that was altered by mutations G232E and R482L in lamin A/C gene using cell model of myoblasts C2C12 and transcriptome analysis.

Presentation_O. Ivanova (slides)
Chromothripsis in a view of spatial organization of the genome
Student: Natalia Petukhova
Supervisors: Nikita Alexeev; Sergey Aganezov (Johns Hopkins University)
Chromothripsis is a mutational phenomenon representing a unique type of tremendous complex structural variation: initially described in cancerous genomes, as well as in other disorders, chromothripsis presents massive genomic alterations during a single cellular event characterized by the simultaneous shattering of chromosomes followed by random reassembly of the DNA fragments and subsequent ligation of broken segments' ends, ultimately resulting in newly formed, mosaic derivative chromosomes. The identification of such unforeseeable catastrophic instance has deeply modified the comprehension of the genesis and the etiology of complex genomic rearrangements and has provided new insights on cellular and molecular mechanisms for genomic instability and the role of genome maintenance pathways. Several nonexclusive mechanistic models have been proposed to explain the cause and high complexity of chromothripsis event but the molecular mechanism of such cellular catastrophe remains unclear and poorly understood, especially from the point of its prediction. The aim of present work is dedicated to analyze chromothripsis from the light of spatial genome organization and to answer such questions: do the chromothripsis rearrangements breakpoints appeared in cancer have the spatial predisposition at the genome organization of normal tissue; how the spatial location of chromothripsis breakpoints can be compared with other structural variations (SV) of non-chromothripsis origin; does the whole chromothripsis cluster has more spatial proximity within this region compared to other genome loci without chromothriptic events.

Presentation_N. Petukhova (slides)
Estimating gene priorities in complex traits based on GWAS summary statistics
Student: Nikita Kolosov
Supervisor: Mykyta Artomov (Broad Institute / Massachusetts General Hospital)
The vast majority of human phenotypes, including diseases, are complex traits. The involvement of multiple genes and biological pathways in such phenotypes, among other factors, results in a relatively small contribution of each associated genetic marker. Genotyping array technology provides an affordable tool to find the genetic nature of the disease. Nevertheless, major complication in understanding disease biology only from GWAS often arises from inability to directly identify a complete set of causal genes. <...> We developed a novel Positive-Unlabeled (PU) learning based gene prioritization method - Gene Prioritizer (GPrior), intended for prioritizing disease-relevant genes given a matrix of gene-level features and sets of reliably causal genes. It is an ensemble of five PU bagging classifiers that finds the optimal combination of the predictions among individual PU algorithms. We tested our approach on both simulated and experimental data and estimated gene priorities for several traits (Schizophrenia, Education attainment, IBD and coronary-artery disease). GPrior delivers significantly better prediction qualities compared to individual PU-learning algorithms, conventional ML approaches, and other gene-prioritization tools used in the field. GPrior is yet not another fine-mapping approach rather it is a gene-level prioritization tool using hidden patterns of functional relatedness among the disease-relevant genes. At the same time GPrior is complementary to any fine-mapping approach and GWAS results post-processing. Altogether, GPrior fills an important and currently underdeveloped niche of methods for GWAS data post-processing, significantly improving the ability to pinpoint disease genes compared to existing solutions.

Presentation_N. Kolosov (slides)
Integration of RNA-sequencing data into phenotype search system GeneQuery
Student: Boris Shpak
Supervisors: Alexander Predeus (University of Liverpool); Maxim Artyomov (Washington University in St. Louis)
GeneQuery is a novel geneset-based phenotype search engine that can be applied across all publicly available microarray experiments independent of the curation status. It utilizes Weighted Gene Correlation Analysis (WGCNA) unsupervised clusterization algorithm that identifies groups of genes that are co-regulated across the samples of each study. Despite being the first search engine spanning virtually all of published microarray studies for human, mouse, and rat, an obvious limitation of GeneQuery was its inability to search RNA-seq data, which became the method of choice for gene expression profiling during the last 10 years. Thus, this work features an update of GeneQuery that would allow us to search most of the published RNA-seq data. We also discuss experimental validation of some targets discovered using GeneQuery. In our earlier studies, GeneQuery revealed an unexpected connection between the transcriptional signatures of TREM2-deficient microglia and a portion of the aging-associated expression signature consisting of genes responsive to α/γ-tocopherol treatment of the mouse brain. In this work we find additional evidence of a specific transcriptional signature of TREM2-dependent microglia inflammation that is upregulated in aging murine brain and can be reversed by α/γ-tocopherol treatment. The obtained results allowed us to rethink the previous design of validation experiments. Expression signature analysis presented in this thesis started experiments to assess the efficacy of administering α/γ-tocopherol to TREM2(–/–) microglia cell culture (a model of Alzheimer's disease exacerbated by TREM2-deficiency) for mitigating pyroptosis induced by damage-associated molecules.

Presentation_B. Shpak (slides)
Chromosome-scale genome assembly from long noisy reads using Hi-C data
Student: Anton Zamyatin
Supervisors: Pavel Avdeyev (George Washington University); Nikita Alexeev
New studies of genome rearrangements cannot be provided without chromosome-level assemblies. The contiguity of genome scaffolds allows better understanding of the organization of chromatin inside the cell nucleus. Possibility to sequence long repeat regions provides insights into the organization of heterochromatin, large centromere, and telomere regions. However, only long reads sequencing will probably not achieve this level of genome contiguity. It can be that sequencer cannot read particular regions at all. In that case, we need good scaffolding. If we have a reference genome, there are no problems with this, but it is more complicated if there is no reference - we have to use an additional source of information. In the past, the best choice was to use mate-pairs reads. Now we have an incredible source of information about proximities in genome Hi-C. Hi-C method is excellent for scaffolding but has some issues with low signal regions and ambiguity in haplotype regions. After the finish of assembly and scaffolding, genome assemblies must be validated to avoid misassembles and misjoints. The present thesis is about all of these stages of chromosome-scale genome assembly during execution of two genome assembly projects - Mosquitos and Barncles projects.

Presentation_A. Zamyatin (slides)
Construction of the GATK4-based pipeline for Russian Exome Project
Student: Mrinal Vashisth
Supervisor: Yury Barbitoff (Bioinformatics Institute)
Lack of Russian variant compendium represents a major gap on the genetic map of the world. Having such a compendium can greatly enrich our understanding of variation in global populations. The Genome Russia Project is unlikely to get completed soon. For the time being efforts are directed towards releasing a draft variant database using a few hundred russian exomes. A draft of the database has already been formed with the data analysis based on the Genome Analysis Toolkit (GATK3), but uniform reanalysis of samples with newer tools (i.e., GATK4) is necessary. During this project, a variant analysis pipeline based on GATK4 Best Practices has been developed. The pipeline is deployable on an HPC cluster within a containerized environment. The constructed pipeline was used for re-analysis of 1276 exome samples. The resulting variant dataset was used to compute allele frequencies, which were compared with other data sources such as the Genome Aggregation Database (gnomAD). Furthermore, statistical analyses were done for the monogenic disease prevalence in Russian population based on known pathogenic variants. Finally, we established a variant browser to make the data publicly available. This will be the first step towards developing a database similar to gnomAD comprising exome germline variants for the Russian population.

Presentation_M. Vashisth (slides)
Using RNA-sequencing data for diagnosing rare Mendelian diseases
Student: Maria Romanova
Supervisor: Alexey Sergushichev
Mutations in Mendelian diseases are located within the single genetic locus, they have low frequency but high effect size. One of the methods for finding such mutations can be RNA-sequencing analysis. It enables expression comparison between individual sample versus control samples, thus it can reveal expression outliers and imbalances in allele expression. Transcriptional level information in RNA-sequencing data can help in the discovery of novel splicing events. Validation of coding changes that impact RNA expression and splicing usually is done with RNA sequencing analysis among many other functional tests. And variant calling is also available. Thus, RNA sequencing can serve as another complementary method to confirm the diagnosis, as well as an independent method with a number of advantages. Thus, the main point of this work was to create an automated reproducible pipeline of tools that are most suitable for analyzing RNA- sequencing data in order to obtain a list of a prioritized candidate or even causative genes for help in the diagnosis of rare Mendelian diseases.

Presentation_M. Romanova (slides)
Investigation of mutations associated with autism in a cohort of children according to exome sequencing
Student: Ekaterina Gibitova
Supervisor: Pavel Dobrynin
Autism spectrum disorder (ASD) includes a group of neurodevelopmental disorders characterized by social defects and stereotyped behavior. It is shocking that in most cases, the etiology of ASD is unclear, but it is generally believed that ASD has a strong genetic link. There is currently no consensus on which genes have sufficient evidence to support the relationship with ASD. Between the research team and the clinical sequencing team, estimates of the number of genes related to ASD vary widely, ranging from a few to a few hundred. The purpose of this project is to discover unique mutations associated with ASD in a cohort of 194 subjects.

Presentation_E. Gibitova (slides)
Evolution of CRISPR-Cas systems and their distribution across geographic locations
Student: Sedreh Nassirnia
Supervisors: Mikhail Rayko (St. Petersburg State University); Alexander Tkachenko
CRISPR-Cas systems are adaptive immunity that is present in the majority of archaea, about 90 percent, and almost half of the bacteria. CRISPR-Cas can capture fragments which are originated from invasive DNA sequences (spacers), such as viruses, bacteriophage for bacteria or plasmids and create a sequence-based array for cleaving viral mobile elements, and also ancillary DNA that can be either taken by transformation, natural acquisition and transduction or also target self chromosome or plasmids that are presented inside the cell. Characterization and study the evolution of CRISPR-Cas systems not only provided a better understanding of defense mechanisms in prokaryotes but also is necessary knowledge for genome editing.
CRISPR-Cas systems are under rapid evolution, and due to the additional horizontal gene transfer events, there are different combinations of Cas proteins that give rise to multiple types of CRISPR-Cas systems. Therefore, it is quite challenging to study all these diversities from an evolutionary point of view. The aim of this project is to discover the diversity and distribution of different varieties of CRISPR-Cas systems based on their effector complex (Cas proteins) across the phylogenetic tree.
We were able to identify different functional clusters of the Cas-related proteins. We showed that multiple clusters are present in major phyla, implying a high degree of HGT, and at the same time we found phyla associated with single clusters that may have evolved in isolation from bacteriophages.

Presentation_S. Nassirnia (slides)
Reconstruction and analysis of viral phylogenetic networks
Student: Daria Nemirich
Supervisor: Nikita Alexeev
To date, viral epidemics represent a significant threat to public health. In the last decade, at least seven viral outbreaks (COVID-19, Ebola, MERS-CoV, H1N1, H7N9 and others) have occurred resulting in numerous human deaths. In order to prevent disease spread, monitoring of its current state is highly necessary. In recent years, with the introduction of next-generation sequencing, it has become much easier to obtain comprehensive data for the pathogen samples. As a result, it is now possible to establish detailed and accurate information on the outbreak source, transmission chains and viral population composition. However, despite the abundance of the software created to serve the aforementioned objectives, there are still unresolved problems, such as the absence of an adequate system for detection of recombination events and the usage of too simplified viral populations simulations. This work aimed to address the challenges mentioned above, by creating the simulation pipeline, which includes all aspects of viral evolution within a single host, such as mutations, recombinations, changes in haplotypes fitness values and size of the population. Besides, the probabilistic model that manages recombination events was developed.

Presentation_D. Nemirich (slides)
Differential selection in the rhizosphere microbial communities of wheat and rye
Student: Ksenia Maximova
Supervisor: Ilia Korvigo (All-Russia Research Institute for Agricultural Microbiology)
An understanding of how microbial communities interact with plants under various environmental conditions might yield insights into macroecological processes. Since the next-generation sequencing analysis has become available, a lot of statistical methods have been adapted for research in ecology to help identify microbial signatures (groups of taxa) that are associated with some ecological patterns. Interactions between plants and microorganisms are reasonably obvious around plant roots, and the evidence of long-range plants specific responses in the bulk soil is overgrowing. However, this scientific field is covered by an insufficient number of studies, mainly due to the diversity and complexity of specific plant responses in soil communities. Multiple studies have underpinned the necessity of the evaluation of host-microbiome interactions for effective crop rotation and the prevention of soil deterioration. In this regard, proper modelling of plant-microbe interactions is a crucial step toward the rational exploitation of the microbiota for agricultural management.

Presentation_K. Maximova (slides)