Master's Theses
The fourth semester is mostly dedicated to the students' work on their Master theses.
The theses are supervised by the leading experts from Russian and foreign scientific centers
working in the field of bioinformatics.
Master's theses, 2020
Transcriptome analysis of myoblasts C2C12 with mutations in LMNA gene
Student: Oksana Ivanova
Supervisors: Renata Dmitrieva (Almazov Centre), Alexey Sergushichev
The nuclear lamina is a polymer located on the inner surface of a nuclear membrane. Lamina supports the structure of the nucleus, participates in the organization of chromatin, regulation of gene expression and the processes of cell division. The major components of nuclear lamina – proteins lamin A and C – are encoded by a single gene called LMNA. Mutations in the LMNA cause diseases that are united into the laminopathy group. These disorders include cardiomyopathy, neuromuscular diseases, myo- and lipodystrophy, and metabolic syndrome. Laminopathies caused by missense mutations p.G232E and p.R482L in LMNA affect skeletal muscle tissue. To date, treatment of laminopathy is symptomatic and there are no effective medications against disease. Despite the big number of fundamental scientific researches of LMNA mutations, the exact molecular mechanisms of disorder development and muscle specificity remain unknown. In this work, we investigate gene expression and molecular pathways of muscle tissue that was altered by mutations G232E and R482L in lamin A/C gene using cell model of myoblasts C2C12 and transcriptome analysis.

Presentation_O. Ivanova (slides)
Chromothripsis in a view of spatial organization of the genome
Student: Natalia Petukhova
Supervisors: Nikita Alexeev, Sergey Aganezov (Johns Hopkins University)
Chromothripsis is a mutational phenomenon representing a unique type of tremendous complex structural variation: initially described in cancerous genomes, as well as in other disorders, chromothripsis presents massive genomic alterations during a single cellular event characterized by the simultaneous shattering of chromosomes followed by random reassembly of the DNA fragments and subsequent ligation of broken segments' ends, ultimately resulting in newly formed, mosaic derivative chromosomes. The identification of such unforeseeable catastrophic instance has deeply modified the comprehension of the genesis and the etiology of complex genomic rearrangements and has provided new insights on cellular and molecular mechanisms for genomic instability and the role of genome maintenance pathways. Several nonexclusive mechanistic models have been proposed to explain the cause and high complexity of chromothripsis event but the molecular mechanism of such cellular catastrophe remains unclear and poorly understood, especially from the point of its prediction. The aim of present work is dedicated to analyze chromothripsis from the light of spatial genome organization and to answer such questions: do the chromothripsis rearrangements breakpoints appeared in cancer have the spatial predisposition at the genome organization of normal tissue; how the spatial location of chromothripsis breakpoints can be compared with other structural variations (SV) of non-chromothripsis origin; does the whole chromothripsis cluster has more spatial proximity within this region compared to other genome loci without chromothriptic events.

Presentation_N. Petukhova (slides)
Estimating gene priorities in complex traits based on GWAS summary statistics
Student: Nikita Kolosov
Supervisor: Mykyta Artomov (Massachusetts General Hospital)
The vast majority of human phenotypes, including diseases, are complex traits. The involvement of multiple genes and biological pathways in such phenotypes, among other factors, results in a relatively small contribution of each associated genetic marker. Genotyping array technology provides an affordable tool to find the genetic nature of the disease. Nevertheless, major complication in understanding disease biology only from GWAS often arises from inability to directly identify a complete set of causal genes. <...> We developed a novel Positive-Unlabeled (PU) learning based gene prioritization method - Gene Prioritizer (GPrior), intended for prioritizing disease-relevant genes given a matrix of gene-level features and sets of reliably causal genes. It is an ensemble of five PU bagging classifiers that finds the optimal combination of the predictions among individual PU algorithms. We tested our approach on both simulated and experimental data and estimated gene priorities for several traits (Schizophrenia, Education attainment, IBD and coronary-artery disease). GPrior delivers significantly better prediction qualities compared to individual PU-learning algorithms, conventional ML approaches, and other gene-prioritization tools used in the field. GPrior is yet not another fine-mapping approach rather it is a gene-level prioritization tool using hidden patterns of functional relatedness among the disease-relevant genes. At the same time GPrior is complementary to any fine-mapping approach and GWAS results post-processing. Altogether, GPrior fills an important and currently underdeveloped niche of methods for GWAS data post-processing, significantly improving the ability to pinpoint disease genes compared to existing solutions.

Presentation_N. Kolosov (slides)
Integration of RNA-sequencing data into phenotype search system GeneQuery
Student: Boris Shpak
Supervisors: Alexander Predeus (University of Liverpool), Maxim Artyomov (Washington University is St. Louis)
GeneQuery is a novel geneset-based phenotype search engine that can be applied across all publicly available microarray experiments independent of the curation status. It utilizes Weighted Gene Correlation Analysis (WGCNA) unsupervised clusterization algorithm that identifies groups of genes that are co-regulated across the samples of each study. Despite being the first search engine spanning virtually all of published microarray studies for human, mouse, and rat, an obvious limitation of GeneQuery was its inability to search RNA-seq data, which became the method of choice for gene expression profiling during the last 10 years. Thus, this work features an update of GeneQuery that would allow us to search most of the published RNA-seq data. We also discuss experimental validation of some targets discovered using GeneQuery. In our earlier studies, GeneQuery revealed an unexpected connection between the transcriptional signatures of TREM2-deficient microglia and a portion of the aging-associated expression signature consisting of genes responsive to α/γ-tocopherol treatment of the mouse brain. In this work we find additional evidence of a specific transcriptional signature of TREM2-dependent microglia inflammation that is upregulated in aging murine brain and can be reversed by α/γ-tocopherol treatment. The obtained results allowed us to rethink the previous design of validation experiments. Expression signature analysis presented in this thesis started experiments to assess the efficacy of administering α/γ-tocopherol to TREM2(–/–) microglia cell culture (a model of Alzheimer's disease exacerbated by TREM2-deficiency) for mitigating pyroptosis induced by damage-associated molecules.

Presentation_B. Shpak (slides)
Chromosome-scale genome assembly from long noisy reads using Hi-C data
Student: Anton Zamyatin
Supervisors: Pavel Avdeyev (George Washington University), Nikita Alexeev
New studies of genome rearrangements cannot be provided without chromosome-level assemblies. The contiguity of genome scaffolds allows better understanding of the organization of chromatin inside the cell nucleus. Possibility to sequence long repeat regions provides insights into the organization of heterochromatin, large centromere, and telomere regions. However, only long reads sequencing will probably not achieve this level of genome contiguity. It can be that sequencer cannot read particular regions at all. In that case, we need good scaffolding. If we have a reference genome, there are no problems with this, but it is more complicated if there is no reference - we have to use an additional source of information. In the past, the best choice was to use mate-pairs reads. Now we have an incredible source of information about proximities in genome Hi-C. Hi-C method is excellent for scaffolding but has some issues with low signal regions and ambiguity in haplotype regions. After the finish of assembly and scaffolding, genome assemblies must be validated to avoid misassembles and misjoints. The present thesis is about all of these stages of chromosome-scale genome assembly during execution of two genome assembly projects - Mosquitos and Barncles projects.

Presentation_A. Zamyatin (slides)
Construction of the GATK4-based pipeline for Russian Exome Project
Student: Mrinal Vashisth
Supervisor: Yury Barbitoff (Bioinformatics Institute)
Lack of Russian variant compendium represents a major gap on the genetic map of the world. Having such a compendium can greatly enrich our understanding of variation in global populations. The Genome Russia Project is unlikely to get completed soon. For the time being efforts are directed towards releasing a draft variant database using a few hundred russian exomes. A draft of the database has already been formed with the data analysis based on the Genome Analysis Toolkit (GATK3), but uniform reanalysis of samples with newer tools (i.e., GATK4) is necessary. During this project, a variant analysis pipeline based on GATK4 Best Practices has been developed. The pipeline is deployable on an HPC cluster within a containerized environment. The constructed pipeline was used for re-analysis of 1276 exome samples. The resulting variant dataset was used to compute allele frequencies, which were compared with other data sources such as the Genome Aggregation Database (gnomAD). Furthermore, statistical analyses were done for the monogenic disease prevalence in Russian population based on known pathogenic variants. Finally, we established a variant browser to make the data publicly available. This will be the first step towards developing a database similar to gnomAD comprising exome germline variants for the Russian population.

Presentation_M. Vashisth (slides)
Using RNA-sequencing data for diagnosing rare Mendelian diseases
Student: Maria Romanova
Supervisor: Alexey Sergushichev
Mutations in Mendelian diseases are located within the single genetic locus, they have low frequency but high effect size. One of the methods for finding such mutations can be RNA-sequencing analysis. It enables expression comparison between individual sample versus control samples, thus it can reveal expression outliers and imbalances in allele expression. Transcriptional level information in RNA-sequencing data can help in the discovery of novel splicing events. Validation of coding changes that impact RNA expression and splicing usually is done with RNA sequencing analysis among many other functional tests. And variant calling is also available. Thus, RNA sequencing can serve as another complementary method to confirm the diagnosis, as well as an independent method with a number of advantages. Thus, the main point of this work was to create an automated reproducible pipeline of tools that are most suitable for analyzing RNA- sequencing data in order to obtain a list of a prioritized candidate or even causative genes for help in the diagnosis of rare Mendelian diseases.

Presentation_M. Romanova (slides)
Investigation of mutations associated with autism in a cohort of children according to exome sequencing
Student: Ekaterina Gibitova
Supervisor: Pavel Dobrynin
Autism spectrum disorder (ASD) includes a group of neurodevelopmental disorders characterized by social defects and stereotyped behavior. It is shocking that in most cases, the etiology of ASD is unclear, but it is generally believed that ASD has a strong genetic link. There is currently no consensus on which genes have sufficient evidence to support the relationship with ASD. Between the research team and the clinical sequencing team, estimates of the number of genes related to ASD vary widely, ranging from a few to a few hundred. The purpose of this project is to discover unique mutations associated with ASD in a cohort of 194 subjects.

Presentation_E. Gibitova (slides)
Evolution of CRISPR-Cas systems and their distribution across geographic locations
Student: Sedreh Nassirnia
Supervisors: Mikhail Rayko (SPbU), Alexander Tkachenko
CRISPR-Cas systems are adaptive immunity that is present in the majority of archaea, about 90 percent, and almost half of the bacteria. CRISPR-Cas can capture fragments which are originated from invasive DNA sequences (spacers), such as viruses, bacteriophage for bacteria or plasmids and create a sequence-based array for cleaving viral mobile elements, and also ancillary DNA that can be either taken by transformation, natural acquisition and transduction or also target self chromosome or plasmids that are presented inside the cell. Characterization and study the evolution of CRISPR-Cas systems not only provided a better understanding of defense mechanisms in prokaryotes but also is necessary knowledge for genome editing.
CRISPR-Cas systems are under rapid evolution, and due to the additional horizontal gene transfer events, there are different combinations of Cas proteins that give rise to multiple types of CRISPR-Cas systems. Therefore, it is quite challenging to study all these diversities from an evolutionary point of view. The aim of this project is to discover the diversity and distribution of different varieties of CRISPR-Cas systems based on their effector complex (Cas proteins) across the phylogenetic tree.
We were able to identify different functional clusters of the Cas-related proteins. We showed that multiple clusters are present in major phyla, implying a high degree of HGT, and at the same time we found phyla associated with single clusters that may have evolved in isolation from bacteriophages.

Presentation_S. Nassirnia (slides)
Reconstruction and analysis of viral phylogenetic networks
Student: Daria Nemirich
Supervisor: Nikita Alexeev
To date, viral epidemics represent a significant threat to public health. In the last decade, at least seven viral outbreaks (COVID-19, Ebola, MERS-CoV, H1N1, H7N9 and others) have occurred resulting in numerous human deaths. In order to prevent disease spread, monitoring of its current state is highly necessary. In recent years, with the introduction of next-generation sequencing, it has become much easier to obtain comprehensive data for the pathogen samples. As a result, it is now possible to establish detailed and accurate information on the outbreak source, transmission chains and viral population composition. However, despite the abundance of the software created to serve the aforementioned objectives, there are still unresolved problems, such as the absence of an adequate system for detection of recombination events and the usage of too simplified viral populations simulations. This work aimed to address the challenges mentioned above, by creating the simulation pipeline, which includes all aspects of viral evolution within a single host, such as mutations, recombinations, changes in haplotypes fitness values and size of the population. Besides, the probabilistic model that manages recombination events was developed.

Presentation_D. Nemirich (slides)
Differential selection in the rhizosphere microbial communities of wheat and rye
Student: Ksenia Maximova
Supervisor: Ilia Korvigo (Ksivalue; All-Russia Research Institute for Agricultural Microbiology)
An understanding of how microbial communities interact with plants under various environmental conditions might yield insights into macroecological processes. Since the next-generation sequencing analysis has become available, a lot of statistical methods have been adapted for research in ecology to help identify microbial signatures (groups of taxa) that are associated with some ecological patterns. Interactions between plants and microorganisms are reasonably obvious around plant roots, and the evidence of long-range plants specific responses in the bulk soil is overgrowing. However, this scientific field is covered by an insufficient number of studies, mainly due to the diversity and complexity of specific plant responses in soil communities. Multiple studies have underpinned the necessity of the evaluation of host-microbiome interactions for effective crop rotation and the prevention of soil deterioration. In this regard, proper modelling of plant-microbe interactions is a crucial step toward the rational exploitation of the microbiota for agricultural management.

Presentation_K. Maximova (slides)