During the second semester, each student is involved in a Course project, which may be further developed into a Master's thesis. The projects are supervised by the leading experts from Russian and foreign scientific centers working in the field of bioinformatics.
Systems Biology projects
of different years:

Spring 2021
Spring 2020
Spring 2019
Projects, 2021
Uncovering possible cells-of-origin for medulloblastoma tumors from comparison to normal brain scRNA-seq data
Student: Ekaterina Petrova
Supervisor: Konstantin Okonechnikov (German Cancer Research Center)

The report presents results of the research work devoted to the analysis of publicly available single-cell RNA-seq data derived from human medulloblastoma tumors combined with normal human fetal cerebellum data. The goal of the project was to discover the similarities in transcriptomes between malignant and normal cells, in order to find the origins of different medulloblastoma subgroups. Integration of cerebellar and tumor data, preprocessing, dimensionality reduction, and clustering with UMAP 2D visualization were performed using the Seurat R package. Cell coordinates extracted from UMAP were used to determine the nearest normal cerebellar type for each tumor cell. In parallel, we applied the SingleR correlation-based method to medulloblastoma tumors single cell data using normal fetal cerebellar dataset as a reference to verify UMAP derived results. The analysis revealed the expected correspondence between the SHH subgroup of medulloblastoma and granule neurons. Medulloblastoma Group 4 was associated with unipolar brush cells. For Group 3, we were not able to detect any significant similarity. It was demonstrated that both clustering using UMAP approach and the SingleR method yield comparable results thus such strategy might be beneficial for future comparisons between tumors and single-cell RNA datasets of normal human tissues.

Comparison of the clinical risk scales for multiple psychiatric traits in their ability to detect genetic susceptibility
Student: Darya Pinakhina
Supervisor: Mykyta Artomov (Broad Institute/ Massachusetts General Hospital)

In this project we wanted to assess the power of qualitative and quantitative DSM and HADS scales for anxiety, depression and bipolar disorder to detect genetic susceptibility to these mental health conditions. For this purpose, the analysis of the results of 7 GWAS studies using these scales has been carried out. Replication rates for the corresponding GWAS results at variant and gene levels, character of causal probability distributions for the most probable associated genes and the results of gene enrichment analysis using large-scale reference GWAS studies for the corresponding conditions indicate that DSM scales tend to perform better both for anxiety and depression, and qualitative scales have shown higher power for anxiety in our case. This is the first assessment of HADS and DSM scales' performance in GWAS for the Russian cohort, and the results, after further verification, could be used to guide further GWAS studies, as well as for testing in clinical practice until more advanced approaches to diagnose mental health conditions are developed.
Clustering of structural descriptors data for vitamin D receptor
Student: Elizaveta Vinogradova
Supervisor: Karina Pats
Clustering of protein structures is an important task of structural bioinformatics. Clustering can be performed, for example, (1) using structural descriptors calculated from different protein structures, (2) using atomic coordinates, (3) based on amino acid sequence comparison (pairwise alignment) or (3) by comparing structural domains. In this project, structural descriptors were used to cluster the protein structures of the vitamin D receptor of three species (homo sapiens, rattus norvegicus, danio rerio). Clustering revealed an outlier on independently normalized VDR data that was not detected by other approaches. The result was confirmed by structure source information. Clustering by structural descriptors was compared with clustering using a matrix of pairwise alignment scores and with the CD-HIT clustering tool.

Phenotype-driven gene prioritization for rare diseases
Student: Valentina Yakushina
Supervisor: Dmitrii Smirnov (Technical University of Munich)

Phenotype-driven gene prioritization utilizes known associations between clinically relevant genes and clinical phenotypes to return ordered list of genes where ideally causal gene is expected to be at first rank. We benchmark of major phenotype-driven gene prioritization tools (AMELIE, PCAN, PubCaseFinder, Phen2gene, GADO) was performed. PubCaseFinder returns the smallest number of putative genes with best rank for causal genes and highest rate of missed causal genes. AMELIE, Phen2gene, GADO return highest number of putative genes with worst rank of causal genes and lowest number of missed causal genes. PCAN takes moderate position regarding number of returned genes with rank of causal genes near PubCaseFinder and low value of missed genes.
Genomic landscape in multiple myeloma
Student: Ekaterina Kazantseva
Supervisor: Anna Zhuk
Multiple myeloma is the second most common blood cancer, approximately 25% of myeloma patients die within the first 3 years of their disease, and approximately 10% of patients die within the first year[1]. One promising way of finding new therapies is to identify genomic driver events and using specific drug to target these aberrations. WES data analysis of MM patients enables us to find variants which caused the disease and improve treatment processes. We analysed samples of 2 patients and found germline and somatic mutations associated with myeloma.
Establishing connection between chromosome-length assemblies and karyotypes
Student: Aliya Yakupova
Supervisor: Sergei Kliver (Institute of Molecular and Cellular Biology SB RAS)

Reference genome assemblies are very important for conservation biology aims. It helps to understand genetic diversity, estimate localization, and visualization of low heterozygosity regions in threatened species. The best approach for that is to use chromosome-level assemblies. It provides better estimation and is easier to work with in comparison with fragmented assemblies. However, even chromosome level assemblies have many differences with real chromosomes. For example, heterochromatin regions of chromosomes usually remain unassembled. This work aimed to establish connection between chromosome-length assemblies and karyotypes of Ailurus fulgens (red panda), Bassariscus astutus (ringtail), and Procyon lotor (common racoon) against Felis catus as a reference genome. In this project a fully automated scalable pipeline for whole genome alignment was developed, which can be deployed on multiple platforms. Assemblies and karyotypes were compared using obtained alignments and Zoo-FISH data. Moreover, significant chromosomal rearrangements such as inversions and translocations were detected in defined chromosomes.

Reconstruction of FaRLiP photosynthetic gene cluster from a novel cyanobacteria
Student: Diana Lupova
Supervisor: Anton Korobeynikov (Saint Petersburg State University)

The report presents the results of a project on detection of FaRLiP cluster in novel cyanobacteria. Analysis was performed on two samples of sequencing data from unicellular organisms, shown to be two strains belonging to a novel cyanobacteria genus by 16S rRNA and transcribed spacers analysis. The goal of this project was to detect and reconstruct the FaRLiP gene cluster in the sample that demonstrated ability for far-red light photoacclimation and confirm absence of such cluster in the sample with no photoacclimation activity.
Search for fusion genes in follicular lymphoma samples
Student: Maria Pospelova
Supervisor: Igor Evsyukov

Follicular lymphoma is a hematological cancer. Although the mechanism of neoplasm formation is not fully investigated, it is known that B-cell apoptosis damage is responsible for it. Treatment strategies based on individual genomes can be formed only knowing the target fusion genes, in instance, translocation t(14;18) (q32;q21), which is known as the fusion responsible for follicular lymphoma. This research focused on fusion genes detection with four tools 1. Arriba; 2. STAR-Fusion; 3. FusionCatcher; 4. Kallisto/Pizzly. After the detection of the fusion genes, those that were detected with two or more tools were chosen for
further analysis. Some of the found fusions were connected to some types of cancer. Yet, some of the fusions were found as new, so the annotation analysis was performed. The PCR primers were prepared for further validation by Sanger sequencing.
PathSeq parameters optimization for improved pathogens search in human tissue NGS data
Student: Aleksandr Cherdintsev
Supervisor: Olga Kudryashova (BostonGene)
Numerous tools for detection of microbiota by sequencing data analysis exist yet demonstrate inaccuracy in results if not tuned for the specific case. Fortunately, articles considering benchmarking of tools and their parameters for various cases are issued. However, some cases remain uncovered. As an example, a PathSeq tool, one of the most frequently used tools for biota detection, is not tested on human biopsy sample sequencing data yet. Thus, the current article was aimed to reveal the optimal parameters of PathSeq for processing of the above-mentioned type of data.
Identifying metabolic modules in TCGA datasets
Student: Evgenia Chikina
Supervisor: Anastasiia Gainullina

The GAM-clustering algorithm is a method based on the joint clustering in network and correlation spaces producing metabolic modules in a graph representation and in the form of patterns (average gene expression in a particular module). The annotation can be explored and compared with patterns after modules identifications. Here we present modified GAM-clustering pipeline with the solver interactions through the specialized R-library mwcsr and including new parameter – maximum module size – allowing to perform automatic dynamic base estimation (parameter of the algorithm) along with the modules identification. Results show correspondences between two approaches of setting base parameter and the ability of the algorithm to find biologically relevant modules. Results are shown for the TCGA LUSC dataset and reveal metabolic differences according to the presence or absence of the mutations in the NFE2L2 and KEAP1 genes.
Hi-C maps visualization and by-hand scaffolding software development
Student: Konstantin Danilov
Supervisor: Anton Zamyatin

In this project we implement the backend model for manual scaffolding using Hi-C map with python3 programming language. It works with Hi-C map in cool format and allows user to perform moving and reversing of contigs. The resulting sequences and Hi-C map representations can be easily saved in new fasta file, based on the original one, or PNG images. We used a state model to store current states of contigs and recalculate region of interest on request from original Hi-C map. Performance of the operations is quite fast and works for O(c) for moving contig and O(1) for reversing of contig, but visualization part works much slower. There two options for users: command line interface or Jupiter Notebook, both available from the GitHub for free.
Virtual high-throughput screening of bile acids
Student: Elizaveta Zhaivoron
Supervisor: Karina Pats

Nuclear receptors (NR) are transcription factors able to regulate various cellular processes. Bile acids (BAs) can bind nuclear receptors in transcriptional complexes. It is known that some bile acids can promote cancer, while others - suppress it. There
is a lack of structural data for BA-NR complexes, so in this research, we performed virtual high-throughput screening of bile acids library and determined their specific binding profiles with vitamin D receptor, pregnane X receptor, and farnesoid X receptor.
Comprehensive analysis of differentially expressed genes between advanced-stage follicular lymphoma and diffuse large B-cell lymphoma samples
Student: Maria Shumilova
Supervisor: Igor Evsyukov

Follicular lymphoma (FL) and diffuse large B-cell lymphoma (DLBCL) are the most common forms of non-Hodgkin lymphoma among adults. FL is slowgrowing while DLBCL is clinically aggressive. Transformation of FL into DLBCL is associated with rapid progression, treatment resistance and poor prognosis. The aim of this project was to search for potential therapeutic targets
and biological markers of FL transformation. We performed differential gene expression analysis between FL grade 3 and DLBCL using RNA-Seq data. And after identifying significant genes pathway enrichment analysis was performed. Obtained results suggest an important role of signaling pathways associated with metal ions in the transformation of follicular lymphoma. Perhaps some of the found genes can be used as targets in therapy or biomarkers of follicular lymphoma malignization.
Evolutionary analysis of different prokaryotic species based on synteny blocks
Student: Asya Marshak
Supervisor: Alexey Zabelkin

Bacterial genomes have very high plasticity, provided by rearrangements such as inversions, deletions, insertions, and duplications. When such rearrangements occur independently in different strains, this indicates parallel adaptation that can be responsible for multi-virulence, antibiotic resistance, and antigenic variation. Such processes can be the cause of increased pathogenicity in some strains. Studying the mechanisms of evolution of various species can help in identifying pathogenic strains. We took four human pathogens that exist in different conditions and studied their evolutionary mechanisms with PaReBrick and pan genomic.
Mining clinical knowledge on Phenotype - Mendelian disease gene associations
Student: Gleb Korelsky
Supervisor: Dmitrii Smirnov (Technical University of Munich)
Researchers have developed phenotype-driven differential-diagnosis systems to improve speed and accuracy of differential-diagnostics of mendelian disorders. Above mentioned systems usually rely on databases of gene-disease associations
(GDA) and gene-phenotype associations (GPA). Therefore, performance of these systems depends on the quality and quantity of data stored in the GDA/GPA databases. In this project the pipeline was developed to retrieve GDAs and GPAs from medical case reports deposited in PubMed.
Projects, 2020
Pilot analysis of the inherited predisposition to lung cancer recurrence
Student: Valeria Rezapova
Supervisor: Mykyta Artomov (Massachusetts General Hospital)

Lung cancer is one of the most important medical and social-economic problems in most developed countries of the world due to its leading position in the structure of cancer incidence and mortality. Recurrence occurs in 30% of lung cancer patients after radical therapy and about 50% of patients with recurrence die within 2 years. Recurrence could be predicted using molecular, immunohistochemical methods and clinical observations of the epithelial changes. Here, we analyzed pilot exome sequencing data from 3 groups of lung cancer patients - 10 samples with distant recurrence, 10 samples with locoregional recurrence and 10 controls with no recurrence. We describe optimal quality filtering and association study strategies and two gene candidates, with prior evidence of involvement in cancer progression - PRSS21 and KRT6A.
Melanoma somatic variant and copy number alterations data analysis
Student: Dmitrii Usoltsev
Supervisor: Mykyta Artomov (Massachusetts General Hospital)

Melanoma is the deadliest skin cancer. In 2020, it is estimated that there will be 100,350 new cases of melanoma and 6,850 deaths in the United States. Unlike other cancers, melanoma has a clear pattern of onset mediated by carcinogen exposure – UV light. Here we present an analysis of lab-generated melanoma tumors obtained by UV-irradiation of healthy melanocyte cell line and further grown in immunocompromised mice. We show that artificial melanoma obtained in this way is different from majority of clinically obtained melanomas in patients. Specifically, ultraviolet mutational signature discriminates between the melanoma origins. Although natural and lab-grown tumors are very similar according to the overall gene signature. Using the somatic variant and copy number variation calling approaches, three gene candidates were found LCE3C, IFIH1 and MEIS2. Deletion of LCE3C is crucial for skin barrier function. IFIH1 involves in tumor cell apoptosis. MEIS2 is important for cancer pathogenesis. All these genes significantly influence survival of patients with melanoma.
Analysis of scRNA-seq data of mitral valve disease in K/BxN mouse model
Student: Maria Firulyova
Supervisor: Konstantin Zaitsev
The heart valves are structures composed of different cell populations, and both the mitral and tricuspid valves are characterized by specific properties in terms of morphological, physiological and other levels, including cell-type composition. Due to these differences, the valves don't equally contribute to the progression of some diseases — in particular, the mitral valve disease. However, the certain factors which play the role of a predisposition to the manifestation of mitral valve disease are still unclear. The project is focused on secondary analysis of single-cell RNA sequencing of both valves under the control and inflamed condition. The analysis results covered multiple topics including annotation of cell populations which are presented in the data, cellular type composition changes during disease progression in the mitral valve, and the role of endothelial-mesenchymal transition in the development of mitral valve disease.
Microarray analysis of gene expression induced by vitamin D3 and its analogs in a human THP-1 monocytic cell line
Student: Marina Terekhova
Supervisors: Ferdinand Molnar (Nazarbayev University); Alexey Sergushichev
It is well established that besides regulating bone and calcium homeostasis vitamin D is involved in a multitude of fundamental cellular processes, including the cell cycle, apoptosis and differentiation, as well as effects on carcinogenesis, immune function, autoimmune diseases and cardiovascular disorders. Since supraphysiological doses of 1,25(OH)2D3, that are necessary to obtain these non-classical effects, result in hypercalcemia, a huge variety of analogs were developed to minimize the calcemic side effects while preserving or augmenting the beneficial effects of 1,25(OH)2D3. The development of DNA microarray technologies has created the opportunity to investigate the effects of 1,25-dihydroxyvitamin D3 on the gene expression profile in various cell types. So, in this study, we aimed to examine changes in gene expression associated with the vitamin D3 and its analogs stimulation in THP-1 cells. THP-1 monocytic human cells were treated with 4 hour 100 nM 1,25-Dihydroxyvitamin D3, 100 nM Gemini or 100 nM TX527 and the gene expression were examined using Illumina HumanHT-12 V4.0 expression BeadChip. Limma based gene prioritization identified a set of 224 genes with adjusted p-values < 0.05 for 1,25-dihydroxyvitamin D3 versus ethanol (113 up- and 111 downregulated genes), a set of 742 genes - for Gemini versus ethanol (268 up- and 474 downregulated genes) and a set of 1166 genes - for TX527 versus ethanol (429 up- and 737 downregulated genes). Differences in gene expression between stimulations with vitamin D preparations were generally implicit and by further analysis of overlapping genes, using published data for THP-1 cells, 15 up- and 3 downregulated genes were selected as vitamin D key target genes. Most of these vitamin D-induced genes are related to immune system regulation. Overall picture of gene set enrichment analysis showed a very low number of significant results, however overlapping with vitamin D pathways and ChIP-Seq datasets was clearly evident.
RNA-seq analysis of early and delayed guar (Cyamopsis tetragonoloba) varieties
Student: Elizaveta Grigorieva
Supervisor: Alexander Tkachenko
Guar (Cyamopsis tetragonoloba (L.) Taub.) is an annual legume crop widely cultivated in India and Pakistan. Guar beans are a source of guar gum that is used in many industries - food, cosmetics, and oil. However, lack of information about guar genome imposes difficulties in breeding this culture across the word and specially in Russia. The main challenge in guar breeding is day length, because guar is a short day plant and the most of guar varieties cannot start to flower under Russian weather conditions. In this study RNA-Seq technology and metabolome investigation were used with a goal of indicate responsible to photoperiodic sensitivity genes by integrative omics approach for 6 early and 9 delayed guar varieties.
SNP calling in transcriptomic and genomic data from guar, Cyamopsis tetragonoloba
Student: Aleksandar Beatovich
Supervisor: Alexander Tkachenko

The guar plant, Cyamopsis tetragonoloba, is a leguminous annual herbaceous plant that contains in its seed galactomannan, a polysaccharide with multiple industrial applications. For this reason identifying SNPs would provide useful information to plant breeders and allow an enhanced exploitation of this plant. In this article SNPs were called from guar RNA-seq data and a novel reference genome was assembled. The 20800 SNPs reported from the KisSplice pipeline are consistent with previously reported numbers of SNPs from guar RNA-seq data. The high number (102778) of SNPs reported by GATK-rna pipelines is at an discrepency with the previously reported number of SNPs in RNA-seq data indicating possible false positives. The novel guar genome assembly although an improvement on its previous draft assembly needs to have its level of fragmentation further reduced in order to be used as a reference ATK cohort genotyping pipeline due to a runtime issue in one of the steps of the pipeline.
Re-annotation of the Intoshia linei genome (Orthonectida)
Student: Elizaveta Skalon
Supervisors: George Slyusarev, Natalya Bondarenko (St. Petersburg State University)
Intoshia linei is an invertebrate organism belonging to the phylum Orthonectida. Orthonectids are parasites of various marine animals. For a long time, orthonectids were considered to be a basal group due to their extremely simplified nervous and muscle system. However, recent molecular studies are placing them among highly specified annelids, the segmented worms. Because of the many exciting aspects of orthonectids biology, they are poised to address many integrative organismal biology questions. Discovering orthonectids genomes is an essential step towards revealing the global mechanism which is behind a secondary reduction of parasitic organisms. Our main goal was to improve the accuracy of the existing genome annotation of Intoshia linei in light of new RNA-sequencing data. We sequenced and assembled additional Intoshia linei adults transcriptome and re-annotated the genome with MAKER pipeline. The annotation assessment results demonstrate that the updated annotation contains more structural elements and is more consistent with our RNA-seq data.
The role of the microbiome in Parkinson's Disease
Student: Mary Futey
Supervisor: Dmitry Rodionov (Sanford-Burnham-Prebys Medical Discovery Institute)
Several studies analyzing the role of Parkinson's Disease indicate that gut microbiome dysbiosis co-occurs with PD. However, there is still a lack of consensus in certain areas, such as the prevalence of potential protective or detrimental bacterial families and the change in alpha diversity in PD. This report compares the microbial community composition among and between the gut microbiomes of 89 patients with confirmed PD and 66 healthy controls, utilizing 16s RNA marker gene sequencing. PD samples had higher alpha diversity was and decreased levels of butyrate producing families. Future studies should include the use of potential confounders, such as diet, medications and demographics and the functional prediction of the microbiomes.
Lipid network analysis
Student: Mariia Emelianova
Supervisor: Alexey Sergushichev
This work is dedicated to metabolic network analysis, which has a great advantage of possibility to identificate new metabolic pathways. But for now, it is still not possible to fully integrate lipidomic data with metabolic network. This project is dedicated to creating lipid network for lipid data and testing the concept of lipid network analysis.
Identification of novel miRNAs in Flax Stem
Student: Anzhelika Dun
Supervisor: Alexander Tkachenko

Linum Usitatissimum also known as Flax is one of the most widespread cultivated plants. Linen which is the most popular fabric after cotton is made from flax stem fibres and its production represents a big part of textile industry. Unique properties of flax stem related to intrusive elongation mechanisms of which are poorly studied. 124 miRNAs of L.Usitatissimum are known nowadays and form 23 families, almost a half of them participate in intrusive elongation. Therefore, miRNAs are the key players in Flax development and can be used in order to improve fibres properties. We identified 150 novel miRNAs for 10 RNA-seq samples using highly accurate tool miRDeep2. The following filtering of dublicates and low quality candidates left 17 best predicted miRNAs. Our results after validation could expand base of L.Usitatissimum miRNAs and to help in clarification of intrusive elongation mechanisms.
Spatial reconstruction of Arabidopsis gene expression using NovoSpaRc
Student: Natalia Baymacheva
Supervisor: Alexander Tkachenko

Spatial information about gene expression is important for our understanding of how organisms function. A popular method of scRNAseq allows analysis of gene expression on single-cell level but most technologies lose spatial information during the sample preparation. Recent tool for gene cartography from single-cell expression data is able to regain the spatial information from expression data and some prior knowledge. This tool is tested on several animal models with some great results but the same methodology may not apply to plant single-cell data due to how root stem cell niche functions.
Analysis of transcription start sites in Salmonella enterica in Amoeba symbiosis using Cappable-seq
Student: Liuaza Etezova
Supervisor: Alexander Tkachenko

Salmonella enterica is a dangerous pathogen which can cause life-threatening bacteremic illness. Symbiosis with an amoeba enhances its virulence and antibiotics resistance. We analysed Cappable-seq transcription data of S. typhimurium serving as an Аcanthamoeba castellanii symbiont and found novel TSSs which are highly expressed. We identified differentially expressed genes, which indicate oxidative stress and responsive to zinc concentrations. We found that S. typhimurium in symbiosis uses TSSs, which are not used in the control group. Also, we analysed antisense transcription using Cappable-seq
Comparison of genome assemblies for Nanger Dama
Student: Azat Mingaleev
Supervisor: Pavel Dobrynin
In this work, we analyzed three different genomic assemblies of the Nanger dama gazelle using sequencing techniques such as Hi-C, 10X, and Bionano. As a result of the analysis, problems with Bionano were identified and further work was carried out with two assemblies. In the end, it was found that for HiC and 10X the number of unaligned fragments was slightly less than 6%. In the future, due to technical problems with haltools and smash ++, it is planned to continue work on finding break points and rearrangements.
Detection of novel regions in the newly sequenced bacterial isolate
Student: Mikhail Lebedev
Supervisor: Sergey Nurk (National Human Genome Research Institute, NIH)

During this project we developed an easy to use pipeline to get novel parts of sequenced bacterial genome, where it was not possible to get results this easily. Moreover, it allows students to look at pan-genome analysis of sequenced bacterial strain and comprehensively visualizes problems with possible uncorrected insertions and deletions (indels) errors in a given bacterial genome sequenced with Oxford Nanopore. Returned novel parts of bacterial genomes can be used to analyze and to find new genes. The resulting pipeline is available at GitHub.
Projects, 2019
Chromothripsis: bioinformatic analysis of complex genomic rearrangements
Student: Natalia Petukhova
Supervisor: Sergey Aganezov (Johns Hopkins University)

Initially described in cancer genomes, and subsequently observed in constitutional disorders, chromothripsis constitutes a new class of massive genomic alterations characterized by the simultaneous shattering of chromosome segments followed by random reassembly of the DNA fragments during a single cellular event, to form complex derivative chromosomes.
The purpose of this study is to analyze genome of the patient with acute leukemia and presumably having chromothripsis in the basis of complex genomic rearrangements. The objective of this work is a bioinformatic analysis of DNA whole genome sequence using different bioinformatic tools for genomic aberrations searching and comparison/confirmation of the obtained results with cytogenetic study which still remains a "gold standard" diagnostic expertise in medical practice.
Producing a chromosome-scale mosquitos genome assembly from Oxford Nanopore reads
Student: Anton Zamyatin
Supervisors: Pavel Avdeyev (George Washington University); Nikita Alexeev

The main task of this project is a producing of the chromosome-level genome assemblies for An. coluzzii and An. arabiensis using different sequencing technologies. Long reads from Oxford Nanopore sequencing technology are used for draft assembly then short reads from Illumina are used for polishing and data from HI-C sequencing is used for mapping scaffolds. The second task is a developing of the working pipeline for eukaryotic genome assembly using these types of input data and state-of-art computational tools that are available for these purposes. At the end of the project, it must be two chromosome-level genome assemblies of two mosquito's species An. coluzzii and An. arabiensis that can be used in further population genomics studies and other chromosome-level analyses.
Estimating gene priorities in complex traits based on GWAS summary statistics
Student: Nikita Kolosov
Supervisor: Mykyta Artomov (Massachusetts General Hospital)

Genome-Wide Association Study or GWAS is a powerful tool for investigating genetic origin of complex traits. They are called complex since their development is affected by multiple genes and external factors. In this review, we will predominantly focus on such complex traits as polygenic diseases. Such diseases are caused by DNA alterations in several genes at once. As a result, it is hard to identify their biological mechanisms using family-based studies. GWAS allows one to get closer to solving this problem by looking on disease alleles prevalence in cohorts of cases and controls. In this article, we present a new approach, which allows to get probabilities for genes to be causal based on GWAS summary statistics using machine learning (ML) classification algorithms.
Signaling responses to personalized exercise therapy in skeletal muscle in heart failure patients
Student: Oksana Ivanova
Supervisors: Renata Dmitrieva (Almazov Centre); Alexey Sergushichev
Heart failure (HF) is one of the most widespread disorders in the world. Signs and symptoms of heart failure commonly include shortness of breath, excessive tiredness, and leg swelling as well as the limited ability to exercise. Moreover, HF often induces
skeletal muscle myopathies like cachexia and myodystrophy. However, there still does not exist a common treatment of muscle disorders in HF patients for improving the quality of their lives and increasing ability to exercise. This study is trying to describe the effect of physical exercise therapy in heart failure patients in order to define possible targets for pharmacological treatment of muscle wasting.
16s rRNA analysis of the gut microbiota in obesity and healthy state
Student: Ksenia Maksimova
Supervisor: Yulia Kondratenko (St. Petersburg State University)

In the last decade studying human gut microbiome is gaining popularity. There is much research about impact of microorganisms on the host, role of gut microbiota in diseases and about the prospect of the treatment different diseases by repairing gut microbiome. Recent studies already revealed the main species living in our body and the great interest of exploration in human microbiota issues so far is to link changes inside such community with particular human phenotype. One of the highest-profile demonstrations of the microbiota's influence on human health is the microbiome in obesity. In present study, will be characterized the gut microbiota community of obese phenotypes compared to the healthy.
Multi chain effect analysis from paired antibody repertoire data
Student: Sedreh Nassirnia
Supervisor: Maria Chernigovskaya
B lymphocytes (B-cells) are one of the most important parts of the adaptive immune system that are involved in immune response with secreting antibodies and producing B-cell receptors (BCRs) through multiple mechanisms for binding to a specific antigen. There are some checkpoints to ensure that only antibodies with two identical heavy chains (IGH) and two identical light chains (IGL) participate in the immune response against antigens. But sometimes antibodies are made up dual light chains that can show auto-reactivity in the immune system and leads to the generation of autoimmune disease in human and mouse samples. In this work, four human and three mouse 10-x genomics VDJ paired datasets were analyzed to investigate cells with a dual light chain. The IgBlast tool was used to filter out suspicious contigs from the datasets and further analyses.
Exome analysis of samples using GATK4 pipeline and database development
Student: Mrinal Vashisth
Supervisor: Yury Barbitoff (Bioinformatics Institute)

Expanding GeneQuery transcriptional database into RNA-Seq space
Student: Boris Shpak
Supervisors: Alexander Predeus (University of Liverpool; Bioinformatics Institute); Maxim Artyomov (Washington University in St. Louis)
InCHIANTI dataset analysis
Student: Maria Romanova
Supervisor: Maxim Artyomov (Washington University in St. Louis)
A comparative analysis of viral outbreak networks reconstruction methods
Student: Daria Nemirich
Supervisor: Nikita Alexeev

FSHD diagnosis through Nanopore sequencing
Student: Ekaterina Gibitova
Supervisor: Pavel Avdeyev (George Washington University)