Machine Learning and Neural Networks - page 22

 

Deep Learning for Regulatory Genomics - Regulator binding, Transcription Factors TFs - Lecture 08 (Spring 2021)



Deep Learning for Regulatory Genomics - Regulator binding, Transcription Factors TFs - Lecture 08 (Spring 2021)

The video discusses the use of deep learning for regulatory genomics and focuses on how DNA sequence can reveal different motifs present in enhancer and promoter regions, and their 3D looping. The video explains how Chromosome confirmation capture (3C) technology can probe chromosomal organization, and Hi-C technology can identify topologically associated domains (TADs), which interact with each other, and the compartment pattern in the genome. Convolutional filters are applied at every position of the DNA sequence to detect different features or motifs, and the deep learning framework can learn common properties, filters and motifs of the DNA sequence, which enable various prediction tasks to be carried out. The video also mentions how multitask learning is beneficial, and using additional layers in the deep learning network to recognize and combine multiple building block representations of transcription factor motifs could allow for more efficient recognition of complex motifs.

The speaker in this video discusses using deep learning for regulatory genomics with a focus on transcription factor binding and gene expression prediction. They explore the use of convolution structures and dilated convolutions to bring in large regions of DNA and make predictions in a multi-task framework for chromatin data and gene expression. The speaker also covers the use of residual connections to train deep neural nets and explains how the model can predict 3D contacts using IC data and models. Overall, deep learning can be a powerful tool for analyzing genomics data and making predictions based on DNA sequence with enough data and the right transformations.

  • 00:00:00 In this section, the speaker discusses the use of DNA sequence and deep learning to predict features of the gene regulatory genome, focusing on distinguishing different motifs that make up enhancer and promoter regions and their 3D looping. The speaker describes the use of position weight matrices (PWMs) to determine the specificity of binding of each transcription factor, which is then used to predict gene regulatory function. Chromatin immunoprecipitation is also mentioned as a technology used to profile regulatory regions in the genome.

  • 00:05:00 In this section, the speaker explains how understanding the three-dimensional chromatin structure can reveal where different transcription factors are bound. The nucleus contains all of the DNA in a cell and is organized spatially with active regions pushed away from the nuclear lamina and closer to the center of the nucleus. Chromosome confirmation capture (3C) is a technique used to probe chromosomal organization by randomly cutting strands of DNA and then gluing them back together to see where different sections of DNA might be in contact with each other. This technique can reveal how chromosomes are actually looping on each other.

  • 00:10:00 In this section, the speaker explains how cutting and ligating different DNA fragments can be used to create chimeric molecules that reveal where portions of DNA bind and map in the genome. By sequencing and analyzing these chimeric regions, researchers can gain insight into the three-dimensional packaging of the genome and how different regions interact with each other. The speaker discusses various techniques such as 3C, 4C, 5C, and ChIA-PET that allow for analysis of interactions between genomic regions and the use of antibody-based methods to selectively study regions bound by specific regulators.

  • 00:15:00 In this section, the speaker explains how Hi-C technology works and how it provides insights into the way the genome is organized. Hi-C technology involves adding biotinylation marks to genome regions and then pulling down those marks to sequence them, which allows scientists to determine how two regions of the genome interact with each other. The resulting pictures show looping information and reveal that regions close to the diagonal interact the most. Hi-C technology also identifies topologically associated domains (TADs), which interact more with each other than with the outside of the domain, and hierarchical patterns of interaction within them. Additionally, the technology shows a checkerboard pattern where regions tend to interact more with regions of the same type, which allows scientists to visualize the compaction and organization of the genome.

  • 00:20:00 In this section, the speaker discusses the territoriality of different chromosomes within the nucleus and the a versus b compartment pattern in the genome, which suggests that one part of the genome is inactive and closer to the periphery while the active part is closer to the center. The speaker also mentions topologically associated domains, which are groups of regions that interact strongly within them, but not across them. The prevailing model for the corner peaks in these domains is that they are created by a process of loop extrusion, which involves binding sites for the regulator CTFC and cohesin pushing a loop of DNA through.

  • 00:25:00 In this section, the video explains the loop extrusion model of high-level chromatin interpretation and chromatin three-dimensional folding, which involves binding sites being brought closer together and pushing the DNA through effectively growing a loop. The video then goes on to discuss the computational analysis of regulatory motifs, using traditional approaches before deep learning, and how the same deep learning methodology can be used for image analysis and regulatory genomics with the one hot encoding of DNA. This methodology involves refining a motif logo by iterating between recognizing a common sequence pattern and discovering instances of that motif.

  • 00:30:00 In this section, the speaker explains how convolutional filters are used in representational learning in deep learning architecture. The DNA sequence is transformed into a one-hot encoding representation with four different input channels for each letter. Convolutional filters are applied at every position of the DNA sequence to detect different features or motifs. These motifs are then learned and can be applied to carry out a specific task, such as determining whether a transcription factor is binary or not. The speaker highlights that the deep learning framework can learn all these convolutional filters and vary the number of layers, prediction tasks, input-output relationships, among others. Ultimately, the architecture can extract common properties, filters and motifs of the DNA sequence and use these to learn a representation of the sequence, enabling various prediction tasks to be carried out.

  • 00:35:00 In this section, the speaker gives an introduction to the use of machine learning on nucleic acid sequences in biology. He discusses the shortcomings of earlier successful uses of machine learning, such as string kernels, and how they were unable to take into account spatial positioning of k-mers or any relationships between them. The speaker then suggests that deep learning methods could potentially overcome these limitations and allow for better representations of DNA sequences for machine learning.

  • 00:40:00 In this section of the video, the speaker explains the process of using convolution filters in deep learning for regulatory genomics, which is similar to the process used in image analysis. The first layer of the convolution filters recognizes position weight matrices that are scanned across the sequence, creating a numeric representation, and then a batch normalization operation is applied, followed by a non-linear function that sets negative values to zero. Next, the max pooling operation is used to take the maximum value of adjacent positions for each filter channel. Convolutional layers are then applied multiple times for the model, with pooling operations in between, to learn relationships between transcription factors and binding sites.

  • 00:45:00 In this section, the speaker discusses making predictions in deep learning for regulatory genomics. They collapse the object across the length axis and run a fully connected layer to make predictions. The speaker then provides an example of DNA hypersensitivity and how there are many sites that are accessible across cell types but also many cell type specific peaks that must be learned. The training, validation, and test sets consist of two million sites, which are broken down into 164 binary calls for whether there's a significant signal from this DNA hypersensitivity assay. The speaker discusses the benefits of multitask learning, where all the convolutions and fully connected layers are shared between all tasks except for the final linear transformation. They explain that this joint representation gives better results than training separate models for each task.

  • 00:50:00 In this section, the presenters discuss the tasks involved in their deep learning model for regulatory genomics, which include different cell types and assays such as transcription factor chip-seq and histone modification chip-seq. They explain that the model uses convolutional neural networks which are more flexible than k-mer SVMs and can represent more things. To understand what the model is doing, they analyze position weight matrices obtained from the convolution filters and compare them to the cisBP database of transcription factor binding sites. They find that the filters largely recognize sequences similar to the database motifs and note that the use of multiple filters for important transcription factors such as CTCF is crucial for predicting accessibility.

  • 00:55:00 In this section, the speaker discusses the potential of using additional layers in a deep learning network to recognize and combine multiple building block representations of transcription factor motifs, like CTCF. This could allow for a more efficient recognition of complex motifs, although it could also make it challenging to pinpoint the exact location and contribution of each individual filter. The speaker also mentions several analyses they performed to gain insights into the information content and influence of different filters in the model, which could aid in better interpreting the results of a deep learning approach to regulatory genomics.

  • 01:00:00 In this section of the video, the speaker discusses using a known motif to make predictions and studying transcription factor binding sites by mutating every single nucleotide across the sequence. The speaker then moves on to discuss a new problem of predicting transcription and gene expression by computing a function of all the elements in a long region of DNA. The solution involves using convolution structures and bringing in a large region of sequence, about 100,000 nucleotides for the model, and then doing max pooling to get the sequence to about 128 base-pair resolution. The challenge is how to share information across the genome, and different modules can be used for this. Recurrent neural networks were hypothesized to be the best tool for the job.

  • 01:05:00 In this section, the speaker talks about a tool called dilated convolution that they used instead of a recurrent neural network to avoid the problem of slow training on long sequences. Dilated convolution involves inserting gaps into the convolution and expanding it, which allows the receptive field to grow exponentially, leading to a very parameter-efficient method of covering an image. The speaker then discusses how they used dilated convolutions to make predictions in a multi-task framework for chromatin data and gene expression. They also mention an additional technique called residual connections or skip connections that can be helpful for training deep neural nets.

  • 01:10:00 In this section, the speaker discusses using residual networks to make it easier for each layer to learn new information without having to relearn everything before it. This is especially useful for dilated convolutions, which look at different positions further away. By directly passing on what has already been learned with the residual connection, they can add new information to each position's vector and normalize it or throw a convolution on top of it. The number of residual connections depends on the length of the sequence being worked with, as they should be able to look far enough without hitting outside the sequence bounds.

  • 01:15:00 In this section of the video, the speaker discusses the use of 5 to 10 dilated convolution layers for an input sequence of 100,000, but notes that this can change depending on the scale of the sequence or bin size. The input in this case is the continuous signal from various datasets, and the speaker notes that it cannot easily be binarized like gene expression. The speaker indicates that a plus loss function works better for the data, and notes that the quality of the model is affected by the quality of the data, which can vary considerably. The speaker briefly mentions using the model to make predictions for mutations in disease-associated SNPs and the importance of connecting computational biology research to disease associations. Finally, the speaker briefly covers the prediction of 3D contacts using IC data and models.

  • 01:20:00 In this section, the speaker explains how they use the high c data to make predictions. The data is two-dimensional, with nucleotides across the x-axis and y-axis, representing the contact frequency between that part of the genome and another bin in the genome. Using mean squared error and multitask learning, the model can predict the data. However, with a million nucleotides coming in, GPU memory limitations become an issue. The solution is to use the averaging of the position i and position j, resulting in a 2D matrix that deep learning tools can analyze. Using 2D convolutions, dilated convolutions, and re-symmetrizing the matrix after every layer, the model can make predictions with ctcf being the main learning factor.

  • 01:25:00 In this section, David Kelley discusses how deep learning can be used in regulatory genomics to analyze basic inputs like DNA sequence and predict transcription factor binding, using CTCF as an example. With enough data and the right transformations, neural network architectures can successfully learn and make predictions based on genomics data. While synthetic data is currently the main focus, this presentation offers an overview of the ways deep learning can be applied in biology and genomics.
Deep Learning for Regulatory Genomics - Regulator binding, Transcription Factors TFs
Deep Learning for Regulatory Genomics - Regulator binding, Transcription Factors TFs
  • 2021.03.16
  • www.youtube.com
Deep Learning in Life Sciences - Lecture 08 - TF binding (Spring 2021)MIT 6.874/6.802/20.390/20.490/HST.506 Spring 2021Prof. Manolis Kellis with Guest lectur...
 

Gene Expression Prediction - Lecture 09 - Deep Learning in Life Sciences (Spring 2021)



Gene Expression Prediction - Lecture 09 - Deep Learning in Life Sciences (Spring 2021)

The video discusses the use of deep learning in gene expression prediction and the challenges involved in analyzing biological data sets, including high dimensionality and noise. The lecture covers methodologies such as cluster analysis, low-rank approximations of matrices, and compressive sensing. The speaker also talks about the use of deep learning for gene expression prediction and chromatin, as well as weakly supervised learning for predicting enhancer activity sites. The lecture discusses several tools developed using primarily deep learning methodology, including danq, djgx, factory mat, and sc fin. The presenter also talks about the use of generative models for studying genomics data sets and introduces the idea of approximate inference methodology, particularly the popular one called variational inference.

In the second part of the lecture, the speaker discusses the application of deep learning in life sciences, specifically in gene expression prediction and genomic interpretation. The first topic focuses on the application of variation autoencoder models to RNA expression analysis for asthma datasets. The speaker proposes a framework to remove experimental artifacts using a conditional generative model. The second topic discusses Illumina's investment in deep learning networks to identify the sequence-to-function models for genomic interpretation, particularly for splicing. The company has developed SpliceAI, a deep convolutional neural network that predicts whether a nucleotide is a splice donor, acceptor, or neither. The third topic is about the speaker's research on predicting whether certain mutations will have cryptic splice function, which can lead to frameshifts and disease. The speaker also invites questions and applications for research positions, internships, and postdocs.

  • 00:00:00 In this section of the lecture, the speakers introduce gene expression analysis and the two methods used to measure RNA expression: hybridization and genome sequencing. The latter has become more popular because of the drastic drop in the cost of genome sequencing in the past 20 years. The result is a matrix that shows which gene is expressed at what level in hundreds of conditions. This matrix can be seen vertically or horizontally, giving a 20,000 long vector for every gene in the genome across an experimental condition of interest, or for a particular cell type that has been sorted.

  • 00:05:00 In this section, the instructor discusses how deep learning can be used in gene expression prediction. The basic input matrices involve profiling every cell to make comparisons across multiple dimensions such as similarity of expression vectors for a given gene across different conditions, tissues, cell types, experiments, age, and gender. Cluster analysis can be used to find similar conditions to each other or genes that are similar to each other across columns or rows. The guilt by association approach can also be used to complete the annotation of unannotated genes based on the similarity of expression. Additionally, the instructor suggests using deep learning approaches like self-supervised learning, prediction using non-linearities and higher-order features, and multi-task learning to predict the different classes of interest jointly, and finally, the instructor emphasizes that deep learning is not the only approach, and there exist a set of tools that can be used to ask biological questions and learn representations of these systems.

  • 00:10:00 In this section, the lecturer discusses dimensionality reduction techniques that can be used to analyze gene expression patterns. One such technique is principal component analysis (PCA), which can be used to identify the major dimensions of variation in genetic pressure patterns. Low rank approximations of matrices can also be used to effectively obtain an optimal lower rank approximation of the data. Other techniques like t-SNE and auto-encoders can also be applied. Additionally, the lecturer mentions the use of compressive sensing to build composite measurements using combinations of probes that capture linear combinations of gene expression. Finally, the lecturer discusses the potential of using chromatin information to predict gene expression levels, which will be discussed in the first guest lecture.

  • 00:15:00 In this section, the speaker discusses the use of deep learning to predict gene expression and chromatin from various features, combining them systematically using attention mechanisms, similar to what was previously discussed for the transformer model and recurrent neural networks. The use of reporter constructs and high-throughput testing is explained, along with the ability to predict whether certain fragments will drive expression using a machine learning or deep learning approach. The speaker also introduces the concept of predicting splicing directly from sequence using a neural network and specific features in the sequence, and highlights the work his team has done on using deep learning to predict enhancers in the human genome using a weekly supervised framework.

  • 00:20:00 In this section of the video, the speaker discusses a method of gene expression prediction using a reporter experiment and a set of chromatin features. The input matrix, which consists of the different marks across thousands of locations in the genome, is constructed for every gene, and the nearby chromatin features are tested against the star-seek result to predict the expression. The output level is a binary classifier, and the model's intermediate representations are used to predict the specific location in the genome sequence. This higher resolution allows for a more efficient use of data analysis, which is achieved by fitting particular curves in the contiguous signal to have a more advanced representation.

  • 00:25:00 In this section, the speaker explains the idea of weakly supervised learning for predicting enhancer activity sites using a method similar to object detection. By passing the original image into a convolutional filter, the activation maps are generated that are used to create a heat map. The model only required a coarse annotation of the enhancer's existence and predicted the precise location using the same method of the heat map. Results of the cross cell line and cross-chromosome validation have shown that the model can accurately predict starseek enhancers. The refined set, obtained by shaving off irrelevant regions while making predictions, has a higher proportion of transcriptional start sites and is more conserved in a hundred different species. The speaker benchmarked the model with the previous state-of-the-art model and performed a case study in neuro progenitor cells, discovering neuro-specific enhancers.

  • 00:30:00 In this section of the YouTube video "Gene Expression Prediction", the speaker discusses the challenges in interpreting biological data sets and the importance of developing methodology that takes multiple factors, such as high dimensionality and noise, into account. The speaker's research in his lab focuses on combining different types of genomic techniques, including single-cell genomics, to develop methods for studying genomics. The speaker also discusses his interest in applying deep learning to gene expression analysis and using it to extract signals from noisy data sets.

  • 00:35:00 In this section, the speaker discusses the development of a methodology that combines multi-modality datasets to allow for the examination of underlying biology. They highlight recent proposals in the field of machine learning that combine visual signals with natural language processing to better understand systems. The speaker then proceeds to list a few tools that their lab has developed using primarily deep learning methodology, including danq, which quantifies the function of DNA sequences, and djgx, which predicts gene expression. The speaker also briefly discusses two other tools, factory mat and sc fin, which predict transcription factor binding, with sc fin being an extension of factory mat for single cell prediction.

  • 00:40:00 In this section of the video, the presenter discusses several methodologies related to the use of deep learning models in life sciences. Specifically, the discussion covers the UFO methodology for RNA structure secondary structure predictions, the DGX model that utilizes deep neural nets for predicting expressions, and the SAILOR methodology for utilizing deep generative models to study single cell ataxic datasets while focusing on the idea of invariant representation learning. The discussion also covers the use of VAE models to study genomics and RNA expression data, an extension of deep generative models for a taxi analysis, and the combination of multi-modality datasets with a model to learn shared representations. The presenter notes that all the tools developed are open source and available on Github.

  • 00:45:00 In this section, the lecturer discusses a method for gene expression prediction using a subset of genes. By profiling a small number of genes (1,000) using the Luminex technology platform, researchers can generate profiles with millions of samples, leading to a cost-effective method for understanding biological processes and drug discoveries. The remaining 20,000 genes can be inferred using computational techniques such as deep neural nets. By inputting 978 dimensional vectors into a multi-layer perception feed-forward neural net, researchers can predict the 20,000 targets in a multi-task fashion jointly and train the model through backpropagation, achieving better accuracy than linear regression. The geo data set containing expression profiles with the entire collection of genes is used to train the model.

  • 00:50:00 In this section of the lecture, the instructor discusses the use of generative models for studying genomics data sets. Since most genomics data sets lack labels, unsupervised learning is often more relevant. The goal is to map high-dimensional data sets into a low-dimensional embedding, which can be more helpful in identifying underlying patterns. The traditional method for this purpose is the autoencoder, which can be trained by matching the input to the output, but has issues like susceptibility to overfitting and inability to generate samples. As a solution, the instructor proposes deep generative models, which model data through a probabilistic framework with latent variables. By assigning priors to the distribution of the latent variables, the model can marginalize over them to obtain the marginal distributions of the input.

  • 00:55:00 In this section, the professor discusses the issues with learning data based on a generic framework and introduces the idea of approximate inference methodology, particularly the popular one called variational inference, which proposes an auxiliary distribution on the distribution of z given x. The lower bound of the log-likelihood bound with the auxiliary distribution is then minimized through a balance between the data and ko distance between distributions, thereby ensuring that the posterior distribution is close enough to the prior distribution while having enough power to model observable data sets. This led to the development of the variational autoencoder, which can model both p(theta)x given z and auxiliary distribution through neural nets by training them to minimize the variation of the negative log-likelihood. However, there are issues with calculating those expectations, which can be addressed using the reparameterization trick, particularly when applying a Gaussian product.

  • 01:00:00 In this section, the speaker discusses the application of variation autoencoder models to RNA expression analysis, specifically for asthma datasets. Due to the discrete and quantitative nature of RNA-seq datasets, researchers use zero-inflated negative binomial distributions to model the read counts. This leads to the idea of using an autoencoder combined with this model to create a deep generating model. However, the learned latent representations may reflect experimental artifacts, such as batch effects and read coverage. To remove these effects, the speaker proposes a framework using a conditional generative model that minimizes the mutual information between the learned representations and their underlying confounding factors.

  • 01:05:00 In this section, the principal investigator at the AI Lab at Illumina talks about the company's goal to understand every possible variant in the human genome and make genome sequencing useful for everyone. The focus is on interpreting non-coding genetic variations, which most clinical sequencing currently skips. This is why Illumina is investing heavily in deep learning networks to identify the sequence-to-function models for genomic interpretation, specifically for splicing. They have developed SpliceAI, a deep convolutional neural network that predicts whether a nucleotide is a splice donor, acceptor, or neither, purely from the sequence, and can reconstruct the intron-exon pattern of a gene from a sequence of sequences.

  • 01:10:00 In this section, the presenter discusses the difficulties of predicting exon splice junctions and how their deep learned network was able to predict all 30 exons of the large CFTR gene with nucleotide-level precision. They found that long-range sequence determinants are key to splice regulation, and the network was able to derive these determinants automatically from sequence data, including nucleosome positioning and the clustering of exons. The network used a variety of features, including the branch point, polyper moon tract, ag, and gt, as well as intronic and exonic splice enhancers, and compensated for the redundancy of local motifs with a long-range context. The presenter also showed how the accuracy of the network increased with larger context sizes and that it worked on non-protein coding sequences as well.

  • 01:15:00 In this section of the video, the speaker discusses the application of splice AI to rare disease patients, specifically a patient with early onset heart failure caused by a single nucleotide mutation that extended the exon and frame shifted the protein. The model was also validated on RNA-seq from GTEx, and the validation rate depended on the splice AI score. The speaker highlights the complexity of interpreting lower-scoring splice variants as they may preserve normal splicing, and there is a graded interpretation of human variation that needs to be addressed. The impact of natural selection on variants with cryptic splice function was also examined, and it was found that natural selection shows that cryptic splice mutations predicted by spicy i are essentially equivalent to a frameshift or nonsense protein coding mutation. Finally, the model was applied to large clinical data sets of patients with autism spectrum disorder and intellectual disability.

  • 01:20:00 In this section of the lecture, the speaker talks about their research on predicting whether or not certain mutations will have cryptic splice function. They used RNA sequencing to confirm the predicted aberrant splice junction and demonstrated examples of how these variants cause splicing to occur in the wrong location, leading to frameshifts and disease. The speaker makes their tools open source and invites questions, as well as applications for research positions, internships, and postdocs. The lecture concludes with thanks to the speaker and a reminder to stay tuned for the final project.
 

Single Cell Genomics - Lecture 10



Single Cell Genomics - Lecture 10 - Deep Learning in Life Sciences (Spring 2021)

In this lecture on single-cell genomics, the speaker discusses various methods and technologies used for profiling individual cells, including cell sorting and microfluidics. The focus is on three specific single-cell sequencing technologies - Smart-seq, drop-seq, and pooled approaches. The speaker also covers the process of analyzing single-cell transcriptomes, including preprocessing, visualization, clustering, and annotation, and the use of autoencoder architecture in community clustering. Deep learning methods are applied for domain adaptation and to reconstruct cell types in a stimulated fashion. The lecture also discusses the challenges involved in analyzing single-cell genomics data and proposes the use of a generative model to address these issues in a scalable and consistent way.

The second part of the video covers various topics related to single-cell genomics and deep learning. Topics discussed include variational inference, a generative process for single-cell RNA sequencing data, the SCVI model for mixing cell type datasets, CanVAE for propagating labels, and the implementation of various deep learning algorithms on a single code base called CVI tools. The speakers also address challenges in using posterior probabilities to calculate measures of gene expression and present methods for accurately calculating posterior expectations and controlling full discovery rates.

  • 00:00:00 In this section of the transcript from "Single Cell Genomics - Lecture 10 - Deep Learning in Life Sciences (Spring 2021)", the speaker explains why single cell profiling is necessary. Individual cells within the body are extremely different from each other and can vary because of environmental stimuli, interactions, cell cycle phase, and transcriptional bursts. Single cell profiling also captures individual differences in cell types, signaling, and genotype, which are often not captured with bulk data. The speaker outlines several technologies that have preceded the current explosion in single cell data analysis, but emphasizes the foundational technology of amplifying individual RNAs to capture transcriptional diversity.

  • 00:05:00 In this section, the speaker discusses the different technologies and methods used for profiling individual cells, which includes cell sorting, microfluidics, and pipetting. By looking at individual cells at different time points and genes across cells, researchers can see how individual genes are turning on and off and how there is heterogeneity even within particular time points. Single-cell analysis poses a challenge in distinguishing technical and biological zero values, but the data obtained through these techniques is able to recapitulate what is seen in biology. The talk also covers smartseek, which uses cell-based technology, dropseek and 10x, which both use droplets, and split-seek, which is a method for barcoding individual cells without separating them.

  • 00:10:00 In this section, the speaker discusses the different methods used in single cell genomics, including microfluidics and blood collection, and describes the basic pipeline used in the process. The focus is on three specific technologies - Smart-seq, drop-seq, and pooled approaches. Smart-seq uses cell sorting and captures up to 10,000 genes per cell, but requires a separate sequencing reaction for every well, making it expensive. Drop-seq replaces wells with droplets, capturing individual cells with barcodes in beads, and is more cost-effective. Finally, the pooled approach involves capturing all individual RNA molecules in a single tube labeled with corresponding cell identity.

  • 00:15:00 In this section, the speaker explains three different types of single-cell RNA sequencing technologies. The first one is Well Sequencing, where each single cell is sorted into a well or droplet, and each well is labeled with a unique barcode to distinguish cells from each other. The second one is 10X Genomics, which involves combining all the labeled RNA from different cells into a single sequencing reaction. The third technology is Split-Seq, where cells are shuffled among different wells with different barcodes added at each iteration, resulting in a unique combination of barcodes for each cell's RNA. This allows for a million unique addresses for every RNA molecule, indicating which cell it came from.

  • 00:20:00 In this section, the lecturer discusses single-cell sequencing technologies, including cells in wells, droplets, and combinatorial indexing. Various types of assays can be used, such as single-cell DNA methylation profiling, single-cell genome sequencing, and single-cell DNA accessibility. Another widely used assay is single-cell ATAC-seq, which looks at the accessibility of chromatin in individual cells. However, the data from individual cells can be sparse, and aggregating data across multiple locations is necessary to talk about transcription factors. The lecturer also mentions the increasing emergence of single-cell multi-omics methods, but cautions about the computational challenges in dealing with noise and artifacts. The section ends with an introduction to two guest lectures from Europe and the West Coast, respectively, who will discuss deep representation learning in single-cell genomics.

  • 00:25:00 In this section of the lecture on single cell genomics, the speaker discussed the process of analyzing single cell transcriptomes, which involves various steps of preprocessing, visualization, clustering, and annotation. The process is unsupervised, as information is only available on cell ensembles, not individual cells. The speaker's lab has contributed tools and frameworks to aid in this process, including the successful scanpy single cell analysis in python, which provides a library of tools and modules to perform these steps. Visualization and downstream analysis involve latent space learning, with the most commonly used method being a k n graph. The speaker's lab has also invested in studying time series information in single cell transcriptomes to understand cellular differentiation processes.

  • 00:30:00 In this section, the speaker discusses the use of autoencoder architecture in community clustering using deep neural networks. This approach is used to deal with the increasing size of datasets and noise in gene times cell matrices. The autoencoder architecture's bottleneck layer is found to be significant and can learn about biological processes. The speaker's team has leveraged this information to develop a deep count autoencoder, which adapts to the noise function by replacing the mean squared error with negative binomial distribution. A two-dimensional plot of this approach on a PBMC dataset shows that the bottleneck layer recognizes cell type groups without any prior knowledge, which could aid in leveraging biological knowledge. The scaling behavior of this neural network method is also identified as a significant advantage compared to the K-n algorithm.

  • 00:35:00 In this section, the speaker discusses the potential of deep learning in genomics and single cell data to develop the next generation of convolutional filters. He mentions a project focused on domain adaptation that aims to transfer certain settings to a new one, such as perturbations and drug stimuli in cells. They call this project "scgen," which models the perturbation effects of cells and seeks to predict how a new cell type would behave. By encoding all data sets, they hope to achieve linearized latent space where they can do arithmetics and out-of-sample prediction. They have also been extending this model for more complex decomposition.

  • 00:40:00 In this section, the speaker discusses the ability to reconstruct a cell type using deep learning in single-cell genomics. The goal is to reconstruct a cell type, such as CD4 positive T cells, in a stimulated fashion by leaving them out, essentially making an out-of-sample prediction. The prediction is not just based on the mean but also on the distribution of the variance. This reconstruction is done not only for CD4 positive T cells but also for all different cell types, and the cell-specific response is learned, making it a potent tool for genomics. The speaker also talks about SCGen, a simple generative model that has been extended with latent space learning. It can be used to do style transfer by packing all the information about the big sample into the model. Finally, the speaker discusses transfer learning, which is essential in dealing with distributed data and making those maps easy to access.

  • 00:45:00 In this section, the speaker discusses the application of Bayesian modeling and variational autoencoders (VAEs) to single-cell data, which aims to understand the distinct functions of cells in a tissue. The process involves dissociating a tissue into single cells and running a single RNA sequencing pipeline, resulting in a matrix that shows the number of times a transcript aligns with a gene for each cell. The speaker emphasizes the importance of collaboration in their work with graduate and master's students and professors, and presents several topics they will cover throughout the presentation, from the significance of applying VAEs to single-cell data to a discussion of extensions and failure modes of VAEs.

  • 00:50:00 In this section, the speaker discusses the various tasks and challenges involved in single-cell genomics, including the analysis of cell and gene level queries. Some of the tasks involve cell stratification, trajectory analysis, data set harmonization, annotation transfer, normalization, and differential expression testing. The analysis can be complex due to technical noise such as variable sequencing depth and batch effects, as well as the high-dimensional, non-Gaussian nature of the data. To address these issues, the speaker proposes using latent variable models and scalable methods to analyze the millions of samples involved.

  • 00:55:00 In this section, the speaker discusses the limitations of applying algorithms to single-cell genomic data and the need for a unifying modeling assumption for the whole process. They present the idea of a generative model, building on Bayesian modeling techniques, that can be used to analyze single-cell data in a scalable and consistent way. The speaker explains how to read a graphical model and how the different nodes and edges can be used to encode probabilistic properties, such as independent replication, and conditionality. The goal is to calculate the posterior distribution, which can be achieved using Bayes' rule, but the marginal likelihood is often intractable, except when using probablistic PCA.

  • 01:00:00 In this section, the speaker discusses the concept of variational inference, which is used in scVI to approximate the posterior probability distribution of observations. The method involves placing a family of distributions and finding the point q that minimizes the K-divergence to the posterior, which is essentially an optimization problem. Using the definition of a conditional density, the optimization problem becomes tractable, and variational inference becomes an attractive method. The speaker also presents an extension of probabilistic PCA, where a neural network can be used to specify the mean and variance of the Gaussian distribution. However, using variational inference in VAEs requires learning the model parameters by maximizing the evidence, which can be achieved by tying up all the parameters of the variational posterior using two neural networks. Finally, the speaker discusses scVI, which incorporates technical effects into a graphical model to generate gene expression counts for a given cell and gene.

  • 01:05:00 In this section, the speaker gives a detailed explanation of the generative process for single cell RNA sequencing data using a conditional variational autoencoder (CVA) and further explains how this model can be used for various tasks like stratification, harmonization, normalization, imputation, and differential expression. The speaker emphasizes how this approach can handle batch effects and improves scalability. The speaker also demonstrates the usefulness of the model by showing that it can recover hierarchical clusters and developmental gradients in the embeddings and can handle batch effects in cases with severe batch effects and many batches.

  • 01:10:00 In this section, the presenter discusses the challenge of mixing cell type datasets while still being able to distinguish cell types. They present the SCVI model which can mix datasets without losing the ability to see cell types. The presenter also talks about the exciting use of the Rao latent variable for differential expression analysis. The team compared the ranking of genes with SCVI and other methods for microarray technology and found that SCVI was performing similarly or even slightly better. Lastly, the presenter introduces the SCVI++ model, which is an extension of SCVI used for annotation purposes, allowing for the transfer of labels from one dataset to another. The SCVI++ model is based on a mixture model and changes the prior in z and uses a neural net for the cell type assignment.

  • 01:15:00 In this section, the speaker discusses the use of a framework called CanVAE in a use case where there is a subset of T-cells but their sub-cell types cannot be identified based on some marker genes that are lowly expressed. By using CanVAE to propagate the labels, it becomes a semi-supervised learning method, which works better than just clustering or classification because it utilizes knowledge about all the cells. Additionally, the speaker presents a problem of factoring out continuous information or covariates from the latent space, which is difficult to handle with neural nets used in parameterizing the variational distribution. They introduce HC constraint VAES, a method that enforces independence statements in the aggregated posterior, resulting in looser lower bounds with more suitable properties. Lastly, they discuss differential expression and how it can be thought of as a Bayesian model selection problem, where likelihood ratios can be used as a threshold for determining differential expression using the CanVAE framework.

  • 01:20:00 In this section, the speaker discusses the challenges and limitations associated with using posterior probabilities to calculate measures of gene expression. The approach can be biased if the posterior is incorrect, and many people prefer controlling the measure of false discovery rate over base factors. To solve this problem, the speaker proposes a method for calculating posterior expectations accurately using samples from the variational distribution. They introduce different upper bounds that overestimate the variance, which is more useful for important sampling than underestimating it. Additionally, the speaker presents a procedure for combining multiple proposals together to control the full discovery rate with the CVI. The paper associated with this work also includes theoretical analyses that quantify the error for important sampling using concentration bounds.

  • 01:25:00 In this section, the speaker discusses the implementation of various deep learning algorithms on a single code base called CVI tools, which contains tools for analyzing single cell omics data and an interface to probabilistic programming languages. The code base contains the implementation of around 10 to 13 generative models, and users can easily change a conditional variational autoencoder in one line of code or create a new one. The speaker also mentions a review paper that discusses the impact of variational autoencoders and generative adversarial networks in molecular biology.
Single Cell Genomics - Lecture 10 - Deep Learning in Life Sciences (Spring 2021)
Single Cell Genomics - Lecture 10 - Deep Learning in Life Sciences (Spring 2021)
  • 2021.03.28
  • www.youtube.com
MIT 6.874/6.802/20.390/20.490/HST.506 Spring 2021 Prof. Manolis KellisGuest lecturers: Fabian Theis, Romain LopezDeep Learning in the Life Sciences / Computa...
 

Dimensionality Reduction - Lecture 11



Dimensionality Reduction - Lecture 11 - Deep Learning in Life Sciences (Spring 2021)

The video lectures on deep learning in life sciences explore dimensionality reduction techniques for clustering and classification in single-cell data analysis. The lectures distinguish between supervised and unsupervised learning and explore the use of statistical hypothesis testing frameworks for evaluating differential expressions of genes. The lecture introduces the concept of manifold learning using principal component analysis, eigen decomposition, and singular value decomposition for linear dimensionality reduction and discusses the methods of t-distributed stochastic neighbor embedding and distributed stochastic neighbor embedding for clustering data preservation. The speaker also discusses the application of non-negative matrix factorization to genomic data and the integration of single-cell and multi-omic data sets. The ultimate goal of these techniques is to redefine cell types and identity in an unbiased and quantitative way.

The second part discusses several topics related to dimensionality reduction, specifically its application in life sciences. Integrative non-negative matrix factorization (iNMF) is used to link transcriptomic and epigenomic profiles to better understand cellular identity across various contexts. The lecture also discusses the benefits of using a mini-batch approach in deep learning, particularly for larger datasets, and how online algorithms can be leveraged to improve dimensionality reduction methods for analyzing large datasets. Additionally, the algorithm is introduced to integrate different types of data, such as RNA-seq and ATAC-seq data. Finally, the speaker expresses willingness to serve as a mentor for students interested in the field. Overall, the lecture was informative and well-received.

  • 00:00:00 In this section, the video lectures continue the discussion on single-cell data analysis and focus on dimensionality reduction techniques for clustering and classification. The gene expression matrices that measure thousands of genes across thousands of experiments can be used for clustering genes or cells or for the classification of cell types based on their gene expressions. The lectures distinguish between supervised and unsupervised learning and explore the use of statistical hypothesis testing frameworks for evaluating the likelihood of differential expressions of genes. The video also mentions the need to consider the underlying distribution of the data and find the most appropriate fit for the observed distribution in the data set.

  • 00:05:00 In this section, the lecturer discusses the various reasons for dimensional reduction in both supervised and unsupervised learning applications. These include data visualization, data reduction, data classification, and reducing noise in data sets. The lecturer explains that dimensionality reduction can help understand factors that drive variation, distinguish between different classes, and identify interesting subsets of data. Additionally, the lecturer describes how dimensionality reduction involves mapping high-dimensional data onto a lower-dimensional manifold.

  • 00:10:00 In this section of the lecture, the concept of manifold learning is introduced as a way to understand the true dimensionality of high dimensional data, which allows for a lower dimensional representation. Manifold learning involves taking high dimensional data and understanding the true dimensionality of the data, which may not be explored by the dataset. Linear dimensionality reduction using principal component analysis (PCA) is discussed as one of the most common ways of learning these manifolds. PCA involves projecting the data into a set of linear coordinates, which is a transformation of the original space. The eigenvectors of the original data are used in PCA to find the vectors that are invariant to transformations.

  • 00:15:00 In this section of the lecture on deep learning in life sciences, the concept of eigen decomposition is introduced as a way to decompose a large matrix of data into its principal vectors of variation. For symmetric matrices, eigenvectors are orthogonal, and for real symmetric matrices, eigenvectors are both orthogonal and real. Eigen decomposition captures the most natural linear dimensionality reduction of a dataset, and the diagonal matrix represents the effects of the independent principal components. For non-symmetric matrices, singular value decomposition is used to find the eigenvectors of the genes and conditions and their combinations that best explain the data.

  • 00:20:00 In this section, the lecturer discusses the concept of singular value decomposition (SVD) and how it can be used for linear dimensionality reduction. SVD is a way of decomposing a matrix into a series of operations, including two rotations and a scaling, in order to find the most important dimensions of variation in the data. The resulting matrix can be used to compute an optimal low-rank approximation of the original data, allowing for the representation of the data in a lower dimensional space. This is useful for linear dimensionality reduction, which is limited in its capabilities, but non-linear dimensionality reduction can eliminate some of these constraints. Principal component analysis is one method of linear dimensionality reduction that captures the major linear dimensions of variation in the data.

  • 00:25:00 In this section, the method of t-distributed stochastic neighbor embedding (t-SNE) is discussed as a technique of clustering data for dimensional reduction while preserving distances at varying scales. Instead of relying on PCA that treats all distances equally, t-SNE maps a high dimensional space onto a lower dimension while preserving the proximity of similar data points within the new space. By applying a specific bandwidth, individual cells with similar expression patterns in a high dimensional space can be made proximal to each other in a lower dimensional space, minimizing the KL divergence between both the spaces. Gradual methods can be used to find an embedding that minimizes the cost function of the KL divergence between the two spaces.

  • 00:30:00 In this section, the speaker discusses how distributed stochastic neighbor embedding (d-SNE) preserves the local similarity structure of data by searching through gradient and optimizing the coordinates of a lower-dimensional space. The approach is a non-linear embedding that preserves local distances instead of global distances and penalizes when points are spread apart but nearby points are closer together. This method is commonly used for visualizations surrounding single-cell data sets, and the number of neighbors considered and the size of the original clusters can affect the quality of the embedding.

  • 00:35:00 this section, the speaker discusses the concept of a lower dimensional projection of data with a focus on learning specific clusters of cell types for single cell data analysis. They talk about a method that allows for joint projection of multiple types of omics data into a lower dimensional dataset within which they can be matched to each other. The speaker presents several approaches he has developed, including the LIGER approach, which uses integrative non-negative matrix factorization, and a method for scaling up the INMF algorithm using online learning. The talk concludes by discussing ongoing projects for integrating data sets with partially overlapping features and combining variational autoencoders and generative adversarial networks to generate single cell RNA profiles.

  • 00:40:00 In this section, the speaker discusses the various types of measurements that can be performed in single cells, including gene expression, histone modification, transcription factor binding, chromatin accessibility, DNA methylation, and chromatin conformation. They also highlight the significance of knowing spatial coordinates and mapping molecular information back into tissue context. The speaker mentions the challenge of moving toward a quantitative definition of cellular identity, where molecular and other types of information with single-cell resolution are used to redefine cell types in an unbiased fashion. To address these challenges, the speaker developed a tool called liger, based on integrative non-negative matrix factorization to perform integrative single-cell analysis across data sets of different measurements. They also discuss the benefits of non-negative matrix factorization's "parts based decomposition" approach.

  • 00:45:00 In this section, the transcript discusses the application of non-negative matrix factorization (NMF) to genomic data, allowing for the interpretation of NMF factors as metagenes that group co-expressed or co-regulated genes. These factors can represent biological pathways or cell type-specific genes, as well as capture technical factors. By grouping genes into metagenes and summarizing cell expression using these metagenes, NMF allows for a quantitative definition of cell identity and the identification of cell types and states across multiple data sets. The interpretability of metagenes also allows for the identification of technical signals and their deconvolution from biological signals in the data sets.

  • 00:50:00 In this section, the speaker discusses how they solved the item f optimization problem mathematically and derived a novel algorithm based on block coordinate descent, which has some significant advantages and provides a convergence guarantee. They use an efficient algorithm to solve the non-negative least squares problem and perform downstream steps to increase overall robustness of the analysis. The speaker then gives an example of how they integrated single cell RNA-seq data across human donors to cluster the cells by cell type rather than by donor, identifying the main cell types of the substantia and insights about how the cells are similar and different across human donors.

  • 00:55:00 In this section, the speaker discusses different applications of single-cell data integration. One example is the integration of spatial and single-cell data sets, which can help identify the spatial locations of cell types within a tissue and provide insights into tissue architecture. The speaker gives an example using a data set from mouse brain to identify two subtypes of astrocytes with different spatial locations, which provides insight into how neural circuits work together. Another important application is integrating multi-omic data sets from single cells, which is challenging because the data sets share neither instances nor features. The speaker explains a strategy for linking these data sets by transforming the epigenome data into gene-level features and correlating them with gene expression.

  • 01:00:00 In this section, the speaker discusses how integrative non-negative matrix factorization (iNMF) can be used to link transcriptomic and epigenomic profiles in order to better understand cellular identity across different contexts. By using data from mouse cortex and human bone marrow, the speaker demonstrates how linking gene expression and methylation data can provide a clearer understanding of cell types and even identify cell types with ambiguous labels. Additionally, the speaker explains how an online learning algorithm can be utilized to solve the iNMF problem in larger and larger datasets by incrementally updating calculations as new data arrives in a streaming fashion.

  • 01:05:00 In this section, the lecturer discusses the benefits of using a mini batch approach in deep learning, particularly for large datasets. This approach allows for an iterative update of the weights and avoids having to store the entire dataset in memory, resulting in faster convergence. The lecturer outlines three scenarios where mini batch is particularly useful, with the key advantage being able to incorporate new data sets as they arrive without having to re-analyze any previous data sets. The lecturer also discusses the computer science behind this approach, leveraging existing theory from a paper on online dictionary learning to optimize a surrogate function that converges asymptotically to the same solution in terms of parameters. Ultimately, this approach works well in practice and converges much more rapidly due to the redundancy of each additional cell in a larger dataset.

  • 01:10:00 In this section, the speaker discusses the advantages of using online algorithms in dimensionality reduction methods for analyzing large datasets. The speaker presents a benchmark of their approach against other widely used methods, showing that it has a significantly lower memory usage and is more time-efficient. They demonstrate the method's iterative refinement capability using data generated by the Brain Initiative Cell Census Network, where they incorporate new data sets in the factorization using the online algorithm. They also show how the inmf algorithm can be extended to a case where features partially overlap, allowing for leveraging both shared and non-shared features across data sets, which is a more satisfying approach than previously used methods that force features to align.

  • 01:15:00 In this section, the speaker explains how an algorithm can be used to leverage all the features present in a data set, even if some features are only present in one of the data sources. The algorithm can be used to integrate different types of data, such as RNA-seq and ATAC-seq data, to give a more complete picture of gene expression, which can enhance the ability to resolve clusters or cell profiles. The speaker also introduces a new approach, called Michigan, that combines the strengths of variational autoencoders (VAEs) and generative adversarial networks (GANs) to generate realistic cell profiles from single-cell expression data. The algorithm uses the disentanglement performance of the VAE and the generation performance of the GAN to create a powerful approach for manipulating and predicting changes in cell identity.

  • 01:20:00 In this section, the speaker expresses his willingness to serve as a mentor for students interested in the field and thanks the audience for attending the lecture. The moderator conducts a quick poll to check if the listeners have learned something, and the audience responds positively. Overall, the lecture was well-received and informative.
Dimensionality Reduction - Lecture 11 - Deep Learning in Life Sciences (Spring 2021)
Dimensionality Reduction - Lecture 11 - Deep Learning in Life Sciences (Spring 2021)
  • 2021.03.31
  • www.youtube.com
MIT 6.874/6.802/20.390/20.490/HST.506 Spring 2021 Prof. Manolis KellisGuest Lecture: Joshua WelchDeep Learning in the Life Sciences / Computational Systems B...
 

Disease Circuitry Dissection GWAS - Lecture 12


Disease Circuitry Dissection GWAS - Lecture 12 - Deep Learning in Life Science (Spring 2021)

This video on disease circuitry dissection GWAS covers the foundations of human genetics, the computational challenges for interpretation, and the various types of genetic variations examined in genome-wide association studies (GWAS). The video also explores methodologies such as Mendelian mapping, linkage analysis, and the identification of single nucleotide polymorphisms (SNPs) associated with diseases. Additionally, the speaker discusses the use of chi-square statistics, Manhattan plots, and QQ plots to visualize genomic regions significantly associated with disease phenotypes. The video also includes a case study on the FTO gene and how it was comprehensively dissected for its mechanistic implications in obesity. The challenges of understanding the genetic association with obesity and the steps to approach this issue are also discussed.

The lecture discusses the challenge of studying the impact of genomic variations on human health, and the importance of understanding how mutations affect different cell types. The speaker outlines their deep learning approach to predicting the effect of genomic sequence and variations, particularly in relation to predicting the binding of transcription factors and the organization of chromatin. They also describe their evaluation of these predictions using deeply sequenced genomic datasets to predict DNA sensitivity and histone mark QTLs, as well as their use of deep learning to predict the effect of mutations on gene expression and human diseases such as autism. Finally, they discuss their unbiased analysis of previously known gene sets and the use of a deep learning sequence model library.

  • 00:00:00 In this section of the video, the speaker discusses the foundations of human genetics and the computational challenges in interpretation. They explain how genetic variations are identified through genome-wide association studies (GWAS) and individual genetic variants that contribute to diseases are found. The lecture also covers genetic gene hunting and the use of linkage and GWAS to recognize locations associated with diseases. The challenges of fine mapping, case studies, and machine learning tools for variant interpretation, including deep variants and deep sea, are also discussed. The history of human genetics and inheritance patterns are briefly covered, starting from ancient Greece and continuing until the development of the concept of transmutation and natural selection by Darwin.

  • 00:05:00 In this section, the speaker discusses the reconciliation between the discrete inheritance of Mendel and the observed continuous variation in phenotypic traits. The concept of particulate inheritance introduced by Mendel showed that there were discrete units of inheritance named genes that were dominant or recessive. However, the biometrics of continuous variation observed in humans could not be explained by Mendelian inheritance. This changed with the work of statisticians in the early 1900s who showed that continuous variation could be explained by multiple Mendelian loci. This became the basis for Mendelian trait mapping that eventually led to the understanding that chromosomes and DNA carry the genetic material. Additionally, the speaker discusses how the deviation from the rule of independent assortment became the workhorse of human genetics and how traits that are physically close in the chromosome tend to be co-inherited.

  • 00:10:00 In this section, the speaker discusses the traditional approach of genetic mapping known as Mendelian mapping, which uses linkage and the segregation frequency of different traits to trace down the regions of the human genome where different traits are encoded. However, this approach is only effective for traits with a strong effect. The speaker then talks about the revolution in the 2000s that led to the ability to map weak effect variations, which was previously impervious to analysis using traditional linkage methods. This was accomplished through genome-wide association studies (GWAS), which look at every single SNP across the genome and how they vary with different diseases. The speaker goes on to explain the types of variations examined in GWAS, including SNPs, indels, SDRs, structural variants, and copy number variants, and how these variations can impact the functionality of the genome.

  • 00:15:00 In this section, the speaker introduces the workhorse of Genome-Wide Association Studies (GWAS), namely Single Nucleotide Polymorphisms (SNPs), which are the most common type of genetic variation. SNPs have two alleles, and every variant has been clustered and built into a database called dbSNP. The speaker also discusses other types of variations, such as shorthand and repeats, insertions and deletions, and more. Additionally, the difference between common and rare variants is explained, as rare variants allow for the examination of strong effect variation. The challenge of finding disease genes is highlighted, given that humans have two copies of their genome that consist of 23 chromosomes, 20,000 genes, 3 billion letters of DNA, and millions of polymorphic sites.

  • 00:20:00 In this section, the lecturer explains the difference between common and rare variants in genetics and their relationship with genome-wide association studies and Mendelian analysis. Rare variants have a big effect and are mostly found in Mendelian analysis, while common variants have a small effect and can be captured by genome-wide association studies. Additionally, linkage analysis can help pinpoint the location of a gene that causes a disorder by studying markers across the chromosomes and seeing which ones co-inherit with the phenotype in a population.

  • 00:25:00 In this section, the speaker introduces genome-wide association studies, which gather thousands of individuals, roughly 50% cases and 50% controls, to study conditions such as schizophrenia, obesity, or diabetes. These studies typically over-represent cases to gain power, and genotyping technology is used due to its cheap cost compared to sequencing. The speaker emphasizes the importance of quality control in both samples and SNPs to ensure the accuracy of the results. In addition, the speaker explains the concept of population certification and the need to eliminate relatedness between individuals in the study.

  • 00:30:00 In this section, the speaker explains how to use a chi-square statistic and p-value distribution to detect actual disease signals in a genome-wide association study (GWAS). Using a contingency table that shows how many cases and controls carry the allele of each SNP, the speaker looks for deviations in the frequency of alleles between cases and controls. The chi-square statistic measures the magnitude of the deviation and the p-value is used to reject the hypothesis that the allele has no effect on the phenotype. The speaker then explains how to plot the p-values in a Manhattan plot to visualize the genomic regions that are significantly associated with the disease phenotype.

  • 00:35:00 In this section, the speaker discusses the use of the Manhattan plot, which displays the minus log 10 p value of a SNP's randomly associated probability with a disease, as well as the QQ plot, which compares the p values of millions of SNPs that have been tested. These are followed by functional analysis to examine the role of the SNPs other ways. The genome-wide significance level is set at 5 times 10 to the power of minus 8, which was established based on a back-of-the-envelope calculation 20 years ago. However, fine mapping can be challenging due to the limited genetic variation in the human population, which has not had enough time for all SNPs to segregate independently.

  • 00:40:00 In this section, the lecturer discusses how variants are inherited in blocks rather than isolated, meaning that if one variant in a block has a certain allele, then every variant in that block has that same allele. After finding an association in a region, the next step is to identify which single nucleotide polymorphism (SNP) is responsible for the association. A Crohn's disease study found a region that was detected by both linkage analysis and genome-wide association studies, while another region was only found by the latter. The lecturer explains the frequency and effect size of each region's risk allele.

  • 00:45:00 In this section, the speaker discusses the rarity of protective alleles and the difficulties in discovering them through case control and cohort studies. They explain that rarer alleles that decrease risk are less likely to be found in studies that greatly enrich for cases, and the family pedigrees required for such studies are not feasible. The speaker also explains the difference between common variants that GWAS captures and rare, strong effect alleles that linkage analysis captures. The section concludes with a brief overview of haplotypes and recombination hotspots, including their variation across populations and the importance of prdm9 in guiding recombination events. Finally, the speaker introduces a study on the FTO gene, which was the strongest GWAS hit for obesity or body mass index and was comprehensively dissected for its mechanistic implications.

  • 00:50:00 In this section of the lecture, the speaker discusses the challenges of understanding the genetic association with obesity and outlines the steps to approach this issue. The first step is to identify the relevant tissue and cell type, which is accomplished by examining epigenomic annotations of various tissues. The second step is to find the downstream target gene, which is complicated by long-range linking and looping. The speaker explains that measuring the expression of different genes in homozygous risk and non-risk individuals reveals that the FTO gene itself shows no change in expression, but rather the IRX3 and IRX5 genes, located far away from FTO, are likely the target genes.

  • 00:55:00 In this section, the speaker describes how they were able to identify target genes for non-coding loci related to obesity and understand the causal SNP using regulatory motif analysis and evolutionary conservation. By disrupting the upstream regulator and the SNP, they were able to show the epistasis between the two and how it affects repression and de-repression. The speaker explains that disrupting the motif decreases repression and enhancers become overactivated, leading to the over-activation of rx3 and rx5 at the gene expression level, causing a shift from energy dissipation to storage. By building a model and with genome editing, they were able to go from a region of association where they knew nothing to understanding the biological process and target genes, and intervening to change the circuitry.

  • 01:00:00 In this section of the lecture, the speaker discusses the challenge of studying the impact of the numerous genome variations that exist in individuals, and the importance of gaining a better understanding of how genomic sequence and mutations affect different cell types and human health. The speaker explains that they take a machine learning approach to utilize genomic sequence and large amounts of functional genomic data to build models that can predict the effect of genomic sequence and variations. Specifically, the speaker discusses their work on predicting the binding of individual transcription factors and the organization of chromatin based on genomic sequences. They aim to develop a systematic method for predicting the impact of 120,000 genome variations at a time using deep learning techniques.

  • 01:05:00 In this section, the speaker discusses their decision to use a deep convolutional network model to build a regulatory sequence model that satisfies their three requirements: the ability to use large sequences and long sequence context, the ability to model the nonlinear interactions across different regions of the sequence, and the ability to share sequence features learned across all the different tasks. The speaker explains that the model learns different levels of sequence features at the lower levels and learns higher order sequence patterns at the higher levels. They also emphasize the importance of preserving the spatial information when making position-specific predictions. The model can be used to predict the effect of any genomic variant by giving the model two sequences that differ by only one variant and comparing the predictions for each allele.

  • 01:10:00 In this section, the speaker describes how they evaluated the accuracy of their predictions for variants that affect DNA sensitivity at the chromatin level. They analyzed deeply sequenced genomic datasets and looked for heterozygous variants where one allele was significantly more represented than the other, indicating potential DNA sensitivity differences. They trained a model to predict the DNA sensitivity for both the reference and alternative alleles and compared the predictions with experimental results. They found that the model had higher accuracy in predicting variants with stronger differences between the reference and alternative alleles and more confidently predicted variants. The evaluation was robust to false positives, allowing them to filter for the true positives. They also applied this approach to histone mark QTLs and found they could predict the allele linked to higher histone marks.

  • 01:15:00 In this section, the speaker discusses how they can use deep learning to predict molecular level effects of variants on gene expression. They face challenges like needing to consider larger regulatory sequences and having fewer training samples available. They address these challenges by looking at a broad region of 40 kilobytes and applying a pre-trained model to predict at different positions. They then train a smooth pattern of contributions from each gene expression profile prediction to gene expression using a regularized linear model. Through this approach, they can predict the effect of different mutations and how they might cause the same disease through a similar mechanism. While the problem of predicting gene expression is far from being solved, they have made a first attempt to address it.

  • 01:20:00 In this section, the speaker discusses the use of deep learning to predict the effect of genomic variants on human diseases such as autism. They explain that non-coding mutations have been difficult to detect and attribute to disease. However, by using models to predict the impact of mutations on chromatin profiles and protein binding profiles, they were able to compare the mutations of individuals with autism to their unaffected siblings. The researchers found that there was a stronger effect on genes associated with autism in individuals with autism compared to their unaffected siblings, thus confirming the contribution of non-coding mutations to the disease.

  • 01:25:00 In this section, the speaker discusses an unbiased analysis using previously known gene sets to determine the contribution of non-coding mutations. They use a network neighborhood-based analysis to look for stronger effects in problem mutations compared to sibling mutations within a gene network. This analysis shows a convergence of mechanisms indicated by coding and non-coding mutations, with genes clustered into synapse-related and chromatin regulation-related groups that had been previously indicated in the coding mutations discovered in autism individuals. The speaker also briefly mentions a deep learning sequence model library that can be used to train and evaluate sequence models.
 

GWAS mechanism - Lecture 13



GWAS mechanism - Lecture 13 - Deep Learning in Life Sciences (Spring 2021)

The lecture on GWAS mechanism in the Deep Learning in Life Sciences series looks at various methods to understand the function of non-coding genetic variants involved in complex traits. The lecture discusses the use of epigenomic annotations and deep learning models to identify global properties across genetically associated regions for a particular disease. It also covers enrichments across different tissues and enhancers and explains how these can be turned into empirical priors to predict the causal SNP within a locus. The lecture also discusses the use of intermediate molecular phenotypes like gene expression and methylation to study causality in genome-wide association studies and how to combine genotype and expression personal components to explain the phenotypic variable of expression. Lastly, the lecture examines the use of causal inference methods to determine the effect of changing a variable on outcome variables to identify causal versus anti-causal pathways.

The lecturer in this video discusses various techniques for inferring causal effects in genomics research. They cover the concept of d-separation and using natural randomization in genetics as a way to establish causal relationships. The lecturer also discusses Mendelian randomization and Rubin's Quasi-Inference Model, along with the potential outcome method for causal inference. They touch on the challenges of imputation and adjusting for biases in observational studies. The speaker also stresses the importance of using multiple orthogonal evidence to develop a robust causal algorithm. Additionally, they explain the use of genetics to perturb gene expressions and learn networks, and introduce the invariance condition as a way to identify causal structures in data. The lecture provides a comprehensive overview of various techniques and tools used in genomics research for causal inference.

  • 00:00:00 In this section, the lecture focuses on expanding the discussion from the previous session to understanding global variables such as epigenomic enrichments, eQTLs, and the study of mediation and causality with Guest Lecturer Professor Yong Jin Park from the University of British Columbia. The lecture plans to review briefly fine mapping and locus mechanistic dissection, followed by different methods for global enrichment analysis using epigenomics, to infer tissues of action regulators, cell types, and target genes. Furthermore, the lecture will look at linear mixed models and polygenic risk scores used in genome-wide association studies to predict phenotypes, and heritability to transition to the remaining topics in Thursday's lecture. The ultimate goal is to understand the functional drivers and mechanistic bases behind every peak in the manhattan plots simultaneously across thousands of genetic loci.

  • 00:05:00 In this section of the lecture, the instructor discusses the challenge of using genetics to understand disease mechanisms for complex traits, which are primarily governed by non-coding variants. To address this challenge, the instructor proposes using epigenomic annotations of cell circuitry and deep learning models to identify global properties across all genetically associated regions for a particular trait. By comparing the differences in enrichments across different traits, such as height and type 1 diabetes, the instructor suggests that they can learn properties that cut across all regions and use them to infer properties of individual loci. This approach can provide an unbiased view of disease and help with predicting target genes, therapeutics, and personalized medicine.

  • 00:10:00 In this section, the speaker explains the process of evaluating the overlap between genetic variants and tissue-specific enhancers to look for significant enrichment using a hypergeometric or binomial statistical test. They found that genetic variants associated with different traits show tissue-specific enrichment across enhancers active in those tissues. For example, genetic variants associated with height were enriched in embryonic stem cell enhancers, while genetic variants associated with blood pressure were enriched in enhancers acting in the left ventricle. They also discovered that Alzheimer's disease was not globally enriched for enhancers active in the brain but instead enriched for enhancers active in immune cells of the brain, specifically CD14+ cells. This led them to postulate that genetic variants associated with Alzheimer's act primarily in immune cells of the brain. They can now use this information in a Bayesian framework to determine which genetic variants associated with disease are more likely to be functional.

  • 00:15:00 In this section of the lecture, the speaker discusses how to turn the observed enrichments into empirical priors that can be used in GWAS. Using the example of Crohn's Disease and Alzheimer's, the speaker explains that genetic variants associated with a disease being enriched in certain regions can be used as a prior to predict the causal SNP within a given locus. They then explain how this prior can be combined with the evidence from GWAS summary statistics in order to build a posterior probability for each variant. The efficacy of this method, called RIVIERA, is demonstrated by the fact that the SNPs it prioritizes are more likely to be evolutionarily conserved and found in eQTLs and digital genome footprints.

  • 00:20:00 In this section of the lecture, the speaker discusses using enriched enhancers to make highly specific associations between genetic variants and traits. By mapping these traits to the enhancers they overlap with, the speaker discusses partitioning genetic loci into specific tissues to better understand the biological functions associated with these loci. The speaker highlights how this can be used to partition complex traits into simpler components and prioritize loci based on their proximity to enhancers in specific tissues. The speaker also provides several examples of loci associated with coronary artery disease that overlap with different tissues and target genes. Additionally, the speaker discusses how new loci that do not reach genome-wide significance can also be studied and mapped to specific tissues.

  • 00:25:00 In this section, the lecturer explains how they use a machine learning approach to prioritize sub-threshold loci, which are less significant than genome-wide significance, and discover novel loci by learning features in the genome-wide significant ones. They discovered many loci associated with heart repolarization and used their features as predictors to prioritize sub-threshold variants with additional lines of evidence from experimental testing. They found that the genes prioritized using this approach were strongly enriched for related genome association studies and linked to target genes that make sense, with a strong correlation to cardiac conduction and contractility phenotypes. They also discussed how they use expression quantitative trait loci to bridge the gap between genetic variation and disease by looking at intermediate molecular phenotypes.

  • 00:30:00 In this section, the speaker discusses the use of intermediate molecular traits, specifically the level of expression of a gene or the level of methylation of a specific site, as a way to study causality in genome-wide association studies. The goal is to focus on specific tissues, genomic mechanisms, gene expression changes, and endophenotypes to identify which traits are a consequence of genetics versus those that are a consequence of the disease. The basis of methylation quantitative trait loci and expression quantitative trait loci is to measure quantitative traits such as height and correlate the number of alternate alleles with the level of methylation or the level of expression of a gene nearby. This approach has led to the discovery of tens of thousands of methylation qtls, and imputing these intermediate molecular phenotypes can help predict methylation and correlate that with disease.

  • 00:35:00 In this section, the video discusses how imputed methylation can be used for larger cohorts to discover correlations between genotype-driven methylation and phenotypes like Alzheimer's disease. Imputed methylation is the genetic component of methylation, and by imputing it, researchers can use fewer individuals and look for genotype-driven methylation, increasing power and looking specifically at the genetic component. The video also shows examples of how, in certain cases, when using multiple SNPs together, many SNPs that were not genomically significant became significant, which allows researchers to combine their effects to predict methylation.

  • 00:40:00 In this section of the lecture on deep learning in life sciences, the speaker discusses a methodology for identifying mediating factors of disease phenotype through genetics, methylation, transcription, and confounder studies. They explain the process of using linear regression models to predict the relationship between these various factors and gene expression, correcting for variables such as population effects and batch effects, and ultimately identifying genetic drivers of intermediate molecular phenotypes like methylation and expression. The methodology involves a Q-Q plot to assess the calibration of statistics and the use of covariates such as age, gender, and principle components for genotypes and expression to interpret the results.

  • 00:45:00 In this section of the lecture, the focus is on combining genotype and expression personal components to determine if a model that includes additional covariance and genotype is better able to explain the phenotypic variable of expression than just the baseline model. This is the basis of an expression quantitative trait locus (eQTL) study that can be complemented with allelic analysis. Allelic analysis involves partitioning the reads of heterozygous individuals into those containing one allele with an A and those containing another allele with a C from the same cell of the same person. By associating the A genotype with the allele-specific expression of this allele that appears to have a higher expression than that of the C allele, one can look at the allele-specific effect of a particular region that is being tested given a particular SNP. The lecture also covers response QTLs and their role in determining QTLs in response to a particular environmental condition.

  • 00:50:00 In this section, the lecturer discusses the concept of expression quantitative trait loci (eQTLs), which are genomic loci that regulate gene expression levels. The lecturer explains that eQTLs can either be present all the time or only become present in response to a particular stimulus. The transcript then transitions to the topic of causality inference, which the lecturer explains is a way to determine which loci are playing a causal role in a disease versus which ones are simply correlated with the disease phenotypes. The lecturer explains that the causal inference field is divided into two categories: causal effect and causal discovery. The lecture will mainly focus on the causal effect influence.

  • 00:55:00 In this section, the speaker discusses the use of causal inference methods in studying genetic analysis. Causal inference involves experimental interventions to determine the effect of changing a variable x on outcome variable y. The goal is to ensure that the conditional probability is almost equivalent to the interventional probability. The speaker also explains the concepts of reachability, conditioning, adjustment, and d separation. By using causal graphical language, researchers can ask causal questions and identify causal versus anti-causal pathways. The presence of a backdoor path can affect the interpretation of the conditional probability and create the misconception that correlation equals causation.

  • 01:00:00 In this section, the lecturer discusses the concept of blocking the backdoor path between vector variables to identify the causal effect in genomics research. They introduce the idea of d-separation and creating collider patterns by conditioning on certain variables. The lecturer explains that if a variable is simple enough, researchers can make interventions and randomly assign variables to break the dependency between confounders and the variable of interest. The lecturer emphasizes that genetics is an important variable in genomics research as it is not affected by environmental factors, and setting it to a certain value is like a natural randomized control trial.

  • 01:05:00 In this section, the lecturer discusses the concept of Mendelian randomization and how it can be used to understand the relationship between genotypes, intermediate phenotypes, and disease phenotypes. The genotypes are beautifully randomized, making it easier to estimate the true causal effect. Although this method heavily relies on assumptions, it has been successfully applied in gene eQTL and gene-environment interaction studies. Additionally, the lecturer explains that another way to estimate the beta regression parameter and mediation effects is through the combination of regression y on g and another regression x on g. Ultimately, Mendelian randomization offers a unique opportunity to understand complex relationships between variables that are difficult to manipulate in real life.

  • 01:10:00 In this section, the lecturer discusses two approaches to inferring causal effects in genomics research: Mendelian Randomization (MR) and Rubin's Quasi-Inference Model. MR is a randomized control trial that makes use of genotypes to randomly perturb intermediate variables for a randomized controlled trial on a disease outcome. However, MR can be difficult when it comes to unknown confounders or if there are alternate paths. Rubin's Quasi-Inference Model is a counterfactual reasoning approach that measures causal effects when the assignment is a discrete variable. This approach creates an imputation problem as the potential outcome for a unit is missing if it was not observed.

  • 01:15:00 In this section of the lecture on deep learning in life sciences, the speaker discusses the potential outcome method for causal inference in genetic studies. Assumptions such as independence, strong ignorability, and overlap are necessary to estimate individual causal effects accurately. The speaker also provides a toy example involving an Alzheimer's disease drug and discusses how fitting a propensity function and using propensity scores can help adjust for biases and produce fair comparisons between treatment and control groups. The potential outcome method allows researchers to ask interesting questions about the effects of different treatments and interventions.

  • 01:20:00 In this section, the speaker discusses causal inference through the potential outcome framework and state-of-the-art counterfactual inference techniques. They explain how weighting the treated groups can account for the difference in outcomes and how imputation can be used to estimate potential outcomes. They also discuss a recent paper that proposes using a snip matrix to capture multiple confounders and using population PC to adjust for these confounding effects, as well as a strategy to impute missing data using Bayesian regression trees. Through this, individual causal effects can be measured to determine the effectiveness of treatments.

  • 01:25:00 In this section, the speaker discusses the causal discovery aspect of deep learning in life sciences. They explain that learning the causal graph structure from high-dimensional data matrices is a complex and challenging problem. However, they note that the breakthrough in this area came from the use of genetics in perturbing genes and measuring the gene expressions to learn networks. They explain that instead of using a score-based likelihood, researchers are now relying on the invariance condition which assumes a single causal model that generates the data, and using this assumption to identify the causal structure of the data. The speaker also provides a toy example that demonstrates this approach.

  • 01:30:00 In this section of the lecture, the speaker discusses the idea of invariance condition and its application in determining whether a model can consistently explain experimental data. The speaker uses the example of gene knockout experiments and shows how the inclusion of a wrong predictor can lead to rejection of the experimental results. The idea of causal triangulation is also mentioned as a way to improve the reproducibility of scientific experiments. The speaker concludes by emphasizing the importance of multiple orthogonal evidence to develop a causal algorithm.
GWAS mechanism - Lecture 13 - Deep Learning in Life Sciences (Spring 2021)
GWAS mechanism - Lecture 13 - Deep Learning in Life Sciences (Spring 2021)
  • 2021.04.08
  • www.youtube.com
MIT 6.874/6.802/20.390/20.490/HST.506 Spring 2021 Prof. Manolis KellisDeep Learning in the Life Sciences / Computational Systems BiologyPlaylist: https://you...
 

Systems Genetics - Lecture 14



Systems Genetics - Lecture 14 - Deep Learning in Life Sciences (Spring 2021)

In this lecture on systems genetics and deep learning, the speaker covers several topics, including SNP heritability, partitioning heritability, stratified LD score regression, and deep learning in molecular phenotyping. They also explore the use of electronic health records, genomic association studies, and genomics to analyze a UK biobank dataset of around 500,000 individuals with thousands of phenotypes. The lecturer discusses how deep learning models can be used for sequence function prediction to understand the circuitry of disease loci and the use of linear mixed models for GWAS and EQTL calling. They also touch on the biases and violations of model assumptions in deep learning and highlight the importance of cell type-specific regulatory annotations in inferring disease-critical cell types. Lastly, the lecturer discusses the complexity of findings related to negative selection and causal effect sizes and introduces Professor Manuel Rivas from Stanford University to discuss the decomposition of genetic associations.

The lecture delves into the application of genetic data in various areas, including quantifying the composition and contribution components of traits, identifying genetic variants that contribute to adipogenesis or lipolysis, identifying mutations with strong effects on gene function and lower disease risk, and the development of risk prediction models using multivariate analysis. Additionally, the lecture discusses the application of polygenic risk score models in various biomarkers and stresses the need for data sharing across different populations to improve predictive accuracy, particularly in the case of non-European populations. The lecture concludes by expressing a willingness to supervise students interested in research projects related to UK Biobank polygenic scores and biotropic effects.

  • 00:00:00 In this section, the speaker introduces the topic of systems genetics and electronic health records. They briefly review the concepts covered in the previous lectures, including common and rare variants, polygenic risk scores, linkage disequilibrium, and fine mapping variants. The speaker discusses the challenges in interpreting genome-wide association studies due to the vast majority of non-coding associations with multiple SNPs. They then introduce the use of genomic, RNA, and variation information, as well as deep learning models for sequence function to predict driver genes, regions, and cell types to understand the circuitry underlying disease loci. The speaker also introduces the use of linear mixed models for both GWAS and EQTL calling, which predict the fixed and random effects on phenotypes of interest using genotypes and covariates.

  • 00:05:00 In this section, the lecturer explains the basic foundation for predicting a person's phenotype based on their genetic variants and the effect size of each alternate allele across all SNPs in the genome and all individuals in the cohort. The noise is distributed across individuals with a centered value at zero and a squared covariance matrix. Additionally, random effects are accounted for using a kinship matrix that measures the genetic sharing between individuals. A Bayesian approach is used to integrate all unknowns and determine the probability of phenotypic effects driven by the covariance matrix. Linear mixed models are built to estimate the total heritability of a particular trait, which is based on the infinitesimal assumption and is estimated using a restricted maximum likelihood model. This random effects model captures the transformations of data and works despite the lack of knowledge about the actual causal variance.

  • 00:10:00 In this section, the speaker discusses the use of deep learning in capturing additional variation through predicting the effect of intermediate molecular phenotypes and the linear relationship between SNPs and expression. The speaker explains that this can be done using prior distributions that match the potential noise surrounding the estimate, which allows for inferring the most preferred outcome. They also mention the influence of population differences, where the strongest effects driving genetic matrices are stemming directly from population differences. Finally, the speaker explains the concept of heritability and how partitioning genetic relatedness into subsets of the genome can be a powerful approach for computing heritability, suggesting that the longer the chromosomes, the more variants they explain for many complex traits.

  • 00:15:00 In this section, Alkes Price from the Harvard School of Public Health explains the concept of SNP heritability, which is a parameter defined as the maximum value attainable in the entire population regarding the relationship between phenotype and genotype. He discusses the idea of partitioning heritability across different functional categories of SNPs, such as coding versus non-coding, and how this could lead to conclusions on which SNPs are enriched for heritability in specific diseases and tissues. Price also introduces the concept of stratified LD score regression as a tool for studying disease-critical cell types and cellular processes across the human body.

  • 00:20:00 In this section, the speaker introduces the idea of analyzing summary association statistics from large data sets in statistical genetics. This method is useful when analyzing diseases such as schizophrenia, rheumatoid arthritis, and Crohn's disease where large samples sizes are available by utilizing summary statistic data rather than individual level genotypes and phenotypes. The speaker explains the method of stratified ld score regression, which is used to regress chi-squared association statistics from disease GWAS across SNPs with LD from different functional categories. The method is based on the idea that an average chi-squared greater than one does not imply confounding and relies on the average LD score across SNPs.

  • 00:25:00 In this section, the speaker explains the concept of tagging signal and biologically causal signal in relation to SNPs (single-nucleotide polymorphisms) and their LD (linkage disequilibrium) scores. They discuss how the method of stratified LD (linkage disequilibrium) score regression can help detect confounding in these scores, with a higher average chi-squared score indicating the presence of confounding. They also touch on the issue of genomic LD (linkage disequilibrium) and how it varies based on population and frequency of SNPs. The speaker then presents real data in the form of a schizophrenia data set to further illustrate this method.

  • 00:30:00 In this section of the lecture, a regression equation is introduced to estimate the SNP heritability using LD scores. The intercept of the regression equation reflects the confounding while the slope reflects the correlation between the chi-square statistic and the LD score. This slope can be used to estimate the SNP heritability, and the respective slopes of multi-linear regression can tell us about the causal SNP heritability of different functional categories. The quantity enrichment can measure the percentage of SNP heritability explained by a specific functional category versus the percentage of SNPs that are part of that category. The functional interpretation of the slope depends on whether the functional categories are overlapping or not.

  • 00:35:00 In this section, the speaker discusses stratified LD score regression, which is used to evaluate enrichment in various functional annotations. The method is applied to coding SNPs, enhancers, histone markers, and more. The speaker notes that the method produces unbiased estimates if the causal categories are included in the model, but becomes biased if the causal categories are not in the model. However, even if a few categories are missing, the model can still provide enough richness to produce close-to-unbiased estimates for the remaining categories. The speaker emphasizes that individual level data methods are not currently designed to run on a large number of overlapping or continuous-valued functional categories.

  • 00:40:00 In this section, the speaker explains that there are potential violations of model assumptions in deep learning if not careful, citing an example with top qtl in gene expression data that does not satisfy the fundamental model assumption. The speaker then moves on to discuss the applications of the deep learning method to real chromatin and gene expression data. Using publicly available summary statistics of 17 traits, the speaker found that coding SNPs are enriched for diseases and complex traits, especially for autoimmune diseases and height, while conserved SNPs across 29 mammals were also found to have substantial impact on disease. Additionally, phantom five enhancers were found to have a significant enrichment for autoimmune diseases. The discussion then turns to interpreting these results in relation to how certain traits may have a higher or lower coupling with reproductive fitness.

  • 00:45:00 In this section, the lecturer explains the reasons why certain functional categories are enriched for heritability, which are not due to larger causal effect sizes. Common snips have a soft upper bound on effect sizes due to negative selection, so it's more about the number of snips in the functional category that do something, with each having medium or small to medium causal effect sizes. The lecturer also discusses the importance of cell type-specific regulatory annotations in inferring disease critical cell types. Brain regulatory annotations are most enriched for schizophrenia, connected bone regulatory adaptations are most enriched for height, and immune cell types are most enriched for rheumatoid arthritis. A genome-wide polygenic approach can yield greater biological insights for highly polygenic traits than traditional approaches that focus on genome-wide significant snips, which may be very low in number for these traits.

  • 00:50:00 In this section of the lecture, the speaker discusses using gene expression data to study specific genes related to certain diseases, including schizophrenia and rheumatoid arthritis. They also mention the concept of ld dependent architectures, where the size of causal effects is dependent on the level of ld, and how snips with lower levels of ld have larger causal effects sizes in 56 different traits. The speaker mentions the complexity of these findings, which are related to negative selection, but runs out of time to discuss single cell RNA sequencing data and disease critical cell types. They then introduce Professor Manuel Rivas from Stanford University, who discusses the process of combining electronic health records, genomic association studies, and genomics to analyze a population-based UK biobank dataset of around 500,000 individuals with thousands of phenotypes.

  • 00:55:00 In this section, the speaker discusses an approach called the decomposition of genetic associations, which involves disentangling many-to-many mappings into fewer components to represent genetic association studies. The speaker used a truncated singular value decomposition approach to represent a matrix composed of summary level data for thousands of traits and genetic variance, resulting in a lower-rank component of about 100 components, each of which is a product of orthogonal elements in three matrices. The first two components were characterized by anthropometric phenotypes, and the speaker projected how each variant loads onto the two components to see how they affect different phenotypes.

  • 01:00:00 In this section, the speaker explains how the composition and contribution components for a given trait can be quantified, such as Body Mass Index (BMI), which is made up of a fat component and a fat-free mass component. Genetic risk of BMI would be contributed of a fat component, among other components, as well. The speaker explains that they are interested in identifying genetic variants that may contribute to adipogenesis or lipolysis effects rather than just having a fat-free effect on body mass index by studying specific Protein Truncating Variants (PTVs) and identifying strong effect sizes. Through this process, the speaker identifies the gene pde3b that has a high cholesterol fat-free mass contribution to BMI, and GPR 151 that has functional consequences on idiprogenesis. The genetic associations for 2000 phenotypes are available online at biobank engine.com.edu, with the idea that it becomes a search portal for anybody to either search their favorite gene, variant or phenotype and browse the set of associations that are available across different popular biobanks.

  • 01:05:00 In this section, the speaker discusses the identification of mutations that have strong effects on gene function and lower the risk for disease, which can lead to new therapeutic hypotheses and guide the selection of targets for drug discovery. They explain the process of identifying specific genetic variants with strong effects on gene function and phenotype by combining summary level data from multiple biobanks. By estimating genetic parameters such as heritability of polygenicity and the correlation of genetic effects, they aim to visualize the relationship between genetics and traits/diseases to improve inference and guide therapeutic development. Examples of strong effect mutations and their effects on protection against diseases such as asthma and type 1 diabetes are also provided.

  • 01:10:00 In this section, the presenter discusses the application of genetic data in risk prediction models. Humans have a large number of genetic variants linked to hundreds of phenotypes, so one approach to exploring these links is fitting millions of univariate models. However, this approach has weak properties for prediction due to the correlation among genetic variants, which makes it hard to distinguish the relevant variant from others. Therefore, a multivariate model is developed by fitting large regression models with millions of variables. The package developed for fitting these models is called S-LDSC. The model uses the Lasso algorithm, which is a penalized regression framework that allows for variable selection to improve predictive performance.

  • 01:15:00 In this section, the speaker discusses the application of polygenic risk score models for 35 biomarkers, including cardiovascular, renal, and liver biomarkers. The study created a training data set of 70, validation set of 10, and a test split of 20 to evaluate model performance. The performance of models was measured in different populations, and the results showed limitations associated with transferring these predictive models from one population that uses causal variants for predictions to other populations. The study demonstrated that the correlation structure varied across different populations, impacting the predictive performance of models. Moreover, different sets of genetic variants may explain the heritability of the phenotype, and transferring over predictive models from one population may not work as well in other populations, breaking down the relationship of correlation structure amongst genetic variants. This calls for the need for data sharing across different populations to improve predictive accuracy.

  • 01:20:00 In this section, the speaker explains that when studying genetic variants in different populations, the absence of certain variants in non-European populations can contribute to heterogeneity in effect sizes. However, when a variant is present across multiple populations, the effect sizes tend to be more homogeneous. The example of lipoprotein a is given, with the explanation that genetic variants contributing to variance in the European population do not exist in the African population, leading to poor performance in African populations. The speaker also expresses willingness to supervise students interested in research projects related to UK Biobank polygenic scores and biotropic effects.
Systems Genetics - Lecture 14 - Deep Learning in Life Sciences (Spring 2021)
Systems Genetics - Lecture 14 - Deep Learning in Life Sciences (Spring 2021)
  • 2021.04.08
  • www.youtube.com
MIT 6.874/6.802/20.390/20.490/HST.506 Spring 2021 Prof. Manolis KellisDeep Learning in the Life Sciences / Computational Systems BiologyPlaylist: https://you...
 

Graph Neural Networks - Lecture 15



Graph Neural Networks - Lecture 15 - Learning in Life Sciences (Spring 2021)

In this YouTube lecture on Graph Neural Networks, the speaker covers a wide range of topics, including the basics of graph networks, spectral representations, semi-supervised classification, and multi-relational data modeling. There is also a focus on the intersection of graph networks and natural language processing and how to generate graphs for drug discovery. The lecturer explains various methods to propagate information across graphs to obtain useful node embeddings that can be used for prediction tasks. The lecture also highlights the importance of contrastive learning for GNNs, the potential benefits of combining patch-based representations and attention-based methods, and the use of the transformer approach in NLP. The latter half of the lecture focuses on discussing papers that showcase the practical uses of GNNs in drug discovery and how to encode and decode the structure of molecules using a junction tree.

This video discusses multiple applications of graph neural networks (GNNs) in life sciences, including drug discovery and latent graph inference. The speaker highlights the issues and potential avenues in GNNs, such as the lack of spatial locality and fixed ordering, and the setup considered involves predicting the type of a given node, predicting a link between two nodes, measuring similarity between two nodes or two networks, and clustering nodes by performing community detection in the network. The lecturer also explains how GNNs can efficiently train and embed graphs, transform and aggregate information, and deal with polypharmacy side effects. Additionally, the lecture covers two methods for automatically learning representations in life sciences, with meta-learning models like MARS being leveraged to generalize to novel cell types. Lastly, the lecture discusses how GNNs can learn latent cell representations across multiple datasets to capture cell type heterogeneity.

  • 00:00:00 In this section, the speaker introduces the fourth module on graphs and proteins and the upcoming lectures on graph neural networks, protein structure, and drug design. The speaker emphasizes the importance of reviewing the material through homework, restations, and papers to prepare for an upcoming in-class quiz. The goal is not to trick or surprise students, but to help them embrace the field and gain a deep understanding of it. The speaker also informs students of an upcoming lecture by the AlphaFold team on protein folding, which is a revolutionary advancement in the field.

  • 00:05:00 In this section, the lecturer introduces the concept of networks and how they are pervasive in various aspects of society, including biological networks. The biological networks include regulatory networks, signaling networks, and metabolic networks operating at different levels of the cell. There is a need for network analysis methods to understand the properties of these networks that interact with each other. Also, there is a mention of probabilistic networks that use nodes and edges to represent probabilistic objects. The matrix representations of these networks allow for decomposing them, learning communities, and identifying modules through linear algebra approaches.

  • 00:10:00 In this section of the lecture, the speaker provides an overview of the extensive body of work on network analysis and its spectral representations. The methods discussed include identifying separability of components using maximal cuts through networks based on the first and second eigenvalues of the Laplacian matrix, as well as the use of diffusion kernels to understand the flow of information between different edges. The speaker emphasizes the importance of not forgetting about this established literature as it can be used in combination with deep learning methods such as graph neural networks that will be discussed in the lecture. The speaker then introduces the guest lecturer, Neil Band, who will provide a refresher on graph neural networks and discuss problem domains such as semi-supervised learning, multi-relational data, and natural language processing.

  • 00:15:00 In this section, we learn how to effectively propagate information over graphs to compute node features or many graphs and perform downstream operations by using graph convolutional networks. This network can aggregate the feature information and update the particular node by receiving and drawing future information from neighbors. The end goal of GNNS is to produce one embedding vector which can be used to predict an entire graph's property or predict the type of each individual node. The update rule is based on propagating information from the node's hidden representation and updates received from the immediate neighborhood. Additionally, to reduce the number of the model's parameters, the same weight matrices are applied with shared parameters to all of the neighbors instead of applying different ones.

  • 00:20:00 In this section, the lecturer describes the process of using graph neural networks to perform a classification task on citation networks with papers as nodes and citation links as edges. The two-layer graph convolutional network is applied, which involves updating each node in the graph to absorb information from its immediate neighborhood and then obtaining the outputs. The lecturer mentions the potential drawback of over-smoothing with deep networks and suggests using gated recurrent units to preserve memory of the initial state. Additionally, the lecturer discusses the possibility of combining attention-based methods and patch-based representations to learn higher order representations in graph neural networks.

  • 00:25:00 In this section, the lecturer discusses different paradigms in graph neural networks, including graph convolutional networks, attentional updates, and message passing techniques. They highlight the potential memory issues that arise when graphs become too dense in message passing, but emphasize that these paradigms are useful for different types of learning tasks. They then dive into semi-supervised classification on graphs, in which transductive setting can allow models to learn quickly, even without explicit node features. Lastly, the lecturer touches on relational graph convolutional networks, which can be used for modeling multi-relational data, such as in natural language processing.

  • 00:30:00 In this section, the lecturer discusses the connection between graphical networks and natural language processing, particularly the use of the transformer model in NLP. The transformer model is commonly used for tasks like language translation and learning general conceptual understanding of words. The transformer approach starts from a fully connected graph, unlike biological networks where many edges are missing, and uses self-attention to update node embeddings before outputting an updated version. While the transformer approach may not necessarily benefit biological networks, there is potential for cross-pollination of strategies and optimization between the two fields.

  • 00:35:00 In this section, we learn about how to perform a word embedding update for a two-word sentence, and how to do a lookup to a particular word to all other words. Graph attention networks use this same method, except that they assume the entire neighborhood is the graph, and there are positional embeddings. The speaker explains how to incorporate graph connectivity information into the architecture and how to mask out portions of the graph to only use words that have previously been mentioned. There are many opportunities to cross-apply these methods.

  • 00:40:00 In this section, the lecturer discusses the unsupervised learning setting of learning node embeddings for downstream tasks, such as node classification or graph classification. To improve the ability of neural networks to become well-specified, the lecturer explains the concept of data augmentation and describes how it is used in contrastive learning approaches. The lecture also covers design parameters, such as sampling strategies, different types of node representations, and different types of scoring functions. One approach is to use the scoring function to maximize the mutual information between the local and global representations of a particular class. This encourages the network to pull out class-related information from different subsets of information from the graph, leading to more robust node embeddings and better downstream performance.

  • 00:45:00 In this section, the speaker discusses the dimensionality of node embeddings in graph neural networks (GNNs) and the use of contrastive learning for GNNs. The speaker explains that in practice, the properties of nodes in GNNs could live in a high-dimensional space, such as 256 or 512 dimensions for a single node in a large graph. The speaker also notes that contrastive learning, which involves using positive and negative examples to encode the graph structure, could be used instead of classification to improve the encoding of graph structure. Finally, the speaker summarizes the takeaways of design decisions in GNNs, highlighting the effectiveness of neighbor-based scoring for link prediction and node classification and the importance of considering both the features of nodes and the structure of the graph when choosing the type of node representation.

  • 00:50:00 In this section, the speaker discusses two ways to generate a graph, the first of which is predicting new links between known entities using a standard graph neural network or graph convolutional network as an encoder and a function of the embeddings as a decoder. The probability of any given edge existence is based on the nodes incident to it and is independent of all other edges. The second way generates a graph with a single embedding vector for the entire graph, using one particular state, which is decoded using a Graph RNN that makes a set of predictions when adding on each specific node. This method attempts to introduce as few inductive biases as possible about how to generate a graph. The latter approach is used for drug discovery, specifically in the paper on Junction Tree Variational Autoencoder to generate de novo molecules with high potency, regardless of whether they have been synthesized or characterized previously.

  • 00:55:00 In this section, the paper's approach to encoding and decoding the structure of molecules using graph neural networks is described. The approach utilizes a fine-grained molecular graph to encode a state and a tree decomposition to decode the higher-level structure of the graph. By using a junction tree to remove cycles in the graph, the authors are able to simplify the decoding process and predict only a node's label and whether or not to add a child node, resulting in a valid higher-level structure of the molecule. The authors use a gated recurrent unit to involve all of the state of the subtree that has been built thus far and achieve a high percentage of reconstruction in terms of molecular validity. Bayesian optimization is used to evaluate the navigability of the latent space for generating novel drugs.

  • 01:00:00 In this section, the speaker discusses two applications of graph neural networks (GNN) in the life sciences. The first application is in the field of drug discovery, where the GNN is used to infer the latent variable of a molecule and predict its chemical property. The model is trained using an encoder-decoder framework and optimized using Bayesian optimization. The second application is latent graph inference, where GNNs are used to model hidden structures in a problem by encoding the set of dynamics that occur over time. The model can be used to predict future outcomes and can be applied to causal discovery. The speaker presents toy data as well as real-world motion capture data to show the effectiveness of GNNs in these applications.

  • 01:05:00 In this section, the speaker discusses the issues and potential avenues in graph neural networks. A few problems were mentioned including the bounded power and theoretical relation to tests of isomorphism in message passing and neighborhood aggregation, the challenges of tree structured computation graphs in finding cycles in graphs, and the issue of over-smoothing. However, the speaker also sees promise in scaling these networks, learning on large data sets, and trying out multimodal and cross-modal learning between sequences and graphs. Following this, a postdoc from Stanford University discusses deep learning in biological networks, and how for data represented as a graph, more broadly applicable deep neural network frameworks are needed. It is explained that while deep learning has transformed the way we think of machine learning life cycle today, it is unclear how to use and apply deep learning for complex data represented as a graph.

  • 01:10:00 In this section, the complexities of learning on graph data are discussed, including the lack of spatial locality and fixed ordering, the inexistence of reference points, and the dynamic nature of graphs. The goal of representation learning on graphs is to find a way to learn a mapping function that takes the graph as input to map the nodes to a low-dimensional embedding space. The efficient task-independent feature learning is a crucial goal of this process for machine learning on networks. The setup considered assumes a graph with an adjacency matrix and node features associated with each node, from which the goal is to predict a type of a given node, predict a link between two nodes, measure similarity between two nodes or two networks, and cluster nodes by performing community detection in the network. The most naive approach of applying deep neural networks to graphs is presented, but its limitations are highlighted, including the increase in the number of parameters in the network depending on the number of nodes, the instability of training and increased likelihood of overfitting.

  • 01:15:00 In this section, the speaker explains how graph neural networks can efficiently train and embed graphs using ideas borrowed from convolutional neural networks. The neighborhood of a node defines the structure of the neural network, and the key idea is to generate node embeddings based on the local network neighborhood. The speaker illustrates this concept by showing how to aggregate and transform information to produce message transformation and aggregation operators, which are permutation invariant. These operators can be learned to transform node information and predict the property of interest.

  • 01:20:00 In this section, the speaker explains the transformation and aggregation process of graph neural networks. The basic approach is to average information from the nodes and apply neural networks for linear transformations followed by nonlinearity. The speaker presents the example of the GraphSAGE algorithm, where a generalized aggregator function is introduced to combine the features of a node's local neighborhood. Differentiable aggregation functions, such as mean, pooling or LSTM cells, can be used to aggregate information across the neighbors. The speaker also discusses the use of graph neural networks in biology and how they can be used to predict certain behaviors or outcomes.

  • 01:25:00 In this section, the lecturer discusses the concept of polypharmacy side effects, which are side effects resulting from the combination of drugs. The lecturer explains that the goal is to estimate the likelihood of side effects from the combination of two drugs by modeling them as nodes in a heterogeneous network. The lecturer shows an example of how drugs and proteins can be modeled in a network to capture the mechanisms of action of drugs and the underlying biological mechanisms. The lecturer then explains how Graph Neural Networks (GNNs) can be extended to embed heterogeneous networks, where the neighborhood needs to be separated by an edge type, and how to transform and propagate information across the graph defined by nodes' network neighborhood in each edge type.

  • 01:30:00 In this section, the lecturer discusses two methods for automatically learning representations in life sciences. The first method is based on relational graph neural networks, which can be used to predict whether two drugs will result in side effects by learning d-dimensional vector embeddings for each node in the graph. The second method is a meta learning model called MARS, which leverages prior knowledge from previously annotated data to generalize to novel, never-before-seen cell types. By optimizing over the unannotated experiment and the metadata set, MARS can automatically annotate cells to cell types and avoid the tedious manual effort of annotating cells based on their gene expression profiles.

  • 01:35:00 In this section of the lecture, the speaker discusses using graph neural networks to learn latent cell representations across multiple datasets in order to capture the heterogeneity of cell types. The approach involves joint projection of cells from annotated and unannotated experiments in a low dimensional embedding space, where similar cell types are embedded close and different cell types are embedded far away. To achieve this, the method learns cell type landmarks as cell type representatives and a nonlinear mapping function using deep neural networks. The approach is validated on a large-scale mouse cell atlas data with over 100,000 cells from more than 20 tissues, and it achieves 45% better performance than existing methods in terms of Adjusted Rand Index.
Graph Neural Networks - Lecture 15 - Learning in Life Sciences (Spring 2021)
Graph Neural Networks - Lecture 15 - Learning in Life Sciences (Spring 2021)
  • 2021.04.19
  • www.youtube.com
MIT 6.874/6.802/20.390/20.490/HST.506 Spring 2021 Prof. Manolis KellisGuest lecturers: Neil Band, Maria Brbic / Jure LeskovecDeep Learning in the Life Scienc...
 

AI for Drug Design - Lecture 16


AI for Drug Design - Lecture 16 - Deep Learning in the Life Sciences (Spring 2021)

This lecture discusses the use of deep learning for drug design. It explains how deep learning can be used to find novel compounds with antibiotic resistance. It also discusses how the deep learning models can be improved by incorporating biological knowledge.

This second part of the lecture provides an overview of how deep learning can be used in drug design, specifically for predicting the antiviral activity of drug combinations. The model was tested in vivo using cell-based assays and two novel synergistic drug combinations were identified.

  • 00:00:00 The speaker will introduce deep learning for drug design and its challenges. He will discuss the functional space and chemical space, and explain how deep learning can be used to find drugs automatically.

  • 00:05:00 The three approaches to drug design are based on first principles, simulation, and virtual screening. The first two are good for finding compounds with specific properties, but the last is more ambitious and tries to find the right compound by looking at properties that are independent of each other. Simulation is often too slow, and virtual screening is expensive. Denoble drug design is the most ambitious approach and tries to solve the inverse problem of finding a compound by looking at a set of criteria.

  • 00:10:00 In this lecture, the speaker discusses two methods for drug discovery, virtual screening and noble drug design. Both methods have their own advantages and disadvantages, with virtual screening being faster and cheaper but having less coverage than traditional methods, while noble drug design is slower but can find more novel compounds. Genetic algorithms are an effective way to explore the chemical space, but there is still room for improvement in the algorithms for this task.

  • 00:15:00 In this lecture, Professor explains how deep learning is being used in drug design, and how it can be more efficient than traditional techniques. He also mentions a paper called "Dolly," which shows how deep learning can be used to generate realistic images of objects.

  • 00:20:00 In this lecture, Professor discusses the deep learning techniques used in drug discovery, and gives examples of how these techniques have helped researchers to find new antibiotics.

  • 00:25:00 Graph neural networks are a type of artificial intelligence that are used to search for new compounds that can kill bacteria. The goal of using this type of AI is to find compounds that are not discovered by traditional methods, as these methods can miss unknown antibacterial patterns.

  • 00:30:00 This lecture discusses how deep learning can be used to identify patterns in data related to antibiotic resistance. The model is able to predict whether a molecule will be effective against bacteria, with a precision of around 9.0 auc.

  • 00:35:00 The video discusses how existing antibiotics are no longer effective against some bacteria strains, and how a new compound, called "hallucin," is both novel and effective against these strains. It also discusses how the compound is effective against infections in mice.

  • 00:40:00 The video discusses the success of deep learning models over traditional methods in discovering new compounds with antibiotic resistance. The video also shows how a traditional method, hand design, is not able to discover certain compounds with antibiotic resistance. The deep learning models are able to capture different parts of the space and are highly ranked by the models.

  • 00:45:00 The speaker describes deep learning models used for drug design and explains how the models can be improved by incorporating biological knowledge. He presents a case study of a drug combination that was found to be more effective than a single drug.

  • 00:50:00 The video discusses AI for drug design, with particular focus on the use of deep learning to identify synergistic compounds. The goal is to find drugs that are synergistic and less toxic, and to incorporate knowledge of the viral replication cycle into the model.

  • 00:55:00 The lecture discusses deep learning methods for drug design, focusing on how it can be used to predict the antiviral activity of a drug against a variety of targets. The first step is to predict the drug target interaction, using a data set from Campbell and from National Institute of Health. Then, a neural network is used to learn the representation of the molecule's structure, which is needed for the second step of the drug design process: predicting the antiviral activity of the drug against a variety of targets. By using a combination of deep learning and matrix completion, the potential for improving drug design is highlighted.

  • 01:00:00 This lecture discusses how deep learning can be used in drug design, specifically for predicting the antiviral activity of drug combinations. The model was tested in vivo using cell-based assays and two novel synergistic drug combinations were identified.

  • 01:05:00 This lecture focuses on Deep Learning in the Life Sciences and its importance for drug design. The lecture covers two prior approaches to drug design, one using sequences and the other using recurrent neural networks. The lecture notes that the smile stream representation of molecules is quite brittle and the techniques have poor performance when applied to drug discovery. The lecture notes that a better way to represent molecules is with graphs, which can be efficiently generated with recurrent neural networks.

  • 01:10:00 The lecture discusses deep learning in the life sciences, specifically as it pertains to drug design. The lecture notes that deep learning can be used to generate molecules, but that it has problems with sparse molecules and low tree wave motifs. A recurrent neural network was proposed as a solution, and it was found to be more successful with molecules that have low tree wave motifs.

  • 01:15:00 This lecture discusses deep learning in the life sciences, focusing on a deep learning autoencoder that can encode molecules into a low dimensional vector. This reduces the number of motifs that can be generated, as well as the time complexity of the process.

  • 01:20:00 In this lecture, Professor explains how deep learning can be used to improve the accuracy of motif reconstruction in drug design. Multifaceted motif generation models are advantageous because they allow for the capture of large cycles in molecules. The success rate of motif generation using a node-by-node approach is low due to the wrong representation of the sequence space. However, using a motif-by-motif approach improves the success rate significantly. This is because the model is able to learn to modify existing molecules to improve their drug likenesses.

  • 01:25:00 The speaker provides a brief overview of deep learning in the life sciences, highlighting the challenges and opportunities of each area. She finishes with a discussion of chemistry and drug design.

  • 01:30:00 In this lecture, the guest lecturer provides advice for students interested in pursuing projects in the field of artificial intelligence for drug design. They state that students can receive mentorship from them if desired.
 

Deep Learning for Protein Folding - Lecture 17



Deep Learning for Protein Folding - Lecture 17 - MIT Deep Learning in Life Sciences (Spring 2021)

This video discusses the use of deep learning in the field of protein folding, and specifically how geometric deep learning can be used to study protein structures and predict things such as ligand-binding sites and protein-protein interactions. The video also covers template-based vs. template-free modeling methods, various approaches for contact prediction in protein folding, and the use of residual neural networks for image modeling in protein structure prediction. Overall, the speaker emphasizes the promise of deep learning in advancing our understanding of protein structures and their functions, and provides detailed examples and results to back up this claim.

The video discusses various approaches to deep learning for protein folding, including the use of co-evolution predictions and templates for accurate modeling, the importance of finding better homologs, and the potential for deep learning to achieve comparable results without relying on traditional physics-based methods. The speakers also delve into the use of differentiable outputs and the importance of global accuracy, as well as the evolution of algorithm space and the potential for deep learning to predict protein confirmations based on factors such as genetic variation or small molecules. Overall, the video highlights the exciting potential for deep learning to revolutionize protein structure prediction and its many applications.

  • 00:00:00 In this section of the video, Bruno Correa introduces the concept of geometric deep learning and how it applies to the study of protein structures. He explains how deep learning has been successful in image classification, but that the data sets in biology are generally much richer and high dimensional, with various time and other dimensions, making geometric deep learning a valuable approach. Correa discusses the importance of protein structures in their functions, from mechanical and chemical functions to binding and recognition, and presents examples such as antibodies, ion pumps, and communication and rigidity proteins. He also addresses the question of whether the work of studying protein surfaces has been addressed by AlphaFold, explaining that AlphaFold has solved protein structures but not specifically the study of protein surfaces.

  • 00:05:00 In this section, the speaker discusses the challenges of predicting protein function from its structure, which is important for understanding how proteins interact with each other and other metabolites in cells. The speaker presents various ways to represent protein structures, with a focus on surface representations that may have similar functions despite having dissimilar sequences and architectures. By analogy to studying people's faces, the speaker argues that studying patterns in protein surfaces can reveal important information about their functions. The speaker then introduces a deep learning approach for predicting protein ligand-binding sites using 3D molecular surface representations.

  • 00:10:00 In this section of the video, the speaker discusses the use of geometric deep learning for the problem of protein folding. They explain that the prototypical objects for geometric deep learning are graphs or surfaces, and their team used mesh representations of proteins to study them. They then explain the use of "patches," which are subsets of the mesh with several vector features at each node, and how local weights are assigned to them. The speaker describes the different types of features that were encoded into each node, including shape index, distance dependent curvature, hydrophobicity, and electrostatic features. This information was then repurposed into a vector for further analysis.

  • 00:15:00 In this section, the speaker discusses how the geometric deep learning approach can encode the surface of a molecule regardless of its sequence, allowing for the study of patterns of atoms and chemical properties. The speaker notes the potential applications of this approach, such as classifying protein pockets based on the features of particular ligands and predicting the docking configurations of two proteins using surface fingerprints. Ablation studies were conducted to understand which factors contribute more to predicting specificity, with chemistry and geometry both found to be important. Overall, the approach shows promise in advancing understanding of protein structures and their functions.

  • 00:20:00 In this section, the speaker describes a network called massive site that can predict which sites in a given protein surface are more likely to interact with other proteins. They also discuss a fingerprint scanning technique used for docking and the success rates of this approach compared to other docking programs. The speaker introduces the next generation of massive called D massive, which uses a fully differentiable network to create a point cloud that describes the protein surface and the computation of geometric and chemical features, including electrostatic properties. Finally, the speaker briefly mentions the exciting design aspect of the project and discusses an important target for controlling the activity of T-cells in cancer treatment.

  • 00:25:00 In this section, the speaker discusses how they used deep learning to design molecules that target proteins. They used massive to predict the site that would be more prone to be targeted by design molecules and extracted the target surface fingerprint. They then docked motifs into this site and predicted interactions with the protein of interest. The result was a new motif that was not previously known in nature and successfully matched experimental structures with a root mean square deviation of around one angstrom, indicating a high affinity binder that binds to the protein. The speaker offers to potentially advise students interested in exploring this area of research.

  • 00:30:00 In this section of the lecture, the speaker discusses the two main categories of protein structure prediction methods: template-based modeling and template-free modeling. While template-based modeling relies on using existing protein structures in the PDB database as templates to predict new structures, template-free modeling is a more recent method that involves homology searching and machine learning to predict structures without relying on templates. The speaker focuses on the latter method and describes a newer approach that uses sequence homology searching, signal profiling, and machine learning to predict protein structures without relying on templates, which has shown better accuracy for many proteins than template-based methods. The speaker also discusses the fragment assembly method, a popular template-based modeling approach used in the past.

  • 00:35:00 In this section of the lecture, the speaker discusses the pipeline used for template-free modeling in protein folding. The predictive information on the distance between any two atoms or residues in the protein is fed into an optimization engine to build the structure. The speaker also discusses different strategies for multiple sequence alignments, including using a cutoff value for the number of coverage or carbon residues needed. The crucial component of this modeling is predicting the induction matrix, modeling the interjection measures using content measures or distance metrics. The speaker presents some effective ideas for contact position prediction, which have made the prediction much easier and collaborations much more effective in recent years.

  • 00:40:00 In this section, the speaker discusses three different approaches for contact prediction in protein folding. The first approach is a global statistic method for coalition analysis, but it requires a large number of sequence homologs to be effective. The second approach is using deep convolutional residual neural networks for prediction of contact distance, and the third is a transformative network for contact prediction that takes into account both sequence and structural information from the protein data bank. The speaker also explains the challenges faced by previous supervised learning methods for contact prediction and how they can be improved by using more advanced machine learning models.

  • 00:45:00 In this section, the speaker discusses the limitations of previous contact prediction methods for protein folding, which only considered two residues at a time and therefore ignored larger relationships within the whole protein. To address these issues, the speaker proposes a new method which uses deep learning to predict all contacts in a protein simultaneously. This method is based on treating each atom pair as a pixel in an image, which can be used to formulate the problem as an image segmentation task. By using a fully convolutional residual neural network, the speaker shows that their method can significantly improve contact prediction precision and enable the folding of larger and harder proteins. Furthermore, the method works well for both single-chain and membrane proteins, and can be used for complex contact prediction without changing the model.

  • 00:50:00 In this section, the speaker discusses the use of residual neural networks to predict protein structure through image modeling using convolutional neural networks. They explain that using residual connections allows for the use of much deeper networks, which leads to better precision without overfitting. The speaker shows some results of the performance of their method in ranking and accuracy compared to other methods, demonstrating the success of the deep learning approach. The precision has improved over the past eight years, and now the precision can go up to 80 percent.

  • 00:55:00 In this section, the speaker discusses the progress on contact position and design position using deep learning models for protein folding. Contact precision has improved significantly with a current precision of 80%, which is much more useful than the previous exam. The speaker explains the process of using a digital network for design position and how it can significantly improve temporary-based modeling. The speaker also discusses the importance of code russian information and shows that even for certain fermented proteins, a good prediction can still be achieved without using it. The results suggest that deep learning can generate new structures and that a small number of sequence hormones are needed for accurate predictions.

  • 01:00:00 In this section, the speakers discuss the use of sequence and structure information to improve protein modeling. They explore the idea of using existing predictions as feedback into a training set to enhance coevolution predictions and lead to better sequence-based predictors. They also discuss using template information and the importance of finding good templates for accurate modeling. Additionally, they question the role of physics in protein modeling and suggest that, while physical-based methods can help refine models, deep learning can also achieve comparable results without the use of physics.

  • 01:05:00 In this section, the video discusses how to model really large proteins without using templates. The example protein has over 13,000 residues, making it difficult to model accurately through traditional means. However, by combining different ensembling methods and utilizing the workflow of iPhone2, the protein is modeled with high accuracy. The video also notes that using a transformer requires a great deal of GPU power and memory, making it difficult for most people to use. However, the machine learning model is still feasible with a smaller set of training data. Additionally, finding better homologs to base the model on is a potential bottleneck that can be improved through further research. Finally, a progress chart is shown for 3D modeling challenging targets, with higher scores indicating better quality of predicted models.

  • 01:10:00 In this section, Muhammad Al-Qaraghuli talks about the evolution of algorithm space for protein structure prediction over the last two decades. He discusses how earlier methods were focused on using a physics-based model and energy function to get at the lowest energy state of a protein, while more recent methods have utilized co-evolution to extract information using various probabilistic inference techniques. Al-Qaraghuli notes that the accuracy of these methods remains limited without additional sequence information and discusses how deep learning has become a game-changer for protein structure prediction, particularly for membrane and transmembrane proteins.

  • 01:15:00 In this section, the speaker discusses the evolution of deep learning approaches for protein folding, beginning with the use of unsupervised methods in the early 2010s and the introduction of deep learning through unique network-based approaches such as Jim Wazoo's work with RaptorX in 2016 and the use of a residual network architecture by capital X 18. The speaker describes the development of the first set of end-to-end differentiable approaches in 2018, which were not necessarily competitive with existing methods but were able to generate predictions much faster. The latest development, AlphaFold 2, treats multiple sequence alignment (MSA) objects as law objects to potentially capture higher order correlations and global aspects of sequence and phylogeny. Finally, the speaker describes the holy grail of protein folding - the ability to work as well as AlphaFold 2 from individual protein sequences – which their latest work aims to achieve.

  • 01:20:00 In this section, the speakers discuss the ability of proteins to fold in vitro and the extent to which chaperones inside the cell guide this process. They also explore the amount of information that is present in the primary sequence of proteins and whether it is enough to predict the impact of a protein-altering mutation. They discuss the A2 protein predictions, which show that it may still be possible to predict from individual sequences without requiring all physical aspects to be present. Finally, the algorithm of space is introduced, which involves the input, a neural network torso, and the output, which is usually a proxy object related to the structure, and then sent through a post-processing pipeline to generate the final three-dimensional structure.

  • 01:25:00 In this section, the speaker discusses the importance of differentiability for the output generated from a deep learning model. If the output is distal from the actual goal, then there is a loss of potential optimization. The speaker also discusses the use of post-processing, which can lead to self-inconsistent predictions, and how their implementation of a deep learning model predicts the final frequency structure without the need for proxy quantities. In their approach, they parameterize the local geometry using a discrete alphabet of torsion angles and predict a probability distribution over that alphabet. By doing so, they can maintain engine differentiability, which allows for efficient optimization of the final structure.

  • 01:30:00 In this section, the speaker explains their approach to constructing the structure of a protein using free torsion angles at each residue and an iterative process. The loss function is defined in terms of global accuracy, not just local accuracy, to account for the interactions between residues in shaping the original structure. The speaker acknowledges that while their approach is limited, they believe that there is an implicit homogenization of the structure happening internally in the neural network, leading to better predictions over time. The speaker also discusses how they parameterize the output using position-specific scoring matrices (PSSMs) and a recurrent architecture. Finally, the speaker presents some of their predictions made using this approach and notes that while some aspects of the structure were well-predicted, others were not.

  • 01:35:00 In this section, the speaker discusses how they have evolved the idea of torsion parameterization using the frenesia construction, which simplifies the math and simplifies the formulation process. They now focus only on C alpha and parameterize using rotation matrices, which solves the issue of pathological secondary structures. The key change is that they have gone back to the idea of a single sequence, which they feed through a language model. They use transformers to embed each residue in a latent space and use that as input to make predictions, with the added challenge of adapting fragments and splicing two different proteins to improve training performance. The speaker shows results comparing RGN1 and RGN2 in predicting a target cast sequence, with RGN2 achieving significantly better results due to a post-processing refinement step. It's important to note that this is based on a single sequence input that went through a language model.

  • 01:40:00 In this section of the video, the speaker discusses the accuracy of their method for predicting protein structures. They show examples aligned with respect to alpha 2, and while the accuracy is not quite as good as the state of the art, they are using a lot less information to make the prediction. They also show examples of singleton proteins, which are essentially in the twilight zone of sequence space and have no sequence homologs, where their approach is making a significant difference compared to the state-of-the-art publicly available system. Additionally, the speaker discusses the de novo proteins and designed proteins that they do well on systematically, which makes sense since these types of sequence-based approaches would be useful in protein design. Finally, the speaker explains that the significant speedup in their method could be useful for a variety of applications.

  • 01:45:00 In this section, the speakers discuss the potential of using deep learning to predict different protein confirmations based on different factors, such as genetic variation or small molecules. While having a single signal space method might work better in theory, there's no way to know until they can actually compare different versions head-to-head such as when alpha 2 is released. Refinement problems are also mentioned, such as predicting the general fault using an MSA, and then refining it into the actual structure using another stage. Rapidly evolving viruses are mentioned as another area where deep learning could be useful. Ultimately, the speakers express their excitement over potential future collaboration opportunities and the privilege of being able to connect with people from different parts of the world.
Deep Learning for Protein Folding - Lecture 17 - MIT Deep Learning in Life Sciences (Spring 2021)
Deep Learning for Protein Folding - Lecture 17 - MIT Deep Learning in Life Sciences (Spring 2021)
  • 2021.04.26
  • www.youtube.com
MIT 6.874/6.802/20.390/20.490/HST.506 Spring 2021 Prof. Manolis KellisGuest lecturers: Bruno Correia, Jinbo Xu, Mohammed AlQuraishiDeep Learning in the Life ...
Reason: