Sample records for gene function classification

  1. Use of DAVID algorithms for gene functional classification in a non-model organism, rainbow trout

    USDA-ARS?s Scientific Manuscript database

    Gene functional clustering is essential in transcriptome data analysis but software programs are not always suitable for use with non-model species. The DAVID Gene Functional Classification Tool has been widely used for soft clustering in model species, but requires adaptations for use in non-model ...

  2. SoFoCles: feature filtering for microarray classification based on gene ontology.

    PubMed

    Papachristoudis, Georgios; Diplaris, Sotiris; Mitkas, Pericles A

    2010-02-01

    Marker gene selection has been an important research topic in the classification analysis of gene expression data. Current methods try to reduce the "curse of dimensionality" by using statistical intra-feature set calculations, or classifiers that are based on the given dataset. In this paper, we present SoFoCles, an interactive tool that enables semantic feature filtering in microarray classification problems with the use of external, well-defined knowledge retrieved from the Gene Ontology. The notion of semantic similarity is used to derive genes that are involved in the same biological path during the microarray experiment, by enriching a feature set that has been initially produced with legacy methods. Among its other functionalities, SoFoCles offers a large repository of semantic similarity methods that are used in order to derive feature sets and marker genes. The structure and functionality of the tool are discussed in detail, as well as its ability to improve classification accuracy. Through experimental evaluation, SoFoCles is shown to outperform other classification schemes in terms of classification accuracy in two real datasets using different semantic similarity computation approaches.

  3. Di-codon Usage for Gene Classification

    NASA Astrophysics Data System (ADS)

    Nguyen, Minh N.; Ma, Jianmin; Fogel, Gary B.; Rajapakse, Jagath C.

    Classification of genes into biologically related groups facilitates inference of their functions. Codon usage bias has been described previously as a potential feature for gene classification. In this paper, we demonstrate that di-codon usage can further improve classification of genes. By using both codon and di-codon features, we achieve near perfect accuracies for the classification of HLA molecules into major classes and sub-classes. The method is illustrated on 1,841 HLA sequences which are classified into two major classes, HLA-I and HLA-II. Major classes are further classified into sub-groups. A binary SVM using di-codon usage patterns achieved 99.95% accuracy in the classification of HLA genes into major HLA classes; and multi-class SVM achieved accuracy rates of 99.82% and 99.03% for sub-class classification of HLA-I and HLA-II genes, respectively. Furthermore, by combining codon and di-codon usages, the prediction accuracies reached 100%, 99.82%, and 99.84% for HLA major class classification, and for sub-class classification of HLA-I and HLA-II genes, respectively.

  4. BeeSpace Navigator: exploratory analysis of gene function using semantic indexing of biological literature.

    PubMed

    Sen Sarma, Moushumi; Arcoleo, David; Khetani, Radhika S; Chee, Brant; Ling, Xu; He, Xin; Jiang, Jing; Mei, Qiaozhu; Zhai, ChengXiang; Schatz, Bruce

    2011-07-01

    With the rapid decrease in cost of genome sequencing, the classification of gene function is becoming a primary problem. Such classification has been performed by human curators who read biological literature to extract evidence. BeeSpace Navigator is a prototype software for exploratory analysis of gene function using biological literature. The software supports an automatic analogue of the curator process to extract functions, with a simple interface intended for all biologists. Since extraction is done on selected collections that are semantically indexed into conceptual spaces, the curation can be task specific. Biological literature containing references to gene lists from expression experiments can be analyzed to extract concepts that are computational equivalents of a classification such as Gene Ontology, yielding discriminating concepts that differentiate gene mentions from other mentions. The functions of individual genes can be summarized from sentences in biological literature, to produce results resembling a model organism database entry that is automatically computed. Statistical frequency analysis based on literature phrase extraction generates offline semantic indexes to support these gene function services. The website with BeeSpace Navigator is free and open to all; there is no login requirement at www.beespace.illinois.edu for version 4. Materials from the 2010 BeeSpace Software Training Workshop are available at www.beespace.illinois.edu/bstwmaterials.php.

  5. A machine-learned computational functional genomics-based approach to drug classification.

    PubMed

    Lötsch, Jörn; Ultsch, Alfred

    2016-12-01

    The public accessibility of "big data" about the molecular targets of drugs and the biological functions of genes allows novel data science-based approaches to pharmacology that link drugs directly with their effects on pathophysiologic processes. This provides a phenotypic path to drug discovery and repurposing. This paper compares the performance of a functional genomics-based criterion to the traditional drug target-based classification. Knowledge discovery in the DrugBank and Gene Ontology databases allowed the construction of a "drug target versus biological process" matrix as a combination of "drug versus genes" and "genes versus biological processes" matrices. As a canonical example, such matrices were constructed for classical analgesic drugs. These matrices were projected onto a toroid grid of 50 × 82 artificial neurons using a self-organizing map (SOM). The distance, respectively, cluster structure of the high-dimensional feature space of the matrices was visualized on top of this SOM using a U-matrix. The cluster structure emerging on the U-matrix provided a correct classification of the analgesics into two main classes of opioid and non-opioid analgesics. The classification was flawless with both the functional genomics and the traditional target-based criterion. The functional genomics approach inherently included the drugs' modulatory effects on biological processes. The main pharmacological actions known from pharmacological science were captures, e.g., actions on lipid signaling for non-opioid analgesics that comprised many NSAIDs and actions on neuronal signal transmission for opioid analgesics. Using machine-learned techniques for computational drug classification in a comparative assessment, a functional genomics-based criterion was found to be similarly suitable for drug classification as the traditional target-based criterion. This supports a utility of functional genomics-based approaches to computational system pharmacology for drug discovery and repurposing.

  6. Genome-Wide Comparative Gene Family Classification

    PubMed Central

    Frech, Christian; Chen, Nansheng

    2010-01-01

    Correct classification of genes into gene families is important for understanding gene function and evolution. Although gene families of many species have been resolved both computationally and experimentally with high accuracy, gene family classification in most newly sequenced genomes has not been done with the same high standard. This project has been designed to develop a strategy to effectively and accurately classify gene families across genomes. We first examine and compare the performance of computer programs developed for automated gene family classification. We demonstrate that some programs, including the hierarchical average-linkage clustering algorithm MC-UPGMA and the popular Markov clustering algorithm TRIBE-MCL, can reconstruct manual curation of gene families accurately. However, their performance is highly sensitive to parameter setting, i.e. different gene families require different program parameters for correct resolution. To circumvent the problem of parameterization, we have developed a comparative strategy for gene family classification. This strategy takes advantage of existing curated gene families of reference species to find suitable parameters for classifying genes in related genomes. To demonstrate the effectiveness of this novel strategy, we use TRIBE-MCL to classify chemosensory and ABC transporter gene families in C. elegans and its four sister species. We conclude that fully automated programs can establish biologically accurate gene families if parameterized accordingly. Comparative gene family classification finds optimal parameters automatically, thus allowing rapid insights into gene families of newly sequenced species. PMID:20976221

  7. Large-scale gene function analysis with the PANTHER classification system.

    PubMed

    Mi, Huaiyu; Muruganujan, Anushya; Casagrande, John T; Thomas, Paul D

    2013-08-01

    The PANTHER (protein annotation through evolutionary relationship) classification system (http://www.pantherdb.org/) is a comprehensive system that combines gene function, ontology, pathways and statistical analysis tools that enable biologists to analyze large-scale, genome-wide data from sequencing, proteomics or gene expression experiments. The system is built with 82 complete genomes organized into gene families and subfamilies, and their evolutionary relationships are captured in phylogenetic trees, multiple sequence alignments and statistical models (hidden Markov models or HMMs). Genes are classified according to their function in several different ways: families and subfamilies are annotated with ontology terms (Gene Ontology (GO) and PANTHER protein class), and sequences are assigned to PANTHER pathways. The PANTHER website includes a suite of tools that enable users to browse and query gene functions, and to analyze large-scale experimental data with a number of statistical tests. It is widely used by bench scientists, bioinformaticians, computer scientists and systems biologists. In the 2013 release of PANTHER (v.8.0), in addition to an update of the data content, we redesigned the website interface to improve both user experience and the system's analytical capability. This protocol provides a detailed description of how to analyze genome-wide experimental data with the PANTHER classification system.

  8. Using PPI network autocorrelation in hierarchical multi-label classification trees for gene function prediction.

    PubMed

    Stojanova, Daniela; Ceci, Michelangelo; Malerba, Donato; Dzeroski, Saso

    2013-09-26

    Ontologies and catalogs of gene functions, such as the Gene Ontology (GO) and MIPS-FUN, assume that functional classes are organized hierarchically, that is, general functions include more specific ones. This has recently motivated the development of several machine learning algorithms for gene function prediction that leverages on this hierarchical organization where instances may belong to multiple classes. In addition, it is possible to exploit relationships among examples, since it is plausible that related genes tend to share functional annotations. Although these relationships have been identified and extensively studied in the area of protein-protein interaction (PPI) networks, they have not received much attention in hierarchical and multi-class gene function prediction. Relations between genes introduce autocorrelation in functional annotations and violate the assumption that instances are independently and identically distributed (i.i.d.), which underlines most machine learning algorithms. Although the explicit consideration of these relations brings additional complexity to the learning process, we expect substantial benefits in predictive accuracy of learned classifiers. This article demonstrates the benefits (in terms of predictive accuracy) of considering autocorrelation in multi-class gene function prediction. We develop a tree-based algorithm for considering network autocorrelation in the setting of Hierarchical Multi-label Classification (HMC). We empirically evaluate the proposed algorithm, called NHMC (Network Hierarchical Multi-label Classification), on 12 yeast datasets using each of the MIPS-FUN and GO annotation schemes and exploiting 2 different PPI networks. The results clearly show that taking autocorrelation into account improves the predictive performance of the learned models for predicting gene function. Our newly developed method for HMC takes into account network information in the learning phase: When used for gene function prediction in the context of PPI networks, the explicit consideration of network autocorrelation increases the predictive performance of the learned models. Overall, we found that this holds for different gene features/ descriptions, functional annotation schemes, and PPI networks: Best results are achieved when the PPI network is dense and contains a large proportion of function-relevant interactions.

  9. Random forests-based differential analysis of gene sets for gene expression data.

    PubMed

    Hsueh, Huey-Miin; Zhou, Da-Wei; Tsai, Chen-An

    2013-04-10

    In DNA microarray studies, gene-set analysis (GSA) has become the focus of gene expression data analysis. GSA utilizes the gene expression profiles of functionally related gene sets in Gene Ontology (GO) categories or priori-defined biological classes to assess the significance of gene sets associated with clinical outcomes or phenotypes. Many statistical approaches have been proposed to determine whether such functionally related gene sets express differentially (enrichment and/or deletion) in variations of phenotypes. However, little attention has been given to the discriminatory power of gene sets and classification of patients. In this study, we propose a method of gene set analysis, in which gene sets are used to develop classifications of patients based on the Random Forest (RF) algorithm. The corresponding empirical p-value of an observed out-of-bag (OOB) error rate of the classifier is introduced to identify differentially expressed gene sets using an adequate resampling method. In addition, we discuss the impacts and correlations of genes within each gene set based on the measures of variable importance in the RF algorithm. Significant classifications are reported and visualized together with the underlying gene sets and their contribution to the phenotypes of interest. Numerical studies using both synthesized data and a series of publicly available gene expression data sets are conducted to evaluate the performance of the proposed methods. Compared with other hypothesis testing approaches, our proposed methods are reliable and successful in identifying enriched gene sets and in discovering the contributions of genes within a gene set. The classification results of identified gene sets can provide an valuable alternative to gene set testing to reveal the unknown, biologically relevant classes of samples or patients. In summary, our proposed method allows one to simultaneously assess the discriminatory ability of gene sets and the importance of genes for interpretation of data in complex biological systems. The classifications of biologically defined gene sets can reveal the underlying interactions of gene sets associated with the phenotypes, and provide an insightful complement to conventional gene set analyses. Copyright © 2012 Elsevier B.V. All rights reserved.

  10. Protein classification using probabilistic chain graphs and the Gene Ontology structure.

    PubMed

    Carroll, Steven; Pavlovic, Vladimir

    2006-08-01

    Probabilistic graphical models have been developed in the past for the task of protein classification. In many cases, classifications obtained from the Gene Ontology have been used to validate these models. In this work we directly incorporate the structure of the Gene Ontology into the graphical representation for protein classification. We present a method in which each protein is represented by a replicate of the Gene Ontology structure, effectively modeling each protein in its own 'annotation space'. Proteins are also connected to one another according to different measures of functional similarity, after which belief propagation is run to make predictions at all ontology terms. The proposed method was evaluated on a set of 4879 proteins from the Saccharomyces Genome Database whose interactions were also recorded in the GRID project. Results indicate that direct utilization of the Gene Ontology improves predictive ability, outperforming traditional models that do not take advantage of dependencies among functional terms. Average increase in accuracy (precision) of positive and negative term predictions of 27.8% (2.0%) over three different similarity measures and three subontologies was observed. C/C++/Perl implementation is available from authors upon request.

  11. A new approach to enhance the performance of decision tree for classifying gene expression data.

    PubMed

    Hassan, Md; Kotagiri, Ramamohanarao

    2013-12-20

    Gene expression data classification is a challenging task due to the large dimensionality and very small number of samples. Decision tree is one of the popular machine learning approaches to address such classification problems. However, the existing decision tree algorithms use a single gene feature at each node to split the data into its child nodes and hence might suffer from poor performance specially when classifying gene expression dataset. By using a new decision tree algorithm where, each node of the tree consists of more than one gene, we enhance the classification performance of traditional decision tree classifiers. Our method selects suitable genes that are combined using a linear function to form a derived composite feature. To determine the structure of the tree we use the area under the Receiver Operating Characteristics curve (AUC). Experimental analysis demonstrates higher classification accuracy using the new decision tree compared to the other existing decision trees in literature. We experimentally compare the effect of our scheme against other well known decision tree techniques. Experiments show that our algorithm can substantially boost the classification performance of the decision tree.

  12. Improved Classification of Lung Cancer Using Radial Basis Function Neural Network with Affine Transforms of Voss Representation.

    PubMed

    Adetiba, Emmanuel; Olugbara, Oludayo O

    2015-01-01

    Lung cancer is one of the diseases responsible for a large number of cancer related death cases worldwide. The recommended standard for screening and early detection of lung cancer is the low dose computed tomography. However, many patients diagnosed die within one year, which makes it essential to find alternative approaches for screening and early detection of lung cancer. We present computational methods that can be implemented in a functional multi-genomic system for classification, screening and early detection of lung cancer victims. Samples of top ten biomarker genes previously reported to have the highest frequency of lung cancer mutations and sequences of normal biomarker genes were respectively collected from the COSMIC and NCBI databases to validate the computational methods. Experiments were performed based on the combinations of Z-curve and tetrahedron affine transforms, Histogram of Oriented Gradient (HOG), Multilayer perceptron and Gaussian Radial Basis Function (RBF) neural networks to obtain an appropriate combination of computational methods to achieve improved classification of lung cancer biomarker genes. Results show that a combination of affine transforms of Voss representation, HOG genomic features and Gaussian RBF neural network perceptibly improves classification accuracy, specificity and sensitivity of lung cancer biomarker genes as well as achieving low mean square error.

  13. PANTHER version 11: expanded annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements.

    PubMed

    Mi, Huaiyu; Huang, Xiaosong; Muruganujan, Anushya; Tang, Haiming; Mills, Caitlin; Kang, Diane; Thomas, Paul D

    2017-01-04

    The PANTHER database (Protein ANalysis THrough Evolutionary Relationships, http://pantherdb.org) contains comprehensive information on the evolution and function of protein-coding genes from 104 completely sequenced genomes. PANTHER software tools allow users to classify new protein sequences, and to analyze gene lists obtained from large-scale genomics experiments. In the past year, major improvements include a large expansion of classification information available in PANTHER, as well as significant enhancements to the analysis tools. Protein subfamily functional classifications have more than doubled due to progress of the Gene Ontology Phylogenetic Annotation Project. For human genes (as well as a few other organisms), PANTHER now also supports enrichment analysis using pathway classifications from the Reactome resource. The gene list enrichment tools include a new 'hierarchical view' of results, enabling users to leverage the structure of the classifications/ontologies; the tools also allow users to upload genetic variant data directly, rather than requiring prior conversion to a gene list. The updated coding single-nucleotide polymorphisms (SNP) scoring tool uses an improved algorithm. The hidden Markov model (HMM) search tools now use HMMER3, dramatically reducing search times and improving accuracy of E-value statistics. Finally, the PANTHER Tree-Attribute Viewer has been implemented in JavaScript, with new views for exploring protein sequence evolution. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  14. PANTHER: a browsable database of gene products organized by biological function, using curated protein family and subfamily classification.

    PubMed

    Thomas, Paul D; Kejariwal, Anish; Campbell, Michael J; Mi, Huaiyu; Diemer, Karen; Guo, Nan; Ladunga, Istvan; Ulitsky-Lazareva, Betty; Muruganujan, Anushya; Rabkin, Steven; Vandergriff, Jody A; Doremieux, Olivier

    2003-01-01

    The PANTHER database was designed for high-throughput analysis of protein sequences. One of the key features is a simplified ontology of protein function, which allows browsing of the database by biological functions. Biologist curators have associated the ontology terms with groups of protein sequences rather than individual sequences. Statistical models (Hidden Markov Models, or HMMs) are built from each of these groups. The advantage of this approach is that new sequences can be automatically classified as they become available. To ensure accurate functional classification, HMMs are constructed not only for families, but also for functionally distinct subfamilies. Multiple sequence alignments and phylogenetic trees, including curator-assigned information, are available for each family. The current version of the PANTHER database includes training sequences from all organisms in the GenBank non-redundant protein database, and the HMMs have been used to classify gene products across the entire genomes of human, and Drosophila melanogaster. The ontology terms and protein families and subfamilies, as well as Drosophila gene c;assifications, can be browsed and searched for free. Due to outstanding contractual obligations, access to human gene classifications and to protein family trees and multiple sequence alignments will temporarily require a nominal registration fee. PANTHER is publicly available on the web at http://panther.celera.com.

  15. Fine-grained parallelization of fitness functions in bioinformatics optimization problems: gene selection for cancer classification and biclustering of gene expression data.

    PubMed

    Gomez-Pulido, Juan A; Cerrada-Barrios, Jose L; Trinidad-Amado, Sebastian; Lanza-Gutierrez, Jose M; Fernandez-Diaz, Ramon A; Crawford, Broderick; Soto, Ricardo

    2016-08-31

    Metaheuristics are widely used to solve large combinatorial optimization problems in bioinformatics because of the huge set of possible solutions. Two representative problems are gene selection for cancer classification and biclustering of gene expression data. In most cases, these metaheuristics, as well as other non-linear techniques, apply a fitness function to each possible solution with a size-limited population, and that step involves higher latencies than other parts of the algorithms, which is the reason why the execution time of the applications will mainly depend on the execution time of the fitness function. In addition, it is usual to find floating-point arithmetic formulations for the fitness functions. This way, a careful parallelization of these functions using the reconfigurable hardware technology will accelerate the computation, specially if they are applied in parallel to several solutions of the population. A fine-grained parallelization of two floating-point fitness functions of different complexities and features involved in biclustering of gene expression data and gene selection for cancer classification allowed for obtaining higher speedups and power-reduced computation with regard to usual microprocessors. The results show better performances using reconfigurable hardware technology instead of usual microprocessors, in computing time and power consumption terms, not only because of the parallelization of the arithmetic operations, but also thanks to the concurrent fitness evaluation for several individuals of the population in the metaheuristic. This is a good basis for building accelerated and low-energy solutions for intensive computing scenarios.

  16. Functional classification of rice flanking sequence tagged genes using MapMan terms and global understanding on metabolic and regulatory pathways affected by dxr mutant having defects in light response.

    PubMed

    Chandran, Anil Kumar Nalini; Lee, Gang-Seob; Yoo, Yo-Han; Yoon, Ung-Han; Ahn, Byung-Ohg; Yun, Doh-Won; Kim, Jin-Hyun; Choi, Hong-Kyu; An, GynHeung; Kim, Tae-Ho; Jung, Ki-Hong

    2016-12-01

    Rice is one of the most important food crops for humans. To improve the agronomical traits of rice, the functions of more than 1,000 rice genes have been recently characterized and summarized. The completed, map-based sequence of the rice genome has significantly accelerated the functional characterization of rice genes, but progress remains limited in assigning functions to all predicted non-transposable element (non-TE) genes, estimated to number 37,000-41,000. The International Rice Functional Genomics Consortium (IRFGC) has generated a huge number of gene-indexed mutants by using mutagens such as T-DNA, Tos17 and Ds/dSpm. These mutants have been identified by 246,566 flanking sequence tags (FSTs) and cover 65 % (25,275 of 38,869) of the non-TE genes in rice, while the mutation ratio of TE genes is 25.7 %. In addition, almost 80 % of highly expressed non-TE genes have insertion mutations, indicating that highly expressed genes in rice chromosomes are more likely to have mutations by mutagens such as T-DNA, Ds, dSpm and Tos17. The functions of around 2.5 % of rice genes have been characterized, and studies have mainly focused on transcriptional and post-transcriptional regulation. Slow progress in characterizing the function of rice genes is mainly due to a lack of clues to guide functional studies or functional redundancy. These limitations can be partially solved by a well-categorized functional classification of FST genes. To create this classification, we used the diverse overviews installed in the MapMan toolkit. Gene Ontology (GO) assignment to FST genes supplemented the limitation of MapMan overviews. The functions of 863 of 1,022 known genes can be evaluated by current FST lines, indicating that FST genes are useful resources for functional genomic studies. We assigned 16,169 out of 29,624 FST genes to 34 MapMan classes, including major three categories such as DNA, RNA and protein. To demonstrate the MapMan application on FST genes, transcriptome analysis was done from a rice mutant of 1-deoxy-D-xylulose 5-phosphate reductoisomerase (DXR) gene with FST. Mapping of 756 down-regulated genes in dxr mutants and their annotation in terms of various MapMan overviews revealed candidate genes downstream of DXR-mediating light signaling pathway in diverse functional classes such as the methyl-D-erythritol 4-phosphatepathway (MEP) pathway overview, photosynthesis, secondary metabolism and regulatory overview. This report provides a useful guide for systematic phenomics and further applications to enhance the key agronomic traits of rice.

  17. Extending bicluster analysis to annotate unclassified ORFs and predict novel functional modules using expression data

    PubMed Central

    Bryan, Kenneth; Cunningham, Pádraig

    2008-01-01

    Background Microarrays have the capacity to measure the expressions of thousands of genes in parallel over many experimental samples. The unsupervised classification technique of bicluster analysis has been employed previously to uncover gene expression correlations over subsets of samples with the aim of providing a more accurate model of the natural gene functional classes. This approach also has the potential to aid functional annotation of unclassified open reading frames (ORFs). Until now this aspect of biclustering has been under-explored. In this work we illustrate how bicluster analysis may be extended into a 'semi-supervised' ORF annotation approach referred to as BALBOA. Results The efficacy of the BALBOA ORF classification technique is first assessed via cross validation and compared to a multi-class k-Nearest Neighbour (kNN) benchmark across three independent gene expression datasets. BALBOA is then used to assign putative functional annotations to unclassified yeast ORFs. These predictions are evaluated using existing experimental and protein sequence information. Lastly, we employ a related semi-supervised method to predict the presence of novel functional modules within yeast. Conclusion In this paper we demonstrate how unsupervised classification methods, such as bicluster analysis, may be extended using of available annotations to form semi-supervised approaches within the gene expression analysis domain. We show that such methods have the potential to improve upon supervised approaches and shed new light on the functions of unclassified ORFs and their co-regulation. PMID:18831786

  18. Soybean kinome: functional classification and gene expression patterns

    PubMed Central

    Liu, Jinyi; Chen, Nana; Grant, Joshua N.; Cheng, Zong-Ming (Max); Stewart, C. Neal; Hewezi, Tarek

    2015-01-01

    The protein kinase (PK) gene family is one of the largest and most highly conserved gene families in plants and plays a role in nearly all biological functions. While a large number of genes have been predicted to encode PKs in soybean, a comprehensive functional classification and global analysis of expression patterns of this large gene family is lacking. In this study, we identified the entire soybean PK repertoire or kinome, which comprised 2166 putative PK genes, representing 4.67% of all soybean protein-coding genes. The soybean kinome was classified into 19 groups, 81 families, and 122 subfamilies. The receptor-like kinase (RLK) group was remarkably large, containing 1418 genes. Collinearity analysis indicated that whole-genome segmental duplication events may have played a key role in the expansion of the soybean kinome, whereas tandem duplications might have contributed to the expansion of specific subfamilies. Gene structure, subcellular localization prediction, and gene expression patterns indicated extensive functional divergence of PK subfamilies. Global gene expression analysis of soybean PK subfamilies revealed tissue- and stress-specific expression patterns, implying regulatory functions over a wide range of developmental and physiological processes. In addition, tissue and stress co-expression network analysis uncovered specific subfamilies with narrow or wide interconnected relationships, indicative of their association with particular or broad signalling pathways, respectively. Taken together, our analyses provide a foundation for further functional studies to reveal the biological and molecular functions of PKs in soybean. PMID:25614662

  19. An enhancement of binary particle swarm optimization for gene selection in classifying cancer classes

    PubMed Central

    2013-01-01

    Background Gene expression data could likely be a momentous help in the progress of proficient cancer diagnoses and classification platforms. Lately, many researchers analyze gene expression data using diverse computational intelligence methods, for selecting a small subset of informative genes from the data for cancer classification. Many computational methods face difficulties in selecting small subsets due to the small number of samples compared to the huge number of genes (high-dimension), irrelevant genes, and noisy genes. Methods We propose an enhanced binary particle swarm optimization to perform the selection of small subsets of informative genes which is significant for cancer classification. Particle speed, rule, and modified sigmoid function are introduced in this proposed method to increase the probability of the bits in a particle’s position to be zero. The method was empirically applied to a suite of ten well-known benchmark gene expression data sets. Results The performance of the proposed method proved to be superior to other previous related works, including the conventional version of binary particle swarm optimization (BPSO) in terms of classification accuracy and the number of selected genes. The proposed method also requires lower computational time compared to BPSO. PMID:23617960

  20. Non-Gaussian Distributions Affect Identification of Expression Patterns, Functional Annotation, and Prospective Classification in Human Cancer Genomes

    PubMed Central

    Marko, Nicholas F.; Weil, Robert J.

    2012-01-01

    Introduction Gene expression data is often assumed to be normally-distributed, but this assumption has not been tested rigorously. We investigate the distribution of expression data in human cancer genomes and study the implications of deviations from the normal distribution for translational molecular oncology research. Methods We conducted a central moments analysis of five cancer genomes and performed empiric distribution fitting to examine the true distribution of expression data both on the complete-experiment and on the individual-gene levels. We used a variety of parametric and nonparametric methods to test the effects of deviations from normality on gene calling, functional annotation, and prospective molecular classification using a sixth cancer genome. Results Central moments analyses reveal statistically-significant deviations from normality in all of the analyzed cancer genomes. We observe as much as 37% variability in gene calling, 39% variability in functional annotation, and 30% variability in prospective, molecular tumor subclassification associated with this effect. Conclusions Cancer gene expression profiles are not normally-distributed, either on the complete-experiment or on the individual-gene level. Instead, they exhibit complex, heavy-tailed distributions characterized by statistically-significant skewness and kurtosis. The non-Gaussian distribution of this data affects identification of differentially-expressed genes, functional annotation, and prospective molecular classification. These effects may be reduced in some circumstances, although not completely eliminated, by using nonparametric analytics. This analysis highlights two unreliable assumptions of translational cancer gene expression analysis: that “small” departures from normality in the expression data distributions are analytically-insignificant and that “robust” gene-calling algorithms can fully compensate for these effects. PMID:23118863

  1. Comparative Oncogenomics for Peripheral Nerve Sheath Cancer Gene Discovery

    DTIC Science & Technology

    2015-06-01

    neurofibromas and MPNSTs, establish gene signatures defining distinct tumor subtypes and functionally test the role of selected driver mutations ...allografted tumor cells, and a variety of in vitro functional assays. We will validate the relevance of these mutated mouse genes in human neurofibromas...and MPNSTs by determining whether these same genes are mutated in human tumors. 15. SUBJECT TERMS Nothing listed 16. SECURITY CLASSIFICATION OF: 17

  2. Rough set soft computing cancer classification and network: one stone, two birds.

    PubMed

    Zhang, Yue

    2010-07-15

    Gene expression profiling provides tremendous information to help unravel the complexity of cancer. The selection of the most informative genes from huge noise for cancer classification has taken centre stage, along with predicting the function of such identified genes and the construction of direct gene regulatory networks at different system levels with a tuneable parameter. A new study by Wang and Gotoh described a novel Variable Precision Rough Sets-rooted robust soft computing method to successfully address these problems and has yielded some new insights. The significance of this progress and its perspectives will be discussed in this article.

  3. Function Clustering Self-Organization Maps (FCSOMs) for mining differentially expressed genes in Drosophila and its correlation with the growth medium.

    PubMed

    Liu, L L; Liu, M J; Ma, M

    2015-09-28

    The central task of this study was to mine the gene-to-medium relationship. Adequate knowledge of this relationship could potentially improve the accuracy of differentially expressed gene mining. One of the approaches to differentially expressed gene mining uses conventional clustering algorithms to identify the gene-to-medium relationship. Compared to conventional clustering algorithms, self-organization maps (SOMs) identify the nonlinear aspects of the gene-to-medium relationships by mapping the input space into another higher dimensional feature space. However, SOMs are not suitable for huge datasets consisting of millions of samples. Therefore, a new computational model, the Function Clustering Self-Organization Maps (FCSOMs), was developed. FCSOMs take advantage of the theory of granular computing as well as advanced statistical learning methodologies, and are built specifically for each information granule (a function cluster of genes), which are intelligently partitioned by the clustering algorithm provided by the DAVID_6.7 software platform. However, only the gene functions, and not their expression values, are considered in the fuzzy clustering algorithm of DAVID. Compared to the clustering algorithm of DAVID, these experimental results show a marked improvement in the accuracy of classification with the application of FCSOMs. FCSOMs can handle huge datasets and their complex classification problems, as each FCSOM (modeled for each function cluster) can be easily parallelized.

  4. Gene function prediction based on the Gene Ontology hierarchical structure.

    PubMed

    Cheng, Liangxi; Lin, Hongfei; Hu, Yuncui; Wang, Jian; Yang, Zhihao

    2014-01-01

    The information of the Gene Ontology annotation is helpful in the explanation of life science phenomena, and can provide great support for the research of the biomedical field. The use of the Gene Ontology is gradually affecting the way people store and understand bioinformatic data. To facilitate the prediction of gene functions with the aid of text mining methods and existing resources, we transform it into a multi-label top-down classification problem and develop a method that uses the hierarchical relationships in the Gene Ontology structure to relieve the quantitative imbalance of positive and negative training samples. Meanwhile the method enhances the discriminating ability of classifiers by retaining and highlighting the key training samples. Additionally, the top-down classifier based on a tree structure takes the relationship of target classes into consideration and thus solves the incompatibility between the classification results and the Gene Ontology structure. Our experiment on the Gene Ontology annotation corpus achieves an F-value performance of 50.7% (precision: 52.7% recall: 48.9%). The experimental results demonstrate that when the size of training set is small, it can be expanded via topological propagation of associated documents between the parent and child nodes in the tree structure. The top-down classification model applies to the set of texts in an ontology structure or with a hierarchical relationship.

  5. [Transcriptome analysis of Dunaliella viridis].

    PubMed

    Zhu, Shuai-qi; Gong, Yi-fu; Hang, Yu-qing; Liu, Hao; Wang, He-yu

    2015-08-01

    In order to understand the gene information, function, haloduric pathway (glycerolipid metabolism) and related key genes for Dunaliella viridis, we used Illumina HiSeqTM 2000 high-throughput sequencing technology to sequence its transcriptome. Trinity soft was used to assemble the data to form transcripts. Based on the Clusters of Orthologous Groups (COG), Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG ) databases, we carried out functional annotation and classification, pathway annotation, and the opening reading fragment (ORF) sequence prediction of transcripts. The key genes in the glycerolipid metabolism were analyzed. The results suggested that 81,593 transcripts were found, and 77,117 ORF sequences were predicted, accounting for 94.50% of all transcripts. COG classification results showed that 16,569 transcripts were assigned to 24 categories. GO classification annotated 76,436 transcripts. The number of transcripts for biologcial processes was 30,678, accounting for 40.14% of all transcripts. KEGG pathway analysis showed that 26,428 transcripts were annotated to 317 pathways, and 131 pathways were related to metabolism, accounting for 41.32% of all annotated pathways. Only one transcript was annotated as coding the key enzyme dihydroxyacetone kinase involved in the glycerolipid pathway. This enzyme could be related to glycerol biosynthesis under salt stress. This study further improved the gene information and laid the foundation of metabolic pathway research for Dunaliella viridis.

  6. Application of a 5-tiered scheme for standardized classification of 2,360 unique mismatch repair gene variants in the InSiGHT locus-specific database.

    PubMed

    Thompson, Bryony A; Spurdle, Amanda B; Plazzer, John-Paul; Greenblatt, Marc S; Akagi, Kiwamu; Al-Mulla, Fahd; Bapat, Bharati; Bernstein, Inge; Capellá, Gabriel; den Dunnen, Johan T; du Sart, Desiree; Fabre, Aurelie; Farrell, Michael P; Farrington, Susan M; Frayling, Ian M; Frebourg, Thierry; Goldgar, David E; Heinen, Christopher D; Holinski-Feder, Elke; Kohonen-Corish, Maija; Robinson, Kristina Lagerstedt; Leung, Suet Yi; Martins, Alexandra; Moller, Pal; Morak, Monika; Nystrom, Minna; Peltomaki, Paivi; Pineda, Marta; Qi, Ming; Ramesar, Rajkumar; Rasmussen, Lene Juel; Royer-Pokora, Brigitte; Scott, Rodney J; Sijmons, Rolf; Tavtigian, Sean V; Tops, Carli M; Weber, Thomas; Wijnen, Juul; Woods, Michael O; Macrae, Finlay; Genuardi, Maurizio

    2014-02-01

    The clinical classification of hereditary sequence variants identified in disease-related genes directly affects clinical management of patients and their relatives. The International Society for Gastrointestinal Hereditary Tumours (InSiGHT) undertook a collaborative effort to develop, test and apply a standardized classification scheme to constitutional variants in the Lynch syndrome-associated genes MLH1, MSH2, MSH6 and PMS2. Unpublished data submission was encouraged to assist in variant classification and was recognized through microattribution. The scheme was refined by multidisciplinary expert committee review of the clinical and functional data available for variants, applied to 2,360 sequence alterations, and disseminated online. Assessment using validated criteria altered classifications for 66% of 12,006 database entries. Clinical recommendations based on transparent evaluation are now possible for 1,370 variants that were not obviously protein truncating from nomenclature. This large-scale endeavor will facilitate the consistent management of families suspected to have Lynch syndrome and demonstrates the value of multidisciplinary collaboration in the curation and classification of variants in public locus-specific databases.

  7. Application of a five-tiered scheme for standardized classification of 2,360 unique mismatch repair gene variants lodged on the InSiGHT locus-specific database

    PubMed Central

    Plazzer, John-Paul; Greenblatt, Marc S.; Akagi, Kiwamu; Al-Mulla, Fahd; Bapat, Bharati; Bernstein, Inge; Capellá, Gabriel; den Dunnen, Johan T.; du Sart, Desiree; Fabre, Aurelie; Farrell, Michael P.; Farrington, Susan M.; Frayling, Ian M.; Frebourg, Thierry; Goldgar, David E.; Heinen, Christopher D.; Holinski-Feder, Elke; Kohonen-Corish, Maija; Robinson, Kristina Lagerstedt; Leung, Suet Yi; Martins, Alexandra; Moller, Pal; Morak, Monika; Nystrom, Minna; Peltomaki, Paivi; Pineda, Marta; Qi, Ming; Ramesar, Rajkumar; Rasmussen, Lene Juel; Royer-Pokora, Brigitte; Scott, Rodney J.; Sijmons, Rolf; Tavtigian, Sean V.; Tops, Carli M.; Weber, Thomas; Wijnen, Juul; Woods, Michael O.; Macrae, Finlay; Genuardi, Maurizio

    2015-01-01

    Clinical classification of sequence variants identified in hereditary disease genes directly affects clinical management of patients and their relatives. The International Society for Gastrointestinal Hereditary Tumours (InSiGHT) undertook a collaborative effort to develop, test and apply a standardized classification scheme to constitutional variants in the Lynch Syndrome genes MLH1, MSH2, MSH6 and PMS2. Unpublished data submission was encouraged to assist variant classification, and recognized by microattribution. The scheme was refined by multidisciplinary expert committee review of clinical and functional data available for variants, applied to 2,360 sequence alterations, and disseminated online. Assessment using validated criteria altered classifications for 66% of 12,006 database entries. Clinical recommendations based on transparent evaluation are now possible for 1,370 variants not obviously protein-truncating from nomenclature. This large-scale endeavor will facilitate consistent management of suspected Lynch Syndrome families, and demonstrates the value of multidisciplinary collaboration for curation and classification of variants in public locus-specific databases. PMID:24362816

  8. Establishing glucose- and ABA-regulated transcription networks in Arabidopsis by microarray analysis and promoter classification using a Relevance Vector Machine.

    PubMed

    Li, Yunhai; Lee, Kee Khoon; Walsh, Sean; Smith, Caroline; Hadingham, Sophie; Sorefan, Karim; Cawley, Gavin; Bevan, Michael W

    2006-03-01

    Establishing transcriptional regulatory networks by analysis of gene expression data and promoter sequences shows great promise. We developed a novel promoter classification method using a Relevance Vector Machine (RVM) and Bayesian statistical principles to identify discriminatory features in the promoter sequences of genes that can correctly classify transcriptional responses. The method was applied to microarray data obtained from Arabidopsis seedlings treated with glucose or abscisic acid (ABA). Of those genes showing >2.5-fold changes in expression level, approximately 70% were correctly predicted as being up- or down-regulated (under 10-fold cross-validation), based on the presence or absence of a small set of discriminative promoter motifs. Many of these motifs have known regulatory functions in sugar- and ABA-mediated gene expression. One promoter motif that was not known to be involved in glucose-responsive gene expression was identified as the strongest classifier of glucose-up-regulated gene expression. We show it confers glucose-responsive gene expression in conjunction with another promoter motif, thus validating the classification method. We were able to establish a detailed model of glucose and ABA transcriptional regulatory networks and their interactions, which will help us to understand the mechanisms linking metabolism with growth in Arabidopsis. This study shows that machine learning strategies coupled to Bayesian statistical methods hold significant promise for identifying functionally significant promoter sequences.

  9. Rough Set Soft Computing Cancer Classification and Network: One Stone, Two Birds

    PubMed Central

    Zhang, Yue

    2010-01-01

    Gene expression profiling provides tremendous information to help unravel the complexity of cancer. The selection of the most informative genes from huge noise for cancer classification has taken centre stage, along with predicting the function of such identified genes and the construction of direct gene regulatory networks at different system levels with a tuneable parameter. A new study by Wang and Gotoh described a novel Variable Precision Rough Sets-rooted robust soft computing method to successfully address these problems and has yielded some new insights. The significance of this progress and its perspectives will be discussed in this article. PMID:20706619

  10. MERRF Classification: Implications for Diagnosis and Clinical Trials.

    PubMed

    Finsterer, Josef; Zarrouk-Mahjoub, Sinda; Shoffner, John M

    2018-03-01

    Given the etiologic heterogeneity of disease classification using clinical phenomenology, we employed contemporary criteria to classify variants associated with myoclonic epilepsy with ragged-red fibers (MERRF) syndrome and to assess the strength of evidence of gene-disease associations. Standardized approaches are used to clarify the definition of MERRF, which is essential for patient diagnosis, patient classification, and clinical trial design. Systematic literature and database search with application of standardized assessment of gene-disease relationships using modified Smith criteria and of variants reported to be associated with MERRF using modified Yarham criteria. Review of available evidence supports a gene-disease association for two MT-tRNAs and for POLG. Using modified Smith criteria, definitive evidence of a MERRF gene-disease association is identified for MT-TK. Strong gene-disease evidence is present for MT-TL1 and POLG. Functional assays that directly associate variants with oxidative phosphorylation impairment were critical to mtDNA variant classification. In silico analysis was of limited utility to the assessment of individual MT-tRNA variants. With the use of contemporary classification criteria, several mtDNA variants previously reported as pathogenic or possibly pathogenic are reclassified as neutral variants. MERRF is primarily an MT-TK disease, with pathogenic variants in this gene accounting for ~90% of MERRF patients. Although MERRF is phenotypically and genotypically heterogeneous, myoclonic epilepsy is the clinical feature that distinguishes MERRF from other categories of mitochondrial disorders. Given its low frequency in mitochondrial disorders, myoclonic epilepsy is not explained simply by an impairment of cellular energetics. Although MERRF phenocopies can occur in other genes, additional data are needed to establish a MERRF disease-gene association. This approach to MERRF emphasizes standardized classification rather than clinical phenomenology, thus improving patient diagnosis and clinical trial design. Copyright © 2017 Elsevier Inc. All rights reserved.

  11. An integrative machine learning strategy for improved prediction of essential genes in Escherichia coli metabolism using flux-coupled features.

    PubMed

    Nandi, Sutanu; Subramanian, Abhishek; Sarkar, Ram Rup

    2017-07-25

    Prediction of essential genes helps to identify a minimal set of genes that are absolutely required for the appropriate functioning and survival of a cell. The available machine learning techniques for essential gene prediction have inherent problems, like imbalanced provision of training datasets, biased choice of the best model for a given balanced dataset, choice of a complex machine learning algorithm, and data-based automated selection of biologically relevant features for classification. Here, we propose a simple support vector machine-based learning strategy for the prediction of essential genes in Escherichia coli K-12 MG1655 metabolism that integrates a non-conventional combination of an appropriate sample balanced training set, a unique organism-specific genotype, phenotype attributes that characterize essential genes, and optimal parameters of the learning algorithm to generate the best machine learning model (the model with the highest accuracy among all the models trained for different sample training sets). For the first time, we also introduce flux-coupled metabolic subnetwork-based features for enhancing the classification performance. Our strategy proves to be superior as compared to previous SVM-based strategies in obtaining a biologically relevant classification of genes with high sensitivity and specificity. This methodology was also trained with datasets of other recent supervised classification techniques for essential gene classification and tested using reported test datasets. The testing accuracy was always high as compared to the known techniques, proving that our method outperforms known methods. Observations from our study indicate that essential genes are conserved among homologous bacterial species, demonstrate high codon usage bias, GC content and gene expression, and predominantly possess a tendency to form physiological flux modules in metabolism.

  12. The phenotypic manifestations of rare genic CNVs in autism spectrum disorder

    PubMed Central

    Merikangas, A K; Segurado, R; Heron, E A; Anney, R J L; Paterson, A D; Cook, E H; Pinto, D; Scherer, S W; Szatmari, P; Gill, M; Corvin, A P; Gallagher, L

    2015-01-01

    Significant evidence exists for the association between copy number variants (CNVs) and Autism Spectrum Disorder (ASD); however, most of this work has focused solely on the diagnosis of ASD. There is limited understanding of the impact of CNVs on the ‘sub-phenotypes' of ASD. The objective of this paper is to evaluate associations between CNVs in differentially brain expressed (DBE) genes or genes previously implicated in ASD/intellectual disability (ASD/ID) and specific sub-phenotypes of ASD. The sample consisted of 1590 cases of European ancestry from the Autism Genome Project (AGP) with a diagnosis of an ASD and at least one rare CNV impacting any gene and a core set of phenotypic measures, including symptom severity, language impairments, seizures, gait disturbances, intelligence quotient (IQ) and adaptive function, as well as paternal and maternal age. Classification analyses using a non-parametric recursive partitioning method (random forests) were employed to define sets of phenotypic characteristics that best classify the CNV-defined groups. There was substantial variation in the classification accuracy of the two sets of genes. The best variables for classification were verbal IQ for the ASD/ID genes, paternal age at birth for the DBE genes and adaptive function for de novo CNVs. CNVs in the ASD/ID list were primarily associated with communication and language domains, whereas CNVs in DBE genes were related to broader manifestations of adaptive function. To our knowledge, this is the first study to examine the associations between sub-phenotypes and CNVs genome-wide in ASD. This work highlights the importance of examining the diverse sub-phenotypic manifestations of CNVs in ASD, including the specific features, comorbid conditions and clinical correlates of ASD that comprise underlying characteristics of the disorder. PMID:25421404

  13. The phenotypic manifestations of rare genic CNVs in autism spectrum disorder.

    PubMed

    Merikangas, A K; Segurado, R; Heron, E A; Anney, R J L; Paterson, A D; Cook, E H; Pinto, D; Scherer, S W; Szatmari, P; Gill, M; Corvin, A P; Gallagher, L

    2015-11-01

    Significant evidence exists for the association between copy number variants (CNVs) and Autism Spectrum Disorder (ASD); however, most of this work has focused solely on the diagnosis of ASD. There is limited understanding of the impact of CNVs on the 'sub-phenotypes' of ASD. The objective of this paper is to evaluate associations between CNVs in differentially brain expressed (DBE) genes or genes previously implicated in ASD/intellectual disability (ASD/ID) and specific sub-phenotypes of ASD. The sample consisted of 1590 cases of European ancestry from the Autism Genome Project (AGP) with a diagnosis of an ASD and at least one rare CNV impacting any gene and a core set of phenotypic measures, including symptom severity, language impairments, seizures, gait disturbances, intelligence quotient (IQ) and adaptive function, as well as paternal and maternal age. Classification analyses using a non-parametric recursive partitioning method (random forests) were employed to define sets of phenotypic characteristics that best classify the CNV-defined groups. There was substantial variation in the classification accuracy of the two sets of genes. The best variables for classification were verbal IQ for the ASD/ID genes, paternal age at birth for the DBE genes and adaptive function for de novo CNVs. CNVs in the ASD/ID list were primarily associated with communication and language domains, whereas CNVs in DBE genes were related to broader manifestations of adaptive function. To our knowledge, this is the first study to examine the associations between sub-phenotypes and CNVs genome-wide in ASD. This work highlights the importance of examining the diverse sub-phenotypic manifestations of CNVs in ASD, including the specific features, comorbid conditions and clinical correlates of ASD that comprise underlying characteristics of the disorder.

  14. Divergence and adaptive evolution of the gibberellin oxidase genes in plants.

    PubMed

    Huang, Yuan; Wang, Xi; Ge, Song; Rao, Guang-Yuan

    2015-09-29

    The important phytohormone gibberellins (GAs) play key roles in various developmental processes. GA oxidases (GAoxs) are critical enzymes in GA synthesis pathway, but their classification, evolutionary history and the forces driving the evolution of plant GAox genes remain poorly understood. This study provides the first large-scale evolutionary analysis of GAox genes in plants by using an extensive whole-genome dataset of 41 species, representing green algae, bryophytes, pteridophyte, and seed plants. We defined eight subfamilies under the GAox family, namely C19-GA2ox, C20-GA2ox, GA20ox,GA3ox, GAox-A, GAox-B, GAox-C and GAox-D. Of these, subfamilies GAox-A, GAox-B, GAox-C and GAox-D are described for the first time. On the basis of phylogenetic analyses and characteristic motifs of GAox genes, we demonstrated a rapid expansion and functional divergence of the GAox genes during the diversification of land plants. We also detected the subfamily-specific motifs and potential sites of some GAox genes, which might have evolved under positive selection. GAox genes originated very early-before the divergence of bryophytes and the vascular plants and the diversification of GAox genes is associated with the functional divergence and could be driven by positive selection. Our study not only provides information on the classification of GAox genes, but also facilitates the further functional characterization and analysis of GA oxidases.

  15. geneCommittee: a web-based tool for extensively testing the discriminatory power of biologically relevant gene sets in microarray data classification.

    PubMed

    Reboiro-Jato, Miguel; Arrais, Joel P; Oliveira, José Luis; Fdez-Riverola, Florentino

    2014-01-30

    The diagnosis and prognosis of several diseases can be shortened through the use of different large-scale genome experiments. In this context, microarrays can generate expression data for a huge set of genes. However, to obtain solid statistical evidence from the resulting data, it is necessary to train and to validate many classification techniques in order to find the best discriminative method. This is a time-consuming process that normally depends on intricate statistical tools. geneCommittee is a web-based interactive tool for routinely evaluating the discriminative classification power of custom hypothesis in the form of biologically relevant gene sets. While the user can work with different gene set collections and several microarray data files to configure specific classification experiments, the tool is able to run several tests in parallel. Provided with a straightforward and intuitive interface, geneCommittee is able to render valuable information for diagnostic analyses and clinical management decisions based on systematically evaluating custom hypothesis over different data sets using complementary classifiers, a key aspect in clinical research. geneCommittee allows the enrichment of microarrays raw data with gene functional annotations, producing integrated datasets that simplify the construction of better discriminative hypothesis, and allows the creation of a set of complementary classifiers. The trained committees can then be used for clinical research and diagnosis. Full documentation including common use cases and guided analysis workflows is freely available at http://sing.ei.uvigo.es/GC/.

  16. Cloud-scale genomic signals processing classification analysis for gene expression microarray data.

    PubMed

    Harvey, Benjamin; Soo-Yeon Ji

    2014-01-01

    As microarray data available to scientists continues to increase in size and complexity, it has become overwhelmingly important to find multiple ways to bring inference though analysis of DNA/mRNA sequence data that is useful to scientists. Though there have been many attempts to elucidate the issue of bringing forth biological inference by means of wavelet preprocessing and classification, there has not been a research effort that focuses on a cloud-scale classification analysis of microarray data using Wavelet thresholding in a Cloud environment to identify significantly expressed features. This paper proposes a novel methodology that uses Wavelet based Denoising to initialize a threshold for determination of significantly expressed genes for classification. Additionally, this research was implemented and encompassed within cloud-based distributed processing environment. The utilization of Cloud computing and Wavelet thresholding was used for the classification 14 tumor classes from the Global Cancer Map (GCM). The results proved to be more accurate than using a predefined p-value for differential expression classification. This novel methodology analyzed Wavelet based threshold features of gene expression in a Cloud environment, furthermore classifying the expression of samples by analyzing gene patterns, which inform us of biological processes. Moreover, enabling researchers to face the present and forthcoming challenges that may arise in the analysis of data in functional genomics of large microarray datasets.

  17. Gene-expression signatures can distinguish gastric cancer grades and stages.

    PubMed

    Cui, Juan; Li, Fan; Wang, Guoqing; Fang, Xuedong; Puett, J David; Xu, Ying

    2011-03-18

    Microarray gene-expression data of 54 paired gastric cancer and adjacent noncancerous gastric tissues were analyzed, with the aim to establish gene signatures for cancer grades (well-, moderately-, poorly- or un-differentiated) and stages (I, II, III and IV), which have been determined by pathologists. Our statistical analysis led to the identification of a number of gene combinations whose expression patterns serve well as signatures of different grades and different stages of gastric cancer. A 19-gene signature was found to have discerning power between high- and low-grade gastric cancers in general, with overall classification accuracy at 79.6%. An expanded 198-gene panel allows the stratification of cancers into four grades and control, giving rise to an overall classification agreement of 74.2% between each grade designated by the pathologists and our prediction. Two signatures for cancer staging, consisting of 10 genes and 9 genes, respectively, provide high classification accuracies at 90.0% and 84.0%, among early-, advanced-stage cancer and control. Functional and pathway analyses on these signature genes reveal the significant relevance of the derived signatures to cancer grades and progression. To the best of our knowledge, this represents the first study on identification of genes whose expression patterns can serve as markers for cancer grades and stages.

  18. Characteristics of genomic signatures derived using univariate methods and mechanistically anchored functional descriptors for predicting drug- and xenobiotic-induced nephrotoxicity.

    PubMed

    Shi, Weiwei; Bugrim, Andrej; Nikolsky, Yuri; Nikolskya, Tatiana; Brennan, Richard J

    2008-01-01

    ABSTRACT The ideal toxicity biomarker is composed of the properties of prediction (is detected prior to traditional pathological signs of injury), accuracy (high sensitivity and specificity), and mechanistic relationships to the endpoint measured (biological relevance). Gene expression-based toxicity biomarkers ("signatures") have shown good predictive power and accuracy, but are difficult to interpret biologically. We have compared different statistical methods of feature selection with knowledge-based approaches, using GeneGo's database of canonical pathway maps, to generate gene sets for the classification of renal tubule toxicity. The gene set selection algorithms include four univariate analyses: t-statistics, fold-change, B-statistics, and RankProd, and their combination and overlap for the identification of differentially expressed probes. Enrichment analysis following the results of the four univariate analyses, Hotelling T-square test, and, finally out-of-bag selection, a variant of cross-validation, were used to identify canonical pathway maps-sets of genes coordinately involved in key biological processes-with classification power. Differentially expressed genes identified by the different statistical univariate analyses all generated reasonably performing classifiers of tubule toxicity. Maps identified by enrichment analysis or Hotelling T-square had lower classification power, but highlighted perturbed lipid homeostasis as a common discriminator of nephrotoxic treatments. The out-of-bag method yielded the best functionally integrated classifier. The map "ephrins signaling" performed comparably to a classifier derived using sparse linear programming, a machine learning algorithm, and represents a signaling network specifically involved in renal tubule development and integrity. Such functional descriptors of toxicity promise to better integrate predictive toxicogenomics with mechanistic analysis, facilitating the interpretation and risk assessment of predictive genomic investigations.

  19. HoloVir: A Workflow for Investigating the Diversity and Function of Viruses in Invertebrate Holobionts

    PubMed Central

    Laffy, Patrick W.; Wood-Charlson, Elisha M.; Turaev, Dmitrij; Weynberg, Karen D.; Botté, Emmanuelle S.; van Oppen, Madeleine J. H.; Webster, Nicole S.; Rattei, Thomas

    2016-01-01

    Abundant bioinformatics resources are available for the study of complex microbial metagenomes, however their utility in viral metagenomics is limited. HoloVir is a robust and flexible data analysis pipeline that provides an optimized and validated workflow for taxonomic and functional characterization of viral metagenomes derived from invertebrate holobionts. Simulated viral metagenomes comprising varying levels of viral diversity and abundance were used to determine the optimal assembly and gene prediction strategy, and multiple sequence assembly methods and gene prediction tools were tested in order to optimize our analysis workflow. HoloVir performs pairwise comparisons of single read and predicted gene datasets against the viral RefSeq database to assign taxonomy and additional comparison to phage-specific and cellular markers is undertaken to support the taxonomic assignments and identify potential cellular contamination. Broad functional classification of the predicted genes is provided by assignment of COG microbial functional category classifications using EggNOG and higher resolution functional analysis is achieved by searching for enrichment of specific Swiss-Prot keywords within the viral metagenome. Application of HoloVir to viral metagenomes from the coral Pocillopora damicornis and the sponge Rhopaloeides odorabile demonstrated that HoloVir provides a valuable tool to characterize holobiont viral communities across species, environments, or experiments. PMID:27375564

  20. Effective Feature Selection for Classification of Promoter Sequences.

    PubMed

    K, Kouser; P G, Lavanya; Rangarajan, Lalitha; K, Acharya Kshitish

    2016-01-01

    Exploring novel computational methods in making sense of biological data has not only been a necessity, but also productive. A part of this trend is the search for more efficient in silico methods/tools for analysis of promoters, which are parts of DNA sequences that are involved in regulation of expression of genes into other functional molecules. Promoter regions vary greatly in their function based on the sequence of nucleotides and the arrangement of protein-binding short-regions called motifs. In fact, the regulatory nature of the promoters seems to be largely driven by the selective presence and/or the arrangement of these motifs. Here, we explore computational classification of promoter sequences based on the pattern of motif distributions, as such classification can pave a new way of functional analysis of promoters and to discover the functionally crucial motifs. We make use of Position Specific Motif Matrix (PSMM) features for exploring the possibility of accurately classifying promoter sequences using some of the popular classification techniques. The classification results on the complete feature set are low, perhaps due to the huge number of features. We propose two ways of reducing features. Our test results show improvement in the classification output after the reduction of features. The results also show that decision trees outperform SVM (Support Vector Machine), KNN (K Nearest Neighbor) and ensemble classifier LibD3C, particularly with reduced features. The proposed feature selection methods outperform some of the popular feature transformation methods such as PCA and SVD. Also, the methods proposed are as accurate as MRMR (feature selection method) but much faster than MRMR. Such methods could be useful to categorize new promoters and explore regulatory mechanisms of gene expressions in complex eukaryotic species.

  1. Similarity-balanced discriminant neighbor embedding and its application to cancer classification based on gene expression data.

    PubMed

    Zhang, Li; Qian, Liqiang; Ding, Chuntao; Zhou, Weida; Li, Fanzhang

    2015-09-01

    The family of discriminant neighborhood embedding (DNE) methods is typical graph-based methods for dimension reduction, and has been successfully applied to face recognition. This paper proposes a new variant of DNE, called similarity-balanced discriminant neighborhood embedding (SBDNE) and applies it to cancer classification using gene expression data. By introducing a novel similarity function, SBDNE deals with two data points in the same class and the different classes with different ways. The homogeneous and heterogeneous neighbors are selected according to the new similarity function instead of the Euclidean distance. SBDNE constructs two adjacent graphs, or between-class adjacent graph and within-class adjacent graph, using the new similarity function. According to these two adjacent graphs, we can generate the local between-class scatter and the local within-class scatter, respectively. Thus, SBDNE can maximize the between-class scatter and simultaneously minimize the within-class scatter to find the optimal projection matrix. Experimental results on six microarray datasets show that SBDNE is a promising method for cancer classification. Copyright © 2015 Elsevier Ltd. All rights reserved.

  2. Variations in the Intragene Methylation Profiles Hallmark Induced Pluripotency

    PubMed Central

    Druzhkov, Pavel; Zolotykh, Nikolay; Meyerov, Iosif; Alsaedi, Ahmed; Shutova, Maria; Ivanchenko, Mikhail; Zaikin, Alexey

    2015-01-01

    We demonstrate the potential of differentiating embryonic and induced pluripotent stem cells by the regularized linear and decision tree machine learning classification algorithms, based on a number of intragene methylation measures. The resulting average accuracy of classification has been proven to be above 95%, which overcomes the earlier achievements. We propose a constructive and transparent method of feature selection based on classifier accuracy. Enrichment analysis reveals statistically meaningful presence of stemness group and cancer discriminating genes among the selected best classifying features. These findings stimulate the further research on the functional consequences of these differences in methylation patterns. The presented approach can be broadly used to discriminate the cells of different phenotype or in different state by their methylation profiles, identify groups of genes constituting multifeature classifiers, and assess enrichment of these groups by the sets of genes with a functionality of interest. PMID:26618180

  3. MO-DE-207B-03: Improved Cancer Classification Using Patient-Specific Biological Pathway Information Via Gene Expression Data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Young, M; Craft, D

    Purpose: To develop an efficient, pathway-based classification system using network biology statistics to assist in patient-specific response predictions to radiation and drug therapies across multiple cancer types. Methods: We developed PICS (Pathway Informed Classification System), a novel two-step cancer classification algorithm. In PICS, a matrix m of mRNA expression values for a patient cohort is collapsed into a matrix p of biological pathways. The entries of p, which we term pathway scores, are obtained from either principal component analysis (PCA), normal tissue centroid (NTC), or gene expression deviation (GED). The pathway score matrix is clustered using both k-means and hierarchicalmore » clustering, and a clustering is judged by how well it groups patients into distinct survival classes. The most effective pathway scoring/clustering combination, per clustering p-value, thus generates various ‘signatures’ for conventional and functional cancer classification. Results: PICS successfully regularized large dimension gene data, separated normal and cancerous tissues, and clustered a large patient cohort spanning six cancer types. Furthermore, PICS clustered patient cohorts into distinct, statistically-significant survival groups. For a suboptimally-debulked ovarian cancer set, the pathway-classified Kaplan-Meier survival curve (p = .00127) showed significant improvement over that of a prior gene expression-classified study (p = .0179). For a pancreatic cancer set, the pathway-classified Kaplan-Meier survival curve (p = .00141) showed significant improvement over that of a prior gene expression-classified study (p = .04). Pathway-based classification confirmed biomarkers for the pyrimidine, WNT-signaling, glycerophosphoglycerol, beta-alanine, and panthothenic acid pathways for ovarian cancer. Despite its robust nature, PICS requires significantly less run time than current pathway scoring methods. Conclusion: This work validates the PICS method to improve cancer classification using biological pathways. Patients are classified with greater specificity and physiological relevance as compared to current gene-specific approaches. Focus now moves to utilizing PICS for pan-cancer patient-specific treatment response prediction.« less

  4. Simple and Flexible Classification of Gene Expression Microarrays Via Swirls and Ripples | Division of Cancer Prevention

    Cancer.gov

    By Stuart G. Baker The program requires Mathematica 7.01.0 The key function is Classify [datalist,options] where datalist={data, genename, dataname} data ={matrix for class 0, matrix for class 1}, matrix is gene expression by specimen genename a list of names of genes, dataname ={name of data set, name of class0, name of class1} |

  5. A label distance maximum-based classifier for multi-label learning.

    PubMed

    Liu, Xiaoli; Bao, Hang; Zhao, Dazhe; Cao, Peng

    2015-01-01

    Multi-label classification is useful in many bioinformatics tasks such as gene function prediction and protein site localization. This paper presents an improved neural network algorithm, Max Label Distance Back Propagation Algorithm for Multi-Label Classification. The method was formulated by modifying the total error function of the standard BP by adding a penalty term, which was realized by maximizing the distance between the positive and negative labels. Extensive experiments were conducted to compare this method against state-of-the-art multi-label methods on three popular bioinformatic benchmark datasets. The results illustrated that this proposed method is more effective for bioinformatic multi-label classification compared to commonly used techniques.

  6. Biological classification with RNA-Seq data: Can alternatively spliced transcript expression enhance machine learning classifier?

    PubMed

    Johnson, Nathan T; Dhroso, Andi; Hughes, Katelyn J; Korkin, Dmitry

    2018-06-25

    The extent to which the genes are expressed in the cell can be simplistically defined as a function of one or more factors of the environment, lifestyle, and genetics. RNA sequencing (RNA-Seq) is becoming a prevalent approach to quantify gene expression, and is expected to gain better insights to a number of biological and biomedical questions, compared to the DNA microarrays. Most importantly, RNA-Seq allows to quantify expression at the gene and alternative splicing isoform levels. However, leveraging the RNA-Seq data requires development of new data mining and analytics methods. Supervised machine learning methods are commonly used approaches for biological data analysis, and have recently gained attention for their applications to the RNA-Seq data. In this work, we assess the utility of supervised learning methods trained on RNA-Seq data for a diverse range of biological classification tasks. We hypothesize that the isoform-level expression data is more informative for biological classification tasks than the gene-level expression data. Our large-scale assessment is done through utilizing multiple datasets, organisms, lab groups, and RNA-Seq analysis pipelines. Overall, we performed and assessed 61 biological classification problems that leverage three independent RNA-Seq datasets and include over 2,000 samples that come from multiple organisms, lab groups, and RNA-Seq analyses. These 61 problems include predictions of the tissue type, sex, or age of the sample, healthy or cancerous phenotypes and, the pathological tumor stage for the samples from the cancerous tissue. For each classification problem, the performance of three normalization techniques and six machine learning classifiers was explored. We find that for every single classification problem, the isoform-based classifiers outperform or are comparable with gene expression based methods. The top-performing supervised learning techniques reached a near perfect classification accuracy, demonstrating the utility of supervised learning for RNA-Seq based data analysis. Published by Cold Spring Harbor Laboratory Press for the RNA Society.

  7. Finding minimum gene subsets with heuristic breadth-first search algorithm for robust tumor classification

    PubMed Central

    2012-01-01

    Background Previous studies on tumor classification based on gene expression profiles suggest that gene selection plays a key role in improving the classification performance. Moreover, finding important tumor-related genes with the highest accuracy is a very important task because these genes might serve as tumor biomarkers, which is of great benefit to not only tumor molecular diagnosis but also drug development. Results This paper proposes a novel gene selection method with rich biomedical meaning based on Heuristic Breadth-first Search Algorithm (HBSA) to find as many optimal gene subsets as possible. Due to the curse of dimensionality, this type of method could suffer from over-fitting and selection bias problems. To address these potential problems, a HBSA-based ensemble classifier is constructed using majority voting strategy from individual classifiers constructed by the selected gene subsets, and a novel HBSA-based gene ranking method is designed to find important tumor-related genes by measuring the significance of genes using their occurrence frequencies in the selected gene subsets. The experimental results on nine tumor datasets including three pairs of cross-platform datasets indicate that the proposed method can not only obtain better generalization performance but also find many important tumor-related genes. Conclusions It is found that the frequencies of the selected genes follow a power-law distribution, indicating that only a few top-ranked genes can be used as potential diagnosis biomarkers. Moreover, the top-ranked genes leading to very high prediction accuracy are closely related to specific tumor subtype and even hub genes. Compared with other related methods, the proposed method can achieve higher prediction accuracy with fewer genes. Moreover, they are further justified by analyzing the top-ranked genes in the context of individual gene function, biological pathway, and protein-protein interaction network. PMID:22830977

  8. Functional Assessment of Genetic Variants with Outcomes Adapted to Clinical Decision-Making

    PubMed Central

    Thouvenot, Pierre; Ben Yamin, Barbara; Fourrière, Lou; Lescure, Aurianne; Boudier, Thomas; Del Nery, Elaine; Chauchereau, Anne; Goldgar, David E.; Stoppa-Lyonnet, Dominique; Nicolas, Alain; Millot, Gaël A.

    2016-01-01

    Understanding the medical effect of an ever-growing number of human variants detected is a long term challenge in genetic counseling. Functional assays, based on in vitro or in vivo evaluations of the variant effects, provide essential information, but they require robust statistical validation, as well as adapted outputs, to be implemented in the clinical decision-making process. Here, we assessed 25 pathogenic and 15 neutral missense variants of the BRCA1 breast/ovarian cancer susceptibility gene in four BRCA1 functional assays. Next, we developed a novel approach that refines the variant ranking in these functional assays. Lastly, we developed a computational system that provides a probabilistic classification of variants, adapted to clinical interpretation. Using this system, the best functional assay exhibits a variant classification accuracy estimated at 93%. Additional theoretical simulations highlight the benefit of this ready-to-use system in the classification of variants after functional assessment, which should facilitate the consideration of functional evidences in the decision-making process after genetic testing. Finally, we demonstrate the versatility of the system with the classification of siRNAs tested for human cell growth inhibition in high throughput screening. PMID:27272900

  9. Derivation of an artificial gene to improve classification accuracy upon gene selection.

    PubMed

    Seo, Minseok; Oh, Sejong

    2012-02-01

    Classification analysis has been developed continuously since 1936. This research field has advanced as a result of development of classifiers such as KNN, ANN, and SVM, as well as through data preprocessing areas. Feature (gene) selection is required for very high dimensional data such as microarray before classification work. The goal of feature selection is to choose a subset of informative features that reduces processing time and provides higher classification accuracy. In this study, we devised a method of artificial gene making (AGM) for microarray data to improve classification accuracy. Our artificial gene was derived from a whole microarray dataset, and combined with a result of gene selection for classification analysis. We experimentally confirmed a clear improvement of classification accuracy after inserting artificial gene. Our artificial gene worked well for popular feature (gene) selection algorithms and classifiers. The proposed approach can be applied to any type of high dimensional dataset. Copyright © 2011 Elsevier Ltd. All rights reserved.

  10. SVM Classifier - a comprehensive java interface for support vector machine classification of microarray data.

    PubMed

    Pirooznia, Mehdi; Deng, Youping

    2006-12-12

    Graphical user interface (GUI) software promotes novelty by allowing users to extend the functionality. SVM Classifier is a cross-platform graphical application that handles very large datasets well. The purpose of this study is to create a GUI application that allows SVM users to perform SVM training, classification and prediction. The GUI provides user-friendly access to state-of-the-art SVM methods embodied in the LIBSVM implementation of Support Vector Machine. We implemented the java interface using standard swing libraries. We used a sample data from a breast cancer study for testing classification accuracy. We achieved 100% accuracy in classification among the BRCA1-BRCA2 samples with RBF kernel of SVM. We have developed a java GUI application that allows SVM users to perform SVM training, classification and prediction. We have demonstrated that support vector machines can accurately classify genes into functional categories based upon expression data from DNA microarray hybridization experiments. Among the different kernel functions that we examined, the SVM that uses a radial basis kernel function provides the best performance. The SVM Classifier is available at http://mfgn.usm.edu/ebl/svm/.

  11. Development of a two-stage gene selection method that incorporates a novel hybrid approach using the cuckoo optimization algorithm and harmony search for cancer classification.

    PubMed

    Elyasigomari, V; Lee, D A; Screen, H R C; Shaheed, M H

    2017-03-01

    For each cancer type, only a few genes are informative. Due to the so-called 'curse of dimensionality' problem, the gene selection task remains a challenge. To overcome this problem, we propose a two-stage gene selection method called MRMR-COA-HS. In the first stage, the minimum redundancy and maximum relevance (MRMR) feature selection is used to select a subset of relevant genes. The selected genes are then fed into a wrapper setup that combines a new algorithm, COA-HS, using the support vector machine as a classifier. The method was applied to four microarray datasets, and the performance was assessed by the leave one out cross-validation method. Comparative performance assessment of the proposed method with other evolutionary algorithms suggested that the proposed algorithm significantly outperforms other methods in selecting a fewer number of genes while maintaining the highest classification accuracy. The functions of the selected genes were further investigated, and it was confirmed that the selected genes are biologically relevant to each cancer type. Copyright © 2017. Published by Elsevier Inc.

  12. Positive selection and functional divergence of farnesyl pyrophosphate synthase genes in plants.

    PubMed

    Qian, Jieying; Liu, Yong; Chao, Naixia; Ma, Chengtong; Chen, Qicong; Sun, Jian; Wu, Yaosheng

    2017-02-04

    Farnesyl pyrophosphate synthase (FPS) belongs to the short-chain prenyltransferase family, and it performs a conserved and essential role in the terpenoid biosynthesis pathway. However, its classification, evolutionary history, and the forces driving the evolution of FPS genes in plants remain poorly understood. Phylogeny and positive selection analysis was used to identify the evolutionary forces that led to the functional divergence of FPS in plants, and recombinant detection was undertaken using the Genetic Algorithm for Recombination Detection (GARD) method. The dataset included 68 FPS variation pattern sequences (2 gymnosperms, 10 monocotyledons, 54 dicotyledons, and 2 outgroups). This study revealed that the FPS gene was under positive selection in plants. No recombinant within the FPS gene was found. Therefore, it was inferred that the positive selection of FPS had not been influenced by a recombinant episode. The positively selected sites were mainly located in the catalytic center and functional areas, which indicated that the 98S and 234D were important positively selected sites for plant FPS in the terpenoid biosynthesis pathway. They were located in the FPS conserved domain of the catalytic site. We inferred that the diversification of FPS genes was associated with functional divergence and could be driven by positive selection. It was clear that protein sequence evolution via positive selection was able to drive adaptive diversification in plant FPS proteins. This study provides information on the classification and positive selection of plant FPS genes, and the results could be useful for further research on the regulation of triterpenoid biosynthesis.

  13. Classification of Phylogenetic Profiles for Protein Function Prediction: An SVM Approach

    NASA Astrophysics Data System (ADS)

    Kotaru, Appala Raju; Joshi, Ramesh C.

    Predicting the function of an uncharacterized protein is a major challenge in post-genomic era due to problems complexity and scale. Having knowledge of protein function is a crucial link in the development of new drugs, better crops, and even the development of biochemicals such as biofuels. Recently numerous high-throughput experimental procedures have been invented to investigate the mechanisms leading to the accomplishment of a protein’s function and Phylogenetic profile is one of them. Phylogenetic profile is a way of representing a protein which encodes evolutionary history of proteins. In this paper we proposed a method for classification of phylogenetic profiles using supervised machine learning method, support vector machine classification along with radial basis function as kernel for identifying functionally linked proteins. We experimentally evaluated the performance of the classifier with the linear kernel, polynomial kernel and compared the results with the existing tree kernel. In our study we have used proteins of the budding yeast saccharomyces cerevisiae genome. We generated the phylogenetic profiles of 2465 yeast genes and for our study we used the functional annotations that are available in the MIPS database. Our experiments show that the performance of the radial basis kernel is similar to polynomial kernel is some functional classes together are better than linear, tree kernel and over all radial basis kernel outperformed the polynomial kernel, linear kernel and tree kernel. In analyzing these results we show that it will be feasible to make use of SVM classifier with radial basis function as kernel to predict the gene functionality using phylogenetic profiles.

  14. Alteration of gene expression by zinc oxide nanoparticles or zinc sulfate in vivo and comparison with in vitro data: A harmonious case.

    PubMed

    Zhang, Wei-Dong; Zhao, Yong; Zhang, Hong-Fu; Wang, Shu-Kun; Hao, Zhi-Hui; Liu, Jing; Yuan, Yu-Qing; Zhang, Peng-Fei; Yang, Hong-Di; Shen, Wei; Li, Lan

    2016-08-01

    Granulosa cells (GCs) are those somatic cells closest to the female germ cell. GCs play a vital role in oocyte growth and development, and the oocyte is necessary for multiplication of a species. Zinc oxide (ZnO) nanoparticles (NPs) readily cross biologic barriers to be absorbed into biologic systems that make them promising candidates as food additives. The objective of the present investigation was to explore the impact of intact NPs on gene expression and the functional classification of altered genes in hen GCs in vivo, to compare the data from in vivo and in vitro studies, and finally to point out the adverse effects of ZnO NPs on the reproductive system. After a 24-week treatment, hen GCs were isolated and gene expression was quantified. Intact NPs were found in the ovary and other organs. Zn levels were similar in ZnO-NP-100 mg/kg- and ZnSO4-100 mg/kg-treated hen ovaries. ZnO-NP-100 mg/kg and ZnSO4-100 mg/kg regulated the expression of the same sets of genes, and they also altered the expression of different sets of genes individually. The number of genes altered by the ZnO-NP-100 mg/kg and ZnSO4-100 mg/kg treatments was different. Gene Ontology (GO) functional analysis reported that different results for the two treatments and, in Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment, 12 pathways (out of the top 20 pathways) in each treatment were different. These results suggested that intact NPs and Zn(2+) had different effects on gene expression in GCs in vivo. In our recent publication, we noted that intact NPs and Zn(2+) differentially altered gene expression in GCs in vitro. However, GO functional classification and KEGG pathway enrichment analyses revealed close similarities for the changed genes in vivo and in vitro after ZnO NP treatment. Furthermore, close similarities were observed for the changed genes after ZnSO4 treatments in vivo and in vitro by GO functional classification and KEGG pathway enrichment analyses. Therefore, the effects of ZnO NPs on gene expression in vitro might represent their effects on gene expression in vivo. The results from this study and our earlier studies support previous findings indicating ZnO NPs promote adverse effects on organisms. Therefore, precautions should be taken when ZnO NPs are used as diet additives for hens because they might cause reproductive issues. Copyright © 2016 Elsevier Inc. All rights reserved.

  15. Classification of lymphoid neoplasms: the microscope as a tool for disease discovery

    PubMed Central

    Harris, Nancy Lee; Stein, Harald; Isaacson, Peter G.

    2008-01-01

    In the past 50 years, we have witnessed explosive growth in the understanding of normal and neoplastic lymphoid cells. B-cell, T-cell, and natural killer (NK)–cell neoplasms in many respects recapitulate normal stages of lymphoid cell differentiation and function, so that they can be to some extent classified according to the corresponding normal stage. Likewise, the molecular mechanisms involved the pathogenesis of lymphomas and lymphoid leukemias are often based on the physiology of the lymphoid cells, capitalizing on deregulated normal physiology by harnessing the promoters of genes essential for lymphocyte function. The clinical manifestations of lymphomas likewise reflect the normal function of lymphoid cells in vivo. The multiparameter approach to classification adopted by the World Health Organization (WHO) classification has been validated in international studies as being highly reproducible, and enhancing the interpretation of clinical and translational studies. In addition, accurate and precise classification of disease entities facilitates the discovery of the molecular basis of lymphoid neoplasms in the basic science laboratory. PMID:19029456

  16. ACLAME: a CLAssification of Mobile genetic Elements, update 2010.

    PubMed

    Leplae, Raphaël; Lima-Mendez, Gipsi; Toussaint, Ariane

    2010-01-01

    The ACLAME database is dedicated to the collection, analysis and classification of sequenced mobile genetic elements (MGEs, in particular phages and plasmids). In addition to providing information on the MGEs content, classifications are available at various levels of organization. At the gene/protein level, families group similar sequences that are expected to share the same function. Families of four or more proteins are manually assigned with a functional annotation using the GeneOntology and the locally developed ontology MeGO dedicated to MGEs. At the genome level, evolutionary cohesive modules group sets of protein families shared among MGEs. At the population level, networks display the reticulate evolutionary relationships among MGEs. To increase the coverage of the phage sequence space, ACLAME version 0.4 incorporates 760 high-quality predicted prophages selected from the Prophinder database. Most of the data can be downloaded from the freely accessible ACLAME web site (http://aclame.ulb.ac.be). The BLAST interface for querying the database has been extended and numerous tools for in-depth analysis of the results have been added.

  17. Using binary classification to prioritize and curate articles for the Comparative Toxicogenomics Database.

    PubMed

    Vishnyakova, Dina; Pasche, Emilie; Ruch, Patrick

    2012-01-01

    We report on the original integration of an automatic text categorization pipeline, so-called ToxiCat (Toxicogenomic Categorizer), that we developed to perform biomedical documents classification and prioritization in order to speed up the curation of the Comparative Toxicogenomics Database (CTD). The task can be basically described as a binary classification task, where a scoring function is used to rank a selected set of articles. Then components of a question-answering system are used to extract CTD-specific annotations from the ranked list of articles. The ranking function is generated using a Support Vector Machine, which combines three main modules: an information retrieval engine for MEDLINE (EAGLi), a gene normalization service (NormaGene) developed for a previous BioCreative campaign and finally, a set of answering components and entity recognizer for diseases and chemicals. The main components of the pipeline are publicly available both as web application and web services. The specific integration performed for the BioCreative competition is available via a web user interface at http://pingu.unige.ch:8080/Toxicat.

  18. Inference of combinatorial Boolean rules of synergistic gene sets from cancer microarray datasets.

    PubMed

    Park, Inho; Lee, Kwang H; Lee, Doheon

    2010-06-15

    Gene set analysis has become an important tool for the functional interpretation of high-throughput gene expression datasets. Moreover, pattern analyses based on inferred gene set activities of individual samples have shown the ability to identify more robust disease signatures than individual gene-based pattern analyses. Although a number of approaches have been proposed for gene set-based pattern analysis, the combinatorial influence of deregulated gene sets on disease phenotype classification has not been studied sufficiently. We propose a new approach for inferring combinatorial Boolean rules of gene sets for a better understanding of cancer transcriptome and cancer classification. To reduce the search space of the possible Boolean rules, we identify small groups of gene sets that synergistically contribute to the classification of samples into their corresponding phenotypic groups (such as normal and cancer). We then measure the significance of the candidate Boolean rules derived from each group of gene sets; the level of significance is based on the class entropy of the samples selected in accordance with the rules. By applying the present approach to publicly available prostate cancer datasets, we identified 72 significant Boolean rules. Finally, we discuss several identified Boolean rules, such as the rule of glutathione metabolism (down) and prostaglandin synthesis regulation (down), which are consistent with known prostate cancer biology. Scripts written in Python and R are available at http://biosoft.kaist.ac.kr/~ihpark/. The refined gene sets and the full list of the identified Boolean rules are provided in the Supplementary Material. Supplementary data are available at Bioinformatics online.

  19. Multi-label literature classification based on the Gene Ontology graph.

    PubMed

    Jin, Bo; Muller, Brian; Zhai, Chengxiang; Lu, Xinghua

    2008-12-08

    The Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form. The current effort of manually annotating proteins with the Gene Ontology is outpaced by the rate of accumulation of biomedical knowledge in literature, which urges the development of text mining approaches to facilitate the process by automatically extracting the Gene Ontology annotation from literature. The task is usually cast as a text classification problem, and contemporary methods are confronted with unbalanced training data and the difficulties associated with multi-label classification. In this research, we investigated the methods of enhancing automatic multi-label classification of biomedical literature by utilizing the structure of the Gene Ontology graph. We have studied three graph-based multi-label classification algorithms, including a novel stochastic algorithm and two top-down hierarchical classification methods for multi-label literature classification. We systematically evaluated and compared these graph-based classification algorithms to a conventional flat multi-label algorithm. The results indicate that, through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods can significantly improve predictions of the Gene Ontology terms implied by the analyzed text. Furthermore, the graph-based multi-label classifiers are capable of suggesting Gene Ontology annotations (to curators) that are closely related to the true annotations even if they fail to predict the true ones directly. A software package implementing the studied algorithms is available for the research community. Through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods have better potential than the conventional flat multi-label classification approach to facilitate protein annotation based on the literature.

  20. Exceptions to the rule: case studies in the prediction of pathogenicity for genetic variants in hereditary cancer genes.

    PubMed

    Rosenthal, E T; Bowles, K R; Pruss, D; van Kan, A; Vail, P J; McElroy, H; Wenstrup, R J

    2015-12-01

    Based on current consensus guidelines and standard practice, many genetic variants detected in clinical testing are classified as disease causing based on their predicted impact on the normal expression or function of the gene in the absence of additional data. However, our laboratory has identified a subset of such variants in hereditary cancer genes for which compelling contradictory evidence emerged after the initial evaluation following the first observation of the variant. Three representative examples of variants in BRCA1, BRCA2 and MSH2 that are predicted to disrupt splicing, prematurely truncate the protein, or remove the start codon were evaluated for pathogenicity by analyzing clinical data with multiple classification algorithms. Available clinical data for all three variants contradicts the expected pathogenic classification. These variants illustrate potential pitfalls associated with standard approaches to variant classification as well as the challenges associated with monitoring data, updating classifications, and reporting potentially contradictory interpretations to the clinicians responsible for translating test outcomes to appropriate clinical action. It is important to address these challenges now as the model for clinical testing moves toward the use of large multi-gene panels and whole exome/genome analysis, which will dramatically increase the number of genetic variants identified. © 2015 The Authors. Clinical Genetics published by John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.

  1. Identification of an Efficient Gene Expression Panel for Glioblastoma Classification

    PubMed Central

    Zelaya, Ivette; Laks, Dan R.; Zhao, Yining; Kawaguchi, Riki; Gao, Fuying; Kornblum, Harley I.; Coppola, Giovanni

    2016-01-01

    We present here a novel genetic algorithm-based random forest (GARF) modeling technique that enables a reduction in the complexity of large gene disease signatures to highly accurate, greatly simplified gene panels. When applied to 803 glioblastoma multiforme samples, this method allowed the 840-gene Verhaak et al. gene panel (the standard in the field) to be reduced to a 48-gene classifier, while retaining 90.91% classification accuracy, and outperforming the best available alternative methods. Additionally, using this approach we produced a 32-gene panel which allows for better consistency between RNA-seq and microarray-based classifications, improving cross-platform classification retention from 69.67% to 86.07%. A webpage producing these classifications is available at http://simplegbm.semel.ucla.edu. PMID:27855170

  2. Annotation and Classification of CRISPR-Cas Systems

    PubMed Central

    Makarova, Kira S.; Koonin, Eugene V.

    2018-01-01

    The clustered regularly interspaced short palindromic repeats (CRISPR)-Cas (CRISPR-associated proteins) is a prokaryotic adaptive immune system that is represented in most archaea and many bacteria. Among the currently known prokaryotic defense systems, the CRISPR-Cas genomic loci show unprecedented complexity and diversity. Classification of CRISPR-Cas variants that would capture their evolutionary relationships to the maximum possible extent is essential for comparative genomic and functional characterization of this theoretically and practically important system of adaptive immunity. To this end, a multipronged approach has been developed that combines phylogenetic analysis of the conserved Cas proteins with comparison of gene repertoires and arrangements in CRISPR-Cas loci. This approach led to the current classification of CRISPR-Cas systems into three distinct types and ten subtypes for each of which signature genes have been identified. Comparative genomic analysis of the CRISPR-Cas systems in new archaeal and bacterial genomes performed over the 3 years elapsed since the development of this classification makes it clear that new types and subtypes of CRISPR-Cas need to be introduced. Moreover, this classification system captures only part of the complexity of CRISPR-Cas organization and evolution, due to the intrinsic modularity and evolutionary mobility of these immunity systems, resulting in numerous recombinant variants. Moreover, most of the cas genes evolve rapidly, complicating the family assignment for many Cas proteins and the use of family profiles for the recognition of CRISPR-Cas subtype signatures. Further progress in the comparative analysis of CRISPR-Cas systems requires integration of the most sensitive sequence comparison tools, protein structure comparison, and refined approaches for comparison of gene neighborhoods. PMID:25981466

  3. Annotation and Classification of CRISPR-Cas Systems.

    PubMed

    Makarova, Kira S; Koonin, Eugene V

    2015-01-01

    The clustered regularly interspaced short palindromic repeats (CRISPR)-Cas (CRISPR-associated proteins) is a prokaryotic adaptive immune system that is represented in most archaea and many bacteria. Among the currently known prokaryotic defense systems, the CRISPR-Cas genomic loci show unprecedented complexity and diversity. Classification of CRISPR-Cas variants that would capture their evolutionary relationships to the maximum possible extent is essential for comparative genomic and functional characterization of this theoretically and practically important system of adaptive immunity. To this end, a multipronged approach has been developed that combines phylogenetic analysis of the conserved Cas proteins with comparison of gene repertoires and arrangements in CRISPR-Cas loci. This approach led to the current classification of CRISPR-Cas systems into three distinct types and ten subtypes for each of which signature genes have been identified. Comparative genomic analysis of the CRISPR-Cas systems in new archaeal and bacterial genomes performed over the 3 years elapsed since the development of this classification makes it clear that new types and subtypes of CRISPR-Cas need to be introduced. Moreover, this classification system captures only part of the complexity of CRISPR-Cas organization and evolution, due to the intrinsic modularity and evolutionary mobility of these immunity systems, resulting in numerous recombinant variants. Moreover, most of the cas genes evolve rapidly, complicating the family assignment for many Cas proteins and the use of family profiles for the recognition of CRISPR-Cas subtype signatures. Further progress in the comparative analysis of CRISPR-Cas systems requires integration of the most sensitive sequence comparison tools, protein structure comparison, and refined approaches for comparison of gene neighborhoods.

  4. Metagenomics of an Alkaline Hot Spring in Galicia (Spain): Microbial Diversity Analysis and Screening for Novel Lipolytic Enzymes.

    PubMed

    López-López, Olalla; Knapik, Kamila; Cerdán, Maria-Esperanza; González-Siso, María-Isabel

    2015-01-01

    A fosmid library was constructed with the metagenomic DNA from the water of the Lobios hot spring (76°C, pH = 8.2) located in Ourense (Spain). Metagenomic sequencing of the fosmid library allowed the assembly of 9722 contigs ranging in size from 500 to 56,677 bp and spanning ~18 Mbp. 23,207 ORFs (Open Reading Frames) were predicted from the assembly. Biodiversity was explored by taxonomic classification and it revealed that bacteria were predominant, while the archaea were less abundant. The six most abundant bacterial phyla were Deinococcus-Thermus, Proteobacteria, Firmicutes, Acidobacteria, Aquificae, and Chloroflexi. Within the archaeal superkingdom, the phylum Thaumarchaeota was predominant with the dominant species "Candidatus Caldiarchaeum subterraneum." Functional classification revealed the genes associated to one-carbon metabolism as the most abundant. Both taxonomic and functional classifications showed a mixture of different microbial metabolic patterns: aerobic and anaerobic, chemoorganotrophic and chemolithotrophic, autotrophic and heterotrophic. Remarkably, the presence of genes encoding enzymes with potential biotechnological interest, such as xylanases, galactosidases, proteases, and lipases, was also revealed in the metagenomic library. Functional screening of this library was subsequently done looking for genes encoding lipolytic enzymes. Six genes conferring lipolytic activity were identified and one was cloned and characterized. This gene was named LOB4Est and it was expressed in a yeast mesophilic host. LOB4Est codes for a novel esterase of family VIII, with sequence similarity to β-lactamases, but with unusual wide substrate specificity. When the enzyme was purified from the mesophilic host it showed half-life of 1 h and 43 min at 50°C, and maximal activity at 40°C and pH 7.5 with p-nitrophenyl-laurate as substrate. Interestingly, the enzyme retained more than 80% of maximal activity in a broad range of pH from 6.5 to 8.

  5. Phylogeny-dominant classification of J-proteins in Arabidopsis thaliana and Brassica oleracea.

    PubMed

    Zhang, Bin; Qiu, Han-Lin; Qu, Dong-Hai; Ruan, Ying; Chen, Dong-Hong

    2018-04-05

    Hsp40s or DnaJ/J-proteins are evolutionarily conserved in all organisms as co-chaperones of molecular chaperone HSP70s that mainly participate in maintaining cellular protein homeostasis, such as protein folding, assembly, stabilization, and translocation under normal conditions as well as refolding and degradation under environmental stresses. It has been reported that Arabidopsis J-proteins are classified into four classes (types A-D) according to domain organization, but their phylogenetic relationships are unknown. Here, we identified 129 J-proteins in the world-wide popular vegetable Brassica oleracea, a close relative of the model plant Arabidopsis, and also revised the information of Arabidopsis J-proteins based on the latest online bioresources. According to phylogenetic analysis with domain organization and gene structure as references, the J-proteins from Arabidopsis and B. oleracea were classified into 15 main clades (I-XV) separated by a number of undefined small branches with remote relationship. Based on the number of members, they respectively belong to multigene clades, oligo-gene clades, and mono-gene clades. The J-protein genes from different clades may function together or separately to constitute a complicated regulatory network. This study provides a constructive viewpoint for J-protein classification and an informative platform for further functional dissection and resistant genes discovery related to genetic improvement of crop plants.

  6. Mouse Vk gene classification by nucleic acid sequence similarity.

    PubMed

    Strohal, R; Helmberg, A; Kroemer, G; Kofler, R

    1989-01-01

    Analyses of immunoglobulin (Ig) variable (V) region gene usage in the immune response, estimates of V gene germline complexity, and other nucleic acid hybridization-based studies depend on the extent to which such genes are related (i.e., sequence similarity) and their organization in gene families. While mouse Igh heavy chain V region (VH) gene families are relatively well-established, a corresponding systematic classification of Igk light chain V region (Vk) genes has not been reported. The present analysis, in the course of which we reviewed the known extent of the Vk germline gene repertoire and Vk gene usage in a variety of responses to foreign and self antigens, provides a classification of mouse Vk genes in gene families composed of members with greater than 80% overall nucleic acid sequence similarity. This classification differed in several aspects from that of VH genes: only some Vk gene families were as clearly separated (by greater than 25% sequence dissimilarity) as typical VH gene families; most Vk gene families were closely related and, in several instances, members from different families were very similar (greater than 80%) over large sequence portions; frequently, classification by nucleic acid sequence similarity diverged from existing classifications based on amino-terminal protein sequence similarity. Our data have implications for Vk gene analyses by nucleic acid hybridization and describe potentially important differences in sequence organization between VH and Vk genes.

  7. Identifying gnostic predictors of the vaccine response

    PubMed Central

    Haining, W. Nicholas; Pulendran, Bali

    2012-01-01

    Molecular predictors of the response to vaccination could transform vaccine development. They would allow larger numbers of vaccine candidates to be rapidly screened, shortening the development time for new vaccines. Gene-expression based predictors of vaccine response have shown early promise. However, a limitation of gene-expression based predictors is that they often fail to reveal the mechanistic basis for their ability to classify response. Linking predictive signatures to the function of their component genes would advance basic understanding of vaccine immunity and also improve the robustness of outcome classification. New analytic tools now allow more biological meaning to be extracted from predictive signatures. Functional genomic approaches to perturb gene expression in mammalian cells permit the function of predictive genes to be surveyed in highly parallel experiments. The challenge for vaccinologists is therefore to use these tools to embed mechanistic insights into predictors of vaccine response. PMID:22633886

  8. Genome-wide classification, evolutionary analysis and gene expression patterns of the kinome in Gossypium

    PubMed Central

    Yan, Jun; Li, Guilin; Guo, Xingqi; Li, Yang; Cao, Xuecheng

    2018-01-01

    The protein kinase (PK, kinome) family is one of the largest families in plants and regulates almost all aspects of plant processes, including plant development and stress responses. Despite their important functions, comprehensive functional classification, evolutionary analysis and expression patterns of the cotton PK gene family has yet to be performed on PK genes. In this study, we identified the cotton kinomes in the Gossypium raimondii, Gossypium arboretum, Gossypium hirsutum and Gossypium barbadense genomes and classified them into 7 groups and 122–24 subfamilies using software HMMER v3.0 scanning and neighbor-joining (NJ) phylogenetic analysis. Some conserved exon-intron structures were identified not only in cotton species but also in primitive plants, ferns and moss, suggesting the significant function and ancient origination of these PK genes. Collinearity analysis revealed that 16.6 million years ago (Mya) cotton-specific whole genome duplication (WGD) events may have played a partial role in the expansion of the cotton kinomes, whereas tandem duplication (TD) events mainly contributed to the expansion of the cotton RLK group. Synteny analysis revealed that tetraploidization of G. hirsutum and G. barbadense contributed to the expansion of G. hirsutum and G. barbadense PKs. Global expression analysis of cotton PKs revealed stress-specific and fiber development-related expression patterns, suggesting that many cotton PKs might be involved in the regulation of the stress response and fiber development processes. This study provides foundational information for further studies on the evolution and molecular function of cotton PKs. PMID:29768506

  9. An improved method for functional similarity analysis of genes based on Gene Ontology.

    PubMed

    Tian, Zhen; Wang, Chunyu; Guo, Maozu; Liu, Xiaoyan; Teng, Zhixia

    2016-12-23

    Measures of gene functional similarity are essential tools for gene clustering, gene function prediction, evaluation of protein-protein interaction, disease gene prioritization and other applications. In recent years, many gene functional similarity methods have been proposed based on the semantic similarity of GO terms. However, these leading approaches may make errorprone judgments especially when they measure the specificity of GO terms as well as the IC of a term set. Therefore, how to estimate the gene functional similarity reliably is still a challenging problem. We propose WIS, an effective method to measure the gene functional similarity. First of all, WIS computes the IC of a term by employing its depth, the number of its ancestors as well as the topology of its descendants in the GO graph. Secondly, WIS calculates the IC of a term set by means of considering the weighted inherited semantics of terms. Finally, WIS estimates the gene functional similarity based on the IC overlap ratio of term sets. WIS is superior to some other representative measures on the experiments of functional classification of genes in a biological pathway, collaborative evaluation of GO-based semantic similarity measures, protein-protein interaction prediction and correlation with gene expression. Further analysis suggests that WIS takes fully into account the specificity of terms and the weighted inherited semantics of terms between GO terms. The proposed WIS method is an effective and reliable way to compare gene function. The web service of WIS is freely available at http://nclab.hit.edu.cn/WIS/ .

  10. Gene selection for tumor classification using neighborhood rough sets and entropy measures.

    PubMed

    Chen, Yumin; Zhang, Zunjun; Zheng, Jianzhong; Ma, Ying; Xue, Yu

    2017-03-01

    With the development of bioinformatics, tumor classification from gene expression data becomes an important useful technology for cancer diagnosis. Since a gene expression data often contains thousands of genes and a small number of samples, gene selection from gene expression data becomes a key step for tumor classification. Attribute reduction of rough sets has been successfully applied to gene selection field, as it has the characters of data driving and requiring no additional information. However, traditional rough set method deals with discrete data only. As for the gene expression data containing real-value or noisy data, they are usually employed by a discrete preprocessing, which may result in poor classification accuracy. In this paper, we propose a novel gene selection method based on the neighborhood rough set model, which has the ability of dealing with real-value data whilst maintaining the original gene classification information. Moreover, this paper addresses an entropy measure under the frame of neighborhood rough sets for tackling the uncertainty and noisy of gene expression data. The utilization of this measure can bring about a discovery of compact gene subsets. Finally, a gene selection algorithm is designed based on neighborhood granules and the entropy measure. Some experiments on two gene expression data show that the proposed gene selection is an effective method for improving the accuracy of tumor classification. Copyright © 2017 Elsevier Inc. All rights reserved.

  11. Grouped gene selection and multi-classification of acute leukemia via new regularized multinomial regression.

    PubMed

    Li, Juntao; Wang, Yanyan; Jiang, Tao; Xiao, Huimin; Song, Xuekun

    2018-05-09

    Diagnosing acute leukemia is the necessary prerequisite to treating it. Multi-classification on the gene expression data of acute leukemia is help for diagnosing it which contains B-cell acute lymphoblastic leukemia (BALL), T-cell acute lymphoblastic leukemia (TALL) and acute myeloid leukemia (AML). However, selecting cancer-causing genes is a challenging problem in performing multi-classification. In this paper, weighted gene co-expression networks are employed to divide the genes into groups. Based on the dividing groups, a new regularized multinomial regression with overlapping group lasso penalty (MROGL) has been presented to simultaneously perform multi-classification and select gene groups. By implementing this method on three-class acute leukemia data, the grouped genes which work synergistically are identified, and the overlapped genes shared by different groups are also highlighted. Moreover, MROGL outperforms other five methods on multi-classification accuracy. Copyright © 2017. Published by Elsevier B.V.

  12. Practical application of self-organizing maps to interrelate biodiversity and functional data in NGS-based metagenomics.

    PubMed

    Weber, Marc; Teeling, Hanno; Huang, Sixing; Waldmann, Jost; Kassabgy, Mariette; Fuchs, Bernhard M; Klindworth, Anna; Klockow, Christine; Wichels, Antje; Gerdts, Gunnar; Amann, Rudolf; Glöckner, Frank Oliver

    2011-05-01

    Next-generation sequencing (NGS) technologies have enabled the application of broad-scale sequencing in microbial biodiversity and metagenome studies. Biodiversity is usually targeted by classifying 16S ribosomal RNA genes, while metagenomic approaches target metabolic genes. However, both approaches remain isolated, as long as the taxonomic and functional information cannot be interrelated. Techniques like self-organizing maps (SOMs) have been applied to cluster metagenomes into taxon-specific bins in order to link biodiversity with functions, but have not been applied to broad-scale NGS-based metagenomics yet. Here, we provide a novel implementation, demonstrate its potential and practicability, and provide a web-based service for public usage. Evaluation with published data sets mimicking varyingly complex habitats resulted into classification specificities and sensitivities of close to 100% to above 90% from phylum to genus level for assemblies exceeding 8 kb for low and medium complexity data. When applied to five real-world metagenomes of medium complexity from direct pyrosequencing of marine subsurface waters, classifications of assemblies above 2.5 kb were in good agreement with fluorescence in situ hybridizations, indicating that biodiversity was mostly retained within the metagenomes, and confirming high classification specificities. This was validated by two protein-based classifications (PBCs) methods. SOMs were able to retrieve the relevant taxa down to the genus level, while surpassing PBCs in resolution. In order to make the approach accessible to a broad audience, we implemented a feature-rich web-based SOM application named TaxSOM, which is freely available at http://www.megx.net/toolbox/taxsom. TaxSOM can classify reads or assemblies exceeding 2.5 kb with high accuracy and thus assists in linking biodiversity and functions in metagenome studies, which is a precondition to study microbial ecology in a holistic fashion.

  13. Discriminant analysis for fast multiclass data classification through regularized kernel function approximation.

    PubMed

    Ghorai, Santanu; Mukherjee, Anirban; Dutta, Pranab K

    2010-06-01

    In this brief we have proposed the multiclass data classification by computationally inexpensive discriminant analysis through vector-valued regularized kernel function approximation (VVRKFA). VVRKFA being an extension of fast regularized kernel function approximation (FRKFA), provides the vector-valued response at single step. The VVRKFA finds a linear operator and a bias vector by using a reduced kernel that maps a pattern from feature space into the low dimensional label space. The classification of patterns is carried out in this low dimensional label subspace. A test pattern is classified depending on its proximity to class centroids. The effectiveness of the proposed method is experimentally verified and compared with multiclass support vector machine (SVM) on several benchmark data sets as well as on gene microarray data for multi-category cancer classification. The results indicate the significant improvement in both training and testing time compared to that of multiclass SVM with comparable testing accuracy principally in large data sets. Experiments in this brief also serve as comparison of performance of VVRKFA with stratified random sampling and sub-sampling.

  14. A methodology to migrate the gene ontology to a description logic environment using DAML+OIL.

    PubMed

    Wroe, C J; Stevens, R; Goble, C A; Ashburner, M

    2003-01-01

    The Gene Ontology Next Generation Project (GONG) is developing a staged methodology to evolve the current representation of the Gene Ontology into DAML+OIL in order to take advantage of the richer formal expressiveness and the reasoning capabilities of the underlying description logic. Each stage provides a step level increase in formal explicit semantic content with a view to supporting validation, extension and multiple classification of the Gene Ontology. The paper introduces DAML+OIL and demonstrates the activity within each stage of the methodology and the functionality gained.

  15. Feature genes predicting the FLT3/ITD mutation in acute myeloid leukemia

    PubMed Central

    LI, CHENGLONG; ZHU, BIAO; CHEN, JIAO; HUANG, XIAOBING

    2016-01-01

    In the present study, gene expression profiles of acute myeloid leukemia (AML) samples were analyzed to identify feature genes with the capacity to predict the mutation status of FLT3/ITD. Two machine learning models, namely the support vector machine (SVM) and random forest (RF) methods, were used for classification. Four datasets were downloaded from the European Bioinformatics Institute, two of which (containing 371 samples, including 281 FLT3/ITD mutation-negative and 90 mutation-positive samples) were randomly defined as the training group, while the other two datasets (containing 488 samples, including 350 FLT3/ITD mutation-negative and 138 mutation-positive samples) were defined as the test group. Differentially expressed genes (DEGs) were identified by significance analysis of the micro-array data by using the training samples. The classification efficiency of the SCM and RF methods was evaluated using the following parameters: Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and the area under the receiver operating characteristic curve. Functional enrichment analysis was performed for the feature genes with DAVID. A total of 585 DEGs were identified in the training group, of which 580 were upregulated and five were downregulated. The classification accuracy rates of the two methods for the training group, the test group and the combined group using the 585 feature genes were >90%. For the SVM and RF methods, the rates of correct determination, specificity and PPV were >90%, while the sensitivity and NPV were >80%. The SVM method produced a slightly better classification effect than the RF method. A total of 13 biological pathways were overrepresented by the feature genes, mainly involving energy metabolism, chromatin organization and translation. The feature genes identified in the present study may be used to predict the mutation status of FLT3/ITD in patients with AML. PMID:27177049

  16. Feature genes predicting the FLT3/ITD mutation in acute myeloid leukemia.

    PubMed

    Li, Chenglong; Zhu, Biao; Chen, Jiao; Huang, Xiaobing

    2016-07-01

    In the present study, gene expression profiles of acute myeloid leukemia (AML) samples were analyzed to identify feature genes with the capacity to predict the mutation status of FLT3/ITD. Two machine learning models, namely the support vector machine (SVM) and random forest (RF) methods, were used for classification. Four datasets were downloaded from the European Bioinformatics Institute, two of which (containing 371 samples, including 281 FLT3/ITD mutation-negative and 90 mutation‑positive samples) were randomly defined as the training group, while the other two datasets (containing 488 samples, including 350 FLT3/ITD mutation-negative and 138 mutation-positive samples) were defined as the test group. Differentially expressed genes (DEGs) were identified by significance analysis of the microarray data by using the training samples. The classification efficiency of the SCM and RF methods was evaluated using the following parameters: Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and the area under the receiver operating characteristic curve. Functional enrichment analysis was performed for the feature genes with DAVID. A total of 585 DEGs were identified in the training group, of which 580 were upregulated and five were downregulated. The classification accuracy rates of the two methods for the training group, the test group and the combined group using the 585 feature genes were >90%. For the SVM and RF methods, the rates of correct determination, specificity and PPV were >90%, while the sensitivity and NPV were >80%. The SVM method produced a slightly better classification effect than the RF method. A total of 13 biological pathways were overrepresented by the feature genes, mainly involving energy metabolism, chromatin organization and translation. The feature genes identified in the present study may be used to predict the mutation status of FLT3/ITD in patients with AML.

  17. MorphDB: Prioritizing Genes for Specialized Metabolism Pathways and Gene Ontology Categories in Plants.

    PubMed

    Zwaenepoel, Arthur; Diels, Tim; Amar, David; Van Parys, Thomas; Shamir, Ron; Van de Peer, Yves; Tzfadia, Oren

    2018-01-01

    Recent times have seen an enormous growth of "omics" data, of which high-throughput gene expression data are arguably the most important from a functional perspective. Despite huge improvements in computational techniques for the functional classification of gene sequences, common similarity-based methods often fall short of providing full and reliable functional information. Recently, the combination of comparative genomics with approaches in functional genomics has received considerable interest for gene function analysis, leveraging both gene expression based guilt-by-association methods and annotation efforts in closely related model organisms. Besides the identification of missing genes in pathways, these methods also typically enable the discovery of biological regulators (i.e., transcription factors or signaling genes). A previously built guilt-by-association method is MORPH, which was proven to be an efficient algorithm that performs particularly well in identifying and prioritizing missing genes in plant metabolic pathways. Here, we present MorphDB, a resource where MORPH-based candidate genes for large-scale functional annotations (Gene Ontology, MapMan bins) are integrated across multiple plant species. Besides a gene centric query utility, we present a comparative network approach that enables researchers to efficiently browse MORPH predictions across functional gene sets and species, facilitating efficient gene discovery and candidate gene prioritization. MorphDB is available at http://bioinformatics.psb.ugent.be/webtools/morphdb/morphDB/index/. We also provide a toolkit, named "MORPH bulk" (https://github.com/arzwa/morph-bulk), for running MORPH in bulk mode on novel data sets, enabling researchers to apply MORPH to their own species of interest.

  18. Genome-wide identification, characterization and classification of ionotropic glutamate receptor genes (iGluRs) in the malaria vector Anopheles sinensis (Diptera: Culicidae).

    PubMed

    Wang, Ting-Ting; Si, Feng-Ling; He, Zheng-Bo; Chen, Bin

    2018-01-15

    Ionotropic glutamate receptors (iGluRs) are conserved ligand-gated ion channel receptors, and ionotropic receptors (IRs) were revealed as a new family of iGluRs. Their subdivision was unsettled, and their characteristics are little known. Anopheles sinensis is a major malaria vector in eastern Asia, and its genome was recently well sequenced and annotated. We identified iGluR genes in the An. sinensis genome, analyzed their characteristics including gene structure, genome distribution, domains and specific sites by bioinformatic methods, and deduced phylogenetic relationships of all iGluRs in An. sinensis, Anopheles gambiae and Drosophila melanogaster. Based on the characteristics and phylogenetics, we generated the classification of iGluRs, and comparatively analyzed the intron number and selective pressure of three iGluRs subdivisions, iGluR group, Antenna IR and Divergent IR subfamily. A total of 56 iGluR genes were identified and named in the whole-genome of An. sinensis. These genes were located on 18 scaffolds, and 31 of them (29 being IRs) are distributed into 10 clusters that are suggested to form mainly from recent gene duplication. These iGluRs can be divided into four groups: NMDA, non-NMDA, Antenna IR and Divergent IR based on feature comparison and phylogenetic analysis. IR8a and IR25a were suggested to be monophyletic, named as Putative in the study, and moved from the Antenna subfamily in the IR family to the non-NMDA group as a sister of traditional non-NMDA. The generated iGluRs of genes (including NMDA and regenerated non-NMDA) are relatively conserved, and have a more complicated gene structure, smaller ω values and some specific functional sites. The iGluR genes in An. sinensis, An. gambiae and D. melanogaster have amino-terminal domain (ATD), ligand binding domain (LBD) and Lig_Chan domains, except for IR8a that only has the LBD and Lig_Chan domains. However, the new concept IR family of genes (including regenerated Antenna IR, and Divergent IR), especially for Divergent IR are more variable, have a simpler gene structure (intron loss phenomenon) and larger ω values, and lack specific functional sites. These IR genes have no other domains except for Antenna IRs that only have the Lig_Chan domain. This study provides a comprehensive information framework for iGluR genes in An. sinensis, and generated the classification of iGluRs by feature and bioinformatics analyses. The work lays the foundation for further functional study of these genes.

  19. Synaptic genes are extensively downregulated across multiple brain regions in normal human aging and Alzheimer’s disease

    PubMed Central

    Berchtold, Nicole C.; Coleman, Paul D.; Cribbs, David H.; Rogers, Joseph; Gillen, Daniel L.; Cotman, Carl W.

    2014-01-01

    Synapses are essential for transmitting, processing, and storing information, all of which decline in aging and Alzheimer’s disease (AD). Because synapse loss only partially accounts for the cognitive declines seen in aging and AD, we hypothesized that existing synapses might undergo molecular changes that reduce their functional capacity. Microarrays were used to evaluate expression profiles of 340 synaptic genes in aging (20–99 years) and AD across 4 brain regions from 81 cases. The analysis revealed an unexpectedly large number of significant expression changes in synapse-related genes in aging, with many undergoing progressive downregulation across aging and AD. Functional classification of the genes showing altered expression revealed that multiple aspects of synaptic function are affected, notably synaptic vesicle trafficking and release, neurotransmitter receptors and receptor trafficking, postsynaptic density scaffolding, cell adhesion regulating synaptic stability, and neuromodulatory systems. The widespread declines in synaptic gene expression in normal aging suggests that function of existing synapses might be impaired, and that a common set of synaptic genes are vulnerable to change in aging and AD. PMID:23273601

  20. Genic insights from integrated human proteomics in GeneCards.

    PubMed

    Fishilevich, Simon; Zimmerman, Shahar; Kohn, Asher; Iny Stein, Tsippi; Olender, Tsviya; Kolker, Eugene; Safran, Marilyn; Lancet, Doron

    2016-01-01

    GeneCards is a one-stop shop for searchable human gene annotations (http://www.genecards.org/). Data are automatically mined from ∼120 sources and presented in an integrated web card for every human gene. We report the application of recent advances in proteomics to enhance gene annotation and classification in GeneCards. First, we constructed the Human Integrated Protein Expression Database (HIPED), a unified database of protein abundance in human tissues, based on the publically available mass spectrometry (MS)-based proteomics sources ProteomicsDB, Multi-Omics Profiling Expression Database, Protein Abundance Across Organisms and The MaxQuant DataBase. The integrated database, residing within GeneCards, compares favourably with its individual sources, covering nearly 90% of human protein-coding genes. For gene annotation and comparisons, we first defined a protein expression vector for each gene, based on normalized abundances in 69 normal human tissues. This vector is portrayed in the GeneCards expression section as a bar graph, allowing visual inspection and comparison. These data are juxtaposed with transcriptome bar graphs. Using the protein expression vectors, we further defined a pairwise metric that helps assess expression-based pairwise proximity. This new metric for finding functional partners complements eight others, including sharing of pathways, gene ontology (GO) terms and domains, implemented in the GeneCards Suite. In parallel, we calculated proteome-based differential expression, highlighting a subset of tissues that overexpress a gene and subserving gene classification. This textual annotation allows users of VarElect, the suite's next-generation phenotyper, to more effectively discover causative disease variants. Finally, we define the protein-RNA expression ratio and correlation as yet another attribute of every gene in each tissue, adding further annotative information. The results constitute a significant enhancement of several GeneCards sections and help promote and organize the genome-wide structural and functional knowledge of the human proteome. Database URL:http://www.genecards.org/. © The Author(s) 2016. Published by Oxford University Press.

  1. Ectodermal dysplasias: a new clinical-genetic classification

    PubMed Central

    Priolo, M.; Lagana, C.

    2001-01-01

    The ectodermal dysplasias (EDs) are a large and complex nosological group of diseases, first described by Thurnam in 1848. In the last 10 years more than 170 different pathological clinical conditions have been recognised and defined as EDs, all sharing in common anomalies of the hair, teeth, nails, and sweat glands. Many are associated with anomalies in other organs and systems and, in some conditions, with mental retardation.
The anomalies affecting the epidermis and epidermal appendages are extremely variable and clinical overlap is present among the majority of EDs. Most EDs are defined by particular clinical signs (for example, eyelid adhesion in AEC syndrome, ectrodactyly in EEC). To date, few causative genes have been identified for these diseases.
We recently reviewed genes known to be responsible for EDs in light of their molecular and biological function and proposed a new approach to EDs, integrating both molecular-genetic data and corresponding clinical findings. Based on our previous report, we now propose a clinical-genetic classification of EDs, expand it to other entities in which no causative genes have been identified based on the phenotype, and speculate on possible candidate genes suggested by associated "non-ectodermal" features.


Keywords: ectodermal dysplasia; clinical-functional correlation; epithelial-mesenchymal interaction; ectodermal structural proteins PMID:11546825

  2. Classification of Microarray Data Using Kernel Fuzzy Inference System

    PubMed Central

    Kumar Rath, Santanu

    2014-01-01

    The DNA microarray classification technique has gained more popularity in both research and practice. In real data analysis, such as microarray data, the dataset contains a huge number of insignificant and irrelevant features that tend to lose useful information. Classes with high relevance and feature sets with high significance are generally referred for the selected features, which determine the samples classification into their respective classes. In this paper, kernel fuzzy inference system (K-FIS) algorithm is applied to classify the microarray data (leukemia) using t-test as a feature selection method. Kernel functions are used to map original data points into a higher-dimensional (possibly infinite-dimensional) feature space defined by a (usually nonlinear) function ϕ through a mathematical process called the kernel trick. This paper also presents a comparative study for classification using K-FIS along with support vector machine (SVM) for different set of features (genes). Performance parameters available in the literature such as precision, recall, specificity, F-measure, ROC curve, and accuracy are considered to analyze the efficiency of the classification model. From the proposed approach, it is apparent that K-FIS model obtains similar results when compared with SVM model. This is an indication that the proposed approach relies on kernel function. PMID:27433543

  3. Growth condition dependency is the major cause of non-responsiveness upon genetic perturbation

    PubMed Central

    Amini, Saman; Holstege, Frank C. P.

    2017-01-01

    Investigating the role and interplay between individual proteins in biological processes is often performed by assessing the functional consequences of gene inactivation or removal. Depending on the sensitivity of the assay used for determining phenotype, between 66% (growth) and 53% (gene expression) of Saccharomyces cerevisiae gene deletion strains show no defect when analyzed under a single condition. Although it is well known that this non-responsive behavior is caused by different types of redundancy mechanisms or by growth condition/cell type dependency, it is not known what the relative contribution of these different causes is. Understanding the underlying causes of and their relative contribution to non-responsive behavior upon genetic perturbation is extremely important for designing efficient strategies aimed at elucidating gene function and unraveling complex cellular systems. Here, we provide a systematic classification of the underlying causes of and their relative contribution to non-responsive behavior upon gene deletion. The overall contribution of redundancy to non-responsive behavior is estimated at 29%, of which approximately 17% is due to homology-based redundancy and 12% is due to pathway-based redundancy. The major determinant of non-responsiveness is condition dependency (71%). For approximately 14% of protein complexes, just-in-time assembly can be put forward as a potential mechanistic explanation for how proteins can be regulated in a condition dependent manner. Taken together, the results underscore the large contribution of growth condition requirement to non-responsive behavior, which needs to be taken into account for strategies aimed at determining gene function. The classification provided here, can also be further harnessed in systematic analyses of complex cellular systems. PMID:28257504

  4. Multiclass classification of microarray data samples with a reduced number of genes

    PubMed Central

    2011-01-01

    Background Multiclass classification of microarray data samples with a reduced number of genes is a rich and challenging problem in Bioinformatics research. The problem gets harder as the number of classes is increased. In addition, the performance of most classifiers is tightly linked to the effectiveness of mandatory gene selection methods. Critical to gene selection is the availability of estimates about the maximum number of genes that can be handled by any classification algorithm. Lack of such estimates may lead to either computationally demanding explorations of a search space with thousands of dimensions or classification models based on gene sets of unrestricted size. In the former case, unbiased but possibly overfitted classification models may arise. In the latter case, biased classification models unable to support statistically significant findings may be obtained. Results A novel bound on the maximum number of genes that can be handled by binary classifiers in binary mediated multiclass classification algorithms of microarray data samples is presented. The bound suggests that high-dimensional binary output domains might favor the existence of accurate and sparse binary mediated multiclass classifiers for microarray data samples. Conclusions A comprehensive experimental work shows that the bound is indeed useful to induce accurate and sparse multiclass classifiers for microarray data samples. PMID:21342522

  5. Smart, Injury-Triggered Therapy for Ocular Trauma

    DTIC Science & Technology

    2016-10-01

    prognosis due to retinal cell death , scar formation, and lack of functional regeneration. Proliferative vitreoretinopathy (PVR), a form of intraocular...Proteases, Metalloproteinases, Cell death , Gene Therapy 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT 18. NUMBER OF PAGES 19a. NAME...vision has a poor prognosis due to retinal cell death , scar formation, and lack of functional regeneration. Proliferative vitreoretinopathy (PVR), a

  6. Ontological function annotation of long non-coding RNAs through hierarchical multi-label classification.

    PubMed

    Zhang, Jingpu; Zhang, Zuping; Wang, Zixiang; Liu, Yuting; Deng, Lei

    2018-05-15

    Long non-coding RNAs (lncRNAs) are an enormous collection of functional non-coding RNAs. Over the past decades, a large number of novel lncRNA genes have been identified. However, most of the lncRNAs remain function uncharacterized at present. Computational approaches provide a new insight to understand the potential functional implications of lncRNAs. Considering that each lncRNA may have multiple functions and a function may be further specialized into sub-functions, here we describe NeuraNetL2GO, a computational ontological function prediction approach for lncRNAs using hierarchical multi-label classification strategy based on multiple neural networks. The neural networks are incrementally trained level by level, each performing the prediction of gene ontology (GO) terms belonging to a given level. In NeuraNetL2GO, we use topological features of the lncRNA similarity network as the input of the neural networks and employ the output results to annotate the lncRNAs. We show that NeuraNetL2GO achieves the best performance and the overall advantage in maximum F-measure and coverage on the manually annotated lncRNA2GO-55 dataset compared to other state-of-the-art methods. The source code and data are available at http://denglab.org/NeuraNetL2GO/. leideng@csu.edu.cn. Supplementary data are available at Bioinformatics online.

  7. Genome-wide analysis of TCP family in tobacco.

    PubMed

    Chen, L; Chen, Y Q; Ding, A M; Chen, H; Xia, F; Wang, W F; Sun, Y H

    2016-05-23

    The TCP family is a transcription factor family, members of which are extensively involved in plant growth and development as well as in signal transduction in the response against many physiological and biochemical stimuli. In the present study, 61 TCP genes were identified in tobacco (Nicotiana tabacum) genome. Bioinformatic methods were employed for predicting and analyzing the gene structure, gene expression, phylogenetic analysis, and conserved domains of TCP proteins in tobacco. The 61 NtTCP genes were divided into three diverse groups, based on the division of TCP genes in tomato and Arabidopsis, and the results of the conserved domain and sequence analyses further confirmed the classification of the NtTCP genes. The expression pattern of NtTCP also demonstrated that majority of these genes play important roles in all the tissues, while some special genes exercise their functions only in specific tissues. In brief, the comprehensive and thorough study of the TCP family in other plants provides sufficient resources for studying the structure and functions of TCPs in tobacco.

  8. The functional therapeutic chemical classification system.

    PubMed

    Croset, Samuel; Overington, John P; Rebholz-Schuhmann, Dietrich

    2014-03-15

    Drug repositioning is the discovery of new indications for compounds that have already been approved and used in a clinical setting. Recently, some computational approaches have been suggested to unveil new opportunities in a systematic fashion, by taking into consideration gene expression signatures or chemical features for instance. We present here a novel method based on knowledge integration using semantic technologies, to capture the functional role of approved chemical compounds. In order to computationally generate repositioning hypotheses, we used the Web Ontology Language to formally define the semantics of over 20 000 terms with axioms to correctly denote various modes of action (MoA). Based on an integration of public data, we have automatically assigned over a thousand of approved drugs into these MoA categories. The resulting new resource is called the Functional Therapeutic Chemical Classification System and was further evaluated against the content of the traditional Anatomical Therapeutic Chemical Classification System. We illustrate how the new classification can be used to generate drug repurposing hypotheses, using Alzheimers disease as a use-case. https://www.ebi.ac.uk/chembl/ftc; https://github.com/loopasam/ftc. croset@ebi.ac.uk Supplementary data are available at Bioinformatics online.

  9. Mining disease fingerprints from within genetic pathways.

    PubMed

    Nabhan, Ahmed Ragab; Sarkar, Indra Neil

    2012-01-01

    Mining biological networks can be an effective means to uncover system level knowledge out of micro level associations, such as encapsulated in genetic pathways. Analysis of human disease genetic pathways can lead to the identification of major mechanisms that may underlie disorders at an abstract functional level. The focus of this study was to develop an approach for structural pattern analysis and classification of genetic pathways of diseases. A probabilistic model was developed to capture characteristic components ('fingerprints') of functionally annotated pathways. A probability estimation procedure of this model searched for fingerprints in each disease pathway while improving probability estimates of model parameters. The approach was evaluated on data from the Kyoto Encyclopedia of Genes and Genomes (consisting of 56 pathways across seven disease categories). Based on the achieved average classification accuracy of up to ~77%, the findings suggest that these fingerprints may be used for classification and discovery of genetic pathways.

  10. Mining Disease Fingerprints From Within Genetic Pathways

    PubMed Central

    Nabhan, Ahmed Ragab; Sarkar, Indra Neil

    2012-01-01

    Mining biological networks can be an effective means to uncover system level knowledge out of micro level associations, such as encapsulated in genetic pathways. Analysis of human disease genetic pathways can lead to the identification of major mechanisms that may underlie disorders at an abstract functional level. The focus of this study was to develop an approach for structural pattern analysis and classification of genetic pathways of diseases. A probabilistic model was developed to capture characteristic components (‘fingerprints’) of functionally annotated pathways. A probability estimation procedure of this model searched for fingerprints in each disease pathway while improving probability estimates of model parameters. The approach was evaluated on data from the Kyoto Encyclopedia of Genes and Genomes (consisting of 56 pathways across seven disease categories). Based on the achieved average classification accuracy of up to ∼77%, the findings suggest that these fingerprints may be used for classification and discovery of genetic pathways. PMID:23304411

  11. Optimal number of features as a function of sample size for various classification rules.

    PubMed

    Hua, Jianping; Xiong, Zixiang; Lowey, James; Suh, Edward; Dougherty, Edward R

    2005-04-15

    Given the joint feature-label distribution, increasing the number of features always results in decreased classification error; however, this is not the case when a classifier is designed via a classification rule from sample data. Typically (but not always), for fixed sample size, the error of a designed classifier decreases and then increases as the number of features grows. The potential downside of using too many features is most critical for small samples, which are commonplace for gene-expression-based classifiers for phenotype discrimination. For fixed sample size and feature-label distribution, the issue is to find an optimal number of features. Since only in rare cases is there a known distribution of the error as a function of the number of features and sample size, this study employs simulation for various feature-label distributions and classification rules, and across a wide range of sample and feature-set sizes. To achieve the desired end, finding the optimal number of features as a function of sample size, it employs massively parallel computation. Seven classifiers are treated: 3-nearest-neighbor, Gaussian kernel, linear support vector machine, polynomial support vector machine, perceptron, regular histogram and linear discriminant analysis. Three Gaussian-based models are considered: linear, nonlinear and bimodal. In addition, real patient data from a large breast-cancer study is considered. To mitigate the combinatorial search for finding optimal feature sets, and to model the situation in which subsets of genes are co-regulated and correlation is internal to these subsets, we assume that the covariance matrix of the features is blocked, with each block corresponding to a group of correlated features. Altogether there are a large number of error surfaces for the many cases. These are provided in full on a companion website, which is meant to serve as resource for those working with small-sample classification. For the companion website, please visit http://public.tgen.org/tamu/ofs/ e-dougherty@ee.tamu.edu.

  12. Evaluating Functional Annotations of Enzymes Using the Gene Ontology.

    PubMed

    Holliday, Gemma L; Davidson, Rebecca; Akiva, Eyal; Babbitt, Patricia C

    2017-01-01

    The Gene Ontology (GO) (Ashburner et al., Nat Genet 25(1):25-29, 2000) is a powerful tool in the informatics arsenal of methods for evaluating annotations in a protein dataset. From identifying the nearest well annotated homologue of a protein of interest to predicting where misannotation has occurred to knowing how confident you can be in the annotations assigned to those proteins is critical. In this chapter we explore what makes an enzyme unique and how we can use GO to infer aspects of protein function based on sequence similarity. These can range from identification of misannotation or other errors in a predicted function to accurate function prediction for an enzyme of entirely unknown function. Although GO annotation applies to any gene products, we focus here a describing our approach for hierarchical classification of enzymes in the Structure-Function Linkage Database (SFLD) (Akiva et al., Nucleic Acids Res 42(Database issue):D521-530, 2014) as a guide for informed utilisation of annotation transfer based on GO terms.

  13. Genome-wide identification, phylogeny and expression analyses of SCARECROW-LIKE(SCL) genes in millet (Setaria italica).

    PubMed

    Liu, Hongyun; Qin, Jiajia; Fan, Hui; Cheng, Jinjin; Li, Lin; Liu, Zheng

    2017-07-01

    As a member of the GRAS gene family, SCARECROW - LIKE ( SCL ) genes encode transcriptional regulators that are involved in plant information transmission and signal transduction. In this study, 44 SCL genes including two SCARECROW genes in millet were identified to be distributed on eight chromosomes, except chromosome 6. All the millet genes contain motifs 6-8, indicating that these motifs are conserved during the evolution. SCL genes of millet were divided into eight groups based on the phylogenetic relationship and classification of Arabidopsis SCL genes. Several putative millet orthologous genes in Arabidopsis , maize and rice were identified. High throughput RNA sequencing revealed that the expressions of millet SCL genes in root, stem, leaf, spica, and along leaf gradient varied greatly. Analyses combining the gene expression patterns, gene structures, motif compositions, promoter cis -elements identification, alternative splicing of transcripts and phylogenetic relationship of SCL genes indicate that the these genes may play diverse functions. Functionally characterized SCL genes in maize, rice and Arabidopsis would provide us some clues for future characterization of their homologues in millet. To the best of our knowledge, this is the first study of millet SCL genes at the genome wide level. Our work provides a useful platform for functional analysis of SCL genes in millet, a model crop for C 4 photosynthesis and bioenergy studies.

  14. Genome-wide identification and expression analysis of the ClTCP transcription factors in Citrullus lanatus.

    PubMed

    Shi, Pibiao; Guy, Kateta Malangisha; Wu, Weifang; Fang, Bingsheng; Yang, Jinghua; Zhang, Mingfang; Hu, Zhongyuan

    2016-04-12

    The plant-specific TCP transcription factor family, which is involved in the regulation of cell growth and proliferation, performs diverse functions in multiple aspects of plant growth and development. However, no comprehensive analysis of the TCP family in watermelon (Citrullus lanatus) has been undertaken previously. A total of 27 watermelon TCP encoding genes distributed on nine chromosomes were identified. Phylogenetic analysis clustered the genes into 11 distinct subgroups. Furthermore, phylogenetic and structural analyses distinguished two homology classes within the ClTCP family, designated Class I and Class II. The Class II genes were differentiated into two subclasses, the CIN subclass and the CYC/TB1 subclass. The expression patterns of all members were determined by semi-quantitative PCR. The functions of two ClTCP genes, ClTCP14a and ClTCP15, in regulating plant height were confirmed by ectopic expression in Arabidopsis wild-type and ortholog mutants. This study represents the first genome-wide analysis of the watermelon TCP gene family, which provides valuable information for understanding the classification and functions of the TCP genes in watermelon.

  15. Binary Classification using Decision Tree based Genetic Programming and Its Application to Analysis of Bio-mass Data

    NASA Astrophysics Data System (ADS)

    To, Cuong; Pham, Tuan D.

    2010-01-01

    In machine learning, pattern recognition may be the most popular task. "Similar" patterns identification is also very important in biology because first, it is useful for prediction of patterns associated with disease, for example cancer tissue (normal or tumor); second, similarity or dissimilarity of the kinetic patterns is used to identify coordinately controlled genes or proteins involved in the same regulatory process. Third, similar genes (proteins) share similar functions. In this paper, we present an algorithm which uses genetic programming to create decision tree for binary classification problem. The application of the algorithm was implemented on five real biological databases. Base on the results of comparisons with well-known methods, we see that the algorithm is outstanding in most of cases.

  16. Mycobacteriophage genome database.

    PubMed

    Joseph, Jerrine; Rajendran, Vasanthi; Hassan, Sameer; Kumar, Vanaja

    2011-01-01

    Mycobacteriophage genome database (MGDB) is an exclusive repository of the 64 completely sequenced mycobacteriophages with annotated information. It is a comprehensive compilation of the various gene parameters captured from several databases pooled together to empower mycobacteriophage researchers. The MGDB (Version No.1.0) comprises of 6086 genes from 64 mycobacteriophages classified into 72 families based on ACLAME database. Manual curation was aided by information available from public databases which was enriched further by analysis. Its web interface allows browsing as well as querying the classification. The main objective is to collect and organize the complexity inherent to mycobacteriophage protein classification in a rational way. The other objective is to browse the existing and new genomes and describe their functional annotation. The database is available for free at http://mpgdb.ibioinformatics.org/mpgdb.php.

  17. Defining functional distance using manifold embeddings of gene ontology annotations

    PubMed Central

    Lerman, Gilad; Shakhnovich, Boris E.

    2007-01-01

    Although rigorous measures of similarity for sequence and structure are now well established, the problem of defining functional relationships has been particularly daunting. Here, we present several manifold embedding techniques to compute distances between Gene Ontology (GO) functional annotations and consequently estimate functional distances between protein domains. To evaluate accuracy, we correlate the functional distance to the well established measures of sequence, structural, and phylogenetic similarities. Finally, we show that manual classification of structures into folds and superfamilies is mirrored by proximity in the newly defined function space. We show how functional distances place structure–function relationships in biological context resulting in insight into divergent and convergent evolution. The methods and results in this paper can be readily generalized and applied to a wide array of biologically relevant investigations, such as accuracy of annotation transference, the relationship between sequence, structure, and function, or coherence of expression modules. PMID:17595300

  18. Characterization and classification of zebrafish brain morphology mutants

    PubMed Central

    Lowery, Laura Anne; De Rienzo, Gianluca; Gutzman, Jennifer H.; Sive, Hazel

    2010-01-01

    The mechanisms by which the vertebrate brain achieves its three-dimensional structure are clearly complex, requiring the functions of many genes. Using the zebrafish as a model, we have begun to define genes required for brain morphogenesis, including brain ventricle formation, by studying 16 mutants previously identified as having embryonic brain morphology defects. We report the phenotypic characterization of these mutants at several time-points, using brain ventricle dye injection, imaging, and immunohistochemistry with neuronal markers. Most of these mutants display early phenotypes, affecting initial brain shaping, while others show later phenotypes, affecting brain ventricle expansion. In the early phenotype group, we further define four phenotypic classes and corresponding functions required for brain morphogenesis. Although we did not use known genotypes for this classification, basing it solely on phenotypes, many mutants with defects in functionally related genes clustered in a single class. In particular, class 1 mutants show midline separation defects, corresponding to epithelial junction defects; class 2 mutants show reduced brain ventricle size; class 3 mutants show midbrain-hindbrain abnormalities, corresponding to basement membrane defects; and class 4 mutants show absence of ventricle lumen inflation, corresponding to defective ion pumping. Later brain ventricle expansion requires the extracellular matrix, cardiovascular circulation, and transcription/splicing-dependent events. We suggest that these mutants define processes likely to be used during brain morphogenesis throughout the vertebrates. PMID:19051268

  19. Evolutionary divergence and functions of the human interleukin (IL) gene family

    PubMed Central

    2010-01-01

    Cytokines play a very important role in nearly all aspects of inflammation and immunity. The term 'interleukin' (IL) has been used to describe a group of cytokines with complex immunomodulatory functions -- including cell proliferation, maturation, migration and adhesion. These cytokines also play an important role in immune cell differentiation and activation. Determining the exact function of a particular cytokine is complicated by the influence of the producing cell type, the responding cell type and the phase of the immune response. ILs can also have pro- and anti-inflammatory effects, further complicating their characterisation. These molecules are under constant pressure to evolve due to continual competition between the host's immune system and infecting organisms; as such, ILs have undergone significant evolution. This has resulted in little amino acid conservation between orthologous proteins, which further complicates the gene family organisation. Within the literature there are a number of overlapping nomenclature and classification systems derived from biological function, receptor-binding properties and originating cell type. Determining evolutionary relationships between ILs therefore can be confusing. More recently, crystallographic data and the identification of common structural motifs have led to a more accurate classification system. To date, the known ILs can be divided into four major groups based on distinguishing structural features. These groups include the genes encoding the IL1-like cytokines, the class I helical cytokines (IL4-like, γ-chain and IL6/12-like), the class II helical cytokines (IL10-like and IL28-like) and the IL17-like cytokines. In addition, there are a number of ILs that do not fit into any of the above groups, due either to their unique structural features or lack of structural information. This suggests that the gene family organisation may be subject to further change in the near future. PMID:21106488

  20. A Bayesian taxonomic classification method for 16S rRNA gene sequences with improved species-level accuracy.

    PubMed

    Gao, Xiang; Lin, Huaiying; Revanna, Kashi; Dong, Qunfeng

    2017-05-10

    Species-level classification for 16S rRNA gene sequences remains a serious challenge for microbiome researchers, because existing taxonomic classification tools for 16S rRNA gene sequences either do not provide species-level classification, or their classification results are unreliable. The unreliable results are due to the limitations in the existing methods which either lack solid probabilistic-based criteria to evaluate the confidence of their taxonomic assignments, or use nucleotide k-mer frequency as the proxy for sequence similarity measurement. We have developed a method that shows significantly improved species-level classification results over existing methods. Our method calculates true sequence similarity between query sequences and database hits using pairwise sequence alignment. Taxonomic classifications are assigned from the species to the phylum levels based on the lowest common ancestors of multiple database hits for each query sequence, and further classification reliabilities are evaluated by bootstrap confidence scores. The novelty of our method is that the contribution of each database hit to the taxonomic assignment of the query sequence is weighted by a Bayesian posterior probability based upon the degree of sequence similarity of the database hit to the query sequence. Our method does not need any training datasets specific for different taxonomic groups. Instead only a reference database is required for aligning to the query sequences, making our method easily applicable for different regions of the 16S rRNA gene or other phylogenetic marker genes. Reliable species-level classification for 16S rRNA or other phylogenetic marker genes is critical for microbiome research. Our software shows significantly higher classification accuracy than the existing tools and we provide probabilistic-based confidence scores to evaluate the reliability of our taxonomic classification assignments based on multiple database matches to query sequences. Despite its higher computational costs, our method is still suitable for analyzing large-scale microbiome datasets for practical purposes. Furthermore, our method can be applied for taxonomic classification of any phylogenetic marker gene sequences. Our software, called BLCA, is freely available at https://github.com/qunfengdong/BLCA .

  1. A three-way approach for protein function classification

    PubMed Central

    2017-01-01

    The knowledge of protein functions plays an essential role in understanding biological cells and has a significant impact on human life in areas such as personalized medicine, better crops and improved therapeutic interventions. Due to expense and inherent difficulty of biological experiments, intelligent methods are generally relied upon for automatic assignment of functions to proteins. The technological advancements in the field of biology are improving our understanding of biological processes and are regularly resulting in new features and characteristics that better describe the role of proteins. It is inevitable to neglect and overlook these anticipated features in designing more effective classification techniques. A key issue in this context, that is not being sufficiently addressed, is how to build effective classification models and approaches for protein function prediction by incorporating and taking advantage from the ever evolving biological information. In this article, we propose a three-way decision making approach which provides provisions for seeking and incorporating future information. We considered probabilistic rough sets based models such as Game-Theoretic Rough Sets (GTRS) and Information-Theoretic Rough Sets (ITRS) for inducing three-way decisions. An architecture of protein functions classification with probabilistic rough sets based three-way decisions is proposed and explained. Experiments are carried out on Saccharomyces cerevisiae species dataset obtained from Uniprot database with the corresponding functional classes extracted from the Gene Ontology (GO) database. The results indicate that as the level of biological information increases, the number of deferred cases are reduced while maintaining similar level of accuracy. PMID:28234929

  2. A three-way approach for protein function classification.

    PubMed

    Ur Rehman, Hafeez; Azam, Nouman; Yao, JingTao; Benso, Alfredo

    2017-01-01

    The knowledge of protein functions plays an essential role in understanding biological cells and has a significant impact on human life in areas such as personalized medicine, better crops and improved therapeutic interventions. Due to expense and inherent difficulty of biological experiments, intelligent methods are generally relied upon for automatic assignment of functions to proteins. The technological advancements in the field of biology are improving our understanding of biological processes and are regularly resulting in new features and characteristics that better describe the role of proteins. It is inevitable to neglect and overlook these anticipated features in designing more effective classification techniques. A key issue in this context, that is not being sufficiently addressed, is how to build effective classification models and approaches for protein function prediction by incorporating and taking advantage from the ever evolving biological information. In this article, we propose a three-way decision making approach which provides provisions for seeking and incorporating future information. We considered probabilistic rough sets based models such as Game-Theoretic Rough Sets (GTRS) and Information-Theoretic Rough Sets (ITRS) for inducing three-way decisions. An architecture of protein functions classification with probabilistic rough sets based three-way decisions is proposed and explained. Experiments are carried out on Saccharomyces cerevisiae species dataset obtained from Uniprot database with the corresponding functional classes extracted from the Gene Ontology (GO) database. The results indicate that as the level of biological information increases, the number of deferred cases are reduced while maintaining similar level of accuracy.

  3. An efficient ensemble learning method for gene microarray classification.

    PubMed

    Osareh, Alireza; Shadgar, Bita

    2013-01-01

    The gene microarray analysis and classification have demonstrated an effective way for the effective diagnosis of diseases and cancers. However, it has been also revealed that the basic classification techniques have intrinsic drawbacks in achieving accurate gene classification and cancer diagnosis. On the other hand, classifier ensembles have received increasing attention in various applications. Here, we address the gene classification issue using RotBoost ensemble methodology. This method is a combination of Rotation Forest and AdaBoost techniques which in turn preserve both desirable features of an ensemble architecture, that is, accuracy and diversity. To select a concise subset of informative genes, 5 different feature selection algorithms are considered. To assess the efficiency of the RotBoost, other nonensemble/ensemble techniques including Decision Trees, Support Vector Machines, Rotation Forest, AdaBoost, and Bagging are also deployed. Experimental results have revealed that the combination of the fast correlation-based feature selection method with ICA-based RotBoost ensemble is highly effective for gene classification. In fact, the proposed method can create ensemble classifiers which outperform not only the classifiers produced by the conventional machine learning but also the classifiers generated by two widely used conventional ensemble learning methods, that is, Bagging and AdaBoost.

  4. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wu, Hong; Zeng, Hong; Lam, Robert

    Mismatch repair prevents the accumulation of erroneous insertions/deletions and non-Watson–Crick base pairs in the genome. Pathogenic mutations in theMLH1gene are associated with a predisposition to Lynch and Turcot's syndromes. Although genetic testing for these mutations is available, robust classification of variants requires strong clinical and functional support. Here, the first structure of the N-terminus of human MLH1, determined by X-ray crystallography, is described. Lastly, the structure shares a high degree of similarity with previously determined prokaryoticMLH1homologs; however, this structure affords a more accurate platform for the classification ofMLH1variants.

  5. Practical application of self-organizing maps to interrelate biodiversity and functional data in NGS-based metagenomics

    PubMed Central

    Weber, Marc; Teeling, Hanno; Huang, Sixing; Waldmann, Jost; Kassabgy, Mariette; Fuchs, Bernhard M; Klindworth, Anna; Klockow, Christine; Wichels, Antje; Gerdts, Gunnar; Amann, Rudolf; Glöckner, Frank Oliver

    2011-01-01

    Next-generation sequencing (NGS) technologies have enabled the application of broad-scale sequencing in microbial biodiversity and metagenome studies. Biodiversity is usually targeted by classifying 16S ribosomal RNA genes, while metagenomic approaches target metabolic genes. However, both approaches remain isolated, as long as the taxonomic and functional information cannot be interrelated. Techniques like self-organizing maps (SOMs) have been applied to cluster metagenomes into taxon-specific bins in order to link biodiversity with functions, but have not been applied to broad-scale NGS-based metagenomics yet. Here, we provide a novel implementation, demonstrate its potential and practicability, and provide a web-based service for public usage. Evaluation with published data sets mimicking varyingly complex habitats resulted into classification specificities and sensitivities of close to 100% to above 90% from phylum to genus level for assemblies exceeding 8 kb for low and medium complexity data. When applied to five real-world metagenomes of medium complexity from direct pyrosequencing of marine subsurface waters, classifications of assemblies above 2.5 kb were in good agreement with fluorescence in situ hybridizations, indicating that biodiversity was mostly retained within the metagenomes, and confirming high classification specificities. This was validated by two protein-based classifications (PBCs) methods. SOMs were able to retrieve the relevant taxa down to the genus level, while surpassing PBCs in resolution. In order to make the approach accessible to a broad audience, we implemented a feature-rich web-based SOM application named TaxSOM, which is freely available at http://www.megx.net/toolbox/taxsom. TaxSOM can classify reads or assemblies exceeding 2.5 kb with high accuracy and thus assists in linking biodiversity and functions in metagenome studies, which is a precondition to study microbial ecology in a holistic fashion. PMID:21160538

  6. FlyBase: genes and gene models

    PubMed Central

    Drysdale, Rachel A.; Crosby, Madeline A.

    2005-01-01

    FlyBase (http://flybase.org) is the primary repository of genetic and molecular data of the insect family Drosophilidae. For the most extensively studied species, Drosophila melanogaster, a wide range of data are presented in integrated formats. Data types include mutant phenotypes, molecular characterization of mutant alleles and aberrations, cytological maps, wild-type expression patterns, anatomical images, transgenic constructs and insertions, sequence-level gene models and molecular classification of gene product functions. There is a growing body of data for other Drosophila species; this is expected to increase dramatically over the next year, with the completion of draft-quality genomic sequences of an additional 11 Drosphila species. PMID:15608223

  7. Gene duplications in prokaryotes can be associated with environmental adaptation

    PubMed Central

    2010-01-01

    Background Gene duplication is a normal evolutionary process. If there is no selective advantage in keeping the duplicated gene, it is usually reduced to a pseudogene and disappears from the genome. However, some paralogs are retained. These gene products are likely to be beneficial to the organism, e.g. in adaptation to new environmental conditions. The aim of our analysis is to investigate the properties of paralog-forming genes in prokaryotes, and to analyse the role of these retained paralogs by relating gene properties to life style of the corresponding prokaryotes. Results Paralogs were identified in a number of prokaryotes, and these paralogs were compared to singletons of persistent orthologs based on functional classification. This showed that the paralogs were associated with for example energy production, cell motility, ion transport, and defence mechanisms. A statistical overrepresentation analysis of gene and protein annotations was based on paralogs of the 200 prokaryotes with the highest fraction of paralog-forming genes. Biclustering of overrepresented gene ontology terms versus species was used to identify clusters of properties associated with clusters of species. The clusters were classified using similarity scores on properties and species to identify interesting clusters, and a subset of clusters were analysed by comparison to literature data. This analysis showed that paralogs often are associated with properties that are important for survival and proliferation of the specific organisms. This includes processes like ion transport, locomotion, chemotaxis and photosynthesis. However, the analysis also showed that the gene ontology terms sometimes were too general, imprecise or even misleading for automatic analysis. Conclusions Properties described by gene ontology terms identified in the overrepresentation analysis are often consistent with individual prokaryote lifestyles and are likely to give a competitive advantage to the organism. Paralogs and singletons dominate different categories of functional classification, where paralogs in particular seem to be associated with processes involving interaction with the environment. PMID:20961426

  8. Gene duplications in prokaryotes can be associated with environmental adaptation.

    PubMed

    Bratlie, Marit S; Johansen, Jostein; Sherman, Brad T; Huang, Da Wei; Lempicki, Richard A; Drabløs, Finn

    2010-10-20

    Gene duplication is a normal evolutionary process. If there is no selective advantage in keeping the duplicated gene, it is usually reduced to a pseudogene and disappears from the genome. However, some paralogs are retained. These gene products are likely to be beneficial to the organism, e.g. in adaptation to new environmental conditions. The aim of our analysis is to investigate the properties of paralog-forming genes in prokaryotes, and to analyse the role of these retained paralogs by relating gene properties to life style of the corresponding prokaryotes. Paralogs were identified in a number of prokaryotes, and these paralogs were compared to singletons of persistent orthologs based on functional classification. This showed that the paralogs were associated with for example energy production, cell motility, ion transport, and defence mechanisms. A statistical overrepresentation analysis of gene and protein annotations was based on paralogs of the 200 prokaryotes with the highest fraction of paralog-forming genes. Biclustering of overrepresented gene ontology terms versus species was used to identify clusters of properties associated with clusters of species. The clusters were classified using similarity scores on properties and species to identify interesting clusters, and a subset of clusters were analysed by comparison to literature data. This analysis showed that paralogs often are associated with properties that are important for survival and proliferation of the specific organisms. This includes processes like ion transport, locomotion, chemotaxis and photosynthesis. However, the analysis also showed that the gene ontology terms sometimes were too general, imprecise or even misleading for automatic analysis. Properties described by gene ontology terms identified in the overrepresentation analysis are often consistent with individual prokaryote lifestyles and are likely to give a competitive advantage to the organism. Paralogs and singletons dominate different categories of functional classification, where paralogs in particular seem to be associated with processes involving interaction with the environment.

  9. Grouping patients for masseter muscle genotype-phenotype studies.

    PubMed

    Moawad, Hadwah Abdelmatloub; Sinanan, Andrea C M; Lewis, Mark P; Hunt, Nigel P

    2012-03-01

    To use various facial classifications, including either/both vertical and horizontal facial criteria, to assess their effects on the interpretation of masseter muscle (MM) gene expression. Fresh MM biopsies were obtained from 29 patients (age, 16-36 years) with various facial phenotypes. Based on clinical and cephalometric analysis, patients were grouped using three different classifications: (1) basic vertical, (2) basic horizontal, and (3) combined vertical and horizontal. Gene expression levels of the myosin heavy chain genes MYH1, MYH2, MYH3, MYH6, MYH7, and MYH8 were recorded using quantitative reverse transcriptase polymerase chain reaction (RT-PCR) and were related to the various classifications. The significance level for statistical analysis was set at P ≤ .05. Using classification 1, none of the MYH genes were found to be significantly different between long face (LF) patients and the average vertical group. Using classification 2, MYH3, MYH6, and MYH7 genes were found to be significantly upregulated in retrognathic patients compared with prognathic and average horizontal groups. Using classification 3, only the MYH7 gene was found to be significantly upregulated in retrognathic LF compared with prognathic LF, prognathic average vertical faces, and average vertical and horizontal groups. The use of basic vertical or basic horizontal facial classifications may not be sufficient for genetics-based studies of facial phenotypes. Prognathic and retrognathic facial phenotypes have different MM gene expressions; therefore, it is not recommended to combine them into one single group, even though they may have a similar vertical facial phenotype.

  10. Revisiting the structure/function relationships of H/ACA(-like) RNAs: a unified model for Euryarchaea and Crenarchaea

    PubMed Central

    Toffano-Nioche, Claire; Gautheret, Daniel; Leclerc, Fabrice

    2015-01-01

    A structural and functional classification of H/ACA and H/ACA-like motifs is obtained from the analysis of the H/ACA guide RNAs which have been identified previously in the genomes of Euryarchaea (Pyrococcus) and Crenarchaea (Pyrobaculum). A unified structure/function model is proposed based on the common structural determinants shared by H/ACA and H/ACA-like motifs in both Euryarchaea and Crenarchaea. Using a computational approach, structural and energetic rules for the guide:target RNA-RNA interactions are derived from structural and functional data on the H/ACA RNP particles. H/ACA(-like) motifs found in Pyrococcus are evaluated through the classification and their biological relevance is discussed. Extra-ribosomal targets found in both Pyrococcus and Pyrobaculum might support the hypothesis of a gene regulation mediated by H/ACA(-like) guide RNAs in archaea. PMID:26240384

  11. SorghumFDB: sorghum functional genomics database with multidimensional network analysis.

    PubMed

    Tian, Tian; You, Qi; Zhang, Liwei; Yi, Xin; Yan, Hengyu; Xu, Wenying; Su, Zhen

    2016-01-01

    Sorghum (Sorghum bicolor [L.] Moench) has excellent agronomic traits and biological properties, such as heat and drought-tolerance. It is a C4 grass and potential bioenergy-producing plant, which makes it an important crop worldwide. With the sorghum genome sequence released, it is essential to establish a sorghum functional genomics data mining platform. We collected genomic data and some functional annotations to construct a sorghum functional genomics database (SorghumFDB). SorghumFDB integrated knowledge of sorghum gene family classifications (transcription regulators/factors, carbohydrate-active enzymes, protein kinases, ubiquitins, cytochrome P450, monolignol biosynthesis related enzymes, R-genes and organelle-genes), detailed gene annotations, miRNA and target gene information, orthologous pairs in the model plants Arabidopsis, rice and maize, gene loci conversions and a genome browser. We further constructed a dynamic network of multidimensional biological relationships, comprised of the co-expression data, protein-protein interactions and miRNA-target pairs. We took effective measures to combine the network, gene set enrichment and motif analyses to determine the key regulators that participate in related metabolic pathways, such as the lignin pathway, which is a major biological process in bioenergy-producing plants.Database URL: http://structuralbiology.cau.edu.cn/sorghum/index.html. © The Author(s) 2016. Published by Oxford University Press.

  12. Novel gene sets improve set-level classification of prokaryotic gene expression data.

    PubMed

    Holec, Matěj; Kuželka, Ondřej; Železný, Filip

    2015-10-28

    Set-level classification of gene expression data has received significant attention recently. In this setting, high-dimensional vectors of features corresponding to genes are converted into lower-dimensional vectors of features corresponding to biologically interpretable gene sets. The dimensionality reduction brings the promise of a decreased risk of overfitting, potentially resulting in improved accuracy of the learned classifiers. However, recent empirical research has not confirmed this expectation. Here we hypothesize that the reported unfavorable classification results in the set-level framework were due to the adoption of unsuitable gene sets defined typically on the basis of the Gene ontology and the KEGG database of metabolic networks. We explore an alternative approach to defining gene sets, based on regulatory interactions, which we expect to collect genes with more correlated expression. We hypothesize that such more correlated gene sets will enable to learn more accurate classifiers. We define two families of gene sets using information on regulatory interactions, and evaluate them on phenotype-classification tasks using public prokaryotic gene expression data sets. From each of the two gene-set families, we first select the best-performing subtype. The two selected subtypes are then evaluated on independent (testing) data sets against state-of-the-art gene sets and against the conventional gene-level approach. The novel gene sets are indeed more correlated than the conventional ones, and lead to significantly more accurate classifiers. The novel gene sets are indeed more correlated than the conventional ones, and lead to significantly more accurate classifiers. Novel gene sets defined on the basis of regulatory interactions improve set-level classification of gene expression data. The experimental scripts and other material needed to reproduce the experiments are available at http://ida.felk.cvut.cz/novelgenesets.tar.gz.

  13. Importance of correlation between gene expression levels: application to the type I interferon signature in rheumatoid arthritis.

    PubMed

    Reynier, Frédéric; Petit, Fabien; Paye, Malick; Turrel-Davin, Fanny; Imbert, Pierre-Emmanuel; Hot, Arnaud; Mougin, Bruno; Miossec, Pierre

    2011-01-01

    The analysis of gene expression data shows that many genes display similarity in their expression profiles suggesting some co-regulation. Here, we investigated the co-expression patterns in gene expression data and proposed a correlation-based research method to stratify individuals. Using blood from rheumatoid arthritis (RA) patients, we investigated the gene expression profiles from whole blood using Affymetrix microarray technology. Co-expressed genes were analyzed by a biclustering method, followed by gene ontology analysis of the relevant biclusters. Taking the type I interferon (IFN) pathway as an example, a classification algorithm was developed from the 102 RA patients and extended to 10 systemic lupus erythematosus (SLE) patients and 100 healthy volunteers to further characterize individuals. We developed a correlation-based algorithm referred to as Classification Algorithm Based on a Biological Signature (CABS), an alternative to other approaches focused specifically on the expression levels. This algorithm applied to the expression of 35 IFN-related genes showed that the IFN signature presented a heterogeneous expression between RA, SLE and healthy controls which could reflect the level of global IFN signature activation. Moreover, the monitoring of the IFN-related genes during the anti-TNF treatment identified changes in type I IFN gene activity induced in RA patients. In conclusion, we have proposed an original method to analyze genes sharing an expression pattern and a biological function showing that the activation levels of a biological signature could be characterized by its overall state of correlation.

  14. A comprehensive and quantitative exploration of thousands of viral genomes

    PubMed Central

    Mahmoudabadi, Gita

    2018-01-01

    The complete assembly of viral genomes from metagenomic datasets (short genomic sequences gathered from environmental samples) has proven to be challenging, so there are significant blind spots when we view viral genomes through the lens of metagenomics. One approach to overcoming this problem is to leverage the thousands of complete viral genomes that are publicly available. Here we describe our efforts to assemble a comprehensive resource that provides a quantitative snapshot of viral genomic trends – such as gene density, noncoding percentage, and abundances of functional gene categories – across thousands of viral genomes. We have also developed a coarse-grained method for visualizing viral genome organization for hundreds of genomes at once, and have explored the extent of the overlap between bacterial and bacteriophage gene pools. Existing viral classification systems were developed prior to the sequencing era, so we present our analysis in a way that allows us to assess the utility of the different classification systems for capturing genomic trends. PMID:29624169

  15. A comprehensive and quantitative exploration of thousands of viral genomes.

    PubMed

    Mahmoudabadi, Gita; Phillips, Rob

    2018-04-19

    The complete assembly of viral genomes from metagenomic datasets (short genomic sequences gathered from environmental samples) has proven to be challenging, so there are significant blind spots when we view viral genomes through the lens of metagenomics. One approach to overcoming this problem is to leverage the thousands of complete viral genomes that are publicly available. Here we describe our efforts to assemble a comprehensive resource that provides a quantitative snapshot of viral genomic trends - such as gene density, noncoding percentage, and abundances of functional gene categories - across thousands of viral genomes. We have also developed a coarse-grained method for visualizing viral genome organization for hundreds of genomes at once, and have explored the extent of the overlap between bacterial and bacteriophage gene pools. Existing viral classification systems were developed prior to the sequencing era, so we present our analysis in a way that allows us to assess the utility of the different classification systems for capturing genomic trends. © 2018, Mahmoudabadi et al.

  16. APOE polymorphism as a potential determinant of functional fitness in the elderly regardless of nutritional status.

    PubMed

    Snejdrlova, Michaela; Kalvach, Zdenek; Topinkova, Eva; Vrablik, Michal; Prochazkova, Renata; Kvasilova, Marie; Lanska, Vera; Zlatohlavek, Lukas; Prusikova, Martina; Ceska, Richard

    2011-01-01

    Life expectancy is determined by a combination of genetic predisposition (~25%) and environmental influences (~75%). Nevertheless a stronger genetic influence is anticipated in long-living individuals. Apolipoprotein E (APOE) gene belongs among the most studied candidate genes of longevity. We evaluated the relation of APOE polymorphism and fitness status in the elderly. We examined a total number of 128 subjects, over 80 years of age. Using a battery of functional tests their fitness status was assessed and the subjects were stratified into 5 functional categories according to Spirduso´s classification. Biochemistry analysis was performed by enzymatic method using automated analyzers. APOE gene polymorphism was analysed performed using PCR-RFLP. APOE4 allele carriers had significantly worse fitness status compared to non-carriers (p=0.025). Multiple logistic regression analysis showed the APOE4 carriers had higher risk (p=0.05) of functional unfitness compared to APOE2/E3 individuals. APOE gene polymorphism seems be an important genetic contributor to frailty development in the elderly. While APOE2 carriers tend to remain functionally fit till higher age, the functional status of APOE4 carriers deteriorates more rapidly. © 2011 Neuroendocrinology Letters

  17. Global Transcriptional Response of Human Liver Cells to Ethanol Stress of Different Strength Reveals Hormetic Behavior.

    PubMed

    Schmidt-Heck, Wolfgang; Wönne, Eva C; Hiller, Thomas; Menzel, Uwe; Koczan, Dirk; Damm, Georg; Seehofer, Daniel; Knöspel, Fanny; Freyer, Nora; Guthke, Reinhard; Dooley, Steven; Zeilinger, Katrin

    2017-05-01

    The liver is the major site for alcohol metabolism in the body and therefore the primary target organ for ethanol (EtOH)-induced toxicity. In this study, we investigated the in vitro response of human liver cells to different EtOH concentrations in a perfused bioartificial liver device that mimics the complex architecture of the natural organ. Primary human liver cells were cultured in the bioartificial liver device and treated for 24 hours with medium containing 150 mM (low), 300 mM (medium), or 600 mM (high) EtOH, while a control culture was kept untreated. Gene expression patterns for each EtOH concentration were monitored using Affymetrix Human Gene 1.0 ST Gene chips. Scaled expression profiles of differentially expressed genes (DEGs) were clustered using Fuzzy c-means algorithm. In addition, functional classification methods, KEGG pathway mapping and also a machine learning approach (Random Forest) were utilized. A number of 966 (150 mM EtOH), 1,334 (300 mM EtOH), or 4,132 (600 mM EtOH) genes were found to be differentially expressed. Dose-response relationships of the identified clusters of co-expressed genes showed a monotonic, threshold, or nonmonotonic (hormetic) behavior. Functional classification of DEGs revealed that low or medium EtOH concentrations operate adaptation processes, while alterations observed for the high EtOH concentration reflect the response to cellular damage. The genes displaying a hormetic response were functionally characterized by overrepresented "cellular ketone metabolism" and "carboxylic acid metabolism." Altered expression of the genes BAHD1 and H3F3B was identified as sufficient to classify the samples according to the applied EtOH doses. Different pathways of metabolic and epigenetic regulation are affected by EtOH exposition and partly undergo hormetic regulation in the bioartificial liver device. Gene expression changes observed at high EtOH concentrations reflect in some aspects the situation of alcoholic hepatitis in humans. Copyright © 2017 by the Research Society on Alcoholism.

  18. Evaluation of gene expression classification studies: factors associated with classification performance.

    PubMed

    Novianti, Putri W; Roes, Kit C B; Eijkemans, Marinus J C

    2014-01-01

    Classification methods used in microarray studies for gene expression are diverse in the way they deal with the underlying complexity of the data, as well as in the technique used to build the classification model. The MAQC II study on cancer classification problems has found that performance was affected by factors such as the classification algorithm, cross validation method, number of genes, and gene selection method. In this paper, we study the hypothesis that the disease under study significantly determines which method is optimal, and that additionally sample size, class imbalance, type of medical question (diagnostic, prognostic or treatment response), and microarray platform are potentially influential. A systematic literature review was used to extract the information from 48 published articles on non-cancer microarray classification studies. The impact of the various factors on the reported classification accuracy was analyzed through random-intercept logistic regression. The type of medical question and method of cross validation dominated the explained variation in accuracy among studies, followed by disease category and microarray platform. In total, 42% of the between study variation was explained by all the study specific and problem specific factors that we studied together.

  19. AUCTSP: an improved biomarker gene pair class predictor.

    PubMed

    Kagaris, Dimitri; Khamesipour, Alireza; Yiannoutsos, Constantin T

    2018-06-26

    The Top Scoring Pair (TSP) classifier, based on the concept of relative ranking reversals in the expressions of pairs of genes, has been proposed as a simple, accurate, and easily interpretable decision rule for classification and class prediction of gene expression profiles. The idea that differences in gene expression ranking are associated with presence or absence of disease is compelling and has strong biological plausibility. Nevertheless, the TSP formulation ignores significant available information which can improve classification accuracy and is vulnerable to selecting genes which do not have differential expression in the two conditions ("pivot" genes). We introduce the AUCTSP classifier as an alternative rank-based estimator of the magnitude of the ranking reversals involved in the original TSP. The proposed estimator is based on the Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) and as such, takes into account the separation of the entire distribution of gene expression levels in gene pairs under the conditions considered, as opposed to comparing gene rankings within individual subjects as in the original TSP formulation. Through extensive simulations and case studies involving classification in ovarian, leukemia, colon, breast and prostate cancers and diffuse large b-cell lymphoma, we show the superiority of the proposed approach in terms of improving classification accuracy, avoiding overfitting and being less prone to selecting non-informative (pivot) genes. The proposed AUCTSP is a simple yet reliable and robust rank-based classifier for gene expression classification. While the AUCTSP works by the same principle as TSP, its ability to determine the top scoring gene pair based on the relative rankings of two marker genes across all subjects as opposed to each individual subject results in significant performance gains in classification accuracy. In addition, the proposed method tends to avoid selection of non-informative (pivot) genes as members of the top-scoring pair.

  20. 15 years of research on Oral-Facial-Digital syndromes: from 1 to 16 causal genes

    PubMed Central

    Bruel, Ange-Line; Franco, Brunella; Duffourd, Yannis; Thevenon, Julien; Jego, Laurence; Lopez, Estelle; Deleuze, Jean-François; Doummar, Diane; Giles, Rachel H.; Johnson, Colin A.; Huynen, Martijn A.; Chevrier, Véronique; Burglen, Lydie; Morleo, Manuela; Desguerres, Isabelle; Pierquin, Geneviève; Doray, Bérénice; Gilbert-Dussardier, Brigitte; Reversade, Bruno; Steichen-Gersdorf, Elisabeth; Baumann, Clarisse; Panigrahi, Inusha; Fargeot-Espaliat, Anne; Dieux, Anne; David, Albert; Goldenberg, Alice; Bongers, Ernie; Gaillard, Dominique; Argente, Jesús; Aral, Bernard; Gigot, Nadège; St-Onge, Judith; Birnbaum, Daniel; Phadke, Shubha R.; Cormier-Daire, Valérie; Eguether, Thibaut; Pazour, Gregory J.; Herranz-Pérez, Vicente; Lee, Jaclyn S.; Pasquier, Laurent; Loget, Philippe; Saunier, Sophie; Mégarbané, André; Rosnet, Olivier; Leroux, Michel R.; Wallingford, John B.; Blacque, Oliver E.; Nachury, Maxence V.; Attie-Bitach, Tania; Rivière, Jean-Baptiste; Faivre, Laurence; Thauvin-Robinet, Christel

    2017-01-01

    Oral-facial-digital syndromes (OFDS) gather rare genetic disorders characterized by facial, oral and digital abnormalities associated with a wide range of additional features (polycystic kidney disease, cerebral malformations and several others) to delineate a growing list of OFD subtypes. The most frequent, OFD type I, is caused by a heterozygous mutation in the OFD1 gene encoding a centrosomal protein. The wide clinical heterogeneity of OFDS suggests the involvement of other ciliary genes. For 15 years, we have aimed to identify the molecular bases of OFDS. This effort has been greatly helped by the recent development of whole exome sequencing (WES). Here, we present all our published and unpublished results for WES in 24 OFDS cases. We identified causal variants in five new genes (C2CD3, TMEM107, INTU, KIAA0753, IFT57) and related the clinical spectrum of four genes in other ciliopathies (C5orf42, TMEM138, TMEM231, WDPCP) to OFDS. Mutations were also detected in two genes previously implicated in OFDS. Functional studies revealed the involvement of centriole elongation, transition zone and intraflagellar transport defects in OFDS, thus characterizing three ciliary protein modules: the complex KIAA0753-FOPNL-OFD1, a regulator of centriole elongation; the MKS module, a major component of the transition zone; and the CPLANE complex necessary for IFT-A assembly. OFDS now appear to be a distinct subgroup of ciliopathies with wide heterogeneity, which makes the initial classification obsolete. A clinical classification restricted to the three frequent/well-delineated subtypes could be proposed, and for patients who do not fit one of these 3 main subtypes, a further classification could be based on the genotype. PMID:28289185

  1. Improving Classification of Cancer and Mining Biomarkers from Gene Expression Profiles Using Hybrid Optimization Algorithms and Fuzzy Support Vector Machine

    PubMed Central

    Moteghaed, Niloofar Yousefi; Maghooli, Keivan; Garshasbi, Masoud

    2018-01-01

    Background: Gene expression data are characteristically high dimensional with a small sample size in contrast to the feature size and variability inherent in biological processes that contribute to difficulties in analysis. Selection of highly discriminative features decreases the computational cost and complexity of the classifier and improves its reliability for prediction of a new class of samples. Methods: The present study used hybrid particle swarm optimization and genetic algorithms for gene selection and a fuzzy support vector machine (SVM) as the classifier. Fuzzy logic is used to infer the importance of each sample in the training phase and decrease the outlier sensitivity of the system to increase the ability to generalize the classifier. A decision-tree algorithm was applied to the most frequent genes to develop a set of rules for each type of cancer. This improved the abilities of the algorithm by finding the best parameters for the classifier during the training phase without the need for trial-and-error by the user. The proposed approach was tested on four benchmark gene expression profiles. Results: Good results have been demonstrated for the proposed algorithm. The classification accuracy for leukemia data is 100%, for colon cancer is 96.67% and for breast cancer is 98%. The results show that the best kernel used in training the SVM classifier is the radial basis function. Conclusions: The experimental results show that the proposed algorithm can decrease the dimensionality of the dataset, determine the most informative gene subset, and improve classification accuracy using the optimal parameters of the classifier with no user interface. PMID:29535919

  2. Evolution and Classification of Myosins, a Paneukaryotic Whole-Genome Approach

    PubMed Central

    Sebé-Pedrós, Arnau; Grau-Bové, Xavier; Richards, Thomas A.; Ruiz-Trillo, Iñaki

    2014-01-01

    Myosins are key components of the eukaryotic cytoskeleton, providing motility for a broad diversity of cargoes. Therefore, understanding the origin and evolutionary history of myosin classes is crucial to address the evolution of eukaryote cell biology. Here, we revise the classification of myosins using an updated taxon sampling that includes newly or recently sequenced genomes and transcriptomes from key taxa. We performed a survey of eukaryotic genomes and phylogenetic analyses of the myosin gene family, reconstructing the myosin toolkit at different key nodes in the eukaryotic tree of life. We also identified the phylogenetic distribution of myosin diversity in terms of number of genes, associated protein domains and number of classes in each taxa. Our analyses show that new classes (i.e., paralogs) and domain architectures were continuously generated throughout eukaryote evolution, with a significant expansion of myosin abundance and domain architectural diversity at the stem of Holozoa, predating the origin of animal multicellularity. Indeed, single-celled holozoans have the most complex myosin complement among eukaryotes, with paralogs of most myosins previously considered animal specific. We recover a dynamic evolutionary history, with several lineage-specific expansions (e.g., the myosin III-like gene family diversification in choanoflagellates), convergence in protein domain architectures (e.g., fungal and animal chitin synthase myosins), and important secondary losses. Overall, our evolutionary scheme demonstrates that the ancestral eukaryote likely had a complex myosin repertoire that included six genes with different protein domain architectures. Finally, we provide an integrative and robust classification, useful for future genomic and functional studies on this crucial eukaryotic gene family. PMID:24443438

  3. Functional Assessment of the Role of BORIS in Ovarian Cancer Using a Novel in Vivo Model System

    DTIC Science & Technology

    2015-12-01

    iv) we obtained founder BORIS-Tg mice and crossed into the FVB/N strain to fully characterize the transgenic gene configuration, v) we conducted...models, transgenic mice 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT 18. NUMBER OF PAGES 19a. NAME OF RESPONSIBLE PERSON USAMRMC a...genes, and wildtype p53 is a negative regulator of BORIS expression. To test these hypotheses, we will develop and utilize a murine transgenic model

  4. Functional dissection of drought-responsive gene expression patterns in Cynodon dactylon L.

    PubMed

    Kim, Changsoo; Lemke, Cornelia; Paterson, Andrew H

    2009-05-01

    Water deficit is one of the main abiotic factors that affect plant productivity in subtropical regions. To identify genes induced during the water stress response in Bermudagrass (Cynodon dactylon), cDNA macroarrays were used. The macroarray analysis identified 189 drought-responsive candidate genes from C. dactylon, of which 120 were up-regulated and 69 were down-regulated. The candidate genes were classified into seven groups by cluster analysis of expression levels across two intensities and three durations of imposed stress. Annotation using BLASTX suggested that up-regulated genes may be involved in proline biosynthesis, signal transduction pathways, protein repair systems, and removal of toxins, while down-regulated genes were mostly related to basic plant metabolism such as photosynthesis and glycolysis. The functional classification of gene ontology (GO) was consistent with the BLASTX results, also suggesting some crosstalk between abiotic and biotic stress. Comparative analysis of cis-regulatory elements from the candidate genes implicated specific elements in drought response in Bermudagrass. Although only a subset of genes was studied, Bermudagrass shared many drought-responsive genes and cis-regulatory elements with other botanical models, supporting a strategy of cross-taxon application of drought-responsive genes, regulatory cues, and physiological-genetic information.

  5. Patterns of population differentiation of candidate genes for cardiovascular disease.

    PubMed

    Kullo, Iftikhar J; Ding, Keyue

    2007-07-12

    The basis for ethnic differences in cardiovascular disease (CVD) susceptibility is not fully understood. We investigated patterns of population differentiation (FST) of a set of genes in etiologic pathways of CVD among 3 ethnic groups: Yoruba in Nigeria (YRI), Utah residents with European ancestry (CEU), and Han Chinese (CHB) + Japanese (JPT). We identified 37 pathways implicated in CVD based on the PANTHER classification and 416 genes in these pathways were further studied; these genes belonged to 6 biological processes (apoptosis, blood circulation and gas exchange, blood clotting, homeostasis, immune response, and lipoprotein metabolism). Genotype data were obtained from the HapMap database. We calculated FST for 15,559 common SNPs (minor allele frequency > or = 0.10 in at least one population) in genes that co-segregated among the populations, as well as an average-weighted FST for each gene. SNPs were classified as putatively functional (non-synonymous and untranslated regions) or non-functional (intronic and synonymous sites). Mean FST values for common putatively functional variants were significantly higher than FST values for nonfunctional variants. A significant variation in FST was also seen based on biological processes; the processes of 'apoptosis' and 'lipoprotein metabolism' showed an excess of genes with high FST. Thus, putative functional SNPs in genes in etiologic pathways for CVD show greater population differentiation than non-functional SNPs and a significant variance of FST values was noted among pairwise population comparisons for different biological processes. These results suggest a possible basis for varying susceptibility to CVD among ethnic groups.

  6. Global identification and expression analysis of stress-responsive genes of the Argonaute family in apple.

    PubMed

    Xu, Ruirui; Liu, Caiyun; Li, Ning; Zhang, Shizhong

    2016-12-01

    Argonaute (AGO) proteins, which are found in yeast, animals, and plants, are the core molecules of the RNA-induced silencing complex. These proteins play important roles in plant growth, development, and responses to biotic stresses. The complete analysis and classification of the AGO gene family have been recently reported in different plants. Nevertheless, systematic analysis and expression profiling of these genes have not been performed in apple (Malus domestica). Approximately 15 AGO genes were identified in the apple genome. The phylogenetic tree, chromosome location, conserved protein motifs, gene structure, and expression of the AGO gene family in apple were analyzed for gene prediction. All AGO genes were phylogenetically clustered into four groups (i.e., AGO1, AGO4, MEL1/AGO5, and ZIPPY/AGO7) with the AGO genes of Arabidopsis. These groups of the AGO gene family were statistically analyzed and compared among 31 plant species. The predicted apple AGO genes are distributed across nine chromosomes at different densities and include three segment duplications. Expression studies indicated that 15 AGO genes exhibit different expression patterns in at least one of the tissues tested. Additionally, analysis of gene expression levels indicated that the genes are mostly involved in responses to NaCl, PEG, heat, and low-temperature stresses. Hence, several candidate AGO genes are involved in different aspects of physiological and developmental processes and may play an important role in abiotic stress responses in apple. To the best of our knowledge, this study is the first to report a comprehensive analysis of the apple AGO gene family. Our results provide useful information to understand the classification and putative functions of these proteins, especially for gene members that may play important roles in abiotic stress responses in M. hupehensis.

  7. A fuzzy neural network for intelligent data processing

    NASA Astrophysics Data System (ADS)

    Xie, Wei; Chu, Feng; Wang, Lipo; Lim, Eng Thiam

    2005-03-01

    In this paper, we describe an incrementally generated fuzzy neural network (FNN) for intelligent data processing. This FNN combines the features of initial fuzzy model self-generation, fast input selection, partition validation, parameter optimization and rule-base simplification. A small FNN is created from scratch -- there is no need to specify the initial network architecture, initial membership functions, or initial weights. Fuzzy IF-THEN rules are constantly combined and pruned to minimize the size of the network while maintaining accuracy; irrelevant inputs are detected and deleted, and membership functions and network weights are trained with a gradient descent algorithm, i.e., error backpropagation. Experimental studies on synthesized data sets demonstrate that the proposed Fuzzy Neural Network is able to achieve accuracy comparable to or higher than both a feedforward crisp neural network, i.e., NeuroRule, and a decision tree, i.e., C4.5, with more compact rule bases for most of the data sets used in our experiments. The FNN has achieved outstanding results for cancer classification based on microarray data. The excellent classification result for Small Round Blue Cell Tumors (SRBCTs) data set is shown. Compared with other published methods, we have used a much fewer number of genes for perfect classification, which will help researchers directly focus their attention on some specific genes and may lead to discovery of deep reasons of the development of cancers and discovery of drugs.

  8. Hybrid Binary Imperialist Competition Algorithm and Tabu Search Approach for Feature Selection Using Gene Expression Data.

    PubMed

    Wang, Shuaiqun; Aorigele; Kong, Wei; Zeng, Weiming; Hong, Xiaomin

    2016-01-01

    Gene expression data composed of thousands of genes play an important role in classification platforms and disease diagnosis. Hence, it is vital to select a small subset of salient features over a large number of gene expression data. Lately, many researchers devote themselves to feature selection using diverse computational intelligence methods. However, in the progress of selecting informative genes, many computational methods face difficulties in selecting small subsets for cancer classification due to the huge number of genes (high dimension) compared to the small number of samples, noisy genes, and irrelevant genes. In this paper, we propose a new hybrid algorithm HICATS incorporating imperialist competition algorithm (ICA) which performs global search and tabu search (TS) that conducts fine-tuned search. In order to verify the performance of the proposed algorithm HICATS, we have tested it on 10 well-known benchmark gene expression classification datasets with dimensions varying from 2308 to 12600. The performance of our proposed method proved to be superior to other related works including the conventional version of binary optimization algorithm in terms of classification accuracy and the number of selected genes.

  9. Hybrid Binary Imperialist Competition Algorithm and Tabu Search Approach for Feature Selection Using Gene Expression Data

    PubMed Central

    Aorigele; Zeng, Weiming; Hong, Xiaomin

    2016-01-01

    Gene expression data composed of thousands of genes play an important role in classification platforms and disease diagnosis. Hence, it is vital to select a small subset of salient features over a large number of gene expression data. Lately, many researchers devote themselves to feature selection using diverse computational intelligence methods. However, in the progress of selecting informative genes, many computational methods face difficulties in selecting small subsets for cancer classification due to the huge number of genes (high dimension) compared to the small number of samples, noisy genes, and irrelevant genes. In this paper, we propose a new hybrid algorithm HICATS incorporating imperialist competition algorithm (ICA) which performs global search and tabu search (TS) that conducts fine-tuned search. In order to verify the performance of the proposed algorithm HICATS, we have tested it on 10 well-known benchmark gene expression classification datasets with dimensions varying from 2308 to 12600. The performance of our proposed method proved to be superior to other related works including the conventional version of binary optimization algorithm in terms of classification accuracy and the number of selected genes. PMID:27579323

  10. Unification of [FeFe]-hydrogenases into three structural and functional groups.

    PubMed

    Poudel, Saroj; Tokmina-Lukaszewska, Monika; Colman, Daniel R; Refai, Mohammed; Schut, Gerrit J; King, Paul W; Maness, Pin-Ching; Adams, Michael W W; Peters, John W; Bothner, Brian; Boyd, Eric S

    2016-09-01

    [FeFe]-hydrogenases (Hyd) are structurally diverse enzymes that catalyze the reversible oxidation of hydrogen (H2). Recent biochemical data demonstrate new functional roles for these enzymes, including those that function in electron bifurcation where an exergonic reaction is coupled with an endergonic reaction to drive the reversible oxidation/production of H2. To identify the structural determinants that underpin differences in enzyme functionality, a total of 714 homologous sequences of the catalytic subunit, HydA, were compiled. Bioinformatics approaches informed by biochemical data were then used to characterize differences in inferred quaternary structure, HydA active site protein environment, accessory iron-sulfur clusters in HydA, and regulatory proteins encoded in HydA gene neighborhoods. HydA homologs were clustered into one of three classification groups, Group 1 (G1), Group 2 (G2), and Group 3 (G3). G1 enzymes were predicted to be monomeric while those in G2 and G3 were predicted to be multimeric and include HydB, HydC (G2/G3) and HydD (G3) subunits. Variation in the HydA active site and accessory iron-sulfur clusters did not vary by group type. Group-specific regulatory genes were identified in the gene neighborhoods of both G2 and G3 Hyd. Analyses of purified G2 and G3 enzymes by mass spectrometry strongly suggest that they are post-translationally modified by phosphorylation. These results suggest that bifurcation capability is dictated primarily by the presence of both HydB and HydC in Hyd complexes, rather than by variation in HydA. This classification scheme provides a framework for future biochemical and mutagenesis studies to elucidate the functional role of Hyd enzymes. Copyright © 2016 Elsevier B.V. All rights reserved.

  11. Systematic computation with functional gene-sets among leukemic and hematopoietic stem cells reveals a favorable prognostic signature for acute myeloid leukemia.

    PubMed

    Yang, Xinan Holly; Li, Meiyi; Wang, Bin; Zhu, Wanqi; Desgardin, Aurelie; Onel, Kenan; de Jong, Jill; Chen, Jianjun; Chen, Luonan; Cunningham, John M

    2015-03-24

    Genes that regulate stem cell function are suspected to exert adverse effects on prognosis in malignancy. However, diverse cancer stem cell signatures are difficult for physicians to interpret and apply clinically. To connect the transcriptome and stem cell biology, with potential clinical applications, we propose a novel computational "gene-to-function, snapshot-to-dynamics, and biology-to-clinic" framework to uncover core functional gene-sets signatures. This framework incorporates three function-centric gene-set analysis strategies: a meta-analysis of both microarray and RNA-seq data, novel dynamic network mechanism (DNM) identification, and a personalized prognostic indicator analysis. This work uses complex disease acute myeloid leukemia (AML) as a research platform. We introduced an adjustable "soft threshold" to a functional gene-set algorithm and found that two different analysis methods identified distinct gene-set signatures from the same samples. We identified a 30-gene cluster that characterizes leukemic stem cell (LSC)-depleted cells and a 25-gene cluster that characterizes LSC-enriched cells in parallel; both mark favorable-prognosis in AML. Genes within each signature significantly share common biological processes and/or molecular functions (empirical p = 6e-5 and 0.03 respectively). The 25-gene signature reflects the abnormal development of stem cells in AML, such as AURKA over-expression. We subsequently determined that the clinical relevance of both signatures is independent of known clinical risk classifications in 214 patients with cytogenetically normal AML. We successfully validated the prognosis of both signatures in two independent cohorts of 91 and 242 patients respectively (log-rank p < 0.0015 and 0.05; empirical p < 0.015 and 0.08). The proposed algorithms and computational framework will harness systems biology research because they efficiently translate gene-sets (rather than single genes) into biological discoveries about AML and other complex diseases.

  12. Gene masking - a technique to improve accuracy for cancer classification with high dimensionality in microarray data.

    PubMed

    Saini, Harsh; Lal, Sunil Pranit; Naidu, Vimal Vikash; Pickering, Vincel Wince; Singh, Gurmeet; Tsunoda, Tatsuhiko; Sharma, Alok

    2016-12-05

    High dimensional feature space generally degrades classification in several applications. In this paper, we propose a strategy called gene masking, in which non-contributing dimensions are heuristically removed from the data to improve classification accuracy. Gene masking is implemented via a binary encoded genetic algorithm that can be integrated seamlessly with classifiers during the training phase of classification to perform feature selection. It can also be used to discriminate between features that contribute most to the classification, thereby, allowing researchers to isolate features that may have special significance. This technique was applied on publicly available datasets whereby it substantially reduced the number of features used for classification while maintaining high accuracies. The proposed technique can be extremely useful in feature selection as it heuristically removes non-contributing features to improve the performance of classifiers.

  13. CAMUR: Knowledge extraction from RNA-seq cancer data through equivalent classification rules.

    PubMed

    Cestarelli, Valerio; Fiscon, Giulia; Felici, Giovanni; Bertolazzi, Paola; Weitschek, Emanuel

    2016-03-01

    Nowadays, knowledge extraction methods from Next Generation Sequencing data are highly requested. In this work, we focus on RNA-seq gene expression analysis and specifically on case-control studies with rule-based supervised classification algorithms that build a model able to discriminate cases from controls. State of the art algorithms compute a single classification model that contains few features (genes). On the contrary, our goal is to elicit a higher amount of knowledge by computing many classification models, and therefore to identify most of the genes related to the predicted class. We propose CAMUR, a new method that extracts multiple and equivalent classification models. CAMUR iteratively computes a rule-based classification model, calculates the power set of the genes present in the rules, iteratively eliminates those combinations from the data set, and performs again the classification procedure until a stopping criterion is verified. CAMUR includes an ad-hoc knowledge repository (database) and a querying tool.We analyze three different types of RNA-seq data sets (Breast, Head and Neck, and Stomach Cancer) from The Cancer Genome Atlas (TCGA) and we validate CAMUR and its models also on non-TCGA data. Our experimental results show the efficacy of CAMUR: we obtain several reliable equivalent classification models, from which the most frequent genes, their relationships, and the relation with a particular cancer are deduced. dmb.iasi.cnr.it/camur.php emanuel@iasi.cnr.it Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.

  14. Twenty-four signature genes predict the prognosis of oral squamous cell carcinoma with high accuracy and repeatability

    PubMed Central

    Gao, Jianyong; Tian, Gang; Han, Xu; Zhu, Qiang

    2018-01-01

    Oral squamous cell carcinoma (OSCC) is the sixth most common type cancer worldwide, with poor prognosis. The present study aimed to identify gene signatures that could classify OSCC and predict prognosis in different stages. A training data set (GSE41613) and two validation data sets (GSE42743 and GSE26549) were acquired from the online Gene Expression Omnibus database. In the training data set, patients were classified based on the tumor-node-metastasis staging system, and subsequently grouped into low stage (L) or high stage (H). Signature genes between L and H stages were selected by disparity index analysis, and classification was performed by the expression of these signature genes. The established classification was compared with the L and H classification, and fivefold cross validation was used to evaluate the stability. Enrichment analysis for the signature genes was implemented by the Database for Annotation, Visualization and Integration Discovery. Two validation data sets were used to determine the precise of classification. Survival analysis was conducted followed each classification using the package ‘survival’ in R software. A set of 24 signature genes was identified based on the classification model with the Fi value of 0.47, which was used to distinguish OSCC samples in two different stages. Overall survival of patients in the H stage was higher than those in the L stage. Signature genes were primarily enriched in ‘ether lipid metabolism’ pathway and biological processes such as ‘positive regulation of adaptive immune response’ and ‘apoptotic cell clearance’. The results provided a novel 24-gene set that may be used as biomarkers to predict OSCC prognosis with high accuracy, which may be used to determine an appropriate treatment program for patients with OSCC in addition to the traditional evaluation index. PMID:29257303

  15. Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2's q2-feature-classifier plugin.

    PubMed

    Bokulich, Nicholas A; Kaehler, Benjamin D; Rideout, Jai Ram; Dillon, Matthew; Bolyen, Evan; Knight, Rob; Huttley, Gavin A; Gregory Caporaso, J

    2018-05-17

    Taxonomic classification of marker-gene sequences is an important step in microbiome analysis. We present q2-feature-classifier ( https://github.com/qiime2/q2-feature-classifier ), a QIIME 2 plugin containing several novel machine-learning and alignment-based methods for taxonomy classification. We evaluated and optimized several commonly used classification methods implemented in QIIME 1 (RDP, BLAST, UCLUST, and SortMeRNA) and several new methods implemented in QIIME 2 (a scikit-learn naive Bayes machine-learning classifier, and alignment-based taxonomy consensus methods based on VSEARCH, and BLAST+) for classification of bacterial 16S rRNA and fungal ITS marker-gene amplicon sequence data. The naive-Bayes, BLAST+-based, and VSEARCH-based classifiers implemented in QIIME 2 meet or exceed the species-level accuracy of other commonly used methods designed for classification of marker gene sequences that were evaluated in this work. These evaluations, based on 19 mock communities and error-free sequence simulations, including classification of simulated "novel" marker-gene sequences, are available in our extensible benchmarking framework, tax-credit ( https://github.com/caporaso-lab/tax-credit-data ). Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make recommendations regarding parameter choices for these classifiers under a range of standard operating conditions. q2-feature-classifier and tax-credit are both free, open-source, BSD-licensed packages available on GitHub.

  16. Gene selection for microarray data classification via subspace learning and manifold regularization.

    PubMed

    Tang, Chang; Cao, Lijuan; Zheng, Xiao; Wang, Minhui

    2017-12-19

    With the rapid development of DNA microarray technology, large amount of genomic data has been generated. Classification of these microarray data is a challenge task since gene expression data are often with thousands of genes but a small number of samples. In this paper, an effective gene selection method is proposed to select the best subset of genes for microarray data with the irrelevant and redundant genes removed. Compared with original data, the selected gene subset can benefit the classification task. We formulate the gene selection task as a manifold regularized subspace learning problem. In detail, a projection matrix is used to project the original high dimensional microarray data into a lower dimensional subspace, with the constraint that the original genes can be well represented by the selected genes. Meanwhile, the local manifold structure of original data is preserved by a Laplacian graph regularization term on the low-dimensional data space. The projection matrix can serve as an importance indicator of different genes. An iterative update algorithm is developed for solving the problem. Experimental results on six publicly available microarray datasets and one clinical dataset demonstrate that the proposed method performs better when compared with other state-of-the-art methods in terms of microarray data classification. Graphical Abstract The graphical abstract of this work.

  17. Circular RNA and gene expression profiles in gastric cancer based on microarray chip technology.

    PubMed

    Sui, Weiguo; Shi, Zhoufang; Xue, Wen; Ou, Minglin; Zhu, Ying; Chen, Jiejing; Lin, Hua; Liu, Fuhua; Dai, Yong

    2017-03-01

    The aim of the present study was to screen gastric cancer (GC) tissue and adjacent tissue for differences in mRNA and circular (circRNA) expression, to analyze the differences in circRNA and mRNA expression, and to investigate the circRNA expression in gastric carcinoma and its mechanism. circRNA and mRNA differential expression profiles generated using Agilent microarray technology were analyzed in the GC tissues and adjacent tissues. qRT-PCR was used to verify the differential expression of circRNAs and mRNAs according to the interactions between circRNAs and miRNAs as well as the possible existence of miRNA and mRNA interactions. We found that: i) the circRNA expression profile revealed 1,285 significant differences in circRNA expression, with circRNA expression downregulated in 594 samples and upregulated in 691 samples via interactions with miRNAs. The qRT-PCR validation experiments showed that hsa_circRNA_400071, hsa_circRNA_000543 and hsa_circRNA_001959 expression was consistent with the microarray analysis results. ii) 29,112 genes were found in the GC tissues and adjacent tissues, including 5,460 differentially expressed genes. Among them, 2,390 differentially expressed genes were upregulated and 3,070 genes were downregulated. Gene Ontology (GO) analysis of the differentially expressed genes revealed these genes involved in biological process classification, cellular component classification and molecular function classification. Pathway analysis of the differentially expressed genes identified 83 significantly enriched genes, including 28 upregulated genes and 55 downregulated genes. iii) 69 differentially expressed circRNAs were found that might adsorb specific miRNAs to regulate the expression of their target gene mRNAs. The conclusions are: i) differentially expressed circRNAs had corresponding miRNA binding sites. These circRNAs regulated the expression of target genes through interactions with miRNAs and might become new molecular biomarkers for GC in the future. ii) Differentially expressed genes may be involved in the occurrence of GC via a variety of mechanisms. iii) CD44, CXXC5, MYH9, MALAT1 and other genes may have important implications for the occurrence and development of GC through the regulation, interaction, and mutual influence of circRNA-miRNA-mRNA via different mechanisms.

  18. The Cross-Entropy Based Multi-Filter Ensemble Method for Gene Selection.

    PubMed

    Sun, Yingqiang; Lu, Chengbo; Li, Xiaobo

    2018-05-17

    The gene expression profile has the characteristics of a high dimension, low sample, and continuous type, and it is a great challenge to use gene expression profile data for the classification of tumor samples. This paper proposes a cross-entropy based multi-filter ensemble (CEMFE) method for microarray data classification. Firstly, multiple filters are used to select the microarray data in order to obtain a plurality of the pre-selected feature subsets with a different classification ability. The top N genes with the highest rank of each subset are integrated so as to form a new data set. Secondly, the cross-entropy algorithm is used to remove the redundant data in the data set. Finally, the wrapper method, which is based on forward feature selection, is used to select the best feature subset. The experimental results show that the proposed method is more efficient than other gene selection methods and that it can achieve a higher classification accuracy under fewer characteristic genes.

  19. Gene Selection and Cancer Classification: A Rough Sets Based Approach

    NASA Astrophysics Data System (ADS)

    Sun, Lijun; Miao, Duoqian; Zhang, Hongyun

    Indentification of informative gene subsets responsible for discerning between available samples of gene expression data is an important task in bioinformatics. Reducts, from rough sets theory, corresponding to a minimal set of essential genes for discerning samples, is an efficient tool for gene selection. Due to the compuational complexty of the existing reduct algoritms, feature ranking is usually used to narrow down gene space as the first step and top ranked genes are selected . In this paper,we define a novel certierion based on the expression level difference btween classes and contribution to classification of the gene for scoring genes and present a algorithm for generating all possible reduct from informative genes.The algorithm takes the whole attribute sets into account and find short reduct with a significant reduction in computational complexity. An exploration of this approach on benchmark gene expression data sets demonstrates that this approach is successful for selecting high discriminative genes and the classification accuracy is impressive.

  20. Dexamethasone Stimulated Gene Expression in Peripheral Blood is a Sensitive Marker for Glucocorticoid Receptor Resistance in Depressed Patients

    PubMed Central

    Menke, Andreas; Arloth, Janine; Pütz, Benno; Weber, Peter; Klengel, Torsten; Mehta, Divya; Gonik, Mariya; Rex-Haffner, Monika; Rubel, Jennifer; Uhr, Manfred; Lucae, Susanne; Deussing, Jan M; Müller-Myhsok, Bertram; Holsboer, Florian; Binder, Elisabeth B

    2012-01-01

    Although gene expression profiles in peripheral blood in major depression are not likely to identify genes directly involved in the pathomechanism of affective disorders, they may serve as biomarkers for this disorder. As previous studies using baseline gene expression profiles have provided mixed results, our approach was to use an in vivo dexamethasone challenge test and to compare glucocorticoid receptor (GR)-mediated changes in gene expression between depressed patients and healthy controls. Whole genome gene expression data (baseline and following GR-stimulation with 1.5 mg dexamethasone p.o.) from two independent cohorts were analyzed to identify gene expression pattern that would predict case and control status using a training (N=18 cases/18 controls) and a test cohort (N=11/13). Dexamethasone led to reproducible regulation of 2670 genes in controls and 1151 transcripts in cases. Several genes, including FKBP5 and DUSP1, previously associated with the pathophysiology of major depression, were found to be reliable markers of GR-activation. Using random forest analyses for classification, GR-stimulated gene expression outperformed baseline gene expression as a classifier for case and control status with a correct classification of 79.1 vs 41.6% in the test cohort. GR-stimulated gene expression performed best in dexamethasone non-suppressor patients (88.7% correctly classified with 100% sensitivity), but also correctly classified 77.3% of the suppressor patients (76.7% sensitivity), when using a refined set of 19 genes. Our study suggests that in vivo stimulated gene expression in peripheral blood cells could be a promising molecular marker of altered GR-functioning, an important component of the underlying pathology, in patients suffering from depressive episodes. PMID:22237309

  1. Identification and classification of genes required for tolerance to freeze-thaw stress revealed by genome-wide screening of Saccharomyces cerevisiae deletion strains.

    PubMed

    Ando, Akira; Nakamura, Toshihide; Murata, Yoshinori; Takagi, Hiroshi; Shima, Jun

    2007-03-01

    Yeasts used in bread making are exposed to freeze-thaw stress during frozen-dough baking. To clarify the genes required for freeze-thaw tolerance, genome-wide screening was performed using the complete deletion strain collection of diploid Saccharomyces cerevisiae. The screening identified 58 gene deletions that conferred freeze-thaw sensitivity. These genes were then classified based on their cellular function and on the localization of their products. The results showed that the genes required for freeze-thaw tolerance were frequently involved in vacuole functions and cell wall biogenesis. The highest numbers of gene products were components of vacuolar H(+)-ATPase. Next, the cross-sensitivity of the freeze-thaw-sensitive mutants to oxidative stress and to cell wall stress was studied; both of these are environmental stresses closely related to freeze-thaw stress. The results showed that defects in the functions of vacuolar H(+)-ATPase conferred sensitivity to oxidative stress and to cell wall stress. In contrast, defects in gene products involved in cell wall assembly conferred sensitivity to cell wall stress but not to oxidative stress. Our results suggest the presence of at least two different mechanisms of freeze-thaw injury: oxidative stress generated during the freeze-thaw process, and defects in cell wall assembly.

  2. A comparative analysis of swarm intelligence techniques for feature selection in cancer classification.

    PubMed

    Gunavathi, Chellamuthu; Premalatha, Kandasamy

    2014-01-01

    Feature selection in cancer classification is a central area of research in the field of bioinformatics and used to select the informative genes from thousands of genes of the microarray. The genes are ranked based on T-statistics, signal-to-noise ratio (SNR), and F-test values. The swarm intelligence (SI) technique finds the informative genes from the top-m ranked genes. These selected genes are used for classification. In this paper the shuffled frog leaping with Lévy flight (SFLLF) is proposed for feature selection. In SFLLF, the Lévy flight is included to avoid premature convergence of shuffled frog leaping (SFL) algorithm. The SI techniques such as particle swarm optimization (PSO), cuckoo search (CS), SFL, and SFLLF are used for feature selection which identifies informative genes for classification. The k-nearest neighbour (k-NN) technique is used to classify the samples. The proposed work is applied on 10 different benchmark datasets and examined with SI techniques. The experimental results show that the results obtained from k-NN classifier through SFLLF feature selection method outperform PSO, CS, and SFL.

  3. Transcriptomic Profiling and Functional Characterization of Fusion Genes in Recurrent Ovarian Cancer

    DTIC Science & Technology

    2017-09-01

    the enhanced malignancy observed in recurrent disease. In the first year of this proposal we have assembled a cohort of 18 patient matched pairs of...significance and biologic function of prioritized RNA fusion events. 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT 18 ...cellularity. 19 cases were identified (Table 1) but one was removed for quality control issues thus leaving a total of 18 cases. Table 1 shows the clinical

  4. Predictive models for subtypes of autism spectrum disorder based on single-nucleotide polymorphisms and magnetic resonance imaging.

    PubMed

    Jiao, Y; Chen, R; Ke, X; Cheng, L; Chu, K; Lu, Z; Herskovits, E H

    2011-01-01

    Autism spectrum disorder (ASD) is a neurodevelopmental disorder, of which Asperger syndrome and high-functioning autism are subtypes. Our goal is: 1) to determine whether a diagnostic model based on single-nucleotide polymorphisms (SNPs), brain regional thickness measurements, or brain regional volume measurements can distinguish Asperger syndrome from high-functioning autism; and 2) to compare the SNP, thickness, and volume-based diagnostic models. Our study included 18 children with ASD: 13 subjects with high-functioning autism and 5 subjects with Asperger syndrome. For each child, we obtained 25 SNPs for 8 ASD-related genes; we also computed regional cortical thicknesses and volumes for 66 brain structures, based on structural magnetic resonance (MR) examination. To generate diagnostic models, we employed five machine-learning techniques: decision stump, alternating decision trees, multi-class alternating decision trees, logistic model trees, and support vector machines. For SNP-based classification, three decision-tree-based models performed better than the other two machine-learning models. The performance metrics for three decision-tree-based models were similar: decision stump was modestly better than the other two methods, with accuracy = 90%, sensitivity = 0.95 and specificity = 0.75. All thickness and volume-based diagnostic models performed poorly. The SNP-based diagnostic models were superior to those based on thickness and volume. For SNP-based classification, rs878960 in GABRB3 (gamma-aminobutyric acid A receptor, beta 3) was selected by all tree-based models. Our analysis demonstrated that SNP-based classification was more accurate than morphometry-based classification in ASD subtype classification. Also, we found that one SNP--rs878960 in GABRB3--distinguishes Asperger syndrome from high-functioning autism.

  5. Validation of the Lung Subtyping Panel in Multiple Fresh-Frozen and Formalin-Fixed, Paraffin-Embedded Lung Tumor Gene Expression Data Sets.

    PubMed

    Faruki, Hawazin; Mayhew, Gregory M; Fan, Cheng; Wilkerson, Matthew D; Parker, Scott; Kam-Morgan, Lauren; Eisenberg, Marcia; Horten, Bruce; Hayes, D Neil; Perou, Charles M; Lai-Goldman, Myla

    2016-06-01

    Context .- A histologic classification of lung cancer subtypes is essential in guiding therapeutic management. Objective .- To complement morphology-based classification of lung tumors, a previously developed lung subtyping panel (LSP) of 57 genes was tested using multiple public fresh-frozen gene-expression data sets and a prospectively collected set of formalin-fixed, paraffin-embedded lung tumor samples. Design .- The LSP gene-expression signature was evaluated in multiple lung cancer gene-expression data sets totaling 2177 patients collected from 4 platforms: Illumina RNAseq (San Diego, California), Agilent (Santa Clara, California) and Affymetrix (Santa Clara) microarrays, and quantitative reverse transcription-polymerase chain reaction. Gene centroids were calculated for each of 3 genomic-defined subtypes: adenocarcinoma, squamous cell carcinoma, and neuroendocrine, the latter of which encompassed both small cell carcinoma and carcinoid. Classification by LSP into 3 subtypes was evaluated in both fresh-frozen and formalin-fixed, paraffin-embedded tumor samples, and agreement with the original morphology-based diagnosis was determined. Results .- The LSP-based classifications demonstrated overall agreement with the original clinical diagnosis ranging from 78% (251 of 322) to 91% (492 of 538 and 869 of 951) in the fresh-frozen public data sets and 84% (65 of 77) in the formalin-fixed, paraffin-embedded data set. The LSP performance was independent of tissue-preservation method and gene-expression platform. Secondary, blinded pathology review of formalin-fixed, paraffin-embedded samples demonstrated concordance of 82% (63 of 77) with the original morphology diagnosis. Conclusions .- The LSP gene-expression signature is a reproducible and objective method for classifying lung tumors and demonstrates good concordance with morphology-based classification across multiple data sets. The LSP panel can supplement morphologic assessment of lung cancers, particularly when classification by standard methods is challenging.

  6. Patterns of population differentiation of candidate genes for cardiovascular disease

    PubMed Central

    Kullo, Iftikhar J; Ding, Keyue

    2007-01-01

    Background The basis for ethnic differences in cardiovascular disease (CVD) susceptibility is not fully understood. We investigated patterns of population differentiation (FST) of a set of genes in etiologic pathways of CVD among 3 ethnic groups: Yoruba in Nigeria (YRI), Utah residents with European ancestry (CEU), and Han Chinese (CHB) + Japanese (JPT). We identified 37 pathways implicated in CVD based on the PANTHER classification and 416 genes in these pathways were further studied; these genes belonged to 6 biological processes (apoptosis, blood circulation and gas exchange, blood clotting, homeostasis, immune response, and lipoprotein metabolism). Genotype data were obtained from the HapMap database. Results We calculated FST for 15,559 common SNPs (minor allele frequency ≥ 0.10 in at least one population) in genes that co-segregated among the populations, as well as an average-weighted FST for each gene. SNPs were classified as putatively functional (non-synonymous and untranslated regions) or non-functional (intronic and synonymous sites). Mean FST values for common putatively functional variants were significantly higher than FST values for nonfunctional variants. A significant variation in FST was also seen based on biological processes; the processes of 'apoptosis' and 'lipoprotein metabolism' showed an excess of genes with high FST. Thus, putative functional SNPs in genes in etiologic pathways for CVD show greater population differentiation than non-functional SNPs and a significant variance of FST values was noted among pairwise population comparisons for different biological processes. Conclusion These results suggest a possible basis for varying susceptibility to CVD among ethnic groups. PMID:17626638

  7. A postprocessing method in the HMC framework for predicting gene function based on biological instrumental data

    NASA Astrophysics Data System (ADS)

    Feng, Shou; Fu, Ping; Zheng, Wenbin

    2018-03-01

    Predicting gene function based on biological instrumental data is a complicated and challenging hierarchical multi-label classification (HMC) problem. When using local approach methods to solve this problem, a preliminary results processing method is usually needed. This paper proposed a novel preliminary results processing method called the nodes interaction method. The nodes interaction method revises the preliminary results and guarantees that the predictions are consistent with the hierarchy constraint. This method exploits the label dependency and considers the hierarchical interaction between nodes when making decisions based on the Bayesian network in its first phase. In the second phase, this method further adjusts the results according to the hierarchy constraint. Implementing the nodes interaction method in the HMC framework also enhances the HMC performance for solving the gene function prediction problem based on the Gene Ontology (GO), the hierarchy of which is a directed acyclic graph that is more difficult to tackle. The experimental results validate the promising performance of the proposed method compared to state-of-the-art methods on eight benchmark yeast data sets annotated by the GO.

  8. Advances in metaheuristics for gene selection and classification of microarray data.

    PubMed

    Duval, Béatrice; Hao, Jin-Kao

    2010-01-01

    Gene selection aims at identifying a (small) subset of informative genes from the initial data in order to obtain high predictive accuracy for classification. Gene selection can be considered as a combinatorial search problem and thus be conveniently handled with optimization methods. In this article, we summarize some recent developments of using metaheuristic-based methods within an embedded approach for gene selection. In particular, we put forward the importance and usefulness of integrating problem-specific knowledge into the search operators of such a method. To illustrate the point, we explain how ranking coefficients of a linear classifier such as support vector machine (SVM) can be profitably used to reinforce the search efficiency of Local Search and Evolutionary Search metaheuristic algorithms for gene selection and classification.

  9. Gene Set−Based Integrative Analysis Revealing Two Distinct Functional Regulation Patterns in Four Common Subtypes of Epithelial Ovarian Cancer

    PubMed Central

    Chang, Chia-Ming; Chuang, Chi-Mu; Wang, Mong-Lien; Yang, Yi-Ping; Chuang, Jen-Hua; Yang, Ming-Jie; Yen, Ming-Shyen; Chiou, Shih-Hwa; Chang, Cheng-Chang

    2016-01-01

    Clear cell (CCC), endometrioid (EC), mucinous (MC) and high-grade serous carcinoma (SC) are the four most common subtypes of epithelial ovarian carcinoma (EOC). The widely accepted dualistic model of ovarian carcinogenesis divided EOCs into type I and II categories based on the molecular features. However, this hypothesis has not been experimentally demonstrated. We carried out a gene set-based analysis by integrating the microarray gene expression profiles downloaded from the publicly available databases. These quantified biological functions of EOCs were defined by 1454 Gene Ontology (GO) term and 674 Reactome pathway gene sets. The pathogenesis of the four EOC subtypes was investigated by hierarchical clustering and exploratory factor analysis. The patterns of functional regulation among the four subtypes containing 1316 cases could be accurately classified by machine learning. The results revealed that the ERBB and PI3K-related pathways played important roles in the carcinogenesis of CCC, EC and MC; while deregulation of cell cycle was more predominant in SC. The study revealed that two different functional regulation patterns exist among the four EOC subtypes, which were compatible with the type I and II classifications proposed by the dualistic model of ovarian carcinogenesis. PMID:27527159

  10. Lignin, mitochondrial family, and photorespiratory transporter classification as case studies in using co-expression, co-response, and protein locations to aid in identifying transport functions

    PubMed Central

    Tohge, Takayuki; Fernie, Alisdair R.

    2014-01-01

    Whole genome sequencing and the relative ease of transcript profiling have facilitated the collection and data warehousing of immense quantities of expression data. However, a substantial proportion of genes are not yet functionally annotated a problem which is particularly acute for transport proteins. In Arabidopsis, for example, only a minor fraction of the estimated 700 intracellular transporters have been identified at the molecular genetic level. Furthermore it is only within the last couple of years that critical genes such as those encoding the final transport step required for the long distance transport of sucrose and the first transporter of the core photorespiratory pathway have been identified. Here we will describe how transcriptional coordination between genes of known function and non-annotated genes allows the identification of putative transporters on the premise that such co-expressed genes tend to be functionally related. We will additionally extend this to include the expansion of this approach to include phenotypic information from other levels of cellular organization such as proteomic and metabolomic data and provide case studies wherein this approach has successfully been used to fill knowledge gaps in important metabolic pathways and physiological processes. PMID:24672529

  11. Classification of Time Series Gene Expression in Clinical Studies via Integration of Biological Network

    PubMed Central

    Qian, Liwei; Zheng, Haoran; Zhou, Hong; Qin, Ruibin; Li, Jinlong

    2013-01-01

    The increasing availability of time series expression datasets, although promising, raises a number of new computational challenges. Accordingly, the development of suitable classification methods to make reliable and sound predictions is becoming a pressing issue. We propose, here, a new method to classify time series gene expression via integration of biological networks. We evaluated our approach on 2 different datasets and showed that the use of a hidden Markov model/Gaussian mixture models hybrid explores the time-dependence of the expression data, thereby leading to better prediction results. We demonstrated that the biclustering procedure identifies function-related genes as a whole, giving rise to high accordance in prognosis prediction across independent time series datasets. In addition, we showed that integration of biological networks into our method significantly improves prediction performance. Moreover, we compared our approach with several state-of–the-art algorithms and found that our method outperformed previous approaches with regard to various criteria. Finally, our approach achieved better prediction results on early-stage data, implying the potential of our method for practical prediction. PMID:23516469

  12. Hierarchical Gene Selection and Genetic Fuzzy System for Cancer Microarray Data Classification

    PubMed Central

    Nguyen, Thanh; Khosravi, Abbas; Creighton, Douglas; Nahavandi, Saeid

    2015-01-01

    This paper introduces a novel approach to gene selection based on a substantial modification of analytic hierarchy process (AHP). The modified AHP systematically integrates outcomes of individual filter methods to select the most informative genes for microarray classification. Five individual ranking methods including t-test, entropy, receiver operating characteristic (ROC) curve, Wilcoxon and signal to noise ratio are employed to rank genes. These ranked genes are then considered as inputs for the modified AHP. Additionally, a method that uses fuzzy standard additive model (FSAM) for cancer classification based on genes selected by AHP is also proposed in this paper. Traditional FSAM learning is a hybrid process comprising unsupervised structure learning and supervised parameter tuning. Genetic algorithm (GA) is incorporated in-between unsupervised and supervised training to optimize the number of fuzzy rules. The integration of GA enables FSAM to deal with the high-dimensional-low-sample nature of microarray data and thus enhance the efficiency of the classification. Experiments are carried out on numerous microarray datasets. Results demonstrate the performance dominance of the AHP-based gene selection against the single ranking methods. Furthermore, the combination of AHP-FSAM shows a great accuracy in microarray data classification compared to various competing classifiers. The proposed approach therefore is useful for medical practitioners and clinicians as a decision support system that can be implemented in the real medical practice. PMID:25823003

  13. Hierarchical gene selection and genetic fuzzy system for cancer microarray data classification.

    PubMed

    Nguyen, Thanh; Khosravi, Abbas; Creighton, Douglas; Nahavandi, Saeid

    2015-01-01

    This paper introduces a novel approach to gene selection based on a substantial modification of analytic hierarchy process (AHP). The modified AHP systematically integrates outcomes of individual filter methods to select the most informative genes for microarray classification. Five individual ranking methods including t-test, entropy, receiver operating characteristic (ROC) curve, Wilcoxon and signal to noise ratio are employed to rank genes. These ranked genes are then considered as inputs for the modified AHP. Additionally, a method that uses fuzzy standard additive model (FSAM) for cancer classification based on genes selected by AHP is also proposed in this paper. Traditional FSAM learning is a hybrid process comprising unsupervised structure learning and supervised parameter tuning. Genetic algorithm (GA) is incorporated in-between unsupervised and supervised training to optimize the number of fuzzy rules. The integration of GA enables FSAM to deal with the high-dimensional-low-sample nature of microarray data and thus enhance the efficiency of the classification. Experiments are carried out on numerous microarray datasets. Results demonstrate the performance dominance of the AHP-based gene selection against the single ranking methods. Furthermore, the combination of AHP-FSAM shows a great accuracy in microarray data classification compared to various competing classifiers. The proposed approach therefore is useful for medical practitioners and clinicians as a decision support system that can be implemented in the real medical practice.

  14. Classification of a large microarray data set: Algorithm comparison and analysis of drug signatures

    PubMed Central

    Natsoulis, Georges; El Ghaoui, Laurent; Lanckriet, Gert R.G.; Tolley, Alexander M.; Leroy, Fabrice; Dunlea, Shane; Eynon, Barrett P.; Pearson, Cecelia I.; Tugendreich, Stuart; Jarnagin, Kurt

    2005-01-01

    A large gene expression database has been produced that characterizes the gene expression and physiological effects of hundreds of approved and withdrawn drugs, toxicants, and biochemical standards in various organs of live rats. In order to derive useful biological knowledge from this large database, a variety of supervised classification algorithms were compared using a 597-microarray subset of the data. Our studies show that several types of linear classifiers based on Support Vector Machines (SVMs) and Logistic Regression can be used to derive readily interpretable drug signatures with high classification performance. Both methods can be tuned to produce classifiers of drug treatments in the form of short, weighted gene lists which upon analysis reveal that some of the signature genes have a positive contribution (act as “rewards” for the class-of-interest) while others have a negative contribution (act as “penalties”) to the classification decision. The combination of reward and penalty genes enhances performance by keeping the number of false positive treatments low. The results of these algorithms are combined with feature selection techniques that further reduce the length of the drug signatures, an important step towards the development of useful diagnostic biomarkers and low-cost assays. Multiple signatures with no genes in common can be generated for the same classification end-point. Comparison of these gene lists identifies biological processes characteristic of a given class. PMID:15867433

  15. Association between expression of random gene sets and survival is evident in multiple cancer types and may be explained by sub-classification.

    PubMed

    Shimoni, Yishai

    2018-02-01

    One of the goals of cancer research is to identify a set of genes that cause or control disease progression. However, although multiple such gene sets were published, these are usually in very poor agreement with each other, and very few of the genes proved to be functional therapeutic targets. Furthermore, recent findings from a breast cancer gene-expression cohort showed that sets of genes selected randomly can be used to predict survival with a much higher probability than expected. These results imply that many of the genes identified in breast cancer gene expression analysis may not be causal of cancer progression, even though they can still be highly predictive of prognosis. We performed a similar analysis on all the cancer types available in the cancer genome atlas (TCGA), namely, estimating the predictive power of random gene sets for survival. Our work shows that most cancer types exhibit the property that random selections of genes are more predictive of survival than expected. In contrast to previous work, this property is not removed by using a proliferation signature, which implies that proliferation may not always be the confounder that drives this property. We suggest one possible solution in the form of data-driven sub-classification to reduce this property significantly. Our results suggest that the predictive power of random gene sets may be used to identify the existence of sub-classes in the data, and thus may allow better understanding of patient stratification. Furthermore, by reducing the observed bias this may allow more direct identification of biologically relevant, and potentially causal, genes.

  16. Association between expression of random gene sets and survival is evident in multiple cancer types and may be explained by sub-classification

    PubMed Central

    2018-01-01

    One of the goals of cancer research is to identify a set of genes that cause or control disease progression. However, although multiple such gene sets were published, these are usually in very poor agreement with each other, and very few of the genes proved to be functional therapeutic targets. Furthermore, recent findings from a breast cancer gene-expression cohort showed that sets of genes selected randomly can be used to predict survival with a much higher probability than expected. These results imply that many of the genes identified in breast cancer gene expression analysis may not be causal of cancer progression, even though they can still be highly predictive of prognosis. We performed a similar analysis on all the cancer types available in the cancer genome atlas (TCGA), namely, estimating the predictive power of random gene sets for survival. Our work shows that most cancer types exhibit the property that random selections of genes are more predictive of survival than expected. In contrast to previous work, this property is not removed by using a proliferation signature, which implies that proliferation may not always be the confounder that drives this property. We suggest one possible solution in the form of data-driven sub-classification to reduce this property significantly. Our results suggest that the predictive power of random gene sets may be used to identify the existence of sub-classes in the data, and thus may allow better understanding of patient stratification. Furthermore, by reducing the observed bias this may allow more direct identification of biologically relevant, and potentially causal, genes. PMID:29470520

  17. Inferring gene dependency network specific to phenotypic alteration based on gene expression data and clinical information of breast cancer.

    PubMed

    Zhou, Xionghui; Liu, Juan

    2014-01-01

    Although many methods have been proposed to reconstruct gene regulatory network, most of them, when applied in the sample-based data, can not reveal the gene regulatory relations underlying the phenotypic change (e.g. normal versus cancer). In this paper, we adopt phenotype as a variable when constructing the gene regulatory network, while former researches either neglected it or only used it to select the differentially expressed genes as the inputs to construct the gene regulatory network. To be specific, we integrate phenotype information with gene expression data to identify the gene dependency pairs by using the method of conditional mutual information. A gene dependency pair (A,B) means that the influence of gene A on the phenotype depends on gene B. All identified gene dependency pairs constitute a directed network underlying the phenotype, namely gene dependency network. By this way, we have constructed gene dependency network of breast cancer from gene expression data along with two different phenotype states (metastasis and non-metastasis). Moreover, we have found the network scale free, indicating that its hub genes with high out-degrees may play critical roles in the network. After functional investigation, these hub genes are found to be biologically significant and specially related to breast cancer, which suggests that our gene dependency network is meaningful. The validity has also been justified by literature investigation. From the network, we have selected 43 discriminative hubs as signature to build the classification model for distinguishing the distant metastasis risks of breast cancer patients, and the result outperforms those classification models with published signatures. In conclusion, we have proposed a promising way to construct the gene regulatory network by using sample-based data, which has been shown to be effective and accurate in uncovering the hidden mechanism of the biological process and identifying the gene signature for phenotypic change.

  18. New workflow for classification of genetic variants' pathogenicity applied to hereditary recurrent fevers by the International Study Group for Systemic Autoinflammatory Diseases (INSAID).

    PubMed

    Van Gijn, Marielle E; Ceccherini, Isabella; Shinar, Yael; Carbo, Ellen C; Slofstra, Mariska; Arostegui, Juan I; Sarrabay, Guillaume; Rowczenio, Dorota; Omoyımnı, Ebun; Balci-Peynircioglu, Banu; Hoffman, Hal M; Milhavet, Florian; Swertz, Morris A; Touitou, Isabelle

    2018-03-29

    Hereditary recurrent fevers (HRFs) are rare inflammatory diseases sharing similar clinical symptoms and effectively treated with anti-inflammatory biological drugs. Accurate diagnosis of HRF relies heavily on genetic testing. This study aimed to obtain an experts' consensus on the clinical significance of gene variants in four well-known HRF genes: MEFV , TNFRSF1A , NLRP3 and MVK . We configured a MOLGENIS web platform to share and analyse pathogenicity classifications of the variants and to manage a consensus-based classification process. Four experts in HRF genetics submitted independent classifications of 858 variants. Classifications were driven to consensus by recruiting four more expert opinions and by targeting discordant classifications in five iterative rounds. Consensus classification was reached for 804/858 variants (94%). None of the unsolved variants (6%) remained with opposite classifications (eg, pathogenic vs benign). New mutational hotspots were found in all genes. We noted a lower pathogenic variant load and a higher fraction of variants with unknown or unsolved clinical significance in the MEFV gene. Applying a consensus-driven process on the pathogenicity assessment of experts yielded rapid classification of almost all variants of four HRF genes. The high-throughput database will profoundly assist clinicians and geneticists in the diagnosis of HRFs. The configured MOLGENIS platform and consensus evolution protocol are usable for assembly of other variant pathogenicity databases. The MOLGENIS software is available for reuse at http://github.com/molgenis/molgenis; the specific HRF configuration is available at http://molgenis.org/said/. The HRF pathogenicity classifications will be published on the INFEVERS database at https://fmf.igh.cnrs.fr/ISSAID/infevers/. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2018. All rights reserved. No commercial use is permitted unless otherwise expressly granted.

  19. Sequence variant classification and reporting: recommendations for improving the interpretation of cancer susceptibility genetic test results

    PubMed Central

    Plon, Sharon E.; Eccles, Diana M.; Easton, Douglas; Foulkes, William D.; Genuardi, Maurizio; Greenblatt, Marc S.; Hogervorst, Frans B.L.; Hoogerbrugge, Nicoline; Spurdle, Amanda B.; Tavtigian, Sean

    2011-01-01

    Genetic testing of cancer susceptibility genes is now widely applied in clinical practice to predict risk of developing cancer. In general, sequence-based testing of germline DNA is used to determine whether an individual carries a change that is clearly likely to disrupt normal gene function. Genetic testing may detect changes that are clearly pathogenic, clearly neutral or variants of unclear clinical significance. Such variants present a considerable challenge to the diagnostic laboratory and the receiving clinician in terms of interpretation and clear presentation of the implications of the result to the patient. There does not appear to be a consistent approach to interpreting and reporting the clinical significance of variants either among genes or among laboratories. The potential for confusion among clinicians and patients is considerable and misinterpretation may lead to inappropriate clinical consequences. In this article we review the current state of sequence-based genetic testing, describe other standardized reporting systems used in oncology and propose a standardized classification system for application to sequence based results for cancer predisposition genes. We suggest a system of five classes of variants based on the degree of likelihood of pathogenicity. Each class is associated with specific recommendations for clinical management of at-risk relatives that will depend on the syndrome. We propose that panels of experts on each cancer predisposition syndrome facilitate the classification scheme and designate appropriate surveillance and cancer management guidelines. The international adoption of a standardized reporting system should improve the clinical utility of sequence-based genetic tests to predict cancer risk. PMID:18951446

  20. Genome-Wide Analysis of NBS-LRR Genes in Sorghum Genome Revealed Several Events Contributing to NBS-LRR Gene Evolution in Grass Species

    PubMed Central

    Yang, Xiping; Wang, Jianping

    2016-01-01

    The nucleotide-binding site (NBS)–leucine-rich repeat (LRR) gene family is crucially important for offering resistance to pathogens. To explore evolutionary conservation and variability of NBS-LRR genes across grass species, we identified 88, 107, 24, and 44 full-length NBS-LRR genes in sorghum, rice, maize, and Brachypodium, respectively. A comprehensive analysis was performed on classification, genome organization, evolution, expression, and regulation of these NBS-LRR genes using sorghum as a representative of grass species. In general, the full-length NBS-LRR genes are highly clustered and duplicated in sorghum genome mainly due to local duplications. NBS-LRR genes have basal expression levels and are highly potentially targeted by miRNA. The number of NBS-LRR genes in the four grass species is positively correlated with the gene clustering rate. The results provided a valuable genomic resource and insights for functional and evolutionary studies of NBS-LRR genes in grass species. PMID:26792976

  1. Blood-Based Gene Expression Profiles Models for Classification of Subsyndromal Symptomatic Depression and Major Depressive Disorder

    PubMed Central

    Yu, Shunying; Yuan, Chengmei; Hong, Wu; Wang, Zuowei; Cui, Jian; Shi, Tieliu; Fang, Yiru

    2012-01-01

    Subsyndromal symptomatic depression (SSD) is a subtype of subthreshold depressive and also lead to significant psychosocial functional impairment as same as major depressive disorder (MDD). Several studies have suggested that SSD is a transitory phenomena in the depression spectrum and is thus considered a subtype of depression. However, the pathophysioloy of depression remain largely obscure and studies on SSD are limited. The present study compared the expression profile and made the classification with the leukocytes by using whole-genome cRNA microarrays among drug-free first-episode subjects with SSD, MDD, and matched controls (8 subjects in each group). Support vector machines (SVMs) were utilized for training and testing on candidate signature expression profiles from signature selection step. Firstly, we identified 63 differentially expressed SSD signatures in contrast to control (P< = 5.0E-4) and 30 differentially expressed MDD signatures in contrast to control, respectively. Then, 123 gene signatures were identified with significantly differential expression level between SSD and MDD. Secondly, in order to conduct priority selection for biomarkers for SSD and MDD together, we selected top gene signatures from each group of pair-wise comparison results, and merged the signatures together to generate better profiles used for clearly classify SSD and MDD sets in the same time. In details, we tried different combination of signatures from the three pair-wise compartmental results and finally determined 48 gene expression signatures with 100% accuracy. Our finding suggested that SSD and MDD did not exhibit the same expressed genome signature with peripheral blood leukocyte, and blood cell–derived RNA of these 48 gene models may have significant value for performing diagnostic functions and classifying SSD, MDD, and healthy controls. PMID:22348066

  2. Structure of the human MLH1 N-terminus: implications for predisposition to Lynch syndrome

    DOE PAGES

    Wu, Hong; Zeng, Hong; Lam, Robert; ...

    2015-08-01

    Mismatch repair prevents the accumulation of erroneous insertions/deletions and non-Watson–Crick base pairs in the genome. Pathogenic mutations in theMLH1gene are associated with a predisposition to Lynch and Turcot's syndromes. Although genetic testing for these mutations is available, robust classification of variants requires strong clinical and functional support. Here, the first structure of the N-terminus of human MLH1, determined by X-ray crystallography, is described. Lastly, the structure shares a high degree of similarity with previously determined prokaryoticMLH1homologs; however, this structure affords a more accurate platform for the classification ofMLH1variants.

  3. Genetic investigation of 100 heart genes in sudden unexplained death victims in a forensic setting

    PubMed Central

    Christiansen, Sofie Lindgren; Hertz, Christin Løth; Ferrero-Miliani, Laura; Dahl, Morten; Weeke, Peter Ejvin; LuCamp; Ottesen, Gyda Lolk; Frank-Hansen, Rune; Bundgaard, Henning; Morling, Niels

    2016-01-01

    In forensic medicine, one-third of the sudden deaths remain unexplained after medico-legal autopsy. A major proportion of these sudden unexplained deaths (SUD) are considered to be caused by inherited cardiac diseases. Sudden cardiac death (SCD) may be the first manifestation of these diseases. The purpose of this study was to explore the yield of next-generation sequencing of genes associated with SCD in a cohort of SUD victims. We investigated 100 genes associated with cardiac diseases in 61 young (1–50 years) SUD cases. DNA was captured with the Haloplex target enrichment system and sequenced using an Illumina MiSeq. The identified genetic variants were evaluated and classified as likely, unknown or unlikely to have a functional effect. The criteria for this classification were based on the literature, databases, conservation and prediction of the effect of the variant. We found that 21 (34%) individuals carried variants with a likely functional effect. Ten (40%) of these variants were located in genes associated with cardiomyopathies and 15 (60%) of the variants in genes associated with cardiac channelopathies. Nineteen individuals carried variants with unknown functional effect. Our findings indicate that broad genetic investigation of SUD victims increases the diagnostic outcome, and the investigation should comprise genes involved in both cardiomyopathies and cardiac channelopathies. PMID:27650965

  4. Genetic investigation of 100 heart genes in sudden unexplained death victims in a forensic setting.

    PubMed

    Christiansen, Sofie Lindgren; Hertz, Christin Løth; Ferrero-Miliani, Laura; Dahl, Morten; Weeke, Peter Ejvin; LuCamp; Ottesen, Gyda Lolk; Frank-Hansen, Rune; Bundgaard, Henning; Morling, Niels

    2016-12-01

    In forensic medicine, one-third of the sudden deaths remain unexplained after medico-legal autopsy. A major proportion of these sudden unexplained deaths (SUD) are considered to be caused by inherited cardiac diseases. Sudden cardiac death (SCD) may be the first manifestation of these diseases. The purpose of this study was to explore the yield of next-generation sequencing of genes associated with SCD in a cohort of SUD victims. We investigated 100 genes associated with cardiac diseases in 61 young (1-50 years) SUD cases. DNA was captured with the Haloplex target enrichment system and sequenced using an Illumina MiSeq. The identified genetic variants were evaluated and classified as likely, unknown or unlikely to have a functional effect. The criteria for this classification were based on the literature, databases, conservation and prediction of the effect of the variant. We found that 21 (34%) individuals carried variants with a likely functional effect. Ten (40%) of these variants were located in genes associated with cardiomyopathies and 15 (60%) of the variants in genes associated with cardiac channelopathies. Nineteen individuals carried variants with unknown functional effect. Our findings indicate that broad genetic investigation of SUD victims increases the diagnostic outcome, and the investigation should comprise genes involved in both cardiomyopathies and cardiac channelopathies.

  5. Hybrid genetic algorithm-neural network: feature extraction for unpreprocessed microarray data.

    PubMed

    Tong, Dong Ling; Schierz, Amanda C

    2011-09-01

    Suitable techniques for microarray analysis have been widely researched, particularly for the study of marker genes expressed to a specific type of cancer. Most of the machine learning methods that have been applied to significant gene selection focus on the classification ability rather than the selection ability of the method. These methods also require the microarray data to be preprocessed before analysis takes place. The objective of this study is to develop a hybrid genetic algorithm-neural network (GANN) model that emphasises feature selection and can operate on unpreprocessed microarray data. The GANN is a hybrid model where the fitness value of the genetic algorithm (GA) is based upon the number of samples correctly labelled by a standard feedforward artificial neural network (ANN). The model is evaluated by using two benchmark microarray datasets with different array platforms and differing number of classes (a 2-class oligonucleotide microarray data for acute leukaemia and a 4-class complementary DNA (cDNA) microarray dataset for SRBCTs (small round blue cell tumours)). The underlying concept of the GANN algorithm is to select highly informative genes by co-evolving both the GA fitness function and the ANN weights at the same time. The novel GANN selected approximately 50% of the same genes as the original studies. This may indicate that these common genes are more biologically significant than other genes in the datasets. The remaining 50% of the significant genes identified were used to build predictive models and for both datasets, the models based on the set of genes extracted by the GANN method produced more accurate results. The results also suggest that the GANN method not only can detect genes that are exclusively associated with a single cancer type but can also explore the genes that are differentially expressed in multiple cancer types. The results show that the GANN model has successfully extracted statistically significant genes from the unpreprocessed microarray data as well as extracting known biologically significant genes. We also show that assessing the biological significance of genes based on classification accuracy may be misleading and though the GANN's set of extra genes prove to be more statistically significant than those selected by other methods, a biological assessment of these genes is highly recommended to confirm their functionality. Copyright © 2011 Elsevier B.V. All rights reserved.

  6. Kinase Pathway Database: An Integrated Protein-Kinase and NLP-Based Protein-Interaction Resource

    PubMed Central

    Koike, Asako; Kobayashi, Yoshiyuki; Takagi, Toshihisa

    2003-01-01

    Protein kinases play a crucial role in the regulation of cellular functions. Various kinds of information about these molecules are important for understanding signaling pathways and organism characteristics. We have developed the Kinase Pathway Database, an integrated database involving major completely sequenced eukaryotes. It contains the classification of protein kinases and their functional conservation, ortholog tables among species, protein–protein, protein–gene, and protein–compound interaction data, domain information, and structural information. It also provides an automatic pathway graphic image interface. The protein, gene, and compound interactions are automatically extracted from abstracts for all genes and proteins by natural-language processing (NLP).The method of automatic extraction uses phrase patterns and the GENA protein, gene, and compound name dictionary, which was developed by our group. With this database, pathways are easily compared among species using data with more than 47,000 protein interactions and protein kinase ortholog tables. The database is available for querying and browsing at http://kinasedb.ontology.ims.u-tokyo.ac.jp/. PMID:12799355

  7. The future of transposable element annotation and their classification in the light of functional genomics - what we can learn from the fables of Jean de la Fontaine?

    PubMed

    Arensburger, Peter; Piégu, Benoît; Bigot, Yves

    2016-01-01

    Transposable element (TE) science has been significantly influenced by the pioneering ideas of David Finnegan near the end of the last century, as well as by the classification systems that were subsequently developed. Today, whole genome TE annotation is mostly done using tools that were developed to aid gene annotation rather than to specifically study TEs. We argue that further progress in the TE field is impeded both by current TE classification schemes and by a failure to recognize that TE biology is fundamentally different from that of multicellular organisms. Novel genome wide TE annotation methods are helping to redefine our understanding of TE sequence origins and evolution. We briefly discuss some of these new methods as well as ideas for possible alternative classification schemes. Our hope is to encourage the formation of a society to organize a larger debate on these questions and to promote the adoption of standards for annotation and an improved TE classification.

  8. Evolutionary conservation of sequence and secondary structures inCRISPR repeats

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kunin, Victor; Sorek, Rotem; Hugenholtz, Philip

    Clustered Regularly Interspaced Palindromic Repeats (CRISPRs) are a novel class of direct repeats, separated by unique spacer sequences of similar length, that are present in {approx}40% of bacterial and all archaeal genomes analyzed to date. More than 40 gene families, called CRISPR-associated sequences (CAS), appear in conjunction with these repeats and are thought to be involved in the propagation and functioning of CRISPRs. It has been proposed that the CRISPR/CAS system samples, maintains a record of, and inactivates invasive DNA that the cell has encountered, and therefore constitutes a prokaryotic analog of an immune system. Here we analyze CRISPR repeatsmore » identified in 195 microbial genomes and show that they can be organized into multiple clusters based on sequence similarity. All individual repeats in any given cluster were inferred to form characteristic RNA secondary structure, ranging from non-existent to pronounced. Stable secondary structures included G:U base pairs and exhibited multiple compensatory base changes in the stem region, indicating evolutionary conservation and functional importance. We also show that the repeat-based classification corresponds to, and expands upon, a previously reported CAS gene-based classification including specific relationships between CRISPR and CAS subtypes.« less

  9. A phylogenomic approach to bacterial subspecies classification: proof of concept in Mycobacterium abscessus.

    PubMed

    Tan, Joon Liang; Khang, Tsung Fei; Ngeow, Yun Fong; Choo, Siew Woh

    2013-12-13

    Mycobacterium abscessus is a rapidly growing mycobacterium that is often associated with human infections. The taxonomy of this species has undergone several revisions and is still being debated. In this study, we sequenced the genomes of 12 M. abscessus strains and used phylogenomic analysis to perform subspecies classification. A data mining approach was used to rank and select informative genes based on the relative entropy metric for the construction of a phylogenetic tree. The resulting tree topology was similar to that generated using the concatenation of five classical housekeeping genes: rpoB, hsp65, secA, recA and sodA. Additional support for the reliability of the subspecies classification came from the analysis of erm41 and ITS gene sequences, single nucleotide polymorphisms (SNPs)-based classification and strain clustering demonstrated by a variable number tandem repeat (VNTR) assay and a multilocus sequence analysis (MLSA). We subsequently found that the concatenation of a minimal set of three median-ranked genes: DNA polymerase III subunit alpha (polC), 4-hydroxy-2-ketovalerate aldolase (Hoa) and cell division protein FtsZ (ftsZ), is sufficient to recover the same tree topology. PCR assays designed specifically for these genes showed that all three genes could be amplified in the reference strain of M. abscessus ATCC 19977T. This study provides proof of concept that whole-genome sequence-based data mining approach can provide confirmatory evidence of the phylogenetic informativeness of existing markers, as well as lead to the discovery of a more economical and informative set of markers that produces similar subspecies classification in M. abscessus. The systematic procedure used in this study to choose the informative minimal set of gene markers can potentially be applied to species or subspecies classification of other bacteria.

  10. Functional Proteomic Analysis of Human NucleolusD⃞

    PubMed Central

    Scherl, Alexander; Couté, Yohann; Déon, Catherine; Callé, Aleth; Kindbeiter, Karine; Sanchez, Jean-Charles; Greco, Anna; Hochstrasser, Denis; Diaz, Jean-Jacques

    2002-01-01

    The notion of a “plurifunctional” nucleolus is now well established. However, molecular mechanisms underlying the biological processes occurring within this nuclear domain remain only partially understood. As a first step in elucidating these mechanisms we have carried out a proteomic analysis to draw up a list of proteins present within nucleoli of HeLa cells. This analysis allowed the identification of 213 different nucleolar proteins. This catalog complements that of the 271 proteins obtained recently by others, giving a total of ∼350 different nucleolar proteins. Functional classification of these proteins allowed outlining several biological processes taking place within nucleoli. Bioinformatic analyses permitted the assignment of hypothetical functions for 43 proteins for which no functional information is available. Notably, a role in ribosome biogenesis was proposed for 31 proteins. More generally, this functional classification reinforces the plurifunctional nature of nucleoli and provides convincing evidence that nucleoli may play a central role in the control of gene expression. Finally, this analysis supports the recent demonstration of a coupling of transcription and translation in higher eukaryotes. PMID:12429849

  11. Entropy-based gene ranking without selection bias for the predictive classification of microarray data.

    PubMed

    Furlanello, Cesare; Serafini, Maria; Merler, Stefano; Jurman, Giuseppe

    2003-11-06

    We describe the E-RFE method for gene ranking, which is useful for the identification of markers in the predictive classification of array data. The method supports a practical modeling scheme designed to avoid the construction of classification rules based on the selection of too small gene subsets (an effect known as the selection bias, in which the estimated predictive errors are too optimistic due to testing on samples already considered in the feature selection process). With E-RFE, we speed up the recursive feature elimination (RFE) with SVM classifiers by eliminating chunks of uninteresting genes using an entropy measure of the SVM weights distribution. An optimal subset of genes is selected according to a two-strata model evaluation procedure: modeling is replicated by an external stratified-partition resampling scheme, and, within each run, an internal K-fold cross-validation is used for E-RFE ranking. Also, the optimal number of genes can be estimated according to the saturation of Zipf's law profiles. Without a decrease of classification accuracy, E-RFE allows a speed-up factor of 100 with respect to standard RFE, while improving on alternative parametric RFE reduction strategies. Thus, a process for gene selection and error estimation is made practical, ensuring control of the selection bias, and providing additional diagnostic indicators of gene importance.

  12. Actions of plant Argonautes: predictable or unpredictable?

    PubMed

    Ma, Zeyang; Zhang, Xiuren

    2018-05-29

    Argonaute (AGO) proteins are the key effector of RNA-induced silencing complex (RISC). Land plants typically encode numerous AGO proteins, and they can be typically divided into two major functional groups based on the species of their housed small RNAs (sRNAs). One group of AGOs, guided by 24-nucleotide (nt) sRNAs, canonically function in nuclei to implement transcriptional gene silencing (TGS), whereas the other group of AGOs, guided by 21-nt sRNAs, act in the cytoplasm to fulfill posttranscriptional gene silencing (PTGS). Many new discoveries have been recently made on functions and mechanisms of AGO proteins in plants, and some of the findings change our views on the conventional classification and roles of AGO proteins. In this review, we summarize our current knowledge of AGO proteins in plants. Copyright © 2018 Elsevier Ltd. All rights reserved.

  13. AnnoTALE: bioinformatics tools for identification, annotation, and nomenclature of TALEs from Xanthomonas genomic sequences

    PubMed Central

    Grau, Jan; Reschke, Maik; Erkes, Annett; Streubel, Jana; Morgan, Richard D.; Wilson, Geoffrey G.; Koebnik, Ralf; Boch, Jens

    2016-01-01

    Transcription activator-like effectors (TALEs) are virulence factors, produced by the bacterial plant-pathogen Xanthomonas, that function as gene activators inside plant cells. Although the contribution of individual TALEs to infectivity has been shown, the specific roles of most TALEs, and the overall TALE diversity in Xanthomonas spp. is not known. TALEs possess a highly repetitive DNA-binding domain, which is notoriously difficult to sequence. Here, we describe an improved method for characterizing TALE genes by the use of PacBio sequencing. We present ‘AnnoTALE’, a suite of applications for the analysis and annotation of TALE genes from Xanthomonas genomes, and for grouping similar TALEs into classes. Based on these classes, we propose a unified nomenclature for Xanthomonas TALEs that reveals similarities pointing to related functionalities. This new classification enables us to compare related TALEs and to identify base substitutions responsible for the evolution of TALE specificities. PMID:26876161

  14. Applying Cost-Sensitive Extreme Learning Machine and Dissimilarity Integration to Gene Expression Data Classification.

    PubMed

    Liu, Yanqiu; Lu, Huijuan; Yan, Ke; Xia, Haixia; An, Chunlin

    2016-01-01

    Embedding cost-sensitive factors into the classifiers increases the classification stability and reduces the classification costs for classifying high-scale, redundant, and imbalanced datasets, such as the gene expression data. In this study, we extend our previous work, that is, Dissimilar ELM (D-ELM), by introducing misclassification costs into the classifier. We name the proposed algorithm as the cost-sensitive D-ELM (CS-D-ELM). Furthermore, we embed rejection cost into the CS-D-ELM to increase the classification stability of the proposed algorithm. Experimental results show that the rejection cost embedded CS-D-ELM algorithm effectively reduces the average and overall cost of the classification process, while the classification accuracy still remains competitive. The proposed method can be extended to classification problems of other redundant and imbalanced data.

  15. Fifteen years of research on oral-facial-digital syndromes: from 1 to 16 causal genes.

    PubMed

    Bruel, Ange-Line; Franco, Brunella; Duffourd, Yannis; Thevenon, Julien; Jego, Laurence; Lopez, Estelle; Deleuze, Jean-François; Doummar, Diane; Giles, Rachel H; Johnson, Colin A; Huynen, Martijn A; Chevrier, Véronique; Burglen, Lydie; Morleo, Manuela; Desguerres, Isabelle; Pierquin, Geneviève; Doray, Bérénice; Gilbert-Dussardier, Brigitte; Reversade, Bruno; Steichen-Gersdorf, Elisabeth; Baumann, Clarisse; Panigrahi, Inusha; Fargeot-Espaliat, Anne; Dieux, Anne; David, Albert; Goldenberg, Alice; Bongers, Ernie; Gaillard, Dominique; Argente, Jesús; Aral, Bernard; Gigot, Nadège; St-Onge, Judith; Birnbaum, Daniel; Phadke, Shubha R; Cormier-Daire, Valérie; Eguether, Thibaut; Pazour, Gregory J; Herranz-Pérez, Vicente; Goldstein, Jaclyn S; Pasquier, Laurent; Loget, Philippe; Saunier, Sophie; Mégarbané, André; Rosnet, Olivier; Leroux, Michel R; Wallingford, John B; Blacque, Oliver E; Nachury, Maxence V; Attie-Bitach, Tania; Rivière, Jean-Baptiste; Faivre, Laurence; Thauvin-Robinet, Christel

    2017-06-01

    Oral-facial-digital syndromes (OFDS) gather rare genetic disorders characterised by facial, oral and digital abnormalities associated with a wide range of additional features (polycystic kidney disease, cerebral malformations and several others) to delineate a growing list of OFDS subtypes. The most frequent, OFD type I, is caused by a heterozygous mutation in the OFD1 gene encoding a centrosomal protein. The wide clinical heterogeneity of OFDS suggests the involvement of other ciliary genes. For 15 years, we have aimed to identify the molecular bases of OFDS. This effort has been greatly helped by the recent development of whole-exome sequencing (WES). Here, we present all our published and unpublished results for WES in 24 cases with OFDS. We identified causal variants in five new genes ( C2CD3 , TMEM107 , INTU , KIAA0753 and IFT57 ) and related the clinical spectrum of four genes in other ciliopathies ( C5orf42 , TMEM138 , TMEM231 and WDPCP ) to OFDS. Mutations were also detected in two genes previously implicated in OFDS. Functional studies revealed the involvement of centriole elongation, transition zone and intraflagellar transport defects in OFDS, thus characterising three ciliary protein modules: the complex KIAA0753-FOPNL-OFD1, a regulator of centriole elongation; the Meckel-Gruber syndrome module, a major component of the transition zone; and the CPLANE complex necessary for IFT-A assembly. OFDS now appear to be a distinct subgroup of ciliopathies with wide heterogeneity, which makes the initial classification obsolete. A clinical classification restricted to the three frequent/well-delineated subtypes could be proposed, and for patients who do not fit one of these three main subtypes, a further classification could be based on the genotype. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.

  16. Probabilistic classifiers with high-dimensional data

    PubMed Central

    Kim, Kyung In; Simon, Richard

    2011-01-01

    For medical classification problems, it is often desirable to have a probability associated with each class. Probabilistic classifiers have received relatively little attention for small n large p classification problems despite of their importance in medical decision making. In this paper, we introduce 2 criteria for assessment of probabilistic classifiers: well-calibratedness and refinement and develop corresponding evaluation measures. We evaluated several published high-dimensional probabilistic classifiers and developed 2 extensions of the Bayesian compound covariate classifier. Based on simulation studies and analysis of gene expression microarray data, we found that proper probabilistic classification is more difficult than deterministic classification. It is important to ensure that a probabilistic classifier is well calibrated or at least not “anticonservative” using the methods developed here. We provide this evaluation for several probabilistic classifiers and also evaluate their refinement as a function of sample size under weak and strong signal conditions. We also present a cross-validation method for evaluating the calibration and refinement of any probabilistic classifier on any data set. PMID:21087946

  17. Origin and diversification of leucine-rich repeat receptor-like protein kinase (LRR-RLK) genes in plants.

    PubMed

    Liu, Ping-Li; Du, Liang; Huang, Yuan; Gao, Shu-Min; Yu, Meng

    2017-02-07

    Leucine-rich repeat receptor-like protein kinases (LRR-RLKs) are the largest group of receptor-like kinases in plants and play crucial roles in development and stress responses. The evolutionary relationships among LRR-RLK genes have been investigated in flowering plants; however, no comprehensive studies have been performed for these genes in more ancestral groups. The subfamily classification of LRR-RLK genes in plants, the evolutionary history and driving force for the evolution of each LRR-RLK subfamily remain to be understood. We identified 119 LRR-RLK genes in the Physcomitrella patens moss genome, 67 LRR-RLK genes in the Selaginella moellendorffii lycophyte genome, and no LRR-RLK genes in five green algae genomes. Furthermore, these LRR-RLK sequences, along with previously reported LRR-RLK sequences from Arabidopsis thaliana and Oryza sativa, were subjected to evolutionary analyses. Phylogenetic analyses revealed that plant LRR-RLKs belong to 19 subfamilies, eighteen of which were established in early land plants, and one of which evolved in flowering plants. More importantly, we found that the basic structures of LRR-RLK genes for most subfamilies are established in early land plants and conserved within subfamilies and across different plant lineages, but divergent among subfamilies. In addition, most members of the same subfamily had common protein motif compositions, whereas members of different subfamilies showed variations in protein motif compositions. The unique gene structure and protein motif compositions of each subfamily differentiate the subfamily classifications and, more importantly, provide evidence for functional divergence among LRR-RLK subfamilies. Maximum likelihood analyses showed that some sites within four subfamilies were under positive selection. Much of the diversity of plant LRR-RLK genes was established in early land plants. Positive selection contributed to the evolution of a few LRR-RLK subfamilies.

  18. Factors affecting the concordance between orthologous gene trees and species tree in bacteria.

    PubMed

    Castillo-Ramírez, Santiago; González, Víctor

    2008-10-30

    As originally defined, orthologous genes implied a reflection of the history of the species. In recent years, many studies have examined the concordance between orthologous gene trees and species trees in bacteria. These studies have produced contradictory results that may have been influenced by orthologous gene misidentification and artefactual phylogenetic reconstructions. Here, using a method that allows the detection and exclusion of false positives during identification of orthologous genes, we address the question of whether putative orthologous genes within bacteria really reflect the history of the species. We identified a set of 370 orthologous genes from the bacterial order Rhizobiales. Although manifesting strong vertical signal, almost every orthologous gene had a distinct phylogeny, and the most common topology among the orthologous gene trees did not correspond with the best estimate of the species tree. However, each orthologous gene tree shared an average of 70% of its bipartitions with the best estimate of the species tree. Stochastic error related to gene size affected the concordance between the best estimated of the species tree and the orthologous gene trees, although this effect was weak and distributed unevenly among the functional categories. The nodes showing the greatest discordance were those defined by the shortest internal branches in the best estimated of the species tree. Moreover, a clear bias was evident with respect to the function of the orthologous genes, and the degree of divergence among the orthologous genes appeared to be related to their functional classification. Orthologous genes do not reflect the history of the species when taken as individual markers, but they do when taken as a whole. Stochastic error affected the concordance of orthologous genes with the species tree, albeit weakly. We conclude that two important biological causes of discordance among orthologous genes are incomplete lineage sorting and functional restriction.

  19. Differential gene expression profiles of peripheral blood mononuclear cells in childhood asthma.

    PubMed

    Kong, Qian; Li, Wen-Jing; Huang, Hua-Rong; Zhong, Ying-Qiang; Fang, Jian-Pei

    2015-05-01

    Asthma is a common childhood disease with strong genetic components. This study compared whole-genome expression differences between asthmatic young children and healthy controls to identify gene signatures of childhood asthma. Total RNA extracted from peripheral blood mononuclear cells (PBMC) was subjected to microarray analysis. QRT-PCR was performed to verify the microarray results. Classification and functional characterization of differential genes were illustrated by hierarchical clustering and gene ontology analysis. Multiple logistic regression (MLR) analysis, receiver operating characteristic (ROC) curve analysis, and discriminate power were used to scan asthma-specific diagnostic markers. For fold-change>2 and p < 0.05, there were 758 named differential genes. The results of QRT-PCR confirmed successfully the array data. Hierarchical clustering divided 29 highly possible genes into seven categories and the genes in the same cluster were likely to possess similar expression patterns or functions. Gene ontology analysis presented that differential genes primarily enriched in immune response, response to stress or stimulus, and regulation of apoptosis in biological process. MLR and ROC curve analysis revealed that the combination of ADAM33, Smad7, and LIGHT possessed excellent discriminating power. The combination of ADAM33, Smad7, and LIGHT would be a reliable and useful childhood asthma model for prediction and diagnosis.

  20. Genome-wide identification, classification, and expression analysis of the arabinogalactan protein gene family in rice (Oryza sativa L.)

    PubMed Central

    Zhao, Jie

    2010-01-01

    Arabinogalactan proteins (AGPs) comprise a family of hydroxyproline-rich glycoproteins that are implicated in plant growth and development. In this study, 69 AGPs are identified from the rice genome, including 13 classical AGPs, 15 arabinogalactan (AG) peptides, three non-classical AGPs, three early nodulin-like AGPs (eNod-like AGPs), eight non-specific lipid transfer protein-like AGPs (nsLTP-like AGPs), and 27 fasciclin-like AGPs (FLAs). The results from expressed sequence tags, microarrays, and massively parallel signature sequencing tags are used to analyse the expression of AGP-encoding genes, which is confirmed by real-time PCR. The results reveal that several rice AGP-encoding genes are predominantly expressed in anthers and display differential expression patterns in response to abscisic acid, gibberellic acid, and abiotic stresses. Based on the results obtained from this analysis, an attempt has been made to link the protein structures and expression patterns of rice AGP-encoding genes to their functions. Taken together, the genome-wide identification and expression analysis of the rice AGP gene family might facilitate further functional studies of rice AGPs. PMID:20423940

  1. Distinguishing between biochemical and cellular function: Are there peptide signatures for cellular function of proteins?

    PubMed

    Jain, Shruti; Bhattacharyya, Kausik; Bakshi, Rachit; Narang, Ankita; Brahmachari, Vani

    2017-04-01

    The genome annotation and identification of gene function depends on conserved biochemical activity. However, in the cell, proteins with the same biochemical function can participate in different cellular pathways and cannot complement one another. Similarly, two proteins of very different biochemical functions are put in the same class of cellular function; for example, the classification of a gene as an oncogene or a tumour suppressor gene is not related to its biochemical function, but is related to its cellular function. We have taken an approach to identify peptide signatures for cellular function in proteins with known biochemical function. ATPases as a test case, we classified ATPases (2360 proteins) and kinases (517 proteins) from the human genome into different cellular function categories such as transcriptional, replicative, and chromatin remodelling proteins. Using publicly available tool, MEME, we identify peptide signatures shared among the members of a given category but not between cellular functional categories; for example, no motif sharing is seen between chromatin remodelling and transporter ATPases, similarly between receptor Serine/Threonine Kinase and Receptor Tyrosine Kinase. There are motifs shared within each category with significant E value and high occurrence. This concept of signature for cellular function was applied to developmental regulators, the polycomb and trithorax proteins which led to the prediction of the role of INO80, a chromatin remodelling protein, in development. This has been experimentally validated earlier for its role in homeotic gene regulation and its interaction with regulatory complexes like the Polycomb and Trithorax complex. Proteins 2017; 85:682-693. © 2016 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.

  2. A New Decision Tree to Solve the Puzzle of Alzheimer's Disease Pathogenesis Through Standard Diagnosis Scoring System.

    PubMed

    Kumar, Ashwani; Singh, Tiratha Raj

    2017-03-01

    Alzheimer's disease (AD) is a progressive, incurable and terminal neurodegenerative disorder of the brain and is associated with mutations in amyloid precursor protein, presenilin 1, presenilin 2 or apolipoprotein E, but its underlying mechanisms are still not fully understood. Healthcare sector is generating a large amount of information corresponding to diagnosis, disease identification and treatment of an individual. Mining knowledge and providing scientific decision-making for the diagnosis and treatment of disease from the clinical dataset are therefore increasingly becoming necessary. The current study deals with the construction of classifiers that can be human readable as well as robust in performance for gene dataset of AD using a decision tree. Models of classification for different AD genes were generated according to Mini-Mental State Examination scores and all other vital parameters to achieve the identification of the expression level of different proteins of disorder that may possibly determine the involvement of genes in various AD pathogenesis pathways. The effectiveness of decision tree in AD diagnosis is determined by information gain with confidence value (0.96), specificity (92 %), sensitivity (98 %) and accuracy (77 %). Besides this functional gene classification using different parameters and enrichment analysis, our finding indicates that the measures of all the gene assess in single cohorts are sufficient to diagnose AD and will help in the prediction of important parameters for other relevant assessments.

  3. Genome-wide analysis of the Solanum tuberosum (potato) trehalose-6-phosphate synthase (TPS) gene family: evolution and differential expression during development and stress.

    PubMed

    Xu, Yingchun; Wang, Yanjie; Mattson, Neil; Yang, Liu; Jin, Qijiang

    2017-12-01

    Trehalose-6-phosphate synthase (TPS) serves important functions in plant desiccation tolerance and response to environmental stimuli. At present, a comprehensive analysis, i.e. functional classification, molecular evolution, and expression patterns of this gene family are still lacking in Solanum tuberosum (potato). In this study, a comprehensive analysis of the TPS gene family was conducted in potato. A total of eight putative potato TPS genes (StTPSs) were identified by searching the latest potato genome sequence. The amino acid identity among eight StTPSs varied from 59.91 to 89.54%. Analysis of d N /d S ratios suggested that regions in the TPP (trehalose-6-phosphate phosphatase) domains evolved faster than the TPS domains. Although the sequence of the eight StTPSs showed high similarity (2571-2796 bp), their gene length is highly differentiated (3189-8406 bp). Many of the regulatory elements possibly related to phytohormones, abiotic stress and development were identified in different TPS genes. Based on the phylogenetic tree constructed using TPS genes of potato, and four other Solanaceae plants, TPS genes could be categorized into 6 distinct groups. Analysis revealed that purifying selection most likely played a major role during the evolution of this family. Amino acid changes detected in specific branches of the phylogenetic tree suggests relaxed constraints might have contributed to functional divergence among groups. Moreover, StTPSs were found to exhibit tissue and treatment specific expression patterns upon analysis of transcriptome data, and performing qRT-PCR. This study provides a reference for genome-wide identification of the potato TPS gene family and sets a framework for further functional studies of this important gene family in development and stress response.

  4. iPcc: a novel feature extraction method for accurate disease class discovery and prediction

    PubMed Central

    Ren, Xianwen; Wang, Yong; Zhang, Xiang-Sun; Jin, Qi

    2013-01-01

    Gene expression profiling has gradually become a routine procedure for disease diagnosis and classification. In the past decade, many computational methods have been proposed, resulting in great improvements on various levels, including feature selection and algorithms for classification and clustering. In this study, we present iPcc, a novel method from the feature extraction perspective to further propel gene expression profiling technologies from bench to bedside. We define ‘correlation feature space’ for samples based on the gene expression profiles by iterative employment of Pearson’s correlation coefficient. Numerical experiments on both simulated and real gene expression data sets demonstrate that iPcc can greatly highlight the latent patterns underlying noisy gene expression data and thus greatly improve the robustness and accuracy of the algorithms currently available for disease diagnosis and classification based on gene expression profiles. PMID:23761440

  5. The COG database: a tool for genome-scale analysis of protein functions and evolution

    PubMed Central

    Tatusov, Roman L.; Galperin, Michael Y.; Natale, Darren A.; Koonin, Eugene V.

    2000-01-01

    Rational classification of proteins encoded in sequenced genomes is critical for making the genome sequences maximally useful for functional and evolutionary studies. The database of Clusters of Orthologous Groups of proteins (COGs) is an attempt on a phylogenetic classification of the proteins encoded in 21 complete genomes of bacteria, archaea and eukaryotes (http://www.ncbi.nlm.nih.gov/COG ). The COGs were constructed by applying the criterion of consistency of genome-specific best hits to the results of an exhaustive comparison of all protein sequences from these genomes. The database comprises 2091 COGs that include 56–83% of the gene products from each of the complete bacterial and archaeal genomes and ~35% of those from the yeast Saccharomyces cerevisiae genome. The COG database is accompanied by the COGNITOR program that is used to fit new proteins into the COGs and can be applied to functional and phylogenetic annotation of newly sequenced genomes. PMID:10592175

  6. Genome-wide analysis of the R2R3-MYB transcription factor gene family in sweet orange (Citrus sinensis).

    PubMed

    Liu, Chaoyang; Wang, Xia; Xu, Yuantao; Deng, Xiuxin; Xu, Qiang

    2014-10-01

    MYB transcription factor represents one of the largest gene families in plant genomes. Sweet orange (Citrus sinensis) is one of the most important fruit crops worldwide, and recently the genome has been sequenced. This provides an opportunity to investigate the organization and evolutionary characteristics of sweet orange MYB genes from whole genome view. In the present study, we identified 100 R2R3-MYB genes in the sweet orange genome. A comprehensive analysis of this gene family was performed, including the phylogeny, gene structure, chromosomal localization and expression pattern analyses. The 100 genes were divided into 29 subfamilies based on the sequence similarity and phylogeny, and the classification was also well supported by the highly conserved exon/intron structures and motif composition. The phylogenomic comparison of MYB gene family among sweet orange and related plant species, Arabidopsis, cacao and papaya suggested the existence of functional divergence during evolution. Expression profiling indicated that sweet orange R2R3-MYB genes exhibited distinct temporal and spatial expression patterns. Our analysis suggested that the sweet orange MYB genes may play important roles in different plant biological processes, some of which may be potentially involved in citrus fruit quality. These results will be useful for future functional analysis of the MYB gene family in sweet orange.

  7. De novo sequencing and analysis of the transcriptome of Panax ginseng in the leaf-expansion period.

    PubMed

    Liu, Shichao; Wang, Siming; Liu, Meichen; Yang, Fei; Zhang, Hui; Liu, Shiyang; Wang, Qun; Zhao, Yu

    2016-08-01

    Panax ginseng, a traditional Chinese medicine, is used worldwide for its variety of health benefits and its treatment efficacy. However, it is difficult to cultivate due to its vulnerability to environmental stresses. The present study provided the first report, to the best of our knowledge, of transcriptome analysis of ginseng at the leaf‑expansion stage. Using the Illumina sequencing platform, >40,000,000 high‑quality paired‑end reads were obtained and assembled into 100,533 unique sequences. When the sequences were searched against the publicly available National Center for Biotechnology Information protein database using The Basic Local Alignment Search Tool, 61,599 sequences exhibited similarity to known proteins. Functional annotation and classification, including use of the Gene Ontology, Clusters of Orthologous Groups, and Kyoto Encyclopedia of Genes and Genomes databases, revealed that the activated genes in ginseng were predominantly ribonuclease‑like storage genes, environmental stress genes, pathogenesis-related genes and other antioxidant genes. A number of candidate genes in environmental stress‑associated pathways were also identified. These novel data provide useful information on the growth and development stages of ginseng, and serve as an important public information platform for further understanding of the molecular mechanisms and functional genomics of ginseng.

  8. Identification of differentially expressed genes through RNA sequencing in goats (Capra hircus) at different postnatal stages

    PubMed Central

    Li, Qian; Lin, Sen

    2017-01-01

    Intramuscular fat (IMF) content and fatty acid composition of longissimus dorsi muscle (LM) change with growth, which partially determines the flavor and nutritional value of goat (Capra hircus) meat. However, unlike cattle, little information is available on the transcriptome-wide changes during different postnatal stages in small ruminants, especially goats. In this study, the sequencing reads of goat LM tissues collected from kid, youth, and adult period were mapped to the goat genome. Results showed that out of total 24 689 Unigenes, 20 435 Unigenes were annotated. Based on expected number of fragments per kilobase of transcript sequence per million base pairs sequenced (FPKM), 111 annotated differentially expressed genes (DEGs) were identified among different postnatal stages, which were subsequently assigned to 16 possible expression patterns by series-cluster analysis. Functional classification by Gene Ontology (GO) analysis was used for selecting the genes showing highest expression related to lipid metabolism. Finally, we identified the node genes for lipid metabolism regulation using co-expression analysis. In conclusion, these data may uncover candidate genes having functional roles in regulation of goat muscle development and lipid metabolism during the various growth stages in goats. PMID:28800357

  9. Identification of differentially expressed genes through RNA sequencing in goats (Capra hircus) at different postnatal stages.

    PubMed

    Lin, Yaqiu; Zhu, Jiangjiang; Wang, Yong; Li, Qian; Lin, Sen

    2017-01-01

    Intramuscular fat (IMF) content and fatty acid composition of longissimus dorsi muscle (LM) change with growth, which partially determines the flavor and nutritional value of goat (Capra hircus) meat. However, unlike cattle, little information is available on the transcriptome-wide changes during different postnatal stages in small ruminants, especially goats. In this study, the sequencing reads of goat LM tissues collected from kid, youth, and adult period were mapped to the goat genome. Results showed that out of total 24 689 Unigenes, 20 435 Unigenes were annotated. Based on expected number of fragments per kilobase of transcript sequence per million base pairs sequenced (FPKM), 111 annotated differentially expressed genes (DEGs) were identified among different postnatal stages, which were subsequently assigned to 16 possible expression patterns by series-cluster analysis. Functional classification by Gene Ontology (GO) analysis was used for selecting the genes showing highest expression related to lipid metabolism. Finally, we identified the node genes for lipid metabolism regulation using co-expression analysis. In conclusion, these data may uncover candidate genes having functional roles in regulation of goat muscle development and lipid metabolism during the various growth stages in goats.

  10. Mechanisms of Mitochondrial Defects in Gulf War Syndrome

    DTIC Science & Technology

    2012-08-01

    oxidized; POR: porin; TCA: Tricarboxylic acid cycle ( Kreb cycle ). Page 2 Body: YEAR 1 of research (10/13/2009-7/14/2010) (9 months): Human... mitochondria , fatigue, myalgias 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT 18. NUMBER OF PAGES 19a. NAME OF RESPONSIBLE PERSON...abnormalities in genes that are related to mitochondrial function. Hence, investigation of mitochondrial dysfunction in GWS is a priority. Mitochondria

  11. Complete Genome Sequence of a Putative New Bacterial Strain, I507, Isolated from the Indian Ocean

    PubMed Central

    Wang, Shu-yan; Wei, Jia-qiang

    2018-01-01

    ABSTRACT Bacterial strain I507 was isolated from the central Indian Ocean and may be a potential novel species, according to the 16S rRNA gene sequence. Here, we present its complete genome sequence and expect that it will provide researchers with valuable information to further understand its classification and function in the future. PMID:29674539

  12. Driver gene classification reveals a substantial overrepresentation of tumor suppressors among very large chromatin-regulating proteins.

    PubMed

    Waks, Zeev; Weissbrod, Omer; Carmeli, Boaz; Norel, Raquel; Utro, Filippo; Goldschmidt, Yaara

    2016-12-23

    Compiling a comprehensive list of cancer driver genes is imperative for oncology diagnostics and drug development. While driver genes are typically discovered by analysis of tumor genomes, infrequently mutated driver genes often evade detection due to limited sample sizes. Here, we address sample size limitations by integrating tumor genomics data with a wide spectrum of gene-specific properties to search for rare drivers, functionally classify them, and detect features characteristic of driver genes. We show that our approach, CAnceR geNe similarity-based Annotator and Finder (CARNAF), enables detection of potentially novel drivers that eluded over a dozen pan-cancer/multi-tumor type studies. In particular, feature analysis reveals a highly concentrated pool of known and putative tumor suppressors among the <1% of genes that encode very large, chromatin-regulating proteins. Thus, our study highlights the need for deeper characterization of very large, epigenetic regulators in the context of cancer causality.

  13. Gene Expression Profiles of Chicken Embryo Fibroblasts in Response to Salmonella Enteritidis Infection

    PubMed Central

    Szmolka, Ama; Wiener, Zoltán; Matulova, Marta Elsheimer; Varmuzova, Karolina; Rychlik, Ivan

    2015-01-01

    The response of chicken to non-typhoidal Salmonella infection is becoming well characterised but the role of particular cell types in this response is still far from being understood. Therefore, in this study we characterised the response of chicken embryo fibroblasts (CEFs) to infection with two different S. Enteritidis strains by microarray analysis. The expression of chicken genes identified as significantly up- or down-regulated (≥3-fold) by microarray analysis was verified by real-time PCR followed by functional classification of the genes and prediction of interactions between the proteins using Gene Ontology and STRING Database. Finally the expression of the newly identified genes was tested in HD11 macrophages and in vivo in chickens. Altogether 19 genes were induced in CEFs after S. Enteritidis infection. Twelve of them were also induced in HD11 macrophages and thirteen in the caecum of orally infected chickens. The majority of these genes were assigned different functions in the immune response, however five of them (LOC101750351, K123, BU460569, MOBKL2C and G0S2) have not been associated with the response of chicken to Salmonella infection so far. K123 and G0S2 were the only ’non-immune’ genes inducible by S. Enteritidis in fibroblasts, HD11 macrophages and in the caecum after oral infection. The function of K123 is unknown but G0S2 is involved in lipid metabolism and in β-oxidation of fatty acids in mitochondria. PMID:26046914

  14. Gene Expression Profiles of Chicken Embryo Fibroblasts in Response to Salmonella Enteritidis Infection.

    PubMed

    Szmolka, Ama; Wiener, Zoltán; Matulova, Marta Elsheimer; Varmuzova, Karolina; Rychlik, Ivan

    2015-01-01

    The response of chicken to non-typhoidal Salmonella infection is becoming well characterised but the role of particular cell types in this response is still far from being understood. Therefore, in this study we characterised the response of chicken embryo fibroblasts (CEFs) to infection with two different S. Enteritidis strains by microarray analysis. The expression of chicken genes identified as significantly up- or down-regulated (≥3-fold) by microarray analysis was verified by real-time PCR followed by functional classification of the genes and prediction of interactions between the proteins using Gene Ontology and STRING Database. Finally the expression of the newly identified genes was tested in HD11 macrophages and in vivo in chickens. Altogether 19 genes were induced in CEFs after S. Enteritidis infection. Twelve of them were also induced in HD11 macrophages and thirteen in the caecum of orally infected chickens. The majority of these genes were assigned different functions in the immune response, however five of them (LOC101750351, K123, BU460569, MOBKL2C and G0S2) have not been associated with the response of chicken to Salmonella infection so far. K123 and G0S2 were the only 'non-immune' genes inducible by S. Enteritidis in fibroblasts, HD11 macrophages and in the caecum after oral infection. The function of K123 is unknown but G0S2 is involved in lipid metabolism and in β-oxidation of fatty acids in mitochondria.

  15. Feature Genes Selection Using Supervised Locally Linear Embedding and Correlation Coefficient for Microarray Classification

    PubMed Central

    Wang, Yun; Huang, Fangzhou

    2018-01-01

    The selection of feature genes with high recognition ability from the gene expression profiles has gained great significance in biology. However, most of the existing methods have a high time complexity and poor classification performance. Motivated by this, an effective feature selection method, called supervised locally linear embedding and Spearman's rank correlation coefficient (SLLE-SC2), is proposed which is based on the concept of locally linear embedding and correlation coefficient algorithms. Supervised locally linear embedding takes into account class label information and improves the classification performance. Furthermore, Spearman's rank correlation coefficient is used to remove the coexpression genes. The experiment results obtained on four public tumor microarray datasets illustrate that our method is valid and feasible. PMID:29666661

  16. Feature Genes Selection Using Supervised Locally Linear Embedding and Correlation Coefficient for Microarray Classification.

    PubMed

    Xu, Jiucheng; Mu, Huiyu; Wang, Yun; Huang, Fangzhou

    2018-01-01

    The selection of feature genes with high recognition ability from the gene expression profiles has gained great significance in biology. However, most of the existing methods have a high time complexity and poor classification performance. Motivated by this, an effective feature selection method, called supervised locally linear embedding and Spearman's rank correlation coefficient (SLLE-SC 2 ), is proposed which is based on the concept of locally linear embedding and correlation coefficient algorithms. Supervised locally linear embedding takes into account class label information and improves the classification performance. Furthermore, Spearman's rank correlation coefficient is used to remove the coexpression genes. The experiment results obtained on four public tumor microarray datasets illustrate that our method is valid and feasible.

  17. From learning taxonomies to phylogenetic learning: integration of 16S rRNA gene data into FAME-based bacterial classification.

    PubMed

    Slabbinck, Bram; Waegeman, Willem; Dawyndt, Peter; De Vos, Paul; De Baets, Bernard

    2010-01-30

    Machine learning techniques have shown to improve bacterial species classification based on fatty acid methyl ester (FAME) data. Nonetheless, FAME analysis has a limited resolution for discrimination of bacteria at the species level. In this paper, we approach the species classification problem from a taxonomic point of view. Such a taxonomy or tree is typically obtained by applying clustering algorithms on FAME data or on 16S rRNA gene data. The knowledge gained from the tree can then be used to evaluate FAME-based classifiers, resulting in a novel framework for bacterial species classification. In view of learning in a taxonomic framework, we consider two types of trees. First, a FAME tree is constructed with a supervised divisive clustering algorithm. Subsequently, based on 16S rRNA gene sequence analysis, phylogenetic trees are inferred by the NJ and UPGMA methods. In this second approach, the species classification problem is based on the combination of two different types of data. Herein, 16S rRNA gene sequence data is used for phylogenetic tree inference and the corresponding binary tree splits are learned based on FAME data. We call this learning approach 'phylogenetic learning'. Supervised Random Forest models are developed to train the classification tasks in a stratified cross-validation setting. In this way, better classification results are obtained for species that are typically hard to distinguish by a single or flat multi-class classification model. FAME-based bacterial species classification is successfully evaluated in a taxonomic framework. Although the proposed approach does not improve the overall accuracy compared to flat multi-class classification, it has some distinct advantages. First, it has better capabilities for distinguishing species on which flat multi-class classification fails. Secondly, the hierarchical classification structure allows to easily evaluate and visualize the resolution of FAME data for the discrimination of bacterial species. Summarized, by phylogenetic learning we are able to situate and evaluate FAME-based bacterial species classification in a more informative context.

  18. From learning taxonomies to phylogenetic learning: Integration of 16S rRNA gene data into FAME-based bacterial classification

    PubMed Central

    2010-01-01

    Background Machine learning techniques have shown to improve bacterial species classification based on fatty acid methyl ester (FAME) data. Nonetheless, FAME analysis has a limited resolution for discrimination of bacteria at the species level. In this paper, we approach the species classification problem from a taxonomic point of view. Such a taxonomy or tree is typically obtained by applying clustering algorithms on FAME data or on 16S rRNA gene data. The knowledge gained from the tree can then be used to evaluate FAME-based classifiers, resulting in a novel framework for bacterial species classification. Results In view of learning in a taxonomic framework, we consider two types of trees. First, a FAME tree is constructed with a supervised divisive clustering algorithm. Subsequently, based on 16S rRNA gene sequence analysis, phylogenetic trees are inferred by the NJ and UPGMA methods. In this second approach, the species classification problem is based on the combination of two different types of data. Herein, 16S rRNA gene sequence data is used for phylogenetic tree inference and the corresponding binary tree splits are learned based on FAME data. We call this learning approach 'phylogenetic learning'. Supervised Random Forest models are developed to train the classification tasks in a stratified cross-validation setting. In this way, better classification results are obtained for species that are typically hard to distinguish by a single or flat multi-class classification model. Conclusions FAME-based bacterial species classification is successfully evaluated in a taxonomic framework. Although the proposed approach does not improve the overall accuracy compared to flat multi-class classification, it has some distinct advantages. First, it has better capabilities for distinguishing species on which flat multi-class classification fails. Secondly, the hierarchical classification structure allows to easily evaluate and visualize the resolution of FAME data for the discrimination of bacterial species. Summarized, by phylogenetic learning we are able to situate and evaluate FAME-based bacterial species classification in a more informative context. PMID:20113515

  19. FARME DB: a functional antibiotic resistance element database

    PubMed Central

    Wallace, James C.; Port, Jesse A.; Smith, Marissa N.; Faustman, Elaine M.

    2017-01-01

    Antibiotic resistance (AR) is a major global public health threat but few resources exist that catalog AR genes outside of a clinical context. Current AR sequence databases are assembled almost exclusively from genomic sequences derived from clinical bacterial isolates and thus do not include many microbial sequences derived from environmental samples that confer resistance in functional metagenomic studies. These environmental metagenomic sequences often show little or no similarity to AR sequences from clinical isolates using standard classification criteria. In addition, existing AR databases provide no information about flanking sequences containing regulatory or mobile genetic elements. To help address this issue, we created an annotated database of DNA and protein sequences derived exclusively from environmental metagenomic sequences showing AR in laboratory experiments. Our Functional Antibiotic Resistant Metagenomic Element (FARME) database is a compilation of publically available DNA sequences and predicted protein sequences conferring AR as well as regulatory elements, mobile genetic elements and predicted proteins flanking antibiotic resistant genes. FARME is the first database to focus on functional metagenomic AR gene elements and provides a resource to better understand AR in the 99% of bacteria which cannot be cultured and the relationship between environmental AR sequences and antibiotic resistant genes derived from cultured isolates. Database URL: http://staff.washington.edu/jwallace/farme PMID:28077567

  20. mRMR-ABC: A Hybrid Gene Selection Algorithm for Cancer Classification Using Microarray Gene Expression Profiling

    PubMed Central

    Alshamlan, Hala; Badr, Ghada; Alohali, Yousef

    2015-01-01

    An artificial bee colony (ABC) is a relatively recent swarm intelligence optimization approach. In this paper, we propose the first attempt at applying ABC algorithm in analyzing a microarray gene expression profile. In addition, we propose an innovative feature selection algorithm, minimum redundancy maximum relevance (mRMR), and combine it with an ABC algorithm, mRMR-ABC, to select informative genes from microarray profile. The new approach is based on a support vector machine (SVM) algorithm to measure the classification accuracy for selected genes. We evaluate the performance of the proposed mRMR-ABC algorithm by conducting extensive experiments on six binary and multiclass gene expression microarray datasets. Furthermore, we compare our proposed mRMR-ABC algorithm with previously known techniques. We reimplemented two of these techniques for the sake of a fair comparison using the same parameters. These two techniques are mRMR when combined with a genetic algorithm (mRMR-GA) and mRMR when combined with a particle swarm optimization algorithm (mRMR-PSO). The experimental results prove that the proposed mRMR-ABC algorithm achieves accurate classification performance using small number of predictive genes when tested using both datasets and compared to previously suggested methods. This shows that mRMR-ABC is a promising approach for solving gene selection and cancer classification problems. PMID:25961028

  1. mRMR-ABC: A Hybrid Gene Selection Algorithm for Cancer Classification Using Microarray Gene Expression Profiling.

    PubMed

    Alshamlan, Hala; Badr, Ghada; Alohali, Yousef

    2015-01-01

    An artificial bee colony (ABC) is a relatively recent swarm intelligence optimization approach. In this paper, we propose the first attempt at applying ABC algorithm in analyzing a microarray gene expression profile. In addition, we propose an innovative feature selection algorithm, minimum redundancy maximum relevance (mRMR), and combine it with an ABC algorithm, mRMR-ABC, to select informative genes from microarray profile. The new approach is based on a support vector machine (SVM) algorithm to measure the classification accuracy for selected genes. We evaluate the performance of the proposed mRMR-ABC algorithm by conducting extensive experiments on six binary and multiclass gene expression microarray datasets. Furthermore, we compare our proposed mRMR-ABC algorithm with previously known techniques. We reimplemented two of these techniques for the sake of a fair comparison using the same parameters. These two techniques are mRMR when combined with a genetic algorithm (mRMR-GA) and mRMR when combined with a particle swarm optimization algorithm (mRMR-PSO). The experimental results prove that the proposed mRMR-ABC algorithm achieves accurate classification performance using small number of predictive genes when tested using both datasets and compared to previously suggested methods. This shows that mRMR-ABC is a promising approach for solving gene selection and cancer classification problems.

  2. Identification of suitable genes contributes to lung adenocarcinoma clustering by multiple meta-analysis methods.

    PubMed

    Yang, Ze-Hui; Zheng, Rui; Gao, Yuan; Zhang, Qiang

    2016-09-01

    With the widespread application of high-throughput technology, numerous meta-analysis methods have been proposed for differential expression profiling across multiple studies. We identified the suitable differentially expressed (DE) genes that contributed to lung adenocarcinoma (ADC) clustering based on seven popular multiple meta-analysis methods. Seven microarray expression profiles of ADC and normal controls were extracted from the ArrayExpress database. The Bioconductor was used to perform the data preliminary preprocessing. Then, DE genes across multiple studies were identified. Hierarchical clustering was applied to compare the classification performance for microarray data samples. The classification efficiency was compared based on accuracy, sensitivity and specificity. Across seven datasets, 573 ADC cases and 222 normal controls were collected. After filtering out unexpressed and noninformative genes, 3688 genes were remained for further analysis. The classification efficiency analysis showed that DE genes identified by sum of ranks method separated ADC from normal controls with the best accuracy, sensitivity and specificity of 0.953, 0.969 and 0.932, respectively. The gene set with the highest classification accuracy mainly participated in the regulation of response to external stimulus (P = 7.97E-04), cyclic nucleotide-mediated signaling (P = 0.01), regulation of cell morphogenesis (P = 0.01) and regulation of cell proliferation (P = 0.01). Evaluation of DE genes identified by different meta-analysis methods in classification efficiency provided a new perspective to the choice of the suitable method in a given application. Varying meta-analysis methods always present varying abilities, so synthetic consideration should be taken when providing meta-analysis methods for particular research. © 2015 John Wiley & Sons Ltd.

  3. Solexa-Sequencing Based Transcriptome Study of Plaice Skin Phenotype in Rex Rabbits (Oryctolagus cuniculus)

    PubMed Central

    Pan, Lei; Liu, Yan; Wei, Qiang; Xiao, Chenwen; Ji, Quanan; Bao, Guolian; Wu, Xinsheng

    2015-01-01

    Background Fur is an important genetically-determined characteristic of domestic rabbits; rabbit furs are of great economic value. We used the Solexa sequencing technology to assess gene expression in skin tissues from full-sib Rex rabbits of different phenotypes in order to explore the molecular mechanisms associated with fur determination. Methodology/Principal Findings Transcriptome analysis included de novo assembly, gene function identification, and gene function classification and enrichment. We obtained 74,032,912 and 71,126,891 short reads of 100 nt, which were assembled into 377,618 unique sequences by Trinity strategy (N50=680 nt). Based on BLAST results with known proteins, 50,228 sequences were identified at a cut-off E-value ≥ 10-5. Using Blast to Gene Ontology (GO), Clusters of Orthologous Groups (KOG) and Kyoto Encyclopedia of Genes and Genomes (KEGG), we obtained several genes with important protein functions. A total of 308 differentially expressed genes were obtained by transcriptome analysis of plaice and un-plaice phenotype animals; 209 additional differentially expressed genes were not found in any database. These genes included 49 that were only expressed in plaice skin rabbits. The novel genes may play important roles during skin growth and development. In addition, 99 known differentially expressed genes were assigned to PI3K-Akt signaling, focal adhesion, and ECM-receptor interactin, among others. Growth factors play a role in skin growth and development by regulating these signaling pathways. We confirmed the altered expression levels of seven target genes by qRT-PCR. And chosen a key gene for SNP to found the differentially between plaice and un-plaice phenotypes rabbit. Conclusions/Significance The rabbit transcriptome profiling data provide new insights in understanding the molecular mechanisms underlying rabbit skin growth and development. PMID:25955442

  4. Is Mitochondrial Donation Germ-Line Gene Therapy? Classifications and Ethical Implications.

    PubMed

    Newson, Ainsley J; Wrigley, Anthony

    2017-01-01

    The classification of techniques used in mitochondrial donation, including their role as purported germ-line gene therapies, is far from clear. These techniques exhibit characteristics typical of a variety of classifications that have been used in both scientific and bioethics scholarship. This raises two connected questions, which we address in this paper: (i) how should we classify mitochondrial donation techniques?; and (ii) what ethical implications surround such a classification? First, we outline how methods of genetic intervention, such as germ-line gene therapy, are typically defined or classified. We then consider whether techniques of mitochondrial donation fit into these, whether they might do so with some refinement of these categories, or whether they require some other approach to classification. To answer the second question, we discuss the relationship between classification and several key ethical issues arising from mitochondrial donation. We conclude that the properties characteristic of mitochondrial inheritance mean that most mitochondrial donation techniques belong to a new sub-class of genetic modification, which we call 'conditionally inheritable genomic modification' (CIGM). © 2017 John Wiley & Sons Ltd.

  5. Regulation of IAP (Inhibitor of Apoptosis) Gene Expression by the p53 Tumor Suppressor Protein

    DTIC Science & Technology

    2005-05-01

    adenovirus, gene therapy, polymorphism, 31 16. PRICE CODE 17. SECURITY CLASSIFICATION 18. SECURITY CLASSIFICATION 19. SECURITY CLASSIFICATION 20...averaged results of three inde- pendent experiments, with standard error. Right panel: Level of p53 in infected cells using the antibody Ab-6 (Calbiochem...with highly purified mitochondria as described in (2). The arrow marks oligomerized BAK. The right _ -. panel depicts the purity of BMH CrosIinked Mito

  6. Alport syndrome: a unified classification of genetic disorders of collagen IV α345: a position paper of the Alport Syndrome Classification Working Group.

    PubMed

    Kashtan, Clifford E; Ding, Jie; Garosi, Guido; Heidet, Laurence; Massella, Laura; Nakanishi, Koichi; Nozu, Kandai; Renieri, Alessandra; Rheault, Michelle; Wang, Fang; Gross, Oliver

    2018-05-01

    Mutations in the genes COL4A3, COL4A4, and COL4A5 affect the synthesis, assembly, deposition, or function of the collagen IV α345 molecule, the major collagenous constituent of the mature mammalian glomerular basement membrane. These mutations are associated with a spectrum of nephropathy, from microscopic hematuria to progressive renal disease leading to ESRD, and with extrarenal manifestations such as sensorineural deafness and ocular anomalies. The existing nomenclature for these conditions is confusing and can delay institution of appropriate nephroprotective therapy. Herein we propose a new classification of genetic disorders of the collagen IV α345 molecule with the goal of improving renal outcomes through regular monitoring and early treatment. Copyright © 2018 International Society of Nephrology. Published by Elsevier Inc. All rights reserved.

  7. Systems-based biological concordance and predictive reproducibility of gene set discovery methods in cardiovascular disease.

    PubMed

    Azuaje, Francisco; Zheng, Huiru; Camargo, Anyela; Wang, Haiying

    2011-08-01

    The discovery of novel disease biomarkers is a crucial challenge for translational bioinformatics. Demonstration of both their classification power and reproducibility across independent datasets are essential requirements to assess their potential clinical relevance. Small datasets and multiplicity of putative biomarker sets may explain lack of predictive reproducibility. Studies based on pathway-driven discovery approaches have suggested that, despite such discrepancies, the resulting putative biomarkers tend to be implicated in common biological processes. Investigations of this problem have been mainly focused on datasets derived from cancer research. We investigated the predictive and functional concordance of five methods for discovering putative biomarkers in four independently-generated datasets from the cardiovascular disease domain. A diversity of biosignatures was identified by the different methods. However, we found strong biological process concordance between them, especially in the case of methods based on gene set analysis. With a few exceptions, we observed lack of classification reproducibility using independent datasets. Partial overlaps between our putative sets of biomarkers and the primary studies exist. Despite the observed limitations, pathway-driven or gene set analysis can predict potentially novel biomarkers and can jointly point to biomedically-relevant underlying molecular mechanisms. Copyright © 2011 Elsevier Inc. All rights reserved.

  8. Tabu search and binary particle swarm optimization for feature selection using microarray data.

    PubMed

    Chuang, Li-Yeh; Yang, Cheng-Huei; Yang, Cheng-Hong

    2009-12-01

    Gene expression profiles have great potential as a medical diagnosis tool because they represent the state of a cell at the molecular level. In the classification of cancer type research, available training datasets generally have a fairly small sample size compared to the number of genes involved. This fact poses an unprecedented challenge to some classification methodologies due to training data limitations. Therefore, a good selection method for genes relevant for sample classification is needed to improve the predictive accuracy, and to avoid incomprehensibility due to the large number of genes investigated. In this article, we propose to combine tabu search (TS) and binary particle swarm optimization (BPSO) for feature selection. BPSO acts as a local optimizer each time the TS has been run for a single generation. The K-nearest neighbor method with leave-one-out cross-validation and support vector machine with one-versus-rest serve as evaluators of the TS and BPSO. The proposed method is applied and compared to the 11 classification problems taken from the literature. Experimental results show that our method simplifies features effectively and either obtains higher classification accuracy or uses fewer features compared to other feature selection methods.

  9. Genetic basis of interindividual susceptibility to cancer cachexia: selection of potential candidate gene polymorphisms for association studies.

    PubMed

    Johns, N; Tan, B H; MacMillan, M; Solheim, T S; Ross, J A; Baracos, V E; Damaraju, S; Fearon, K C H

    2014-12-01

    Cancer cachexia is a complex and multifactorial disease. Evolving definitions highlight the fact that a diverse range of biological processes contribute to cancer cachexia. Part of the variation in who will and who will not develop cancer cachexia may be genetically determined. As new definitions, classifications and biological targets continue to evolve, there is a need for reappraisal of the literature for future candidate association studies. This review summarizes genes identified or implicated as well as putative candidate genes contributing to cachexia, identified through diverse technology platforms and model systems to further guide association studies. A systematic search covering 1986-2012 was performed for potential candidate genes / genetic polymorphisms relating to cancer cachexia. All candidate genes were reviewed for functional polymorphisms or clinically significant polymorphisms associated with cachexia using the OMIM and GeneRIF databases. Pathway analysis software was used to reveal possible network associations between genes. Functionality of SNPs/genes was explored based on published literature, algorithms for detecting putative deleterious SNPs and interrogating the database for expression of quantitative trait loci (eQTLs). A total of 154 genes associated with cancer cachexia were identified and explored for functional polymorphisms. Of these 154 genes, 119 had a combined total of 281 polymorphisms with functional and/or clinical significance in terms of cachexia associated with them. Of these, 80 polymorphisms (in 51 genes) were replicated in more than one study with 24 polymorphisms found to influence two or more hallmarks of cachexia (i.e., inflammation, loss of fat mass and/or lean mass and reduced survival). Selection of candidate genes and polymorphisms is a key element of multigene study design. The present study provides a contemporary basis to select genes and/or polymorphisms for further association studies in cancer cachexia, and to develop their potential as susceptibility biomarkers of cachexia.

  10. Phylogenetic and expression analysis of the NPR1-like gene family from Persea americana (Mill.).

    PubMed

    Backer, Robert; Mahomed, Waheed; Reeksting, Bianca J; Engelbrecht, Juanita; Ibarra-Laclette, Enrique; van den Berg, Noëlani

    2015-01-01

    The NONEXPRESSOR OF PATHOGENESIS-RELATED GENES1 (NPR1) forms an integral part of the salicylic acid (SA) pathway in plants and is involved in cross-talk between the SA and jasmonic acid/ethylene (JA/ET) pathways. Therefore, NPR1 is essential to the effective response of plants to pathogens. Avocado (Persea americana) is a commercially important crop worldwide. Significant losses in production result from Phytophthora root rot, caused by the hemibiotroph, Phytophthora cinnamomi. This oomycete infects the feeder roots of avocado trees leading to an overall decline in health and eventual death. The interaction between avocado and P. cinnamomi is poorly understood and as such limited control strategies exist. Thus uncovering the role of NPR1 in avocado could provide novel insights into the avocado - P. cinnamomi interaction. A total of five NPR1-like sequences were identified. These sequences were annotated using FGENESH and a maximum-likelihood tree was constructed using 34 NPR1-like protein sequences from other plant species. The conserved protein domains and functional motifs of these sequences were predicted. Reverse transcription quantitative PCR was used to analyze the expression of the five NPR1-like sequences in the roots of avocado after treatment with salicylic and jasmonic acid, P. cinnamomi infection, across different tissues and in P. cinnamomi infected tolerant and susceptible rootstocks. Of the five NPR1-like sequences three have strong support for a defensive role while two are most likely involved in development. Significant differences in the expression profiles of these five NPR1-like genes were observed, assisting in functional classification. Understanding the interaction of avocado and P. cinnamomi is essential to developing new control strategies. This work enables further classification of these genes by means of functional annotation and is a crucial step in understanding the role of NPR1 during P. cinnamomi infection.

  11. Phylogenetic and expression analysis of the NPR1-like gene family from Persea americana (Mill.)

    PubMed Central

    Backer, Robert; Mahomed, Waheed; Reeksting, Bianca J.; Engelbrecht, Juanita; Ibarra-Laclette, Enrique; van den Berg, Noëlani

    2015-01-01

    The NONEXPRESSOR OF PATHOGENESIS-RELATED GENES1 (NPR1) forms an integral part of the salicylic acid (SA) pathway in plants and is involved in cross-talk between the SA and jasmonic acid/ethylene (JA/ET) pathways. Therefore, NPR1 is essential to the effective response of plants to pathogens. Avocado (Persea americana) is a commercially important crop worldwide. Significant losses in production result from Phytophthora root rot, caused by the hemibiotroph, Phytophthora cinnamomi. This oomycete infects the feeder roots of avocado trees leading to an overall decline in health and eventual death. The interaction between avocado and P. cinnamomi is poorly understood and as such limited control strategies exist. Thus uncovering the role of NPR1 in avocado could provide novel insights into the avocado – P. cinnamomi interaction. A total of five NPR1-like sequences were identified. These sequences were annotated using FGENESH and a maximum-likelihood tree was constructed using 34 NPR1-like protein sequences from other plant species. The conserved protein domains and functional motifs of these sequences were predicted. Reverse transcription quantitative PCR was used to analyze the expression of the five NPR1-like sequences in the roots of avocado after treatment with salicylic and jasmonic acid, P. cinnamomi infection, across different tissues and in P. cinnamomi infected tolerant and susceptible rootstocks. Of the five NPR1-like sequences three have strong support for a defensive role while two are most likely involved in development. Significant differences in the expression profiles of these five NPR1-like genes were observed, assisting in functional classification. Understanding the interaction of avocado and P. cinnamomi is essential to developing new control strategies. This work enables further classification of these genes by means of functional annotation and is a crucial step in understanding the role of NPR1 during P. cinnamomi infection. PMID:25972890

  12. Clusters of orthologous genes for 41 archaeal genomes and implications for evolutionary genomics of archaea.

    PubMed

    Makarova, Kira S; Sorokin, Alexander V; Novichkov, Pavel S; Wolf, Yuri I; Koonin, Eugene V

    2007-11-27

    An evolutionary classification of genes from sequenced genomes that distinguishes between orthologs and paralogs is indispensable for genome annotation and evolutionary reconstruction. Shortly after multiple genome sequences of bacteria, archaea, and unicellular eukaryotes became available, an attempt on such a classification was implemented in Clusters of Orthologous Groups of proteins (COGs). Rapid accumulation of genome sequences creates opportunities for refining COGs but also represents a challenge because of error amplification. One of the practical strategies involves construction of refined COGs for phylogenetically compact subsets of genomes. New Archaeal Clusters of Orthologous Genes (arCOGs) were constructed for 41 archaeal genomes (13 Crenarchaeota, 27 Euryarchaeota and one Nanoarchaeon) using an improved procedure that employs a similarity tree between smaller, group-specific clusters, semi-automatically partitions orthology domains in multidomain proteins, and uses profile searches for identification of remote orthologs. The annotation of arCOGs is a consensus between three assignments based on the COGs, the CDD database, and the annotations of homologs in the NR database. The 7538 arCOGs, on average, cover approximately 88% of the genes in a genome compared to a approximately 76% coverage in COGs. The finer granularity of ortholog identification in the arCOGs is apparent from the fact that 4538 arCOGs correspond to 2362 COGs; approximately 40% of the arCOGs are new. The archaeal gene core (protein-coding genes found in all 41 genome) consists of 166 arCOGs. The arCOGs were used to reconstruct gene loss and gene gain events during archaeal evolution and gene sets of ancestral forms. The Last Archaeal Common Ancestor (LACA) is conservatively estimated to possess 996 genes compared to 1245 and 1335 genes for the last common ancestors of Crenarchaeota and Euryarchaeota, respectively. It is inferred that LACA was a chemoautotrophic hyperthermophile that, in addition to the core archaeal functions, encoded more idiosyncratic systems, e.g., the CASS systems of antivirus defense and some toxin-antitoxin systems. The arCOGs provide a convenient, flexible framework for functional annotation of archaeal genomes, comparative genomics and evolutionary reconstructions. Genomic reconstructions suggest that the last common ancestor of archaea might have been (nearly) as advanced as the modern archaeal hyperthermophiles. ArCOGs and related information are available at: ftp://ftp.ncbi.nih.gov/pub/koonin/arCOGs/.

  13. Transcriptome Sequencing of Lima Bean (Phaseolus lunatus) to Identify Putative Positive Selection in Phaseolus and Legumes

    PubMed Central

    Li, Fengqi; Cao, Depan; Liu, Yang; Yang, Ting; Wang, Guirong

    2015-01-01

    The identification of genes under positive selection is a central goal of evolutionary biology. Many legume species, including Phaseolus vulgaris (common bean) and Phaseolus lunatus (lima bean), have important ecological and economic value. In this study, we sequenced and assembled the transcriptome of one Phaseolus species, lima bean. A comparison with the genomes of six other legume species, including the common bean, Medicago, lotus, soybean, chickpea, and pigeonpea, revealed 15 and 4 orthologous groups with signatures of positive selection among the two Phaseolus species and among the seven legume species, respectively. Characterization of these positively selected genes using Non redundant (nr) annotation, gene ontology (GO) classification, GO term enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses revealed that these genes are mostly involved in thylakoids, photosynthesis and metabolism. This study identified genes that may be related to the divergence of the Phaseolus and legume species. These detected genes are particularly good candidates for subsequent functional studies. PMID:26151849

  14. Gene expression during different periods of the handling-stress response in Pampus argenteus

    NASA Astrophysics Data System (ADS)

    Sun, Peng; Tang, Baojun; Yin, Fei

    2017-11-01

    Common aquaculture practices subject fish to a variety of acute and chronic stressors. Such stressors are inherent in aquaculture production but can adversely affect survival, growth, immune response, reproductive capacity, and behavior. Understanding the biological mechanisms underlying stress responses helps with methods to alleviate the negative effects through better aquaculture practices, resulting in improved animal welfare and production efficiency. In the present study, transcriptome sequencing of liver and kidney was performed in silver pomfret (Pampus argenteus) subjected to handling stress versus controls. A total of 162.19 million clean reads were assembled to 30 339 unigenes. The quality of the assembly was high, with an N50 length of 2 472 bases. For function classification and pathway assignment, the unigenes were categorized into three GO (gene ontology) categories, twenty-six clusters of eggNOG (evolutionary genealogy of genes: non-supervised orthologous groups) function categories, and thirty-eight KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways. Stress affected different functional groups of genes in the tissues studied. Differentially expressed genes were mainly involved in metabolic pathways (carbohydrate metabolism, lipid metabolism, amino-acid metabolism, uptake of cofactors and vitamins, and biosynthesis of other secondary metabolites), environmental information processing (signaling molecules and their interactions), organismal systems (endocrine system, digestive system), and disease (immune, neurodegenerative, endocrine and metabolic diseases). This is the first reported analysis of genome-wide transcriptome in P. argenteus, and the findings expand our understanding of the silver pomfret genome and gene expression in association with stress. The results will be useful to future analyses of functional genes and studies of healthy artificial breeding in P. argenteus and other related fish species.

  15. Molecular and comparative genetics of mental retardation.

    PubMed Central

    Inlow, Jennifer K; Restifo, Linda L

    2004-01-01

    Affecting 1-3% of the population, mental retardation (MR) poses significant challenges for clinicians and scientists. Understanding the biology of MR is complicated by the extraordinary heterogeneity of genetic MR disorders. Detailed analyses of >1000 Online Mendelian Inheritance in Man (OMIM) database entries and literature searches through September 2003 revealed 282 molecularly identified MR genes. We estimate that hundreds more MR genes remain to be identified. A novel test, in which we distributed unmapped MR disorders proportionately across the autosomes, failed to eliminate the well-known X-chromosome overrepresentation of MR genes and candidate genes. This evidence argues against ascertainment bias as the main cause of the skewed distribution. On the basis of a synthesis of clinical and laboratory data, we developed a biological functions classification scheme for MR genes. Metabolic pathways, signaling pathways, and transcription are the most common functions, but numerous other aspects of neuronal and glial biology are controlled by MR genes as well. Using protein sequence and domain-organization comparisons, we found a striking conservation of MR genes and genetic pathways across the approximately 700 million years that separate Homo sapiens and Drosophila melanogaster. Eighty-seven percent have one or more fruit fly homologs and 76% have at least one candidate functional ortholog. We propose that D. melanogaster can be used in a systematic manner to study MR and possibly to develop bioassays for therapeutic drug discovery. We selected 42 Drosophila orthologs as most likely to reveal molecular and cellular mechanisms of nervous system development or plasticity relevant to MR. PMID:15020472

  16. Arabidopsis intragenomic conserved noncoding sequence

    PubMed Central

    Thomas, Brian C.; Rapaka, Lakshmi; Lyons, Eric; Pedersen, Brent; Freeling, Michael

    2007-01-01

    After the most recent tetraploidy in the Arabidopsis lineage, most gene pairs lost one, but not both, of their duplicates. We manually inspected the 3,179 retained gene pairs and their surrounding gene space still present in the genome using a custom-made viewer application. The display of these pairs allowed us to define intragenic conserved noncoding sequences (CNSs), identify exon annotation errors, and discover potentially new genes. Using a strict algorithm to sort high-scoring pair sequences from the bl2seq data, we created a database of 14,944 intragenomic Arabidopsis CNSs. The mean CNS length is 31 bp, ranging from 15 to 285 bp. There are ≈1.7 CNSs associated with a typical gene, and Arabidopsis CNSs are found in all areas around exons, most frequently in the 5′ upstream region. Gene ontology classifications related to transcription, regulation, or “response to …” external or endogenous stimuli, especially hormones, tend to be significantly overrepresented among genes containing a large number of CNSs, whereas protein localization, transport, and metabolism are common among genes with no CNSs. There is a 1.5% overlap between these CNSs and the 218,982 putative RNAs in the Arabidopsis Small RNA Project database, allowing for two mismatches. These CNSs provide a unique set of noncoding sequences enriched for function. CNS function is implied by evolutionary conservation and independently supported because CNS-richness predicts regulatory gene ontology categories. PMID:17301222

  17. Microbial genome analysis: the COG approach.

    PubMed

    Galperin, Michael Y; Kristensen, David M; Makarova, Kira S; Wolf, Yuri I; Koonin, Eugene V

    2017-09-14

    For the past 20 years, the Clusters of Orthologous Genes (COG) database had been a popular tool for microbial genome annotation and comparative genomics. Initially created for the purpose of evolutionary classification of protein families, the COG have been used, apart from straightforward functional annotation of sequenced genomes, for such tasks as (i) unification of genome annotation in groups of related organisms; (ii) identification of missing and/or undetected genes in complete microbial genomes; (iii) analysis of genomic neighborhoods, in many cases allowing prediction of novel functional systems; (iv) analysis of metabolic pathways and prediction of alternative forms of enzymes; (v) comparison of organisms by COG functional categories; and (vi) prioritization of targets for structural and functional characterization. Here we review the principles of the COG approach and discuss its key advantages and drawbacks in microbial genome analysis. Published by Oxford University Press 2017. This work is written by US Government employees and is in the public domain in the US.

  18. Transcriptome analysis of phosphorus stress responsiveness in the seedlings of Dongxiang wild rice (Oryza rufipogon Griff.).

    PubMed

    Deng, Qian-Wen; Luo, Xiang-Dong; Chen, Ya-Ling; Zhou, Yi; Zhang, Fan-Tao; Hu, Biao-Lin; Xie, Jian-Kun

    2018-03-15

    Low phosphorus availability is a major factor restricting rice growth. Dongxiang wild rice (Oryza rufipogon Griff.) has many useful genes lacking in cultivated rice, including stress resistance to phosphorus deficiency, cold, salt and drought, which is considered to be a precious germplasm resource for rice breeding. However, the molecular mechanism of regulation of phosphorus deficiency tolerance is not clear. In this study, cDNA libraries were constructed from the leaf and root tissues of phosphorus stressed and untreated Dongxiang wild rice seedlings, and transcriptome sequencing was performed with the goal of elucidating the molecular mechanisms involved in phosphorus stress response. The results indicated that 1184 transcripts were differentially expressed in the leaves (323 up-regulated and 861 down-regulated) and 986 transcripts were differentially expressed in the roots (756 up-regulated and 230 down-regulated). 43 genes were up-regulated both in leaves and roots, 38 genes were up-regulated in roots but down-regulated in leaves, and only 2 genes were down-regulated in roots but up-regulated in leaves. Among these differentially expressed genes, the detection of many transcription factors and functional genes demonstrated that multiple regulatory pathways were involved in phosphorus deficiency tolerance. Meanwhile, the differentially expressed genes were also annotated with gene ontology terms and key pathways via functional classification and Kyoto Encyclopedia of Gene and Genomes pathway mapping, respectively. A set of the most important candidate genes was then identified by combining the differentially expressed genes found in the present study with previously identified phosphorus deficiency tolerance quantitative trait loci. The present work provides abundant genomic information for functional dissection of the phosphorus deficiency resistance of Dongxiang wild rice, which will be help to understand the biological regulatory mechanisms of phosphorus deficiency tolerance in Dongxiang wild rice.

  19. Recursive feature selection with significant variables of support vectors.

    PubMed

    Tsai, Chen-An; Huang, Chien-Hsun; Chang, Ching-Wei; Chen, Chun-Houh

    2012-01-01

    The development of DNA microarray makes researchers screen thousands of genes simultaneously and it also helps determine high- and low-expression level genes in normal and disease tissues. Selecting relevant genes for cancer classification is an important issue. Most of the gene selection methods use univariate ranking criteria and arbitrarily choose a threshold to choose genes. However, the parameter setting may not be compatible to the selected classification algorithms. In this paper, we propose a new gene selection method (SVM-t) based on the use of t-statistics embedded in support vector machine. We compared the performance to two similar SVM-based methods: SVM recursive feature elimination (SVMRFE) and recursive support vector machine (RSVM). The three methods were compared based on extensive simulation experiments and analyses of two published microarray datasets. In the simulation experiments, we found that the proposed method is more robust in selecting informative genes than SVMRFE and RSVM and capable to attain good classification performance when the variations of informative and noninformative genes are different. In the analysis of two microarray datasets, the proposed method yields better performance in identifying fewer genes with good prediction accuracy, compared to SVMRFE and RSVM.

  20. DeepGene: an advanced cancer type classifier based on deep learning and somatic point mutations.

    PubMed

    Yuan, Yuchen; Shi, Yi; Li, Changyang; Kim, Jinman; Cai, Weidong; Han, Zeguang; Feng, David Dagan

    2016-12-23

    With the developments of DNA sequencing technology, large amounts of sequencing data have become available in recent years and provide unprecedented opportunities for advanced association studies between somatic point mutations and cancer types/subtypes, which may contribute to more accurate somatic point mutation based cancer classification (SMCC). However in existing SMCC methods, issues like high data sparsity, small volume of sample size, and the application of simple linear classifiers, are major obstacles in improving the classification performance. To address the obstacles in existing SMCC studies, we propose DeepGene, an advanced deep neural network (DNN) based classifier, that consists of three steps: firstly, the clustered gene filtering (CGF) concentrates the gene data by mutation occurrence frequency, filtering out the majority of irrelevant genes; secondly, the indexed sparsity reduction (ISR) converts the gene data into indexes of its non-zero elements, thereby significantly suppressing the impact of data sparsity; finally, the data after CGF and ISR is fed into a DNN classifier, which extracts high-level features for accurate classification. Experimental results on our curated TCGA-DeepGene dataset, which is a reformulated subset of the TCGA dataset containing 12 selected types of cancer, show that CGF, ISR and DNN all contribute in improving the overall classification performance. We further compare DeepGene with three widely adopted classifiers and demonstrate that DeepGene has at least 24% performance improvement in terms of testing accuracy. Based on deep learning and somatic point mutation data, we devise DeepGene, an advanced cancer type classifier, which addresses the obstacles in existing SMCC studies. Experiments indicate that DeepGene outperforms three widely adopted existing classifiers, which is mainly attributed to its deep learning module that is able to extract the high level features between combinatorial somatic point mutations and cancer types.

  1. Classification of rare missense substitutions, using risk surfaces, with genetic- and molecular-epidemiology applications.

    PubMed

    Tavtigian, Sean V; Byrnes, Graham B; Goldgar, David E; Thomas, Alun

    2008-11-01

    Many individually rare missense substitutions are encountered during deep resequencing of candidate susceptibility genes and clinical mutation screening of known susceptibility genes. BRCA1 and BRCA2 are among the most resequenced of all genes, and clinical mutation screening of these genes provides an extensive data set for analysis of rare missense substitutions. Align-GVGD is a mathematically simple missense substitution analysis algorithm, based on the Grantham difference, which has already contributed to classification of missense substitutions in BRCA1, BRCA2, and CHEK2. However, the distribution of genetic risk as a function of Align-GVGD's output variables Grantham variation (GV) and Grantham deviation (GD) has not been well characterized. Here, we used data from the Myriad Genetic Laboratories database of nearly 70,000 full-sequence tests plus two risk estimates, one approximating the odds ratio and the other reflecting strength of selection, to display the distribution of risk in the GV-GD plane as a series of surfaces. We abstracted contours from the surfaces and used the contours to define a sequence of missense substitution grades ordered from greatest risk to least risk. The grades were validated internally using a third, personal and family history-based, measure of risk. The Align-GVGD grades defined here are applicable to both the genetic epidemiology problem of classifying rare missense substitutions observed in known susceptibility genes and the molecular epidemiology problem of analyzing rare missense substitutions observed during case-control mutation screening studies of candidate susceptibility genes. (c) 2008 Wiley-Liss, Inc.

  2. Bioinformatics analyses of Shigella CRISPR structure and spacer classification.

    PubMed

    Wang, Pengfei; Zhang, Bing; Duan, Guangcai; Wang, Yingfang; Hong, Lijuan; Wang, Linlin; Guo, Xiangjiao; Xi, Yuanlin; Yang, Haiyan

    2016-03-01

    Clustered regularly interspaced short palindromic repeats (CRISPR) are inheritable genetic elements of a variety of archaea and bacteria and indicative of the bacterial ecological adaptation, conferring acquired immunity against invading foreign nucleic acids. Shigella is an important pathogen for anthroponosis. This study aimed to analyze the features of Shigella CRISPR structure and classify the spacers through bioinformatics approach. Among 107 Shigella, 434 CRISPR structure loci were identified with two to seven loci in different strains. CRISPR-Q1, CRISPR-Q4 and CRISPR-Q5 were widely distributed in Shigella strains. Comparison of the first and last repeats of CRISPR1, CRISPR2 and CRISPR3 revealed several base variants and different stem-loop structures. A total of 259 cas genes were found among these 107 Shigella strains. The cas gene deletions were discovered in 88 strains. However, there is one strain that does not contain cas gene. Intact clusters of cas genes were found in 19 strains. From comprehensive analysis of sequence signature and BLAST and CRISPRTarget score, the 708 spacers were classified into three subtypes: Type I, Type II and Type III. Of them, Type I spacer referred to those linked with one gene segment, Type II spacer linked with two or more different gene segments, and Type III spacer undefined. This study examined the diversity of CRISPR/cas system in Shigella strains, demonstrated the main features of CRISPR structure and spacer classification, which provided critical information for elucidation of the mechanisms of spacer formation and exploration of the role the spacers play in the function of the CRISPR/cas system.

  3. Literature classification for semi-automated updating of biological knowledgebases

    PubMed Central

    2013-01-01

    Background As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. Results We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. Conclusion We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases. PMID:24564403

  4. [Evaluation of traditional pathological classification at molecular classification era for gastric cancer].

    PubMed

    Yu, Yingyan

    2014-01-01

    Histopathological classification is in a pivotal position in both basic research and clinical diagnosis and treatment of gastric cancer. Currently, there are different classification systems in basic science and clinical application. In medical literatures, different classifications are used including Lauren and WHO systems, which have confused many researchers. Lauren classification has been proposed for half a century, but is still used worldwide. It shows many advantages of simple, easy handling with prognostic significance. The WHO classification scheme is better than Lauren classification in that it is continuously being revised according to the progress of gastric cancer, and is always used in the clinical and pathological diagnosis of common scenarios. Along with the progression of genomics, transcriptomics, proteomics, metabolomics researches, molecular classification of gastric cancer becomes the current hot topics. The traditional therapeutic approach based on phenotypic characteristics of gastric cancer will most likely be replaced with a gene variation mode. The gene-targeted therapy against the same molecular variation seems more reasonable than traditional chemical treatment based on the same morphological change.

  5. GTA: a game theoretic approach to identifying cancer subnetwork markers.

    PubMed

    Farahmand, S; Goliaei, S; Ansari-Pour, N; Razaghi-Moghadam, Z

    2016-03-01

    The identification of genetic markers (e.g. genes, pathways and subnetworks) for cancer has been one of the most challenging research areas in recent years. A subset of these studies attempt to analyze genome-wide expression profiles to identify markers with high reliability and reusability across independent whole-transcriptome microarray datasets. Therefore, the functional relationships of genes are integrated with their expression data. However, for a more accurate representation of the functional relationships among genes, utilization of the protein-protein interaction network (PPIN) seems to be necessary. Herein, a novel game theoretic approach (GTA) is proposed for the identification of cancer subnetwork markers by integrating genome-wide expression profiles and PPIN. The GTA method was applied to three distinct whole-transcriptome breast cancer datasets to identify the subnetwork markers associated with metastasis. To evaluate the performance of our approach, the identified subnetwork markers were compared with gene-based, pathway-based and network-based markers. We show that GTA is not only capable of identifying robust metastatic markers, it also provides a higher classification performance. In addition, based on these GTA-based subnetworks, we identified a new bonafide candidate gene for breast cancer susceptibility.

  6. Genetic Bee Colony (GBC) algorithm: A new gene selection method for microarray cancer classification.

    PubMed

    Alshamlan, Hala M; Badr, Ghada H; Alohali, Yousef A

    2015-06-01

    Naturally inspired evolutionary algorithms prove effectiveness when used for solving feature selection and classification problems. Artificial Bee Colony (ABC) is a relatively new swarm intelligence method. In this paper, we propose a new hybrid gene selection method, namely Genetic Bee Colony (GBC) algorithm. The proposed algorithm combines the used of a Genetic Algorithm (GA) along with Artificial Bee Colony (ABC) algorithm. The goal is to integrate the advantages of both algorithms. The proposed algorithm is applied to a microarray gene expression profile in order to select the most predictive and informative genes for cancer classification. In order to test the accuracy performance of the proposed algorithm, extensive experiments were conducted. Three binary microarray datasets are use, which include: colon, leukemia, and lung. In addition, another three multi-class microarray datasets are used, which are: SRBCT, lymphoma, and leukemia. Results of the GBC algorithm are compared with our recently proposed technique: mRMR when combined with the Artificial Bee Colony algorithm (mRMR-ABC). We also compared the combination of mRMR with GA (mRMR-GA) and Particle Swarm Optimization (mRMR-PSO) algorithms. In addition, we compared the GBC algorithm with other related algorithms that have been recently published in the literature, using all benchmark datasets. The GBC algorithm shows superior performance as it achieved the highest classification accuracy along with the lowest average number of selected genes. This proves that the GBC algorithm is a promising approach for solving the gene selection problem in both binary and multi-class cancer classification. Copyright © 2015 Elsevier Ltd. All rights reserved.

  7. Improved Sparse Multi-Class SVM and Its Application for Gene Selection in Cancer Classification

    PubMed Central

    Huang, Lingkang; Zhang, Hao Helen; Zeng, Zhao-Bang; Bushel, Pierre R.

    2013-01-01

    Background Microarray techniques provide promising tools for cancer diagnosis using gene expression profiles. However, molecular diagnosis based on high-throughput platforms presents great challenges due to the overwhelming number of variables versus the small sample size and the complex nature of multi-type tumors. Support vector machines (SVMs) have shown superior performance in cancer classification due to their ability to handle high dimensional low sample size data. The multi-class SVM algorithm of Crammer and Singer provides a natural framework for multi-class learning. Despite its effective performance, the procedure utilizes all variables without selection. In this paper, we propose to improve the procedure by imposing shrinkage penalties in learning to enforce solution sparsity. Results The original multi-class SVM of Crammer and Singer is effective for multi-class classification but does not conduct variable selection. We improved the method by introducing soft-thresholding type penalties to incorporate variable selection into multi-class classification for high dimensional data. The new methods were applied to simulated data and two cancer gene expression data sets. The results demonstrate that the new methods can select a small number of genes for building accurate multi-class classification rules. Furthermore, the important genes selected by the methods overlap significantly, suggesting general agreement among different variable selection schemes. Conclusions High accuracy and sparsity make the new methods attractive for cancer diagnostics with gene expression data and defining targets of therapeutic intervention. Availability: The source MATLAB code are available from http://math.arizona.edu/~hzhang/software.html. PMID:23966761

  8. Gene expression profiling of choline-deprived neural precursor cells isolated from mouse brain.

    PubMed

    Niculescu, Mihai D; Craciunescu, Corneliu N; Zeisel, Steven H

    2005-04-04

    Choline is an essential nutrient and an important methyl donor. Choline deficiency alters fetal development of the hippocampus in rodents and these changes are associated with decreased memory function lasting throughout life. Also, choline deficiency alters global and gene-specific DNA methylation in several models. This gene expression profiling study describes changes in cortical neural precursor cells from embryonic day 14 mice, after 48 h of exposure to a choline-deficient medium. Using Significance Analysis of Microarrays, we found the expression of 1003 genes to be significantly changed (from a total of 16,000 total genes spotted on the array), with a false discovery rate below 5%. A total of 846 genes were overexpressed while 157 were underexpressed. Classification by gene ontology revealed that 331 of these genes modulate cell proliferation, apoptosis, neuronal and glial differentiation, methyl metabolism, and calcium-binding protein classes. Twenty-seven genes that had changed expression have previously been reported to be regulated by promoter or intron methylation. These findings support our previous work suggesting that choline deficiency decreases the proliferation of neural precursors and possibly increases premature neuronal differentiation and apoptosis.

  9. Transcriptional Regulation in Saccharomyces cerevisiae: Transcription Factor Regulation and Function, Mechanisms of Initiation, and Roles of Activators and Coactivators

    PubMed Central

    Hahn, Steven; Young, Elton T.

    2011-01-01

    Here we review recent advances in understanding the regulation of mRNA synthesis in Saccharomyces cerevisiae. Many fundamental gene regulatory mechanisms have been conserved in all eukaryotes, and budding yeast has been at the forefront in the discovery and dissection of these conserved mechanisms. Topics covered include upstream activation sequence and promoter structure, transcription factor classification, and examples of regulated transcription factor activity. We also examine advances in understanding the RNA polymerase II transcription machinery, conserved coactivator complexes, transcription activation domains, and the cooperation of these factors in gene regulatory mechanisms. PMID:22084422

  10. De Novo Transcriptome Assembly and Characterization of Lithospermum officinale to Discover Putative Genes Involved in Specialized Metabolites Biosynthesis.

    PubMed

    Rai, Amit; Nakaya, Taiki; Shimizu, Yohei; Rai, Megha; Nakamura, Michimi; Suzuki, Hideyuki; Saito, Kazuki; Yamazaki, Mami

    2018-05-29

    Lithospermum officinale is a valuable source of bioactive metabolites with medicinal and industrial values. However, little is known about genes involved in the biosynthesis of these metabolites, primarily due to the lack of genome or transcriptome resources. This study presents the first effort to establish and characterize de novo transcriptome assembly resource for L. officinale and expression analysis for three of its tissues, namely leaf, stem, and root. Using over 4Gbps of RNA-sequencing datasets, we obtained de novo transcriptome assembly of L. officinale , consisting of 77,047 unigenes with assembly N50 value as 1524 bps. Based on transcriptome annotation and functional classification, 52,766 unigenes were assigned with putative genes functions, gene ontology terms, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. KEGG pathway and gene ontology enrichment analysis using highly expressed unigenes across three tissues and targeted metabolome analysis showed active secondary metabolic processes enriched specifically in the root of L. officinale . Using co-expression analysis, we also identified 20 and 48 unigenes representing different enzymes of lithospermic/chlorogenic acid and shikonin biosynthesis pathways, respectively. We further identified 15 candidate unigenes annotated as cytochrome P450 with the highest expression in the root of L. officinale as novel genes with a role in key biochemical reactions toward shikonin biosynthesis. Thus, through this study, we not only generated a high-quality genomic resource for L. officinale but also propose candidate genes to be involved in shikonin biosynthesis pathways for further functional characterization. Georg Thieme Verlag KG Stuttgart · New York.

  11. Molecular Diagnostics of Gliomas Using Next Generation Sequencing of a Glioma-Tailored Gene Panel.

    PubMed

    Zacher, Angela; Kaulich, Kerstin; Stepanow, Stefanie; Wolter, Marietta; Köhrer, Karl; Felsberg, Jörg; Malzkorn, Bastian; Reifenberger, Guido

    2017-03-01

    Current classification of gliomas is based on histological criteria according to the World Health Organization (WHO) classification of tumors of the central nervous system. Over the past years, characteristic genetic profiles have been identified in various glioma types. These can refine tumor diagnostics and provide important prognostic and predictive information. We report on the establishment and validation of gene panel next generation sequencing (NGS) for the molecular diagnostics of gliomas. We designed a glioma-tailored gene panel covering 660 amplicons derived from 20 genes frequently aberrant in different glioma types. Sensitivity and specificity of glioma gene panel NGS for detection of DNA sequence variants and copy number changes were validated by single gene analyses. NGS-based mutation detection was optimized for application on formalin-fixed paraffin-embedded tissue specimens including small stereotactic biopsy samples. NGS data obtained in a retrospective analysis of 121 gliomas allowed for their molecular classification into distinct biological groups, including (i) isocitrate dehydrogenase gene (IDH) 1 or 2 mutant astrocytic gliomas with frequent α-thalassemia/mental retardation syndrome X-linked (ATRX) and tumor protein p53 (TP53) gene mutations, (ii) IDH mutant oligodendroglial tumors with 1p/19q codeletion, telomerase reverse transcriptase (TERT) promoter mutation and frequent Drosophila homolog of capicua (CIC) gene mutation, as well as (iii) IDH wildtype glioblastomas with frequent TERT promoter mutation, phosphatase and tensin homolog (PTEN) mutation and/or epidermal growth factor receptor (EGFR) amplification. Oligoastrocytic gliomas were genetically assigned to either of these groups. Our findings implicate gene panel NGS as a promising diagnostic technique that may facilitate integrated histological and molecular glioma classification. © 2016 International Society of Neuropathology.

  12. PCR and RFLP analyses based on the ribosomal protein operon

    USDA-ARS?s Scientific Manuscript database

    Differentiation and classification of phytoplasmas have been primarily based on the highly conserved 16Sr RNA gene. RFLP analysis of 16Sr RNA gene sequences has identified 31 16Sr RNA (16Sr) groups and more than 100 16Sr subgroups. Classification of phytoplasma strains can however, become more refin...

  13. A cDNA microarray gene expression data classifier for clinical diagnostics based on graph theory.

    PubMed

    Benso, Alfredo; Di Carlo, Stefano; Politano, Gianfranco

    2011-01-01

    Despite great advances in discovering cancer molecular profiles, the proper application of microarray technology to routine clinical diagnostics is still a challenge. Current practices in the classification of microarrays' data show two main limitations: the reliability of the training data sets used to build the classifiers, and the classifiers' performances, especially when the sample to be classified does not belong to any of the available classes. In this case, state-of-the-art algorithms usually produce a high rate of false positives that, in real diagnostic applications, are unacceptable. To address this problem, this paper presents a new cDNA microarray data classification algorithm based on graph theory and is able to overcome most of the limitations of known classification methodologies. The classifier works by analyzing gene expression data organized in an innovative data structure based on graphs, where vertices correspond to genes and edges to gene expression relationships. To demonstrate the novelty of the proposed approach, the authors present an experimental performance comparison between the proposed classifier and several state-of-the-art classification algorithms.

  14. Multiclass cancer diagnosis using tumor gene expression signatures

    DOE PAGES

    Ramaswamy, S.; Tamayo, P.; Rifkin, R.; ...

    2001-12-11

    The optimal treatment of patients with cancer depends on establishing accurate diagnoses by using a complex combination of clinical and histopathological data. In some instances, this task is difficult or impossible because of atypical clinical presentation or histopathology. To determine whether the diagnosis of multiple common adult malignancies could be achieved purely by molecular classification, we subjected 218 tumor samples, spanning 14 common tumor types, and 90 normal tissue samples to oligonucleotide microarray gene expression analysis. The expression levels of 16,063 genes and expressed sequence tags were used to evaluate the accuracy of a multiclass classifier based on a supportmore » vector machine algorithm. Overall classification accuracy was 78%, far exceeding the accuracy of random classification (9%). Poorly differentiated cancers resulted in low-confidence predictions and could not be accurately classified according to their tissue of origin, indicating that they are molecularly distinct entities with dramatically different gene expression patterns compared with their well differentiated counterparts. Taken together, these results demonstrate the feasibility of accurate, multiclass molecular cancer classification and suggest a strategy for future clinical implementation of molecular cancer diagnostics.« less

  15. Fourier-based classification of protein secondary structures.

    PubMed

    Shu, Jian-Jun; Yong, Kian Yan

    2017-04-15

    The correct prediction of protein secondary structures is one of the key issues in predicting the correct protein folded shape, which is used for determining gene function. Existing methods make use of amino acids properties as indices to classify protein secondary structures, but are faced with a significant number of misclassifications. The paper presents a technique for the classification of protein secondary structures based on protein "signal-plotting" and the use of the Fourier technique for digital signal processing. New indices are proposed to classify protein secondary structures by analyzing hydrophobicity profiles. The approach is simple and straightforward. Results show that the more types of protein secondary structures can be classified by means of these newly-proposed indices. Copyright © 2017 Elsevier Inc. All rights reserved.

  16. Effects of gross motor function and manual function levels on performance-based ADL motor skills of children with spastic cerebral palsy.

    PubMed

    Park, Myoung-Ok

    2017-02-01

    [Purpose] The purpose of this study was to determine effects of Gross Motor Function Classification System and Manual Ability Classification System levels on performance-based motor skills of children with spastic cerebral palsy. [Subjects and Methods] Twenty-three children with cerebral palsy were included. The Assessment of Motor and Process Skills was used to evaluate performance-based motor skills in daily life. Gross motor function was assessed using Gross Motor Function Classification Systems, and manual function was measured using the Manual Ability Classification System. [Results] Motor skills in daily activities were significantly different on Gross Motor Function Classification System level and Manual Ability Classification System level. According to the results of multiple regression analysis, children categorized as Gross Motor Function Classification System level III scored lower in terms of performance based motor skills than Gross Motor Function Classification System level I children. Also, when analyzed with respect to Manual Ability Classification System level, level II was lower than level I, and level III was lower than level II in terms of performance based motor skills. [Conclusion] The results of this study indicate that performance-based motor skills differ among children categorized based on Gross Motor Function Classification System and Manual Ability Classification System levels of cerebral palsy.

  17. Transcriptomic analysis of Ruditapes philippinarum hemocytes reveals cytoskeleton disruption after in vitro Vibrio tapetis challenge.

    PubMed

    Brulle, Franck; Jeffroy, Fanny; Madec, Stéphanie; Nicolas, Jean-Louis; Paillard, Christine

    2012-10-01

    The Manila clam, Ruditapes philippinarum, is an economically-important, commercial shellfish; harvests are diminished in some European waters by a pathogenic bacterium, Vibrio tapetis, that causes Brown Ring disease. To identify molecular characteristics associated with susceptibility or resistance to Brown Ring disease, Suppression Subtractive Hybridization (SSH) analyzes were performed to construct cDNA libraries enriched in up- or down-regulated transcripts from clam immune cells, hemocytes, after a 3-h in vitro challenge with cultured V. tapetis. Nine hundred and ninety eight sequences from the two libraries were sequenced, and an in silico analysis identified 235 unique genes. BLAST and "Gene ontology" classification analyzes revealed that 60.4% of the Expressed Sequence Tags (ESTs) have high similarities with genes involved in various physiological functions, such as immunity, apoptosis and cytoskeleton organization; whereas, 39.6% remain unidentified. From the 235 unique genes, we selected 22 candidates based upon physiological function and redundancy in the libraries. Then, Real-Time PCR analysis identified 3 genes related to cytoskeleton organization showing significant variation in expression attributable to V. tapetis exposure. Disruption in regulation of these genes is consistent with the etiologic agent of Brown Ring disease in Manila clams. Copyright © 2012 Elsevier Ltd. All rights reserved.

  18. Application of machine learning on brain cancer multiclass classification

    NASA Astrophysics Data System (ADS)

    Panca, V.; Rustam, Z.

    2017-07-01

    Classification of brain cancer is a problem of multiclass classification. One approach to solve this problem is by first transforming it into several binary problems. The microarray gene expression dataset has the two main characteristics of medical data: extremely many features (genes) and only a few number of samples. The application of machine learning on microarray gene expression dataset mainly consists of two steps: feature selection and classification. In this paper, the features are selected using a method based on support vector machine recursive feature elimination (SVM-RFE) principle which is improved to solve multiclass classification, called multiple multiclass SVM-RFE. Instead of using only the selected features on a single classifier, this method combines the result of multiple classifiers. The features are divided into subsets and SVM-RFE is used on each subset. Then, the selected features on each subset are put on separate classifiers. This method enhances the feature selection ability of each single SVM-RFE. Twin support vector machine (TWSVM) is used as the method of the classifier to reduce computational complexity. While ordinary SVM finds single optimum hyperplane, the main objective Twin SVM is to find two non-parallel optimum hyperplanes. The experiment on the brain cancer microarray gene expression dataset shows this method could classify 71,4% of the overall test data correctly, using 100 and 1000 genes selected from multiple multiclass SVM-RFE feature selection method. Furthermore, the per class results show that this method could classify data of normal and MD class with 100% accuracy.

  19. JDINAC: joint density-based non-parametric differential interaction network analysis and classification using high-dimensional sparse omics data.

    PubMed

    Ji, Jiadong; He, Di; Feng, Yang; He, Yong; Xue, Fuzhong; Xie, Lei

    2017-10-01

    A complex disease is usually driven by a number of genes interwoven into networks, rather than a single gene product. Network comparison or differential network analysis has become an important means of revealing the underlying mechanism of pathogenesis and identifying clinical biomarkers for disease classification. Most studies, however, are limited to network correlations that mainly capture the linear relationship among genes, or rely on the assumption of a parametric probability distribution of gene measurements. They are restrictive in real application. We propose a new Joint density based non-parametric Differential Interaction Network Analysis and Classification (JDINAC) method to identify differential interaction patterns of network activation between two groups. At the same time, JDINAC uses the network biomarkers to build a classification model. The novelty of JDINAC lies in its potential to capture non-linear relations between molecular interactions using high-dimensional sparse data as well as to adjust confounding factors, without the need of the assumption of a parametric probability distribution of gene measurements. Simulation studies demonstrate that JDINAC provides more accurate differential network estimation and lower classification error than that achieved by other state-of-the-art methods. We apply JDINAC to a Breast Invasive Carcinoma dataset, which includes 114 patients who have both tumor and matched normal samples. The hub genes and differential interaction patterns identified were consistent with existing experimental studies. Furthermore, JDINAC discriminated the tumor and normal sample with high accuracy by virtue of the identified biomarkers. JDINAC provides a general framework for feature selection and classification using high-dimensional sparse omics data. R scripts available at https://github.com/jijiadong/JDINAC. lxie@iscb.org. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  20. Graph-based semi-supervised learning with genomic data integration using condition-responsive genes applied to phenotype classification.

    PubMed

    Doostparast Torshizi, Abolfazl; Petzold, Linda R

    2018-01-01

    Data integration methods that combine data from different molecular levels such as genome, epigenome, transcriptome, etc., have received a great deal of interest in the past few years. It has been demonstrated that the synergistic effects of different biological data types can boost learning capabilities and lead to a better understanding of the underlying interactions among molecular levels. In this paper we present a graph-based semi-supervised classification algorithm that incorporates latent biological knowledge in the form of biological pathways with gene expression and DNA methylation data. The process of graph construction from biological pathways is based on detecting condition-responsive genes, where 3 sets of genes are finally extracted: all condition responsive genes, high-frequency condition-responsive genes, and P-value-filtered genes. The proposed approach is applied to ovarian cancer data downloaded from the Human Genome Atlas. Extensive numerical experiments demonstrate superior performance of the proposed approach compared to other state-of-the-art algorithms, including the latest graph-based classification techniques. Simulation results demonstrate that integrating various data types enhances classification performance and leads to a better understanding of interrelations between diverse omics data types. The proposed approach outperforms many of the state-of-the-art data integration algorithms. © The Author 2017. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  1. An integrated method for cancer classification and rule extraction from microarray data

    PubMed Central

    Huang, Liang-Tsung

    2009-01-01

    Different microarray techniques recently have been successfully used to investigate useful information for cancer diagnosis at the gene expression level due to their ability to measure thousands of gene expression levels in a massively parallel way. One important issue is to improve classification performance of microarray data. However, it would be ideal that influential genes and even interpretable rules can be explored at the same time to offer biological insight. Introducing the concepts of system design in software engineering, this paper has presented an integrated and effective method (named X-AI) for accurate cancer classification and the acquisition of knowledge from DNA microarray data. This method included a feature selector to systematically extract the relative important genes so as to reduce the dimension and retain as much as possible of the class discriminatory information. Next, diagonal quadratic discriminant analysis (DQDA) was combined to classify tumors, and generalized rule induction (GRI) was integrated to establish association rules which can give an understanding of the relationships between cancer classes and related genes. Two non-redundant datasets of acute leukemia were used to validate the proposed X-AI, showing significantly high accuracy for discriminating different classes. On the other hand, I have presented the abilities of X-AI to extract relevant genes, as well as to develop interpretable rules. Further, a web server has been established for cancer classification and it is freely available at . PMID:19272192

  2. Effects of drought stress on global gene expression profile in leaf and root samples of Dongxiang wild rice (Oryza rufipogon).

    PubMed

    Zhang, Fantao; Zhou, Yi; Zhang, Meng; Luo, Xiangdong; Xie, Jiankun

    2017-06-30

    Drought is a serious constraint to rice production throughout the world, and although Dongxiang wild rice ( Oryza rufipogon , DXWR) possesses a high degree of drought resistance, the underlying mechanisms of this trait remains unclear. In the present study, cDNA libraries were constructed from the leaf and root tissues of drought-stressed and untreated DXWR seedlings, and transcriptome sequencing was performed with the goal of elucidating the molecular mechanisms involved in drought-stress response. The results indicated that 11231 transcripts were differentially expressed in the leaves (4040 up-regulated and 7191 down-regulated) and 7025 transcripts were differentially expressed in the roots (3097 up-regulated and 3928 down-regulated). Among these differentially expressed genes (DEGs), the detection of many transcriptional factors and functional genes demonstrated that multiple regulatory pathways were involved in drought resistance. Meanwhile, the DEGs were also annotated with gene ontology (GO) terms and key pathways via functional classification and Kyoto Encyclopedia of Gene and Genomes (KEGG) pathway mapping, respectively. A set of the most interesting candidate genes was then identified by combining the DEGs with previously identified drought-resistant quantitative trait loci (QTL). The present work provides abundant genomic information for functional dissection of the drought resistance of DXWR, and findings will further help the current understanding of the biological regulatory mechanisms of drought resistance in plants and facilitate the breeding of new drought-resistant rice cultivars. © 2017 The Author(s).

  3. Domain organization, genomic structure, evolution, and regulation of expression of the aggrecan gene family.

    PubMed

    Schwartz, N B; Pirok, E W; Mensch, J R; Domowicz, M S

    1999-01-01

    Proteoglycans are complex macromolecules, consisting of a polypeptide backbone to which are covalently attached one or more glycosaminoglycan chains. Molecular cloning has allowed identification of the genes encoding the core proteins of various proteoglycans, leading to a better understanding of the diversity of proteoglycan structure and function, as well as to the evolution of a classification of proteoglycans on the basis of emerging gene families that encode the different core proteins. One such family includes several proteoglycans that have been grouped with aggrecan, the large aggregating chondroitin sulfate proteoglycan of cartilage, based on a high number of sequence similarities within the N- and C-terminal domains. Thus far these proteoglycans include versican, neurocan, and brevican. It is now apparent that these proteins, as a group, are truly a gene family with shared structural motifs on the protein and nucleotide (mRNA) levels, and with nearly identical genomic organizations. Clearly a common ancestral origin is indicated for the members of the aggrecan family of proteoglycans. However, differing patterns of amplification and divergence have also occurred within certain exons across species and family members, leading to the class-characteristic protein motifs in the central carbohydrate-rich region exclusively. Thus the overall domain organization strongly suggests that sequence conservation in the terminal globular domains underlies common functions, whereas differences in the central portions of the genes account for functional specialization among the members of this gene family.

  4. Bacterial reference genes for gene expression studies by RT-qPCR: survey and analysis.

    PubMed

    Rocha, Danilo J P; Santos, Carolina S; Pacheco, Luis G C

    2015-09-01

    The appropriate choice of reference genes is essential for accurate normalization of gene expression data obtained by the method of reverse transcription quantitative real-time PCR (RT-qPCR). In 2009, a guideline called the Minimum Information for Publication of Quantitative Real-Time PCR Experiments (MIQE) highlighted the importance of the selection and validation of more than one suitable reference gene for obtaining reliable RT-qPCR results. Herein, we searched the recent literature in order to identify the bacterial reference genes that have been most commonly validated in gene expression studies by RT-qPCR (in the first 5 years following publication of the MIQE guidelines). Through a combination of different search parameters with the text mining tool MedlineRanker, we identified 145 unique bacterial genes that were recently tested as candidate reference genes. Of these, 45 genes were experimentally validated and, in most of the cases, their expression stabilities were verified using the software tools geNorm and NormFinder. It is noteworthy that only 10 of these reference genes had been validated in two or more of the studies evaluated. An enrichment analysis using Gene Ontology classifications demonstrated that genes belonging to the functional categories of DNA Replication (GO: 0006260) and Transcription (GO: 0006351) rendered a proportionally higher number of validated reference genes. Three genes in the former functional class were also among the top five most stable genes identified through an analysis of gene expression data obtained from the Pathosystems Resource Integration Center. These results may provide a guideline for the initial selection of candidate reference genes for RT-qPCR studies in several different bacterial species.

  5. Quantification of the Spatial Organization of the Nuclear Lamina as a Tool for Cell Classification

    PubMed Central

    Righolt, Christiaan H.; Zatreanu, Diana A.; Raz, Vered

    2013-01-01

    The nuclear lamina is the structural scaffold of the nuclear envelope that plays multiple regulatory roles in chromatin organization and gene expression as well as a structural role in nuclear stability. The lamina proteins, also referred to as lamins, determine nuclear lamina organization and define the nuclear shape and the structural integrity of the cell nucleus. In addition, lamins are connected with both nuclear and cytoplasmic structures forming a dynamic cellular structure whose shape changes upon external and internal signals. When bound to the nuclear lamina, the lamins are mobile, have an impact on the nuclear envelop structure, and may induce changes in their regulatory functions. Changes in the nuclear lamina shape cause changes in cellular functions. A quantitative description of these structural changes could provide an unbiased description of changes in cellular function. In this review, we describe how changes in the nuclear lamina can be measured from three-dimensional images of lamins at the nuclear envelope, and we discuss how structural changes of the nuclear lamina can be used for cell classification. PMID:27335676

  6. Quantification of the Spatial Organization of the Nuclear Lamina as a Tool for Cell Classification.

    PubMed

    Righolt, Christiaan H; Zatreanu, Diana A; Raz, Vered

    2013-01-01

    The nuclear lamina is the structural scaffold of the nuclear envelope that plays multiple regulatory roles in chromatin organization and gene expression as well as a structural role in nuclear stability. The lamina proteins, also referred to as lamins, determine nuclear lamina organization and define the nuclear shape and the structural integrity of the cell nucleus. In addition, lamins are connected with both nuclear and cytoplasmic structures forming a dynamic cellular structure whose shape changes upon external and internal signals. When bound to the nuclear lamina, the lamins are mobile, have an impact on the nuclear envelop structure, and may induce changes in their regulatory functions. Changes in the nuclear lamina shape cause changes in cellular functions. A quantitative description of these structural changes could provide an unbiased description of changes in cellular function. In this review, we describe how changes in the nuclear lamina can be measured from three-dimensional images of lamins at the nuclear envelope, and we discuss how structural changes of the nuclear lamina can be used for cell classification.

  7. Dystonia.

    PubMed

    Morgante, Francesca; Klein, Christine

    2013-10-01

    The purpose of this review is to provide an update on the classification, phenomenology, pathophysiology, and treatment of dystonia. A revised definition based on the main phenomenologic features of dystonia has recently been developed in an expert consensus approach. Classification is based on two main axes: clinical features and etiology. Currently, genes have been reported for 14 types of monogenic isolated and combined dystonia. Isolated dystonia (with dystonic tremor) can be caused by mutations in TOR1A (DYT1), TUBB4 (DYT4), THAP1 (DYT6), PRKRA (DYT16), CIZ1 (DYT23), ANO3 (DYT24), and GNAL (DYT25). Combined dystonias (with parkinsonism or myoclonus) are further subdivided into persistent (GCHI [DYT5], SGCE [DYT11], and ATP1A3 [DYT12], with TAF1 most likely but not yet proven to be linked to DYT3) and paroxysmal (PNKD [DYT8], PRRT2 [DYT10], and SLC2A1 [DYT18]). Recent insights from neurophysiologic studies identified functional abnormalities in two networks in dystonia: the basal ganglia-sensorimotor network and, more recently, the cerebellothalamocortical pathway. Besides the well-known lack of inhibition at different CNS levels, dystonia is specifically characterized by maladaptive plasticity in the sensorimotor cortex and loss of cortical surround inhibition. The exact role (modulatory or compensatory) of the cerebellar-cortical pathways still has to be further elucidated. In addition to botulinum toxin for focal forms, deep brain stimulation of the globus pallidus internus is increasingly recognized as an effective treatment for generalized and segmental dystonia. The revised classification and identification of new genes for different forms of dystonia, including adult-onset segmental dystonia, enable an improved diagnostic approach. Recent pathophysiologic insights have fundamentally contributed to a better understanding of the disease mechanisms and impact on treatment, such as functional neurosurgery and nonpharmacologic treatment options.

  8. CnidBase: The Cnidarian Evolutionary Genomics Database

    PubMed Central

    Ryan, Joseph F.; Finnerty, John R.

    2003-01-01

    CnidBase, the Cnidarian Evolutionary Genomics Database, is a tool for investigating the evolutionary, developmental and ecological factors that affect gene expression and gene function in cnidarians. In turn, CnidBase will help to illuminate the role of specific genes in shaping cnidarian biodiversity in the present day and in the distant past. CnidBase highlights evolutionary changes between species within the phylum Cnidaria and structures genomic and expression data to facilitate comparisons to non-cnidarian metazoans. CnidBase aims to further the progress that has already been made in the realm of cnidarian evolutionary genomics by creating a central community resource which will help drive future research and facilitate more accurate classification and comparison of new experimental data with existing data. CnidBase is available at http://cnidbase.bu.edu/. PMID:12519972

  9. Parameters selection in gene selection using Gaussian kernel support vector machines by genetic algorithm.

    PubMed

    Mao, Yong; Zhou, Xiao-Bo; Pi, Dao-Ying; Sun, You-Xian; Wong, Stephen T C

    2005-10-01

    In microarray-based cancer classification, gene selection is an important issue owing to the large number of variables and small number of samples as well as its non-linearity. It is difficult to get satisfying results by using conventional linear statistical methods. Recursive feature elimination based on support vector machine (SVM RFE) is an effective algorithm for gene selection and cancer classification, which are integrated into a consistent framework. In this paper, we propose a new method to select parameters of the aforementioned algorithm implemented with Gaussian kernel SVMs as better alternatives to the common practice of selecting the apparently best parameters by using a genetic algorithm to search for a couple of optimal parameter. Fast implementation issues for this method are also discussed for pragmatic reasons. The proposed method was tested on two representative hereditary breast cancer and acute leukaemia datasets. The experimental results indicate that the proposed method performs well in selecting genes and achieves high classification accuracies with these genes.

  10. Phylogenetic classification and the universal tree.

    PubMed

    Doolittle, W F

    1999-06-25

    From comparative analyses of the nucleotide sequences of genes encoding ribosomal RNAs and several proteins, molecular phylogeneticists have constructed a "universal tree of life," taking it as the basis for a "natural" hierarchical classification of all living things. Although confidence in some of the tree's early branches has recently been shaken, new approaches could still resolve many methodological uncertainties. More challenging is evidence that most archaeal and bacterial genomes (and the inferred ancestral eukaryotic nuclear genome) contain genes from multiple sources. If "chimerism" or "lateral gene transfer" cannot be dismissed as trivial in extent or limited to special categories of genes, then no hierarchical universal classification can be taken as natural. Molecular phylogeneticists will have failed to find the "true tree," not because their methods are inadequate or because they have chosen the wrong genes, but because the history of life cannot properly be represented as a tree. However, taxonomies based on molecular sequences will remain indispensable, and understanding of the evolutionary process will ultimately be enriched, not impoverished.

  11. Functional genomic responses to cystic fibrosis transmembrane conductance regulator (CFTR) and CFTR(delta508) in the lung.

    PubMed

    Xu, Yan; Liu, Cong; Clark, Jean C; Whitsett, Jeffrey A

    2006-04-21

    Cystic fibrosis (CF), a common lethal pulmonary disorder in Caucasians, is caused by mutations in the cystic fibrosis transmembrane conductance regulator gene (CFTR) that disturbs fluid homeostasis and host defense in target organs. The effects of CFTR and delta508-CFTR were assessed in transgenic mice that 1) lack CFTR expression (Cftr-/-); 2) express the human delta508 CFTR (CFTR(delta508)); 3) overexpress the normal human CFTR (CFTR(tg)) in respiratory epithelial cells. Genes were selected from Affymetrix Murine Gene-Chips analysis and subjected to functional classification, k-means clustering, promoter cis-elements/modules searching, literature mining, and pathway exploring. Genomic responses to Cftr-/- were not corrected by expression of CFTR(delta508). Genes regulating host defense, inflammation, fluid and electrolyte transport were similarly altered in Cftr-/- and CFTR(delta508) mice. CFTR(delta508) induced a primary disturbance in expression of genes regulating redox and antioxidant systems. Genomic responses to CFTR(tg) were modest and were not associated with lung pathology. CFTR(tg) and CFTR(delta508) induced genes encoding heat shock proteins and other chaperones but did not activate the endoplasmic reticulum-associated degradation pathway. RNAs encoding proteins that directly interact with CFTR were identified in each of the CFTR mouse models, supporting the hypothesis that CFTR functions within a multiprotein complex whose members interact at the level of protein-protein interactions and gene expression. Promoters of genes influenced by CFTR shared common regulatory elements, suggesting that their co-expression may be mediated by shared regulatory mechanisms. Genes and pathways involved in the response to CFTR may be of interest as modifiers of CF.

  12. De novo Transcriptome Assembly of Chinese Kale and Global Expression Analysis of Genes Involved in Glucosinolate Metabolism in Multiple Tissues

    PubMed Central

    Wu, Shuanghua; Lei, Jianjun; Chen, Guoju; Chen, Hancai; Cao, Bihao; Chen, Changming

    2017-01-01

    Chinese kale, a vegetable of the cruciferous family, is a popular crop in southern China and Southeast Asia due to its high glucosinolate content and nutritional qualities. However, there is little research on the molecular genetics and genes involved in glucosinolate metabolism and its regulation in Chinese kale. In this study, we sequenced and characterized the transcriptomes and expression profiles of genes expressed in 11 tissues of Chinese kale. A total of 216 million 150-bp clean reads were generated using RNA-sequencing technology. From the sequences, 98,180 unigenes were assembled for the whole plant, and 49,582~98,423 unigenes were assembled for each tissue. Blast analysis indicated that a total of 80,688 (82.18%) unigenes exhibited similarity to known proteins. The functional annotation and classification tools used in this study suggested that genes principally expressed in Chinese kale, were mostly involved in fundamental processes, such as cellular and molecular functions, the signal transduction, and biosynthesis of secondary metabolites. The expression levels of all unigenes were analyzed in various tissues of Chinese kale. A large number of candidate genes involved in glucosinolate metabolism and its regulation were identified, and the expression patterns of these genes were analyzed. We found that most of the genes involved in glucosinolate biosynthesis were highly expressed in the root, petiole, and in senescent leaves. The expression patterns of ten glucosinolate biosynthetic genes from RNA-seq were validated by quantitative RT-PCR in different tissues. These results provided an initial and global overview of Chinese kale gene functions and expression activities in different tissues. PMID:28228764

  13. Analysis of Aspergillus nidulans metabolism at the genome-scale

    PubMed Central

    David, Helga; Özçelik, İlknur Ş; Hofmann, Gerald; Nielsen, Jens

    2008-01-01

    Background Aspergillus nidulans is a member of a diverse group of filamentous fungi, sharing many of the properties of its close relatives with significance in the fields of medicine, agriculture and industry. Furthermore, A. nidulans has been a classical model organism for studies of development biology and gene regulation, and thus it has become one of the best-characterized filamentous fungi. It was the first Aspergillus species to have its genome sequenced, and automated gene prediction tools predicted 9,451 open reading frames (ORFs) in the genome, of which less than 10% were assigned a function. Results In this work, we have manually assigned functions to 472 orphan genes in the metabolism of A. nidulans, by using a pathway-driven approach and by employing comparative genomics tools based on sequence similarity. The central metabolism of A. nidulans, as well as biosynthetic pathways of relevant secondary metabolites, was reconstructed based on detailed metabolic reconstructions available for A. niger and Saccharomyces cerevisiae, and information on the genetics, biochemistry and physiology of A. nidulans. Thereby, it was possible to identify metabolic functions without a gene associated, and to look for candidate ORFs in the genome of A. nidulans by comparing its sequence to sequences of well-characterized genes in other species encoding the function of interest. A classification system, based on defined criteria, was developed for evaluating and selecting the ORFs among the candidates, in an objective and systematic manner. The functional assignments served as a basis to develop a mathematical model, linking 666 genes (both previously and newly annotated) to metabolic roles. The model was used to simulate metabolic behavior and additionally to integrate, analyze and interpret large-scale gene expression data concerning a study on glucose repression, thereby providing a means of upgrading the information content of experimental data and getting further insight into this phenomenon in A. nidulans. Conclusion We demonstrate how pathway modeling of A. nidulans can be used as an approach to improve the functional annotation of the genome of this organism. Furthermore we show how the metabolic model establishes functional links between genes, enabling the upgrade of the information content of transcriptome data. PMID:18405346

  14. MicroRNA-integrated and network-embedded gene selection with diffusion distance.

    PubMed

    Huang, Di; Zhou, Xiaobo; Lyon, Christopher J; Hsueh, Willa A; Wong, Stephen T C

    2010-10-29

    Gene network information has been used to improve gene selection in microarray-based studies by selecting marker genes based both on their expression and the coordinate expression of genes within their gene network under a given condition. Here we propose a new network-embedded gene selection model. In this model, we first address the limitations of microarray data. Microarray data, although widely used for gene selection, measures only mRNA abundance, which does not always reflect the ultimate gene phenotype, since it does not account for post-transcriptional effects. To overcome this important (critical in certain cases) but ignored-in-almost-all-existing-studies limitation, we design a new strategy to integrate together microarray data with the information of microRNA, the major post-transcriptional regulatory factor. We also handle the challenges led by gene collaboration mechanism. To incorporate the biological facts that genes without direct interactions may work closely due to signal transduction and that two genes may be functionally connected through multi paths, we adopt the concept of diffusion distance. This concept permits us to simulate biological signal propagation and therefore to estimate the collaboration probability for all gene pairs, directly or indirectly-connected, according to multi paths connecting them. We demonstrate, using type 2 diabetes (DM2) as an example, that the proposed strategies can enhance the identification of functional gene partners, which is the key issue in a network-embedded gene selection model. More importantly, we show that our gene selection model outperforms related ones. Genes selected by our model 1) have improved classification capability; 2) agree with biological evidence of DM2-association; and 3) are involved in many well-known DM2-associated pathways.

  15. Genome-wide identification, phylogenetic classification, and exon-intron structure characterisation of the tubulin and actin genes in flax (Linum usitatissimum).

    PubMed

    Pydiura, Nikolay; Pirko, Yaroslav; Galinousky, Dmitry; Postovoitova, Anastasiia; Yemets, Alla; Kilchevsky, Aleksandr; Blume, Yaroslav

    2018-06-08

    Flax (Linum usitatissimum L.) is a valuable food and fiber crop cultivated for its quality fiber and seed oil. α-, β-, γ-tubulins and actins are the main structural proteins of the cytoskeleton. α- and γ-tubulin and actin genes have not been characterized yet in the flax genome. In this study, we have identified 6 α-tubulin genes, 13 β-tubulin genes, 2 γ-tubulin genes, and 15 actin genes in the flax genome and analysed the phylogenetic relationships between flax and A. thaliana tubulin and actin genes. Six α-tubulin genes are represented by 3 paralogous pairs, among 13 β-tubulin genes 7 different isotypes can be distinguished, 6 of which are encoded by two paralogous genes each. γ-tubulin is represented by a paralogous pair of genes one of which may be not functional. Fifteen actin genes represent 7 paralogous pairs - 7 actin isotypes and a sequentially duplicated copy of one of the genes of one of the isotypes. Exon-intron structure analysis has shown intron length polymorphism within the β-tubulin genes and intron number variation among the α-tubulin gene: 3 or 4 introns are found in two or four genes, respectively. Intron positioning occurs at conservative sites, as observed in numerous other plant species. Flax actin genes show both intron length polymorphisms and variation in the number of intron that may be 2 or 3. These data will be useful to support further studies on the specificity, functioning, regulation and evolution of the flax cytoskeleton proteins. This article is protected by copyright. All rights reserved.

  16. Nucleic and Amino Acid Sequences Support Structure-Based Viral Classification.

    PubMed

    Sinclair, Robert M; Ravantti, Janne J; Bamford, Dennis H

    2017-04-15

    Viral capsids ensure viral genome integrity by protecting the enclosed nucleic acids. Interactions between the genome and capsid and between individual capsid proteins (i.e., capsid architecture) are intimate and are expected to be characterized by strong evolutionary conservation. For this reason, a capsid structure-based viral classification has been proposed as a way to bring order to the viral universe. The seeming lack of sufficient sequence similarity to reproduce this classification has made it difficult to reject structural convergence as the basis for the classification. We reinvestigate whether the structure-based classification for viral coat proteins making icosahedral virus capsids is in fact supported by previously undetected sequence similarity. Since codon choices can influence nascent protein folding cotranslationally, we searched for both amino acid and nucleotide sequence similarity. To demonstrate the sensitivity of the approach, we identify a candidate gene for the pandoravirus capsid protein. We show that the structure-based classification is strongly supported by amino acid and also nucleotide sequence similarities, suggesting that the similarities are due to common descent. The correspondence between structure-based and sequence-based analyses of the same proteins shown here allow them to be used in future analyses of the relationship between linear sequence information and macromolecular function, as well as between linear sequence and protein folds. IMPORTANCE Viral capsids protect nucleic acid genomes, which in turn encode capsid proteins. This tight coupling of protein shell and nucleic acids, together with strong functional constraints on capsid protein folding and architecture, leads to the hypothesis that capsid protein-coding nucleotide sequences may retain signatures of ancient viral evolution. We have been able to show that this is indeed the case, using the major capsid proteins of viruses forming icosahedral capsids. Importantly, we detected similarity at the nucleotide level between capsid protein-coding regions from viruses infecting cells belonging to all three domains of life, reproducing a previously established structure-based classification of icosahedral viral capsids. Copyright © 2017 Sinclair et al.

  17. Nucleic and Amino Acid Sequences Support Structure-Based Viral Classification

    PubMed Central

    Sinclair, Robert M.; Ravantti, Janne J.

    2017-01-01

    ABSTRACT Viral capsids ensure viral genome integrity by protecting the enclosed nucleic acids. Interactions between the genome and capsid and between individual capsid proteins (i.e., capsid architecture) are intimate and are expected to be characterized by strong evolutionary conservation. For this reason, a capsid structure-based viral classification has been proposed as a way to bring order to the viral universe. The seeming lack of sufficient sequence similarity to reproduce this classification has made it difficult to reject structural convergence as the basis for the classification. We reinvestigate whether the structure-based classification for viral coat proteins making icosahedral virus capsids is in fact supported by previously undetected sequence similarity. Since codon choices can influence nascent protein folding cotranslationally, we searched for both amino acid and nucleotide sequence similarity. To demonstrate the sensitivity of the approach, we identify a candidate gene for the pandoravirus capsid protein. We show that the structure-based classification is strongly supported by amino acid and also nucleotide sequence similarities, suggesting that the similarities are due to common descent. The correspondence between structure-based and sequence-based analyses of the same proteins shown here allow them to be used in future analyses of the relationship between linear sequence information and macromolecular function, as well as between linear sequence and protein folds. IMPORTANCE Viral capsids protect nucleic acid genomes, which in turn encode capsid proteins. This tight coupling of protein shell and nucleic acids, together with strong functional constraints on capsid protein folding and architecture, leads to the hypothesis that capsid protein-coding nucleotide sequences may retain signatures of ancient viral evolution. We have been able to show that this is indeed the case, using the major capsid proteins of viruses forming icosahedral capsids. Importantly, we detected similarity at the nucleotide level between capsid protein-coding regions from viruses infecting cells belonging to all three domains of life, reproducing a previously established structure-based classification of icosahedral viral capsids. PMID:28122979

  18. Differential gene expression in patients with subsyndromal symptomatic depression and major depressive disorder.

    PubMed

    Yang, Chengqing; Hu, Guoqin; Li, Zezhi; Wang, Qingzhong; Wang, Xuemei; Yuan, Chengmei; Wang, Zuowei; Hong, Wu; Lu, Weihong; Cao, Lan; Chen, Jun; Wang, Yong; Yu, Shunying; Zhou, Yimin; Yi, Zhenghui; Fang, Yiru

    2017-01-01

    Subsyndromal symptomatic depression (SSD) is a subtype of subthreshold depressive and can lead to significant psychosocial functional impairment. Although the pathogenesis of major depressive disorder (MDD) and SSD still remains poorly understood, a set of studies have found that many same genetic factors play important roles in the etiology of these two disorders. Nowadays, the differential gene expression between MDD and SSD is still unknown. In our previous study, we compared the expression profile and made the classification with the leukocytes by using whole-genome cRNA microarrays among drug-free first-episode subjects with SSD, MDD and matched healthy controls (8 subjects in each group), and finally determined 48 gene expression signatures. Based on these findings, we further clarify whether these genes mRNA was different expressed in peripheral blood in patients with SSD, MDD and healthy controls (60 subjects respectively). With the help of the quantitative real-time reverse transcription-polymerase chain reaction (RT-qPCR), we gained gene relative expression levels among the three groups. We found that there are three of the forty eight co-regulated genes had differential expression in peripheral blood among the three groups, which are CD84, STRN, CTNS gene (F = 3.528, p = 0.034; F = 3.382, p = 0.039; F = 3.801, p = 0.026, respectively) while there were no significant differences for other genes. CD84, STRN, CTNS gene may have significant value for performing diagnostic functions and classifying SSD, MDD and healthy controls.

  19. A multi-Poisson dynamic mixture model to cluster developmental patterns of gene expression by RNA-seq.

    PubMed

    Ye, Meixia; Wang, Zhong; Wang, Yaqun; Wu, Rongling

    2015-03-01

    Dynamic changes of gene expression reflect an intrinsic mechanism of how an organism responds to developmental and environmental signals. With the increasing availability of expression data across a time-space scale by RNA-seq, the classification of genes as per their biological function using RNA-seq data has become one of the most significant challenges in contemporary biology. Here we develop a clustering mixture model to discover distinct groups of genes expressed during a period of organ development. By integrating the density function of multivariate Poisson distribution, the model accommodates the discrete property of read counts characteristic of RNA-seq data. The temporal dependence of gene expression is modeled by the first-order autoregressive process. The model is implemented with the Expectation-Maximization algorithm and model selection to determine the optimal number of gene clusters and obtain the estimates of Poisson parameters that describe the pattern of time-dependent expression of genes from each cluster. The model has been demonstrated by analyzing a real data from an experiment aimed to link the pattern of gene expression to catkin development in white poplar. The usefulness of the model has been validated through computer simulation. The model provides a valuable tool for clustering RNA-seq data, facilitating our global view of expression dynamics and understanding of gene regulation mechanisms. © The Author 2014. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.

  20. Transporter taxonomy - a comparison of different transport protein classification schemes.

    PubMed

    Viereck, Michael; Gaulton, Anna; Digles, Daniela; Ecker, Gerhard F

    2014-06-01

    Currently, there are more than 800 well characterized human membrane transport proteins (including channels and transporters) and there are estimates that about 10% (approx. 2000) of all human genes are related to transport. Membrane transport proteins are of interest as potential drug targets, for drug delivery, and as a cause of side effects and drug–drug interactions. In light of the development of Open PHACTS, which provides an open pharmacological space, we analyzed selected membrane transport protein classification schemes (Transporter Classification Database, ChEMBL, IUPHAR/BPS Guide to Pharmacology, and Gene Ontology) for their ability to serve as a basis for pharmacology driven protein classification. A comparison of these membrane transport protein classification schemes by using a set of clinically relevant transporters as use-case reveals the strengths and weaknesses of the different taxonomy approaches.

  1. Bacterial community composition in the gut content of Lampetra japonica revealed by 16S rRNA gene pyrosequencing.

    PubMed

    Zuo, Yu; Xie, Wenfang; Pang, Yue; Li, Tiesong; Li, Qingwei; Li, Yingying

    2017-01-01

    The composition of the bacterial communities in the hindgut contents of Lampetrs japonica was surveyed by Illumina MiSeq of the 16S rRNA gene. An average of 32385 optimized reads was obtained from three samples. The rarefaction curve based on the operational taxonomic units tended to approach the asymptote. The rank abundance curve representing the species richness and evenness was calculated. The composition of microbe in six classification levels was also analyzed. Top 20 members in genera level were displayed as the classification tree. The abundance of microorganisms in different individuals was displayed as the pie charts at the branch nodes in the classification tree. The differences of top 50 genera in abundance between individuals of lamprey are displayed as a heatmap. The pairwise comparison of bacterial taxa abundance revealed that there are no significant differences of gut microbiota between three individuals of lamprey at a given rarefied depth. Also, the gut microbiota derived from L. japonica displays little similarity with other aquatic organism of Vertebrata after UPGMA analysis. The metabolic function of the bacterial communities was predicted through KEGG analysis. This study represents the first analysis of the bacterial community composition in the gut content of L. japonica. The investigation of the gut microbiota associated with L. japonica will broaden our understanding of this unique organism.

  2. Impact of genomics on the understanding of microbial evolution and classification: the importance of Darwin's views on classification.

    PubMed

    Gupta, Radhey S

    2016-07-01

    Analyses of genome sequences, by some approaches, suggest that the widespread occurrence of horizontal gene transfers (HGTs) in prokaryotes disguises their evolutionary relationships and have led to questioning of the Darwinian model of evolution for prokaryotes. These inferences are critically examined in the light of comparative genome analysis, characteristic synapomorphies, phylogenetic trees and Darwin's views on examining evolutionary relationships. Genome sequences are enabling discovery of numerous molecular markers (synapomorphies) such as conserved signature indels (CSIs) and conserved signature proteins (CSPs), which are distinctive characteristics of different prokaryotic taxa. Based on these molecular markers, exhibiting high degree of specificity and predictive ability, numerous prokaryotic taxa of different ranks, currently identified based on the 16S rRNA gene trees, can now be reliably demarcated in molecular terms. Within all studied groups, multiple CSIs and CSPs have been identified for successive nested clades providing reliable information regarding their hierarchical relationships and these inferences are not affected by HGTs. These results strongly support Darwin's views on evolution and classification and supplement the current phylogenetic framework based on 16S rRNA in important respects. The identified molecular markers provide important means for developing novel diagnostics, therapeutics and for functional studies providing important insights regarding prokaryotic taxa. © FEMS 2016. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  3. A comprehensive simulation study on classification of RNA-Seq data.

    PubMed

    Zararsız, Gökmen; Goksuluk, Dincer; Korkmaz, Selcuk; Eldem, Vahap; Zararsiz, Gozde Erturk; Duru, Izzet Parug; Ozturk, Ahmet

    2017-01-01

    RNA sequencing (RNA-Seq) is a powerful technique for the gene-expression profiling of organisms that uses the capabilities of next-generation sequencing technologies. Developing gene-expression-based classification algorithms is an emerging powerful method for diagnosis, disease classification and monitoring at molecular level, as well as providing potential markers of diseases. Most of the statistical methods proposed for the classification of gene-expression data are either based on a continuous scale (eg. microarray data) or require a normal distribution assumption. Hence, these methods cannot be directly applied to RNA-Seq data since they violate both data structure and distributional assumptions. However, it is possible to apply these algorithms with appropriate modifications to RNA-Seq data. One way is to develop count-based classifiers, such as Poisson linear discriminant analysis and negative binomial linear discriminant analysis. Another way is to bring the data closer to microarrays and apply microarray-based classifiers. In this study, we compared several classifiers including PLDA with and without power transformation, NBLDA, single SVM, bagging SVM (bagSVM), classification and regression trees (CART), and random forests (RF). We also examined the effect of several parameters such as overdispersion, sample size, number of genes, number of classes, differential-expression rate, and the transformation method on model performances. A comprehensive simulation study is conducted and the results are compared with the results of two miRNA and two mRNA experimental datasets. The results revealed that increasing the sample size, differential-expression rate and decreasing the dispersion parameter and number of groups lead to an increase in classification accuracy. Similar with differential-expression studies, the classification of RNA-Seq data requires careful attention when handling data overdispersion. We conclude that, as a count-based classifier, the power transformed PLDA and, as a microarray-based classifier, vst or rlog transformed RF and SVM classifiers may be a good choice for classification. An R/BIOCONDUCTOR package, MLSeq, is freely available at https://www.bioconductor.org/packages/release/bioc/html/MLSeq.html.

  4. CARSVM: a class association rule-based classification framework and its application to gene expression data.

    PubMed

    Kianmehr, Keivan; Alhajj, Reda

    2008-09-01

    In this study, we aim at building a classification framework, namely the CARSVM model, which integrates association rule mining and support vector machine (SVM). The goal is to benefit from advantages of both, the discriminative knowledge represented by class association rules and the classification power of the SVM algorithm, to construct an efficient and accurate classifier model that improves the interpretability problem of SVM as a traditional machine learning technique and overcomes the efficiency issues of associative classification algorithms. In our proposed framework: instead of using the original training set, a set of rule-based feature vectors, which are generated based on the discriminative ability of class association rules over the training samples, are presented to the learning component of the SVM algorithm. We show that rule-based feature vectors present a high-qualified source of discrimination knowledge that can impact substantially the prediction power of SVM and associative classification techniques. They provide users with more conveniences in terms of understandability and interpretability as well. We have used four datasets from UCI ML repository to evaluate the performance of the developed system in comparison with five well-known existing classification methods. Because of the importance and popularity of gene expression analysis as real world application of the classification model, we present an extension of CARSVM combined with feature selection to be applied to gene expression data. Then, we describe how this combination will provide biologists with an efficient and understandable classifier model. The reported test results and their biological interpretation demonstrate the applicability, efficiency and effectiveness of the proposed model. From the results, it can be concluded that a considerable increase in classification accuracy can be obtained when the rule-based feature vectors are integrated in the learning process of the SVM algorithm. In the context of applicability, according to the results obtained from gene expression analysis, we can conclude that the CARSVM system can be utilized in a variety of real world applications with some adjustments.

  5. Impact of missing data imputation methods on gene expression clustering and classification.

    PubMed

    de Souto, Marcilio C P; Jaskowiak, Pablo A; Costa, Ivan G

    2015-02-26

    Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/ .

  6. Toxicogenomics in the 3T3-L1 cell line, a new approach for screening of obesogenic compounds.

    PubMed

    Pereira-Fernandes, Anna; Vanparys, Caroline; Vergauwen, Lucia; Knapen, Dries; Jorens, Philippe Germaines; Blust, Ronny

    2014-08-01

    The obesogen hypothesis states that together with an energy imbalance between calories consumed and calories expended, exposure to environmental compounds early in life or throughout lifetime might have an influence on obesity development. In this work, we propose a new approach for obesogen screening, i.e., the use of transcriptomics in the 3T3-L1 pre-adipocyte cell line. Based on the data from a previous study of our group using a lipid accumulation based adipocyte differentiation assay, several human-relevant obesogenic compounds were selected: reference obesogens (Rosiglitazone, Tributyltin), test obesogens (Butylbenzyl phthalate, butylparaben, propylparaben, Bisphenol A), and non-obesogens (Ethylene Brassylate, Bis (2-ethylhexyl)phthalate). The high stability and reproducibility of the 3T3-L1 gene transcription patterns over different experiments and cell batches is demonstrated by this study. Obesogens and non-obesogen gene transcription profiles were clearly distinguished using hierarchical clustering. Furthermore, a gradual distinction corresponding to differences in induction of lipid accumulation could be made between test and reference obesogens based on transcription patterns, indicating the potential use of this strategy for classification of obesogens. Marker genes that are able to distinguish between non, test, and reference obesogens were identified. Well-known genes involved in adipocyte differentiation as well as genes with unknown functions were selected, implying a potential adipocyte-related function of the latter. Cell-physiological lipid accumulation was well estimated based on transcription levels of the marker genes, indicating the biological relevance of omics data. In conclusion, this study shows the high relevance and reproducibility of this 3T3-L1 based in vitro toxicogenomics tool for classification of obesogens and biomarker discovery. Although the results presented here are promising, further confirmation of the predictive value of the set of candidate biomarkers identified as well as the validation of their clinical role will be needed. © The Author 2014. Published by Oxford University Press on behalf of the Society of Toxicology. All rights reserved. For permissions, please email: journals.permissions@oup.com.

  7. Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data.

    PubMed

    Ooi, Chia Huey; Chetty, Madhu; Teng, Shyh Wei

    2006-06-23

    Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results. In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy. We present two realistically evaluated correlation-based feature selection techniques which incorporate, in addition to the two existing criteria involved in forming a predictor set (relevance and redundancy), a third criterion called the degree of differential prioritization (DDP). DDP functions as a parameter to strike the balance between relevance and redundancy, providing our techniques with the novel ability to differentially prioritize the optimization of relevance against redundancy (and vice versa). This ability proves useful in producing optimal classification accuracy while using reasonably small predictor set sizes for nine well-known multiclass microarray datasets. For multiclass microarray datasets, especially the GCM and NCI60 datasets, DDP enables our filter-based techniques to produce accuracies better than those reported in previous studies which employed similarly realistic evaluation procedures.

  8. Classification of rice (Oryza sativa L. Japonica nipponbare) immunophilins (FKBPs, CYPs) and expression patterns under water stress.

    PubMed

    Ahn, Jun Cheul; Kim, Dae-Won; You, Young Nim; Seok, Min Sook; Park, Jeong Mee; Hwang, Hyunsik; Kim, Beom-Gi; Luan, Sheng; Park, Hong-Seog; Cho, Hye Sun

    2010-11-18

    FK506 binding proteins (FKBPs) and cyclophilins (CYPs) are abundant and ubiquitous proteins belonging to the peptidyl-prolyl cis/trans isomerase (PPIase) superfamily, which regulate much of metabolism through a chaperone or an isomerization of proline residues during protein folding. They are collectively referred to as immunophilin (IMM), being present in almost all cellular organs. In particular, a number of IMMs relate to environmental stresses. FKBP and CYP proteins in rice (Oryza sativa cv. Japonica) were identified and classified, and given the appropriate name for each IMM, considering the ortholog-relation with Arabidopsis and Chlamydomonas or molecular weight of the proteins. 29 FKBP and 27 CYP genes can putatively be identified in rice; among them, a number of genes can be putatively classified as orthologs of Arabidopsis IMMs. However, some genes were novel, did not match with those of Arabidopsis and Chlamydomonas, and several genes were paralogs by genetic duplication. Among 56 IMMs in rice, a significant number are regulated by salt and/or desiccation stress. In addition, their expression levels responding to the water-stress have been analyzed in different tissues, and some subcellular IMMs located by means of tagging with GFP protein. Like other green photosynthetic organisms such as Arabidopsis (23 FKBPs and 29 CYPs) and Chlamydomonas (23 FKBs and 26 CYNs), rice has the highest number of IMM genes among organisms reported so far, suggesting that the numbers relate closely to photosynthesis. Classification of the putative FKBPs and CYPs in rice provides the information about their evolutional/functional significance when comparisons are drawn with the relatively well studied genera, Arabidopsis and Chlamydomonas. In addition, many of the genes upregulated by water stress offer the possibility of manipulating the stress responses in rice.

  9. Identification of Disease Critical Genes Using Collective Meta-heuristic Approaches: An Application to Preeclampsia.

    PubMed

    Biswas, Surama; Dutta, Subarna; Acharyya, Sriyankar

    2017-12-01

    Identifying a small subset of disease critical genes out of a large size of microarray gene expression data is a challenge in computational life sciences. This paper has applied four meta-heuristic algorithms, namely, honey bee mating optimization (HBMO), harmony search (HS), differential evolution (DE) and genetic algorithm (basic version GA) to find disease critical genes of preeclampsia which affects women during gestation. Two hybrid algorithms, namely, HBMO-kNN and HS-kNN have been newly proposed here where kNN (k nearest neighbor classifier) is used for sample classification. Performances of these new approaches have been compared with other two hybrid algorithms, namely, DE-kNN and SGA-kNN. Three datasets of different sizes have been used. In a dataset, the set of genes found common in the output of each algorithm is considered here as disease critical genes. In different datasets, the percentage of classification or classification accuracy of meta-heuristic algorithms varied between 92.46 and 100%. HBMO-kNN has the best performance (99.64-100%) in almost all data sets. DE-kNN secures the second position (99.42-100%). Disease critical genes obtained here match with clinically revealed preeclampsia genes to a large extent.

  10. tRNAscan-SE On-line: integrating search and context for analysis of transfer RNA genes.

    PubMed

    Lowe, Todd M; Chan, Patricia P

    2016-07-08

    High-throughput genome sequencing continues to grow the need for rapid, accurate genome annotation and tRNA genes constitute the largest family of essential, ever-present non-coding RNA genes. Newly developed tRNAscan-SE 2.0 has advanced the state-of-the-art methodology in tRNA gene detection and functional prediction, captured by rich new content of the companion Genomic tRNA Database. Previously, web-server tRNA detection was isolated from knowledge of existing tRNAs and their annotation. In this update of the tRNAscan-SE On-line resource, we tie together improvements in tRNA classification with greatly enhanced biological context via dynamically generated links between web server search results, the most relevant genes in the GtRNAdb and interactive, rich genome context provided by UCSC genome browsers. The tRNAscan-SE On-line web server can be accessed at http://trna.ucsc.edu/tRNAscan-SE/. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  11. Inter-Relationships of Functional Status in Cerebral Palsy: Analyzing Gross Motor Function, Manual Ability, and Communication Function Classification Systems in Children

    ERIC Educational Resources Information Center

    Hidecker, Mary Jo Cooley; Ho, Nhan Thi; Dodge, Nancy; Hurvitz, Edward A.; Slaughter, Jaime; Workinger, Marilyn Seif; Kent, Ray D.; Rosenbaum, Peter; Lenski, Madeleine; Messaros, Bridget M.; Vanderbeek, Suzette B.; Deroos, Steven; Paneth, Nigel

    2012-01-01

    Aim: To investigate the relationships among the Gross Motor Function Classification System (GMFCS), Manual Ability Classification System (MACS), and Communication Function Classification System (CFCS) in children with cerebral palsy (CP). Method: Using questionnaires describing each scale, mothers reported GMFCS, MACS, and CFCS levels in 222…

  12. Regulatory Role of Circular RNAs and Neurological Disorders.

    PubMed

    Floris, Gabriele; Zhang, Longbin; Follesa, Paolo; Sun, Tao

    2017-09-01

    Circular RNAs (circRNAs) are a class of long noncoding RNAs that are characterized by the presence of covalently linked ends and have been found in all life kingdoms. Exciting studies in regulatory roles of circRNAs are emerging. Here, we summarize classification, characteristics, biogenesis, and regulatory functions of circRNAs. CircRNAs are found to be preferentially expressed along neural genes and in neural tissues. We thus highlight the association of circRNA dysregulation with neurodegenerative diseases such as Alzheimer's disease. Investigation of regulatory role of circRNAs will shed novel light in gene expression mechanisms during development and under disease conditions and may identify circRNAs as new biomarkers for aging and neurodegenerative disorders.

  13. Near-isogenic cotton germplasm lines that differ in fiber-bundle strength have temporal differences in fiber gene expression patterns as revealed by comparative high-throughput profiling.

    PubMed

    Hinchliffe, Doug J; Meredith, William R; Yeater, Kathleen M; Kim, Hee Jin; Woodward, Andrew W; Chen, Z Jeffrey; Triplett, Barbara A

    2010-05-01

    Gene expression profiles of developing cotton (Gossypium hirsutum L.) fibers from two near-isogenic lines (NILs) that differ in fiber-bundle strength, short-fiber content, and in fewer than two genetic loci were compared using an oligonucleotide microarray. Fiber gene expression was compared at five time points spanning fiber elongation and secondary cell wall (SCW) biosynthesis. Fiber samples were collected from field plots in a randomized, complete block design, with three spatially distinct biological replications for each NIL at each time point. Microarray hybridizations were performed in a loop experimental design that allowed comparisons of fiber gene expression profiles as a function of time between the two NILs. Overall, developmental expression patterns revealed by the microarray experiment agreed with previously reported cotton fiber gene expression patterns for specific genes. Additionally, genes expressed coordinately with the onset of SCW biosynthesis in cotton fiber correlated with gene expression patterns of other SCW-producing plant tissues. Functional classification and enrichment analysis of differentially expressed genes between the two NILs revealed that genes associated with SCW biosynthesis were significantly up-regulated in fibers of the high-fiber quality line at the transition stage of cotton fiber development. For independent corroboration of the microarray results, 15 genes were selected for quantitative reverse transcription PCR analysis of fiber gene expression. These analyses, conducted over multiple field years, confirmed the temporal difference in fiber gene expression between the two NILs. We hypothesize that the loci conferring temporal differences in fiber gene expression between the NILs are important regulatory sequences that offer the potential for more targeted manipulation of cotton fiber quality.

  14. MiT family translocation renal cell carcinoma.

    PubMed

    Argani, Pedram

    2015-03-01

    The MiT subfamily of transcription factors includes TFE3, TFEB, TFC, and MiTF. Gene fusions involving two of these transcription factors have been identified in renal cell carcinoma (RCC). The Xp11 translocation RCCs were first officially recognized in the 2004 WHO renal tumor classification, and harbor gene fusions involving TFE3. The t(6;11) RCCs harbor a specific Alpha-TFEB gene fusion and were first officially recognized in the 2013 International Society of Urologic Pathology (ISUP) Vancouver classification of renal neoplasia. These two subtypes of translocation RCC have many similarities. Both were initially described in and disproportionately involve young patients, though adult translocation RCC may overall outnumber pediatric cases. Both often have unusual and distinctive morphologies; the Xp11 translocation RCCs frequently have clear cells with papillary architecture and abundant psammomatous bodies, while the t(6;11) RCCs frequently have a biphasic appearance with both large and small epithelioid cells and nodules of basement membrane material. However, the morphology of these two neoplasms can overlap, with one mimicking the other. Both of these RCCs underexpress epithelial immunohistochemical markers like cytokeratin and epithelial membrane antigen (EMA) relative to most other RCCs. Unlike other RCCs, both frequently express the cysteine protease cathepsin k and often express melanocytic markers like HMB45 and Melan A. Finally, TFE3 and TFEB have overlapping functional activity as these two transcription factors frequently heterodimerize and bind to the same targets. Therefore, on the basis of clinical, morphologic, immunohistochemical, and genetic similarities, the 2013 ISUP Vancouver classification of renal neoplasia grouped these two neoplasms together under the heading of "MiT family translocation RCC." This review summarizes our current knowledge of these recently described RCCs. Copyright © 2015 Elsevier Inc. All rights reserved.

  15. Early vertebrate origin and diversification of small transmembrane regulators of cellular ion transport.

    PubMed

    Pirkmajer, Sergej; Kirchner, Henriette; Lundell, Leonidas S; Zelenin, Pavel V; Zierath, Juleen R; Makarova, Kira S; Wolf, Yuri I; Chibalin, Alexander V

    2017-07-15

    Small transmembrane proteins such as FXYDs, which interact with Na + ,K + -ATPase, and the micropeptides that interact with sarco/endoplasmic reticulum Ca 2+ -ATPase play fundamental roles in regulation of ion transport in vertebrates. Uncertain evolutionary origins and phylogenetic relationships among these regulators of ion transport have led to inconsistencies in their classification across vertebrate species, thus hampering comparative studies of their functions. We discovered the first FXYD homologue in sea lamprey, a basal jawless vertebrate, which suggests small transmembrane regulators of ion transport emerged early in the vertebrate lineage. We also identified 13 gene subfamilies of FXYDs and propose a revised, phylogeny-based FXYD classification that is consistent across vertebrate species. These findings provide an improved framework for investigating physiological and pathophysiological functions of small transmembrane regulators of ion transport. Small transmembrane proteins are important for regulation of cellular ion transport. The most prominent among these are members of the FXYD family (FXYD1-12), which regulate Na + ,K + -ATPase, and phospholamban, sarcolipin, myoregulin and DWORF, which regulate the sarco/endoplasmic reticulum Ca 2+ -ATPase (SERCA). FXYDs and regulators of SERCA are present in fishes, as well as terrestrial vertebrates; however, their evolutionary origins and phylogenetic relationships are obscure, thus hampering comparative physiological studies. Here we discovered that sea lamprey (Petromyzon marinus), a representative of extant jawless vertebrates (Cyclostomata), expresses an FXYD homologue, which strongly suggests that FXYDs predate the emergence of fishes and other jawed vertebrates (Gnathostomata). Using a combination of sequence-based phylogenetic analysis and conservation of local chromosome context, we determined that FXYDs markedly diversified in the lineages leading to cartilaginous fishes (Chondrichthyes) and bony vertebrates (Euteleostomi). Diversification of SERCA regulators was much less extensive, indicating they operate under different evolutionary constraints. Finally, we found that FXYDs in extant vertebrates can be classified into 13 gene subfamilies, which do not always correspond to the established FXYD classification. We therefore propose a revised classification that is based on evolutionary history of FXYDs and that is consistent across vertebrate species. Collectively, our findings provide an improved framework for investigating the function of ion transport in health and disease. © 2017 The Authors. The Journal of Physiology © 2017 The Physiological Society.

  16. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wu, Hong; Zeng, Hong; Lam, Robert

    The crystal structure of the human MLH1 N-terminus is reported at 2.30 Å resolution. The overall structure is described along with an analysis of two clinically important mutations. Mismatch repair prevents the accumulation of erroneous insertions/deletions and non-Watson–Crick base pairs in the genome. Pathogenic mutations in the MLH1 gene are associated with a predisposition to Lynch and Turcot’s syndromes. Although genetic testing for these mutations is available, robust classification of variants requires strong clinical and functional support. Here, the first structure of the N-terminus of human MLH1, determined by X-ray crystallography, is described. The structure shares a high degree ofmore » similarity with previously determined prokaryotic MLH1 homologs; however, this structure affords a more accurate platform for the classification of MLH1 variants.« less

  17. Advances in the understanding of headache.

    PubMed

    Goadsby, Peter J

    2005-01-01

    Primary headache disorders account for a substantial part of the morbidity seen in medical practice and so advances in their understanding and management are of general importance. The classification of headache disorders has recently been revised, and the importance of frequent migraine, chronic (transformed) migraine and some important, albeit rarer, conditions that were previously not included has been recognized. Identification of the first genes for a migraine syndrome, namely familial hemiplegic migraine, and their classification as channelopathies opens up new understanding of these disorders and their possible pathophysiology. Functional brain imaging of migraine and cluster headache has placed the pathophysiology of these disorders firmly and clearly in the brain. As our understanding of migraine and related syndromes has increased, new therapies have been developed which reduce the significant disability associated with these important neurological disorders.

  18. [Diversity and antimicrobial activities of cultivable bacteria isolated from Jiaozhou Bay].

    PubMed

    Wang, Yiting; Zhang, Chuanbo; Qi, Lin; Jia, Xiaoqiang; Lu, Wenyu

    2016-12-04

    Marine microorganisms have a great potential in producing biologically active secondary metabolites. In order to study the diversity and antimicrobial activity, we explored 9 sediment samples in different observation sites of Jiaozhou bay. We used YPD and Z2216E culture medium to isolate bacteria from the sediments; 16S rRNA was sequenced for classification and identification of the isolates. Then, we used Oxford cup method to detect antimicrobial activities of the isolated bacteria against 7 test strains. Lastly, we selected 16 representatives to detect secondary-metabolite biosynthesis genes:PKSI, NRPS, CYP, PhzE, dTGD by PCR specific amplification. A total of 76 bacterial strains were isolated from Jiaozhou bay; according to the 16S rRNA gene sequence analysis. These strains could be sorted into 11 genera belonging to 8 different families:Aneurinibacillus, Brevibacillus, Microbacterium, Oceanisphae, Bacillus, Marinomonas, Staphylococcus, Kocuria, Arthrobacters, Micrococcus and Pseudoalteromonas. Of them 34 strains showed antimicrobial activity against at least one of the tested strains. All 16 strains had at least one function genes, 5 strains possessed more than three function genes. Jiaozhou bay area is rich in microbial resources with potential in providing useful secondary metabolites.

  19. Generation and Analysis of Expressed Sequence Tags (ESTs) from Halophyte Atriplex canescens to Explore Salt-Responsive Related Genes

    PubMed Central

    Li, Jingtao; Sun, Xinhua; Yu, Gang; Jia, Chengguo; Liu, Jinliang; Pan, Hongyu

    2014-01-01

    Little information is available on gene expression profiling of halophyte A. canescens. To elucidate the molecular mechanism for stress tolerance in A. canescens, a full-length complementary DNA library was generated from A. canescens exposed to 400 mM NaCl, and provided 343 high-quality ESTs. In an evaluation of 343 valid EST sequences in the cDNA library, 197 unigenes were assembled, among which 190 unigenes (83.1% ESTs) were identified according to their significant similarities with proteins of known functions. All the 343 EST sequences have been deposited in the dbEST GenBank under accession numbers JZ535802 to JZ536144. According to Arabidopsis MIPS functional category and GO classifications, we identified 193 unigenes of the 311 annotations EST, representing 72 non-redundant unigenes sharing similarities with genes related to the defense response. The sets of ESTs obtained provide a rich genetic resource and 17 up-regulated genes related to salt stress resistance were identified by qRT-PCR. Six of these genes may contribute crucially to earlier and later stage salt stress resistance. Additionally, among the 343 unigenes sequences, 22 simple sequence repeats (SSRs) were also identified contributing to the study of A. canescens resources. PMID:24960361

  20. PlantTribes: a gene and gene family resource for comparative genomics in plants

    PubMed Central

    Wall, P. Kerr; Leebens-Mack, Jim; Müller, Kai F.; Field, Dawn; Altman, Naomi S.; dePamphilis, Claude W.

    2008-01-01

    The PlantTribes database (http://fgp.huck.psu.edu/tribe.html) is a plant gene family database based on the inferred proteomes of five sequenced plant species: Arabidopsis thaliana, Carica papaya, Medicago truncatula, Oryza sativa and Populus trichocarpa. We used the graph-based clustering algorithm MCL [Van Dongen (Technical Report INS-R0010 2000) and Enright et al. (Nucleic Acids Res. 2002; 30: 1575–1584)] to classify all of these species’ protein-coding genes into putative gene families, called tribes, using three clustering stringencies (low, medium and high). For all tribes, we have generated protein and DNA alignments and maximum-likelihood phylogenetic trees. A parallel database of microarray experimental results is linked to the genes, which lets researchers identify groups of related genes and their expression patterns. Unified nomenclatures were developed, and tribes can be related to traditional gene families and conserved domain identifiers. SuperTribes, constructed through a second iteration of MCL clustering, connect distant, but potentially related gene clusters. The global classification of nearly 200 000 plant proteins was used as a scaffold for sorting ∼4 million additional cDNA sequences from over 200 plant species. All data and analyses are accessible through a flexible interface allowing users to explore the classification, to place query sequences within the classification, and to download results for further study. PMID:18073194

  1. Genomewide analysis of TCP transcription factor gene family in Malus domestica.

    PubMed

    Xu, Ruirui; Sun, Peng; Jia, Fengjuan; Lu, Longtao; Li, Yuanyuan; Zhang, Shizhong; Huang, Jinguang

    2014-12-01

    Teosinte branched 1/cycloidea/proliferating cell factor 1 (TCP) proteins are a large family of transcriptional regulators in angiosperms. They are involved in various biological processes, including development and plant metabolism pathways. In this study, a total of 52 TCP genes were identified in apple (Malus domestica) genome. Bioinformatic methods were employed to predicate and analyse their relevant gene classification, gene structure, chromosome location, sequence alignment and conserved domains of MdTCP proteins. Expression analysis from microarray data showed that the expression levels of 28 and 51 MdTCP genes changed during the ripening and rootstock-scion interaction processes, respectively. The expression patterns of 12 selected MdTCP genes were analysed in different tissues and in response to abiotic stresses. All of the selected genes were detected in at least one of the tissues tested, and most of them were modulated by adverse treatments indicating that the MdTCPs were involved in various developmental and physiological processes. To the best of our knowledge, this is the first study of a genomewide analysis of apple TCP gene family. These results provide valuable information for studies on functions of the TCP transcription factor genes in apple.

  2. Expression profiling in canine osteosarcoma: identification of biomarkers and pathways associated with outcome

    PubMed Central

    2010-01-01

    Background Osteosarcoma (OSA) spontaneously arises in the appendicular skeleton of large breed dogs and shares many physiological and molecular biological characteristics with human OSA. The standard treatment for OSA in both species is amputation or limb-sparing surgery, followed by chemotherapy. Unfortunately, OSA is an aggressive cancer with a high metastatic rate. Characterization of OSA with regard to its metastatic potential and chemotherapeutic resistance will improve both prognostic capabilities and treatment modalities. Methods We analyzed archived primary OSA tissue from dogs treated with limb amputation followed by doxorubicin or platinum-based drug chemotherapy. Samples were selected from two groups: dogs with disease free intervals (DFI) of less than 100 days (n = 8) and greater than 300 days (n = 7). Gene expression was assessed with Affymetrix Canine 2.0 microarrays and analyzed with a two-tailed t-test. A subset of genes was confirmed using qRT-PCR and used in classification analysis to predict prognosis. Systems-based gene ontology analysis was conducted on genes selected using a standard J5 metric. The genes identified using this approach were converted to their human homologues and assigned to functional pathways using the GeneGo MetaCore platform. Results Potential biomarkers were identified using gene expression microarray analysis and 11 differentially expressed (p < 0.05) genes were validated with qRT-PCR (n = 10/group). Statistical classification models using the qRT-PCR profiles predicted patient outcomes with 100% accuracy in the training set and up to 90% accuracy upon stratified cross validation. Pathway analysis revealed alterations in pathways associated with oxidative phosphorylation, hedgehog and parathyroid hormone signaling, cAMP/Protein Kinase A (PKA) signaling, immune responses, cytoskeletal remodeling and focal adhesion. Conclusions This profiling study has identified potential new biomarkers to predict patient outcome in OSA and new pathways that may be targeted for therapeutic intervention. PMID:20860831

  3. Structural and Functional Insights from the Metagenome of an Acidic Hot Spring Microbial Planktonic Community in the Colombian Andes

    PubMed Central

    Jiménez, Diego Javier; Andreote, Fernando Dini; Chaves, Diego; Montaña, José Salvador; Osorio-Forero, Cesar; Junca, Howard; Zambrano, María Mercedes; Baena, Sandra

    2012-01-01

    A taxonomic and annotated functional description of microbial life was deduced from 53 Mb of metagenomic sequence retrieved from a planktonic fraction of the Neotropical high Andean (3,973 meters above sea level) acidic hot spring El Coquito (EC). A classification of unassembled metagenomic reads using different databases showed a high proportion of Gammaproteobacteria and Alphaproteobacteria (in total read affiliation), and through taxonomic affiliation of 16S rRNA gene fragments we observed the presence of Proteobacteria, micro-algae chloroplast and Firmicutes. Reads mapped against the genomes Acidiphilium cryptum JF-5, Legionella pneumophila str. Corby and Acidithiobacillus caldus revealed the presence of transposase-like sequences, potentially involved in horizontal gene transfer. Functional annotation and hierarchical comparison with different datasets obtained by pyrosequencing in different ecosystems showed that the microbial community also contained extensive DNA repair systems, possibly to cope with ultraviolet radiation at such high altitudes. Analysis of genes involved in the nitrogen cycle indicated the presence of dissimilatory nitrate reduction to N2 (narGHI, nirS, norBCDQ and nosZ), associated with Proteobacteria-like sequences. Genes involved in the sulfur cycle (cysDN, cysNC and aprA) indicated adenylsulfate and sulfite production that were affiliated to several bacterial species. In summary, metagenomic sequence data provided insight regarding the structure and possible functions of this hot spring microbial community, describing some groups potentially involved in the nitrogen and sulfur cycling in this environment. PMID:23251687

  4. Insights into rubber biosynthesis from transcriptome analysis of Hevea brasiliensis latex.

    PubMed

    Chow, Keng-See; Wan, Kiew-Lian; Isa, Mohd Noor Mat; Bahari, Azlina; Tan, Siang-Hee; Harikrishna, K; Yeang, Hoong-Yeet

    2007-01-01

    Hevea brasiliensis is the most widely cultivated species for commercial production of natural rubber (cis-polyisoprene). In this study, 10,040 expressed sequence tags (ESTs) were generated from the latex of the rubber tree, which represents the cytoplasmic content of a single cell type, in order to analyse the latex transcription profile with emphasis on rubber biosynthesis-related genes. A total of 3,441 unique transcripts (UTs) were obtained after quality editing and assembly of EST sequences. Functional classification of UTs according to the Gene Ontology convention showed that 73.8% were related to genes of unknown function. Among highly expressed ESTs, a significant proportion encoded proteins related to rubber biosynthesis and stress or defence responses. Sequences encoding rubber particle membrane proteins (RPMPs) belonging to three protein families accounted for 12% of the ESTs. Characterization of these ESTs revealed nine RPMP variants (7.9-27 kDa) including the 14 kDa REF (rubber elongation factor) and 22 kDa SRPP (small rubber particle protein). The expression of multiple RPMP isoforms in latex was shown using antibodies against REF and SRPP. Both EST and quantitative reverse transcription-PCR (QRT-PCR) analyses demonstrated REF and SRPP to be the most abundant transcripts in latex. Besides rubber biosynthesis, comparative sequence analysis showed that the RPMPs are highly similar to sequences in the plant kingdom having stress-related functions. Implications of the RPMP function in cis-polyisoprene biosynthesis in the context of transcript abundance and differential gene expression are discussed.

  5. Minimising Immunohistochemical False Negative ER Classification Using a Complementary 23 Gene Expression Signature of ER Status

    PubMed Central

    Li, Qiyuan; Eklund, Aron C.; Juul, Nicolai; Haibe-Kains, Benjamin; Workman, Christopher T.; Richardson, Andrea L.; Szallasi, Zoltan; Swanton, Charles

    2010-01-01

    Background Expression of the oestrogen receptor (ER) in breast cancer predicts benefit from endocrine therapy. Minimising the frequency of false negative ER status classification is essential to identify all patients with ER positive breast cancers who should be offered endocrine therapies in order to improve clinical outcome. In routine oncological practice ER status is determined by semi-quantitative methods such as immunohistochemistry (IHC) or other immunoassays in which the ER expression level is compared to an empirical threshold[1], [2]. The clinical relevance of gene expression-based ER subtypes as compared to IHC-based determination has not been systematically evaluated. Here we attempt to reduce the frequency of false negative ER status classification using two gene expression approaches and compare these methods to IHC based ER status in terms of predictive and prognostic concordance with clinical outcome. Methodology/Principal Findings Firstly, ER status was discriminated by fitting the bimodal expression of ESR1 to a mixed Gaussian model. The discriminative power of ESR1 suggested bimodal expression as an efficient way to stratify breast cancer; therefore we identified a set of genes whose expression was both strongly bimodal, mimicking ESR expression status, and highly expressed in breast epithelial cell lines, to derive a 23-gene ER expression signature-based classifier. We assessed our classifiers in seven published breast cancer cohorts by comparing the gene expression-based ER status to IHC-based ER status as a predictor of clinical outcome in both untreated and tamoxifen treated cohorts. In untreated breast cancer cohorts, the 23 gene signature-based ER status provided significantly improved prognostic power compared to IHC-based ER status (P = 0.006). In tamoxifen-treated cohorts, the 23 gene ER expression signature predicted clinical outcome (HR = 2.20, P = 0.00035). These complementary ER signature-based strategies estimated that between 15.1% and 21.8% patients of IHC-based negative ER status would be classified with ER positive breast cancer. Conclusion/Significance Expression-based ER status classification may complement IHC to minimise false negative ER status classification and optimise patient stratification for endocrine therapies. PMID:21152022

  6. Identification of Differentially Expressed Genes Associated with Apple Fruit Ripening and Softening by Suppression Subtractive Hybridization

    PubMed Central

    Zhang, Zongying; Jiang, Shenghui; Wang, Nan; Li, Min; Ji, Xiaohao; Sun, Shasha; Liu, Jingxuan; Wang, Deyun; Xu, Haifeng; Qi, Sumin; Wu, Shujing; Fei, Zhangjun; Feng, Shouqian; Chen, Xuesen

    2015-01-01

    Apple is one of the most economically important horticultural fruit crops worldwide. It is critical to gain insights into fruit ripening and softening to improve apple fruit quality and extend shelf life. In this study, forward and reverse suppression subtractive hybridization libraries were generated from ‘Taishanzaoxia’ apple fruits sampled around the ethylene climacteric to isolate ripening- and softening-related genes. A set of 648 unigenes were derived from sequence alignment and cluster assembly of 918 expressed sequence tags. According to gene ontology functional classification, 390 out of 443 unigenes (88%) were assigned to the biological process category, 356 unigenes (80%) were classified in the molecular function category, and 381 unigenes (86%) were allocated to the cellular component category. A total of 26 unigenes differentially expressed during fruit development period were analyzed by quantitative RT-PCR. These genes were involved in cell wall modification, anthocyanin biosynthesis, aroma production, stress response, metabolism, transcription, or were non-annotated. Some genes associated with cell wall modification, anthocyanin biosynthesis and aroma production were up-regulated and significantly correlated with ethylene production, suggesting that fruit texture, coloration and aroma may be regulated by ethylene in ‘Taishanzaoxia’. Some of the identified unigenes associated with fruit ripening and softening have not been characterized in public databases. The results contribute to an improved characterization of changes in gene expression during apple fruit ripening and softening. PMID:26719904

  7. Robust diagnosis of non-Hodgkin lymphoma phenotypes validated on gene expression data from different laboratories.

    PubMed

    Bhanot, Gyan; Alexe, Gabriela; Levine, Arnold J; Stolovitzky, Gustavo

    2005-01-01

    A major challenge in cancer diagnosis from microarray data is the need for robust, accurate, classification models which are independent of the analysis techniques used and can combine data from different laboratories. We propose such a classification scheme originally developed for phenotype identification from mass spectrometry data. The method uses a robust multivariate gene selection procedure and combines the results of several machine learning tools trained on raw and pattern data to produce an accurate meta-classifier. We illustrate and validate our method by applying it to gene expression datasets: the oligonucleotide HuGeneFL microarray dataset of Shipp et al. (www.genome.wi.mit.du/MPR/lymphoma) and the Hu95Av2 Affymetrix dataset (DallaFavera's laboratory, Columbia University). Our pattern-based meta-classification technique achieves higher predictive accuracies than each of the individual classifiers , is robust against data perturbations and provides subsets of related predictive genes. Our techniques predict that combinations of some genes in the p53 pathway are highly predictive of phenotype. In particular, we find that in 80% of DLBCL cases the mRNA level of at least one of the three genes p53, PLK1 and CDK2 is elevated, while in 80% of FL cases, the mRNA level of at most one of them is elevated.

  8. 14 CFR Section 9 - Functional Classification-Operating Revenues

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 14 Aeronautics and Space 4 2010-01-01 2010-01-01 false Functional Classification-Operating... AIR CARRIERS Profit and Loss Classification Section 9 Functional Classification—Operating Revenues 3900Transport Revenues. This classification is prescribed for all air carrier groups and shall include all...

  9. Comparison between SLC3A1 and SLC7A9 cystinuria patients and carriers: a need for a new classification.

    PubMed

    Dello Strologo, Luca; Pras, Elon; Pontesilli, Claudia; Beccia, Ercole; Ricci-Barbini, Vittorino; de Sanctis, Luisa; Ponzone, Alberto; Gallucci, Michele; Bisceglia, Luigi; Zelante, Leopoldo; Jimenez-Vidal, Maite; Font, Mariona; Zorzano, Antonio; Rousaud, Ferran; Nunes, Virginia; Gasparini, Paolo; Palacín, Manuel; Rizzoni, Gianfranco

    2002-10-01

    Recent developments in the genetics and physiology of cystinuria do not support the traditional classification, which is based on the excretion of cystine and dibasic amino acids in obligate heterozygotes. Mutations of only two genes (SLC3A1 and SLC7A9), identified by the International Cystinuria Consortium (ICC), have been found to be responsible for all three types of the disease. The ICC set up a multinational database and collected genetic and clinical data from 224 patients affected by cystinuria, 125 with full genotype definition. Amino acid urinary excretion patterns of 189 heterozygotes with genetic definition and of 83 healthy controls were also included. All SLC3A1 carriers and 14% of SLC7A9 carriers showed a normal amino acid urinary pattern (i.e., type I phenotype). The rest of the SLC7A9 carriers showed phenotype non-I (type III, 80.5%; type II, 5.5%). This makes the traditional classification imprecise. A new classification is needed: type A, due to two mutations of SLC3A1 (rBAT) on chromosome 2 (45.2% in our database); type B, due to two mutations of SLC7A9 on chromosome 19 (53.2% in this series); and a possible third type, AB (1.6%), with one mutation on each of the above-mentioned genes. Clinical data show that cystinuria is more severe in males than in females. The two types of cystinuria (A and B) had a similar outcome in this retrospective study, but the effect of the treatment could not be analyzed. Stone events do not correlate with amino acid urinary excretion. Renal function was clearly impaired in 17% of the patients.

  10. Negative Example Selection for Protein Function Prediction: The NoGO Database

    PubMed Central

    Youngs, Noah; Penfold-Brown, Duncan; Bonneau, Richard; Shasha, Dennis

    2014-01-01

    Negative examples – genes that are known not to carry out a given protein function – are rarely recorded in genome and proteome annotation databases, such as the Gene Ontology database. Negative examples are required, however, for several of the most powerful machine learning methods for integrative protein function prediction. Most protein function prediction efforts have relied on a variety of heuristics for the choice of negative examples. Determining the accuracy of methods for negative example prediction is itself a non-trivial task, given that the Open World Assumption as applied to gene annotations rules out many traditional validation metrics. We present a rigorous comparison of these heuristics, utilizing a temporal holdout, and a novel evaluation strategy for negative examples. We add to this comparison several algorithms adapted from Positive-Unlabeled learning scenarios in text-classification, which are the current state of the art methods for generating negative examples in low-density annotation contexts. Lastly, we present two novel algorithms of our own construction, one based on empirical conditional probability, and the other using topic modeling applied to genes and annotations. We demonstrate that our algorithms achieve significantly fewer incorrect negative example predictions than the current state of the art, using multiple benchmarks covering multiple organisms. Our methods may be applied to generate negative examples for any type of method that deals with protein function, and to this end we provide a database of negative examples in several well-studied organisms, for general use (The NoGO database, available at: bonneaulab.bio.nyu.edu/nogo.html). PMID:24922051

  11. Data Mining Algorithms for Classification of Complex Biomedical Data

    ERIC Educational Resources Information Center

    Lan, Liang

    2012-01-01

    In my dissertation, I will present my research which contributes to solve the following three open problems from biomedical informatics: (1) Multi-task approaches for microarray classification; (2) Multi-label classification of gene and protein prediction from multi-source biological data; (3) Spatial scan for movement data. In microarray…

  12. Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys

    PubMed Central

    Werner, Jeffrey J; Koren, Omry; Hugenholtz, Philip; DeSantis, Todd Z; Walters, William A; Caporaso, J Gregory; Angenent, Largus T; Knight, Rob; Ley, Ruth E

    2012-01-01

    Taxonomic classification of the thousands–millions of 16S rRNA gene sequences generated in microbiome studies is often achieved using a naïve Bayesian classifier (for example, the Ribosomal Database Project II (RDP) classifier), due to favorable trade-offs among automation, speed and accuracy. The resulting classification depends on the reference sequences and taxonomic hierarchy used to train the model; although the influence of primer sets and classification algorithms have been explored in detail, the influence of training set has not been characterized. We compared classification results obtained using three different publicly available databases as training sets, applied to five different bacterial 16S rRNA gene pyrosequencing data sets generated (from human body, mouse gut, python gut, soil and anaerobic digester samples). We observed numerous advantages to using the largest, most diverse training set available, that we constructed from the Greengenes (GG) bacterial/archaeal 16S rRNA gene sequence database and the latest GG taxonomy. Phylogenetic clusters of previously unclassified experimental sequences were identified with notable improvements (for example, 50% reduction in reads unclassified at the phylum level in mouse gut, soil and anaerobic digester samples), especially for phylotypes belonging to specific phyla (Tenericutes, Chloroflexi, Synergistetes and Candidate phyla TM6, TM7). Trimming the reference sequences to the primer region resulted in systematic improvements in classification depth, and greatest gains at higher confidence thresholds. Phylotypes unclassified at the genus level represented a greater proportion of the total community variation than classified operational taxonomic units in mouse gut and anaerobic digester samples, underscoring the need for greater diversity in existing reference databases. PMID:21716311

  13. Novel PAX3 mutations causing Waardenburg syndrome type 1 in Tunisian patients.

    PubMed

    Trabelsi, Mediha; Nouira, Malek; Maazoul, Faouzi; Kraoua, Lilia; Meddeb, Rim; Ouertani, Ines; Chelly, Imen; Benoit, Valérie; Besbes, Ghazi; Mrad, Ridha

    2017-12-01

    Waardenburg syndrome (WS) is an auditory-pigmentary disease characterized by a clinical and genetic variability. WS is classified into four types depending on the presence or absence of additional symptoms: WS1, WS2, WS3 and WS4. Type 1 and 3 are mostly caused by PAX3 mutations, while type 2 and type 4 are genetically heterogeneous. The aims of this study are to confirm the diagnostic of WS1 by the sequencing of PAX3 gene and to evaluate the genotype phenotype correlation. A clinical classification was established for 14 patients WS, as proposed by the Waardenburg Consortium, and noted a predominance of type 1 and type 2 with 6 patients WS1, 7 patients WS2 and 1 patient WS3. A significant inter and intra-familial clinical heterogeneity was also observed. A sequencing of PAX3 gene in the 6 patients WS1 confirmed the diagnosis in 4 of them by revealing three novel mutations that modify two functional domains of the protein: the c.942delC; the c.933_936dupTTAC and the c.164delTCCGCCACA. These three variations are most likely responsible for the phenotype, however their pathogenic effects need to be confirmed by functional studies. The MLPA analysis of the 2 patients who were sequence negative for PAX3 gene revealed, in one of them, a heterozygous deletion of exons 5 to 9 confirming the WS1 diagnosis. Both clinical and molecular approaches led to the conclusion that there is a lack of genotype-phenotype correlation in WS1, an element that must be taken into account in genetic counseling. The absence of PAX3 mutation in one patient WS1 highlights the fact that the clinical classification is sometimes insufficient to distinguish WS1 from other types WS hence the interest of sequencing the other WS genes in this patient. Copyright © 2017 Elsevier B.V. All rights reserved.

  14. Divergence between motoneurons: gene expression profiling provides a molecular characterization of functionally discrete somatic and autonomic motoneurons

    PubMed Central

    Cui, Dapeng; Dougherty, Kimberly J.; Machacek, David W.; Sawchuk, Michael; Hochman, Shawn; Baro, Deborah J.

    2009-01-01

    Studies in the developing spinal cord suggest that different motoneuron (MN) cell types express very different genetic programs, but the degree to which adult programs differ is unknown. To compare genetic programs between adult MN columnar cell types, we used laser capture micro-dissection (LCM) and Affymetrix microarrays to create expression profiles for three columnar cell types: lateral and medial MNs from lumbar segments and sympathetic preganglionic motoneurons located in the thoracic intermediolateral nucleus. A comparison of the three expression profiles indicated that ~7% (813/11,552) of the genes showed significant differences in their expression levels. The largest differences were observed between sympathetic preganglionic MNs and the lateral motor column, with 6% (706/11,552) of the genes being differentially expressed. Significant differences in expression were observed for 1.8% (207/11,552) of the genes when comparing sympathetic preganglionic MNs with the medial motor column. Lateral and medial MNs showed the least divergence, with 1.3% (150/11,552) of the genes being differentially expressed. These data indicate that the amount of divergence in expression profiles between identified columnar MNs does not strictly correlate with divergence of function as defined by innervation patterns (somatic/muscle vs. autonomic/viscera). Classification of the differentially expressed genes with regard to function showed that they underpin all fundamental cell systems and processes, although most differentially expressed genes encode proteins involved in signal transduction. Mining the expression profiles to examine transcription factors essential for MN development suggested that many of the same transcription factors participatein combinatorial codes in embryonic and adult neurons, but patterns of expression change significantly. PMID:16317082

  15. Noncoding sequence classification based on wavelet transform analysis: part I

    NASA Astrophysics Data System (ADS)

    Paredes, O.; Strojnik, M.; Romo-Vázquez, R.; Vélez Pérez, H.; Ranta, R.; Garcia-Torales, G.; Scholl, M. K.; Morales, J. A.

    2017-09-01

    DNA sequences in human genome can be divided into the coding and noncoding ones. Coding sequences are those that are read during the transcription. The identification of coding sequences has been widely reported in literature due to its much-studied periodicity. Noncoding sequences represent the majority of the human genome. They play an important role in gene regulation and differentiation among the cells. However, noncoding sequences do not exhibit periodicities that correlate to their functions. The ENCODE (Encyclopedia of DNA elements) and Epigenomic Roadmap Project projects have cataloged the human noncoding sequences into specific functions. We study characteristics of noncoding sequences with wavelet analysis of genomic signals.

  16. A 16-Gene Signature Distinguishes Anaplastic Astrocytoma from Glioblastoma

    PubMed Central

    Rao, Soumya Alige Mahabala; Srinivasan, Sujaya; Patric, Irene Rosita Pia; Hegde, Alangar Sathyaranjandas; Chandramouli, Bangalore Ashwathnarayanara; Arimappamagan, Arivazhagan; Santosh, Vani; Kondaiah, Paturu; Rao, Manchanahalli R. Sathyanarayana; Somasundaram, Kumaravel

    2014-01-01

    Anaplastic astrocytoma (AA; Grade III) and glioblastoma (GBM; Grade IV) are diffusely infiltrating tumors and are called malignant astrocytomas. The treatment regimen and prognosis are distinctly different between anaplastic astrocytoma and glioblastoma patients. Although histopathology based current grading system is well accepted and largely reproducible, intratumoral histologic variations often lead to difficulties in classification of malignant astrocytoma samples. In order to obtain a more robust molecular classifier, we analysed RT-qPCR expression data of 175 differentially regulated genes across astrocytoma using Prediction Analysis of Microarrays (PAM) and found the most discriminatory 16-gene expression signature for the classification of anaplastic astrocytoma and glioblastoma. The 16-gene signature obtained in the training set was validated in the test set with diagnostic accuracy of 89%. Additionally, validation of the 16-gene signature in multiple independent cohorts revealed that the signature predicted anaplastic astrocytoma and glioblastoma samples with accuracy rates of 99%, 88%, and 92% in TCGA, GSE1993 and GSE4422 datasets, respectively. The protein-protein interaction network and pathway analysis suggested that the 16-genes of the signature identified epithelial-mesenchymal transition (EMT) pathway as the most differentially regulated pathway in glioblastoma compared to anaplastic astrocytoma. In addition to identifying 16 gene classification signature, we also demonstrated that genes involved in epithelial-mesenchymal transition may play an important role in distinguishing glioblastoma from anaplastic astrocytoma. PMID:24475040

  17. Clinical relevance of rare germline sequence variants in cancer genes: evolution and application of classification models.

    PubMed

    Spurdle, Amanda B

    2010-06-01

    Multifactorial models developed for BRCA1/2 variant classification have proved very useful for delineating BRCA1/2 variants associated with very high risk of cancer, or with little clinical significance. Recent linkage of this quantitative assessment of risk to clinical management guidelines has provided a basis to standardize variant reporting, variant classification and management of families with such variants, and can theoretically be applied to any disease gene. As proof of principle, the multifactorial approach already shows great promise for application to the evaluation of mismatch repair gene variants identified in families with suspected Lynch syndrome. However there is need to be cautious of the noted limitations and caveats of the current model, some of which may be exacerbated by differences in ascertainment and biological pathways to disease for different cancer syndromes.

  18. Gene features selection for three-class disease classification via multiple orthogonal partial least square discriminant analysis and S-plot using microarray data.

    PubMed

    Yang, Mingxing; Li, Xiumin; Li, Zhibin; Ou, Zhimin; Liu, Ming; Liu, Suhuan; Li, Xuejun; Yang, Shuyu

    2013-01-01

    DNA microarray analysis is characterized by obtaining a large number of gene variables from a small number of observations. Cluster analysis is widely used to analyze DNA microarray data to make classification and diagnosis of disease. Because there are so many irrelevant and insignificant genes in a dataset, a feature selection approach must be employed in data analysis. The performance of cluster analysis of this high-throughput data depends on whether the feature selection approach chooses the most relevant genes associated with disease classes. Here we proposed a new method using multiple Orthogonal Partial Least Squares-Discriminant Analysis (mOPLS-DA) models and S-plots to select the most relevant genes to conduct three-class disease classification and prediction. We tested our method using Golub's leukemia microarray data. For three classes with subtypes, we proposed hierarchical orthogonal partial least squares-discriminant analysis (OPLS-DA) models and S-plots to select features for two main classes and their subtypes. For three classes in parallel, we employed three OPLS-DA models and S-plots to choose marker genes for each class. The power of feature selection to classify and predict three-class disease was evaluated using cluster analysis. Further, the general performance of our method was tested using four public datasets and compared with those of four other feature selection methods. The results revealed that our method effectively selected the most relevant features for disease classification and prediction, and its performance was better than that of the other methods.

  19. Supervised DNA Barcodes species classification: analysis, comparisons and results

    PubMed Central

    2014-01-01

    Background Specific fragments, coming from short portions of DNA (e.g., mitochondrial, nuclear, and plastid sequences), have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. Species classification with DNA Barcode sequences has been proven effective on different organisms. Indeed, specific gene regions have been identified as Barcode: COI in animals, rbcL and matK in plants, and ITS in fungi. The classification problem assigns an unknown specimen to a known species by analyzing its Barcode. This task has to be supported with reliable methods and algorithms. Methods In this work the efficacy of supervised machine learning methods to classify species with DNA Barcode sequences is shown. The Weka software suite, which includes a collection of supervised classification methods, is adopted to address the task of DNA Barcode analysis. Classifier families are tested on synthetic and empirical datasets belonging to the animal, fungus, and plant kingdoms. In particular, the function-based method Support Vector Machines (SVM), the rule-based RIPPER, the decision tree C4.5, and the Naïve Bayes method are considered. Additionally, the classification results are compared with respect to ad-hoc and well-established DNA Barcode classification methods. Results A software that converts the DNA Barcode FASTA sequences to the Weka format is released, to adapt different input formats and to allow the execution of the classification procedure. The analysis of results on synthetic and real datasets shows that SVM and Naïve Bayes outperform on average the other considered classifiers, although they do not provide a human interpretable classification model. Rule-based methods have slightly inferior classification performances, but deliver the species specific positions and nucleotide assignments. On synthetic data the supervised machine learning methods obtain superior classification performances with respect to the traditional DNA Barcode classification methods. On empirical data their classification performances are at a comparable level to the other methods. Conclusions The classification analysis shows that supervised machine learning methods are promising candidates for handling with success the DNA Barcoding species classification problem, obtaining excellent performances. To conclude, a powerful tool to perform species identification is now available to the DNA Barcoding community. PMID:24721333

  20. Profiling Hyporheic Microbial Community Nitrogen Cycle and Carbohydrate Active Enzyme Gene Abundances across Seasons

    NASA Astrophysics Data System (ADS)

    Nelson, W. C.; Graham, E.; Stegen, J.

    2016-12-01

    The hyporheic zone (HZ) is the permanently inundated sediment layer between a surface channel and adjacent groundwater-saturated sediments. It has been hypothesized to play a major role in macronutrient (C, N, P) cycling in rivers. The correlation between community taxonomic composition dynamics and functional gene representation is poorly understood for hyporheic communities. To explore how microbial communities respond to temporal changes in environmental conditions, metagenomes were derived from communities captured in sterile sandpacks deployed within the HZ of the Columbia River. HMM databases were used to enumerate protein families present. Functional classification of reads allowed a general assessment of community function over time, while targeted assembly of specific genes enabled investigation of the diversity of organisms encoding these functions. Preliminary analysis of nitrogen cycle pathways shows most gene families examined to have quite steady representation across seasons, with most observed changes being less than an order of magnitude. Analysis of ammonia oxidation genes showed bacterial ammonia oxidizers (AOB) to be stably present across the year, while the archaeal amoA gene increased in late summer, peaking sharply in November, mirroring results from 16S rRNA amplicon analysis which showed an increase in Thaumarcheal OTUs during that same period. Most glycosyl hydrolase GH families had low representation. Highly abundant classes of GH included the GH94 (beta-glucosidase), GH95 (1-2-alpha-L-fucosidase) and GH103 (lytic transglycosylase) families, suggesting activity on plant, fungus and insect polysaccharides and peptidoglycans. Further work is investigating the taxonomy of the sequences identified, to determine how changes in the community composition contribute to the stable gene family profiles observed. These results are intended to work towards a greater understanding of the role of species diversity and functional redundancy in the dynamics of community composition in response to changes in environmental conditions and stochastic processes. In addition, it will serve as a foundation enabling modeling of generalized microbial function in the hyporheic zone, improving our ability to predict fluxes of carbon and nitrogen through riverine systems.

  1. Genome-Wide Identification, Phylogenetic and Expression Analyses of the Ubiquitin-Conjugating Enzyme Gene Family in Maize.

    PubMed

    Jue, Dengwei; Sang, Xuelian; Lu, Shengqiao; Dong, Chen; Zhao, Qiufang; Chen, Hongliang; Jia, Liqiang

    2015-01-01

    Ubiquitination is a post-translation modification where ubiquitin is attached to a substrate. Ubiquitin-conjugating enzymes (E2s) play a major role in the ubiquitin transfer pathway, as well as a variety of functions in plant biological processes. To date, no genome-wide characterization of this gene family has been conducted in maize (Zea mays). In the present study, a total of 75 putative ZmUBC genes have been identified and located in the maize genome. Phylogenetic analysis revealed that ZmUBC proteins could be divided into 15 subfamilies, which include 13 ubiquitin-conjugating enzymes (ZmE2s) and two independent ubiquitin-conjugating enzyme variant (UEV) groups. The predicted ZmUBC genes were distributed across 10 chromosomes at different densities. In addition, analysis of exon-intron junctions and sequence motifs in each candidate gene has revealed high levels of conservation within and between phylogenetic groups. Tissue expression analysis indicated that most ZmUBC genes were expressed in at least one of the tissues, indicating that these are involved in various physiological and developmental processes in maize. Moreover, expression profile analyses of ZmUBC genes under different stress treatments (4°C, 20% PEG6000, and 200 mM NaCl) and various expression patterns indicated that these may play crucial roles in the response of plants to stress. Genome-wide identification, chromosome organization, gene structure, evolutionary and expression analyses of ZmUBC genes have facilitated in the characterization of this gene family, as well as determined its potential involvement in growth, development, and stress responses. This study provides valuable information for better understanding the classification and putative functions of the UBC-encoding genes of maize.

  2. Genome-Wide Identification, Phylogenetic and Expression Analyses of the Ubiquitin-Conjugating Enzyme Gene Family in Maize

    PubMed Central

    Jue, Dengwei; Sang, Xuelian; Lu, Shengqiao; Dong, Chen; Zhao, Qiufang; Chen, Hongliang; Jia, Liqiang

    2015-01-01

    Background Ubiquitination is a post-translation modification where ubiquitin is attached to a substrate. Ubiquitin-conjugating enzymes (E2s) play a major role in the ubiquitin transfer pathway, as well as a variety of functions in plant biological processes. To date, no genome-wide characterization of this gene family has been conducted in maize (Zea mays). Methodology/Principal Findings In the present study, a total of 75 putative ZmUBC genes have been identified and located in the maize genome. Phylogenetic analysis revealed that ZmUBC proteins could be divided into 15 subfamilies, which include 13 ubiquitin-conjugating enzymes (ZmE2s) and two independent ubiquitin-conjugating enzyme variant (UEV) groups. The predicted ZmUBC genes were distributed across 10 chromosomes at different densities. In addition, analysis of exon-intron junctions and sequence motifs in each candidate gene has revealed high levels of conservation within and between phylogenetic groups. Tissue expression analysis indicated that most ZmUBC genes were expressed in at least one of the tissues, indicating that these are involved in various physiological and developmental processes in maize. Moreover, expression profile analyses of ZmUBC genes under different stress treatments (4°C, 20% PEG6000, and 200 mM NaCl) and various expression patterns indicated that these may play crucial roles in the response of plants to stress. Conclusions Genome-wide identification, chromosome organization, gene structure, evolutionary and expression analyses of ZmUBC genes have facilitated in the characterization of this gene family, as well as determined its potential involvement in growth, development, and stress responses. This study provides valuable information for better understanding the classification and putative functions of the UBC-encoding genes of maize. PMID:26606743

  3. A Partial Least Squares Based Procedure for Upstream Sequence Classification in Prokaryotes.

    PubMed

    Mehmood, Tahir; Bohlin, Jon; Snipen, Lars

    2015-01-01

    The upstream region of coding genes is important for several reasons, for instance locating transcription factor, binding sites, and start site initiation in genomic DNA. Motivated by a recently conducted study, where multivariate approach was successfully applied to coding sequence modeling, we have introduced a partial least squares (PLS) based procedure for the classification of true upstream prokaryotic sequence from background upstream sequence. The upstream sequences of conserved coding genes over genomes were considered in analysis, where conserved coding genes were found by using pan-genomics concept for each considered prokaryotic species. PLS uses position specific scoring matrix (PSSM) to study the characteristics of upstream region. Results obtained by PLS based method were compared with Gini importance of random forest (RF) and support vector machine (SVM), which is much used method for sequence classification. The upstream sequence classification performance was evaluated by using cross validation, and suggested approach identifies prokaryotic upstream region significantly better to RF (p-value < 0.01) and SVM (p-value < 0.01). Further, the proposed method also produced results that concurred with known biological characteristics of the upstream region.

  4. Congenital neutropenia in the era of genomics: classification, diagnosis, and natural history.

    PubMed

    Donadieu, Jean; Beaupain, Blandine; Fenneteau, Odile; Bellanné-Chantelot, Christine

    2017-11-01

    This review focuses on the classification, diagnosis and natural history of congenital neutropenia (CN). CN encompasses a number of genetic disorders with chronic neutropenia and, for some, affecting other organ systems, such as the pancreas, central nervous system, heart, bone and skin. To date, 24 distinct genes have been associated with CN. The number of genes involved makes gene screening difficult. This can be solved by next-generation sequencing (NGS) of targeted gene panels. One of the major complications of CN is spontaneous leukaemia, which is preceded by clonal somatic evolution, and can be screened by a targeted NGS panel focused on somatic events. © 2017 John Wiley & Sons Ltd.

  5. GSNFS: Gene subnetwork biomarker identification of lung cancer expression data.

    PubMed

    Doungpan, Narumol; Engchuan, Worrawat; Chan, Jonathan H; Meechai, Asawin

    2016-12-05

    Gene expression has been used to identify disease gene biomarkers, but there are ongoing challenges. Single gene or gene-set biomarkers are inadequate to provide sufficient understanding of complex disease mechanisms and the relationship among those genes. Network-based methods have thus been considered for inferring the interaction within a group of genes to further study the disease mechanism. Recently, the Gene-Network-based Feature Set (GNFS), which is capable of handling case-control and multiclass expression for gene biomarker identification, has been proposed, partly taking into account of network topology. However, its performance relies on a greedy search for building subnetworks and thus requires further improvement. In this work, we establish a new approach named Gene Sub-Network-based Feature Selection (GSNFS) by implementing the GNFS framework with two proposed searching and scoring algorithms, namely gene-set-based (GS) search and parent-node-based (PN) search, to identify subnetworks. An additional dataset is used to validate the results. The two proposed searching algorithms of the GSNFS method for subnetwork expansion are concerned with the degree of connectivity and the scoring scheme for building subnetworks and their topology. For each iteration of expansion, the neighbour genes of a current subnetwork, whose expression data improved the overall subnetwork score, is recruited. While the GS search calculated the subnetwork score using an activity score of a current subnetwork and the gene expression values of its neighbours, the PN search uses the expression value of the corresponding parent of each neighbour gene. Four lung cancer expression datasets were used for subnetwork identification. In addition, using pathway data and protein-protein interaction as network data in order to consider the interaction among significant genes were discussed. Classification was performed to compare the performance of the identified gene subnetworks with three subnetwork identification algorithms. The two searching algorithms resulted in better classification and gene/gene-set agreement compared to the original greedy search of the GNFS method. The identified lung cancer subnetwork using the proposed searching algorithm resulted in an improvement of the cross-dataset validation and an increase in the consistency of findings between two independent datasets. The homogeneity measurement of the datasets was conducted to assess dataset compatibility in cross-dataset validation. The lung cancer dataset with higher homogeneity showed a better result when using the GS search while the dataset with low homogeneity showed a better result when using the PN search. The 10-fold cross-dataset validation on the independent lung cancer datasets showed higher classification performance of the proposed algorithms when compared with the greedy search in the original GNFS method. The proposed searching algorithms provide a higher number of genes in the subnetwork expansion step than the greedy algorithm. As a result, the performance of the subnetworks identified from the GSNFS method was improved in terms of classification performance and gene/gene-set level agreement depending on the homogeneity of the datasets used in the analysis. Some common genes obtained from the four datasets using different searching algorithms are genes known to play a role in lung cancer. The improvement of classification performance and the gene/gene-set level agreement, and the biological relevance indicated the effectiveness of the GSNFS method for gene subnetwork identification using expression data.

  6. Mapping Gene Associations in Human Mitochondria using Clinical Disease Phenotypes

    PubMed Central

    Scharfe, Curt; Lu, Henry Horng-Shing; Neuenburg, Jutta K.; Allen, Edward A.; Li, Guan-Cheng; Klopstock, Thomas; Cowan, Tina M.; Enns, Gregory M.; Davis, Ronald W.

    2009-01-01

    Nuclear genes encode most mitochondrial proteins, and their mutations cause diverse and debilitating clinical disorders. To date, 1,200 of these mitochondrial genes have been recorded, while no standardized catalog exists of the associated clinical phenotypes. Such a catalog would be useful to develop methods to analyze human phenotypic data, to determine genotype-phenotype relations among many genes and diseases, and to support the clinical diagnosis of mitochondrial disorders. Here we establish a clinical phenotype catalog of 174 mitochondrial disease genes and study associations of diseases and genes. Phenotypic features such as clinical signs and symptoms were manually annotated from full-text medical articles and classified based on the hierarchical MeSH ontology. This classification of phenotypic features of each gene allowed for the comparison of diseases between different genes. In turn, we were then able to measure the phenotypic associations of disease genes for which we calculated a quantitative value that is based on their shared phenotypic features. The results showed that genes sharing more similar phenotypes have a stronger tendency for functional interactions, proving the usefulness of phenotype similarity values in disease gene network analysis. We then constructed a functional network of mitochondrial genes and discovered a higher connectivity for non-disease than for disease genes, and a tendency of disease genes to interact with each other. Utilizing these differences, we propose 168 candidate genes that resemble the characteristic interaction patterns of mitochondrial disease genes. Through their network associations, the candidates are further prioritized for the study of specific disorders such as optic neuropathies and Parkinson disease. Most mitochondrial disease phenotypes involve several clinical categories including neurologic, metabolic, and gastrointestinal disorders, which might indicate the effects of gene defects within the mitochondrial system. The accompanying knowledgebase (http://www.mitophenome.org/) supports the study of clinical diseases and associated genes. PMID:19390613

  7. Draft genome sequence of marine-derived Streptomyces sp. TP-A0598, a producer of anti-MRSA antibiotic lydicamycins.

    PubMed

    Komaki, Hisayuki; Ichikawa, Natsuko; Hosoyama, Akira; Fujita, Nobuyuki; Igarashi, Yasuhiro

    2015-01-01

    Streptomyces sp. TP-A0598, isolated from seawater, produces lydicamycin, structurally unique type I polyketide bearing two nitrogen-containing five-membered rings, and four congeners TPU-0037-A, -B, -C, and -D. We herein report the 8 Mb draft genome sequence of this strain, together with classification and features of the organism and generation, annotation and analysis of the genome sequence. The genome encodes 7,240 putative ORFs, of which 4,450 ORFs were assigned with COG categories. Also, 66 tRNA genes and one rRNA operon were identified. The genome contains eight gene clusters involved in the production of polyketides and nonribosomal peptides. Among them, a PKS/NRPS gene cluster was assigned to be responsible for lydicamycin biosynthesis and a plausible biosynthetic pathway was proposed on the basis of gene function prediction. This genome sequence data will facilitate to probe the potential of secondary metabolism in marine-derived Streptomyces.

  8. Expressed MHC class II genes in sea otters (Enhydra lutris) from geographically disparate populations

    USGS Publications Warehouse

    Bowen, Lizabeth; Aldridge, B.M.; Miles, A. Keith; Stott, J.L.

    2006-01-01

    The major histocompatibility complex (MHC) is central to maintaining the immunologic vigor of individuals and populations. Classical MHC class II genes were targeted for partial sequencing in sea otters (Enhydra lutris) from populations in California, Washington, and Alaska. Sequences derived from sea otter peripheral blood leukocyte mRNAs were similar to those classified as DQA, DQB, DRA, and DRB in other species. Comparisons of the derived amino acid compositions supported the classification of these as functional molecules from at least one DQA, DQB, and DRA locus and at least two DRB loci. While limited in scope, phylogenetic analysis of the DRB peptide‐binding region suggested the possible existence of distinct clades demarcated by geographic region. These preliminary findings support the need for additional MHC gene sequencing and expansion to a comprehensive study targeting additional otters.

  9. Sorting Five Human Tumor Types Reveals Specific Biomarkers and Background Classification Genes.

    PubMed

    Roche, Kimberly E; Weinstein, Marvin; Dunwoodie, Leland J; Poehlman, William L; Feltus, Frank A

    2018-05-25

    We applied two state-of-the-art, knowledge independent data-mining methods - Dynamic Quantum Clustering (DQC) and t-Distributed Stochastic Neighbor Embedding (t-SNE) - to data from The Cancer Genome Atlas (TCGA). We showed that the RNA expression patterns for a mixture of 2,016 samples from five tumor types can sort the tumors into groups enriched for relevant annotations including tumor type, gender, tumor stage, and ethnicity. DQC feature selection analysis discovered 48 core biomarker transcripts that clustered tumors by tumor type. When these transcripts were removed, the geometry of tumor relationships changed, but it was still possible to classify the tumors using the RNA expression profiles of the remaining transcripts. We continued to remove the top biomarkers for several iterations and performed cluster analysis. Even though the most informative transcripts were removed from the cluster analysis, the sorting ability of remaining transcripts remained strong after each iteration. Further, in some iterations we detected a repeating pattern of biological function that wasn't detectable with the core biomarker transcripts present. This suggests the existence of a "background classification" potential in which the pattern of gene expression after continued removal of "biomarker" transcripts could still classify tumors in agreement with the tumor type.

  10. Network-constrained group lasso for high-dimensional multinomial classification with application to cancer subtype prediction.

    PubMed

    Tian, Xinyu; Wang, Xuefeng; Chen, Jun

    2014-01-01

    Classic multinomial logit model, commonly used in multiclass regression problem, is restricted to few predictors and does not take into account the relationship among variables. It has limited use for genomic data, where the number of genomic features far exceeds the sample size. Genomic features such as gene expressions are usually related by an underlying biological network. Efficient use of the network information is important to improve classification performance as well as the biological interpretability. We proposed a multinomial logit model that is capable of addressing both the high dimensionality of predictors and the underlying network information. Group lasso was used to induce model sparsity, and a network-constraint was imposed to induce the smoothness of the coefficients with respect to the underlying network structure. To deal with the non-smoothness of the objective function in optimization, we developed a proximal gradient algorithm for efficient computation. The proposed model was compared to models with no prior structure information in both simulations and a problem of cancer subtype prediction with real TCGA (the cancer genome atlas) gene expression data. The network-constrained mode outperformed the traditional ones in both cases.

  11. Classification of ductal carcinoma in situ by gene expression profiling.

    PubMed

    Hannemann, Juliane; Velds, Arno; Halfwerk, Johannes B G; Kreike, Bas; Peterse, Johannes L; van de Vijver, Marc J

    2006-01-01

    Ductal carcinoma in situ (DCIS) is characterised by the intraductal proliferation of malignant epithelial cells. Several histological classification systems have been developed, but assessing the histological type/grade of DCIS lesions is still challenging, making treatment decisions based on these features difficult. To obtain insight in the molecular basis of the development of different types of DCIS and its progression to invasive breast cancer, we have studied differences in gene expression between different types of DCIS and between DCIS and invasive breast carcinomas. Gene expression profiling using microarray analysis has been performed on 40 in situ and 40 invasive breast cancer cases. DCIS cases were classified as well- (n = 6), intermediately (n = 18), and poorly (n = 14) differentiated type. Of the 40 invasive breast cancer samples, five samples were grade I, 11 samples were grade II, and 24 samples were grade III. Using two-dimensional hierarchical clustering, the basal-like type, ERB-B2 type, and the luminal-type tumours originally described for invasive breast cancer could also be identified in DCIS. Using supervised classification, we identified a gene expression classifier of 35 genes, which differed between DCIS and invasive breast cancer; a classifier of 43 genes could be identified separating between well- and poorly differentiated DCIS samples.

  12. Classification of ductal carcinoma in situ by gene expression profiling

    PubMed Central

    Hannemann, Juliane; Velds, Arno; Halfwerk, Johannes BG; Kreike, Bas; Peterse, Johannes L; van de Vijver, Marc J

    2006-01-01

    Introduction Ductal carcinoma in situ (DCIS) is characterised by the intraductal proliferation of malignant epithelial cells. Several histological classification systems have been developed, but assessing the histological type/grade of DCIS lesions is still challenging, making treatment decisions based on these features difficult. To obtain insight in the molecular basis of the development of different types of DCIS and its progression to invasive breast cancer, we have studied differences in gene expression between different types of DCIS and between DCIS and invasive breast carcinomas. Methods Gene expression profiling using microarray analysis has been performed on 40 in situ and 40 invasive breast cancer cases. Results DCIS cases were classified as well- (n = 6), intermediately (n = 18), and poorly (n = 14) differentiated type. Of the 40 invasive breast cancer samples, five samples were grade I, 11 samples were grade II, and 24 samples were grade III. Using two-dimensional hierarchical clustering, the basal-like type, ERB-B2 type, and the luminal-type tumours originally described for invasive breast cancer could also be identified in DCIS. Conclusion Using supervised classification, we identified a gene expression classifier of 35 genes, which differed between DCIS and invasive breast cancer; a classifier of 43 genes could be identified separating between well- and poorly differentiated DCIS samples. PMID:17069663

  13. Fuzzy support vector machine: an efficient rule-based classification technique for microarrays.

    PubMed

    Hajiloo, Mohsen; Rabiee, Hamid R; Anooshahpour, Mahdi

    2013-01-01

    The abundance of gene expression microarray data has led to the development of machine learning algorithms applicable for tackling disease diagnosis, disease prognosis, and treatment selection problems. However, these algorithms often produce classifiers with weaknesses in terms of accuracy, robustness, and interpretability. This paper introduces fuzzy support vector machine which is a learning algorithm based on combination of fuzzy classifiers and kernel machines for microarray classification. Experimental results on public leukemia, prostate, and colon cancer datasets show that fuzzy support vector machine applied in combination with filter or wrapper feature selection methods develops a robust model with higher accuracy than the conventional microarray classification models such as support vector machine, artificial neural network, decision trees, k nearest neighbors, and diagonal linear discriminant analysis. Furthermore, the interpretable rule-base inferred from fuzzy support vector machine helps extracting biological knowledge from microarray data. Fuzzy support vector machine as a new classification model with high generalization power, robustness, and good interpretability seems to be a promising tool for gene expression microarray classification.

  14. Transcriptional profiles of Arabidopsis stomataless mutants reveal developmental and physiological features of life in the absence of stomata

    PubMed Central

    de Marcos, Alberto; Triviño, Magdalena; Pérez-Bueno, María Luisa; Ballesteros, Isabel; Barón, Matilde; Mena, Montaña; Fenoll, Carmen

    2015-01-01

    Loss of function of the positive stomata development regulators SPCH or MUTE in Arabidopsis thaliana renders stomataless plants; spch-3 and mute-3 mutants are extreme dwarfs, but produce cotyledons and tiny leaves, providing a system to interrogate plant life in the absence of stomata. To this end, we compared their cotyledon transcriptomes with that of wild-type plants. K-means clustering of differentially expressed genes generated four clusters: clusters 1 and 2 grouped genes commonly regulated in the mutants, while clusters 3 and 4 contained genes distinctively regulated in mute-3. Classification in functional categories and metabolic pathways of genes in clusters 1 and 2 suggested that both mutants had depressed secondary, nitrogen and sulfur metabolisms, while only a few photosynthesis-related genes were down-regulated. In situ quenching analysis of chlorophyll fluorescence revealed limited inhibition of photosynthesis. This and other fluorescence measurements matched the mutant transcriptomic features. Differential transcriptomes of both mutants were enriched in growth-related genes, including known stomata development regulators, which paralleled their epidermal phenotypes. Analysis of cluster 3 was not informative for developmental aspects of mute-3. Cluster 4 comprised genes differentially up−regulated in mute−3, 35% of which were direct targets for SPCH and may relate to the unique cell types of mute−3. A screen of T-DNA insertion lines in genes differentially expressed in the mutants identified a gene putatively involved in stomata development. A collection of lines for conditional overexpression of transcription factors differentially expressed in the mutants rendered distinct epidermal phenotypes, suggesting that these proteins may be novel stomatal development regulators. Thus, our transcriptome analysis represents a useful source of new genes for the study of stomata development and for characterizing physiology and growth in the absence of stomata. PMID:26157447

  15. Network-Induced Classification Kernels for Gene Expression Profile Analysis

    PubMed Central

    Dror, Gideon; Shamir, Ron

    2012-01-01

    Abstract Computational classification of gene expression profiles into distinct disease phenotypes has been highly successful to date. Still, robustness, accuracy, and biological interpretation of the results have been limited, and it was suggested that use of protein interaction information jointly with the expression profiles can improve the results. Here, we study three aspects of this problem. First, we show that interactions are indeed relevant by showing that co-expressed genes tend to be closer in the network of interactions. Second, we show that the improved performance of one extant method utilizing expression and interactions is not really due to the biological information in the network, while in another method this is not the case. Finally, we develop a new kernel method—called NICK—that integrates network and expression data for SVM classification, and demonstrate that overall it achieves better results than extant methods while running two orders of magnitude faster. PMID:22697242

  16. Transcriptomic markers meet the real world: finding diagnostic signatures of corticosteroid treatment in commercial beef samples

    PubMed Central

    2012-01-01

    Background The use of growth-promoters in beef cattle, despite the EU ban, remains a frequent practice. The use of transcriptomic markers has already proposed to identify indirect evidence of anabolic hormone treatment. So far, such approach has been tested in experimentally treated animals. Here, for the first time commercial samples were analyzed. Results Quantitative determination of Dexamethasone (DEX) residues in the urine collected at the slaughterhouse was performed by Liquid Chromatography-Mass Spectrometry (LC-MS). DNA-microarray technology was used to obtain transcriptomic profiles of skeletal muscle in commercial samples and negative controls. LC-MS confirmed the presence of low level of DEX residues in the urine of the commercial samples suspect for histological classification. Principal Component Analysis (PCA) on microarray data identified two clusters of samples. One cluster included negative controls and a subset of commercial samples, while a second cluster included part of the specimens collected at the slaughterhouse together with positives for corticosteroid treatment based on thymus histology and LC-MS. Functional analysis of the differentially expressed genes (3961) between the two groups provided further evidence that animals clustering with positive samples might have been treated with corticosteroids. These suspect samples could be reliably classified with a specific classification tool (Prediction Analysis of Microarray) using just two genes. Conclusions Despite broad variation observed in gene expression profiles, the present study showed that DNA-microarrays can be used to find transcriptomic signatures of putative anabolic treatments and that gene expression markers could represent a useful screening tool. PMID:23110699

  17. Correlation of Biomarker Expression in Colonic Mucosa with Disease Phenotype in Crohn's Disease and Ulcerative Colitis.

    PubMed

    Bruno, Maria E C; Rogier, Eric W; Arsenescu, Razvan I; Flomenhoft, Deborah R; Kurkjian, Cathryn J; Ellis, Gavin I; Kaetzel, Charlotte S

    2015-10-01

    Inflammatory bowel diseases (IBD), including Crohn's disease (CD) and ulcerative colitis (UC), are characterized by chronic intestinal inflammation due to immunological, microbial, and environmental factors in genetically predisposed individuals. Advances in the diagnosis, prognosis, and treatment of IBD require the identification of robust biomarkers that can be used for molecular classification of diverse disease presentations. We previously identified five genes, RELA, TNFAIP3 (A20), PIGR, TNF, and IL8, whose mRNA levels in colonic mucosal biopsies could be used in a multivariate analysis to classify patients with CD based on disease behavior and responses to therapy. We compared expression of these five biomarkers in IBD patients classified as having CD or UC, and in healthy controls. Patients with CD were characterized as having decreased median expression of TNFAIP3, PIGR, and TNF in non-inflamed colonic mucosa as compared to healthy controls. By contrast, UC patients exhibited decreased expression of PIGR and elevated expression of IL8 in colonic mucosa compared to healthy controls. A multivariate analysis combining mRNA levels for all five genes resulted in segregation of individuals based on disease presentation (CD vs. UC) as well as severity, i.e., patients in remission versus those with acute colitis at the time of biopsy. We propose that this approach could be used as a model for molecular classification of IBD patients, which could further be enhanced by the inclusion of additional genes that are identified by functional studies, global gene expression analyses, and genome-wide association studies.

  18. Pathway activity inference for multiclass disease classification through a mathematical programming optimisation framework.

    PubMed

    Yang, Lingjian; Ainali, Chrysanthi; Tsoka, Sophia; Papageorgiou, Lazaros G

    2014-12-05

    Applying machine learning methods on microarray gene expression profiles for disease classification problems is a popular method to derive biomarkers, i.e. sets of genes that can predict disease state or outcome. Traditional approaches where expression of genes were treated independently suffer from low prediction accuracy and difficulty of biological interpretation. Current research efforts focus on integrating information on protein interactions through biochemical pathway datasets with expression profiles to propose pathway-based classifiers that can enhance disease diagnosis and prognosis. As most of the pathway activity inference methods in literature are either unsupervised or applied on two-class datasets, there is good scope to address such limitations by proposing novel methodologies. A supervised multiclass pathway activity inference method using optimisation techniques is reported. For each pathway expression dataset, patterns of its constituent genes are summarised into one composite feature, termed pathway activity, and a novel mathematical programming model is proposed to infer this feature as a weighted linear summation of expression of its constituent genes. Gene weights are determined by the optimisation model, in a way that the resulting pathway activity has the optimal discriminative power with regards to disease phenotypes. Classification is then performed on the resulting low-dimensional pathway activity profile. The model was evaluated through a variety of published gene expression profiles that cover different types of disease. We show that not only does it improve classification accuracy, but it can also perform well in multiclass disease datasets, a limitation of other approaches from the literature. Desirable features of the model include the ability to control the maximum number of genes that may participate in determining pathway activity, which may be pre-specified by the user. Overall, this work highlights the potential of building pathway-based multi-phenotype classifiers for accurate disease diagnosis and prognosis problems.

  19. Disease gene classification with metagraph representations.

    PubMed

    Kircali Ata, Sezin; Fang, Yuan; Wu, Min; Li, Xiao-Li; Xiao, Xiaokui

    2017-12-01

    Protein-protein interaction (PPI) networks play an important role in studying the functional roles of proteins, including their association with diseases. However, protein interaction networks are not sufficient without the support of additional biological knowledge for proteins such as their molecular functions and biological processes. To complement and enrich PPI networks, we propose to exploit biological properties of individual proteins. More specifically, we integrate keywords describing protein properties into the PPI network, and construct a novel PPI-Keywords (PPIK) network consisting of both proteins and keywords as two different types of nodes. As disease proteins tend to have a similar topological characteristics on the PPIK network, we further propose to represent proteins with metagraphs. Different from a traditional network motif or subgraph, a metagraph can capture a particular topological arrangement involving the interactions/associations between both proteins and keywords. Based on the novel metagraph representations for proteins, we further build classifiers for disease protein classification through supervised learning. Our experiments on three different PPI databases demonstrate that the proposed method consistently improves disease protein prediction across various classifiers, by 15.3% in AUC on average. It outperforms the baselines including the diffusion-based methods (e.g., RWR) and the module-based methods by 13.8-32.9% for overall disease protein prediction. For predicting breast cancer genes, it outperforms RWR, PRINCE and the module-based baselines by 6.6-14.2%. Finally, our predictions also turn out to have better correlations with literature findings from PubMed. Copyright © 2017 Elsevier Inc. All rights reserved.

  20. Promising personalized therapeutic options for diffuse large B-cell Lymphoma Subtypes with oncogene addictions.

    PubMed

    Steinhardt, James J; Gartenhaus, Ronald B

    2012-09-01

    Currently, two major classification systems segregate diffuse large B-cell lymphoma (DLBCL) into subtypes based on gene expression profiles and provide great insights about the oncogenic mechanisms that may be crucial for lymphomagenesis as well as prognostic information regarding response to current therapies. However, these current classification systems primarily look at expression and not dependency and are thus limited to inductive or probabilistic reasoning when evaluating alternative therapeutic options. The development of a deductive classification system that identifies subtypes in which all patients with a given phenotype require the same oncogenic drivers, and would therefore have a similar response to a rational therapy targeting the essential drivers, would significantly advance the treatment of DLBCL. This review highlights the putative drivers identified as well as the work done to identify potentially dependent populations. These studies integrated genomic analysis and functional screens to provide a rationale for targeted therapies within defined populations. Personalizing treatments by identifying patients with oncogenic dependencies via genotyping and specifically targeting the responsible drivers may constitute a novel approach for the treatment of DLBCL. ©2012 AACR.

  1. Genome-wide analyses of late pollen-preferred genes conserved in various rice cultivars and functional identification of a gene involved in the key processes of late pollen development.

    PubMed

    Moon, Sunok; Oo, Moe Moe; Kim, Backki; Koh, Hee-Jong; Oh, Sung Aeong; Yi, Gihwan; An, Gynheung; Park, Soon Ki; Jung, Ki-Hong

    2018-04-23

    Understanding late pollen development, including the maturation and pollination process, is a key component in maintaining crop yields. Transcriptome data obtained through microarray or RNA-seq technologies can provide useful insight into those developmental processes. Six series of microarray data from a public transcriptome database, the Gene Expression Omnibus of the National Center for Biotechnology Information, are related to anther and pollen development. We performed a systematic and functional study across the rice genome of genes that are preferentially expressed in the late stages of pollen development, including maturation and germination. By comparing the transcriptomes of sporophytes and male gametes over time, we identified 627 late pollen-preferred genes that are conserved among japonica and indica rice cultivars. Functional classification analysis with a MapMan tool kit revealed a significant association between cell wall organization/metabolism and mature pollen grains. Comparative analysis of rice and Arabidopsis demonstrated that genes involved in cell wall modifications and the metabolism of major carbohydrates are unique to rice. We used the GUS reporter system to monitor the expression of eight of those genes. In addition, we evaluated the significance of our candidate genes, using T-DNA insertional mutant population and the CRISPR/Cas9 system. Mutants from T-DNA insertion and CRISPR/Cas9 systems of a rice gene encoding glycerophosphoryl diester phosphodiesterase are defective in their male gamete transfer. Through the global analyses of the late pollen-preferred genes from rice, we found several biological features of these genes. First, biological process related to cell wall organization and modification is over-represented in these genes to support rapid tube growth. Second, comparative analysis of late pollen preferred genes between rice and Arabidopsis provide a significant insight on the evolutional disparateness in cell wall biogenesis and storage reserves of pollen. In addition, these candidates might be useful targets for future examinations of late pollen development, and will be a valuable resource for accelerating the understanding of molecular mechanisms for pollen maturation and germination processes in rice.

  2. Non-Compact Cardiomyopathy or Ventricular Non-Compact Syndrome?

    PubMed Central

    2014-01-01

    Ventricular myocardial non-compaction has been recognized and defined as a genetic cardiomyopathy by American Heart Association since 2006. The argument on the nomenclature and pathogenesis of this kind of ventricular myocardial non-compaction characterized by regional ventricular wall thickening and deep trabecular recesses often complicated with chronic heart failure, arrhythmia and thromboembolism and usually overlap the genetics and phenotypes of other kind of genetic or mixed cardiomyopathy still exist. The proper classification and correct nomenclature of the non-compact ventricles will contribute to the precisely and completely understanding of etiology and its related patho-physiological mechanism for a better risk stratification and more personalized therapy of the disease individually. All of the genetic heterogeneity and phenotypical overlap and the variety in histopathological, electromechanical and clinical presentation indicates that some of the cardiomyopathies might just be the different consequence of myocardial development variations related to gene mutation and phenotype of one or group genes induced by the interacted and disturbed process of gene modulation at different links of gene function expression and some other etiologies. This review aims to establish a new concept of "ventricular non-compaction syndrome" based on the demonstration of the current findings of etiology, epidemiology, histopathology and echocardiography related to the disorder of ventricular myocardial compaction and myocardial electromechanical function development. PMID:25580189

  3. Identification of the chitinase genes from the diamondback moth, Plutella xylostella.

    PubMed

    Liao, Z H; Kuo, T C; Kao, C H; Chou, T M; Kao, Y H; Huang, R N

    2016-12-01

    Chitinases have an indispensable function in chitin metabolism and are well characterized in numerous insect species. Although the diamondback moth (DBM) Plutella xylostella, which has a high reproductive potential, short generation time, and characteristic adaptation to adverse environments, has become one of the most serious pests of cruciferous plants worldwide, the information on the chitinases of the moth is presently limited. In the present study, using degenerated polymerase chain reaction (PCR) and rapid amplification of cDNA ends-PCR strategies, four chitinase genes of P. xylostella were cloned, and an exhaustive search was conducted for chitinase-like sequences from the P. xylostella genome and transcriptomic database. Based on the domain analysis of the deduced amino acid sequences and the phylogenetic analysis of the catalytic domain sequences, we identified 15 chitinase genes from P. xylostella. Two of the gut-specific chitinases did not cluster with any of the known phylogenetic groups of chitinases and might be in a new group of the chitinase family. Moreover, in our study, group VIII chitinase was not identified. The structures, classifications and expression patterns of the chitinases of P. xylostella were further delineated, and with this information, further investigations on the functions of chitinase genes in DBM could be facilitated.

  4. Transcriptional Regulation of Fruit Ripening by Tomato FRUITFULL Homologs and Associated MADS Box Proteins[W

    PubMed Central

    Fujisawa, Masaki; Shima, Yoko; Nakagawa, Hiroyuki; Kitagawa, Mamiko; Kimbara, Junji; Nakano, Toshitsugu; Kasumi, Takafumi; Ito, Yasuhiro

    2014-01-01

    The tomato (Solanum lycopersicum) MADS box FRUITFULL homologs FUL1 and FUL2 act as key ripening regulators and interact with the master regulator MADS box protein RIPENING INHIBITOR (RIN). Here, we report the large-scale identification of direct targets of FUL1 and FUL2 by transcriptome analysis of FUL1/FUL2 suppressed fruits and chromatin immunoprecipitation coupled with microarray analysis (ChIP-chip) targeting tomato gene promoters. The ChIP-chip and transcriptome analysis identified FUL1/FUL2 target genes that contain at least one genomic region bound by FUL1 or FUL2 (regions that occur mainly in their promoters) and exhibit FUL1/FUL2-dependent expression during ripening. These analyses identified 860 direct FUL1 targets and 878 direct FUL2 targets; this set of genes includes both direct targets of RIN and nontargets of RIN. Functional classification of the FUL1/FUL2 targets revealed that these FUL homologs function in many biological processes via the regulation of ripening-related gene expression, both in cooperation with and independent of RIN. Our in vitro assay showed that the FUL homologs, RIN, and tomato AGAMOUS-LIKE1 form DNA binding complexes, suggesting that tetramer complexes of these MADS box proteins are mainly responsible for the regulation of ripening. PMID:24415769

  5. Muscular Dystrophy with Ribitol-Phosphate Deficiency: A Novel Post-Translational Mechanism in Dystroglycanopathy

    PubMed Central

    Kanagawa, Motoi; Toda, Tatsushi

    2017-01-01

    Muscular dystrophy is a group of genetic disorders characterized by progressive muscle weakness. In the early 2000s, a new classification of muscular dystrophy, dystroglycanopathy, was established. Dystroglycanopathy often associates with abnormalities in the central nervous system. Currently, at least eighteen genes have been identified that are responsible for dystroglycanopathy, and despite its genetic heterogeneity, its common biochemical feature is abnormal glycosylation of alpha-dystroglycan. Abnormal glycosylation of alpha-dystroglycan reduces its binding activities to ligand proteins, including laminins. In just the last few years, remarkable progress has been made in determining the sugar chain structures and gene functions associated with dystroglycanopathy. The normal sugar chain contains tandem structures of ribitol-phosphate, a pentose alcohol that was previously unknown in humans. The dystroglycanopathy genes fukutin, fukutin-related protein (FKRP), and isoprenoid synthase domain-containing protein (ISPD) encode essential enzymes for the synthesis of this structure: fukutin and FKRP transfer ribitol-phosphate onto sugar chains of alpha-dystroglycan, and ISPD synthesizes CDP-ribitol, a donor substrate for fukutin and FKRP. These findings resolved long-standing questions and established a disease subgroup that is ribitol-phosphate deficient, which describes a large population of dystroglycanopathy patients. Here, we review the history of dystroglycanopathy, the properties of the sugar chain structure of alpha-dystroglycan, dystroglycanopathy gene functions, and therapeutic strategies. PMID:29081423

  6. Ensemble Feature Learning of Genomic Data Using Support Vector Machine

    PubMed Central

    Anaissi, Ali; Goyal, Madhu; Catchpoole, Daniel R.; Braytee, Ali; Kennedy, Paul J.

    2016-01-01

    The identification of a subset of genes having the ability to capture the necessary information to distinguish classes of patients is crucial in bioinformatics applications. Ensemble and bagging methods have been shown to work effectively in the process of gene selection and classification. Testament to that is random forest which combines random decision trees with bagging to improve overall feature selection and classification accuracy. Surprisingly, the adoption of these methods in support vector machines has only recently received attention but mostly on classification not gene selection. This paper introduces an ensemble SVM-Recursive Feature Elimination (ESVM-RFE) for gene selection that follows the concepts of ensemble and bagging used in random forest but adopts the backward elimination strategy which is the rationale of RFE algorithm. The rationale behind this is, building ensemble SVM models using randomly drawn bootstrap samples from the training set, will produce different feature rankings which will be subsequently aggregated as one feature ranking. As a result, the decision for elimination of features is based upon the ranking of multiple SVM models instead of choosing one particular model. Moreover, this approach will address the problem of imbalanced datasets by constructing a nearly balanced bootstrap sample. Our experiments show that ESVM-RFE for gene selection substantially increased the classification performance on five microarray datasets compared to state-of-the-art methods. Experiments on the childhood leukaemia dataset show that an average 9% better accuracy is achieved by ESVM-RFE over SVM-RFE, and 5% over random forest based approach. The selected genes by the ESVM-RFE algorithm were further explored with Singular Value Decomposition (SVD) which reveals significant clusters with the selected data. PMID:27304923

  7. Dystonia: an update on phenomenology, classification, pathogenesis and treatment.

    PubMed

    Balint, Bettina; Bhatia, Kailash P

    2014-08-01

    This article will highlight recent advances in dystonia with focus on clinical aspects such as the new classification, syndromic approach, new gene discoveries and genotype-phenotype correlations. Broadening of phenotype of some of the previously described hereditary dystonias and environmental risk factors and trends in treatment will be covered. Based on phenomenology, a new consensus update on the definition, phenomenology and classification of dystonia and a syndromic approach to guide diagnosis have been proposed. Terminology has changed and 'isolated dystonia' is used wherein dystonia is the only motor feature apart from tremor, and the previously called heredodegenerative dystonias and dystonia plus syndromes are now subsumed under 'combined dystonia'. The recently discovered genes ANO3, GNAL and CIZ1 appear not to be a common cause of adult-onset cervical dystonia. Clinical and genetic heterogeneity underlie myoclonus-dystonia, dopa-responsive dystonia and deafness-dystonia syndrome. ALS2 gene mutations are a newly recognized cause for combined dystonia. The phenotypic and genotypic spectra of ATP1A3 mutations have considerably broadened. Two new genome-wide association studies identified new candidate genes. A retrospective analysis suggested complicated vaginal delivery as a modifying risk factor in DYT1. Recent studies confirm lasting therapeutic effects of deep brain stimulation in isolated dystonia, good treatment response in myoclonus-dystonia, and suggest that early treatment correlates with a better outcome. Phenotypic classification continues to be important to recognize particular forms of dystonia and this includes syndromic associations. There are a number of genes underlying isolated or combined dystonia and there will be further new discoveries with the advances in genetic technologies such as exome and whole-genome sequencing. The identification of new genes will facilitate better elucidation of pathogenetic mechanisms and possible corrective therapies.

  8. HPMCD: the database of human microbial communities from metagenomic datasets and microbial reference genomes.

    PubMed

    Forster, Samuel C; Browne, Hilary P; Kumar, Nitin; Hunt, Martin; Denise, Hubert; Mitchell, Alex; Finn, Robert D; Lawley, Trevor D

    2016-01-04

    The Human Pan-Microbe Communities (HPMC) database (http://www.hpmcd.org/) provides a manually curated, searchable, metagenomic resource to facilitate investigation of human gastrointestinal microbiota. Over the past decade, the application of metagenome sequencing to elucidate the microbial composition and functional capacity present in the human microbiome has revolutionized many concepts in our basic biology. When sufficient high quality reference genomes are available, whole genome metagenomic sequencing can provide direct biological insights and high-resolution classification. The HPMC database provides species level, standardized phylogenetic classification of over 1800 human gastrointestinal metagenomic samples. This is achieved by combining a manually curated list of bacterial genomes from human faecal samples with over 21000 additional reference genomes representing bacteria, viruses, archaea and fungi with manually curated species classification and enhanced sample metadata annotation. A user-friendly, web-based interface provides the ability to search for (i) microbial groups associated with health or disease state, (ii) health or disease states and community structure associated with a microbial group, (iii) the enrichment of a microbial gene or sequence and (iv) enrichment of a functional annotation. The HPMC database enables detailed analysis of human microbial communities and supports research from basic microbiology and immunology to therapeutic development in human health and disease. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  9. Genome-Wide Analyses and Functional Classification of Proline Repeat-Rich Proteins: Potential Role of eIF5A in Eukaryotic Evolution

    PubMed Central

    Mandal, Ajeet; Mandal, Swati; Park, Myung Hee

    2014-01-01

    The eukaryotic translation factor, eIF5A has been recently reported as a sequence-specific elongation factor that facilitates peptide bond formation at consecutive prolines in Saccharomyces cerevisiae, as its ortholog elongation factor P (EF-P) does in bacteria. We have searched the genome databases of 35 representative organisms from six kingdoms of life for PPP (Pro-Pro-Pro) and/or PPG (Pro-Pro-Gly)-encoding genes whose expression is expected to depend on eIF5A. We have made detailed analyses of proteome data of 5 selected species, Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster, Mus musculus and Homo sapiens. The PPP and PPG motifs are low in the prokaryotic proteomes. However, their frequencies markedly increase with the biological complexity of eukaryotic organisms, and are higher in newly derived proteins than in those orthologous proteins commonly shared in all species. Ontology classifications of S. cerevisiae and human genes encoding the highest level of polyprolines reveal their strong association with several specific biological processes, including actin/cytoskeletal associated functions, RNA splicing/turnover, DNA binding/transcription and cell signaling. Previously reported phenotypic defects in actin polarity and mRNA decay of eIF5A mutant strains are consistent with the proposed role for eIF5A in the translation of the polyproline-containing proteins. Of all the amino acid tandem repeats (≥3 amino acids), only the proline repeat frequency correlates with functional complexity of the five organisms examined. Taken together, these findings suggest the importance of proline repeat-rich proteins and a potential role for eIF5A and its hypusine modification pathway in the course of eukaryotic evolution. PMID:25364902

  10. Identification of novel and known oocyte-specific genes using complementary DNA subtraction and microarray analysis in three different species.

    PubMed

    Vallée, Maud; Gravel, Catherine; Palin, Marie-France; Reghenas, Hélène; Stothard, Paul; Wishart, David S; Sirard, Marc-André

    2005-07-01

    The main objective of the present study was to identify novel oocyte-specific genes in three different species: bovine, mouse, and Xenopus laevis. To achieve this goal, two powerful technologies were combined: a polymerase chain reaction (PCR)-based cDNA subtraction, and cDNA microarrays. Three subtractive libraries consisting of 3456 clones were established and enriched for oocyte-specific transcripts. Sequencing analysis of the positive insert-containing clones resulted in the following classification: 53% of the clones corresponded to known cDNAs, 26% were classified as uncharacterized cDNAs, and a final 9% were classified as novel sequences. All these clones were used for cDNA microarray preparation. Results from these microarray analyses revealed that in addition to already known oocyte-specific genes, such as GDF9, BMP15, and ZP, known genes with unknown function in the oocyte were identified, such as a MLF1-interacting protein (MLF1IP), B-cell translocation gene 4 (BTG4), and phosphotyrosine-binding protein (xPTB). Furthermore, 15 novel oocyte-specific genes were validated by reverse transcription-PCR to confirm their preferential expression in the oocyte compared to somatic tissues. The results obtained in the present study confirmed that microarray analysis is a robust technique to identify true positives from the suppressive subtractive hybridization experiment. Furthermore, obtaining oocyte-specific genes from three species simultaneously allowed us to look at important genes that are conserved across species. Further characterization of these novel oocyte-specific genes will lead to a better understanding of the molecular mechanisms related to the unique functions found in the oocyte.

  11. Gene expression profiles in whole blood and associations with metabolic dysregulation in obesity.

    PubMed

    Cox, Amanda J; Zhang, Ping; Evans, Tiffany J; Scott, Rodney J; Cripps, Allan W; West, Nicholas P

    Gene expression data provides one tool to gain further insight into the complex biological interactions linking obesity and metabolic disease. This study examined associations between blood gene expression profiles and metabolic disease in obesity. Whole blood gene expression profiles, performed using the Illumina HT-12v4 Human Expression Beadchip, were compared between (i) individuals with obesity (O) or lean (L) individuals (n=21 each), (ii) individuals with (M) or without (H) Metabolic Syndrome (n=11 each) matched on age and gender. Enrichment of differentially expressed genes (DEG) into biological pathways was assessed using Ingenuity Pathway Analysis. Association between sets of genes from biological pathways considered functionally relevant and Metabolic Syndrome were further assessed using an area under the curve (AUC) and cross-validated classification rate (CR). For OvL, only 50 genes were significantly differentially expressed based on the selected differential expression threshold (1.2-fold, p<0.05). For MvH, 582 genes were significantly differentially expressed (1.2-fold, p<0.05) and pathway analysis revealed enrichment of DEG into a diverse set of pathways including immune/inflammatory control, insulin signalling and mitochondrial function pathways. Gene sets from the mTOR signalling pathways demonstrated the strongest association with Metabolic Syndrome (p=8.1×10 -8 ; AUC: 0.909, CR: 72.7%). These results support the use of expression profiling in whole blood in the absence of more specific tissue types for investigations of metabolic disease. Using a pathway analysis approach it was possible to identify an enrichment of DEG into biological pathways that could be targeted for in vitro follow-up. Copyright © 2017 Asia Oceania Association for the Study of Obesity. Published by Elsevier Ltd. All rights reserved.

  12. Discovering semantic features in the literature: a foundation for building functional associations

    PubMed Central

    Chagoyen, Monica; Carmona-Saez, Pedro; Shatkay, Hagit; Carazo, Jose M; Pascual-Montano, Alberto

    2006-01-01

    Background Experimental techniques such as DNA microarray, serial analysis of gene expression (SAGE) and mass spectrometry proteomics, among others, are generating large amounts of data related to genes and proteins at different levels. As in any other experimental approach, it is necessary to analyze these data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research. Results We present a method to create literature profiles for large sets of genes or proteins based on common semantic features extracted from a corpus of relevant documents. These profiles can be used to establish pair-wise similarities among genes, utilized in gene/protein classification or can be even combined with experimental measurements. Semantic features can be used by researchers to facilitate the understanding of the commonalities indicated by experimental results. Our approach is based on non-negative matrix factorization (NMF), a machine-learning algorithm for data analysis, capable of identifying local patterns that characterize a subset of the data. The literature is thus used to establish putative relationships among subsets of genes or proteins and to provide coherent justification for this clustering into subsets. We demonstrate the utility of the method by applying it to two independent and vastly different sets of genes. Conclusion The presented method can create literature profiles from documents relevant to sets of genes. The representation of genes as additive linear combinations of semantic features allows for the exploration of functional associations as well as for clustering, suggesting a valuable methodology for the validation and interpretation of high-throughput experimental data. PMID:16438716

  13. School Refusal Behavior: Classification, Assessment, and Treatment Issues.

    ERIC Educational Resources Information Center

    Lee, Marcella I.; Miltenberger, Raymond G.

    1996-01-01

    Discusses diagnostic and functional classification, assessment, and treatment approaches for school refusal behavior. Diagnostic classification focuses on separation anxiety disorder, specific phobia, social phobia, depression, and truancy. Functional classification focuses on the maintaining consequences of the behavior, such as avoidance of…

  14. Family-specific scaling laws in bacterial genomes.

    PubMed

    De Lazzari, Eleonora; Grilli, Jacopo; Maslov, Sergei; Cosentino Lagomarsino, Marco

    2017-07-27

    Among several quantitative invariants found in evolutionary genomics, one of the most striking is the scaling of the overall abundance of proteins, or protein domains, sharing a specific functional annotation across genomes of given size. The size of these functional categories change, on average, as power-laws in the total number of protein-coding genes. Here, we show that such regularities are not restricted to the overall behavior of high-level functional categories, but also exist systematically at the level of single evolutionary families of protein domains. Specifically, the number of proteins within each family follows family-specific scaling laws with genome size. Functionally similar sets of families tend to follow similar scaling laws, but this is not always the case. To understand this systematically, we provide a comprehensive classification of families based on their scaling properties. Additionally, we develop a quantitative score for the heterogeneity of the scaling of families belonging to a given category or predefined group. Under the common reasonable assumption that selection is driven solely or mainly by biological function, these findings point to fine-tuned and interdependent functional roles of specific protein domains, beyond our current functional annotations. This analysis provides a deeper view on the links between evolutionary expansion of protein families and the functional constraints shaping the gene repertoire of bacterial genomes. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  15. Genome-wide analysis of WRKY gene family in the sesame genome and identification of the WRKY genes involved in responses to abiotic stresses.

    PubMed

    Li, Donghua; Liu, Pan; Yu, Jingyin; Wang, Linhai; Dossa, Komivi; Zhang, Yanxin; Zhou, Rong; Wei, Xin; Zhang, Xiurong

    2017-09-11

    Sesame (Sesamum indicum L.) is one of the world's most important oil crops. However, it is susceptible to abiotic stresses in general, and to waterlogging and drought stresses in particular. The molecular mechanisms of abiotic stress tolerance in sesame have not yet been elucidated. The WRKY domain transcription factors play significant roles in plant growth, development, and responses to stresses. However, little is known about the number, location, structure, molecular phylogenetics, and expression of the WRKY genes in sesame. We performed a comprehensive study of the WRKY gene family in sesame and identified 71 SiWRKYs. In total, 65 of these genes were mapped to 15 linkage groups within the sesame genome. A phylogenetic analysis was performed using a related species (Arabidopsis thaliana) to investigate the evolution of the sesame WRKY genes. Tissue expression profiles of the WRKY genes demonstrated that six SiWRKY genes were highly expressed in all organs, suggesting that these genes may be important for plant growth and organ development in sesame. Analysis of the SiWRKY gene expression patterns revealed that 33 and 26 SiWRKYs respond strongly to waterlogging and drought stresses, respectively. Changes in the expression of 12 SiWRKY genes were observed at different times after the waterlogging and drought treatments had begun, demonstrating that sesame gene expression patterns vary in response to abiotic stresses. In this study, we analyzed the WRKY family of transcription factors encoded by the sesame genome. Insight was gained into the classification, evolution, and function of the SiWRKY genes, revealing their putative roles in a variety of tissues. Responses to abiotic stresses in different sesame cultivars were also investigated. The results of our study provide a better understanding of the structures and functions of sesame WRKY genes and suggest that manipulating these WRKYs could enhance resistance to waterlogging and drought.

  16. Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins.

    PubMed

    Kolker, Natali; Higdon, Roger; Broomall, William; Stanberry, Larissa; Welch, Dean; Lu, Wei; Haynes, Winston; Barga, Roger; Kolker, Eugene

    2011-01-01

    To address the monumental challenge of assigning function to millions of sequenced proteins, we completed the first of a kind all-versus-all sequence alignments using BLAST for 9.9 million proteins in the UniRef100 database. Microsoft Windows Azure produced over 3 billion filtered records in 6 days using 475 eight-core virtual machines. Protein classification into functional groups was then performed using Hive and custom jars implemented on top of Apache Hadoop utilizing the MapReduce paradigm. First, using the Clusters of Orthologous Genes (COG) database, a length normalized bit score (LNBS) was determined to be the best similarity measure for classification of proteins. LNBS achieved sensitivity and specificity of 98% each. Second, out of 5.1 million bacterial proteins, about two-thirds were assigned to significantly extended COG groups, encompassing 30 times more assigned proteins. Third, the remaining proteins were classified into protein functional groups using an innovative implementation of a single-linkage algorithm on an in-house Hadoop compute cluster. This implementation significantly reduces the run time for nonindexed queries and optimizes efficient clustering on a large scale. The performance was also verified on Amazon Elastic MapReduce. This clustering assigned nearly 2 million proteins to approximately half a million different functional groups. A similar approach was applied to classify 2.8 million eukaryotic sequences resulting in over 1 million proteins being assign to existing KOG groups and the remainder clustered into 100,000 functional groups.

  17. Identification of candidate chemosensory genes in the antennal transcriptome of Tenebrio molitor (Coleoptera: Tenebrionidae).

    PubMed

    Liu, Su; Rao, Xiang-Jun; Li, Mao-Ye; Feng, Ming-Feng; He, Meng-Zhu; Li, Shi-Guang

    2015-03-01

    We present the first antennal transcriptome sequencing information for the yellow mealworm beetle, Tenebrio molitor (Coleoptera: Tenebrionidae). Analysis of the transcriptome dataset obtained 52,216,616 clean reads, from which 35,363 unigenes were assembled. Of these, 18,820 unigenes showed significant similarity (E-value <10(-5)) to known proteins in the NCBI non-redundant protein database. Gene ontology (GO) and Cluster of Orthologous Groups (COG) analyses were used for functional classification of these unigenes. We identified 19 putative odorant-binding protein (OBP) genes, 12 chemosensory protein (CSP) genes, 20 olfactory receptor (OR) genes, 6 ionotropic receptor (IR) genes and 2 sensory neuron membrane protein (SNMP) genes. BLASTX best hit results indicated that these chemosensory genes were most identical to their respective orthologs from Tribolium castaneum. Phylogenetic analyses also revealed that the T. molitor OBPs and CSPs are closely related to those of T. castaneum. Real-time quantitative PCR assays showed that eight TmolOBP genes were antennae-specific. Of these, TmolOBP5, TmolOBP7 and TmolOBP16 were found to be predominantly expressed in male antennae, while TmolOBP17 was expressed mainly in the legs of males. Several other genes were identified that were neither tissue-specific nor sex-specific. These results establish a firm foundation for future studies of the chemosensory genes in T. molitor. Copyright © 2015 Elsevier Inc. All rights reserved.

  18. Serrated colorectal cancer: Molecular classification, prognosis, and response to chemotherapy

    PubMed Central

    Murcia, Oscar; Juárez, Miriam; Hernández-Illán, Eva; Egoavil, Cecilia; Giner-Calabuig, Mar; Rodríguez-Soler, María; Jover, Rodrigo

    2016-01-01

    Molecular advances support the existence of an alternative pathway of colorectal carcinogenesis that is based on the hypermethylation of specific DNA regions that silences tumor suppressor genes. This alternative pathway has been called the serrated pathway due to the serrated appearance of tumors in histological analysis. New classifications for colorectal cancer (CRC) were proposed recently based on genetic profiles that show four types of molecular alterations: BRAF gene mutations, KRAS gene mutations, microsatellite instability, and hypermethylation of CpG islands. This review summarizes what is known about the serrated pathway of CRC, including CRC molecular and clinical features, prognosis, and response to chemotherapy. PMID:27053844

  19. Accurate, Rapid Taxonomic Classification of Fungal Large-Subunit rRNA Genes

    PubMed Central

    Liu, Kuan-Liang; Porras-Alfaro, Andrea; Eichorst, Stephanie A.

    2012-01-01

    Taxonomic and phylogenetic fingerprinting based on sequence analysis of gene fragments from the large-subunit rRNA (LSU) gene or the internal transcribed spacer (ITS) region is becoming an integral part of fungal classification. The lack of an accurate and robust classification tool trained by a validated sequence database for taxonomic placement of fungal LSU genes is a severe limitation in taxonomic analysis of fungal isolates or large data sets obtained from environmental surveys. Using a hand-curated set of 8,506 fungal LSU gene fragments, we determined the performance characteristics of a naïve Bayesian classifier across multiple taxonomic levels and compared the classifier performance to that of a sequence similarity-based (BLASTN) approach. The naïve Bayesian classifier was computationally more rapid (>460-fold with our system) than the BLASTN approach, and it provided equal or superior classification accuracy. Classifier accuracies were compared using sequence fragments of 100 bp and 400 bp and two different PCR primer anchor points to mimic sequence read lengths commonly obtained using current high-throughput sequencing technologies. Accuracy was higher with 400-bp sequence reads than with 100-bp reads. It was also significantly affected by sequence location across the 1,400-bp test region. The highest accuracy was obtained across either the D1 or D2 variable region. The naïve Bayesian classifier provides an effective and rapid means to classify fungal LSU sequences from large environmental surveys. The training set and tool are publicly available through the Ribosomal Database Project (http://rdp.cme.msu.edu/classifier/classifier.jsp). PMID:22194300

  20. Identification and classification of genes required for tolerance to high-sucrose stress revealed by genome-wide screening of Saccharomyces cerevisiae.

    PubMed

    Ando, Akira; Tanaka, Fumiko; Murata, Yoshinori; Takagi, Hiroshi; Shima, Jun

    2006-03-01

    Yeasts used in bread making are exposed to high concentrations of sucrose during sweet dough fermentation. Despite its importance, tolerance to high-sucrose stress is poorly understood at the gene level. To clarify the genes required for tolerance to high-sucrose stress, genome-wide screening was undertaken using the complete deletion strain collection of diploid Saccharomyces cerevisiae. The screening identified 273 deletions that yielded high sucrose sensitivity, approximately 20 of which were previously uncharacterized. These 273 deleted genes were classified based on their cellular function and localization of their gene products. Cross-sensitivity of the high-sucrose-sensitive mutants to high concentrations of NaCl and sorbitol was studied. Among the 273 sucrose-sensitive deletion mutants, 269 showed cross-sensitivities to sorbitol or NaCl, and four (i.e. ade5,7, ade6, ade8, and pde2) were specifically sensitive to high sucrose. The general stress response pathways via high-osmolarity glycerol and stress response element pathways and the function of the invertase in the ade mutants were similar to those in the wild-type strain. In the presence of high-sucrose stress, intracellular contents of ATP in ade mutants were at least twofold lower than that of the wild-type cells, suggesting that depletion of ATP is a factor in sensitivity to high-sucrose stress. The genes identified in this study might be important for tolerance to high-sucrose stress, and therefore should be target genes in future research into molecular modification for breeding of yeast tolerant to high-sucrose stress.

  1. Sequencing analysis of 20,000 full-length cDNA clones from cassava reveals lineage specific expansions in gene families related to stress response

    PubMed Central

    Sakurai, Tetsuya; Plata, Germán; Rodríguez-Zapata, Fausto; Seki, Motoaki; Salcedo, Andrés; Toyoda, Atsushi; Ishiwata, Atsushi; Tohme, Joe; Sakaki, Yoshiyuki; Shinozaki, Kazuo; Ishitani, Manabu

    2007-01-01

    Background Cassava, an allotetraploid known for its remarkable tolerance to abiotic stresses is an important source of energy for humans and animals and a raw material for many industrial processes. A full-length cDNA library of cassava plants under normal, heat, drought, aluminum and post harvest physiological deterioration conditions was built; 19968 clones were sequence-characterized using expressed sequence tags (ESTs). Results The ESTs were assembled into 6355 contigs and 9026 singletons that were further grouped into 10577 scaffolds; we found 4621 new cassava sequences and 1521 sequences with no significant similarity to plant protein databases. Transcripts of 7796 distinct genes were captured and we were able to assign a functional classification to 78% of them while finding more than half of the enzymes annotated in metabolic pathways in Arabidopsis. The annotation of sequences that were not paired to transcripts of other species included many stress-related functional categories showing that our library is enriched with stress-induced genes. Finally, we detected 230 putative gene duplications that include key enzymes in reactive oxygen species signaling pathways and could play a role in cassava stress response features. Conclusion The cassava full-length cDNA library here presented contains transcripts of genes involved in stress response as well as genes important for different areas of cassava research. This library will be an important resource for gene discovery, characterization and cloning; in the near future it will aid the annotation of the cassava genome. PMID:18096061

  2. Classification of Pelteobagrus fish in Poyang Lake based on mitochondrial COI gene sequence.

    PubMed

    Zhong, Bin; Chen, Ting-Ting; Gong, Rui-Yue; Zhao, Zhe-Xia; Wang, Binhua; Fang, Chunlin; Mao, Hui-Ling

    2016-11-01

    We use DNA molecular marker technology to correct the deficiency of traditional morphological taxonomy. Totality 770 Pelteobagrus fish from Poyang Lake were collected. After preliminary morphological classification, random selected eight samples in each species for DNA extraction. Mitochondrial COI gene sequence was cloned with universal primers and sequenced. The results showed that there are four species of Pelteobagrus living in Poyang Lake. The average of intraspecific genetic distance value was 0.003, while the average interspecific genetic distance was 0.128. The interspecific genetic distance is far more than intraspecific genetic distance. Besides, phylogenetic tree analysis revealed that molecular systematics was in accord with morphological classification. It indicated that COI gene is an effective DNA molecular marker in Pelteobagrus classification. Surprisingly, the intraspecific difference of some individuals (P. e6, P. n6, P. e5, and P. v4) from their original named exceeded species threshold (2%), which should be renewedly classified into Pelteobagrus fulvidraco. However, another individual P. v3 was very different, because its genetic distance was over 8.4% difference from original named Pelteobagrus vachelli. Its taxonomic status remained to be further studied.

  3. [Progress on molecular biology of Isaria farinosa, pathogen of host of Ophiocordyceps sinensis during the artificial culture].

    PubMed

    Liu, Fei; Wu, Xiao-Li; Liu, Ying; Chen, Da-Xia; Zhang, De-Li; Yang, Da-Jian

    2016-02-01

    Isaria farinosa is the pathogen of the host of Ophiocordyceps sinensis. The present research has analyzed the progress on the molecular biology according to the bibliometrics, the sequences (including the gene sequences) of I. farinosa in the NCBI. The results indicated that different country had published different number of the papers, and had landed different kinds and different number of the sequences (including the gene sequences). China had published the most number of the papers, and had landed the most number of the sequences (including the gene sequences). America had landed the most numbers of the function genes. The main content about the pathogen study was focus on the biological controlling. The main content about the molecular study concentrated on the phylogenies classification. In recent years some protease genes and chitinase genes had been researched. With the increase of the effect on the healthy of O. sinensis, and the whole sequence and more and more pharmacological activities of I. farinosa being made known to the public, the study on the molecular biology of the I. farinosa would be deeper and wider. Copyright© by the Chinese Pharmaceutical Association.

  4. Comparative genomic analysis by microbial COGs self-attraction rate.

    PubMed

    Santoni, Daniele; Romano-Spica, Vincenzo

    2009-06-21

    Whole genome analysis provides new perspectives to determine phylogenetic relationships among microorganisms. The availability of whole nucleotide sequences allows different levels of comparison among genomes by several approaches. In this work, self-attraction rates were considered for each cluster of orthologous groups of proteins (COGs) class in order to analyse gene aggregation levels in physical maps. Phylogenetic relationships among microorganisms were obtained by comparing self-attraction coefficients. Eighteen-dimensional vectors were computed for a set of 168 completely sequenced microbial genomes (19 archea, 149 bacteria). The components of the vector represent the aggregation rate of the genes belonging to each of 18 COGs classes. Genes involved in nonessential functions or related to environmental conditions showed the highest aggregation rates. On the contrary genes involved in basic cellular tasks showed a more uniform distribution along the genome, except for translation genes. Self-attraction clustering approach allowed classification of Proteobacteria, Bacilli and other species belonging to Firmicutes. Rearrangement and Lateral Gene Transfer events may influence divergences from classical taxonomy. Each set of COG classes' aggregation values represents an intrinsic property of the microbial genome. This novel approach provides a new point of view for whole genome analysis and bacterial characterization.

  5. Mutation-profile-based methods for understanding selection forces in cancer somatic mutations: a comparative analysis.

    PubMed

    Zhou, Zhan; Zou, Yangyun; Liu, Gangbiao; Zhou, Jingqi; Wu, Jingcheng; Zhao, Shimin; Su, Zhixi; Gu, Xun

    2017-08-29

    Human genes exhibit different effects on fitness in cancer and normal cells. Here, we present an evolutionary approach to measure the selection pressure on human genes, using the well-known ratio of the nonsynonymous to synonymous substitution rate in both cancer genomes ( C N / C S ) and normal populations ( p N / p S ). A new mutation-profile-based method that adopts sample-specific mutation rate profiles instead of conventional substitution models was developed. We found that cancer-specific selection pressure is quite different from the selection pressure at the species and population levels. Both the relaxation of purifying selection on passenger mutations and the positive selection of driver mutations may contribute to the increased C N / C S values of human genes in cancer genomes compared with the p N / p S values in human populations. The C N / C S values also contribute to the improved classification of cancer genes and a better understanding of the onco-functionalization of cancer genes during oncogenesis. The use of our computational pipeline to identify cancer-specific positively and negatively selected genes may provide useful information for understanding the evolution of cancers and identifying possible targets for therapeutic intervention.

  6. A genome-wide 20 K citrus microarray for gene expression analysis

    PubMed Central

    Martinez-Godoy, M Angeles; Mauri, Nuria; Juarez, Jose; Marques, M Carmen; Santiago, Julia; Forment, Javier; Gadea, Jose

    2008-01-01

    Background Understanding of genetic elements that contribute to key aspects of citrus biology will impact future improvements in this economically important crop. Global gene expression analysis demands microarray platforms with a high genome coverage. In the last years, genome-wide EST collections have been generated in citrus, opening the possibility to create new tools for functional genomics in this crop plant. Results We have designed and constructed a publicly available genome-wide cDNA microarray that include 21,081 putative unigenes of citrus. As a functional companion to the microarray, a web-browsable database [1] was created and populated with information about the unigenes represented in the microarray, including cDNA libraries, isolated clones, raw and processed nucleotide and protein sequences, and results of all the structural and functional annotation of the unigenes, like general description, BLAST hits, putative Arabidopsis orthologs, microsatellites, putative SNPs, GO classification and PFAM domains. We have performed a Gene Ontology comparison with the full set of Arabidopsis proteins to estimate the genome coverage of the microarray. We have also performed microarray hybridizations to check its usability. Conclusion This new cDNA microarray replaces the first 7K microarray generated two years ago and allows gene expression analysis at a more global scale. We have followed a rational design to minimize cross-hybridization while maintaining its utility for different citrus species. Furthermore, we also provide access to a website with full structural and functional annotation of the unigenes represented in the microarray, along with the ability to use this site to directly perform gene expression analysis using standard tools at different publicly available servers. Furthermore, we show how this microarray offers a good representation of the citrus genome and present the usefulness of this genomic tool for global studies in citrus by using it to catalogue genes expressed in citrus globular embryos. PMID:18598343

  7. Annotation of the Transcriptome from Taenia pisiformis and Its Comparative Analysis with Three Taeniidae Species

    PubMed Central

    Yang, Deying; Fu, Yan; Wu, Xuhang; Xie, Yue; Nie, Huaming; Chen, Lin; Nong, Xiang; Gu, Xiaobin; Wang, Shuxian; Peng, Xuerong; Yan, Ning; Zhang, Runhui; Zheng, Wanpeng; Yang, Guangyou

    2012-01-01

    Background Taenia pisiformis is one of the most common intestinal tapeworms and can cause infections in canines. Adult T. pisiformis (canines as definitive hosts) and Cysticercus pisiformis (rabbits as intermediate hosts) cause significant health problems to the host and considerable socio-economic losses as a consequence. No complete genomic data regarding T. pisiformis are currently available in public databases. RNA-seq provides an effective approach to analyze the eukaryotic transcriptome to generate large functional gene datasets that can be used for further studies. Methodology/Principal Findings In this study, 2.67 million sequencing clean reads and 72,957 unigenes were generated using the RNA-seq technique. Based on a sequence similarity search with known proteins, a total of 26,012 unigenes (no redundancy) were identified after quality control procedures via the alignment of four databases. Overall, 15,920 unigenes were mapped to 203 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. Through analyzing the glycolysis/gluconeogenesis and axonal guidance pathways, we achieved an in-depth understanding of the biochemistry of T. pisiformis. Here, we selected four unigenes at random and obtained their full-length cDNA clones using RACE PCR. Functional distribution characteristics were gained through comparing four cestode species (72,957 unigenes of T. pisiformis, 30,700 ESTs of T. solium, 1,058 ESTs of Eg+Em [conserved ESTs between Echinococcus granulosus and Echinococcus multilocularis]), with the cluster of orthologous groups (COG) and gene ontology (GO) functional classification systems. Furthermore, the conserved common genes in these four cestode species were obtained and aligned by the KEGG database. Conclusion This study provides an extensive transcriptome dataset obtained from the deep sequencing of T. pisiformis in a non-model whole genome. The identification of conserved genes may provide novel approaches for potential drug targets and vaccinations against cestode infections. Research can now accelerate into the functional genomics, immunity and gene expression profiles of cestode species. PMID:22514598

  8. Challenges in projecting clustering results across gene expression-profiling datasets.

    PubMed

    Lusa, Lara; McShane, Lisa M; Reid, James F; De Cecco, Loris; Ambrogi, Federico; Biganzoli, Elia; Gariboldi, Manuela; Pierotti, Marco A

    2007-11-21

    Gene expression microarray studies for several types of cancer have been reported to identify previously unknown subtypes of tumors. For breast cancer, a molecular classification consisting of five subtypes based on gene expression microarray data has been proposed. These subtypes have been reported to exist across several breast cancer microarray studies, and they have demonstrated some association with clinical outcome. A classification rule based on the method of centroids has been proposed for identifying the subtypes in new collections of breast cancer samples; the method is based on the similarity of the new profiles to the mean expression profile of the previously identified subtypes. Previously identified centroids of five breast cancer subtypes were used to assign 99 breast cancer samples, including a subset of 65 estrogen receptor-positive (ER+) samples, to five breast cancer subtypes based on microarray data for the samples. The effect of mean centering the genes (i.e., transforming the expression of each gene so that its mean expression is equal to 0) on subtype assignment by method of centroids was assessed. Further studies of the effect of mean centering and of class prevalence in the test set on the accuracy of method of centroids classifications of ER status were carried out using training and test sets for which ER status had been independently determined by ligand-binding assay and for which the proportion of ER+ and ER- samples were systematically varied. When all 99 samples were considered, mean centering before application of the method of centroids appeared to be helpful for correctly assigning samples to subtypes, as evidenced by the expression of genes that had previously been used as markers to identify the subtypes. However, when only the 65 ER+ samples were considered for classification, many samples appeared to be misclassified, as evidenced by an unexpected distribution of ER+ samples among the resultant subtypes. When genes were mean centered before classification of samples for ER status, the accuracy of the ER subgroup assignments was highly dependent on the proportion of ER+ samples in the test set; this effect of subtype prevalence was not seen when gene expression data were not mean centered. Simple corrections such as mean centering of genes aimed at microarray platform or batch effect correction can have undesirable consequences because patient population effects can easily be confused with these assay-related effects. Careful thought should be given to the comparability of the patient populations before attempting to force data comparability for purposes of assigning subtypes to independent subjects.

  9. Familial or Sporadic Idiopathic Scoliosis – classification based on artificial neural network and GAPDH and ACTB transcription profile

    PubMed Central

    2013-01-01

    Background Importance of hereditary factors in the etiology of Idiopathic Scoliosis is widely accepted. In clinical practice some of the IS patients present with positive familial history of the deformity and some do not. Traditionally about 90% of patients have been considered as sporadic cases without familial recurrence. However the exact proportion of Familial and Sporadic Idiopathic Scoliosis is still unknown. Housekeeping genes encode proteins that are usually essential for the maintenance of basic cellular functions. ACTB and GAPDH are two housekeeping genes encoding respectively a cytoskeletal protein β-actin, and glyceraldehyde-3-phosphate dehydrogenase, an enzyme of glycolysis. Although their expression levels can fluctuate between different tissues and persons, human housekeeping genes seem to exhibit a preserved tissue-wide expression ranking order. It was hypothesized that expression ranking order of two representative housekeeping genes ACTB and GAPDH might be disturbed in the tissues of patients with Familial Idiopathic Scoliosis (with positive family history of idiopathic scoliosis) opposed to the patients with no family members affected (Sporadic Idiopathic Scoliosis). An artificial neural network (ANN) was developed that could serve to differentiate between familial and sporadic cases of idiopathic scoliosis based on the expression levels of ACTB and GAPDH in different tissues of scoliotic patients. The aim of the study was to investigate whether the expression levels of ACTB and GAPDH in different tissues of idiopathic scoliosis patients could be used as a source of data for specially developed artificial neural network in order to predict the positive family history of index patient. Results The comparison of developed models showed, that the most satisfactory classification accuracy was achieved for ANN model with 18 nodes in the first hidden layer and 16 nodes in the second hidden layer. The classification accuracy for positive Idiopathic Scoliosis anamnesis only with the expression measurements of ACTB and GAPDH with the use of ANN based on 6-18-16-1 architecture was 8 of 9 (88%). Only in one case the prediction was ambiguous. Conclusions Specially designed artificial neural network model proved possible association between expression level of ACTB, GAPDH and positive familial history of Idiopathic Scoliosis. PMID:23289769

  10. Genomic landscape of gastric cancer: molecular classification and potential targets.

    PubMed

    Guo, Jiawei; Yu, Weiwei; Su, Hui; Pang, Xiufeng

    2017-02-01

    Gastric cancer imposes a considerable health burden worldwide, and its mortality ranks as the second highest for all types of cancers. The limited knowledge of the molecular mechanisms underlying gastric cancer tumorigenesis hinders the development of therapeutic strategies. However, ongoing collaborative sequencing efforts facilitate molecular classification and unveil the genomic landscape of gastric cancer. Several new drivers and tumorigenic pathways in gastric cancer, including chromatin remodeling genes, RhoA-related pathways, TP53 dysregulation, activation of receptor tyrosine kinases, stem cell pathways and abnormal DNA methylation, have been revealed. These newly identified genomic alterations await translation into clinical diagnosis and targeted therapies. Considering that loss-of-function mutations are intractable, synthetic lethality could be employed when discussing feasible therapeutic strategies. Although many challenges remain to be tackled, we are optimistic regarding improvements in the prognosis and treatment of gastric cancer in the near future.

  11. The grapevine kinome: annotation, classification and expression patterns in developmental processes and stress responses.

    PubMed

    Zhu, Kaikai; Wang, Xiaolong; Liu, Jinyi; Tang, Jun; Cheng, Qunkang; Chen, Jin-Gui; Cheng, Zong-Ming Max

    2018-01-01

    Protein kinases (PKs) have evolved as the largest family of molecular switches that regulate protein activities associated with almost all essential cellular functions. Only a fraction of plant PKs, however, have been functionally characterized even in model plant species. In the present study, the entire grapevine kinome was identified and annotated using the most recent version of the grapevine genome. A total of 1168 PK-encoding genes were identified and classified into 20 groups and 121 families, with the RLK-Pelle group being the largest, with 872 members. The 1168 kinase genes were unevenly distributed over all 19 chromosomes, and both tandem and segmental duplications contributed to the expansion of the grapevine kinome, especially of the RLK-Pelle group. Ka/Ks values indicated that most of the tandem and segmental duplication events were under purifying selection. The grapevine kinome families exhibited different expression patterns during plant development and in response to various stress treatments, with many being coexpressed. The comprehensive annotation of grapevine kinase genes, their patterns of expression and coexpression, and the related information facilitate a more complete understanding of the roles of various grapevine kinases in growth and development, responses to abiotic stress, and evolutionary history.

  12. Prediction and functional analysis of the sweet orange protein-protein interaction network.

    PubMed

    Ding, Yu-Duan; Chang, Ji-Wei; Guo, Jing; Chen, Dijun; Li, Sen; Xu, Qiang; Deng, Xiu-Xin; Cheng, Yun-Jiang; Chen, Ling-Ling

    2014-08-05

    Sweet orange (Citrus sinensis) is one of the most important fruits world-wide. Because it is a woody plant with a long growth cycle, genetic studies of sweet orange are lagging behind those of other species. In this analysis, we employed ortholog identification and domain combination methods to predict the protein-protein interaction (PPI) network for sweet orange. The K-nearest neighbors (KNN) classification method was used to verify and filter the network. The final predicted PPI network, CitrusNet, contained 8,195 proteins with 124,491 interactions. The quality of CitrusNet was evaluated using gene ontology (GO) and Mapman annotations, which confirmed the reliability of the network. In addition, we calculated the expression difference of interacting genes (EDI) in CitrusNet using RNA-seq data from four sweet orange tissues, and also analyzed the EDI distribution and variation in different sub-networks. Gene expression in CitrusNet has significant modular features. Target of rapamycin (TOR) protein served as the central node of the hormone-signaling sub-network. All evidence supported the idea that TOR can integrate various hormone signals and affect plant growth. CitrusNet provides valuable resources for the study of biological functions in sweet orange.

  13. Phylogenetic relationships among Lactuca (Asteraceae) species and related genera based on ITS-1 DNA sequences.

    PubMed

    Koopman, W J; Guetta, E; van de Wiel, C C; Vosman, B; van den Berg, R G

    1998-11-01

    Internal transcribed spacer (ITS-1) sequences from 97 accessions representing 23 species of Lactuca and related genera were determined and used to evaluate species relationships of Lactuca sensu lato (s.l.). The ITS-1 phylogenies, calculated using PAUP and PHYLIP, correspond better to the classification of Feráková than to other classifications evaluated, although the inclusion of sect. Lactuca subsect. Cyanicae is not supported. Therefore, exclusion of subsect. Cyanicae from Lactuca sensu Feráková is proposed. The amended genus contains the entire gene pool (sensu Harlan and De Wet) of cultivated lettuce (Lactuca sativa). The position of the species in the amended classification corresponds to their position in the lettuce gene pool. In the ITS-1 phylogenies, a clade with L. sativa, L. serriola, L. dregeana, L. altaica, and L. aculeata represents the primary gene pool. L. virosa and L. saligna, branching off closest to this clade, encompass the secondary gene pool. L. virosa is possibly of hybrid origin. The primary and secondary gene pool species are classified in sect. Lactuca subsect. Lactuca. The species L. quercina, L. viminea, L. sibirica, and L. tatarica, branching off next, represent the tertiary gene pool. They are classified in Lactuca sect. Lactucopsis, sect. Phaenixopus, and sect. Mulgedium, respectively. L. perennis and L. tenerrima, classified in sect. Lactuca subsect. Cyanicae, form clades with species from related genera and are not part of the lettuce gene pool.

  14. nRC: non-coding RNA Classifier based on structural features.

    PubMed

    Fiannaca, Antonino; La Rosa, Massimo; La Paglia, Laura; Rizzo, Riccardo; Urso, Alfonso

    2017-01-01

    Non-coding RNA (ncRNA) are small non-coding sequences involved in gene expression regulation of many biological processes and diseases. The recent discovery of a large set of different ncRNAs with biologically relevant roles has opened the way to develop methods able to discriminate between the different ncRNA classes. Moreover, the lack of knowledge about the complete mechanisms in regulative processes, together with the development of high-throughput technologies, has required the help of bioinformatics tools in addressing biologists and clinicians with a deeper comprehension of the functional roles of ncRNAs. In this work, we introduce a new ncRNA classification tool, nRC (non-coding RNA Classifier). Our approach is based on features extraction from the ncRNA secondary structure together with a supervised classification algorithm implementing a deep learning architecture based on convolutional neural networks. We tested our approach for the classification of 13 different ncRNA classes. We obtained classification scores, using the most common statistical measures. In particular, we reach an accuracy and sensitivity score of about 74%. The proposed method outperforms other similar classification methods based on secondary structure features and machine learning algorithms, including the RNAcon tool that, to date, is the reference classifier. nRC tool is freely available as a docker image at https://hub.docker.com/r/tblab/nrc/. The source code of nRC tool is also available at https://github.com/IcarPA-TBlab/nrc.

  15. Transcriptomics and molecular evolutionary rate analysis of the bladderwort (Utricularia), a carnivorous plant with a minimal genome

    PubMed Central

    2011-01-01

    Background The carnivorous plant Utricularia gibba (bladderwort) is remarkable in having a minute genome, which at ca. 80 megabases is approximately half that of Arabidopsis. Bladderworts show an incredible diversity of forms surrounding a defined theme: tiny, bladder-like suction traps on terrestrial, epiphytic, or aquatic plants with a diversity of unusual vegetative forms. Utricularia plants, which are rootless, are also anomalous in physiological features (respiration and carbon distribution), and highly enhanced molecular evolutionary rates in chloroplast, mitochondrial and nuclear ribosomal sequences. Despite great interest in the genus, no genomic resources exist for Utricularia, and the substitution rate increase has received limited study. Results Here we describe the sequencing and analysis of the Utricularia gibba transcriptome. Three different organs were surveyed, the traps, the vegetative shoot bodies, and the inflorescence stems. We also examined the bladderwort transcriptome under diverse stress conditions. We detail aspects of functional classification, tissue similarity, nitrogen and phosphorus metabolism, respiration, DNA repair, and detoxification of reactive oxygen species (ROS). Long contigs of plastid and mitochondrial genomes, as well as sequences for 100 individual nuclear genes, were compared with those of other plants to better establish information on molecular evolutionary rates. Conclusion The Utricularia transcriptome provides a detailed genomic window into processes occurring in a carnivorous plant. It contains a deep representation of the complex metabolic pathways that characterize a putative minimal plant genome, permitting its use as a source of genomic information to explore the structural, functional, and evolutionary diversity of the genus. Vegetative shoots and traps are the most similar organs by functional classification of their transcriptome, the traps expressing hydrolytic enzymes for prey digestion that were previously thought to be encoded by bacteria. Supporting physiological data, global gene expression analysis shows that traps significantly over-express genes involved in respiration and that phosphate uptake might occur mainly in traps, whereas nitrogen uptake could in part take place in vegetative parts. Expression of DNA repair and ROS detoxification enzymes may be indicative of a response to increased respiration. Finally, evidence from the bladderwort transcriptome, direct measurement of ROS in situ, and cross-species comparisons of organellar genomes and multiple nuclear genes supports the hypothesis that increased nucleotide substitution rates throughout the plant may be due to the mutagenic action of amplified ROS production. PMID:21639913

  16. Opioid receptor subtypes: fact or artifact?

    PubMed

    Dietis, N; Rowbotham, D J; Lambert, D G

    2011-07-01

    There is a vast amount of pharmacological evidence favouring the existence of multiple subtypes of opioid receptors. In addition to the primary classification of µ (mu: MOP), δ (delta: DOP), κ (kappa: KOP) receptors, and the nociceptin/orphanin FQ peptide receptor (NOP), various groups have further classified the pharmacological µ into µ(1-3), the δ into δ(1-2)/δ(complexed/non-complexed), and the κ into κ(1-3). From an anaesthetic perspective, the suggestions that µ(1) produced analgesia and µ(2) produced respiratory depression are particularly important. However, subsequent to the formal identification of the primary opioid receptors (MOP/DOP/KOP/NOP) by cloning and the use of this information to produce knockout animals, evidence for these additional subtypes is lacking. Indeed, knockout of a single gene (and hence receptor) results in a loss of all function associated with that receptor. In the case of MOP knockout, analgesia and respiratory depression is lost. This suggests that further sub-classification of the primary types is unwise. So how can the wealth of pharmacological data be reconciled with new molecular information? In addition to some simple misclassification (κ(3) is probably NOP), there are several possibilities which include: (i) alternate splicing of a common gene product, (ii) receptor dimerization, (iii) interaction of a common gene product with other receptors/signalling molecules, or (iv) a combination of (i)-(iii). Assigning variations in ligand activity (pharmacological subtypes) to one or more of these molecular suggestions represents an interesting challenge for future opioid research.

  17. Hybrid feature selection algorithm using symmetrical uncertainty and a harmony search algorithm

    NASA Astrophysics Data System (ADS)

    Salameh Shreem, Salam; Abdullah, Salwani; Nazri, Mohd Zakree Ahmad

    2016-04-01

    Microarray technology can be used as an efficient diagnostic system to recognise diseases such as tumours or to discriminate between different types of cancers in normal tissues. This technology has received increasing attention from the bioinformatics community because of its potential in designing powerful decision-making tools for cancer diagnosis. However, the presence of thousands or tens of thousands of genes affects the predictive accuracy of this technology from the perspective of classification. Thus, a key issue in microarray data is identifying or selecting the smallest possible set of genes from the input data that can achieve good predictive accuracy for classification. In this work, we propose a two-stage selection algorithm for gene selection problems in microarray data-sets called the symmetrical uncertainty filter and harmony search algorithm wrapper (SU-HSA). Experimental results show that the SU-HSA is better than HSA in isolation for all data-sets in terms of the accuracy and achieves a lower number of genes on 6 out of 10 instances. Furthermore, the comparison with state-of-the-art methods shows that our proposed approach is able to obtain 5 (out of 10) new best results in terms of the number of selected genes and competitive results in terms of the classification accuracy.

  18. Lung tumor diagnosis and subtype discovery by gene expression profiling.

    PubMed

    Wang, Lu-yong; Tu, Zhuowen

    2006-01-01

    The optimal treatment of patients with complex diseases, such as cancers, depends on the accurate diagnosis by using a combination of clinical and histopathological data. In many scenarios, it becomes tremendously difficult because of the limitations in clinical presentation and histopathology. To accurate diagnose complex diseases, the molecular classification based on gene or protein expression profiles are indispensable for modern medicine. Moreover, many heterogeneous diseases consist of various potential subtypes in molecular basis and differ remarkably in their response to therapies. It is critical to accurate predict subgroup on disease gene expression profiles. More fundamental knowledge of the molecular basis and classification of disease could aid in the prediction of patient outcome, the informed selection of therapies, and identification of novel molecular targets for therapy. In this paper, we propose a new disease diagnostic method, probabilistic boosting tree (PB tree) method, on gene expression profiles of lung tumors. It enables accurate disease classification and subtype discovery in disease. It automatically constructs a tree in which each node combines a number of weak classifiers into a strong classifier. Also, subtype discovery is naturally embedded in the learning process. Our algorithm achieves excellent diagnostic performance, and meanwhile it is capable of detecting the disease subtype based on gene expression profile.

  19. 23 CFR 470.105 - Urban area boundaries and highway functional classification.

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ... classification. 470.105 Section 470.105 Highways FEDERAL HIGHWAY ADMINISTRATION, DEPARTMENT OF TRANSPORTATION... criteria and procedures are provided in the FHWA publication “Highway Functional Classification—Concepts... functional classification shall be mapped and submitted to the Federal Highway Administration (FHWA) for...

  20. Statistical approach for selection of biologically informative genes.

    PubMed

    Das, Samarendra; Rai, Anil; Mishra, D C; Rai, Shesh N

    2018-05-20

    Selection of informative genes from high dimensional gene expression data has emerged as an important research area in genomics. Many gene selection techniques have been proposed so far are either based on relevancy or redundancy measure. Further, the performance of these techniques has been adjudged through post selection classification accuracy computed through a classifier using the selected genes. This performance metric may be statistically sound but may not be biologically relevant. A statistical approach, i.e. Boot-MRMR, was proposed based on a composite measure of maximum relevance and minimum redundancy, which is both statistically sound and biologically relevant for informative gene selection. For comparative evaluation of the proposed approach, we developed two biological sufficient criteria, i.e. Gene Set Enrichment with QTL (GSEQ) and biological similarity score based on Gene Ontology (GO). Further, a systematic and rigorous evaluation of the proposed technique with 12 existing gene selection techniques was carried out using five gene expression datasets. This evaluation was based on a broad spectrum of statistically sound (e.g. subject classification) and biological relevant (based on QTL and GO) criteria under a multiple criteria decision-making framework. The performance analysis showed that the proposed technique selects informative genes which are more biologically relevant. The proposed technique is also found to be quite competitive with the existing techniques with respect to subject classification and computational time. Our results also showed that under the multiple criteria decision-making setup, the proposed technique is best for informative gene selection over the available alternatives. Based on the proposed approach, an R Package, i.e. BootMRMR has been developed and available at https://cran.r-project.org/web/packages/BootMRMR. This study will provide a practical guide to select statistical techniques for selecting informative genes from high dimensional expression data for breeding and system biology studies. Published by Elsevier B.V.

  1. The Early Innate Response of Chickens to Salmonella enterica Is Dependent on the Presence of O-Antigen but Not on Serovar Classification

    PubMed Central

    Varmuzova, Karolina; Matulova, Marta Elsheimer; Sebkova, Alena; Sekelova, Zuzana; Havlickova, Hana; Sisak, Frantisek; Babak, Vladimir; Rychlik, Ivan

    2014-01-01

    Salmonella vaccines used in poultry in the EU are based on attenuated strains of either Salmonella serovar Enteritidis or Typhimurium which results in a decrease in S. Enteritidis and S. Typhimurium but may allow other Salmonella serovars to fill an empty ecological niche. In this study we were therefore interested in the early interactions of chicken immune system with S. Infantis compared to S. Enteritidis and S. Typhimurium, and a role of O-antigen in these interactions. To reach this aim, we orally infected newly hatched chickens with 7 wild type strains of Salmonella serovars Enteritidis, Typhimurium and Infantis as well as with their rfaL mutants and characterized the early Salmonella-chicken interactions. Inflammation was characterized in the cecum 4 days post-infection by measuring expression of 43 different genes. All wild type strains stimulated a greater inflammatory response than any of the rfaL mutants. However, there were large differences in chicken responses to different wild type strains not reflecting their serovar classification. The initial interaction between newly-hatched chickens and Salmonella was found to be dependent on the presence of O-antigen but not on its structure, i.e. not on serovar classification. In addition, we observed that the expression of calbindin or aquaporin 8 in the cecum did not change if inflammatory gene expression remained within a 10 fold fluctuation, indicating the buffering capacity of the cecum, preserving normal gut functions even in the presence of minor inflammatory stimuli. PMID:24763249

  2. The Communication Function Classification System: cultural adaptation, validity, and reliability of the Farsi version for patients with cerebral palsy.

    PubMed

    Soleymani, Zahra; Joveini, Ghodsiye; Baghestani, Ahmad Reza

    2015-03-01

    This study developed a Farsi language Communication Function Classification System and then tested its reliability and validity. Communication Function Classification System is designed to classify the communication functions of individuals with cerebral palsy. Up until now, there has been no instrument for assessment of this communication function in Iran. The English Communication Function Classification System was translated into Farsi and cross-culturally modified by a panel of experts. Professionals and parents then assessed the content validity of the modified version. A backtranslation of the Farsi version was confirmed by the developer of the English Communication Function Classification System. Face validity was assessed by therapists and parents of 10 patients. The Farsi Communication Function Classification System was administered to 152 individuals with cerebral palsy (age, 2 to 18 years; median age, 10 years; mean age, 9.9 years; standard deviation, 4.3 years). Inter-rater reliability was analyzed between parents, occupational therapists, and speech and language pathologists. The test-retest reliability was assessed for 75 patients with a 14 day interval between tests. The inter-rater reliability of the Communication Function Classification System was 0.81 between speech and language pathologists and occupational therapists, 0.74 between parents and occupational therapists, and 0.88 between parents and speech and language pathologists. The test-retest reliability was 0.96 for occupational therapists, 0.98 for speech and language pathologists, and 0.94 for parents. The findings suggest that the Farsi version of Communication Function Classification System is a reliable and valid measure that can be used in clinical settings to assess communication function in patients with cerebral palsy. Copyright © 2015 Elsevier Inc. All rights reserved.

  3. A Bayesian Approach to Genome/Linguistic Relationships in Native South Americans

    PubMed Central

    Amorim, Carlos Eduardo Guerra; Bisso-Machado, Rafael; Ramallo, Virginia; Bortolini, Maria Cátira; Bonatto, Sandro Luis; Salzano, Francisco Mauro; Hünemeier, Tábita

    2013-01-01

    The relationship between the evolution of genes and languages has been studied for over three decades. These studies rely on the assumption that languages, as many other cultural traits, evolve in a gene-like manner, accumulating heritable diversity through time and being subjected to evolutionary mechanisms of change. In the present work we used genetic data to evaluate South American linguistic classifications. We compared discordant models of language classifications to the current Native American genome-wide variation using realistic demographic models analyzed under an Approximate Bayesian Computation (ABC) framework. Data on 381 STRs spread along the autosomes were gathered from the literature for populations representing the five main South Amerindian linguistic groups: Andean, Arawakan, Chibchan-Paezan, Macro-Jê, and Tupí. The results indicated a higher posterior probability for the classification proposed by J.H. Greenberg in 1987, although L. Campbell's 1997 classification cannot be ruled out. Based on Greenberg's classification, it was possible to date the time of Tupí-Arawakan divergence (2.8 kya), and the time of emergence of the structure between present day major language groups in South America (3.1 kya). PMID:23696865

  4. A bayesian approach to genome/linguistic relationships in native South Americans.

    PubMed

    Amorim, Carlos Eduardo Guerra; Bisso-Machado, Rafael; Ramallo, Virginia; Bortolini, Maria Cátira; Bonatto, Sandro Luis; Salzano, Francisco Mauro; Hünemeier, Tábita

    2013-01-01

    The relationship between the evolution of genes and languages has been studied for over three decades. These studies rely on the assumption that languages, as many other cultural traits, evolve in a gene-like manner, accumulating heritable diversity through time and being subjected to evolutionary mechanisms of change. In the present work we used genetic data to evaluate South American linguistic classifications. We compared discordant models of language classifications to the current Native American genome-wide variation using realistic demographic models analyzed under an Approximate Bayesian Computation (ABC) framework. Data on 381 STRs spread along the autosomes were gathered from the literature for populations representing the five main South Amerindian linguistic groups: Andean, Arawakan, Chibchan-Paezan, Macro-Jê, and Tupí. The results indicated a higher posterior probability for the classification proposed by J.H. Greenberg in 1987, although L. Campbell's 1997 classification cannot be ruled out. Based on Greenberg's classification, it was possible to date the time of Tupí-Arawakan divergence (2.8 kya), and the time of emergence of the structure between present day major language groups in South America (3.1 kya).

  5. Gene Expression in Accumbens GABA Neurons from Inbred Rats with Different Drug-Taking Behavior

    PubMed Central

    Sharp, B.M.; Chen, H.; Gong, S.; Wu, X.; Liu, Z.; Hiler, K.; Taylor, W.L.; Matta, S.G.

    2011-01-01

    Inbred Lewis and Fisher 344 rat strains differ greatly in drug self-administration; Lewis rats operantly self-administer drugs of abuse including nicotine, whereas Fisher self-administer poorly. As shown herein, operant food self-administration is similar. Based on their pivotal role in drug reward, we hypothesized that differences in basal gene expression in GABAergic neurons projecting from nucleus accumbens (NAcc) to ventral pallidum (VP) play a role in vulnerability to drug taking behavior. The transcriptomes of NAcc shell-VP GABAergic neurons from these two strains were analyzed in adolescents, using a multidisciplinary approach that combined stereotaxic ionotophoretic brain microinjections, laser-capture microdissection (LCM) and microarray measurement of transcripts. LCM enriched the gene transcripts detected in GABA neurons compared to the residual NAcc tissue: a ratio of neuron/residual > 1 and false discovery rate (FDR) <5% yielded 6,623 transcripts, whereas a ratio of >3 yielded 3,514. Strain-dependent differences in gene expression within GABA neurons were identified; 322 vs. 60 transcripts showed 1.5-fold vs. 2-fold differences in expression (FDR<5%). Classification by gene ontology showed these 322 transcripts were widely distributed, without categorical enrichment. This is most consistent with a global change in GABA neuron function. Literature-mining by Chilibot found 38 genes related to synaptic plasticity, signaling and gene transcription, all of which determine drug-abuse; 33 genes have no known association with addiction or nicotine. In Lewis rats, upregulation of Mint-1, Cask, CamkIIδ, Ncam1, Vsnl1, Hpcal1 and Car8 indicates these transcripts likely contribute to altered signaling and synaptic function in NAcc GABA projection neurons to VP. PMID:21745336

  6. Metacoder: An R package for visualization and manipulation of community taxonomic diversity data.

    PubMed

    Foster, Zachary S L; Sharpton, Thomas J; Grünwald, Niklaus J

    2017-02-01

    Community-level data, the type generated by an increasing number of metabarcoding studies, is often graphed as stacked bar charts or pie graphs that use color to represent taxa. These graph types do not convey the hierarchical structure of taxonomic classifications and are limited by the use of color for categories. As an alternative, we developed metacoder, an R package for easily parsing, manipulating, and graphing publication-ready plots of hierarchical data. Metacoder includes a dynamic and flexible function that can parse most text-based formats that contain taxonomic classifications, taxon names, taxon identifiers, or sequence identifiers. Metacoder can then subset, sample, and order this parsed data using a set of intuitive functions that take into account the hierarchical nature of the data. Finally, an extremely flexible plotting function enables quantitative representation of up to 4 arbitrary statistics simultaneously in a tree format by mapping statistics to the color and size of tree nodes and edges. Metacoder also allows exploration of barcode primer bias by integrating functions to run digital PCR. Although it has been designed for data from metabarcoding research, metacoder can easily be applied to any data that has a hierarchical component such as gene ontology or geographic location data. Our package complements currently available tools for community analysis and is provided open source with an extensive online user manual.

  7. Metacoder: An R package for visualization and manipulation of community taxonomic diversity data

    PubMed Central

    Foster, Zachary S. L.; Sharpton, Thomas J.

    2017-01-01

    Community-level data, the type generated by an increasing number of metabarcoding studies, is often graphed as stacked bar charts or pie graphs that use color to represent taxa. These graph types do not convey the hierarchical structure of taxonomic classifications and are limited by the use of color for categories. As an alternative, we developed metacoder, an R package for easily parsing, manipulating, and graphing publication-ready plots of hierarchical data. Metacoder includes a dynamic and flexible function that can parse most text-based formats that contain taxonomic classifications, taxon names, taxon identifiers, or sequence identifiers. Metacoder can then subset, sample, and order this parsed data using a set of intuitive functions that take into account the hierarchical nature of the data. Finally, an extremely flexible plotting function enables quantitative representation of up to 4 arbitrary statistics simultaneously in a tree format by mapping statistics to the color and size of tree nodes and edges. Metacoder also allows exploration of barcode primer bias by integrating functions to run digital PCR. Although it has been designed for data from metabarcoding research, metacoder can easily be applied to any data that has a hierarchical component such as gene ontology or geographic location data. Our package complements currently available tools for community analysis and is provided open source with an extensive online user manual. PMID:28222096

  8. The distance function effect on k-nearest neighbor classification for medical datasets.

    PubMed

    Hu, Li-Yu; Huang, Min-Wei; Ke, Shih-Wen; Tsai, Chih-Fong

    2016-01-01

    K-nearest neighbor (k-NN) classification is conventional non-parametric classifier, which has been used as the baseline classifier in many pattern classification problems. It is based on measuring the distances between the test data and each of the training data to decide the final classification output. Since the Euclidean distance function is the most widely used distance metric in k-NN, no study examines the classification performance of k-NN by different distance functions, especially for various medical domain problems. Therefore, the aim of this paper is to investigate whether the distance function can affect the k-NN performance over different medical datasets. Our experiments are based on three different types of medical datasets containing categorical, numerical, and mixed types of data and four different distance functions including Euclidean, cosine, Chi square, and Minkowsky are used during k-NN classification individually. The experimental results show that using the Chi square distance function is the best choice for the three different types of datasets. However, using the cosine and Euclidean (and Minkowsky) distance function perform the worst over the mixed type of datasets. In this paper, we demonstrate that the chosen distance function can affect the classification accuracy of the k-NN classifier. For the medical domain datasets including the categorical, numerical, and mixed types of data, K-NN based on the Chi square distance function performs the best.

  9. Network selection, Information filtering and Scalable computation

    NASA Astrophysics Data System (ADS)

    Ye, Changqing

    This dissertation explores two application scenarios of sparsity pursuit method on large scale data sets. The first scenario is classification and regression in analyzing high dimensional structured data, where predictors corresponds to nodes of a given directed graph. This arises in, for instance, identification of disease genes for the Parkinson's diseases from a network of candidate genes. In such a situation, directed graph describes dependencies among the genes, where direction of edges represent certain causal effects. Key to high-dimensional structured classification and regression is how to utilize dependencies among predictors as specified by directions of the graph. In this dissertation, we develop a novel method that fully takes into account such dependencies formulated through certain nonlinear constraints. We apply the proposed method to two applications, feature selection in large margin binary classification and in linear regression. We implement the proposed method through difference convex programming for the cost function and constraints. Finally, theoretical and numerical analyses suggest that the proposed method achieves the desired objectives. An application to disease gene identification is presented. The second application scenario is personalized information filtering which extracts the information specifically relevant to a user, predicting his/her preference over a large number of items, based on the opinions of users who think alike or its content. This problem is cast into the framework of regression and classification, where we introduce novel partial latent models to integrate additional user-specific and content-specific predictors, for higher predictive accuracy. In particular, we factorize a user-over-item preference matrix into a product of two matrices, each representing a user's preference and an item preference by users. Then we propose a likelihood method to seek a sparsest latent factorization, from a class of over-complete factorizations, possibly with a high percentage of missing values. This promotes additional sparsity beyond rank reduction. Computationally, we design methods based on a ``decomposition and combination'' strategy, to break large-scale optimization into many small subproblems to solve in a recursive and parallel manner. On this basis, we implement the proposed methods through multi-platform shared-memory parallel programming, and through Mahout, a library for scalable machine learning and data mining, for mapReduce computation. For example, our methods are scalable to a dataset consisting of three billions of observations on a single machine with sufficient memory, having good timings. Both theoretical and numerical investigations show that the proposed methods exhibit significant improvement in accuracy over state-of-the-art scalable methods.

  10. Automatic annotation of protein motif function with Gene Ontology terms.

    PubMed

    Lu, Xinghua; Zhai, Chengxiang; Gopalakrishnan, Vanathi; Buchanan, Bruce G

    2004-09-02

    Conserved protein sequence motifs are short stretches of amino acid sequence patterns that potentially encode the function of proteins. Several sequence pattern searching algorithms and programs exist foridentifying candidate protein motifs at the whole genome level. However, a much needed and important task is to determine the functions of the newly identified protein motifs. The Gene Ontology (GO) project is an endeavor to annotate the function of genes or protein sequences with terms from a dynamic, controlled vocabulary and these annotations serve well as a knowledge base. This paper presents methods to mine the GO knowledge base and use the association between the GO terms assigned to a sequence and the motifs matched by the same sequence as evidence for predicting the functions of novel protein motifs automatically. The task of assigning GO terms to protein motifs is viewed as both a binary classification and information retrieval problem, where PROSITE motifs are used as samples for mode training and functional prediction. The mutual information of a motif and aGO term association is found to be a very useful feature. We take advantage of the known motifs to train a logistic regression classifier, which allows us to combine mutual information with other frequency-based features and obtain a probability of correct association. The trained logistic regression model has intuitively meaningful and logically plausible parameter values, and performs very well empirically according to our evaluation criteria. In this research, different methods for automatic annotation of protein motifs have been investigated. Empirical result demonstrated that the methods have a great potential for detecting and augmenting information about the functions of newly discovered candidate protein motifs.

  11. Cloning and Characterization of a Cell Senescence Gene for Breast Cancer Cells

    DTIC Science & Technology

    2004-07-01

    have already established the inducible expression system in a retroviral vector for these studies. F. References 1. Hayflick , L. (1965). The limited ...CLASSIFICATION 18. SECURITY CLASSIFICATION 19. SECURITY CLASSIFICATION 20. LIMITATION OF ABSTRACT OF REPORT OF THIS PAGE OFABSTRACT Unclassified...13-14 Annual report A. Introduction Normal diploid mammalian cells display a limited proliferative life span in culture (1-3

  12. Genome-wide identification and analysis of the MADS-box gene family in apple.

    PubMed

    Tian, Yi; Dong, Qinglong; Ji, Zhirui; Chi, Fumei; Cong, Peihua; Zhou, Zongshan

    2015-01-25

    The MADS-box gene family is one of the most widely studied families in plants and has diverse developmental roles in flower pattern formation, gametophyte cell division and fruit differentiation. Although the genome-wide analysis of this family has been performed in some species, little is known regarding MADS-box genes in apple (Malus domestica). In this study, 146 MADS-box genes were identified in the apple genome and were phylogenetically clustered into six subgroups (MIKC(c), MIKC*, Mα, Mβ, Mγ and Mδ) with the MADS-box genes from Arabidopsis and rice. The predicted apple MADS-box genes were distributed across all 17 chromosomes at different densities. Additionally, the MADS-box domain, exon length, gene structure and motif compositions of the apple MADS-box genes were analysed. Moreover, the expression of all of the apple MADS-box genes was analysed in the root, stem, leaf, flower tissues and five stages of fruit development. All of the apple MADS-box genes, with the exception of some genes in each group, were expressed in at least one of the tissues tested, which indicates that the MADS-box genes are involved in various aspects of the physiological and developmental processes of the apple. To the best of our knowledge, this report describes the first genome-wide analysis of the apple MADS-box gene family, and the results should provide valuable information for understanding the classification, cloning and putative functions of this family. Copyright © 2014 Elsevier B.V. All rights reserved.

  13. 21 CFR 864.7290 - Factor deficiency test.

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ... state (a person carrying both a recessive gene for a coagulation factor deficiency such as hemophilia and the corresponding normal gene). (b) Classification. Class II (performance standards). [45 FR 60613...

  14. AucPR: an AUC-based approach using penalized regression for disease prediction with high-dimensional omics data.

    PubMed

    Yu, Wenbao; Park, Taesung

    2014-01-01

    It is common to get an optimal combination of markers for disease classification and prediction when multiple markers are available. Many approaches based on the area under the receiver operating characteristic curve (AUC) have been proposed. Existing works based on AUC in a high-dimensional context depend mainly on a non-parametric, smooth approximation of AUC, with no work using a parametric AUC-based approach, for high-dimensional data. We propose an AUC-based approach using penalized regression (AucPR), which is a parametric method used for obtaining a linear combination for maximizing the AUC. To obtain the AUC maximizer in a high-dimensional context, we transform a classical parametric AUC maximizer, which is used in a low-dimensional context, into a regression framework and thus, apply the penalization regression approach directly. Two kinds of penalization, lasso and elastic net, are considered. The parametric approach can avoid some of the difficulties of a conventional non-parametric AUC-based approach, such as the lack of an appropriate concave objective function and a prudent choice of the smoothing parameter. We apply the proposed AucPR for gene selection and classification using four real microarray and synthetic data. Through numerical studies, AucPR is shown to perform better than the penalized logistic regression and the nonparametric AUC-based method, in the sense of AUC and sensitivity for a given specificity, particularly when there are many correlated genes. We propose a powerful parametric and easily-implementable linear classifier AucPR, for gene selection and disease prediction for high-dimensional data. AucPR is recommended for its good prediction performance. Beside gene expression microarray data, AucPR can be applied to other types of high-dimensional omics data, such as miRNA and protein data.

  15. An ADAM33 polymorphism associates with progression of preschool wheeze into childhood asthma: a prospective case-control study with replication in a birth cohort study.

    PubMed

    Klaassen, Ester M M; Penders, John; Jöbsis, Quirijn; van de Kant, Kim D G; Thijs, Carel; Mommers, Monique; van Schayck, Constant P; van Eys, Guillaume; Koppelman, Gerard H; Dompeling, Edward

    2015-01-01

    The influence of asthma candidate genes on the development from wheeze to asthma in young children still needs to be defined. To link genetic variants in asthma candidate genes to progression of wheeze to persistent wheeze into childhood asthma. In a prospective study, children with recurrent wheeze from the ADEM (Asthma DEtection and Monitoring) study were followed until the age of six. At that age a classification (transient wheeze or asthma) was based on symptoms, lung function and medication use. In 198 children the relationship between this classification and 30 polymorphisms in 16 asthma candidate genes was assessed by logistic regression. In case of an association based on a p<0.10, replication analysis was performed in an independent birth cohort study (KOALA study, n = 248 included for the present analysis). In the ADEM study, the minor alleles of ADAM33 rs511898 and rs528557 and the ORMDL3/GSDMB rs7216389 polymorphisms were negatively associated, whereas the minor alleles of IL4 rs2243250 and rs2070874 polymorphisms were positively associated with childhood asthma. When replicated in the KOALA study, ADAM33 rs528557 showed a negative association of the CG/GG-genotype with progression of recurrent wheeze into childhood asthma (0.50 (0.26-0.97) p = 0.04) and no association with preschool wheeze. Polymorphisms in ADAM33, ORMDL3/GSDMB and IL4 were associated with childhood asthma in a group of children with recurrent wheeze. The replication of the negative association of the CG/GG-genotype of rs528557 ADAM33 with childhood asthma in an independent birth cohort study confirms that a compromised ADAM33 gene may be implicated in the progression of wheeze into childhood asthma.

  16. Classification of Genes and Putative Biomarker Identification Using Distribution Metrics on Expression Profiles

    PubMed Central

    Huang, Hung-Chung; Jupiter, Daniel; VanBuren, Vincent

    2010-01-01

    Background Identification of genes with switch-like properties will facilitate discovery of regulatory mechanisms that underlie these properties, and will provide knowledge for the appropriate application of Boolean networks in gene regulatory models. As switch-like behavior is likely associated with tissue-specific expression, these gene products are expected to be plausible candidates as tissue-specific biomarkers. Methodology/Principal Findings In a systematic classification of genes and search for biomarkers, gene expression profiles (GEPs) of more than 16,000 genes from 2,145 mouse array samples were analyzed. Four distribution metrics (mean, standard deviation, kurtosis and skewness) were used to classify GEPs into four categories: predominantly-off, predominantly-on, graded (rheostatic), and switch-like genes. The arrays under study were also grouped and examined by tissue type. For example, arrays were categorized as ‘brain group’ and ‘non-brain group’; the Kolmogorov-Smirnov distance and Pearson correlation coefficient were then used to compare GEPs between brain and non-brain for each gene. We were thus able to identify tissue-specific biomarker candidate genes. Conclusions/Significance The methodology employed here may be used to facilitate disease-specific biomarker discovery. PMID:20140228

  17. Constrained clusters of gene expression profiles with pathological features.

    PubMed

    Sese, Jun; Kurokawa, Yukinori; Monden, Morito; Kato, Kikuya; Morishita, Shinichi

    2004-11-22

    Gene expression profiles should be useful in distinguishing variations in disease, since they reflect accurately the status of cells. The primary clustering of gene expression reveals the genotypes that are responsible for the proximity of members within each cluster, while further clustering elucidates the pathological features of the individual members of each cluster. However, since the first clustering process and the second classification step, in which the features are associated with clusters, are performed independently, the initial set of clusters may omit genes that are associated with pathologically meaningful features. Therefore, it is important to devise a way of identifying gene expression clusters that are associated with pathological features. We present the novel technique of 'itemset constrained clustering' (IC-Clustering), which computes the optimal cluster that maximizes the interclass variance of gene expression between groups, which are divided according to the restriction that only divisions that can be expressed using common features are allowed. This constraint automatically labels each cluster with a set of pathological features which characterize that cluster. When applied to liver cancer datasets, IC-Clustering revealed informative gene expression clusters, which could be annotated with various pathological features, such as 'tumor' and 'man', or 'except tumor' and 'normal liver function'. In contrast, the k-means method overlooked these clusters.

  18. Genomewide identification and expression analysis of the ARF gene family in apple.

    PubMed

    Luo, Xiao-Cui; Sun, Mei-Hong; Xu, Rui-Rui; Shu, Huai-Rui; Wang, Jia-Wei; Zhang, Shi-Zhong

    2014-12-01

    Auxin response factors (ARF) are transcription factors that regulate auxin responses in plants. Although the genomewide analysis of this family has been performed in some species, little is known regarding ARF genes in apple (Malus domestica). In this study, 31 putative apple ARF genes have been identified and located within the apple genome. The phylogenetic analysis revealed that MdARFs could be divided into three subfamilies (groups I, II and III). The predicted MdARFs were distributed across 15 of 17 chromosomes with different densities. In addition, the analysis of exon-intron junctions and of the intron phase inside the predicted coding region of each candidate gene has revealed high levels of conservation within and between phylogenetic groups. Expression profile analyses of MdARF genes were performed in different tissues (root, stem, leaf, flower and fruit), and all the selected genes were expressed in at least one of the tissues that were tested, which indicated that MdARFs are involved in various aspects of physiological and developmental processes of apple. To our knowledge, this report is the first to provide a genomewide analysis of the apple ARF gene family. This study provides valuable information for understanding the classification and putative functions of the ARF signal in apple.

  19. DIF Trees: Using Classification Trees to Detect Differential Item Functioning

    ERIC Educational Resources Information Center

    Vaughn, Brandon K.; Wang, Qiu

    2010-01-01

    A nonparametric tree classification procedure is used to detect differential item functioning for items that are dichotomously scored. Classification trees are shown to be an alternative procedure to detect differential item functioning other than the use of traditional Mantel-Haenszel and logistic regression analysis. A nonparametric…

  20. Relationship between Functional Classification Levels and Anaerobic Performance of Wheelchair Basketball Athletes

    ERIC Educational Resources Information Center

    Molik, Bartosz; Laskin, James J.; Kosmol, Andrzej; Skucas, Kestas; Bida, Urszula

    2010-01-01

    Wheelchair basketball athletes are classified using the International Wheelchair Basketball Federation (IWBF) functional classification system. The purpose of this study was to evaluate the relationship between upper extremity anaerobic performance (AnP) and all functional classification levels in wheelchair basketball. Ninety-seven male athletes…

  1. Calibration of Multiple In Silico Tools for Predicting Pathogenicity of Mismatch Repair Gene Missense Substitutions

    PubMed Central

    Thompson, Bryony A.; Greenblatt, Marc S.; Vallee, Maxime P.; Herkert, Johanna C.; Tessereau, Chloe; Young, Erin L.; Adzhubey, Ivan A.; Li, Biao; Bell, Russell; Feng, Bingjian; Mooney, Sean D.; Radivojac, Predrag; Sunyaev, Shamil R.; Frebourg, Thierry; Hofstra, Robert M.W.; Sijmons, Rolf H.; Boucher, Ken; Thomas, Alun; Goldgar, David E.; Spurdle, Amanda B.; Tavtigian, Sean V.

    2015-01-01

    Classification of rare missense substitutions observed during genetic testing for patient management is a considerable problem in clinical genetics. The Bayesian integrated evaluation of unclassified variants is a solution originally developed for BRCA1/2. Here, we take a step toward an analogous system for the mismatch repair (MMR) genes (MLH1, MSH2, MSH6, and PMS2) that confer colon cancer susceptibility in Lynch syndrome by calibrating in silico tools to estimate prior probabilities of pathogenicity for MMR gene missense substitutions. A qualitative five-class classification system was developed and applied to 143 MMR missense variants. This identified 74 missense substitutions suitable for calibration. These substitutions were scored using six different in silico tools (Align-Grantham Variation Grantham Deviation, multivariate analysis of protein polymorphisms [MAPP], Mut-Pred, PolyPhen-2.1, Sorting Intolerant From Tolerant, and Xvar), using curated MMR multiple sequence alignments where possible. The output from each tool was calibrated by regression against the classifications of the 74 missense substitutions; these calibrated outputs are interpretable as prior probabilities of pathogenicity. MAPP was the most accurate tool and MAPP + PolyPhen-2.1 provided the best-combined model (R2 = 0.62 and area under receiver operating characteristic = 0.93). The MAPP + PolyPhen-2.1 output is sufficiently predictive to feed as a continuous variable into the quantitative Bayesian integrated evaluation for clinical classification of MMR gene missense substitutions. PMID:22949387

  2. Systematic review of autosomal recessive ataxias and proposal for a classification.

    PubMed

    Beaudin, Marie; Klein, Christopher J; Rouleau, Guy A; Dupré, Nicolas

    2017-01-01

    The classification of autosomal recessive ataxias represents a significant challenge because of high genetic heterogeneity and complex phenotypes. We conducted a comprehensive systematic review of the literature to examine all recessive ataxias in order to propose a new classification and properly circumscribe this field as new technologies are emerging for comprehensive targeted gene testing. We searched Pubmed and Embase to identify original articles on recessive forms of ataxia in humans for which a causative gene had been identified. Reference lists and public databases, including OMIM and GeneReviews, were also reviewed. We evaluated the clinical descriptions to determine if ataxia was a core feature of the phenotype and assessed the available evidence on the genotype-phenotype association. Included disorders were classified as primary recessive ataxias, as other complex movement or multisystem disorders with prominent ataxia, or as disorders that may occasionally present with ataxia. After removal of duplicates, 2354 references were reviewed and assessed for inclusion. A total of 130 articles were completely reviewed and included in this qualitative analysis. The proposed new list of autosomal recessive ataxias includes 45 gene-defined disorders for which ataxia is a core presenting feature. We propose a clinical algorithm based on the associated symptoms. We present a new classification for autosomal recessive ataxias that brings awareness to their complex phenotypes while providing a unified categorization of this group of disorders. This review should assist in the development of a consensus nomenclature useful in both clinical and research applications.

  3. Medical devices; hematology and pathology devices; classification of early growth response 1 gene fluorescence in-situ hybridization test system for specimen characterization. Final order.

    PubMed

    2014-09-03

    The Food and Drug Administration (FDA) is classifying early growth response 1 (EGR1) gene fluorescence in-situ hybridization (FISH) test system for specimen characterization into class II (special controls). The special controls that will apply to this device are identified in this order and will be part of the codified language for the early growth response 1 (EGR1) gene fluorescence in-site hybridization (FISH) test system for specimen characterization classification. The Agency is classifying the device into class II (special controls) in order to provide a reasonable assurance of safety and effectiveness of the device.

  4. Heterogeneous data fusion for brain tumor classification.

    PubMed

    Metsis, Vangelis; Huang, Heng; Andronesi, Ovidiu C; Makedon, Fillia; Tzika, Aria

    2012-10-01

    Current research in biomedical informatics involves analysis of multiple heterogeneous data sets. This includes patient demographics, clinical and pathology data, treatment history, patient outcomes as well as gene expression, DNA sequences and other information sources such as gene ontology. Analysis of these data sets could lead to better disease diagnosis, prognosis, treatment and drug discovery. In this report, we present a novel machine learning framework for brain tumor classification based on heterogeneous data fusion of metabolic and molecular datasets, including state-of-the-art high-resolution magic angle spinning (HRMAS) proton (1H) magnetic resonance spectroscopy and gene transcriptome profiling, obtained from intact brain tumor biopsies. Our experimental results show that our novel framework outperforms any analysis using individual dataset.

  5. Filtered selection coupled with support vector machines generate a functionally relevant prediction model for colorectal cancer

    PubMed Central

    Gabere, Musa Nur; Hussein, Mohamed Aly; Aziz, Mohammad Azhar

    2016-01-01

    Purpose There has been considerable interest in using whole-genome expression profiles for the classification of colorectal cancer (CRC). The selection of important features is a crucial step before training a classifier. Methods In this study, we built a model that uses support vector machine (SVM) to classify cancer and normal samples using Affymetrix exon microarray data obtained from 90 samples of 48 patients diagnosed with CRC. From the 22,011 genes, we selected the 20, 30, 50, 100, 200, 300, and 500 genes most relevant to CRC using the minimum-redundancy–maximum-relevance (mRMR) technique. With these gene sets, an SVM model was designed using four different kernel types (linear, polynomial, radial basis function [RBF], and sigmoid). Results The best model, which used 30 genes and RBF kernel, outperformed other combinations; it had an accuracy of 84% for both ten fold and leave-one-out cross validations in discriminating the cancer samples from the normal samples. With this 30 genes set from mRMR, six classifiers were trained using random forest (RF), Bayes net (BN), multilayer perceptron (MLP), naïve Bayes (NB), reduced error pruning tree (REPT), and SVM. Two hybrids, mRMR + SVM and mRMR + BN, were the best models when tested on other datasets, and they achieved a prediction accuracy of 95.27% and 91.99%, respectively, compared to other mRMR hybrid models (mRMR + RF, mRMR + NB, mRMR + REPT, and mRMR + MLP). Ingenuity pathway analysis was used to analyze the functions of the 30 genes selected for this model and their potential association with CRC: CDH3, CEACAM7, CLDN1, IL8, IL6R, MMP1, MMP7, and TGFB1 were predicted to be CRC biomarkers. Conclusion This model could be used to further develop a diagnostic tool for predicting CRC based on gene expression data from patient samples. PMID:27330311

  6. Dynamic changes of yak (Bos grunniens) gut microbiota during growth revealed by polymerase chain reaction-denaturing gradient gel electrophoresis and metagenomics

    PubMed Central

    Nie, Yuanyang; Zhou, Zhiwei; Guan, Jiuqiang; Xia, Baixue; Luo, Xiaolin; Yang, Yang; Fu, Yu; Sun, Qun

    2017-01-01

    Objective To understand the dynamic structure, function, and influence on nutrient metabolism in hosts, it was crucial to assess the genetic potential of gut microbial community in yaks of different ages. Methods The denaturing gradient gel electrophoresis (DGGE) profiles and Illumina-based metagenomic sequencing on colon contents of 15 semi-domestic yaks were investigated. Unweighted pairwise grouping method with mathematical averages (UPGMA) clustering and principal component analysis (PCA) were used to analyze the DGGE fingerprint. The Illumina sequences were assembled, predicted to genes and functionally annotated, and then classified by querying protein sequences of the genes against the Kyoto encyclopedia of genes and genomes (KEGG) database. Results Metagenomic sequencing showed that more than 85% of ribosomal RNA (rRNA) gene sequences belonged to the phylum Firmicutes and Bacteroidetes, indicating that the family Ruminococcaceae (46.5%), Rikenellaceae (11.3%), Lachnospiraceae (10.0%), and Bacteroidaceae (6.3%) were dominant gut microbes. Over 50% of non-rRNA gene sequences represented the metabolic pathways of amino acids (14.4%), proteins (12.3%), sugars (11.9%), nucleotides (6.8%), lipids (1.7%), xenobiotics (1.4%), coenzymes, and vitamins (3.6%). Gene functional classification showed that most of enzyme-coding genes were related to cellulose digestion and amino acids metabolic pathways. Conclusion Yaks’ age had a substantial effect on gut microbial composition. Comparative metagenomics of gut microbiota in 0.5-, 1.5-, and 2.5-year-old yaks revealed that the abundance of the class Clostridia, Bacteroidia, and Lentisphaeria, as well as the phylum Firmicutes, Bacteroidetes, Lentisphaerae, Tenericutes, and Cyanobacteria, varied more greatly during yaks’ growth, especially in young animals (0.5 and 1.5 years old). Gut microbes, including Bacteroides, Clostridium, and Lentisphaeria, make a contribution to the energy metabolism and synthesis of amino acid, which are essential to the normal growth of yaks. PMID:28183172

  7. Integrating Colon Cancer Microarray Data: Associating Locus-Specific Methylation Groups to Gene Expression-Based Classifications.

    PubMed

    Barat, Ana; Ruskin, Heather J; Byrne, Annette T; Prehn, Jochen H M

    2015-11-23

    Recently, considerable attention has been paid to gene expression-based classifications of colorectal cancers (CRC) and their association with patient prognosis. In addition to changes in gene expression, abnormal DNA-methylation is known to play an important role in cancer onset and development, and colon cancer is no exception to this rule. Large-scale technologies, such as methylation microarray assays and specific sequencing of methylated DNA, have been used to determine whole genome profiles of CpG island methylation in tissue samples. In this article, publicly available microarray-based gene expression and methylation data sets are used to characterize expression subtypes with respect to locus-specific methylation. A major objective was to determine whether integration of these data types improves previously characterized subtypes, or provides evidence for additional subtypes. We used unsupervised clustering techniques to determine methylation-based subgroups, which are subsequently annotated with three published expression-based classifications, comprising from three to six subtypes. Our results showed that, while methylation profiles provide a further basis for segregation of certain (Inflammatory and Goblet-like) finer-grained expression-based subtypes, they also suggest that other finer-grained subtypes are not distinctive and can be considered as a single subtype.

  8. Integrating Colon Cancer Microarray Data: Associating Locus-Specific Methylation Groups to Gene Expression-Based Classifications

    PubMed Central

    Barat, Ana; Ruskin, Heather J.; Byrne, Annette T.; Prehn, Jochen H. M.

    2015-01-01

    Recently, considerable attention has been paid to gene expression-based classifications of colorectal cancers (CRC) and their association with patient prognosis. In addition to changes in gene expression, abnormal DNA-methylation is known to play an important role in cancer onset and development, and colon cancer is no exception to this rule. Large-scale technologies, such as methylation microarray assays and specific sequencing of methylated DNA, have been used to determine whole genome profiles of CpG island methylation in tissue samples. In this article, publicly available microarray-based gene expression and methylation data sets are used to characterize expression subtypes with respect to locus-specific methylation. A major objective was to determine whether integration of these data types improves previously characterized subtypes, or provides evidence for additional subtypes. We used unsupervised clustering techniques to determine methylation-based subgroups, which are subsequently annotated with three published expression-based classifications, comprising from three to six subtypes. Our results showed that, while methylation profiles provide a further basis for segregation of certain (Inflammatory and Goblet-like) finer-grained expression-based subtypes, they also suggest that other finer-grained subtypes are not distinctive and can be considered as a single subtype. PMID:27600244

  9. Gene selection and cancer type classification of diffuse large-B-cell lymphoma using a bivariate mixture model for two-species data.

    PubMed

    Su, Yuhua; Nielsen, Dahlia; Zhu, Lei; Richards, Kristy; Suter, Steven; Breen, Matthew; Motsinger-Reif, Alison; Osborne, Jason

    2013-01-05

    : A bivariate mixture model utilizing information across two species was proposed to solve the fundamental problem of identifying differentially expressed genes in microarray experiments. The model utility was illustrated using a dog and human lymphoma data set prepared by a group of scientists in the College of Veterinary Medicine at North Carolina State University. A small number of genes were identified as being differentially expressed in both species and the human genes in this cluster serve as a good predictor for classifying diffuse large-B-cell lymphoma (DLBCL) patients into two subgroups, the germinal center B-cell-like diffuse large B-cell lymphoma and the activated B-cell-like diffuse large B-cell lymphoma. The number of human genes that were observed to be significantly differentially expressed (21) from the two-species analysis was very small compared to the number of human genes (190) identified with only one-species analysis (human data). The genes may be clinically relevant/important, as this small set achieved low misclassification rates of DLBCL subtypes. Additionally, the two subgroups defined by this cluster of human genes had significantly different survival functions, indicating that the stratification based on gene-expression profiling using the proposed mixture model provided improved insight into the clinical differences between the two cancer subtypes.

  10. Prediction of gene expression in embryonic structures of Drosophila melanogaster.

    PubMed

    Samsonova, Anastasia A; Niranjan, Mahesan; Russell, Steven; Brazma, Alvis

    2007-07-01

    Understanding how sets of genes are coordinately regulated in space and time to generate the diversity of cell types that characterise complex metazoans is a major challenge in modern biology. The use of high-throughput approaches, such as large-scale in situ hybridisation and genome-wide expression profiling via DNA microarrays, is beginning to provide insights into the complexities of development. However, in many organisms the collection and annotation of comprehensive in situ localisation data is a difficult and time-consuming task. Here, we present a widely applicable computational approach, integrating developmental time-course microarray data with annotated in situ hybridisation studies, that facilitates the de novo prediction of tissue-specific expression for genes that have no in vivo gene expression localisation data available. Using a classification approach, trained with data from microarray and in situ hybridisation studies of gene expression during Drosophila embryonic development, we made a set of predictions on the tissue-specific expression of Drosophila genes that have not been systematically characterised by in situ hybridisation experiments. The reliability of our predictions is confirmed by literature-derived annotations in FlyBase, by overrepresentation of Gene Ontology biological process annotations, and, in a selected set, by detailed gene-specific studies from the literature. Our novel organism-independent method will be of considerable utility in enriching the annotation of gene function and expression in complex multicellular organisms.

  11. Prediction of Gene Expression in Embryonic Structures of Drosophila melanogaster

    PubMed Central

    Samsonova, Anastasia A; Niranjan, Mahesan; Russell, Steven; Brazma, Alvis

    2007-01-01

    Understanding how sets of genes are coordinately regulated in space and time to generate the diversity of cell types that characterise complex metazoans is a major challenge in modern biology. The use of high-throughput approaches, such as large-scale in situ hybridisation and genome-wide expression profiling via DNA microarrays, is beginning to provide insights into the complexities of development. However, in many organisms the collection and annotation of comprehensive in situ localisation data is a difficult and time-consuming task. Here, we present a widely applicable computational approach, integrating developmental time-course microarray data with annotated in situ hybridisation studies, that facilitates the de novo prediction of tissue-specific expression for genes that have no in vivo gene expression localisation data available. Using a classification approach, trained with data from microarray and in situ hybridisation studies of gene expression during Drosophila embryonic development, we made a set of predictions on the tissue-specific expression of Drosophila genes that have not been systematically characterised by in situ hybridisation experiments. The reliability of our predictions is confirmed by literature-derived annotations in FlyBase, by overrepresentation of Gene Ontology biological process annotations, and, in a selected set, by detailed gene-specific studies from the literature. Our novel organism-independent method will be of considerable utility in enriching the annotation of gene function and expression in complex multicellular organisms. PMID:17658945

  12. [Landscape classification: research progress and development trend].

    PubMed

    Liang, Fa-Chao; Liu, Li-Ming

    2011-06-01

    Landscape classification is the basis of the researches on landscape structure, process, and function, and also, the prerequisite for landscape evaluation, planning, protection, and management, directly affecting the precision and practicability of landscape research. This paper reviewed the research progress on the landscape classification system, theory, and methodology, and summarized the key problems and deficiencies of current researches. Some major landscape classification systems, e. g. , LANMAP and MUFIC, were introduced and discussed. It was suggested that a qualitative and quantitative comprehensive classification based on the ideology of functional structure shape and on the integral consideration of landscape classification utility, landscape function, landscape structure, physiogeographical factors, and human disturbance intensity should be the major research directions in the future. The integration of mapping, 3S technology, quantitative mathematics modeling, computer artificial intelligence, and professional knowledge to enhance the precision of landscape classification would be the key issues and the development trend in the researches of landscape classification.

  13. What is new in genetics and osteogenesis imperfecta classification?

    PubMed

    Valadares, Eugênia R; Carneiro, Túlio B; Santos, Paula M; Oliveira, Ana Cristina; Zabel, Bernhard

    2014-01-01

    Literature review of new genes related to osteogenesis imperfecta (OI) and update of its classification. Literature review in the PubMed and OMIM databases, followed by selection of relevant references. In 1979, Sillence et al. developed a classification of OI subtypes based on clinical features and disease severity: OI type I, mild, common, with blue sclera; OI type II, perinatal lethal form; OI type III, severe and progressively deforming, with normal sclera; and OI type IV, moderate severity with normal sclera. Approximately 90% of individuals with OI are heterozygous for mutations in the COL1A1 and COL1A2 genes, with dominant pattern of inheritance or sporadic mutations. After 2006, mutations were identified in the CRTAP, FKBP10, LEPRE1, PLOD2, PPIB, SERPINF1, SERPINH1, SP7, WNT1, BMP1, and TMEM38B genes, associated with recessive OI and mutation in the IFITM5 gene associated with dominant OI. Mutations in PLS3 were recently identified in families with osteoporosis and fractures, with X-linked inheritance pattern. In addition to the genetic complexity of the molecular basis of OI, extensive phenotypic variability resulting from individual loci has also been documented. Considering the discovery of new genes and limited genotype-phenotype correlation, the use of next-generation sequencing tools has become useful in molecular studies of OI cases. The recommendation of the Nosology Group of the International Society of Skeletal Dysplasias is to maintain the classification of Sillence as the prototypical form, universally accepted to classify the degree of severity in OI, while maintaining it free from direct molecular reference. Copyright © 2014 Sociedade Brasileira de Pediatria. Published by Elsevier Editora Ltda. All rights reserved.

  14. Defining the human deubiquitinating enzyme interaction landscape.

    PubMed

    Sowa, Mathew E; Bennett, Eric J; Gygi, Steven P; Harper, J Wade

    2009-07-23

    Deubiquitinating enzymes (Dubs) function to remove covalently attached ubiquitin from proteins, thereby controlling substrate activity and/or abundance. For most Dubs, their functions, targets, and regulation are poorly understood. To systematically investigate Dub function, we initiated a global proteomic analysis of Dubs and their associated protein complexes. This was accomplished through the development of a software platform called CompPASS, which uses unbiased metrics to assign confidence measurements to interactions from parallel nonreciprocal proteomic data sets. We identified 774 candidate interacting proteins associated with 75 Dubs. Using Gene Ontology, interactome topology classification, subcellular localization, and functional studies, we link Dubs to diverse processes, including protein turnover, transcription, RNA processing, DNA damage, and endoplasmic reticulum-associated degradation. This work provides the first glimpse into the Dub interaction landscape, places previously unstudied Dubs within putative biological pathways, and identifies previously unknown interactions and protein complexes involved in this increasingly important arm of the ubiquitin-proteasome pathway.

  15. Defining the Human Deubiquitinating Enzyme Interaction Landscape

    PubMed Central

    Sowa, Mathew E.; Bennett, Eric J.; Gygi, Steven P.; Harper, J. Wade

    2009-01-01

    Summary Deubiquitinating enzymes (Dubs) function to remove covalently attached ubiquitin from proteins, thereby controlling substrate activity and/or abundance. For most Dubs, their functions, targets, and regulation are poorly understood. To systematically investigate Dub function, we initiated a global proteomic analysis of Dubs and their associated protein complexes. This was accomplished through the development of a software platform, called CompPASS, which uses unbiased metrics to assign confidence measurements to interactions from parallel non-reciprocal proteomic datasets. We identified 774 candidate interacting proteins associated with 75 Dubs. Using Gene Ontology, interactome topology classification, sub-cellular localization and functional studies, we link Dubs to diverse processes, including protein turnover, transcription, RNA processing, DNA damage, and endoplasmic reticulum-associated degradation. This work provides the first glimpse into the Dub interaction landscape, places previously unstudied Dubs within putative biological pathways, and identifies previously unknown interactions and protein complexes involved in this increasingly important arm of the ubiquitin-proteasome pathway. PMID:19615732

  16. Functional annotation from the genome sequence of the giant panda.

    PubMed

    Huo, Tong; Zhang, Yinjie; Lin, Jianping

    2012-08-01

    The giant panda is one of the most critically endangered species due to the fragmentation and loss of its habitat. Studying the functions of proteins in this animal, especially specific trait-related proteins, is therefore necessary to protect the species. In this work, the functions of these proteins were investigated using the genome sequence of the giant panda. Data on 21,001 proteins and their functions were stored in the Giant Panda Protein Database, in which the proteins were divided into two groups: 20,179 proteins whose functions can be predicted by GeneScan formed the known-function group, whereas 822 proteins whose functions cannot be predicted by GeneScan comprised the unknown-function group. For the known-function group, we further classified the proteins by molecular function, biological process, cellular component, and tissue specificity. For the unknown-function group, we developed a strategy in which the proteins were filtered by cross-Blast to identify panda-specific proteins under the assumption that proteins related to the panda-specific traits in the unknown-function group exist. After this filtering procedure, we identified 32 proteins (2 of which are membrane proteins) specific to the giant panda genome as compared against the dog and horse genomes. Based on their amino acid sequences, these 32 proteins were further analyzed by functional classification using SVM-Prot, motif prediction using MyHits, and interacting protein prediction using the Database of Interacting Proteins. Nineteen proteins were predicted to be zinc-binding proteins, thus affecting the activities of nucleic acids. The 32 panda-specific proteins will be further investigated by structural and functional analysis.

  17. Naïve Bayes classification in R.

    PubMed

    Zhang, Zhongheng

    2016-06-01

    Naïve Bayes classification is a kind of simple probabilistic classification methods based on Bayes' theorem with the assumption of independence between features. The model is trained on training dataset to make predictions by predict() function. This article introduces two functions naiveBayes() and train() for the performance of Naïve Bayes classification.

  18. Population Level Purifying Selection and Gene Expression Shape Subgenome Evolution in Maize.

    PubMed

    Pophaly, Saurabh D; Tellier, Aurélien

    2015-12-01

    The maize ancestor experienced a recent whole-genome duplication (WGD) followed by gene erosion which generated two subgenomes, the dominant subgenome (maize1) experiencing fewer deletions than maize2. We take advantage of available extensive polymorphism and gene expression data in maize to study purifying selection and gene expression divergence between WGD retained paralog pairs. We first report a strong correlation in nucleotide diversity between duplicate pairs, except for upstream regions. We then show that maize1 genes are under stronger purifying selection than maize2. WGD retained genes have higher gene dosage and biased Gene Ontologies consistent with previous studies. The relative gene expression of paralogs across tissues demonstrates that 98% of duplicate pairs have either subfunctionalized in a tissuewise manner or have diverged consistently in their expression thereby preventing functional complementation. Tissuewise subfunctionalization seems to be a hallmark of transcription factors, whereas consistent repression occurs for macromolecular complexes. We show that dominant gene expression is a strong determinant of the strength of purifying selection, explaining the inferred stronger negative selection on maize1 genes. We propose a novel expression-based classification of duplicates which is more robust to explain observed polymorphism patterns than the subgenome location. Finally, upstream regions of repressed genes exhibit an enrichment in transposable elements which indicates a possible mechanism for expression divergence. © The Author 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  19. Clinical application of modified bag-of-features coupled with hybrid neural-based classifier in dengue fever classification using gene expression data.

    PubMed

    Chatterjee, Sankhadeep; Dey, Nilanjan; Shi, Fuqian; Ashour, Amira S; Fong, Simon James; Sen, Soumya

    2018-04-01

    Dengue fever detection and classification have a vital role due to the recent outbreaks of different kinds of dengue fever. Recently, the advancement in the microarray technology can be employed for such classification process. Several studies have established that the gene selection phase takes a significant role in the classifier performance. Subsequently, the current study focused on detecting two different variations, namely, dengue fever (DF) and dengue hemorrhagic fever (DHF). A modified bag-of-features method has been proposed to select the most promising genes in the classification process. Afterward, a modified cuckoo search optimization algorithm has been engaged to support the artificial neural (ANN-MCS) to classify the unknown subjects into three different classes namely, DF, DHF, and another class containing convalescent and normal cases. The proposed method has been compared with other three well-known classifiers, namely, multilayer perceptron feed-forward network (MLP-FFN), artificial neural network (ANN) trained with cuckoo search (ANN-CS), and ANN trained with PSO (ANN-PSO). Experiments have been carried out with different number of clusters for the initial bag-of-features-based feature selection phase. After obtaining the reduced dataset, the hybrid ANN-MCS model has been employed for the classification process. The results have been compared in terms of the confusion matrix-based performance measuring metrics. The experimental results indicated a highly statistically significant improvement with the proposed classifier over the traditional ANN-CS model.

  20. [The establishment, development and application of classification approach of freshwater phytoplankton based on the functional group: a review].

    PubMed

    Yang, Wen; Zhu, Jin-Yong; Lu, Kai-Hong; Wan, Li; Mao, Xiao-Hua

    2014-06-01

    Appropriate schemes for classification of freshwater phytoplankton are prerequisites and important tools for revealing phytoplanktonic succession and studying freshwater ecosystems. An alternative approach, functional group of freshwater phytoplankton, has been proposed and developed due to the deficiencies of Linnaean and molecular identification in ecological applications. The functional group of phytoplankton is a classification scheme based on autoecology. In this study, the theoretical basis and classification criterion of functional group (FG), morpho-functional group (MFG) and morphology-based functional group (MBFG) were summarized, as well as their merits and demerits. FG was considered as the optimal classification approach for the aquatic ecology research and aquatic environment evaluation. The application status of FG was introduced, with the evaluation standards and problems of two approaches to assess water quality on the basis of FG, index methods of Q and QR, being briefly discussed.

  1. Identification and Analysis of Mitogen-Activated Protein Kinase (MAPK) Cascades in Fragaria vesca.

    PubMed

    Zhou, Heying; Ren, Suyue; Han, Yuanfang; Zhang, Qing; Qin, Ling; Xing, Yu

    2017-08-13

    Mitogen-activated protein kinase (MAPK) cascades are highly conserved signaling modules in eukaryotes, including yeasts, plants and animals. MAPK cascades are responsible for protein phosphorylation during signal transduction events, and typically consist of three protein kinases: MAPK, MAPK kinase, and MAPK kinase kinase. In this current study, we identified a total of 12 FvMAPK , 7 FvMAPKK , 73 FvMAPKKK , and one FvMAPKKKK genes in the recently published Fragaria vesca genome sequence. This work reported the classification, annotation and phylogenetic evaluation of these genes and an assessment of conserved motifs and the expression profiling of members of the gene family were also analyzed here. The expression profiles of the MAPK and MAPKK genes in different organs and fruit developmental stages were further investigated using quantitative real-time reverse transcription PCR (qRT-PCR). Finally, the MAPK and MAPKK expression patterns in response to hormone and abiotic stresses (salt, drought, and high and low temperature) were investigated in fruit and leaves of F. vesca . The results provide a platform for further characterization of the physiological and biochemical functions of MAPK cascades in strawberry.

  2. De Novo RNA Sequencing and Transcriptome Analysis of Monascus purpureus and Analysis of Key Genes Involved in Monacolin K Biosynthesis

    PubMed Central

    Zhang, Chan; Liang, Jian; Yang, Le; Sun, Baoguo; Wang, Chengtao

    2017-01-01

    Monascus purpureus is an important medicinal and edible microbial resource. To facilitate biological, biochemical, and molecular research on medicinal components of M. purpureus, we investigated the M. purpureus transcriptome by RNA sequencing (RNA-seq). An RNA-seq library was created using RNA extracted from a mixed sample of M. purpureus expressing different levels of monacolin K output. In total 29,713 unigenes were assembled from more than 60 million high-quality short reads. A BLAST search revealed hits for 21,331 unigenes in at least one of the protein or nucleotide databases used in this study. The 22,365 unigenes were categorized into 48 functional groups based on Gene Ontology classification. Owing to the economic and medicinal importance of M. purpureus, most studies on this organism have focused on the pharmacological activity of chemical components and the molecular function of genes involved in their biogenesis. In this study, we performed quantitative real-time PCR to detect the expression of genes related to monacolin K (mokA-mokI) at different phases (2, 5, 8, and 12 days) of M. purpureus M1 and M1-36. Our study found that mokF modulates monacolin K biogenesis in M. purpureus. Nine genes were suggested to be associated with the monacolin K biosynthesis. Studies on these genes could provide useful information on secondary metabolic processes in M. purpureus. These results indicate a detailed resource through genetic engineering of monacolin K biosynthesis in M. purpureus and related species. PMID:28114365

  3. Microbial-type terpene synthase genes occur widely in nonseed land plants, but not in seed plants

    DOE PAGES

    Jia, Qidong; Li, Guanglin; Köllner, Tobias G.; ...

    2016-10-10

    Here, the vast abundance of terpene natural products in nature is due to enzymes known as terpene synthases (TPSs) that convert acyclic prenyl diphosphate precursors into a multitude of cyclic and acyclic carbon skeletons. Yet the evolution of TPSs is not well understood at higher levels of classification. Microbial TPSs from bacteria and fungi are only distantly related to typical plant TPSs, whereas genes similar to microbial TPS genes have been recently identified in the lycophyte Selaginella moellendorffii. The goal of this study was to investigate the distribution, evolution, and biochemical functions of microbial terpene synthase-like ( MTPSL) genes inmore » other plants. By analyzing the transcriptomes of 1,103 plant species ranging from green algae to flowering plants, putative MTPSL genes were identified predominantly from nonseed plants, including liverworts, mosses, hornworts, lycophytes, and monilophytes. Directed searching for MTPSL genes in the sequenced genomes of a wide range of seed plants confirmed their general absence in this group. Among themselves, MTPSL proteins from nonseed plants form four major groups, with two of these more closely related to bacterial TPSs and the other two to fungal TPSs. Two of the four groups contain a canonical aspartate-rich “DDxxD” motif. The third group has a “DDxxxD” motif, and the fourth group has only the first two “DD” conserved in this motif. Upon heterologous expression, representative members from each of the four groups displayed diverse catalytic functions as monoterpene and sesquiterpene synthases, suggesting these are important for terpene formation in nonseed plants.« less

  4. Comparative transcriptomics reveals genes involved in metabolic and immune pathways in the digestive gland of scallop Chlamys farreri following cadmium exposure

    NASA Astrophysics Data System (ADS)

    Zhang, Hui; Zhai, Yuxiu; Yao, Lin; Jiang, Yanhua; Li, Fengling

    2017-05-01

    Chlamys farreri is an economically important mollusk that can accumulate excessive amounts of cadmium (Cd). Studying the molecular mechanism of Cd accumulation in bivalves is difficult because of the lack of genome background. Transcriptomic analysis based on high-throughput RNA sequencing has been shown to be an efficient and powerful method for the discovery of relevant genes in non-model and genome reference-free organisms. Here, we constructed two cDNA libraries (control and Cd exposure groups) from the digestive gland of C. farreri and compared the transcriptomic data between them. A total of 227 673 transcripts were assembled into 105 071 unigenes, most of which shared high similarity with sequences in the NCBI non-redundant protein database. For functional classification, 24 493 unigenes were assigned to Gene Ontology terms. Additionally, EuKaryotic Ortholog Groups and Kyoto Encyclopedia of Genes and Genomes analyses assigned 12 028 unigenes to 26 categories and 7 849 unigenes to five pathways, respectively. Comparative transcriptomics analysis identified 3 800 unigenes that were differentially expressed in the Cd-treated group compared with the control group. Among them, genes associated with heavy metal accumulation were screened, including metallothionein, divalent metal transporter, and metal tolerance protein. The functional genes and predicted pathways identified in our study will contribute to a better understanding of the metabolic and immune system in the digestive gland of C. farreri. In addition, the transcriptomic data will provide a comprehensive resource that may contribute to the understanding of molecular mechanisms that respond to marine pollutants in bivalves.

  5. Microbial-type terpene synthase genes occur widely in nonseed land plants, but not in seed plants

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Jia, Qidong; Li, Guanglin; Köllner, Tobias G.

    Here, the vast abundance of terpene natural products in nature is due to enzymes known as terpene synthases (TPSs) that convert acyclic prenyl diphosphate precursors into a multitude of cyclic and acyclic carbon skeletons. Yet the evolution of TPSs is not well understood at higher levels of classification. Microbial TPSs from bacteria and fungi are only distantly related to typical plant TPSs, whereas genes similar to microbial TPS genes have been recently identified in the lycophyte Selaginella moellendorffii. The goal of this study was to investigate the distribution, evolution, and biochemical functions of microbial terpene synthase-like ( MTPSL) genes inmore » other plants. By analyzing the transcriptomes of 1,103 plant species ranging from green algae to flowering plants, putative MTPSL genes were identified predominantly from nonseed plants, including liverworts, mosses, hornworts, lycophytes, and monilophytes. Directed searching for MTPSL genes in the sequenced genomes of a wide range of seed plants confirmed their general absence in this group. Among themselves, MTPSL proteins from nonseed plants form four major groups, with two of these more closely related to bacterial TPSs and the other two to fungal TPSs. Two of the four groups contain a canonical aspartate-rich “DDxxD” motif. The third group has a “DDxxxD” motif, and the fourth group has only the first two “DD” conserved in this motif. Upon heterologous expression, representative members from each of the four groups displayed diverse catalytic functions as monoterpene and sesquiterpene synthases, suggesting these are important for terpene formation in nonseed plants.« less

  6. Translocations and mutations involving the nucleophosmin (NPM1) gene in lymphomas and leukemias.

    PubMed

    Falini, Brunangelo; Nicoletti, Ildo; Bolli, Niccolò; Martelli, Maria Paola; Liso, Arcangelo; Gorello, Paolo; Mandelli, Franco; Mecucci, Cristina; Martelli, Massimo Fabrizio

    2007-04-01

    Nucleophosmin (NPM) is a ubiquitously expressed nucleolar phoshoprotein which shuttles continuously between the nucleus and cytoplasm. Many findings have revealed a complex scenario of NPM functions and interactions, pointing to proliferative and growth-suppressive roles of this molecule. The gene NPM1 that encodes for nucleophosmin (NPM1) is translocated or mutated in various lymphomas and leukemias, forming fusion proteins (NPM-ALK, NPM-RARalpha, NPM-MLF1) or NPM mutant products. Here, we review the structure and functions of NPM, as well as the biological, clinical and pathological features of human hematologic malignancies with NPM1 gene alterations. NPM-ALK indentifies a new category of T/Null lymphomas with distinctive molecular and clinico-pathological features, that is going to be included as a novel disease entity (ALK+ anaplastic large cell lymphoma) in the new WHO classification of lymphoid neoplasms. NPM1 mutations occur specifically in about 30% of adult de novo AML and cause aberrant cytoplasmic expression of NPM (hence the term NPMc+ AML). NPMc+ AML associates with normal karyotpe, and shows wide morphological spectrum, multilineage involvement, a unique gene expression signature, a high frequency of FLT3-internal tandem duplications, and distinctive clinical and prognostic features. The availability of specific antibodies and molecular techniques for the detection of NPM1 gene alterations has an enormous impact in the biological study diagnosis, prognostic stratification, and monitoring of minimal residual disease of various lymphomas and leukemias. The discovery of NPM1 gene alterations also represents the rationale basis for development of molecular targeted drugs.

  7. DNA microarray‐based analysis of voluntary resistance wheel running reveals novel transcriptome leading robust hippocampal plasticity

    PubMed Central

    Lee, Min Chul; Rakwal, Randeep; Shibato, Junko; Inoue, Koshiro; Chang, Hyukki; Soya, Hideaki

    2014-01-01

    Abstract In two separate experiments, voluntary resistance wheel running with 30% of body weight (RWR), rather than wheel running (WR), led to greater enhancements, including adult hippocampal neurogenesis and cognitive functions, in conjunction with hippocampal brain‐derived neurotrophic factor (BDNF) signaling (Lee et al., J Appl Physiol, 2012; Neurosci Lett., 2013). Here we aimed to unravel novel molecular factors and gain insight into underlying molecular mechanisms for RWR‐enhanced hippocampal functions; a high‐throughput whole‐genome DNA microarray approach was applied to rats performing voluntary running for 4 weeks. RWR rats showed a significant decrease in average running distances although average work levels increased immensely, by about 11‐fold compared to WR, resulting in muscular adaptation for the fast‐twitch plantaris muscle. Global transcriptome profiling analysis identified 128 (sedentary × WR) and 169 (sedentary × RWR) up‐regulated (>1.5‐fold change), and 97 (sedentary × WR) and 468 (sedentary × RWR) down‐regulated (<0.75‐fold change) genes. Functional categorization using both pathway‐ or specific‐disease‐state‐focused gene classifications and Ingenuity Pathway Analysis (IPA) revealed expression pattern changes in the major categories of disease and disorders, molecular functions, and physiological system development and function. Genes specifically regulated with RWR include the newly identified factors of NFATc1, AVPR1A, and FGFR4, as well as previously known factors, BDNF and CREB mRNA. Interestingly, RWR down‐regulated multiple inflammatory cytokines (IL1B, IL2RA, and TNF) and chemokines (CXCL1, CXCL10, CCL2, and CCR4) with the SYCP3, PRL genes, which are potentially involved in regulating hippocampal neuroplastic changes. These results provide understanding of the voluntary‐RWR‐related hippocampal transcriptome, which will open a window to the underlying mechanisms of the positive effects of exercise, with therapeutic value for enhancing hippocampal functions. PMID:25413326

  8. Lissencephaly: expanded imaging and clinical classification

    PubMed Central

    Di Donato, Nataliya; Chiari, Sara; Mirzaa, Ghayda M.; Aldinger, Kimberly; Parrini, Elena; Olds, Carissa; Barkovich, A. James; Guerrini, Renzo; Dobyns, William B.

    2017-01-01

    Lissencephaly (“smooth brain”, LIS) is a malformation of cortical development associated with deficient neuronal migration and abnormal formation of cerebral convolutions or gyri. The LIS spectrum includes agyria, pachygyria, and subcortical band heterotopia. Our first classification of LIS and subcortical band heterotopia (SBH) was developed to distinguish between the first two genetic causes of LIS – LIS1 (PAFAH1B1) and DCX. However, progress in molecular genetics has led to identification of 19 LIS-associated genes, leaving the existing classification system insufficient to distinguish the increasingly diverse patterns of LIS. To address this challenge, we reviewed clinical, imaging and molecular data on 188 patients with LIS-SBH ascertained during the last five years, and reviewed selected archival data on another ~1,400 patients. Using these data plus published reports, we constructed a new imaging based classification system with 21 recognizable patterns that reliably predict the most likely causative genes. These patterns do not correlate consistently with the clinical outcome, leading us to also develop a new scale useful for predicting clinical severity and outcome. Taken together, our work provides new tools that should prove useful for clinical management and genetic counselling of patients with LIS-SBH (imaging and severity based classifications), and guidance for prioritizing and interpreting genetic testing results (imaging based classification). PMID:28440899

  9. Cell of origin associated classification of B-cell malignancies by gene signatures of the normal B-cell hierarchy.

    PubMed

    Johnsen, Hans Erik; Bergkvist, Kim Steve; Schmitz, Alexander; Kjeldsen, Malene Krag; Hansen, Steen Møller; Gaihede, Michael; Nørgaard, Martin Agge; Bæch, John; Grønholdt, Marie-Louise; Jensen, Frank Svendsen; Johansen, Preben; Bødker, Julie Støve; Bøgsted, Martin; Dybkær, Karen

    2014-06-01

    Recent findings have suggested biological classification of B-cell malignancies as exemplified by the "activated B-cell-like" (ABC), the "germinal-center B-cell-like" (GCB) and primary mediastinal B-cell lymphoma (PMBL) subtypes of diffuse large B-cell lymphoma and "recurrent translocation and cyclin D" (TC) classification of multiple myeloma. Biological classification of B-cell derived cancers may be refined by a direct and systematic strategy where identification and characterization of normal B-cell differentiation subsets are used to define the cancer cell of origin phenotype. Here we propose a strategy combining multiparametric flow cytometry, global gene expression profiling and biostatistical modeling to generate B-cell subset specific gene signatures from sorted normal human immature, naive, germinal centrocytes and centroblasts, post-germinal memory B-cells, plasmablasts and plasma cells from available lymphoid tissues including lymph nodes, tonsils, thymus, peripheral blood and bone marrow. This strategy will provide an accurate image of the stage of differentiation, which prospectively can be used to classify any B-cell malignancy and eventually purify tumor cells. This report briefly describes the current models of the normal B-cell subset differentiation in multiple tissues and the pathogenesis of malignancies originating from the normal germinal B-cell hierarchy.

  10. Insights into animal and plant lectins with antimicrobial activities.

    PubMed

    Dias, Renata de Oliveira; Machado, Leandro Dos Santos; Migliolo, Ludovico; Franco, Octavio Luiz

    2015-01-05

    Lectins are multivalent proteins with the ability to recognize and bind diverse carbohydrate structures. The glyco -binding and diverse molecular structures observed in these protein classes make them a large and heterogeneous group with a wide range of biological activities in microorganisms, animals and plants. Lectins from plants and animals are commonly used in direct defense against pathogens and in immune regulation. This review focuses on sources of animal and plant lectins, describing their functional classification and tridimensional structures, relating these properties with biotechnological purposes, including antimicrobial activities. In summary, this work focuses on structural-functional elucidation of diverse lectin groups, shedding some light on host-pathogen interactions; it also examines their emergence as biotechnological tools through gene manipulation and development of new drugs.

  11. Heterogeneous activation of the TGFβ pathway in glioblastomas identified by gene expression-based classification using TGFβ-responsive genes

    PubMed Central

    Xu, Xie L; Kapoun, Ann M

    2009-01-01

    Background TGFβ has emerged as an attractive target for the therapeutic intervention of glioblastomas. Aberrant TGFβ overproduction in glioblastoma and other high-grade gliomas has been reported, however, to date, none of these reports has systematically examined the components of TGFβ signaling to gain a comprehensive view of TGFβ activation in large cohorts of human glioma patients. Methods TGFβ activation in mammalian cells leads to a transcriptional program that typically affects 5–10% of the genes in the genome. To systematically examine the status of TGFβ activation in high-grade glial tumors, we compiled a gene set of transcriptional response to TGFβ stimulation from tissue culture and in vivo animal studies. These genes were used to examine the status of TGFβ activation in high-grade gliomas including a large cohort of glioblastomas. Unsupervised and supervised classification analysis was performed in two independent, publicly available glioma microarray datasets. Results Unsupervised and supervised classification using the TGFβ-responsive gene list in two independent glial tumor gene expression data sets revealed various levels of TGFβ activation in these tumors. Among glioblastomas, one of the most devastating human cancers, two subgroups were identified that showed distinct TGFβ activation patterns as measured from transcriptional responses. Approximately 62% of glioblastoma samples analyzed showed strong TGFβ activation, while the rest showed a weak TGFβ transcriptional response. Conclusion Our findings suggest heterogeneous TGFβ activation in glioblastomas, which may cause potential differences in responses to anti-TGFβ therapies in these two distinct subgroups of glioblastomas patients. PMID:19192267

  12. Lex-SVM: exploring the potential of exon expression profiling for disease classification.

    PubMed

    Yuan, Xiongying; Zhao, Yi; Liu, Changning; Bu, Dongbo

    2011-04-01

    Exon expression profiling technologies, including exon arrays and RNA-Seq, measure the abundance of every exon in a gene. Compared with gene expression profiling technologies like 3' array, exon expression profiling technologies could detect alterations in both transcription and alternative splicing, therefore they are expected to be more sensitive in diagnosis. However, exon expression profiling also brings higher dimension, more redundancy, and significant correlation among features. Ignoring the correlation structure among exons of a gene, a popular classification method like L1-SVM selects exons individually from each gene and thus is vulnerable to noise. To overcome this limitation, we present in this paper a new variant of SVM named Lex-SVM to incorporate correlation structure among exons and known splicing patterns to promote classification performance. Specifically, we construct a new norm, ex-norm, including our prior knowledge on exon correlation structure to regularize the coefficients of a linear SVM. Lex-SVM can be solved efficiently using standard linear programming techniques. The advantage of Lex-SVM is that it can select features group-wisely, force features in a subgroup to take equal weihts and exclude the features that contradict the majority in the subgroup. Experimental results suggest that on exon expression profile, Lex-SVM is more accurate than existing methods. Lex-SVM also generates a more compact model and selects genes more consistently in cross-validation. Unlike L1-SVM selecting only one exon in a gene, Lex-SVM assigns equal weights to as many exons in a gene as possible, lending itself easier for further interpretation.

  13. Computational intelligence techniques for biological data mining: An overview

    NASA Astrophysics Data System (ADS)

    Faye, Ibrahima; Iqbal, Muhammad Javed; Said, Abas Md; Samir, Brahim Belhaouari

    2014-10-01

    Computational techniques have been successfully utilized for a highly accurate analysis and modeling of multifaceted and raw biological data gathered from various genome sequencing projects. These techniques are proving much more effective to overcome the limitations of the traditional in-vitro experiments on the constantly increasing sequence data. However, most critical problems that caught the attention of the researchers may include, but not limited to these: accurate structure and function prediction of unknown proteins, protein subcellular localization prediction, finding protein-protein interactions, protein fold recognition, analysis of microarray gene expression data, etc. To solve these problems, various classification and clustering techniques using machine learning have been extensively used in the published literature. These techniques include neural network algorithms, genetic algorithms, fuzzy ARTMAP, K-Means, K-NN, SVM, Rough set classifiers, decision tree and HMM based algorithms. Major difficulties in applying the above algorithms include the limitations found in the previous feature encoding and selection methods while extracting the best features, increasing classification accuracy and decreasing the running time overheads of the learning algorithms. The application of this research would be potentially useful in the drug design and in the diagnosis of some diseases. This paper presents a concise overview of the well-known protein classification techniques.

  14. Functional Communication Profiles in Children with Cerebral Palsy in Relation to Gross Motor Function and Manual and Intellectual Ability.

    PubMed

    Choi, Ja Young; Park, Jieun; Choi, Yoon Seong; Goh, Yu Ra; Park, Eun Sook

    2018-07-01

    The aim of the present study was to investigate communication function using classification systems and its association with other functional profiles, including gross motor function, manual ability, intellectual functioning, and brain magnetic resonance imaging (MRI) characteristics in children with cerebral palsy (CP). This study recruited 117 individuals with CP aged from 4 to 16 years. The Communication Function Classification System (CFCS), Viking Speech Scale (VSS), Speech Language Profile Groups (SLPG), Gross Motor Function Classification System (GMFCS), Manual Ability Classification System (MACS), and intellectual functioning were assessed in the children along with brain MRI categorization. Very strong relationships were noted among the VSS, CFCS, and SLPG, although these three communication systems provide complementary information, especially for children with mid-range communication impairment. These three communication classification systems were strongly related with the MACS, but moderately related with the GMFCS. Multiple logistic regression analysis indicated that manual ability and intellectual functioning were significantly related with VSS and CFCS function, whereas only intellectual functioning was significantly related with SLPG functioning in children with CP. Communication function in children with a periventricular white matter lesion (PVWL) varied widely. In the cases with a PVWL, poor functioning was more common on the SLPG, compared to the VSS and CFCS. Very strong relationships were noted among three communication classification systems that are closely related with intellectual ability. Compared to gross motor function, manual ability seemed more closely related with communication function in these children. © Copyright: Yonsei University College of Medicine 2018.

  15. Structural organization and classification of cytochrome P450 genes in flax (Linum usitatissimum L.).

    PubMed

    Babu, Peram Ravindra; Rao, Khareedu Venkateswara; Reddy, Vudem Dashavantha

    2013-01-15

    Flax CYPome analysis resulted in the identification of 334 putative cytochrome P450 (CYP450) genes in the cultivated flax genome. Classification of flax CYP450 genes based on the sequence similarity with Arabidopsis orthologs and CYP450 nomenclature, revealed 10 clans representing 44 families and 98 subfamilies. CYP80, CYP83, CYP92, CYP702, CYP705, CYP708, CYP728, CYP729, CYP733 and CYP736 families are absent in the flax genome. The subfamily members exhibited conserved sequences, length of exons and phasing of introns. Similarity search of the genomic resources of wild flax species Linum bienne with CYP450 coding sequences of the cultivated flax, revealed the presence of 127 CYP450 gene orthologs, indicating amplification of novel CYP450 genes in the cultivated flax. Seven families CYP73, 74, 75, 76, 77, 84 and 709, coding for enzymes associated with phenylpropanoid/fatty acid metabolism, showed extensive gene amplification in the flax. About 59% of the flax CYP450 genes were present in the EST libraries. Copyright © 2012 Elsevier B.V. All rights reserved.

  16. Impact of recent molecular phylogenetic studies on classification of ascomycete yeasts

    USDA-ARS?s Scientific Manuscript database

    Analyses of concatenated gene sequences as well as whole genome sequences are resolving relationships among the ascomycete yeasts (Saccharomycotina), thus allowing classification of members of this subphylum to be based on phylogeny. In addition, changes implemented in the new Botanical Code [Intern...

  17. On the integrity of functional brain networks in schizophrenia, Parkinson's disease, and advanced age: Evidence from connectivity-based single-subject classification.

    PubMed

    Pläschke, Rachel N; Cieslik, Edna C; Müller, Veronika I; Hoffstaedter, Felix; Plachti, Anna; Varikuti, Deepthi P; Goosses, Mareike; Latz, Anne; Caspers, Svenja; Jockwitz, Christiane; Moebus, Susanne; Gruber, Oliver; Eickhoff, Claudia R; Reetz, Kathrin; Heller, Julia; Südmeyer, Martin; Mathys, Christian; Caspers, Julian; Grefkes, Christian; Kalenscher, Tobias; Langner, Robert; Eickhoff, Simon B

    2017-12-01

    Previous whole-brain functional connectivity studies achieved successful classifications of patients and healthy controls but only offered limited specificity as to affected brain systems. Here, we examined whether the connectivity patterns of functional systems affected in schizophrenia (SCZ), Parkinson's disease (PD), or normal aging equally translate into high classification accuracies for these conditions. We compared classification performance between pre-defined networks for each group and, for any given network, between groups. Separate support vector machine classifications of 86 SCZ patients, 80 PD patients, and 95 older adults relative to their matched healthy/young controls, respectively, were performed on functional connectivity in 12 task-based, meta-analytically defined networks using 25 replications of a nested 10-fold cross-validation scheme. Classification performance of the various networks clearly differed between conditions, as those networks that best classified one disease were usually non-informative for the other. For SCZ, but not PD, emotion-processing, empathy, and cognitive action control networks distinguished patients most accurately from controls. For PD, but not SCZ, networks subserving autobiographical or semantic memory, motor execution, and theory-of-mind cognition yielded the best classifications. In contrast, young-old classification was excellent based on all networks and outperformed both clinical classifications. Our pattern-classification approach captured associations between clinical and developmental conditions and functional network integrity with a higher level of specificity than did previous whole-brain analyses. Taken together, our results support resting-state connectivity as a marker of functional dysregulation in specific networks known to be affected by SCZ and PD, while suggesting that aging affects network integrity in a more global way. Hum Brain Mapp 38:5845-5858, 2017. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.

  18. Railroad Classification Yard Technology : An Introductory Analysis of Functions and Operations

    DOT National Transportation Integrated Search

    1975-05-01

    A review of the basic operating characteristics and functions of railroad classification yards is presented. Introductory descriptions of terms, concepts, and problems of railroad operations involving classification yards are included in an attempt t...

  19. Pathogenic Germline Variants in 10,389 Adult Cancers.

    PubMed

    Huang, Kuan-Lin; Mashl, R Jay; Wu, Yige; Ritter, Deborah I; Wang, Jiayin; Oh, Clara; Paczkowska, Marta; Reynolds, Sheila; Wyczalkowski, Matthew A; Oak, Ninad; Scott, Adam D; Krassowski, Michal; Cherniack, Andrew D; Houlahan, Kathleen E; Jayasinghe, Reyka; Wang, Liang-Bo; Zhou, Daniel Cui; Liu, Di; Cao, Song; Kim, Young Won; Koire, Amanda; McMichael, Joshua F; Hucthagowder, Vishwanathan; Kim, Tae-Beom; Hahn, Abigail; Wang, Chen; McLellan, Michael D; Al-Mulla, Fahd; Johnson, Kimberly J; Lichtarge, Olivier; Boutros, Paul C; Raphael, Benjamin; Lazar, Alexander J; Zhang, Wei; Wendl, Michael C; Govindan, Ramaswamy; Jain, Sanjay; Wheeler, David; Kulkarni, Shashikant; Dipersio, John F; Reimand, Jüri; Meric-Bernstam, Funda; Chen, Ken; Shmulevich, Ilya; Plon, Sharon E; Chen, Feng; Ding, Li

    2018-04-05

    We conducted the largest investigation of predisposition variants in cancer to date, discovering 853 pathogenic or likely pathogenic variants in 8% of 10,389 cases from 33 cancer types. Twenty-one genes showed single or cross-cancer associations, including novel associations of SDHA in melanoma and PALB2 in stomach adenocarcinoma. The 659 predisposition variants and 18 additional large deletions in tumor suppressors, including ATM, BRCA1, and NF1, showed low gene expression and frequent (43%) loss of heterozygosity or biallelic two-hit events. We also discovered 33 such variants in oncogenes, including missenses in MET, RET, and PTPN11 associated with high gene expression. We nominated 47 additional predisposition variants from prioritized VUSs supported by multiple evidences involving case-control frequency, loss of heterozygosity, expression effect, and co-localization with mutations and modified residues. Our integrative approach links rare predisposition variants to functional consequences, informing future guidelines of variant classification and germline genetic testing in cancer. Copyright © 2018 The Authors. Published by Elsevier Inc. All rights reserved.

  20. Gene selection for cancer classification with the help of bees.

    PubMed

    Moosa, Johra Muhammad; Shakur, Rameen; Kaykobad, Mohammad; Rahman, Mohammad Sohel

    2016-08-10

    Development of biologically relevant models from gene expression data notably, microarray data has become a topic of great interest in the field of bioinformatics and clinical genetics and oncology. Only a small number of gene expression data compared to the total number of genes explored possess a significant correlation with a certain phenotype. Gene selection enables researchers to obtain substantial insight into the genetic nature of the disease and the mechanisms responsible for it. Besides improvement of the performance of cancer classification, it can also cut down the time and cost of medical diagnoses. This study presents a modified Artificial Bee Colony Algorithm (ABC) to select minimum number of genes that are deemed to be significant for cancer along with improvement of predictive accuracy. The search equation of ABC is believed to be good at exploration but poor at exploitation. To overcome this limitation we have modified the ABC algorithm by incorporating the concept of pheromones which is one of the major components of Ant Colony Optimization (ACO) algorithm and a new operation in which successive bees communicate to share their findings. The proposed algorithm is evaluated using a suite of ten publicly available datasets after the parameters are tuned scientifically with one of the datasets. Obtained results are compared to other works that used the same datasets. The performance of the proposed method is proved to be superior. The method presented in this paper can provide subset of genes leading to more accurate classification results while the number of selected genes is smaller. Additionally, the proposed modified Artificial Bee Colony Algorithm could conceivably be applied to problems in other areas as well.

  1. GENE-07. MOLECULAR NEUROPATHOLOGY 2.0 - INCREASING DIAGNOSTIC ACCURACY IN PEDIATRIC NEUROONCOLOGY

    PubMed Central

    Sturm, Dominik; Jones, David T.W.; Capper, David; Sahm, Felix; von Deimling, Andreas; Rutkoswki, Stefan; Warmuth-Metz, Monika; Bison, Brigitte; Gessi, Marco; Pietsch, Torsten; Pfister, Stefan M.

    2017-01-01

    Abstract The classification of central nervous system (CNS) tumors into clinically and biologically distinct entities and subgroups is challenging. Children and adolescents can be affected by >100 histological variants with very variable outcomes, some of which are exceedingly rare. The current WHO classification has introduced a number of novel molecular markers to aid routine neuropathological diagnostics, and DNA methylation profiling is emerging as a powerful tool to distinguish CNS tumor classes. The Molecular Neuropathology 2.0 study aims to integrate genome wide (epi-)genetic diagnostics with reference neuropathological assessment for all newly-diagnosed pediatric brain tumors in Germany. To date, >350 patients have been enrolled. A molecular diagnosis is established by epigenetic tumor classification through DNA methylation profiling and targeted panel sequencing of >130 genes to detect diagnostically and/or therapeutically useful DNA mutations, structural alterations, and fusion events. Results are aligned with the reference neuropathological diagnosis, and discrepant findings are discussed in a multi-disciplinary tumor board including reference neuroradiological evaluation. Ten FFPE sections as input material are sufficient to establish a molecular diagnosis in >95% of tumors. Alignment with reference pathology results in four broad categories: a) concordant classification (~77%), b) discrepant classification resolvable by tumor board discussion and/or additional data (~5%), c) discrepant classification without currently available options to resolve (~8%), and d) cases currently unclassifiable by molecular diagnostics (~10%). Discrepancies are enriched in certain histopathological entities, such as histological high grade gliomas with a molecularly low grade profile. Gene panel sequencing reveals predisposing germline events in ~10% of patients. Genome wide (epi-)genetic analyses add a valuable layer of information to routine neuropathological diagnostics. Our study provides insight into CNS tumors with divergent histopathological and molecular classification, opening new avenues for research discoveries and facilitating optimization of clinical management for affected patients in the future.

  2. Use of mutation profiles to refine the classification of endometrial carcinomas

    PubMed Central

    Cheang, Maggie CU; Wiegand, Kimberly; Senz, Janine; Tone, Alicia; Yang, Winnie; Prentice, Leah; Tse, Kane; Zeng, Thomas; McDonald, Helen; Schmidt, Amy P.; Mutch, David G.; McAlpine, Jessica N; Hirst, Martin; Shah, Sohrab P; Lee, Cheng-Han; Goodfellow, Paul J; Gilks, C. Blake; Huntsman, David G

    2014-01-01

    The classification of endometrial carcinomas is based on pathological assessment of tumour cell type; the different cell types (endometrioid, serous, carcinosarcoma, mixed, and clear cell) are associated with distinct molecular alterations. This current classification system for high-grade subtypes, in particular the distinction between high-grade endometrioid (EEC-3) and serous carcinomas (ESC), is limited in its reproducibility and prognostic abilities. Therefore, a search for specific molecular classifiers to improve endometrial carcinoma subclassification is warranted. We performed target enrichment sequencing on 393 endometrial carcinomas from two large cohorts, sequencing exons from the following 9 genes; ARID1A, PPP2R1A, PTEN, PIK3CA, KRAS, CTNNB1, TP53, BRAF and PPP2R5C. Based on this gene panel each endometrial carcinoma subtype shows a distinct mutation profile. EEC-3s have significantly different frequencies of PTEN and TP53 mutations when compared to low-grade endometrioid carcinomas. ESCs and EEC-3s are distinct subtypes with significantly different frequencies of mutations in PTEN, ARID1A, PPP2R1A, TP53, and CTNNB1. From the mutation profiles we were able to identify subtype outliers, i.e. cases diagnosed morphologically as one subtype but with a mutation profile suggestive of a different subtype. Careful review of these diagnostically challenging cases suggested that the original morphological classification was incorrect in most instances. The molecular profile of carcinosarcomas suggests two distinct mutation profiles for these tumours; endometrioid-type (PTEN, PIK3CA, ARID1A, KRAS mutations), and serous-type (TP53 and PPP2R1A mutations). While this nine gene panel does not allow for a purely molecularly based classification of endometrial carcinoma, it may prove useful as an adjunct to morphological classification and serve as an aid in the classification of problematic cases. If used in practice, it may lead to improved diagnostic reproducibility and may also serve to stratify patients for targeted therapeutics. PMID:22653804

  3. Reduced Set of Virulence Genes Allows High Accuracy Prediction of Bacterial Pathogenicity in Humans

    PubMed Central

    Iraola, Gregorio; Vazquez, Gustavo; Spangenberg, Lucía; Naya, Hugo

    2012-01-01

    Although there have been great advances in understanding bacterial pathogenesis, there is still a lack of integrative information about what makes a bacterium a human pathogen. The advent of high-throughput sequencing technologies has dramatically increased the amount of completed bacterial genomes, for both known human pathogenic and non-pathogenic strains; this information is now available to investigate genetic features that determine pathogenic phenotypes in bacteria. In this work we determined presence/absence patterns of different virulence-related genes among more than finished bacterial genomes from both human pathogenic and non-pathogenic strains, belonging to different taxonomic groups (i.e: Actinobacteria, Gammaproteobacteria, Firmicutes, etc.). An accuracy of 95% using a cross-fold validation scheme with in-fold feature selection is obtained when classifying human pathogens and non-pathogens. A reduced subset of highly informative genes () is presented and applied to an external validation set. The statistical model was implemented in the BacFier v1.0 software (freely available at ), that displays not only the prediction (pathogen/non-pathogen) and an associated probability for pathogenicity, but also the presence/absence vector for the analyzed genes, so it is possible to decipher the subset of virulence genes responsible for the classification on the analyzed genome. Furthermore, we discuss the biological relevance for bacterial pathogenesis of the core set of genes, corresponding to eight functional categories, all with evident and documented association with the phenotypes of interest. Also, we analyze which functional categories of virulence genes were more distinctive for pathogenicity in each taxonomic group, which seems to be a completely new kind of information and could lead to important evolutionary conclusions. PMID:22916122

  4. Genome-Wide Analysis of the RAV Family in Soybean and Functional Identification of GmRAV-03 Involvement in Salt and Drought Stresses and Exogenous ABA Treatment

    PubMed Central

    Zhao, Shu-Ping; Xu, Zhao-Shi; Zheng, Wei-Jun; Zhao, Wan; Wang, Yan-Xia; Yu, Tai-Fei; Chen, Ming; Zhou, Yong-Bin; Min, Dong-Hong; Ma, You-Zhi; Chai, Shou-Cheng; Zhang, Xiao-Hong

    2017-01-01

    Transcription factors play vital roles in plant growth and in plant responses to abiotic stresses. The RAV transcription factors contain a B3 DNA binding domain and/or an APETALA2 (AP2) DNA binding domain. Although genome-wide analyses of RAV family genes have been performed in several species, little is known about the family in soybean (Glycine max L.). In this study, a total of 13 RAV genes, named as GmRAVs, were identified in the soybean genome. We predicted and analyzed the amino acid compositions, phylogenetic relationships, and folding states of conserved domain sequences of soybean RAV transcription factors. These soybean RAV transcription factors were phylogenetically clustered into three classes based on their amino acid sequences. Subcellular localization analysis revealed that the soybean RAV proteins were located in the nucleus. The expression patterns of 13 RAV genes were analyzed by quantitative real-time PCR. Under drought stresses, the RAV genes expressed diversely, up- or down-regulated. Following NaCl treatments, all RAV genes were down-regulated excepting GmRAV-03 which was up-regulated. Under abscisic acid (ABA) treatment, the expression of all of the soybean RAV genes increased dramatically. These results suggested that the soybean RAV genes may be involved in diverse signaling pathways and may be responsive to abiotic stresses and exogenous ABA. Further analysis indicated that GmRAV-03 could increase the transgenic lines resistance to high salt and drought and result in the transgenic plants insensitive to exogenous ABA. This present study provides valuable information for understanding the classification and putative functions of the RAV transcription factors in soybean. PMID:28634481

  5. Significance of Inactivated Genes in Leukemia: Pathogenesis and Prognosis

    PubMed Central

    Heidari, Nazanin; Abroun, Saeid; Bertacchini, Jessika; Vosoughi, Tina; Rahim, Fakher; Saki, Najmaldin

    2017-01-01

    Epigenetic and genetic alterations are two mechanisms participating in leukemia, which can inactivate genes involved in leukemia pathogenesis or progression. The purpose of this review was to introduce various inactivated genes and evaluate their possible role in leukemia pathogenesis and prognosis. By searching the mesh words “Gene, Silencing AND Leukemia” in PubMed website, relevant English articles dealt with human subjects as of 2000 were included in this study. Gene inactivation in leukemia is largely mediated by promoter’s hypermethylation of gene involving in cellular functions such as cell cycle, apoptosis, and gene transcription. Inactivated genes, such as ASPP1, TP53, IKZF1 and P15, may correlate with poor prognosis in acute lymphoid leukemia (ALL), chronic lymphoid leukemia (CLL), chronic myelogenous leukemia (CML) and acute myeloid leukemia (AML), respectively. Gene inactivation may play a considerable role in leukemia pathogenesis and prognosis, which can be considered as complementary diagnostic tests to differentiate different leukemia types, determine leukemia prognosis, and also detect response to therapy. In general, this review showed some genes inactivated only in leukemia (with differences between B-ALL, T-ALL, CLL, AML and CML). These differences could be of interest as an additional tool to better categorize leukemia types. Furthermore; based on inactivated genes, a diverse classification of Leukemias could represent a powerful method to address a targeted therapy of the patients, in order to minimize side effects of conventional therapies and to enhance new drug strategies. PMID:28580304

  6. Exploring Triacylglycerol Biosynthetic Pathway in Developing Seeds of Chia (Salvia hispanica L.): A Transcriptomic Approach

    PubMed Central

    Rupwate, Sunny D.; Rajasekharan, Ram; Srinivasan, Malathi

    2015-01-01

    Chia (Salvia hispanica L.), a member of the mint family (Lamiaceae), is a rediscovered crop with great importance in health and nutrition and is also the highest known terrestrial plant source of heart-healthy omega-3 fatty acid, alpha linolenic acid (ALA). At present, there is no public genomic information or database available for this crop, hindering research on its genetic improvement through genomics-assisted breeding programs. The first comprehensive analysis of the global transcriptome profile of developing Salvia hispanica L. seeds, with special reference to lipid biosynthesis is presented in this study. RNA from five different stages of seed development was extracted and sequenced separately using the Illumina GAIIx platform. De novo assembly of processed reads in the pooled transcriptome using Trinity yielded 76,014 transcripts. The total transcript length was 66,944,462 bases (66.9 Mb), with an average length of approximately 880 bases. In the molecular functions category of Gene Ontology (GO) terms, ATP binding and nucleotide binding were found to be the most abundant and in the biological processes category, the metabolic process and the regulation of transcription-DNA-dependent and oxidation-reduction process were abundant. From the EuKaryotic Orthologous Groups of proteins (KOG) classification, the major category was “Metabolism” (31.97%), of which the most prominent class was ‘carbohydrate metabolism and transport’ (5.81% of total KOG classifications) followed by ‘secondary metabolite biosynthesis transport and catabolism’ (5.34%) and ‘lipid metabolism’ (4.57%). A majority of the candidate genes involved in lipid biosynthesis and oil accumulation were identified. Furthermore, 5596 simple sequence repeats (SSRs) were identified. The transcriptome data was further validated through confirmative PCR and qRT-PCR for select lipid genes. Our study provides insight into the complex transcriptome and will contribute to further genome-wide research and understanding of chia. The identified novel UniGenes will facilitate gene discovery and creation of genomic resource for this crop. PMID:25875809

  7. Mapping the rehabilitation interventions of a community stroke team to the extended International Classification of Functioning, Disability and Health Core Set for Stroke.

    PubMed

    Evans, Melissa; Hocking, Clare; Kersten, Paula

    2017-12-01

    This study aim was to evaluate whether the Extended International Classification of Functioning, Disability and Health Core Set for Stroke captured the interventions of a community stroke rehabilitation team situated in a large city in New Zealand. It was proposed that the results would identify the contribution of each discipline, and the gaps and differences in service provision to Māori and non-Māori. Applying the Extended International Classification of Functioning, Disability and Health Core Set for Stroke in this way would also inform whether this core set should be adopted in New Zealand. Interventions were retrospectively extracted from 18 medical records and linked to the International Classification of Functioning, Disability and Health and the Extended International Classification of Functioning, Disability and Health Core Set for Stroke. The frequencies of linked interventions and the health discipline providing the intervention were calculated. Analysis revealed that 98.8% of interventions provided by the rehabilitation team could be linked to the Extended International Classification of Functioning, Disability and Health Core Set for Stroke, with more interventions for body function and structure than for activities and participation; no interventions for emotional concerns; and limited interventions for community, social and civic life. Results support previous recommendations for additions to the EICSS. The results support the use of the Extended International Classification of Functioning, Disability and Health Core Set for Stroke in New Zealand and demonstrates its use as a quality assurance tool that can evaluate the scope and practice of a rehabilitation service. Implications for Rehabilitation The Extended International Classification of Functioning Disability and Health Core Set for Stroke appears to represent the stroke interventions of a community stroke rehabilitation team in New Zealand. As a result, researchers and clinicians may have increased confidence to use this core set in research and clinical practice. The Extended International Classification of Functioning Disability and Health Core Set for Stroke can be used as a quality assurance tool to establish whether a community stroke rehabilitation team is meeting the functional needs of its stroke population.

  8. The Importance of Motor Functional Levels from the Activity Limitation Perspective of ICF in Children with Cerebral Palsy

    ERIC Educational Resources Information Center

    Mutlu, Akmer

    2010-01-01

    Our purpose in this study was to evaluate performance and capacity as defined by Gross Motor Function Classification System (GMFCS) and Manual Ability Classification System (MACS) from the "activity limitation" perspective of International Classification of Functioning, Disability, and Health (ICF) and to investigate the relationship between the…

  9. Genome-Wide Comparative In Silico Analysis of the RNA Helicase Gene Family in Zea mays and Glycine max: A Comparison with Arabidopsis and Oryza sativa

    PubMed Central

    Huang, Jinguang; Zheng, Chengchao

    2013-01-01

    RNA helicases are enzymes that are thought to unwind double-stranded RNA molecules in an energy-dependent fashion through the hydrolysis of NTP. RNA helicases are associated with all processes involving RNA molecules, including nuclear transcription, editing, splicing, ribosome biogenesis, RNA export, and organelle gene expression. The involvement of RNA helicase in response to stress and in plant growth and development has been reported previously. While their importance in Arabidopsis and Oryza sativa has been partially studied, the function of RNA helicase proteins is poorly understood in Zea mays and Glycine max. In this study, we identified a total of RNA helicase genes in Arabidopsis and other crop species genome by genome-wide comparative in silico analysis. We classified the RNA helicase genes into three subfamilies according to the structural features of the motif II region, such as DEAD-box, DEAH-box and DExD/H-box, and different species showed different patterns of alternative splicing. Secondly, chromosome location analysis showed that the RNA helicase protein genes were distributed across all chromosomes with different densities in the four species. Thirdly, phylogenetic tree analyses identified the relevant homologs of DEAD-box, DEAH-box and DExD/H-box RNA helicase proteins in each of the four species. Fourthly, microarray expression data showed that many of these predicted RNA helicase genes were expressed in different developmental stages and different tissues under normal growth conditions. Finally, real-time quantitative PCR analysis showed that the expression levels of 10 genes in Arabidopsis and 13 genes in Zea mays were in close agreement with the microarray expression data. To our knowledge, this is the first report of a comparative genome-wide analysis of the RNA helicase gene family in Arabidopsis, Oryza sativa, Zea mays and Glycine max. This study provides valuable information for understanding the classification and putative functions of the RNA helicase gene family in crop growth and development. PMID:24265739

  10. High time for a roll call: gene duplication and phylogenetic relationships of TCP-like genes in monocots

    PubMed Central

    Mondragón-Palomino, Mariana; Trontin, Charlotte

    2011-01-01

    Background and Aims The TCP family is an ancient group of plant developmental transcription factors that regulate cell division in vegetative and reproductive structures and are essential in the establishment of flower zygomorphy. In-depth research on eudicot TCPs has documented their evolutionary and developmental role. This has not happened to the same extent in monocots, although zygomorphy has been critical for the diversification of Orchidaceae and Poaceae, the largest families of this group. Investigating the evolution and function of TCP-like genes in a wider group of monocots requires a detailed phylogenetic analysis of all available sequence information and a system that facilitates comparing genetic and functional information. Methods The phylogenetic relationships of TCP-like genes in monocots were investigated by analysing sequences from the genomes of Zea mays, Brachypodium distachyon, Oryza sativa and Sorghum bicolor, as well as EST data from several other monocot species. Key Results All available monocot TCP-like sequences are associated in 20 major groups with an average identity ≥64 % and most correspond to well-supported clades of the phylogeny. Their sequence motifs and relationships of orthology were documented and it was found that 67 % of the TCP-like genes of Sorghum, Oryza, Zea and Brachypodium are in microsyntenic regions. This analysis suggests that two rounds of whole genome duplication drove the expansion of TCP-like genes in these species. Conclusions A system of classification is proposed where putative or recognized monocot TCP-like genes are assigned to a specific clade of PCF-, CIN- or CYC/tb1-like genes. Specific biases in sequence data of this family that must be tackled when studying its molecular evolution and phylogeny are documented. Finally, the significant retention of duplicated TCP genes from Zea mays is considered in the context of balanced gene drive. PMID:21444336

  11. Genome-wide comparative in silico analysis of the RNA helicase gene family in Zea mays and Glycine max: a comparison with Arabidopsis and Oryza sativa.

    PubMed

    Xu, Ruirui; Zhang, Shizhong; Huang, Jinguang; Zheng, Chengchao

    2013-01-01

    RNA helicases are enzymes that are thought to unwind double-stranded RNA molecules in an energy-dependent fashion through the hydrolysis of NTP. RNA helicases are associated with all processes involving RNA molecules, including nuclear transcription, editing, splicing, ribosome biogenesis, RNA export, and organelle gene expression. The involvement of RNA helicase in response to stress and in plant growth and development has been reported previously. While their importance in Arabidopsis and Oryza sativa has been partially studied, the function of RNA helicase proteins is poorly understood in Zea mays and Glycine max. In this study, we identified a total of RNA helicase genes in Arabidopsis and other crop species genome by genome-wide comparative in silico analysis. We classified the RNA helicase genes into three subfamilies according to the structural features of the motif II region, such as DEAD-box, DEAH-box and DExD/H-box, and different species showed different patterns of alternative splicing. Secondly, chromosome location analysis showed that the RNA helicase protein genes were distributed across all chromosomes with different densities in the four species. Thirdly, phylogenetic tree analyses identified the relevant homologs of DEAD-box, DEAH-box and DExD/H-box RNA helicase proteins in each of the four species. Fourthly, microarray expression data showed that many of these predicted RNA helicase genes were expressed in different developmental stages and different tissues under normal growth conditions. Finally, real-time quantitative PCR analysis showed that the expression levels of 10 genes in Arabidopsis and 13 genes in Zea mays were in close agreement with the microarray expression data. To our knowledge, this is the first report of a comparative genome-wide analysis of the RNA helicase gene family in Arabidopsis, Oryza sativa, Zea mays and Glycine max. This study provides valuable information for understanding the classification and putative functions of the RNA helicase gene family in crop growth and development.

  12. De Novo Transcriptomic Analysis of Peripheral Blood Lymphocytes from the Chinese Goose: Gene Discovery and Immune System Pathway Description

    PubMed Central

    Tariq, Mansoor; Chen, Rong; Yuan, Hongyu; Liu, Yanjie; Wu, Yanan; Wang, Junya; Xia, Chun

    2015-01-01

    Background The Chinese goose is one of the most economically important poultry birds and is a natural reservoir for many avian viruses. However, the nature and regulation of the innate and adaptive immune systems of this waterfowl species are not completely understood due to limited information on the goose genome. Recently, transcriptome sequencing technology was applied in the genomic studies focused on novel gene discovery. Thus, this study described the transcriptome of the goose peripheral blood lymphocytes to identify immunity relevant genes. Principal Findings De novo transcriptome assembly of the goose peripheral blood lymphocytes was sequenced by Illumina-Solexa technology. In total, 211,198 unigenes were assembled from the 69.36 million cleaned reads. The average length, N50 size and the maximum length of the assembled unigenes were 687 bp, 1,298 bp and 18,992 bp, respectively. A total of 36,854 unigenes showed similarity by BLAST search against the NCBI non-redundant (Nr) protein database. For functional classification, 163,161 unigenes were comprised of three Gene Ontology (Go) categories and 67 subcategories. A total of 15,334 unigenes were annotated into 25 eukaryotic orthologous groups (KOGs) categories. Kyoto Encyclopedia of Genes and Genomes (KEGG) database annotated 39,585 unigenes into six biological functional groups and 308 pathways. Among the 2,757 unigenes that participated in the 15 immune system KEGG pathways, 125 of the most important immune relevant genes were summarized and analyzed by STRING analysis to identify gene interactions and relationships. Moreover, 10 genes were confirmed by PCR and analyzed. Of these 125 unigenes, 109 unigenes, approximately 87%, were not previously identified in the goose. Conclusion This de novo transcriptome analysis could provide important Chinese goose sequence information and highlights the value of new gene discovery, pathways investigation and immune system gene identification, and comparison with other avian species as useful tools to understand the goose immune system. PMID:25816068

  13. Salt-Responsive Transcriptome Profiling of Suaeda glauca via RNA Sequencing

    PubMed Central

    Jin, Hangxia; Dong, Dekun; Yang, Qinghua; Zhu, Danhua

    2016-01-01

    Background Suaeda glauca, a succulent halophyte of the Chenopodiaceae family, is widely distributed in coastal areas of China. Suaeda glauca is highly resistant to salt and alkali stresses. In the present study, the salt-responsive transcriptome of Suaeda glauca was analyzed to identify genes involved in salt tolerance and study halophilic mechanisms in this halophyte. Results Illumina HiSeq 2500 was used to sequence cDNA libraries from salt-treated and control samples with three replicates each treatment. De novo assembly of the six transcriptomes identified 75,445 unigenes. A total of 23,901 (31.68%) unigenes were annotated. Compared with transcriptomes from the three salt-treated and three salt-free samples, 231 differentially expressed genes (DEGs) were detected (including 130 up-regulated genes and 101 down-regulated genes), and 195 unigenes were functionally annotated. Based on the Gene Ontology (GO), Clusters of Orthologous Groups (COG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) classifications of the DEGs, more attention should be paid to transcripts associated with signal transduction, transporters, the cell wall and growth, defense metabolism and transcription factors involved in salt tolerance. Conclusions This report provides a genome-wide transcriptional analysis of a halophyte, Suaeda glauca, under salt stress. Further studies of the genetic basis of salt tolerance in halophytes are warranted. PMID:26930632

  14. Genomic overview of mRNA 5′-leader trans-splicing in the ascidian Ciona intestinalis

    PubMed Central

    Satou, Yutaka; Hamaguchi, Makoto; Takeuchi, Keisuke; Hastings, Kenneth E. M.; Satoh, Nori

    2006-01-01

    Although spliced leader (SL) trans-splicing in the chordates was discovered in the tunicate Ciona intestinalis there has been no genomic overview analysis of the extent of trans-splicing or the make-up of the trans-spliced and non-trans-spliced gene populations of this model organism. Here we report such an analysis for Ciona based on the oligo-capping full-length cDNA approach. We randomly sampled 2078 5′-full-length ESTs representing 668 genes, or 4.2% of the entire genome. Our results indicate that Ciona contains a single major SL, which is efficiently trans-spliced to mRNAs transcribed from a specific set of genes representing ∼50% of the total number of expressed genes, and that individual trans-spliced mRNA species are, on average, 2–3-fold less abundant than non-trans-spliced mRNA species. Our results also identify a relationship between trans-splicing status and gene functional classification; ribosomal protein genes fall predominantly into the non-trans-spliced category. In addition, our data provide the first evidence for the occurrence of polycistronic transcription in Ciona. An interesting feature of the Ciona polycistronic transcription units is that the great majority entirely lack intercistronic sequences. PMID:16822859

  15. A computational neural approach to support the discovery of gene function and classes of cancer.

    PubMed

    Azuaje, F

    2001-03-01

    Advances in molecular classification of tumours may play a central role in cancer treatment. Here, a novel approach to genome expression pattern interpretation is described and applied to the recognition of B-cell malignancies as a test set. Using cDNA microarrays data generated by a previous study, a neural network model known as simplified fuzzy ARTMAP is able to identify normal and diffuse large B-cell lymphoma (DLBCL) patients. Furthermore, it discovers the distinction between patients with molecularly distinct forms of DLBCL without previous knowledge of those subtypes.

  16. Integrated genome-wide Alu methylation and transcriptome profiling analyses reveal novel epigenetic regulatory networks associated with autism spectrum disorder.

    PubMed

    Saeliw, Thanit; Tangsuwansri, Chayanin; Thongkorn, Surangrat; Chonchaiya, Weerasak; Suphapeetiporn, Kanya; Mutirangura, Apiwat; Tencomnao, Tewin; Hu, Valerie W; Sarachana, Tewarit

    2018-01-01

    Alu elements are a group of repetitive elements that can influence gene expression through CpG residues and transcription factor binding. Altered gene expression and methylation profiles have been reported in various tissues and cell lines from individuals with autism spectrum disorder (ASD). However, the role of Alu elements in ASD remains unclear. We thus investigated whether Alu elements are associated with altered gene expression profiles in ASD. We obtained five blood-based gene expression profiles from the Gene Expression Omnibus database and human Alu-inserted gene lists from the TranspoGene database. Differentially expressed genes (DEGs) in ASD were identified from each study and overlapped with the human Alu-inserted genes. The biological functions and networks of Alu-inserted DEGs were then predicted by Ingenuity Pathway Analysis (IPA). A combined bisulfite restriction analysis of lymphoblastoid cell lines (LCLs) derived from 36 ASD and 20 sex- and age-matched unaffected individuals was performed to assess the global DNA methylation levels within Alu elements, and the Alu expression levels were determined by quantitative RT-PCR. In ASD blood or blood-derived cells, 320 Alu-inserted genes were reproducibly differentially expressed. Biological function and pathway analysis showed that these genes were significantly associated with neurodevelopmental disorders and neurological functions involved in ASD etiology. Interestingly, estrogen receptor and androgen signaling pathways implicated in the sex bias of ASD, as well as IL-6 signaling and neuroinflammation signaling pathways, were also highlighted. Alu methylation was not significantly different between the ASD and sex- and age-matched control groups. However, significantly altered Alu methylation patterns were observed in ASD cases sub-grouped based on Autism Diagnostic Interview-Revised scores compared with matched controls. Quantitative RT-PCR analysis of Alu expression also showed significant differences between ASD subgroups. Interestingly, Alu expression was correlated with methylation status in one phenotypic ASD subgroup. Alu methylation and expression were altered in LCLs from ASD subgroups. Our findings highlight the association of Alu elements with gene dysregulation in ASD blood samples and warrant further investigation. Moreover, the classification of ASD individuals into subgroups based on phenotypes may be beneficial and could provide insights into the still unknown etiology and the underlying mechanisms of ASD.

  17. On the statistical assessment of classifiers using DNA microarray data

    PubMed Central

    Ancona, N; Maglietta, R; Piepoli, A; D'Addabbo, A; Cotugno, R; Savino, M; Liuni, S; Carella, M; Pesole, G; Perri, F

    2006-01-01

    Background In this paper we present a method for the statistical assessment of cancer predictors which make use of gene expression profiles. The methodology is applied to a new data set of microarray gene expression data collected in Casa Sollievo della Sofferenza Hospital, Foggia – Italy. The data set is made up of normal (22) and tumor (25) specimens extracted from 25 patients affected by colon cancer. We propose to give answers to some questions which are relevant for the automatic diagnosis of cancer such as: Is the size of the available data set sufficient to build accurate classifiers? What is the statistical significance of the associated error rates? In what ways can accuracy be considered dependant on the adopted classification scheme? How many genes are correlated with the pathology and how many are sufficient for an accurate colon cancer classification? The method we propose answers these questions whilst avoiding the potential pitfalls hidden in the analysis and interpretation of microarray data. Results We estimate the generalization error, evaluated through the Leave-K-Out Cross Validation error, for three different classification schemes by varying the number of training examples and the number of the genes used. The statistical significance of the error rate is measured by using a permutation test. We provide a statistical analysis in terms of the frequencies of the genes involved in the classification. Using the whole set of genes, we found that the Weighted Voting Algorithm (WVA) classifier learns the distinction between normal and tumor specimens with 25 training examples, providing e = 21% (p = 0.045) as an error rate. This remains constant even when the number of examples increases. Moreover, Regularized Least Squares (RLS) and Support Vector Machines (SVM) classifiers can learn with only 15 training examples, with an error rate of e = 19% (p = 0.035) and e = 18% (p = 0.037) respectively. Moreover, the error rate decreases as the training set size increases, reaching its best performances with 35 training examples. In this case, RLS and SVM have error rates of e = 14% (p = 0.027) and e = 11% (p = 0.019). Concerning the number of genes, we found about 6000 genes (p < 0.05) correlated with the pathology, resulting from the signal-to-noise statistic. Moreover the performances of RLS and SVM classifiers do not change when 74% of genes is used. They progressively reduce up to e = 16% (p < 0.05) when only 2 genes are employed. The biological relevance of a set of genes determined by our statistical analysis and the major roles they play in colorectal tumorigenesis is discussed. Conclusions The method proposed provides statistically significant answers to precise questions relevant for the diagnosis and prognosis of cancer. We found that, with as few as 15 examples, it is possible to train statistically significant classifiers for colon cancer diagnosis. As for the definition of the number of genes sufficient for a reliable classification of colon cancer, our results suggest that it depends on the accuracy required. PMID:16919171

  18. Integrated computational biology analysis to evaluate target genes for chronic myelogenous leukemia.

    PubMed

    Zheng, Yu; Wang, Yu-Ping; Cao, Hongbao; Chen, Qiusheng; Zhang, Xi

    2018-06-05

    Although hundreds of genes have been linked to chronic myelogenous leukemia (CML), many of the results lack reproducibility. In the present study, data across multiple modalities were integrated to evaluate 579 CML candidate genes, including literature‑based CML‑gene relation data, Gene Expression Omnibus RNA expression data and pathway‑based gene‑gene interaction data. The expression data included samples from 76 patients with CML and 73 healthy controls. For each target gene, four metrics were proposed and tested with case/control classification. The effectiveness of the four metrics presented was demonstrated by the high classification accuracy (94.63%; P<2x10‑4). Cross metric analysis suggested nine top candidate genes for CML: Epidermal growth factor receptor, tumor protein p53, catenin β 1, janus kinase 2, tumor necrosis factor, abelson murine leukemia viral oncogene homolog 1, vascular endothelial growth factor A, B‑cell lymphoma 2 and proto‑oncogene tyrosine‑protein kinase. In addition, 145 CML candidate pathways enriched with 485 out of 579 genes were identified (P<8.2x10‑11; q=0.005). In conclusion, weighted genetic networks generated using computational biology may be complementary to biological experiments for the evaluation of known or novel CML target genes.

  19. CrossLink: a novel method for cross-condition classification of cancer subtypes.

    PubMed

    Ma, Chifeng; Sastry, Konduru S; Flore, Mario; Gehani, Salah; Al-Bozom, Issam; Feng, Yusheng; Serpedin, Erchin; Chouchane, Lotfi; Chen, Yidong; Huang, Yufei

    2016-08-22

    We considered the prediction of cancer classes (e.g. subtypes) using patient gene expression profiles that contain both systematic and condition-specific biases when compared with the training reference dataset. The conventional normalization-based approaches cannot guarantee that the gene signatures in the reference and prediction datasets always have the same distribution for all different conditions as the class-specific gene signatures change with the condition. Therefore, the trained classifier would work well under one condition but not under another. To address the problem of current normalization approaches, we propose a novel algorithm called CrossLink (CL). CL recognizes that there is no universal, condition-independent normalization mapping of signatures. In contrast, it exploits the fact that the signature is unique to its associated class under any condition and thus employs an unsupervised clustering algorithm to discover this unique signature. We assessed the performance of CL for cross-condition predictions of PAM50 subtypes of breast cancer by using a simulated dataset modeled after TCGA BRCA tumor samples with a cross-validation scheme, and datasets with known and unknown PAM50 classification. CL achieved prediction accuracy >73 %, highest among other methods we evaluated. We also applied the algorithm to a set of breast cancer tumors derived from Arabic population to assign a PAM50 classification to each tumor based on their gene expression profiles. A novel algorithm CrossLink for cross-condition prediction of cancer classes was proposed. In all test datasets, CL showed robust and consistent improvement in prediction performance over other state-of-the-art normalization and classification algorithms.

  20. Genomics of Mature and Immature Olfactory Sensory Neurons

    PubMed Central

    Nickell, Melissa D.; Breheny, Patrick; Stromberg, Arnold J.; McClintock, Timothy S.

    2014-01-01

    The continuous replacement of neurons in the olfactory epithelium provides an advantageous model for investigating neuronal differentiation and maturation. By calculating the relative enrichment of every mRNA detected in samples of mature mouse olfactory sensory neurons (OSNs), immature OSNs, and the residual population of neighboring cell types, and then comparing these ratios against the known expression patterns of >300 genes, enrichment criteria that accurately predicted the OSN expression patterns of nearly all genes were determined. We identified 847 immature OSN-specific and 691 mature OSN-specific genes. The control of gene expression by chromatin modification and transcription factors, and neurite growth, protein transport, RNA processing, cholesterol biosynthesis, and apoptosis via death domain receptors, were overrepresented biological processes in immature OSNs. Ion transport (ion channels), presynaptic functions, and cilia-specific processes were overrepresented in mature OSNs. Processes overrepresented among the genes expressed by all OSNs were protein and ion transport, ER overload response, protein catabolism, and the electron transport chain. To more accurately represent gradations in mRNA abundance and identify all genes expressed in each cell type, classification methods were used to produce probabilities of expression in each cell type for every gene. These probabilities, which identified 9,300 genes expressed in OSNs, were 96% accurate at identifying genes expressed in OSNs and 86% accurate at discriminating genes specific to mature and immature OSNs. This OSN gene database not only predicts the genes responsible for the major biological processes active in OSNs, but also identifies thousands of never before studied genes that support OSN phenotypes. PMID:22252456

  1. Stratification of co-evolving genomic groups using ranked phylogenetic profiles

    PubMed Central

    Freilich, Shiri; Goldovsky, Leon; Gottlieb, Assaf; Blanc, Eric; Tsoka, Sophia; Ouzounis, Christos A

    2009-01-01

    Background Previous methods of detecting the taxonomic origins of arbitrary sequence collections, with a significant impact to genome analysis and in particular metagenomics, have primarily focused on compositional features of genomes. The evolutionary patterns of phylogenetic distribution of genes or proteins, represented by phylogenetic profiles, provide an alternative approach for the detection of taxonomic origins, but typically suffer from low accuracy. Herein, we present rank-BLAST, a novel approach for the assignment of protein sequences into genomic groups of the same taxonomic origin, based on the ranking order of phylogenetic profiles of target genes or proteins across the reference database. Results The rank-BLAST approach is validated by computing the phylogenetic profiles of all sequences for five distinct microbial species of varying degrees of phylogenetic proximity, against a reference database of 243 fully sequenced genomes. The approach - a combination of sequence searches, statistical estimation and clustering - analyses the degree of sequence divergence between sets of protein sequences and allows the classification of protein sequences according to the species of origin with high accuracy, allowing taxonomic classification of 64% of the proteins studied. In most cases, a main cluster is detected, representing the corresponding species. Secondary, functionally distinct and species-specific clusters exhibit different patterns of phylogenetic distribution, thus flagging gene groups of interest. Detailed analyses of such cases are provided as examples. Conclusion Our results indicate that the rank-BLAST approach can capture the taxonomic origins of sequence collections in an accurate and efficient manner. The approach can be useful both for the analysis of genome evolution and the detection of species groups in metagenomics samples. PMID:19860884

  2. Comparative genomic and transcriptomic analysis of selected fatty acid biosynthesis genes and CNL disease resistance genes in oil palm.

    PubMed

    Rosli, Rozana; Amiruddin, Nadzirah; Ab Halim, Mohd Amin; Chan, Pek-Lan; Chan, Kuang-Lim; Azizi, Norazah; Morris, Priscilla E; Leslie Low, Eng-Ti; Ong-Abdullah, Meilina; Sambanthamurthi, Ravigadevi; Singh, Rajinder; Murphy, Denis J

    2018-01-01

    Comparative genomics and transcriptomic analyses were performed on two agronomically important groups of genes from oil palm versus other major crop species and the model organism, Arabidopsis thaliana. The first analysis was of two gene families with key roles in regulation of oil quality and in particular the accumulation of oleic acid, namely stearoyl ACP desaturases (SAD) and acyl-acyl carrier protein (ACP) thioesterases (FAT). In both cases, these were found to be large gene families with complex expression profiles across a wide range of tissue types and developmental stages. The detailed classification of the oil palm SAD and FAT genes has enabled the updating of the latest version of the oil palm gene model. The second analysis focused on disease resistance (R) genes in order to elucidate possible candidates for breeding of pathogen tolerance/resistance. Ortholog analysis showed that 141 out of the 210 putative oil palm R genes had homologs in banana and rice. These genes formed 37 clusters with 634 orthologous genes. Classification of the 141 oil palm R genes showed that the genes belong to the Kinase (7), CNL (95), MLO-like (8), RLK (3) and Others (28) categories. The CNL R genes formed eight clusters. Expression data for selected R genes also identified potential candidates for breeding of disease resistance traits. Furthermore, these findings can provide information about the species evolution as well as the identification of agronomically important genes in oil palm and other major crops.

  3. Comparative genomic and transcriptomic analysis of selected fatty acid biosynthesis genes and CNL disease resistance genes in oil palm

    PubMed Central

    Rosli, Rozana; Amiruddin, Nadzirah; Ab Halim, Mohd Amin; Chan, Pek-Lan; Chan, Kuang-Lim; Azizi, Norazah; Morris, Priscilla E.; Leslie Low, Eng-Ti; Ong-Abdullah, Meilina; Sambanthamurthi, Ravigadevi; Singh, Rajinder

    2018-01-01

    Comparative genomics and transcriptomic analyses were performed on two agronomically important groups of genes from oil palm versus other major crop species and the model organism, Arabidopsis thaliana. The first analysis was of two gene families with key roles in regulation of oil quality and in particular the accumulation of oleic acid, namely stearoyl ACP desaturases (SAD) and acyl-acyl carrier protein (ACP) thioesterases (FAT). In both cases, these were found to be large gene families with complex expression profiles across a wide range of tissue types and developmental stages. The detailed classification of the oil palm SAD and FAT genes has enabled the updating of the latest version of the oil palm gene model. The second analysis focused on disease resistance (R) genes in order to elucidate possible candidates for breeding of pathogen tolerance/resistance. Ortholog analysis showed that 141 out of the 210 putative oil palm R genes had homologs in banana and rice. These genes formed 37 clusters with 634 orthologous genes. Classification of the 141 oil palm R genes showed that the genes belong to the Kinase (7), CNL (95), MLO-like (8), RLK (3) and Others (28) categories. The CNL R genes formed eight clusters. Expression data for selected R genes also identified potential candidates for breeding of disease resistance traits. Furthermore, these findings can provide information about the species evolution as well as the identification of agronomically important genes in oil palm and other major crops. PMID:29672525

  4. A revised family-level classification of the Polyporales (Basidiomycota)

    Treesearch

    Alfredo Justo; Otto Miettinen; Dimitrios Floudas; Beatriz Ortiz-Santana; Elisabet Sjökvist; Daniel Lindner; Karen Nakasone; Tuomo Niemelä; Karl-Henrik Larsson; Leif Ryvarden; David S. Hibbett

    2017-01-01

    Polyporales is strongly supported as a clade of Agaricomycetes, but the lack of a consensus higher-level classification within the group is a barrier to further taxonomic revision. We amplified nrLSU, nrITS, and rpb1 genes across the Polyporales, with a special focus on the latter. We...

  5. A Coupled k-Nearest Neighbor Algorithm for Multi-Label Classification

    DTIC Science & Technology

    2015-05-22

    classification, an image may contain several concepts simultaneously, such as beach, sunset and kangaroo . Such tasks are usually denoted as multi-label...informatics, a gene can belong to both metabolism and transcription classes; and in music categorization, a song may labeled as Mozart and sad. In the

  6. Comprehensive identification and clustering of CLV3/ESR-related (CLE) genes in plants finds groups with potentially shared function.

    PubMed

    Goad, David M; Zhu, Chuanmei; Kellogg, Elizabeth A

    2017-10-01

    CLV3/ESR (CLE) proteins are important signaling peptides in plants. The short CLE peptide (12-13 amino acids) is cleaved from a larger pre-propeptide and functions as an extracellular ligand. The CLE family is large and has resisted attempts at classification because the CLE domain is too short for reliable phylogenetic analysis and the pre-propeptide is too variable. We used a model-based search for CLE domains from 57 plant genomes and used the entire pre-propeptide for comprehensive clustering analysis. In total, 1628 CLE genes were identified in land plants, with none recognizable from green algae. These CLEs form 12 groups within which CLE domains are largely conserved and pre-propeptides can be aligned. Most clusters contain sequences from monocots, eudicots and Amborella trichopoda, with sequences from Picea abies, Selaginella moellendorffii and Physcomitrella patens scattered in some clusters. We easily identified previously known clusters involved in vascular differentiation and nodulation. In addition, we found a number of discrete groups whose function remains poorly characterized. Available data indicate that CLE proteins within a cluster are likely to share function, whereas those from different clusters play at least partially different roles. Our analysis provides a foundation for future evolutionary and functional studies. © 2016 The Authors. New Phytologist © 2016 New Phytologist Trust.

  7. Extracytoplasmic function σ factors of the widely distributed group ECF41 contain a fused regulatory domain

    PubMed Central

    Wecke, Tina; Halang, Petra; Staroń, Anna; Dufour, Yann S; Donohue, Timothy J; Mascher, Thorsten

    2012-01-01

    Bacteria need signal transducing systems to respond to environmental changes. Next to one- and two-component systems, alternative σ factors of the extra-cytoplasmic function (ECF) protein family represent the third fundamental mechanism of bacterial signal transduction. A comprehensive classification of these proteins identified more than 40 phylogenetically distinct groups, most of which are not experimentally investigated. Here, we present the characterization of such a group with unique features, termed ECF41. Among analyzed bacterial genomes, ECF41 σ factors are widely distributed with about 400 proteins from 10 different phyla. They lack obvious anti-σ factors that typically control activity of other ECF σ factors, but their structural genes are often predicted to be cotranscribed with carboxymuconolactone decarboxylases, oxidoreductases, or epimerases based on genomic context conservation. We demonstrate for Bacillus licheniformis and Rhodobacter sphaeroides that the corresponding genes are preceded by a highly conserved promoter motif and are the only detectable targets of ECF41-dependent gene regulation. In contrast to other ECF σ factors, proteins of group ECF41 contain a large C-terminal extension, which is crucial for σ factor activity. Our data demonstrate that ECF41 σ factors are regulated by a novel mechanism based on the presence of a fused regulatory domain. PMID:22950025

  8. Actinobacteria phylogenomics, selective isolation from an iron oligotrophic environment and siderophore functional characterization, unveil new desferrioxamine traits.

    PubMed

    Cruz-Morales, Pablo; Ramos-Aboites, Hilda E; Licona-Cassani, Cuauhtémoc; Selem-Mójica, Nelly; Mejía-Ponce, Paulina M; Souza-Saldívar, Valeria; Barona-Gómez, Francisco

    2017-09-01

    Desferrioxamines are hydroxamate siderophores widely conserved in both aquatic and soil-dwelling Actinobacteria. While the genetic and enzymatic bases of siderophore biosynthesis and their transport in model families of this phylum are well understood, evolutionary studies are lacking. Here, we perform a comprehensive desferrioxamine-centric (des genes) phylogenomic analysis, which includes the genomes of six novel strains isolated from an iron and phosphorous depleted oasis in the Chihuahuan desert of Mexico. Our analyses reveal previously unnoticed desferrioxamine evolutionary patterns, involving both biosynthetic and transport genes, likely to be related to desferrioxamines chemical diversity. The identified patterns were used to postulate experimentally testable hypotheses after phenotypic characterization, including profiling of siderophores production and growth stimulation of co-cultures under iron deficiency. Based in our results, we propose a novel des gene, which we term desG, as responsible for incorporation of phenylacetyl moieties during biosynthesis of previously reported arylated desferrioxamines. Moreover, a genomic-based classification of the siderophore-binding proteins responsible for specific and generalist siderophore assimilation is postulated. This report provides a much-needed evolutionary framework, with specific insights supported by experimental data, to direct the future ecological and functional analysis of desferrioxamines in the environment. © FEMS 2017.

  9. Rapid assessment of urban wetlands: Do hydrogeomorpic classification and reference criteria work?

    EPA Science Inventory

    The Hydrogeomorphic (HGM) functional assessment method is predicated on the ability of a wetland classification method based on hydrology (HGM classification) and a visual assessment of disturbance and alteration to provide reference standards against which functions in individua...

  10. Ancestry and evolution of a secretory pathway serpin

    PubMed Central

    2008-01-01

    Background The serpin (serine protease inhibitor) superfamily constitutes a class of functionally highly diverse proteins usually encompassing several dozens of paralogs in mammals. Though phylogenetic classification of vertebrate serpins into six groups based on gene organisation is well established, the evolutionary roots beyond the fish/tetrapod split are unresolved. The aim of this study was to elucidate the phylogenetic relationships of serpins involved in surveying the secretory pathway routes against uncontrolled proteolytic activity. Results Here, rare genomic characters are used to show that orthologs of neuroserpin, a prominent representative of vertebrate group 3 serpin genes, exist in early diverging deuterostomes and probably also in cnidarians, indicating that the origin of a mammalian serpin can be traced back far in the history of eumetazoans. A C-terminal address code assigning association with secretory pathway organelles is present in all neuroserpin orthologs, suggesting that supervision of cellular export/import routes by antiproteolytic serpins is an ancient trait, though subtle functional and compartmental specialisations have developed during their evolution. The results also suggest that massive changes in the exon-intron organisation of serpin genes have occurred along the lineage leading to vertebrate neuroserpin, in contrast with the immediately adjacent PDCD10 gene that is linked to its neighbour at least since divergence of echinoderms. The intron distribution pattern of closely adjacent and co-regulated genes thus may experience quite different fates during evolution of metazoans. Conclusion This study demonstrates that the analysis of microsynteny and other rare characters can provide insight into the intricate family history of metazoan serpins. Serpins with the capacity to defend the main cellular export/import routes against uncontrolled endogenous and/or foreign proteolytic activity represent an ancient trait in eukaryotes that has been maintained continuously in metazoans though subtle changes affecting function and subcellular location have evolved. It is shown that the intron distribution pattern of neuroserpin gene orthologs has undergone substantial rearrangements during metazoan evolution. PMID:18793432

  11. Ancestry and evolution of a secretory pathway serpin.

    PubMed

    Kumar, Abhishek; Ragg, Hermann

    2008-09-15

    The serpin (serine protease inhibitor) superfamily constitutes a class of functionally highly diverse proteins usually encompassing several dozens of paralogs in mammals. Though phylogenetic classification of vertebrate serpins into six groups based on gene organisation is well established, the evolutionary roots beyond the fish/tetrapod split are unresolved. The aim of this study was to elucidate the phylogenetic relationships of serpins involved in surveying the secretory pathway routes against uncontrolled proteolytic activity. Here, rare genomic characters are used to show that orthologs of neuroserpin, a prominent representative of vertebrate group 3 serpin genes, exist in early diverging deuterostomes and probably also in cnidarians, indicating that the origin of a mammalian serpin can be traced back far in the history of eumetazoans. A C-terminal address code assigning association with secretory pathway organelles is present in all neuroserpin orthologs, suggesting that supervision of cellular export/import routes by antiproteolytic serpins is an ancient trait, though subtle functional and compartmental specialisations have developed during their evolution. The results also suggest that massive changes in the exon-intron organisation of serpin genes have occurred along the lineage leading to vertebrate neuroserpin, in contrast with the immediately adjacent PDCD10 gene that is linked to its neighbour at least since divergence of echinoderms. The intron distribution pattern of closely adjacent and co-regulated genes thus may experience quite different fates during evolution of metazoans. This study demonstrates that the analysis of microsynteny and other rare characters can provide insight into the intricate family history of metazoan serpins. Serpins with the capacity to defend the main cellular export/import routes against uncontrolled endogenous and/or foreign proteolytic activity represent an ancient trait in eukaryotes that has been maintained continuously in metazoans though subtle changes affecting function and subcellular location have evolved. It is shown that the intron distribution pattern of neuroserpin gene orthologs has undergone substantial rearrangements during metazoan evolution.

  12. Deep sequencing analysis of the transcriptomes of peanut aerial and subterranean young pods identifies candidate genes related to early embryo abortion.

    PubMed

    Chen, Xiaoping; Zhu, Wei; Azam, Sarwar; Li, Heying; Zhu, Fanghe; Li, Haifen; Hong, Yanbin; Liu, Haiyan; Zhang, Erhua; Wu, Hong; Yu, Shanlin; Zhou, Guiyuan; Li, Shaoxiong; Zhong, Ni; Wen, Shijie; Li, Xingyu; Knapp, Steve J; Ozias-Akins, Peggy; Varshney, Rajeev K; Liang, Xuanqiang

    2013-01-01

    The failure of peg penetration into the soil leads to seed abortion in peanut. Knowledge of genes involved in these processes is comparatively deficient. Here, we used RNA-seq to gain insights into transcriptomes of aerial and subterranean pods. More than 2 million transcript reads with an average length of 396 bp were generated from one aerial (AP) and two subterranean (SP1 and SP2) pod libraries using pyrosequencing technology. After assembly, sets of 49 632, 49 952 and 50 494 from a total of 74 974 transcript assembly contigs (TACs) were identified in AP, SP1 and SP2, respectively. A clear linear relationship in the gene expression level was observed between these data sets. In brief, 2194 differentially expressed TACs with a 99.0% true-positive rate were identified, among which 859 and 1068 TACs were up-regulated in aerial and subterranean pods, respectively. Functional analysis showed that putative function based on similarity with proteins catalogued in UniProt and gene ontology term classification could be determined for 59 342 (79.2%) and 42 955 (57.3%) TACs, respectively. A total of 2968 TACs were mapped to 174 KEGG pathways, of which 168 were shared by aerial and subterranean transcriptomes. TACs involved in photosynthesis were significantly up-regulated and enriched in the aerial pod. In addition, two senescence-associated genes were identified as significantly up-regulated in the aerial pod, which potentially contribute to embryo abortion in aerial pods, and in turn, to cessation of swelling. The data set generated in this study provides evidence for some functional genes as robust candidates underlying aerial and subterranean pod development and contributes to an elucidation of the evolutionary implications resulting from fruit development under light and dark conditions. © 2012 The Authors Plant Biotechnology Journal © 2012 Society for Experimental Biology, Association of Applied Biologists and Blackwell Publishing Ltd.

  13. Inherited Congenital Cataract: A Guide to Suspect the Genetic Etiology in the Cataract Genesis

    PubMed Central

    Messina-Baas, Olga; Cuevas-Covarrubias, Sergio A.

    2017-01-01

    Cataracts are the principal cause of treatable blindness worldwide. Inherited congenital cataract (CC) shows all types of inheritance patterns in a syndromic and nonsyndromic form. There are more than 100 genes associated with cataract with a predominance of autosomal dominant inheritance. A cataract is defined as an opacity of the lens producing a variation of the refractive index of the lens. This variation derives from modifications in the lens structure resulting in light scattering, frequently a consequence of a significant concentration of high-molecular-weight protein aggregates. The aim of this review is to introduce a guide to identify the gene involved in inherited CC. Due to the manifold clinical and genetic heterogeneity, we discarded the cataract phenotype as a cardinal sign; a 4-group classification with the genes implicated in inherited CC is proposed. We consider that this classification will assist in identifying the probable gene involved in inherited CC. PMID:28611546

  14. Angiotensinogen gene polymorphism predicts hypertension, and iridological constitutional classification enhances the risk for hypertension in Koreans.

    PubMed

    Cho, Joo-Jang; Hwang, Woo-Jun; Hong, Seung-Heon; Jeong, Hyun-Ja; Lee, Hye-Jung; Kim, Hyung-Min; Um, Jae-Young

    2008-05-01

    This study investigated the relationship between iridological constitution and angiotensinogen (AGN) gene polymorphism in hypertensives. In addition to angiotensin converting enzyme gene, AGN genotype is also one of the most well studied genetic markers of hypertension. Furthermore, iridology, one of complementary and alternative medicine, is the diagnosis of the medical conditions through noting irregularities of the pigmentation in the iris. Iridological constitution has a strong familial aggregation and is implicated in heredity. Therefore, the study classified 87 hypertensive patients with familial history of cerebral infarction and controls (n = 88) according to Iris constitution, and determined AGN genotype. As a result, the AGN/TT genotype was associated with hypertension (chi2 = 13.413, p < .05). The frequency of T allele was 0.92 in patients and 0.76 in controls (chi2 = 13.159, p < .05). In addition, iridological constitutional classification increased the relative risk for hypertension in the subjects with AGN/T allele. These results suggest that AGN polymorphism predicts hypertension, and iridological constitutional classification enhances the risk for hypertension associated with AGN/T in a Korean population.

  15. Community of protein complexes impacts disease association

    PubMed Central

    Wang, Qianghu; Liu, Weisha; Ning, Shangwei; Ye, Jingrun; Huang, Teng; Li, Yan; Wang, Peng; Shi, Hongbo; Li, Xia

    2012-01-01

    One important challenge in the post-genomic era is uncovering the relationships among distinct pathophenotypes by using molecular signatures. Given the complex functional interdependencies between cellular components, a disease is seldom the consequence of a defect in a single gene product, instead reflecting the perturbations of a group of closely related gene products that carry out specific functions together. Therefore, it is meaningful to explore how the community of protein complexes impacts disease associations. Here, by integrating a large amount of information from protein complexes and the cellular basis of diseases, we built a human disease network in which two diseases are linked if they share common disease-related protein complex. A systemic analysis revealed that linked disease pairs exhibit higher comorbidity than those that have no links, and that the stronger association two diseases have based on protein complexes, the higher comorbidity they are prone to display. Moreover, more connected diseases tend to be malignant, which have high prevalence. We provide novel disease associations that cannot be identified through previous analysis. These findings will potentially provide biologists and clinicians new insights into the etiology, classification and treatment of diseases. PMID:22549411

  16. Inducible CRISPR genome-editing tool: classifications and future trends.

    PubMed

    Dai, Xiaofeng; Chen, Xiao; Fang, Qiuwu; Li, Jia; Bai, Zhonghu

    2018-06-01

    The discovery of CRISPR-Cas9/dCas9 system has reinforced our ability and revolutionized our history in genome engineering. While Cas9 and dCas9 are programed to modulate gene expression by introducing DNA breaks, blocking transcription factor recruitment or dragging functional groups towards the targeted sites, sgRNAs determine the genomic loci where the modulation occurs. The off-target problem, due to limited sgRNA specificity and genome complexity of many species, has posed concerns for the wide application of this revolutionary technique. To solve this problem and, more importantly, gain power over gene functionality and cell fate control, inducible strategies have been continuously evolved to offer tailored solutions to address specific biological questions. By reviewing recent advances in inducible CRISPR system design and critical elements potentially adding values to such systems, we classify current approaches in this domain into four mechanically distinct categories, namely, "split system", "allosteric system", "combinatorial system", and "transient delivery system", discuss the pros and cons of each system, and point out the under-explored areas and future directions, with the aim of enriching our toolbox of delicate life engineering.

  17. Genome-wide analysis and identification of stress-responsive genes of the NAM-ATAF1,2-CUC2 transcription factor family in apple.

    PubMed

    Su, Hongyan; Zhang, Shizhong; Yuan, Xiaowei; Chen, Changtian; Wang, Xiao-Fei; Hao, Yu-Jin

    2013-10-01

    NAC (NAM, ATAF1,2, and CUC2) proteins constitute one of the largest families of plant-specific transcription factors. To date, little is known about the NAC genes in the apple (Malus domestica). In this study, a total of 180 NAC genes were identified in the apple genome and were phylogenetically clustered into six groups (I-VI) with the NAC genes from Arabidopsis and rice. The predicted apple NAC genes were distributed across all of 17 chromosomes at various densities. Additionally, the gene structure and motif compositions of the apple NAC genes were analyzed. Moreover, the expression of 29 selected apple NAC genes was analyzed in different tissues and under different abiotic stress conditions. All of the selected genes, with the exception of four genes, were expressed in at least one of the tissues tested, which indicates that the NAC genes are involved in various aspects of the physiological and developmental processes of the apple. Encouragingly, 17 of the selected genes were found to respond to one or more of the abiotic stress treatments, and these 17 genes included not only the expected 7 genes that were clustered with the well-known stress-related marker genes in group IV but also 10 genes located in other subgroups, none of which contains members that have been reported to be stress-related. To the best of our knowledge, this report describes the first genome-wide analysis of the apple NAC gene family, and the results should provide valuable information for understanding the classification and putative functions of this family. Copyright © 2013 Elsevier Masson SAS. All rights reserved.

  18. Improved, ACMG-Compliant, in silico prediction of pathogenicity for missense substitutions encoded by TP53 variants.

    PubMed

    Fortuno, Cristina; James, Paul A; Young, Erin L; Feng, Bing; Olivier, Magali; Pesaran, Tina; Tavtigian, Sean V; Spurdle, Amanda B

    2018-05-18

    Clinical interpretation of germline missense variants represents a major challenge, including those in the TP53 Li-Fraumeni syndrome gene. Bioinformatic prediction is a key part of variant classification strategies. We aimed to optimize the performance of the Align-GVGD tool used for p53 missense variant prediction, and compare its performance to other bioinformatic tools (SIFT, PolyPhen-2) and ensemble methods (REVEL, BayesDel). Reference sets of assumed pathogenic and assumed benign variants were defined using functional and/or clinical data. Area under the curve and Matthews correlation coefficient (MCC) values were used as objective functions to select an optimized protein multi-sequence alignment with best performance for Align-GVGD. MCC comparison of tools using binary categories showed optimized Align-GVGD (C15 cut-off) combined with BayesDel (0.16 cut-off), or with REVEL (0.5 cut-off), to have the best overall performance. Further, a semi-quantitative approach using multiple tiers of bioinformatic prediction, validated using an independent set of non-functional and functional variants, supported use of Align-GVGD and BayesDel prediction for different strength of evidence levels in ACMG/AMP rules. We provide rationale for bioinformatic tool selection for TP53 variant classification, and have also computed relevant bioinformatic predictions for every possible p53 missense variant to facilitate their use by the scientific and medical community. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.

  19. Differential expression of genes in fetal brain as a consequence of maternal protein deficiency and nematode infection.

    PubMed

    Haque, Manjurul; Starr, Lisa M; Koski, Kristine G; Scott, Marilyn E

    2018-01-01

    Maternal dietary protein deficiency and gastrointestinal nematode infection during early pregnancy have negative impacts on both maternal placental gene expression and fetal growth in the mouse. Here we used next-generation RNA sequencing to test our hypothesis that maternal protein deficiency and/or nematode infection also alter the expression of genes in the developing fetal brain. Outbred pregnant CD1 mice were used in a 2×2 design with two levels of dietary protein (24% versus 6%) and two levels of infection (repeated sham versus Heligmosomoides bakeri beginning at gestation day 5). Pregnant dams were euthanized on gestation day 18 to harvest the whole fetal brain. Four fetal brains from each treatment group were analyzed using RNA Hi-Seq sequencing and the differential expression of genes was determined by the edgeR package using NetworkAnalyst. In response to maternal H. bakeri infection, 96 genes (88 up-regulated and eight down-regulated) were differentially expressed in the fetal brain. Differentially expressed genes were involved in metabolic processes, developmental processes and the immune system according to the PANTHER classification system. Among the important biological functions identified, several up-regulated genes have known neurological functions including neuro-development (Gdf15, Ing4), neural differentiation (miRNA let-7), synaptic plasticity (via suppression of NF-κβ), neuro-inflammation (S100A8, S100A9) and glucose metabolism (Tnnt1, Atf3). However, in response to maternal protein deficiency, brain-specific serine protease (Prss22) was the only up-regulated gene and only one gene (Dynlt1a) responded to the interaction of maternal nematode infection and protein deficiency. In conclusion, maternal exposure to GI nematode infection from day 5 to 18 of pregnancy may influence developmental programming of the fetal brain. Copyright © 2017 The Author(s). Published by Elsevier Ltd.. All rights reserved.

  20. Transcriptome Characterization of Cymbidium sinense 'Dharma' Using 454 Pyrosequencing and Its Application in the Identification of Genes Associated with Leaf Color Variation.

    PubMed

    Zhu, Genfa; Yang, Fengxi; Shi, Shanshan; Li, Dongmei; Wang, Zhen; Liu, Hailin; Huang, Dan; Wang, Caiyun

    2015-01-01

    The highly variable leaf color of Cymbidium sinense significantly improves its horticultural and economic value, and makes it highly desirable in the flower markets in China and Southeast Asia. However, little is understood about the molecular mechanism underlying leaf-color variations. In this study, we found the content of photosynthetic pigments, especially chlorophyll degradation metabolite in the leaf-color mutants is distinguished significantly from that in the wild type of Cymbidium sinense 'Dharma'. To further determine the candidate genes controlling leaf-color variations, we first sequenced the global transcriptome using 454 pyrosequencing. More than 0.7 million expressed sequence tags (ESTs) with an average read length of 445.9 bp were generated and assembled into 103,295 isotigs representing 68,460 genes. Of these isotigs, 43,433 were significantly aligned to known proteins in the public database, of which 29,299 could be categorized into 42 functional groups in the gene ontology system, 10,079 classified into 23 functional classifications in the clusters of orthologous groups system, and 23,092 assigned to 139 clusters of specific metabolic pathways in the Kyoto Encyclopedia of Genes and Genomes. Among these annotations, 95 isotigs were designated as involved in chlorophyll metabolism. On this basis, we identified 16 key enzyme-encoding genes in the chlorophyll metabolism pathway, the full length cDNAs and expressions of which were further confirmed. Expression pattern indicated that the key enzyme-encoding genes for chlorophyll degradation were more highly expressed in the leaf color mutants, as was consistent with their lower chlorophyll contents. This study is the first to supply an informative 454 EST dataset for Cymbidium sinense 'Dharma' and to identify original leaf color-associated genes, which provide important resources to facilitate gene discovery for molecular breeding, marketable trait discovery, and investigating various biological process in this species.

  1. Transcriptome Characterization of Cymbidium sinense 'Dharma' Using 454 Pyrosequencing and Its Application in the Identification of Genes Associated with Leaf Color Variation

    PubMed Central

    Shi, Shanshan; Li, Dongmei; Wang, Zhen; Liu, Hailin; Huang, Dan; Wang, Caiyun

    2015-01-01

    The highly variable leaf color of Cymbidium sinense significantly improves its horticultural and economic value, and makes it highly desirable in the flower markets in China and Southeast Asia. However, little is understood about the molecular mechanism underlying leaf-color variations. In this study, we found the content of photosynthetic pigments, especially chlorophyll degradation metabolite in the leaf-color mutants is distinguished significantly from that in the wild type of Cymbidium sinense 'Dharma'. To further determine the candidate genes controlling leaf-color variations, we first sequenced the global transcriptome using 454 pyrosequencing. More than 0.7 million expressed sequence tags (ESTs) with an average read length of 445.9 bp were generated and assembled into 103,295 isotigs representing 68,460 genes. Of these isotigs, 43,433 were significantly aligned to known proteins in the public database, of which 29,299 could be categorized into 42 functional groups in the gene ontology system, 10,079 classified into 23 functional classifications in the clusters of orthologous groups system, and 23,092 assigned to 139 clusters of specific metabolic pathways in the Kyoto Encyclopedia of Genes and Genomes. Among these annotations, 95 isotigs were designated as involved in chlorophyll metabolism. On this basis, we identified 16 key enzyme-encoding genes in the chlorophyll metabolism pathway, the full length cDNAs and expressions of which were further confirmed. Expression pattern indicated that the key enzyme-encoding genes for chlorophyll degradation were more highly expressed in the leaf color mutants, as was consistent with their lower chlorophyll contents. This study is the first to supply an informative 454 EST dataset for Cymbidium sinense 'Dharma' and to identify original leaf color-associated genes, which provide important resources to facilitate gene discovery for molecular breeding, marketable trait discovery, and investigating various biological process in this species. PMID:26042676

  2. The Molecular Pathology of Myelodysplastic Syndrome.

    PubMed

    Haferlach, Torsten

    2018-05-23

    The diagnosis and classification of myelodysplastic syndromes (MDS) are based on cytomorphology and cytogenetics (WHO classification). Prognosis is best defined by the Revised International Prognostic Scoring System (IPSS-R). In recent years, an increasing number of molecular aberrations have been discovered. They are already included in the classification (e.g., SF3B1) and, more importantly, have emerged as valuable markers for better classification, particularly for defining risk groups. Mutations in genes such as SF3B1 and IDH1/2 have already had an impact on targeted treatment approaches in MDS. © 2018 S. Karger AG, Basel.

  3. ATM-Mediated Transcriptional and Developmental Responses to γ-rays in Arabidopsis

    PubMed Central

    Renou, Jean-Pierre; Pichon, Olivier; Fochesato, Sylvain; Ortet, Philippe; Montané, Marie-Hélène

    2007-01-01

    ATM (Ataxia Telangiectasia Mutated) is an essential checkpoint kinase that signals DNA double-strand breaks in eukaryotes. Its depletion causes meiotic and somatic defects in Arabidopsis and progressive motor impairment accompanied by several cell deficiencies in patients with ataxia telangiectasia (AT). To obtain a comprehensive view of the ATM pathway in plants, we performed a time-course analysis of seedling responses by combining confocal laser scanning microscopy studies of root development and genome-wide expression profiling of wild-type (WT) and homozygous ATM-deficient mutants challenged with a dose of γ-rays (IR) that is sublethal for WT plants. Early morphologic defects in meristematic stem cells indicated that AtATM, an Arabidopsis homolog of the human ATM gene, is essential for maintaining the quiescent center and controlling the differentiation of initial cells after exposure to IR. Results of several microarray experiments performed with whole seedlings and roots up to 5 h post-IR were compiled in a single table, which was used to import gene information and extract gene sets. Sequence and function homology searches; import of spatio-temporal, cell cycling, and mutant-constitutive expression characteristics; and a simplified functional classification system were used to identify novel genes in all functional classes. The hundreds of radiomodulated genes identified were not a random collection, but belonged to functional pathways such as those of the cell cycle; cell death and repair; DNA replication, repair, and recombination; and transcription; translation; and signaling, indicating the strong cell reprogramming and double-strand break abrogation functions of ATM checkpoints. Accordingly, genes in all functional classes were either down or up-regulated concomitantly with downregulation of chromatin deacetylases or upregulation of acetylases and methylases, respectively. Determining the early transcriptional indicators of prolonged S-G2 phases that coincided with cell proliferation delay, or an anticipated subsequent auxin increase, accelerated cell differentiation or death, was used to link IR-regulated hallmark functions and tissue phenotypes after IR. The transcription burst was almost exclusively AtATM-dependent or weakly AtATR-dependent, and followed two major trends of expression in atm: (i)-loss or severe attenuation and delay, and (ii)-inverse and/or stochastic, as well as specific, enabling one to distinguish IR/ATM pathway constituents. Our data provide a large resource for studies on the interaction between plant checkpoints of the cell cycle, development, hormone response, and DNA repair functions, because IR-induced transcriptional changes partially overlap with the response to environmental stress. Putative connections of ATM to stem cell maintenance pathways after IR are also discussed. PMID:17487278

  4. ATM-mediated transcriptional and developmental responses to gamma-rays in Arabidopsis.

    PubMed

    Ricaud, Lilian; Proux, Caroline; Renou, Jean-Pierre; Pichon, Olivier; Fochesato, Sylvain; Ortet, Philippe; Montané, Marie-Hélène

    2007-05-09

    ATM (Ataxia Telangiectasia Mutated) is an essential checkpoint kinase that signals DNA double-strand breaks in eukaryotes. Its depletion causes meiotic and somatic defects in Arabidopsis and progressive motor impairment accompanied by several cell deficiencies in patients with ataxia telangiectasia (AT). To obtain a comprehensive view of the ATM pathway in plants, we performed a time-course analysis of seedling responses by combining confocal laser scanning microscopy studies of root development and genome-wide expression profiling of wild-type (WT) and homozygous ATM-deficient mutants challenged with a dose of gamma-rays (IR) that is sublethal for WT plants. Early morphologic defects in meristematic stem cells indicated that AtATM, an Arabidopsis homolog of the human ATM gene, is essential for maintaining the quiescent center and controlling the differentiation of initial cells after exposure to IR. Results of several microarray experiments performed with whole seedlings and roots up to 5 h post-IR were compiled in a single table, which was used to import gene information and extract gene sets. Sequence and function homology searches; import of spatio-temporal, cell cycling, and mutant-constitutive expression characteristics; and a simplified functional classification system were used to identify novel genes in all functional classes. The hundreds of radiomodulated genes identified were not a random collection, but belonged to functional pathways such as those of the cell cycle; cell death and repair; DNA replication, repair, and recombination; and transcription; translation; and signaling, indicating the strong cell reprogramming and double-strand break abrogation functions of ATM checkpoints. Accordingly, genes in all functional classes were either down or up-regulated concomitantly with downregulation of chromatin deacetylases or upregulation of acetylases and methylases, respectively. Determining the early transcriptional indicators of prolonged S-G2 phases that coincided with cell proliferation delay, or an anticipated subsequent auxin increase, accelerated cell differentiation or death, was used to link IR-regulated hallmark functions and tissue phenotypes after IR. The transcription burst was almost exclusively AtATM-dependent or weakly AtATR-dependent, and followed two major trends of expression in atm: (i)-loss or severe attenuation and delay, and (ii)-inverse and/or stochastic, as well as specific, enabling one to distinguish IR/ATM pathway constituents. Our data provide a large resource for studies on the interaction between plant checkpoints of the cell cycle, development, hormone response, and DNA repair functions, because IR-induced transcriptional changes partially overlap with the response to environmental stress. Putative connections of ATM to stem cell maintenance pathways after IR are also discussed.

  5. A Developmental Approach to Characterizing the Tissue-Invasion Gene Program in Breast Cancer

    DTIC Science & Technology

    2001-09-01

    OF PAGES Breast Cancer 24 16. PRICE CODE 17. SECURITY CLASSIFICATION 18. SECURITY CLASSIFICATION 19. SECURITY CLASSIFICATION 20. LIMITATION OF...induced host response. Am J. Pathol. 149:273-282, 1996. 4. Wolf, c., Rouyer, N., Lutz, Y., Adida , C., Loriot, M., Bellocq, J.P., Chambon, P., and Basset...following a 5 d incubation period. (upper left and right panels). In contrast, MT1-MMP-transfected cells perforated the BM in representative TEM and

  6. Gene expression-based molecular diagnostic system for malignant gliomas is superior to histological diagnosis.

    PubMed

    Shirahata, Mitsuaki; Iwao-Koizumi, Kyoko; Saito, Sakae; Ueno, Noriko; Oda, Masashi; Hashimoto, Nobuo; Takahashi, Jun A; Kato, Kikuya

    2007-12-15

    Current morphology-based glioma classification methods do not adequately reflect the complex biology of gliomas, thus limiting their prognostic ability. In this study, we focused on anaplastic oligodendroglioma and glioblastoma, which typically follow distinct clinical courses. Our goal was to construct a clinically useful molecular diagnostic system based on gene expression profiling. The expression of 3,456 genes in 32 patients, 12 and 20 of whom had prognostically distinct anaplastic oligodendroglioma and glioblastoma, respectively, was measured by PCR array. Next to unsupervised methods, we did supervised analysis using a weighted voting algorithm to construct a diagnostic system discriminating anaplastic oligodendroglioma from glioblastoma. The diagnostic accuracy of this system was evaluated by leave-one-out cross-validation. The clinical utility was tested on a microarray-based data set of 50 malignant gliomas from a previous study. Unsupervised analysis showed divergent global gene expression patterns between the two tumor classes. A supervised binary classification model showed 100% (95% confidence interval, 89.4-100%) diagnostic accuracy by leave-one-out cross-validation using 168 diagnostic genes. Applied to a gene expression data set from a previous study, our model correlated better with outcome than histologic diagnosis, and also displayed 96.6% (28 of 29) consistency with the molecular classification scheme used for these histologically controversial gliomas in the original article. Furthermore, we observed that histologically diagnosed glioblastoma samples that shared anaplastic oligodendroglioma molecular characteristics tended to be associated with longer survival. Our molecular diagnostic system showed reproducible clinical utility and prognostic ability superior to traditional histopathologic diagnosis for malignant glioma.

  7. Effective Classification and Gene Expression Profiling for the Facioscapulohumeral Muscular Dystrophy

    PubMed Central

    González-Navarro, Félix F.; Belanche-Muñoz, Lluís A.; Silva-Colón, Karen A.

    2013-01-01

    The Facioscapulohumeral Muscular Dystrophy (FSHD) is an autosomal dominant neuromuscular disorder whose incidence is estimated in about one in 400,000 to one in 20,000. No effective therapeutic strategies are known to halt progression or reverse muscle weakness and atrophy. It is known that the FSHD is caused by modifications located within a D4ZA repeat array in the chromosome 4q, while recent advances have linked these modifications to the DUX4 gene. Unfortunately, the complete mechanisms responsible for the molecular pathogenesis and progressive muscle weakness still remain unknown. Although there are many studies addressing cancer databases from a machine learning perspective, there is no such precedent in the analysis of the FSHD. This study aims to fill this gap by analyzing two specific FSHD databases. A feature selection algorithm is used as the main engine to select genes promoting the highest possible classification capacity. The combination of feature selection and classification aims at obtaining simple models (in terms of very low numbers of genes) capable of good generalization, that may be associated with the disease. We show that the reported method is highly efficient in finding genes to discern between healthy cases (not affected by the FSHD) and FSHD cases, allowing the discovery of very parsimonious models that yield negligible repeated cross-validation error. These models in turn give rise to very simple decision procedures in the form of a decision tree. Current biological evidence regarding these genes shows that they are linked to skeletal muscle processes concerning specific human conditions. PMID:24349187

  8. Functional outcomes in children and young people with dyskinetic cerebral palsy.

    PubMed

    Monbaliu, Elegast; De La Peña, Mary-Grace; Ortibus, Els; Molenaers, Guy; Deklerck, Jan; Feys, Hilde

    2017-06-01

    This cross-sectional study aimed to map the functional profile of individuals with dyskinetic cerebral palsy (CP), to determine interrelationships between the functional classification systems, and to investigate the relationship of functional abilities with dystonia and choreoathetosis severity. Fifty-five children (<15y) and young people (15-22y) (30 males, 25 females; mean age 14y 6mo, standard deviation 4y 1mo) with dyskinetic CP were assessed using the Gross Motor Function Classification System (GMFCS), Manual Ability Classification System (MACS), Communication Function Classification System (CFCS), Eating and Drinking Ability Classification System (EDACS), and Viking Speech Scale (VSS), as well as the Dyskinesia Impairment Scale. Over 50 per cent of the participants exhibited the highest limitation levels in GMFCS, MACS, and VSS. Better functional abilities were seen in EDACS and CFCS. Moderate to excellent interrelationship was found among the classification scales. All scales had significant correlation (r s =0.65 - 0.81) with dystonia severity except for CFCS in the young people group. Finally, only MACS (r s =0.40) and EDACS (r s =0.55) in the young people group demonstrated significant correlation with choreoathetosis severity. The need for inclusion of speech, eating, and drinking in the functional assessment of dyskinetic CP is highlighted. The study further supports the strategy of managing dystonia in particular at a younger age followed by choreoathetosis in a later stage. © 2017 Mac Keith Press.

  9. Supervised group Lasso with applications to microarray data analysis

    PubMed Central

    Ma, Shuangge; Song, Xiao; Huang, Jian

    2007-01-01

    Background A tremendous amount of efforts have been devoted to identifying genes for diagnosis and prognosis of diseases using microarray gene expression data. It has been demonstrated that gene expression data have cluster structure, where the clusters consist of co-regulated genes which tend to have coordinated functions. However, most available statistical methods for gene selection do not take into consideration the cluster structure. Results We propose a supervised group Lasso approach that takes into account the cluster structure in gene expression data for gene selection and predictive model building. For gene expression data without biological cluster information, we first divide genes into clusters using the K-means approach and determine the optimal number of clusters using the Gap method. The supervised group Lasso consists of two steps. In the first step, we identify important genes within each cluster using the Lasso method. In the second step, we select important clusters using the group Lasso. Tuning parameters are determined using V-fold cross validation at both steps to allow for further flexibility. Prediction performance is evaluated using leave-one-out cross validation. We apply the proposed method to disease classification and survival analysis with microarray data. Conclusion We analyze four microarray data sets using the proposed approach: two cancer data sets with binary cancer occurrence as outcomes and two lymphoma data sets with survival outcomes. The results show that the proposed approach is capable of identifying a small number of influential gene clusters and important genes within those clusters, and has better prediction performance than existing methods. PMID:17316436

  10. Genome-Wide Classification and Evolutionary and Expression Analyses of Citrus MYB Transcription Factor Families in Sweet Orange

    PubMed Central

    Hou, Xiao-Jin; Li, Si-Bei; Liu, Sheng-Rui; Hu, Chun-Gen; Zhang, Jin-Zhi

    2014-01-01

    MYB family genes are widely distributed in plants and comprise one of the largest transcription factors involved in various developmental processes and defense responses of plants. To date, few MYB genes and little expression profiling have been reported for citrus. Here, we describe and classify 177 members of the sweet orange MYB gene (CsMYB) family in terms of their genomic gene structures and similarity to their putative Arabidopsis orthologs. According to these analyses, these CsMYBs were categorized into four groups (4R-MYB, 3R-MYB, 2R-MYB and 1R-MYB). Gene structure analysis revealed that 1R-MYB genes possess relatively more introns as compared with 2R-MYB genes. Investigation of their chromosomal localizations revealed that these CsMYBs are distributed across nine chromosomes. Sweet orange includes a relatively small number of MYB genes compared with the 198 members in Arabidopsis, presumably due to a paralog reduction related to repetitive sequence insertion into promoter and non-coding transcribed region of the genes. Comparative studies of CsMYBs and Arabidopsis showed that CsMYBs had fewer gene duplication events. Expression analysis revealed that the MYB gene family has a wide expression profile in sweet orange development and plays important roles in development and stress responses. In addition, 337 new putative microsatellites with flanking sequences sufficient for primer design were also identified from the 177 CsMYBs. These results provide a useful reference for the selection of candidate MYB genes for cloning and further functional analysis forcitrus. PMID:25375352

  11. Evolving Concepts and Translational Relevance of Enteroendocrine Cell Biology.

    PubMed

    Drucker, Daniel J

    2016-03-01

    Classical enteroenteroendocrine cell (EEC) biology evolved historically from identification of scattered hormone-producing endocrine cells within the epithelial mucosa of the stomach, small and large intestine. Purification of functional EEC hormones from intestinal extracts, coupled with molecular cloning of cDNAs and genes expressed within EECs has greatly expanded the complexity of EEC endocrinology, with implications for understanding the contribution of EECs to disease pathophysiology. Pubmed searches identified manuscripts highlighting new concepts illuminating the molecular biology, classification and functional role(s) of EECs and their hormonal products. Molecular interrogation of EECs has been transformed over the past decade, raising multiple new questions that challenge historical concepts of EEC biology. Evidence for evolution of the EEC from a unihormonal cell type with classical endocrine actions, to a complex plurihormonal dynamic cell with pleiotropic interactive functional networks within the gastrointestinal mucosa is critically assessed. We discuss gaps in understanding how EECs sense and respond to nutrients, cytokines, toxins, pathogens, the microbiota, and the microbial metabolome, and highlight the expanding translational relevance of EECs in the pathophysiology and therapy of metabolic and inflammatory disorders. The EEC system represents the largest specialized endocrine network in human physiology, integrating environmental and nutrient cues, enabling neural and hormonal control of metabolic homeostasis. Updating EEC classification systems will enable more accurate comparative analyses of EEC subpopulations and endocrine networks in multiple regions of the gastrointestinal tract.

  12. Molecular Evolution and Functional Diversification of Replication Protein A1 in Plants

    PubMed Central

    Aklilu, Behailu B.; Culligan, Kevin M.

    2016-01-01

    Replication protein A (RPA) is a heterotrimeric, single-stranded DNA binding complex required for eukaryotic DNA replication, repair, and recombination. RPA is composed of three subunits, RPA1, RPA2, and RPA3. In contrast to single RPA subunit genes generally found in animals and yeast, plants encode multiple paralogs of RPA subunits, suggesting subfunctionalization. Genetic analysis demonstrates that five Arabidopsis thaliana RPA1 paralogs (RPA1A to RPA1E) have unique and overlapping functions in DNA replication, repair, and meiosis. We hypothesize here that RPA1 subfunctionalities will be reflected in major structural and sequence differences among the paralogs. To address this, we analyzed amino acid and nucleotide sequences of RPA1 paralogs from 25 complete genomes representing a wide spectrum of plants and unicellular green algae. We find here that the plant RPA1 gene family is divided into three general groups termed RPA1A, RPA1B, and RPA1C, which likely arose from two progenitor groups in unicellular green algae. In the family Brassicaceae the RPA1B and RPA1C groups have further expanded to include two unique sub-functional paralogs RPA1D and RPA1E, respectively. In addition, RPA1 groups have unique domains, motifs, cis-elements, gene expression profiles, and pattern of conservation that are consistent with proposed functions in monocot and dicot species, including a novel C-terminal zinc-finger domain found only in plant RPA1C-like sequences. These results allow for improved prediction of RPA1 subunit functions in newly sequenced plant genomes, and potentially provide a unique molecular tool to improve classification of Brassicaceae species. PMID:26858742

  13. Should the Gross Motor Function Classification System be used for children who do not have cerebral palsy?

    PubMed

    Towns, Megan; Rosenbaum, Peter; Palisano, Robert; Wright, F Virginia

    2018-02-01

    This literature review addressed four questions. (1) In which populations other than cerebral palsy (CP) has the Gross Motor Function Classification System (GMFCS) been applied? (2) In what types of study, and why was it used? (3) How was it modified to facilitate these applications? (4) What justifications and evidence of psychometric adequacy were used to support its application? A search of PubMed, MEDLINE, and Embase databases (January 1997 to April 2017) using the terms: 'GMFCS' OR 'Gross Motor Function Classification System' yielded 2499 articles. 118 met inclusion criteria and reported children/adults with 133 health conditions/clinical descriptions other than CP. Three broad GMFCS applications were observed: as a categorization tool, independent variable, or outcome measure. While the GMFCS is widely used for children with health conditions/clinical description other than CP, researchers rarely provided adequate justification for these uses. We offer recommendations for development/validation of other condition-specific classification systems and discuss the potential need for a generic gross motor function classification system. The Gross Motor Function Classification System should not be used outside cerebral palsy or as an outcome measure. The authors provide recommendations for development and validation of condition-specific or generic classification systems. © 2017 Mac Keith Press.

  14. Snf2 family gene distribution in higher plant genomes reveals DRD1 expansion and diversification in the tomato genome.

    PubMed

    Bargsten, Joachim W; Folta, Adam; Mlynárová, Ludmila; Nap, Jan-Peter

    2013-01-01

    As part of large protein complexes, Snf2 family ATPases are responsible for energy supply during chromatin remodeling, but the precise mechanism of action of many of these proteins is largely unknown. They influence many processes in plants, such as the response to environmental stress. This analysis is the first comprehensive study of Snf2 family ATPases in plants. We here present a comparative analysis of 1159 candidate plant Snf2 genes in 33 complete and annotated plant genomes, including two green algae. The number of Snf2 ATPases shows considerable variation across plant genomes (17-63 genes). The DRD1, Rad5/16 and Snf2 subfamily members occur most often. Detailed analysis of the plant-specific DRD1 subfamily in related plant genomes shows the occurrence of a complex series of evolutionary events. Notably tomato carries unexpected gene expansions of DRD1 gene members. Most of these genes are expressed in tomato, although at low levels and with distinct tissue or organ specificity. In contrast, the Snf2 subfamily genes tend to be expressed constitutively in tomato. The results underpin and extend the Snf2 subfamily classification, which could help to determine the various functional roles of Snf2 ATPases and to target environmental stress tolerance and yield in future breeding.

  15. Inhabitancy of active Nitrosopumilus-like ammonia-oxidizing archaea and Nitrospira nitrite-oxidizing bacteria in the sponge Theonella swinhoei

    PubMed Central

    Feng, Guofang; Sun, Wei; Zhang, Fengli; Karthik, Loganathan; Li, Zhiyong

    2016-01-01

    Nitrification directly contributes to the ammonia removal in sponges, and it plays an indispensable role in sponge-mediated nitrogen cycle. Previous studies have demonstrated genomic evidences of nitrifying lineages in the sponge Theonella swinhoei. However, little is known about the transcriptional activity of nitrifying community in this sponge. In this study, combined DNA- and transcript-based analyses were performed to reveal the composition and transcriptional activity of the nitrifiers in T. swinhoei from the South China Sea. Transcriptional activity of ammonia-oxidizing archaea (AOA) and nitrite-oxidizing bacteria (NOB) in this sponge were confirmed by targeting their nitrifying genes,16S rRNA genes and their transcripts. Phylogenetic analysis coupled with RDP rRNA classification indicated that archaeal 16S rRNA genes, amoA (the subunit of ammonia monooxygenase) genes and their transcripts were closely related to Nitrosopumilus-like AOA; whereas nitrifying bacterial 16S rRNA genes, nxrB (the subunit of nitrite oxidoreductase) genes and their transcripts were closely related to Nitrospira NOB. Quantitative assessment demonstrated relative higher abundances of nitrifying genes and transcripts of Nitrosopumilus-like AOA than those of Nitrospira NOB in this sponge. This study illustrated the transcriptional potentials of Nitrosopumilus-like archaea and Nitrospira bacteria that would predominantly contribute to the nitrification functionality in the South China Sea T. swinhoei. PMID:27113140

  16. De Novo Assembly and Comparative Transcriptome Analyses of Red and Green Morphs of Sweet Basil Grown in Full Sunlight.

    PubMed

    Torre, Sara; Tattini, Massimiliano; Brunetti, Cecilia; Guidi, Lucia; Gori, Antonella; Marzano, Cristina; Landi, Marco; Sebastiani, Federico

    2016-01-01

    Sweet basil (Ocimum basilicum), one of the most popular cultivated herbs worldwide, displays a number of varieties differing in several characteristics, such as the color of the leaves. The development of a reference transcriptome for sweet basil, and the analysis of differentially expressed genes in acyanic and cyanic cultivars exposed to natural sunlight irradiance, has interest from horticultural and biological point of views. There is still great uncertainty about the significance of anthocyanins in photoprotection, and how green and red morphs may perform when exposed to photo-inhibitory light, a condition plants face on daily and seasonal basis. We sequenced the leaf transcriptome of the green-leaved Tigullio (TIG) and the purple-leaved Red Rubin (RR) exposed to full sunlight over a four-week experimental period. We assembled and annotated 111,007 transcripts. A total of 5,468 and 5,969 potential SSRs were identified in TIG and RR, respectively, out of which 66 were polymorphic in silico. Comparative analysis of the two transcriptomes showed 2,372 differentially expressed genes (DEGs) clustered in 222 enriched Gene ontology terms. Green and red basil mostly differed for transcripts abundance of genes involved in secondary metabolism. While the biosynthesis of waxes was up-regulated in red basil, the biosynthesis of flavonols and carotenoids was up-regulated in green basil. Data from our study provides a comprehensive transcriptome survey, gene sequence resources and microsatellites that can be used for further investigations in sweet basil. The analysis of DEGs and their functional classification also offers new insights on the functional role of anthocyanins in photoprotection.

  17. De novo RNA sequencing transcriptome of Rhododendron obtusum identified the early heat response genes involved in the transcriptional regulation of photosynthesis

    PubMed Central

    Tong, Jun; Dong, Yanfang; Xu, Dongyun; Mao, Jing; Zhou, Yuan

    2017-01-01

    Rhododendron spp. is an important ornamental species that is widely cultivated for landscape worldwide. Heat stress is a major obstacle for its cultivation in south China. Previous studies on rhododendron principally focused on its physiological and biochemical processes, which are involved in a series of stress tolerance. However, molecular or genetic properties of rhododendron’s response to heat stress are still poorly understood. The phenotype and chlorophyll fluorescence kinetics parameters of four rhododendron cultivars were compared under normal or heat stress conditions, and a cultivar with highest heat tolerance, “Yanzhimi” (R. obtusum) was selected for transcriptome sequencing. A total of 325,429,240 high quality reads were obtained and assembled into 395,561 transcripts and 92,463 unigenes. Functional annotation showed that 38,724 unigenes had sequence similarity to known genes in at least one of the proteins or nucleotide databases used in this study. These 38,724 unigenes were categorized into 51 functional groups based on Gene Ontology classification and were blasted to 24 known cluster of orthologous groups. A total of 973 identified unigenes belonged to 57 transcription factor families, including the stress-related HSF, DREB, ZNF, and NAC genes. Photosynthesis was significantly enriched in the Kyoto Encyclopedia of Genes and Genomes pathway, and the changed expression pattern was illustrated. The key pathways and signaling components that contribute to heat tolerance in rhododendron were revealed. These results provide a potentially valuable resource that can be used for heat-tolerance breeding. PMID:29059200

  18. Transcriptome Analysis and Discovery of Genes Involved in Immune Pathways from Hepatopancreas of Microbial Challenged Mitten Crab Eriocheir sinensis

    PubMed Central

    Li, Xihong; Cui, Zhaoxia; Liu, Yuan; Song, Chengwen; Shi, Guohui

    2013-01-01

    Background The Chinese mitten crab Eriocheir sinensis is an important economic crustacean and has been seriously attacked by various diseases, which requires more and more information for immune relevant genes on genome background. Recently, high-throughput RNA sequencing (RNA-seq) technology provides a powerful and efficient method for transcript analysis and immune gene discovery. Methods/Principal Findings A cDNA library from hepatopancreas of E. sinensis challenged by a mixture of three pathogen strains (Gram-positive bacteria Micrococcus luteus, Gram-negative bacteria Vibrio alginolyticus and fungi Pichia pastoris; 108 cfu·mL−1) was constructed and randomly sequenced using Illumina technique. Totally 39.76 million clean reads were assembled to 70,300 unigenes. After ruling out short-length and low-quality sequences, 52,074 non-redundant unigenes were compared to public databases for homology searching and 17,617 of them showed high similarity to sequences in NCBI non-redundant protein (Nr) database. For function classification and pathway assignment, 18,734 (36.00%) unigenes were categorized to three Gene Ontology (GO) categories, 12,243 (23.51%) were classified to 25 Clusters of Orthologous Groups (COG), and 8,983 (17.25%) were assigned to six Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. Potentially, 24, 14, 47 and 132 unigenes were characterized to be involved in Toll, IMD, JAK-STAT and MAPK pathways, respectively. Conclusions/Significance This is the first systematical transcriptome analysis of components relating to innate immune pathways in E. sinensis. Functional genes and putative pathways identified here will contribute to better understand immune system and prevent various diseases in crab. PMID:23874555

  19. The genetics of mental illness: implications for practice.

    PubMed Central

    Hyman, S. E.

    2000-01-01

    Many of the comfortable and relatively simple models of the nature of mental disorders, their causes and their neural substrates now appear quite frayed. Gone is the idea that symptom clusters, course of illness, family history and treatment response would coalesce in a simple way to yield valid diagnoses. Also too simple was the concept, born of early pharmacological successes, that abnormal levels of one or more neurotransmitters would satisfactorily explain the pathogenesis of depression or schizophrenia. Gone is the notion that there is a single gene that causes any mental disorder or determines any behavioural variant. The concept of the causative gene has been replaced by that of genetic complexity, in which multiple genes act in concert with non-genetic factors to produce a risk of mental disorder. Discoveries in genetics and neuroscience can be expected to lead to better models that provide improved representation of the complexity of the brain and behaviour and the development of both. There are likely to be profound implications for clinical practice. The complex genetics of risk should reinvigorate research on the epidemiology and classification of mental disorders and explain the complex patterns of disease transmission within families. Knowledge of the timing of the expression of risk genes during brain development and of their function should not only contribute to an understanding of gene action and the pathophysiology of disease but should also help to direct the search for modifiable environmental risk factors that convert risk into illness. The function of risk genes can only become comprehensible in the context of advances at the molecular, cellular and systems levels in neuroscience and the behavioural sciences. Genetics should yield new therapies aimed not just at symptoms but also at pathogenic processes, thus permitting the targeting of specific therapies to individual patients. PMID:10885164

  20. Co-clustering phenome–genome for phenotype classification and disease gene discovery

    PubMed Central

    Hwang, TaeHyun; Atluri, Gowtham; Xie, MaoQiang; Dey, Sanjoy; Hong, Changjin; Kumar, Vipin; Kuang, Rui

    2012-01-01

    Understanding the categorization of human diseases is critical for reliably identifying disease causal genes. Recently, genome-wide studies of abnormal chromosomal locations related to diseases have mapped >2000 phenotype–gene relations, which provide valuable information for classifying diseases and identifying candidate genes as drug targets. In this article, a regularized non-negative matrix tri-factorization (R-NMTF) algorithm is introduced to co-cluster phenotypes and genes, and simultaneously detect associations between the detected phenotype clusters and gene clusters. The R-NMTF algorithm factorizes the phenotype–gene association matrix under the prior knowledge from phenotype similarity network and protein–protein interaction network, supervised by the label information from known disease classes and biological pathways. In the experiments on disease phenotype–gene associations in OMIM and KEGG disease pathways, R-NMTF significantly improved the classification of disease phenotypes and disease pathway genes compared with support vector machines and Label Propagation in cross-validation on the annotated phenotypes and genes. The newly predicted phenotypes in each disease class are highly consistent with human phenotype ontology annotations. The roles of the new member genes in the disease pathways are examined and validated in the protein–protein interaction subnetworks. Extensive literature review also confirmed many new members of the disease classes and pathways as well as the predicted associations between disease phenotype classes and pathways. PMID:22735708

  1. European validation of The Comprehensive International Classification of Functioning, Disability and Health Core Set for Osteoarthritis from the perspective of patients with osteoarthritis of the knee or hip.

    PubMed

    Weigl, Martin; Wild, Heike

    2017-09-15

    To validate the International Classification of Functioning, Disability and Health Comprehensive Core Set for Osteoarthritis from the patient perspective in Europe. This multicenter cross-sectional study involved 375 patients with knee or hip osteoarthritis. Trained health professionals completed the Comprehensive Core Set, and patients completed the Short-Form 36 questionnaire. Content validity was evaluated by calculating prevalences of impairments in body function and structures, limitations in activities and participation and environmental factors, which were either barriers or facilitators. Convergent construct validity was evaluated by correlating the International Classification of Functioning, Disability and Health categories with the Short-Form 36 Physical Component Score and the SF-36 Mental Component Score in a subgroup of 259 patients. The prevalences of all body function, body structure and activities and participation categories were >40%, >32% and >20%, respectively, and all environmental factors were relevant for >16% of patients. Few categories showed relevant differences between knee and hip osteoarthritis. All body function categories and all but two activities and participation categories showed significant correlations with the Physical Component Score. Body functions from the ICF chapter Mental Functions showed higher correlations with the Mental Component Score than with the Physical Component Score. This study supports the validity of the International Classification of Functioning, Disability and Health Comprehensive Core Set for Osteoarthritis. Implications for Rehabilitation Comprehensive International Classification of Functioning, Disability and Health Core Sets were developed as practical tools for application in multidisciplinary assessments. The validity of the Comprehensive International Classification of Functioning, Disability and Health Core Set for Osteoarthritis in this study supports its application in European patients with osteoarthritis. The differences in results between this Europe validation study and a previous Singaporean validation study underscore the need to validate the International Classification of Functioning, Disability and Health Core Sets in different regions of the world.

  2. Comparisons of severity classification systems for oropharyngeal dysfunction in children with cerebral palsy: Relations with other functional profiles.

    PubMed

    Goh, Yu-Ra; Choi, Ja Young; Kim, Seon Ah; Park, Jieun; Park, Eun Sook

    2018-01-01

    This study aimed to investigate the relationships between various classification systems assessing the severity of oropharyngeal dysphagia and communication function and other functional profiles in children with cerebral palsy (CP). This is a prospective, cross-sectional, study in a university-affiliated, tertiary-care hospital. We recruited 151 children with CP (mean age 6.11 years, SD 3.42, range 3-18yr). The Eating and Drinking Ability Classification System (EDACS) and the dysphagia scales of Functional Oral Intake Scale (FOIS), Swallow Function Scales (SFS), and Food Intake Level Scale (FILS) were used. The Communication Function Classification System (CFCS) and Viking Speech Scale (VSS) were employed to classify communication function and speech intelligibility, respectively. The Pediatric Evaluation of Disability Inventory (PEDI) with the Gross Motor Function Classification System (GFMCS) and the Manual Ability Classification System (MACS) level were also assessed. Spearman correlation analysis to investigate the associations between measures and univariate and multivariate logistic regression models to identify significant factors were used. Median GMFCS level of participants was III (interquartile range II-IV). Significant dysphagia based on EDACS level III-V was noted in 23 children (15.2%). There were strong to very strong relationships between the EDACS level with the dysphagia scales. The EDACS presented strong associations with MACS, CFCS, and VSS, a moderate association with GMFCS level, and a moderate to strong association with each domain of the PEDI. In multivariate analysis, poor functioning in EDACS were associated with poor functioning in gross motor and communication functions. Copyright © 2017. Published by Elsevier Ltd.

  3. Comparative analysis of sugarcane bagasse metagenome reveals unique and conserved biomass-degrading enzymes among lignocellulolytic microbial communities.

    PubMed

    Mhuantong, Wuttichai; Charoensawan, Varodom; Kanokratana, Pattanop; Tangphatsornruang, Sithichoke; Champreda, Verawat

    2015-01-01

    As one of the most abundant agricultural wastes, sugarcane bagasse is largely under-exploited, but it possesses a great potential for the biofuel, fermentation, and cellulosic biorefinery industries. It also provides a unique ecological niche, as the microbes in this lignocellulose-rich environment thrive in relatively high temperatures (50°C) with varying microenvironments of aerobic surface to anoxic interior. The microbial community in bagasse thus presents a good resource for the discovery and characterization of new biomass-degrading enzymes; however, it remains largely unexplored. We have constructed a fosmid library of sugarcane bagasse and obtained the largest bagasse metagenome to date. A taxonomic classification of the bagasse metagenome reviews the predominance of Proteobacteria, which are also found in high abundance in other aerobic environments. Based on the functional characterization of biomass-degrading enzymes, we have demonstrated that the bagasse microbial community benefits from a large repertoire of lignocellulolytic enzymes, which allows them to digest different components of lignocelluoses into single molecule sugars. Comparative genomic analyses with other lignocellulolytic and non-lignocellulolytic metagenomes show that microbial communities are taxonomically separable by their aerobic "open" or anoxic "closed" environments. Importantly, a functional analysis of lignocellulose-active genes (based on the CAZy classifications) reveals core enzymes highly conserved within the lignocellulolytic group, regardless of their taxonomic compositions. Cellulases, in particular, are markedly more pronounced compared to the non-lignocellulolytic group. In addition to the core enzymes, the bagasse fosmid library also contains some uniquely enriched glycoside hydrolases, as well as a large repertoire of the newly defined auxiliary activity proteins. Our study demonstrates a conservation and diversification of carbohydrate-active genes among diverse microbial species in different biomass-degrading niches, and signifies the importance of taking a global approach to functionally investigate a microbial community as a whole, as compared to focusing on individual organisms.

  4. Characterization of new DsbB-like thiol-oxidoreductases of Campylobacter jejuni and Helicobacter pylori and classification of the DsbB family based on phylogenomic, structural and functional criteria.

    PubMed

    Raczko, Anna M; Bujnicki, Janusz M; Pawlowski, Marcin; Godlewska, Renata; Lewandowska, Magdalena; Jagusztyn-Krynicka, Elzbieta K

    2005-01-01

    In Gram-negative bacterial cells, disulfide bond formation occurs in the oxidative environment of the periplasm and is catalysed by Dsb (disulfide bond) proteins found in the periplasm and in the inner membrane. In this report the identification of a new subfamily of disulfide oxidoreductases encoded by a gene denoted dsbI, and functional characterization of DsbI proteins from Campylobacter jejuni and Helicobacter pylori, as well as DsbB from C. jejuni, are described. The N-terminal domain of DsbI is related to DsbB proteins and comprises five predicted transmembrane segments, while the C-terminal domain is predicted to locate to the periplasm and to fold into a beta-propeller structure. The dsbI gene is co-transcribed with a small ORF designated dba (dsbI-accessory). Based on a series of deletion and complementation experiments it is proposed that DsbB can complement the lack of DsbI but not the converse. In the presence of DsbB, the activity of DsbI was undetectable, hence it probably acts only on a subset of possible substrates of DsbB. To reconstruct the principal events in the evolution of DsbB and DsbI proteins, sequences of all their homologues identifiable in databases were analysed. In the course of this study, previously undetected variations on the common thiol-oxidoreductase theme were identified, such as development of an additional transmembrane helix and loss or migration of the second pair of Cys residues between two distinct periplasmic loops. In conjunction with the experimental characterization of two members of the DsbI lineage, this analysis has resulted in the first comprehensive classification of the DsbB/DsbI family based on structural, functional and evolutionary criteria.

  5. New technologies accelerate the exploration of non-coding RNAs in horticultural plants

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Liu, Degao; Mewalal, Ritesh; Hu, Rongbin

    Non-coding RNAs (ncRNAs), that is, RNAs not translated into proteins, are crucial regulators of a variety of biological processes in plants. While protein-encoding genes have been relatively well-annotated in sequenced genomes, accounting for a small portion of the genome space in plants, the universe of plant ncRNAs is rapidly expanding. Recent advances in experimental and computational technologies have generated a great momentum for discovery and functional characterization of ncRNAs. Here we summarize the classification and known biological functions of plant ncRNAs, review the application of next-generation sequencing (NGS) technology and ribosome profiling technology to ncRNA discovery in horticultural plants andmore » discuss the application of new technologies, especially the new genome-editing tool clustered regularly interspaced short palindromic repeat (CRISPR)/CRISPR-associated protein 9 (Cas9) systems, to functional characterization of plant ncRNAs.« less

  6. New technologies accelerate the exploration of non-coding RNAs in horticultural plants

    PubMed Central

    Liu, Degao; Mewalal, Ritesh; Hu, Rongbin; Tuskan, Gerald A; Yang, Xiaohan

    2017-01-01

    Non-coding RNAs (ncRNAs), that is, RNAs not translated into proteins, are crucial regulators of a variety of biological processes in plants. While protein-encoding genes have been relatively well-annotated in sequenced genomes, accounting for a small portion of the genome space in plants, the universe of plant ncRNAs is rapidly expanding. Recent advances in experimental and computational technologies have generated a great momentum for discovery and functional characterization of ncRNAs. Here we summarize the classification and known biological functions of plant ncRNAs, review the application of next-generation sequencing (NGS) technology and ribosome profiling technology to ncRNA discovery in horticultural plants and discuss the application of new technologies, especially the new genome-editing tool clustered regularly interspaced short palindromic repeat (CRISPR)/CRISPR-associated protein 9 (Cas9) systems, to functional characterization of plant ncRNAs. PMID:28698797

  7. A Genome-Wide Identification and Analysis of the Basic Helix-Loop-Helix Transcription Factors in Brown Planthopper, Nilaparvata lugens

    PubMed Central

    Wan, Pin-Jun; Yuan, San-Yue; Wang, Wei-Xia; Chen, Xu; Lai, Feng-Xiang; Fu, Qiang

    2016-01-01

    The basic helix-loop-helix (bHLH) transcription factors in insects play essential roles in multiple developmental processes including neurogenesis, sterol metabolism, circadian rhythms, organogenesis and formation of olfactory sensory neurons. The identification and function analysis of bHLH family members of the most destructive insect pest of rice, Nilaparvata lugens, may provide novel tools for pest management. Here, a genome-wide survey for bHLH sequences identified 60 bHLH sequences (NlbHLHs) encoded in the draft genome of N. lugens. Phylogenetic analysis of the bHLH domains successfully classified these genes into 40 bHLH families in group A (25), B (14), C (10), D (1), E (8) and F (2). The number of NlbHLHs with introns is higher than many other insect species, and the average intron length is shorter than those of Acyrthosiphon pisum. High number of ortholog families of NlbHLHs was found suggesting functional conversation for these proteins. Compared to other insect species studied, N. lugens has the highest number of bHLH members. Furthermore, gene duplication events of SREBP, Kn(col), Tap, Delilah, Sim, Ato and Crp were found in N. lugens. In addition, a putative full set of NlbHLH genes is defined and compared with another insect species. Thus, our classification of these NlbHLH members provides a platform for further investigations of bHLH protein functions in the regulation of N. lugens, and of insects in general. PMID:27869716

  8. Portraying the Expression Landscapes of B-Cell Lymphoma-Intuitive Detection of Outlier Samples and of Molecular Subtypes

    PubMed Central

    Hopp, Lydia; Lembcke, Kathrin; Binder, Hans; Wirth, Henry

    2013-01-01

    We present an analytic framework based on Self-Organizing Map (SOM) machine learning to study large scale patient data sets. The potency of the approach is demonstrated in a case study using gene expression data of more than 200 mature aggressive B-cell lymphoma patients. The method portrays each sample with individual resolution, characterizes the subtypes, disentangles the expression patterns into distinct modules, extracts their functional context using enrichment techniques and enables investigation of the similarity relations between the samples. The method also allows to detect and to correct outliers caused by contaminations. Based on our analysis, we propose a refined classification of B-cell Lymphoma into four molecular subtypes which are characterized by differential functional and clinical characteristics. PMID:24833231

  9. Identification, Classification, and Expression Analysis of GRAS Gene Family in Malus domestica

    PubMed Central

    Fan, Sheng; Zhang, Dong; Gao, Cai; Zhao, Ming; Wu, Haiqin; Li, Youmei; Shen, Yawen; Han, Mingyu

    2017-01-01

    GRAS genes encode plant-specific transcription factors that play important roles in plant growth and development. However, little is known about the GRAS gene family in apple. In this study, 127 GRAS genes were identified in the apple (Malus domestica Borkh.) genome and named MdGRAS1 to MdGRAS127 according to their chromosomal locations. The chemical characteristics, gene structures and evolutionary relationships of the MdGRAS genes were investigated. The 127 MdGRAS genes could be grouped into eight subfamilies based on their structural features and phylogenetic relationships. Further analysis of gene structures, segmental and tandem duplication, gene phylogeny and tissue-specific expression with ArrayExpress database indicated their diversification in quantity, structure and function. We further examined the expression pattern of MdGRAS genes during apple flower induction with transcriptome sequencing. Eight higher MdGRAS (MdGRAS6, 26, 28, 44, 53, 64, 107, and 122) genes were surfaced. Further quantitative reverse transcription PCR indicated that the candidate eight genes showed distinct expression patterns among different tissues (leaves, stems, flowers, buds, and fruits). The transcription levels of eight genes were also investigated with various flowering related treatments (GA3, 6-BA, and sucrose) and different flowering varieties (Yanfu No. 6 and Nagafu No. 2). They all were affected by flowering-related circumstance and showed different expression level. Changes in response to these hormone or sugar related treatments indicated their potential involvement during apple flower induction. Taken together, our results provide rich resources for studying GRAS genes and their potential clues in genetic improvement of apple flowering, which enriches biological theories of GRAS genes in apple and their involvement in flower induction of fruit trees. PMID:28503152

  10. Identification, Classification, and Expression Analysis of GRAS Gene Family in Malus domestica.

    PubMed

    Fan, Sheng; Zhang, Dong; Gao, Cai; Zhao, Ming; Wu, Haiqin; Li, Youmei; Shen, Yawen; Han, Mingyu

    2017-01-01

    GRAS genes encode plant-specific transcription factors that play important roles in plant growth and development. However, little is known about the GRAS gene family in apple. In this study, 127 GRAS genes were identified in the apple ( Malus domestica Borkh.) genome and named MdGRAS1 to MdGRAS127 according to their chromosomal locations. The chemical characteristics, gene structures and evolutionary relationships of the MdGRAS genes were investigated. The 127 MdGRAS genes could be grouped into eight subfamilies based on their structural features and phylogenetic relationships. Further analysis of gene structures, segmental and tandem duplication, gene phylogeny and tissue-specific expression with ArrayExpress database indicated their diversification in quantity, structure and function. We further examined the expression pattern of MdGRAS genes during apple flower induction with transcriptome sequencing. Eight higher MdGRAS ( MdGRAS6, 26, 28, 44, 53, 64, 107 , and 122 ) genes were surfaced. Further quantitative reverse transcription PCR indicated that the candidate eight genes showed distinct expression patterns among different tissues (leaves, stems, flowers, buds, and fruits). The transcription levels of eight genes were also investigated with various flowering related treatments (GA 3 , 6-BA, and sucrose) and different flowering varieties (Yanfu No. 6 and Nagafu No. 2). They all were affected by flowering-related circumstance and showed different expression level. Changes in response to these hormone or sugar related treatments indicated their potential involvement during apple flower induction. Taken together, our results provide rich resources for studying GRAS genes and their potential clues in genetic improvement of apple flowering, which enriches biological theories of GRAS genes in apple and their involvement in flower induction of fruit trees.

  11. Differential gene expression detection and sample classification using penalized linear regression models.

    PubMed

    Wu, Baolin

    2006-02-15

    Differential gene expression detection and sample classification using microarray data have received much research interest recently. Owing to the large number of genes p and small number of samples n (p > n), microarray data analysis poses big challenges for statistical analysis. An obvious problem owing to the 'large p small n' is over-fitting. Just by chance, we are likely to find some non-differentially expressed genes that can classify the samples very well. The idea of shrinkage is to regularize the model parameters to reduce the effects of noise and produce reliable inferences. Shrinkage has been successfully applied in the microarray data analysis. The SAM statistics proposed by Tusher et al. and the 'nearest shrunken centroid' proposed by Tibshirani et al. are ad hoc shrinkage methods. Both methods are simple, intuitive and prove to be useful in empirical studies. Recently Wu proposed the penalized t/F-statistics with shrinkage by formally using the (1) penalized linear regression models for two-class microarray data, showing good performance. In this paper we systematically discussed the use of penalized regression models for analyzing microarray data. We generalize the two-class penalized t/F-statistics proposed by Wu to multi-class microarray data. We formally derive the ad hoc shrunken centroid used by Tibshirani et al. using the (1) penalized regression models. And we show that the penalized linear regression models provide a rigorous and unified statistical framework for sample classification and differential gene expression detection.

  12. First Generation Gene Expression Signature for Early Prediction of Late Occurring Hematological Acute Radiation Syndrome in Baboons.

    PubMed

    Port, M; Herodin, F; Valente, M; Drouet, M; Lamkowski, A; Majewski, M; Abend, M

    2016-07-01

    We implemented a two-stage study to predict late occurring hematologic acute radiation syndrome (HARS) in a baboon model based on gene expression changes measured in peripheral blood within the first two days after irradiation. Eighteen baboons were irradiated to simulate different patterns of partial-body and total-body exposure, which corresponded to an equivalent dose of 2.5 or 5 Gy. According to changes in blood cell counts the surviving baboons (n = 17) exhibited mild (H1-2, n = 4) or more severe (H2-3, n = 13) HARS. Blood samples taken before irradiation served as unexposed control (H0, n = 17). For stage I of this study, a whole genome screen (mRNA microarrays) was performed using a portion of the samples (H0, n = 5; H1-2, n = 4; H2-3, n = 5). For stage II, using the remaining samples and the more sensitive methodology, qRT-PCR, validation was performed on candidate genes that were differentially up- or down-regulated during the first two days after irradiation. Differential gene expression was defined as significant (P < 0.05) and greater than or equal to a twofold difference above a H0 classification. From approximately 20,000 genes, on average 46% appeared to be expressed. On day 1 postirradiation for H2-3, approximately 2-3 times more genes appeared up-regulated (1,418 vs. 550) or down-regulated (1,603 vs. 735) compared to H1-2. This pattern became more pronounced at day 2 while the number of differentially expressed genes decreased. The specific genes showed an enrichment of biological processes coding for immune system processes, natural killer cell activation and immune response (P = 1 × E-06 up to 9 × E-14). Based on the P values, magnitude and sustained differential gene expression over time, we selected 89 candidate genes for validation using qRT-PCR. Ultimately, 22 genes were confirmed for identification of H1-3 classifications and seven genes for identification of H2-3 classifications using qRT-PCR. For H1-3 classifications, most genes were constantly three to fivefold down-regulated relative to H0 over both days, but some genes appeared 10.3-fold (VSIG4) or even 30.7-fold up-regulated (CD177) over H0. For H2-3, some genes appeared four to sevenfold up-regulated relative to H0 (RNASE3, DAGLA, ARG2), but other genes showed a strong 14- to 33-fold down-regulation relative to H0 (WNT3, POU2AF1, CCR7). All of these genes allowed an almost completely identifiable separation among each of the HARS categories. In summary, clinically relevant HARS can be independently predicted with all 29 irradiated genes examined in the peripheral blood of baboons within the first two days postirradiation. While further studies are needed to confirm these findings, this model shows potential relevance in the prediction of clinical outcomes in exposed humans and as an aid in the prioritizing of medical treatment.

  13. Splice-mediated Variants of Proteins (SpliVaP) - data and characterization of changes in signatures among protein isoforms due to alternative splicing.

    PubMed

    Floris, Matteo; Orsini, Massimiliano; Thanaraj, Thangavel Alphonse

    2008-10-02

    It is often the case that mammalian genes are alternatively spliced; the resulting alternate transcripts often encode protein isoforms that differ in amino acid sequences. Changes among the protein isoforms can alter the cellular properties of proteins. The effect can range from a subtle modulation to a complete loss of function. (i) We examined human splice-mediated protein isoforms (as extracted from a manually curated data set, and from a computationally predicted data set) for differences in the annotation for protein signatures (Pfam domains and PRINTS fingerprints) and we characterized the differences & their effects on protein functionalities. An important question addressed relates to the extent of protein isoforms that may lack any known function in the cell. (ii) We present a database that reports differences in protein signatures among human splice-mediated protein isoform sequences. (i) Characterization: The work points to distinct sets of alternatively spliced genes with varying degrees of annotation for the splice-mediated protein isoforms. Protein molecular functions seen to be often affected are those that relate to: binding, catalytic, transcription regulation, structural molecule, transporter, motor, and antioxidant; and the processes that are often affected are nucleic acid binding, signal transduction, and protein-protein interactions. Signatures are often included/excluded and truncated in length among protein isoforms; truncation is seen as the predominant type of change. Analysis points to the following novel aspects: (a) Analysis using data from the manually curated Vega indicates that one in 8.9 genes can lead to a protein isoform of no "known" function; and one in 18 expressed protein isoforms can be such an "orphan" isoform; the corresponding numbers as seen with computationally predicted ASD data set are: one in 4.9 genes and one in 9.8 isoforms. (b) When swapping of signatures occurs, it is often between those of same functional classifications. (c) Pfam domains can occur in varying lengths, and PRINTS fingerprints can occur with varying number of constituent motifs among isoforms - since such a variation is seen in large number of genes, it could be a general mechanism to modulate protein function. (ii) The reported resource (at http://www.bioinformatica.crs4.org/tools/dbs/splivap/) provides the community ability to access data on splice-mediated protein isoforms (with value-added annotation such as association with diseases) through changes in protein signatures.

  14. [GST genes expression as prognostic factor in papillary thyroid cancer].

    PubMed

    Gonçalves, Antonio Jose; Monte, Osmar; Morari, Eliane Cristina; Ward, Laura Sterian; Nakasako, Diana Shimoda; Nieto, Juliana; Nakai, Marianne Yumi

    2009-01-01

    Analyze the relationship between the AMES classification and molecular factors from Glutation-S-Transferase System, specifically the GSTT1 and GSTM1 in patients with well differentiated thyroid cancer. Samples of thyroid tissue of 66 patients with papillary thyroid carcinoma were obtained (53 women and 13 men). Patients were divided in two groups (high and low risk) according to the AMES classification. In each group, presence of the null genotype of both GST enzymes system was studied. These results were compared with the AMES classification. Samples were obtained in the operating room immediately after thyroidectomy, placed in cryotubes, immersed in liquid nitrogen and stored in a freezer at -80 masculineC. DNA of this enzymes was extracted by the fenol-cloroformium method. There were 17 high risk patients and 49 low risk patients. The null genotype of the high risk group was 5.8% and in the other group was 6.1%. There was no relationship between absence of genes GSTT1 and GSTM1 and prognosis of the papillary thyroid carcinoma when compared to the AMES classifications.

  15. Semi-Supervised Projective Non-Negative Matrix Factorization for Cancer Classification.

    PubMed

    Zhang, Xiang; Guan, Naiyang; Jia, Zhilong; Qiu, Xiaogang; Luo, Zhigang

    2015-01-01

    Advances in DNA microarray technologies have made gene expression profiles a significant candidate in identifying different types of cancers. Traditional learning-based cancer identification methods utilize labeled samples to train a classifier, but they are inconvenient for practical application because labels are quite expensive in the clinical cancer research community. This paper proposes a semi-supervised projective non-negative matrix factorization method (Semi-PNMF) to learn an effective classifier from both labeled and unlabeled samples, thus boosting subsequent cancer classification performance. In particular, Semi-PNMF jointly learns a non-negative subspace from concatenated labeled and unlabeled samples and indicates classes by the positions of the maximum entries of their coefficients. Because Semi-PNMF incorporates statistical information from the large volume of unlabeled samples in the learned subspace, it can learn more representative subspaces and boost classification performance. We developed a multiplicative update rule (MUR) to optimize Semi-PNMF and proved its convergence. The experimental results of cancer classification for two multiclass cancer gene expression profile datasets show that Semi-PNMF outperforms the representative methods.

  16. Molecular systematics of the barklouse family Psocidae (Insecta: Psocodea: 'Psocoptera') and implications for morphological and behavioral evolution.

    PubMed

    Yoshizawa, Kazunori; Johnson, Kevin P

    2008-02-01

    We evaluated the higher level classification within the family Psocidae (Insecta: Psocodea: 'Psocoptera') based on combined analyses of nuclear 18S, Histone 3, wingless and mitochondrial 12S, 16S and COI gene sequences. Various analyses (inclusion/exclusion of incomplete taxa and/or rapidly evolving genes, data partitioning, and analytical method selection) all provided similar results, which were generally concordant with relationships inferred using morphological observations. Based on the phylogenetic trees estimated for Psocidae, we propose a revised higher level classification of this family, although uncertainty still exists regarding some aspects of this classification. This classification includes a basal division into two subfamilies, 'Amphigerontiinae' (possibly paraphyletic) and Psocinae. The Amphigerontiinae is divided into the tribes Kaindipsocini (new tribe), Blastini, Amphigerontini, and Stylatopsocini. Psocinae is divided into the tribes 'Ptyctini' (probably paraphyletic), Psocini, Atrichadenotecnini (new tribe), Sigmatoneurini, Metylophorini, and Thyrsophorini (the latter includes the taxon previously recognized as Cerastipsocini). We examined the evolution of symmetric/asymmetric male genitalia over this tree and found this character to be quite homoplasious.

  17. Combining anatomical, diffusion, and resting state functional magnetic resonance imaging for individual classification of mild and moderate Alzheimer's disease.

    PubMed

    Schouten, Tijn M; Koini, Marisa; de Vos, Frank; Seiler, Stephan; van der Grond, Jeroen; Lechner, Anita; Hafkemeijer, Anne; Möller, Christiane; Schmidt, Reinhold; de Rooij, Mark; Rombouts, Serge A R B

    2016-01-01

    Magnetic resonance imaging (MRI) is sensitive to structural and functional changes in the brain caused by Alzheimer's disease (AD), and can therefore be used to help in diagnosing the disease. Improving classification of AD patients based on MRI scans might help to identify AD earlier in the disease's progress, which may be key in developing treatments for AD. In this study we used an elastic net classifier based on several measures derived from the MRI scans of mild to moderate AD patients (N = 77) from the prospective registry on dementia study and controls (N = 173) from the Austrian Stroke Prevention Family Study. We based our classification on measures from anatomical MRI, diffusion weighted MRI and resting state functional MRI. Our unimodal classification performance ranged from an area under the curve (AUC) of 0.760 (full correlations between functional networks) to 0.909 (grey matter density). When combining measures from multiple modalities in a stepwise manner, the classification performance improved to an AUC of 0.952. This optimal combination consisted of grey matter density, white matter density, fractional anisotropy, mean diffusivity, and sparse partial correlations between functional networks. Classification performance for mild AD as well as moderate AD also improved when using this multimodal combination. We conclude that different MRI modalities provide complementary information for classifying AD. Moreover, combining multiple modalities can substantially improve classification performance over unimodal classification.

  18. Hierarchical Ensemble Methods for Protein Function Prediction

    PubMed Central

    2014-01-01

    Protein function prediction is a complex multiclass multilabel classification problem, characterized by multiple issues such as the incompleteness of the available annotations, the integration of multiple sources of high dimensional biomolecular data, the unbalance of several functional classes, and the difficulty of univocally determining negative examples. Moreover, the hierarchical relationships between functional classes that characterize both the Gene Ontology and FunCat taxonomies motivate the development of hierarchy-aware prediction methods that showed significantly better performances than hierarchical-unaware “flat” prediction methods. In this paper, we provide a comprehensive review of hierarchical methods for protein function prediction based on ensembles of learning machines. According to this general approach, a separate learning machine is trained to learn a specific functional term and then the resulting predictions are assembled in a “consensus” ensemble decision, taking into account the hierarchical relationships between classes. The main hierarchical ensemble methods proposed in the literature are discussed in the context of existing computational methods for protein function prediction, highlighting their characteristics, advantages, and limitations. Open problems of this exciting research area of computational biology are finally considered, outlining novel perspectives for future research. PMID:25937954

  19. Classification of BRCA1 missense variants of unknown clinical significance

    PubMed Central

    Phelan, C; Dapic, V; Tice, B; Favis, R; Kwan, E; Barany, F; Manoukian, S; Radice, P; van der Luijt, R B; van Nesselrooij, B P M; Chenevix-Trench, G; kConFab; Caldes, T; de La Hoya, M; Lindquist, S; Tavtigian, S; Goldgar, D; Borg, A; Narod, S; Monteiro, A

    2005-01-01

    Background: BRCA1 is a tumour suppressor with pleiotropic actions. Germline mutations in BRCA1 are responsible for a large proportion of breast–ovarian cancer families. Several missense variants have been identified throughout the gene but because of lack of information about their impact on the function of BRCA1, predictive testing is not always informative. Classification of missense variants into deleterious/high risk or neutral/low clinical significance is essential to identify individuals at risk. Objective: To investigate a panel of missense variants. Methods and results: The panel was investigated in a comprehensive framework that included (1) a functional assay based on transcription activation; (2) segregation analysis and a method of using incomplete pedigree data to calculate the odds of causality; (3) a method based on interspecific sequence variation. It was shown that the transcriptional activation assay could be used as a test to characterise mutations in the carboxy-terminus region of BRCA1 encompassing residues 1396–1863. Thirteen missense variants (H1402Y, L1407P, H1421Y, S1512I, M1628T, M1628V, T1685I, G1706A, T1720A, A1752P, G1788V, V1809F, and W1837R) were specifically investigated. Conclusions: While individual classification schemes for BRCA1 alleles still present limitations, a combination of several methods provides a more powerful way of identifying variants that are causally linked to a high risk of breast and ovarian cancer. The framework presented here brings these variants nearer to clinical applicability. PMID:15689452

  20. Analysis and functional classification of transcripts from the nematode Meloidogyne incognita

    PubMed Central

    McCarter, James P; Dautova Mitreva, Makedonka; Martin, John; Dante, Mike; Wylie, Todd; Rao, Uma; Pape, Deana; Bowers, Yvette; Theising, Brenda; Murphy, Claire V; Kloek, Andrew P; Chiapelli, Brandi J; Clifton, Sandra W; Bird, David Mck; Waterston, Robert H

    2003-01-01

    Background Plant parasitic nematodes are major pathogens of most crops. Molecular characterization of these species as well as the development of new techniques for control can benefit from genomic approaches. As an entrée to characterizing plant parasitic nematode genomes, we analyzed 5,700 expressed sequence tags (ESTs) from second-stage larvae (L2) of the root-knot nematode Meloidogyne incognita. Results From these, 1,625 EST clusters were formed and classified by function using the Gene Ontology (GO) hierarchy and the Kyoto KEGG database. L2 larvae, which represent the infective stage of the life cycle before plant invasion, express a diverse array of ligand-binding proteins and abundant cytoskeletal proteins. L2 are structurally similar to Caenorhabditis elegans dauer larva and the presence of transcripts encoding glyoxylate pathway enzymes in the M. incognita clusters suggests that root-knot nematode larvae metabolize lipid stores while in search of a host. Homology to other species was observed in 79% of translated cluster sequences, with the C. elegans genome providing more information than any other source. In addition to identifying putative nematode-specific and Tylenchida-specific genes, sequencing revealed previously uncharacterized horizontal gene transfer candidates in Meloidogyne with high identity to rhizobacterial genes including homologs of nodL acetyltransferase and novel cellulases. Conclusions With sequencing from plant parasitic nematodes accelerating, the approaches to transcript characterization described here can be applied to more extensive datasets and also provide a foundation for more complex genome analyses. PMID:12702207

Top