Automated design evolution of stereochemically randomized protein foldamers
NASA Astrophysics Data System (ADS)
Ranbhor, Ranjit; Kumar, Anil; Patel, Kirti; Ramakrishnan, Vibin; Durani, Susheel
2018-05-01
Diversification of chain stereochemistry opens up the possibilities of an ‘in principle’ increase in the design space of proteins. This huge increase in the sequence and consequent structural variation is aimed at the generation of smart materials. To diversify protein structure stereochemically, we introduced L- and D-α-amino acids as the design alphabet. With a sequence design algorithm, we explored the usage of specific variables such as chirality and the sequence of this alphabet in independent steps. With molecular dynamics, we folded stereochemically diverse homopolypeptides and evaluated their ‘fitness’ for possible design as protein-like foldamers. We propose a fitness function to prune the most optimal fold among 1000 structures simulated with an automated repetitive simulated annealing molecular dynamics (AR-SAMD) approach. The highly scored poly-leucine fold with sequence lengths of 24 and 30 amino acids were later sequence-optimized using a Dead End Elimination cum Monte Carlo based optimization tool. This paper demonstrates a novel approach for the de novo design of protein-like foldamers.
A Coarse-Grained Protein Model in a Water-like Solvent
NASA Astrophysics Data System (ADS)
Sharma, Sumit; Kumar, Sanat K.; Buldyrev, Sergey V.; Debenedetti, Pablo G.; Rossky, Peter J.; Stanley, H. Eugene
2013-05-01
Simulations employing an explicit atom description of proteins in solvent can be computationally expensive. On the other hand, coarse-grained protein models in implicit solvent miss essential features of the hydrophobic effect, especially its temperature dependence, and have limited ability to capture the kinetics of protein folding. We propose a free space two-letter protein (``H-P'') model in a simple, but qualitatively accurate description for water, the Jagla model, which coarse-grains water into an isotropically interacting sphere. Using Monte Carlo simulations, we design protein-like sequences that can undergo a collapse, exposing the ``Jagla-philic'' monomers to the solvent, while maintaining a ``hydrophobic'' core. This protein-like model manifests heat and cold denaturation in a manner that is reminiscent of proteins. While this protein-like model lacks the details that would introduce secondary structure formation, we believe that these ideas represent a first step in developing a useful, but computationally expedient, means of modeling proteins.
Wang, Luan; He, Hao; Wang, Shuangchao; Chen, Xiaoguang; Qiu, Dewen; Kondo, Hideki; Guo, Lihua
2018-05-01
Here we describe a novel (-)ssRNA mycovirus, Fusarium graminearum negative-stranded RNA virus 1 (FgNSRV-1), isolated from Fusarium graminearum strain HN1. The genome of FgNSRV-1 is 9072 nucleotides in length, with five discontinuous but linear ORFs (ORF I-V). Phylogenetic analysis based on entire L polymerase sequences indicated that FgNSRV-1 is related to the (-)ssRNA mycovirus Sclerotinia sclerotiorum negative-stranded RNA virus 1 (SsNSRV-1), and other mycoviruses. Our data suggest that FgNSRV-1 can be classified into the family Mymonaviridae, order Mononegavirales. Putative enveloped virion-like structures with filamentous morphology similar to SsNSRV-1 were observed in virion preparation samples. The L proteins of FgNSRV-1, and other fungal mononegaviruses, were found to be related to L protein-like sequences in some fungal genome, supporting the hypothesis that there is coevolution occurring between mycoviruses and fungi. Besides, clearing the virus from the infected host fungus resulted in no discernable phenotypic change. Copyright © 2018 Elsevier Inc. All rights reserved.
Ose, Naoko; Kawai, Teruka; Ishida, Daisuke; Kobori, Yuko; Takeuchi, Yukiyasu; Senba, Hidetoshi
2016-11-01
A pulmonary lymphoepithelioma-like carcinoma (PLELC) is similar to a lymphoepithelioma, a subtype of nasopharyngeal carcinoma and commonly associated with Epstein-Barr virus infection which is a rare tumour and classified in the group of "other and unclassified carcinoma" in the latest 2015 World Health Organization (WHO) classification. Some reports of lymphoepithelioma-like carcinoma (LELC) have noted an epidermal growth factor receptor (EGFR) mutation, whereas none have noted a mutation of the echinoderm microtubule-associated protein-like 4-anaplastic lymphoma kinase (EML4-ALK) fusion gene. This is the first reported case of PLELC with ALK rearrangement. A 76-year-old woman underwent a right lower lobectomy and complicated partial resection of the upper lobe with lymph node dissection under complete thoracoscopic approach. A histopathological diagnosis of PLELC was made and the stage was classified as T1aN1(#12l) M0, pl0, G2, Ly1, V1. The results of both ALK immunohistochemistry and EML4-ALK fusion gene on fluorescence in situ hybridization (FISH) examinations were positive; however, EGFR mutational analysis results showed wild-type mutation.
Wang, Zichao; Gao, Mengchun; Xin, Yanjun; Ma, Dong; She, Zonglian; Wang, Zhe; Sun, Changqing; Ren, Yun
2014-01-01
The effect of C/N ratio on extracellular polymeric substances (EPS) of activated sludge was investigated in an anoxic-aerobic sequencing batch reactor (SBR) treating saline wastewater. The protein (PN) and protein/polysaccharide (PN/PS) ratio in the loosely bound EPS (LB-EPS) increased with the decrease of C/N ratio, whereas the PS in the LB-EPS decreased. The PS, PN and PN/PS ratio in the tightly bound EPS (TB-EPS) were independent of C/N ratio. Two fluorescence peaks in the LB-EPS and TB-EPS were identified at excitation/emission (Ex/Em) wavelengths of 275-280/335-340 nm and 220-225/330-340 nm by three-dimensional excitation-emission matrix (3D-EEM) fluorescence spectroscopy, respectively. These peaks in LB-EPS and TB-EPS were, respectively, associated with tryptophan protein-like substances and aromatic protein-like substances. The tryptophan protein-like fluorescence peaks in LB-EPS showed blue shift along the Ex axis and red shift along the Em axis with the decrease of C/N ratio. Fourier transform infrared spectra suggested that the variation of C/N ratio had more distinct effect on the functional groups of protein in the LB-EPS than those in the TB-EPS. The sludge volume index value decreased with the increase of LB-EPS, but there was no correlation between SVI and TB-EPS.
``Sequence space soup'' of proteins and copolymers
NASA Astrophysics Data System (ADS)
Chan, Hue Sun; Dill, Ken A.
1991-09-01
To study the protein folding problem, we use exhaustive computer enumeration to explore ``sequence space soup,'' an imaginary solution containing the ``native'' conformations (i.e., of lowest free energy) under folding conditions, of every possible copolymer sequence. The model is of short self-avoiding chains of hydrophobic (H) and polar (P) monomers configured on the two-dimensional square lattice. By exhaustive enumeration, we identify all native structures for every possible sequence. We find that random sequences of H/P copolymers will bear striking resemblance to known proteins: Most sequences under folding conditions will be approximately as compact as known proteins, will have considerable amounts of secondary structure, and it is most probable that an arbitrary sequence will fold to a number of lowest free energy conformations that is of order one. In these respects, this simple model shows that proteinlike behavior should arise simply in copolymers in which one monomer type is highly solvent averse. It suggests that the structures and uniquenesses of native proteins are not consequences of having 20 different monomer types, or of unique properties of amino acid monomers with regard to special packing or interactions, and thus that simple copolymers might be designable to collapse to proteinlike structures and properties. A good strategy for designing a sequence to have a minimum possible number of native states is to strategically insert many P monomers. Thus known proteins may be marginally stable due to a balance: More H residues stabilize the desired native state, but more P residues prevent simultaneous stabilization of undesired native states.
Wei, Dong; Zhang, Keyi; Ngo, Huu Hao; Guo, Wenshan; Wang, Siyu; Li, Jibin; Han, Fei; Du, Bin; Wei, Qin
2017-04-01
In present study, the feasibility of partial nitrification (PN) process achievement and its greenhouse gas emission were evaluated in a sequencing batch biofilm reactor (SBBR). After 90days' operation, the average effluent NH 4 + -N removal efficiency and nitrite accumulation rate of PN-SBBR were high of 98.2% and 87.6%, respectively. Both polysaccharide and protein contents were reduced in loosely bound extracellular polymeric substances (LB-EPS) and tightly bound EPS (TB-EPS) during the achievement of PN-biofilm. Excitation-emission matrix spectra implied that aromatic protein-like, tryptophan protein-like and humic acid-like substances were the main compositions of both kinds of EPS in seed sludge and PN-biofilm. According to typical cycle, the emission rate of CO 2 had a much higher value than that of N 2 O, and their total amounts per cycle were 67.7 and 16.5mg, respectively. Free ammonia (FA) played a significant role on the inhibition activity of nitrite-oxidizing bacteria and the occurrence of nitrite accumulation. Copyright © 2017 Elsevier Ltd. All rights reserved.
DNA-Catalyzed Amide Hydrolysis.
Zhou, Cong; Avins, Joshua L; Klauser, Paul C; Brandsen, Benjamin M; Lee, Yujeong; Silverman, Scott K
2016-02-24
DNA catalysts (deoxyribozymes) for a variety of reactions have been identified by in vitro selection. However, for certain reactions this identification has not been achieved. One important example is DNA-catalyzed amide hydrolysis, for which a previous selection experiment instead led to DNA-catalyzed DNA phosphodiester hydrolysis. Subsequent efforts in which the selection strategy deliberately avoided phosphodiester hydrolysis led to DNA-catalyzed ester and aromatic amide hydrolysis, but aliphatic amide hydrolysis has been elusive. In the present study, we show that including modified nucleotides that bear protein-like functional groups (any one of primary amino, carboxyl, or primary hydroxyl) enables identification of amide-hydrolyzing deoxyribozymes. In one case, the same deoxyribozyme sequence without the modifications still retains substantial catalytic activity. Overall, these findings establish the utility of introducing protein-like functional groups into deoxyribozymes for identifying new catalytic function. The results also suggest the longer-term feasibility of deoxyribozymes as artificial proteases.
Jin, Pengkang; Jin, Xin; Bjerkelund, Viggo A; Østerhus, Stein W; Wang, Xiaochang C; Yang, Lei
2016-01-01
The reactivity of dissolved effluent organic matter (EfOM) in the process of ozonation was examined. Under different ozone dosages (0.42 ± 0.09, 0.98 ± 0.11 and 2.24 ± 0.17 mgO3/mg DOC), the EfOM before and after ozonation could be classified into four fractions according to their hydrophobicities. By ozonation, the hydrophobic fractions, especially hydrophobic acid (HOA) and hydrophobic neutral (HON), were found to undergo a process of transformation into hydrophilic fractions (HI), of which the HOA were first transformed into HON, and then the majority of the HON fraction was later converted to HI by further ozonation. It was noticeable that after ozonation, the fluorescence intensity in the humic-like and protein-like regions decreased as indicated by the excitation and emission matrix (EEM) spectra for the hydrophobic fractions. By coupling the EEM spectra with the molecular size analysis using high performance size exclusion chromatography (HPSEC), the difference between the characteristic distributions of the humic-like and protein-like fluorophores were further revealed. It could thus be extrapolated that ozone might have preferentially reacted with the protein-like hydrophobic fraction with molecular weight (MW) less than 100 kDa. Moreover, by X-ray photoelectron spectroscopy (XPS) analysis, it was identified that with increasing ozone dosage (from 0 to 2.24 ± 0.17 mgO3/mg DOC), the aromaticity of HON decreased dramatically, while aliphatics and ketones increased especially at the low ozone dose (0.42 ± 0.09 mgO3/mg DOC). Of the EfOM fractions, the HON fraction would have a higher content of electron enriched aromatics which could preferentially react with ozone rather than the HOA fraction. Copyright © 2015 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Mendoza, Wilson G.; Weiss, Elliot L.; Schieber, Brian; Greg Mitchell, B.
2017-07-01
In this study we used fluorescence excitation and emission matrix spectroscopy, hydrographic data, and a self-organizing map (SOM) analysis to assess the spatial distribution of labile and refractory fluorescent dissolved organic matter (FDOM) for the Chukchi and Beaufort Seas at the time of a massive under-ice phytoplankton bloom during early summer 2011. Biogeochemical properties were assessed through decomposition of water property classes and sample classification that employed a SOM neural network-based analysis which classified 10 clusters from 269 samples and 17 variables. The terrestrial, humic-like component FDOM (ArC1, 4.98 ± 1.54 Quinine Sulfate Units (QSU)) and protein-like component FDOM (ArC3, 1.63 ± 0.88 QSU) were found to have elevated fluorescence in the Lower Polar Mixed Layer (LPML) (salinity 29.56 ± 0.76). In the LPML water mass, the observed contribution of meteoric water fraction was 17%, relative to a 12% contribution from the sea ice melt fraction. The labile ArC3-protein-like component (2.01 ± 1.92 QSU) was also observed to be elevated in the Pacific Winter Waters mass, where the under-ice algal bloom was observed ( 40-50 m). We interpreted these relationships to indicate that the accumulation and variable distribution of the protein-like component on the shelf could be influenced directly by sea ice melt, transport, and mixing processes and indirectly by the in situ algal bloom and microbial activity. ArC5, corresponding to what is commonly considered marine humic FDOM, indicated a bimodal distribution with high values in both the freshest and saltiest waters. The association of ArC5 with deep, dense salty water is consistent with this component as refractory humic-like FDOM, whereas our evidence of a terrestrial origin challenges this classic paradigm for this component.
Mendoza, Wilson G; Weiss, Elliot L; Schieber, Brian; Greg Mitchell, B
2017-07-01
In this study we used fluorescence excitation and emission matrix spectroscopy, hydrographic data, and a self-organizing map (SOM) analysis to assess the spatial distribution of labile and refractory fluorescent dissolved organic matter (FDOM) for the Chukchi and Beaufort Seas at the time of a massive under-ice phytoplankton bloom during early summer 2011. Biogeochemical properties were assessed through decomposition of water property classes and sample classification that employed a SOM neural network-based analysis which classified 10 clusters from 269 samples and 17 variables. The terrestrial, humic-like component FDOM (ArC1, 4.98 ± 1.54 Quinine Sulfate Units (QSU)) and protein-like component FDOM (ArC3, 1.63 ± 0.88 QSU) were found to have elevated fluorescence in the Lower Polar Mixed Layer (LPML) (salinity ~29.56 ± 0.76). In the LPML water mass, the observed contribution of meteoric water fraction was 17%, relative to a 12% contribution from the sea ice melt fraction. The labile ArC3-protein-like component (2.01 ± 1.92 QSU) was also observed to be elevated in the Pacific Winter Waters mass, where the under-ice algal bloom was observed (~40-50 m). We interpreted these relationships to indicate that the accumulation and variable distribution of the protein-like component on the shelf could be influenced directly by sea ice melt, transport, and mixing processes and indirectly by the in situ algal bloom and microbial activity. ArC5, corresponding to what is commonly considered marine humic FDOM, indicated a bimodal distribution with high values in both the freshest and saltiest waters. The association of ArC5 with deep, dense salty water is consistent with this component as refractory humic-like FDOM, whereas our evidence of a terrestrial origin challenges this classic paradigm for this component.
He, Xiao-Song; Xi, Bei-Dou; Gao, Ru-Tai; Wang, Lei; Ma, Yan; Cui, Dong-Yu; Tan, Wen-Bing
2015-06-01
Groundwater was collected in 2011 and 2012, and fluorescence spectroscopy coupled with chemometric analysis was employed to investigate the composition, origin, and dynamics of dissolved organic matter (DOM) in the groundwater. The results showed that the groundwater DOM comprised protein-, fulvic-, and humic-like substances, and the protein-like component originated predominantly from microbial production. The groundwater pollution by landfill leachate enhanced microbial activity and thereby increased microbial by-product-like material such as protein-like component in the groundwater. Excitation-emission matrix fluorescence spectra combined with parallel factor analysis showed that the protein-like matter content increased from 2011 to 2012 in the groundwater, whereas the fulvic- and humic-like matter concentration exhibited no significant changes. In addition, synchronous-scan fluorescence spectra coupled with two-dimensional correlation analysis showed that the change of the fulvic- and humic-like matter was faster than that of the protein-like substances, as the groundwater flowed from upstream to downstream in 2011, but slower than that of the protein-like substance in 2012 due to the enhancement of microbial activity. Fluorescence spectroscopy combined with chemometric analysis can investigate groundwater pollution characteristics and monitor DOM dynamics in groundwater.
Reinert, Zachary E; Horne, W Seth
2014-11-28
A variety of non-biological structural motifs have been incorporated into the backbone of natural protein sequences. In parallel work, diverse unnatural oligomers of de novo design (termed "foldamers") have been developed that fold in defined ways. In this Perspective article, we survey foundational studies on protein backbone engineering, with a focus on alterations made in the context of complex tertiary folds. We go on to summarize recent work illustrating the potential promise of these methods to provide a general framework for the construction of foldamer mimics of protein tertiary structures.
Liu, Qinghua; Zhou, Zhichun; Wei, Yongcheng; Shen, Danyu; Feng, Zhongping; Hong, Shanping
2015-01-01
Masson pine is an important timber and resource for oleoresin in South China. Increasing yield of oleoresin in stems can raise economic benefits and enhance the resistance to bark beetles. However, the genetic mechanisms for regulating the yield of oleoresin were still unknown. Here, high-throughput sequencing technology was used to investigate the transcriptome and compare the gene expression profiles of high and low oleoresin-yielding genotypes. A total of 40,690,540 reads were obtained and assembled into 137,499 transcripts from the secondary xylem tissues. We identified 84,842 candidate unigenes based on sequence annotation using various databases and 96 unigenes were candidates for terpenoid backbone biosynthesis in pine. By comparing the expression profiles of high and low oleoresin-yielding genotypes, 649 differentially expressed genes (DEGs) were identified. GO enrichment analysis of DEGs revealed that multiple pathways were related to high yield of oleoresin. Nine candidate genes were validated by QPCR analysis. Among them, the candidate genes encoding geranylgeranyl diphosphate synthase (GGPS) and (-)-alpha/beta-pinene synthase were up-regulated in the high oleoresin-yielding genotype, while tricyclene synthase revealed lower expression level, which was in good agreement with the GC/MS result. In addition, DEG encoding ABC transporters, pathogenesis-related proteins (PR5 and PR9), phosphomethylpyrimidine synthase, non-specific lipid-transfer protein-like protein and ethylene responsive transcription factors (ERFs) were also confirmed to be critical for the biosynthesis of oleoresin. The next-generation sequencing strategy used in this study has proven to be a powerful means for analyzing transcriptome variation related to the yield of oleoresin in masson pine. The candidate genes encoding GGPS, (-)-alpha/beta-pinene, tricyclene synthase, ABC transporters, non-specific lipid-transfer protein-like protein, phosphomethylpyrimidine synthase, ERFs and pathogen responses may play important roles in regulating the yield of oleoresin. These DEGs are worthy of special attention in future studies. PMID:26167875
Free energy landscape of protein-like chains with discontinuous potentials
NASA Astrophysics Data System (ADS)
Movahed, Hanif Bayat; van Zon, Ramses; Schofield, Jeremy
2012-06-01
In this article the configurational space of two simple protein models consisting of polymers composed of a periodic sequence of four different kinds of monomers is studied as a function of temperature. In the protein models, hydrogen bond interactions, electrostatic repulsion, and covalent bond vibrations are modeled by discontinuous step, shoulder, and square-well potentials, respectively. The protein-like chains exhibit a secondary alpha helix structure in their folded states at low temperatures, and allow a natural definition of a configuration by considering which beads are bonded. Free energies and entropies of configurations are computed using the parallel tempering method in combination with hybrid Monte Carlo sampling of the canonical ensemble of the discontinuous potential system. The probability of observing the most common configuration is used to analyze the nature of the free energy landscape, and it is found that the model with the least number of possible bonds exhibits a funnel-like free energy landscape at low enough temperature for chains with fewer than 30 beads. For longer proteins, the free landscape consists of several minima, where the configuration with the lowest free energy changes significantly by lowering the temperature and the probability of observing the most common configuration never approaches one due to the degeneracy of the lowest accessible potential energy.
Pulmonary inflammatory myofibroblastic tumor harboring EML4-ALK fusion gene.
Sokai, Akihiko; Enaka, Makiko; Sokai, Risa; Mori, Shoichi; Mori, Shunsuke; Gunji, Masaharu; Fujino, Masahiko; Ito, Masafumi
2014-01-01
Inflammatory myofibroblastic tumor is a rare tumor deriving from mesenchymal tissue. Approximately 50% of inflammatory myofibroblastic tumors harbor an anaplastic lymphoma kinase fusion gene. Pulmonary inflammatory myofibroblastic tumors harboring tropomyosin3-anaplastic lymphoma kinase or protein tyrosine phosphatase receptor-type F polypeptide-interacting protein-binding protein 1-anaplastic lymphoma kinase have been reported previously. However, it has not been reported that inflammatory myofibroblastic tumors harbor echinoderm microtubule-associated protein-like 4-anaplastic lymphoma kinase fusion gene which is considered to be very specific to lung cancers. A few tumors harboring echinoderm microtubule-associated protein-like 4-anaplastic lymphoma kinase fusion gene other than lung cancers have been reported and the tumors were all carcinomas. A 67-year-old man had been followed up for a benign tumor for approximately 3 years before the tumor demonstrated malignant transformation. Lobectomy and autopsy revealed that an inflammatory myofibroblastic tumor harboring echinoderm microtubule-associated protein-like 4-anaplastic lymphoma kinase fusion gene had transformed into an undifferentiated sarcoma. This case suggests that echinoderm microtubule-associated protein-like 4-anaplastic lymphoma kinase fusion is an oncogenic event in not only carcinomas but also sarcomas originating from stromal cells.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Riback, Joshua A.; Bowman, Micayla A.; Zmyslowski, Adam M.
A substantial fraction of the proteome is intrinsically disordered, and even well-folded proteins adopt non-native geometries during synthesis, folding, transport, and turnover. Characterization of intrinsically disordered proteins (IDPs) is challenging, in part because of a lack of accurate physical models and the difficulty of interpreting experimental results. We have developed a general method to extract the dimensions and solvent quality (self-interactions) of IDPs from a single small-angle x-ray scattering measurement. We applied this procedure to a variety of IDPs and found that even IDPs with low net charge and high hydrophobicity remain highly expanded in water, contrary to the generalmore » expectation that protein-like sequences collapse in water. Our results suggest that the unfolded state of most foldable sequences is expanded; we conjecture that this property was selected by evolution to minimize misfolding and aggregation.« less
Peldszus, Sigrid; Hallé, Cynthia; Peiris, Ramila H; Hamouda, Mohamed; Jin, Xiaohui; Legge, Raymond L; Budman, Hector; Moresoli, Christine; Huck, Peter M
2011-10-15
With the increased use of membranes in drinking water treatment, fouling--particularly the hydraulically irreversible type--remains the main operating issue that hinders performance and increases operational costs. The main challenge in assessing fouling potential of feed water is to accurately detect and quantify feed water constituents responsible for membrane fouling. Utilizing fluorescence excitation-emission matrices (EEM), protein-like substances, humic and fulvic acids, and particulate/colloidal matter can be detected with high sensitivity in surface waters. The application of principal component analysis to fluorescence EEMs allowed estimation of the impact of surface water constituents on reversible and irreversible membrane fouling. This technique was applied to experimental data from a two year bench-scale study that included thirteen experiments investigating the fouling potential of Grand River water (Ontario, Canada) and the effect of biofiltration pre-treatment on the level of foulants during ultrafiltration (UF). Results showed that, although the content of protein-like substances in this membrane feed water (=biofiltered natural water) was much lower than commonly found in wastewater applications, the content of protein-like substances was still highly correlated with irreversible fouling of the UF membrane. In addition, there is evidence that protein-like substances and particulate/colloidal matter formed a combined fouling layer, which contributed to both reversible and irreversible fouling. It is suggested that fouling transitions from a reversible to an irreversible regime depending on feed composition and operating time. Direct biofiltration without prior coagulant addition reduced the protein-like content of the membrane feed water which in turn reduced the irreversible fouling potential for UF membranes. Biofilters also decreased reversible fouling, and for both types of fouling higher biofilter contact times were beneficial. Copyright © 2011 Elsevier Ltd. All rights reserved.
Brown, Michael P; Hissaria, Pravin; Hsieh, Amy Hc; Kneebone, Christopher; Vallat, Wilson
2017-04-15
Immune checkpoint inhibitors such as Pembrolizumab are used to restore antitumour immune response. It is important to be vigilant of immune mediated adverse events related to such therapy. We report a case of autoimmune limbic encephalitis with Contactin-Associated Protein-like 2 (CASPR2) antibody secondary to Pembrolizumab therapy for metastatic melanoma. Copyright © 2017 Elsevier B.V. All rights reserved.
Pattern Recognition of Adsorbing HP Lattice Proteins
NASA Astrophysics Data System (ADS)
Wilson, Matthew S.; Shi, Guangjie; Wüst, Thomas; Landau, David P.; Schmid, Friederike
2015-03-01
Protein adsorption is relevant in fields ranging from medicine to industry, and the qualitative behavior exhibited by course-grained models could shed insight for further research in such fields. Our study on the selective adsorption of lattice proteins utilizes the Wang-Landau algorithm to simulate the Hydrophobic-Polar (H-P) model with an efficient set of Monte Carlo moves. Each substrate is modeled as a square pattern of 9 lattice sites which attract either H or P monomers, and are located on an otherwise neutral surface. The fully enumerated set of 102 unique surfaces is simulated with each protein sequence. A collection of 27-monomer sequences is used- each of which is non-degenerate and protein-like. Thermodynamic quantities such as the specific heat and free energy are calculated from the density of states, and are used to investigate the adsorption of lattice proteins on patterned substrates. Research supported by NSF.
Fu, Qing-Long; He, Jian-Zhou; Blaney, Lee; Zhou, Dong-Mei
2016-07-01
The fate and transport of roxarsone (ROX), a widely used organoarsenic feed additive, in soil is significantly influenced by the ubiquitous presence of soil-derived dissolved organic matter (DOM). In this study, fluorescence quenching titration and two-dimensional correlation spectroscopy (2D-COS) were employed to study ROX binding to DOM. Binding mechanisms were revealed by fluorescence lifetime measurement and Fourier transform infrared spectroscopy (FTIR). Humic- and protein-like fluorophores were identified in the excitation-emission matrix and synchronous fluorescence spectra of DOM. The conditional stability constant (log KC) for ROX binding to DOM was found to be 5.06, indicating that ROX was strongly bound to DOM. The binding order of ROX to DOM fluorophores revealed by 2D-COS followed the sequence of protein-like fluorophore ≈ the longer wavelength excited humic-like (L-humic-like) fluorophore > the shorter wavelength excited humic-like (S-humic-like) fluorophore. 2D-COS resolved issues with peak overlapping and allowed further exploration of the interaction between ROX and DOM. Results of fluorescence lifetime and FTIR spectra demonstrated that ROX interacted with DOM through the hydroxyl, amide II, carboxyl, aliphatic CH, and NO2 groups, yielding stable DOM-ROX complexes. The strong interaction between ROX and DOM implies that DOM plays an important role in the environmental fate of ROX in soil. Copyright © 2016 Elsevier Ltd. All rights reserved.
Protein-like fully reversible tetramerisation and super-association of an aminocellulose
NASA Astrophysics Data System (ADS)
Nikolajski, Melanie; Adams, Gary G.; Gillis, Richard B.; Besong, David Tabot; Rowe, Arthur J.; Heinze, Thomas; Harding, Stephen E.
2014-01-01
Unusual protein-like, partially reversible associative behaviour has recently been observed in solutions of the water soluble carbohydrates known as 6-deoxy-6-(ω-aminoalkyl)aminocelluloses, which produce controllable self-assembling films for enzyme immobilisation and other biotechnological applications. Now, for the first time, we have found a fully reversible self-association (tetramerisation) within this family of polysaccharides. Remarkably these carbohydrate tetramers are then seen to associate further in a regular way into supra-molecular complexes. Fully reversible oligomerisation has been hitherto completely unknown for carbohydrates and instead resembles in some respects the assembly of polypeptides and proteins like haemoglobin and its sickle cell mutation. Our traditional perceptions as to what might be considered ``protein-like'' and what might be considered as ``carbohydrate-like'' behaviour may need to be rendered more flexible, at least as far as interaction phenomena are concerned.
Weininger, Arthur; Weininger, Susan
2015-01-01
The ability to identify the functional correlates of structural and sequence variation in proteins is a critical capability. We related structures of influenza A N10 and N11 proteins that have no established function to structures of proteins with known function by identifying spatially conserved atoms. We identified atoms with common distributed spatial occupancy in PDB structures of N10 protein, N11 protein, an influenza A neuraminidase, an influenza B neuraminidase, and a bacterial neuraminidase. By superposing these spatially conserved atoms, we aligned the structures and associated molecules. We report spatially and sequence invariant residues in the aligned structures. Spatially invariant residues in the N6 and influenza B neuraminidase active sites were found in previously unidentified spatially equivalent sites in the N10 and N11 proteins. We found the corresponding secondary and tertiary structures of the aligned proteins to be largely identical despite significant sequence divergence. We found structural precedent in known non-neuraminidase structures for residues exhibiting structural and sequence divergence in the aligned structures. In N10 protein, we identified staphylococcal enterotoxin I-like domains. In N11 protein, we identified hepatitis E E2S-like domains, SARS spike protein-like domains, and toxin components shared by alpha-bungarotoxin, staphylococcal enterotoxin I, anthrax lethal factor, clostridium botulinum neurotoxin, and clostridium tetanus toxin. The presence of active site components common to the N6, influenza B, and S. pneumoniae neuraminidases in the N10 and N11 proteins, combined with the absence of apparent neuraminidase function, suggests that the role of neuraminidases in H17N10 and H18N11 emerging influenza A viruses may have changed. The presentation of E2S-like, SARS spike protein-like, or toxin-like domains by the N10 and N11 proteins in these emerging viruses may indicate that H17N10 and H18N11 sialidase-facilitated cell entry has been supplemented or replaced by sialidase-independent receptor binding to an expanded cell population that may include neurons and T-cells. PMID:25706124
Shi, Yong Xiang; Mangal, Vaughn; Guéguen, Céline
2016-07-01
Diffusive gradients in thin films (DGT) devices were used to investigate the temporal and spatial changes in vanadium (V) speciation in the Churchill estuary system (Manitoba). Thirty-six DGT sets and 95 discrete water samples were collected at 8 river and 3 estuary sites during spring freshet and summer base flow. Dissolved V concentration in the Churchill River at summer base flow was approximately 5 times higher than those during the spring high flow (27.3 ± 18.9 nM vs 4.8 ± 3.5 nM). DGT-labile V showed an opposite trend with greater values found during the spring high flow (2.6 ± 1.8 nM vs 1.4 ± 0.3 nM). Parallel factor analysis (PARAFAC) conducted on 95 excitation-emission matrix spectra validated four humic-like (C1C4) and one protein-like (C5) fluorescent components. Significant positive relationship was found between protein-like DOM and DGT-labile V (r = 0.53, p < 0.05), indicating that protein-like DOM possibly affected the DGT-labile V concentration in Churchill River. Sediment leachates were enriched in DGT-labile V and protein-like DOM, which can be readily released when river sediment began to thaw during spring freshet. Copyright © 2016 Elsevier Ltd. All rights reserved.
Peiris, R H; Jaklewicz, M; Budman, H; Legge, R L; Moresoli, C
2013-06-15
Fluorescence excitation-emission matrix (EEM) approach together with principal component analysis (PCA) was used for assessing hydraulically irreversible fouling of three pilot-scale ultrafiltration (UF) systems containing full-scale and bench-scale hollow fiber membrane modules in drinking water treatment. These systems were operated for at least three months with extensive cycles of permeation, combination of back-pulsing and scouring and chemical cleaning. The principal component (PC) scores generated from the PCA of the fluorescence EEMs were found to be related to humic substances (HS), protein-like and colloidal/particulate matter content. PC scores of HS- and protein-like matter of the UF feed water, when considered separately, showed reasonably good correlations with the rate of hydraulically irreversible fouling for long-term UF operations. In contrast, comparatively weaker correlations for PC scores of colloidal/particulate matter and the rate of hydraulically irreversible fouling were obtained for all UF systems. Since, individual correlations could not fully explain the evolution of the rate of irreversible fouling, multi-linear regression models were developed to relate the combined effect of HS-like, protein-like and colloidal/particulate matter PC scores to the rate of hydraulically irreversible fouling for each specific UF system. These multi-linear regression models revealed significant individual and combined contribution of HS- and protein-like matter to the rate of hydraulically irreversible fouling, with protein-like matter generally showing the greatest contribution. The contribution of colloidal/particulate matter to the rate of hydraulically irreversible fouling was not as significant. The addition of polyaluminum chloride, as coagulant, to UF feed appeared to have a positive impact in reducing hydraulically irreversible fouling by these constituents. The proposed approach has applications in quantifying the individual and synergistic contribution of major natural water constituents to the rate of hydraulically irreversible membrane fouling and shows potential for controlling UF irreversible fouling in the production of drinking water. Copyright © 2013 Elsevier Ltd. All rights reserved.
Liu, Ting; Chen, Zhong-lin; Yu, Wen-zheng; You, Shi-jie
2011-02-01
This study focuses on organic membrane foulants in a submerged membrane bioreactor (MBR) process with pre-ozonation compared to an individual MBR using three-dimensional excitation-emission matrix (EEM) fluorescence spectroscopy. While the influent was continuously ozonated at a normal dosage, preferable organic matter removal was achieved in subsequent MBR, and trans-membrane pressure increased at a much lower rate than that of the individual MBR. EEM fluorescence spectroscopy was employed to characterize the dissolved organic matter (DOM) samples, extracellular polymeric substance (EPS) samples and membrane foulants. Four main peaks could be identified from the EEM fluorescence spectra of the DOM samples in both MBRs. Two peaks were associated with the protein-like fluorophores, and the other ones were related to the humic-like fluorophores. The results indicated that pre-ozonation decreased fluorescence intensities of all peaks in the EEM spectra of influent DOM especially for protein-like substances and caused red shifts of all fluorescence peaks to different extents. The peak intensities of the protein-like substances represented by Peak T(1) and T(2) in EPS spectra were obviously decreased as a result of pre-ozonation. Both external and internal fouling could be effectively mitigated by the pre-ozonation. The most primary component of external foulants was humic acid-like substance (Peak C) in the MBR with pre-ozonation and protein-like substance (Peak T(1)) in the individual MBR, respectively. The content decrease of protein-like substances and structural change of humic-like substances were observed in external foulants from EEM fluorescence spectra due to pre-ozonation. However, it could be seen that ozonation resulted in significant reduction of intensities but little location shift of all peaks in EEM fluorescence spectra of internal foulants. Copyright © 2010 Elsevier Ltd. All rights reserved.
Identification of 24h Ixodes scapularis immunogenic tick saliva proteins.
Lewis, Lauren A; Radulović, Željko M; Kim, Tae K; Porter, Lindsay M; Mulenga, Albert
2015-04-01
Ixodes scapularis is arguably the most medically important tick species in the United States. This tick transmits 5 of the 14 human tick-borne disease (TBD) agents in the USA: Borrelia burgdorferi, Anaplasma phagocytophilum, B. miyamotoi, Babesia microti, and Powassan virus disease. Except for the Powassan virus disease, I. scapularis-vectored TBD agents require more than 24h post attachment to be transmitted. This study describes identification of 24h immunogenic I. scapularis tick saliva proteins, which could provide opportunities to develop strategies to stop tick feeding before transmission of the majority of pathogens. A 24h fed female I. scapularis phage display cDNA expression library was biopanned using rabbit antibodies to 24h fed I. scapularis female tick saliva proteins, subjected to next generation sequencing, de novo assembly, and bioinformatic analyses. A total of 182 contigs were assembled, of which ∼19% (35/182) are novel and did not show identity to any known proteins in GenBank. The remaining ∼81% (147/182) of contigs were provisionally identified based on matches in GenBank including ∼18% (27/147) that matched protein sequences previously annotated as hypothetical and putative tick saliva proteins. Others include proteases and protease inhibitors (∼3%, 5/147), transporters and/or ligand binding proteins (∼6%, 9/147), immunogenic tick saliva housekeeping enzyme-like (17%, 25/147), ribosomal protein-like (∼31%, 46/147), and those classified as miscellaneous (∼24%, 35/147). Notable among the miscellaneous class include antimicrobial peptides (microplusin and ricinusin), myosin-like proteins that have been previously found in tick saliva, and heat shock tick saliva protein. Data in this study provides the foundation for in-depth analysis of I. scapularis feeding during the first 24h, before the majority of TBD agents can be transmitted. Copyright © 2015 Elsevier GmbH. All rights reserved.
Pan, Xiangliang; Liu, Jing; Zhang, Daoyong; Chen, Xi; Song, Wenjuan; Wu, Fengchang
2010-05-15
Binding of dicamba to soluble EPS (SEPS) and bound EPS (BEPS) from aerobic activated sludge was investigated using fluorescence spectroscopy. Two protein-like fluorescence peaks (peak A with Ex/Em=225 nm/342-344 nm and peak B with Ex/Em=275/340-344 nm) were identified in SEPS and BEPS. Humic-like fluorescence peak C (Ex/Em=270-275 nm/450-460 nm) was only found in BEPS. Fluorescence of the peaks A and B for SEPS and peak A for BEPS were markedly quenched by dicamba at all temperatures whereas fluorescence of peaks B and C for BEPS was quenched only at 298 K. A dynamic process dominated the fluorescence quenching of peak A of both SEPS and BEPS. Fluorescence quenching of peak B and C was governed a static process. The effective quenching constants (logK(a)) were 4.725-5.293 for protein-like fluorophores of SEPS and 4.23-5.190 for protein-like fluorophores of BEPS, respectively. LogK(a) for humic-like substances was 3.85. Generally, SEPS had greater binding capacity for dicamba than BEPS, and protein-like substances bound dicamba more strongly than humic-like substances. Binding of dicamba to SEPS and BEPS was spontaneous and exothermic. Electrostatic force and hydrophobic interaction forces play a crucial role in binding of dicamba to EPS. Copyright © 2010 Elsevier Inc. All rights reserved.
Chen, Fei; Peldszus, Sigrid; Elhadidy, Ahmed M; Legge, Raymond L; Van Dyke, Michele I; Huck, Peter M
2016-11-01
To better understand biofiltration, concentration profiles of various natural organic matter (NOM) components throughout a pilot-scale drinking water biofilter were investigated using liquid chromatography - organic carbon detection (LC-OCD) and fluorescence excitation and emission matrices (FEEM). Over a 2 month period, water samples were collected from six ports at different biofilter media depths. Results showed substantial removal of biopolymers (i.e. high molecular weight (MW) NOM components as characterized by LC-OCD) and FEEM protein-like materials, but low removal of humic substances, building blocks and low MW neutrals and low MW acids. For the first time, relative biodegradability of different NOM components characterized by LC-OCD and FEEM approaches were investigated across the entire MW range and for different fluorophore compositions, in addition to establishing the biodegradation kinetics. The removal kinetics for FEEM protein-like materials were different than for the LC-OCD-based biopolymers, illustrating the complementary nature of the LC-OCD and FEEM approaches. LC-OCD biopolymers (both organic carbon and organic nitrogen) and FEEM protein-like materials were shown to follow either first or second order biodegradation kinetics. Due to the low percent removal and small number of data points, the performance of three kinetic models was not distinguishable for humic substances. Pre-filtration of samples for FEEM analyses affected the removal behaviours and/or kinetics especially of protein-like materials which was attributed to the removal of the colloidal/particulate materials. Copyright © 2016 Elsevier Ltd. All rights reserved.
Designing artificial enzymes from scratch: Experimental study and mesoscale simulation
NASA Astrophysics Data System (ADS)
Komarov, Pavel V.; Zaborina, Olga E.; Klimova, Tamara P.; Lozinsky, Vladimir I.; Khalatur, Pavel G.; Khokhlov, Alexey R.
2016-09-01
We present a new concept for designing biomimetic analogs of enzymatic proteins; these analogs are based on the synthetic protein-like copolymers. α-Chymotrypsin is used as a prototype of the artificial catalyst. Our experimental study shows that in the course of free radical copolymerization of hydrophobic and hydrophilic monomers the target globular nanostructures of a "core-shell" morphology appear in a selective solvent. Using a mesoscale computer simulation, we show that the protein-like globules can have a large number of catalytic centers located at the hydrophobic core/hydrophilic shell interface.
Accurate, Rapid Taxonomic Classification of Fungal Large-Subunit rRNA Genes
Liu, Kuan-Liang; Porras-Alfaro, Andrea; Eichorst, Stephanie A.
2012-01-01
Taxonomic and phylogenetic fingerprinting based on sequence analysis of gene fragments from the large-subunit rRNA (LSU) gene or the internal transcribed spacer (ITS) region is becoming an integral part of fungal classification. The lack of an accurate and robust classification tool trained by a validated sequence database for taxonomic placement of fungal LSU genes is a severe limitation in taxonomic analysis of fungal isolates or large data sets obtained from environmental surveys. Using a hand-curated set of 8,506 fungal LSU gene fragments, we determined the performance characteristics of a naïve Bayesian classifier across multiple taxonomic levels and compared the classifier performance to that of a sequence similarity-based (BLASTN) approach. The naïve Bayesian classifier was computationally more rapid (>460-fold with our system) than the BLASTN approach, and it provided equal or superior classification accuracy. Classifier accuracies were compared using sequence fragments of 100 bp and 400 bp and two different PCR primer anchor points to mimic sequence read lengths commonly obtained using current high-throughput sequencing technologies. Accuracy was higher with 400-bp sequence reads than with 100-bp reads. It was also significantly affected by sequence location across the 1,400-bp test region. The highest accuracy was obtained across either the D1 or D2 variable region. The naïve Bayesian classifier provides an effective and rapid means to classify fungal LSU sequences from large environmental surveys. The training set and tool are publicly available through the Ribosomal Database Project (http://rdp.cme.msu.edu/classifier/classifier.jsp). PMID:22194300
Zhang Hua; Kuan, Wang; Song, Jian; Zhang, Yong; Huang, Ming; Huang, Jian; Zhu, Jing; Huang, Shan; Wang, Meng
2016-03-01
This paper used excitation-emission matrix spectroscopy (EEMs) to probe the fluorescence properties of dissolved organic matter (DOM) in the overlying water with different dissolved oxygen (DO) conditions, investigating the relationship between protein-like fluorescence intensity and total nitrogen concentration. The resulting fluorescence spectra revealed three protein-like components (high-excitation wavelength tyrosine, low-excitation wavelength tyrosine, low-excitation wavelength tryptophan) and two fulvic-like components (ultraviolet fulvic-like components, visible fulvic-like components) in the overlying water. Moreover, the protein-like components were dominant in the overlying water's DOM. The fluorescence intensity of the protein-like components decreased significantly after aeration. Two of the protein-like components--the low-excitation wavelength tyrosine and the low-excitation wavelength tryptophan--were more susceptible to degradation by microorganisms within the degradable organic matter with respect to the high-excitation wavelength tyrosine. In contrast, the ultraviolet and visible fulvic-like fluorescence intensity increased along with increasing DO concentration, indicating that the fulvic-like components were part of the refractory organics. The fluorescence indices of the DOM in the overlying water were between 1.65-1.80, suggesting that the sources of the DOM were related to terrigenous sediments and microbial metabolic processes, with the primary source being the contribution from microbial metabolism. The fluorescence indices increased along with DO growth, which showed that microbial biomass and microbial activity gradually increased with increasing DO while microbial metabolism also improved, which also increased the biogenic components in the overlying water. The fluorescence intensity of the high-excitation wavelength tyrosine peak A showed a good linear relationship with the total nitrogen concentration at higher DO concentrations of 2.5, 3.5, and 5.5 mg x L(-1), with r2 being 0.956, 0.946, and 0.953, respectively. This study demonstrated that excitation-emission matrix spectroscopy can distinguish the transformation characteristics of the DOM and identify the linear relationship between the fluorescence intensity of the high-excitation wavelength tyrosine peak A and total nitrogen concentration, thus providing a quick and effective technique and theoretical support for river water monitoring and water restoration.
Frank, Simon; Goeppert, Nadine; Goldscheider, Nico
2018-02-15
Karst springs, especially in alpine regions, are important for drinking water supply but also vulnerable to contamination, especially after rainfall events. This high variability of water quality requires rapid quantification of contamination parameters. Here, we used a fluorescence-based multi-parameter approach to characterize the dynamics of organic carbon, faecal bacteria, and particles at three alpine karst springs. We used excitation emission matrices (EEMs) to identify fluorescent dissolved organic material (FDOM). At the first system, peak A fluorescence and total organic carbon (TOC) were strongly correlated (Spearman's r s of 0.949), indicating that a large part of the organic matter is related to humic-like substances. Protein-like fluorescence and cultivation-based determination of coliform bacteria also had a significant correlation with r s =0.734, indicating that protein-like fluorescence is directly related to faecal pollution. At the second system, which has two spring outlets, the absolute values of all measured water-quality parameters were lower; there was a significant correlation between TOC and humic-like fluorescence (r s =0.588-0.689) but coliform bacteria and protein-like fluorescence at these two springs were not correlated. Additionally, there was a strong correlation (r s =0.571-0.647) between small particle fractions (1.0 and 2.0μm), a secondary turbidity peak and bacteria. At one of these springs, discharge was constant despite the reaction of all other parameters to the rainfall event. Our results demonstrated that i) all three springs showed fast and marked responses of all investigated water-quality parameters after rain events; ii) a constant discharge does not necessarily mean constant water quality; iii) at high contamination levels, protein-like fluorescence is a good indicator of bacterial contamination, while at low contamination levels no correlation between protein-like fluorescence and bacterial values was detected; and iv) a combination of fluorescence measurements and particle-size analysis is a promising approach for a rapid assessment of organic contamination, especially relative to time-consuming conventional bacterial determination methods. Copyright © 2017 Elsevier B.V. All rights reserved.
Optical and size characterization of dissolved organic matter from the lower Yukon River
NASA Astrophysics Data System (ADS)
Guo, L.; Lin, H.
2017-12-01
The Arctic rivers have experienced significant climate and environmental changes over the last several decades and their export fluxes and environmental fate of dissolved organic matter (DOM) have received considerable attention. Monthly or bimonthly water samples were collected from the Yukon River, one of the Arctic rivers, between July 2004 and September 2005 for size fractionation to isolate low-molecular-weight (LMW, <1 kDa) and high-molecular-weight (HMW, >1 kDa) DOM. The freeze-dried HMW-DOM was then characterized for their optical properties using fluorescence spectroscopy and colloidal size spectra using asymmetrical flow field-flow fractionation techniques. Ratios of biological index (BIX) to humification index (HIX) show a seasonal change, with lower values in river open seasons and higher values under the ice, and the influence of rive discharge. Three major fluorescence DOM components were identified, including two humic-like components (Ex/Em at 260/480 nm and 250/420 nm, respectively) and one protein-like component (Ex/Em=250/330). The ratio of protein-like to humic-like components was broadly correlated with discharge, with low values during spring freshet and high values under the ice. The relatively high protein-like/humic-like ratio during the ice-covered season suggested sources from macro-organisms and/or ice-algae. Both protein-like and humic-like colloidal fluorophores were partitioned mostly in the 1-5 kDa size fraction although the protein-like fluorophores in some samples also contained larger colloidal size. The relationship between chemical/biological reactivity and size/optical characteristics of DOM needs to be further investigated.
Maqbool, Tahir; Quang, Viet Ly; Cho, Jinwoo; Hur, Jin
2016-06-01
In this study, we successfully tracked the dynamic changes in different constitutes of bound extracellular polymeric substances (bEPS), soluble microbial products (SMP), and permeate during the operation of bench scale membrane bioreactors (MBRs) via fluorescence excitation-emission matrix (EEM) combined with parallel factor analysis (PARAFAC). Three fluorescent groups were identified, including two protein-like (tryptophan-like C1 and tyrosine-like C2) and one microbial humic-like components (C3). In bEPS, protein-like components were consistently more dominant than C3 during the MBR operation, while their relative abundance in SMP depended on aeration intensities. C1 of bEPS exhibited a linear correlation (R(2)=0.738; p<0.01) with bEPS amounts in sludge, and C2 was closely related to the stability of sludge. The protein-like components were more greatly responsible for membrane fouling. Our study suggests that EEM-PARAFAC can be a promising monitoring tool to provide further insight into process evaluation and membrane fouling during MBR operation. Copyright © 2016 Elsevier Ltd. All rights reserved.
Bokulich, Nicholas A; Kaehler, Benjamin D; Rideout, Jai Ram; Dillon, Matthew; Bolyen, Evan; Knight, Rob; Huttley, Gavin A; Gregory Caporaso, J
2018-05-17
Taxonomic classification of marker-gene sequences is an important step in microbiome analysis. We present q2-feature-classifier ( https://github.com/qiime2/q2-feature-classifier ), a QIIME 2 plugin containing several novel machine-learning and alignment-based methods for taxonomy classification. We evaluated and optimized several commonly used classification methods implemented in QIIME 1 (RDP, BLAST, UCLUST, and SortMeRNA) and several new methods implemented in QIIME 2 (a scikit-learn naive Bayes machine-learning classifier, and alignment-based taxonomy consensus methods based on VSEARCH, and BLAST+) for classification of bacterial 16S rRNA and fungal ITS marker-gene amplicon sequence data. The naive-Bayes, BLAST+-based, and VSEARCH-based classifiers implemented in QIIME 2 meet or exceed the species-level accuracy of other commonly used methods designed for classification of marker gene sequences that were evaluated in this work. These evaluations, based on 19 mock communities and error-free sequence simulations, including classification of simulated "novel" marker-gene sequences, are available in our extensible benchmarking framework, tax-credit ( https://github.com/caporaso-lab/tax-credit-data ). Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make recommendations regarding parameter choices for these classifiers under a range of standard operating conditions. q2-feature-classifier and tax-credit are both free, open-source, BSD-licensed packages available on GitHub.
Song, Fanhao; Wu, Fengchang; Guo, Fei; Wang, Hao; Feng, Weiying; Zhou, Min; Deng, Yanghui; Bai, Yingchen; Xing, Baoshan; Giesy, John P
2017-12-15
In aquatic environments, pH can control environmental behaviors of fulvic acid (FA) via regulating hydrolysis of functional groups. Sub-fractions of FA, eluted using pyrophosphate buffers with initial pHs of 3.0 (FA 3 ), 5.0 (FA 5 ), 7.0 (FA 7 ), 9.0 (FA 9 ) and 13.0 (FA 13 ), were used to explore interactions between the various, operationally defined, FA fractions and protons, by use of EEM-PARAFAC analysis. Splitting of peaks (FA 3 and FA 13 ), merging of peaks (FA 7 ), disappearance of peaks (FA 9 and FA 13 ), and red/blue-shifting of peaks were observed during fluorescence titration. Fulvic-like components were identified from FA 3 -FA 13 , and protein-like components were observed in fractions FA 9 and FA 13 . There primary compounds (carboxylic-like, phenolic-like, and protein-like chromophores) in PARAFAC components were distinguished based on acid-base properties. Dissociation constants (pK a ) for fulvic-like components with proton ranged from 2.43 to 4.13 in an acidic pH and from 9.95 to 11.27 at basic pH. These results might be due to protonation of di-carboxylate and phenolic functional groups. At basic pH, pK a values of protein-like components (9.77-10.13) were similar to those of amino acids. However, at acidic pH, pK a values of protein-like components, which ranged from 3.33 to 4.22, were 1-2units greater than those of amino acids. Results presented here, will benefit understanding of environmental behaviors of FA, as well as interactions of FA with environmental contaminants. Copyright © 2017 Elsevier B.V. All rights reserved.
Ji, Fang-ying; Li, Si; Zhou, Guang-ming; Yu, Dan-ni; Wang, Tu-jin; Cao, Lin; Tan, Xue-mei; Yang, Da-cheng; Zhou, Xiao-yi
2010-01-01
The fluorescence emission and excitation emission matrix (EEM) technologies were used to characterize the dissolved organic matter (DOM) in the water body of the Yangtze River and Jialing River around the Chongqing urban areas from April to August 2008. Concerning about the accidents of the Wenchuan's Earthquake in May and Tangjiashan Yansaihu's effects in June, and the high water period time in the summer in two months of July and August, from the EEM obtained from each sampling station and time, the composition, distribution and their changing features of the DOM in the two rivers were investigated as combined with the water samples' environmental parameters such as pH, DO, DOC with EEM's fingerprint features, f(450/500) etc; finally the bio-environment behavior effects of the three types of fluorescence peaks were elaborated, where humic-like, fulvic-like, and protein-like from the five sampling stations' EEMs during the five months were given detailed representation. From the experimental results obtained, the fluorescence peaks are mainly composed of two types of fluorophores: humic-like and protein-like in the two rivers around the Chongqing urban areas during the investigation in five months, the protein-like's peaks value in Jialing River is higher than the values in the Yangtze River, and all the fluorescence peaks in the two Rivers' water body decrease more or less after the two Rivers join in Chun Tan sampling station; the protein-like peak is notably higher after the "5 x 12" earthquake period time including May and June and high water period time, which mainly originated from terrestrial sources, but its intensities decreased observably while the water bodies of the two rivers joining together in the Chao Tianmen and Chun Tan's sampling station.
Valenzuela-González, Fabiola; Martínez-Porchas, Marcel; Villalpando-Canchola, Enrique; Vargas-Albores, Francisco
2016-03-01
Ultrafast-metagenomic sequence classification using exact alignments (Kraken) is a novel approach to classify 16S rDNA sequences. The classifier is based on mapping short sequences to the lowest ancestor and performing alignments to form subtrees with specific weights in each taxon node. This study aimed to evaluate the classification performance of Kraken with long 16S rDNA random environmental sequences produced by cloning and then Sanger sequenced. A total of 480 clones were isolated and expanded, and 264 of these clones formed contigs (1352 ± 153 bp). The same sequences were analyzed using the Ribosomal Database Project (RDP) classifier. Deeper classification performance was achieved by Kraken than by the RDP: 73% of the contigs were classified up to the species or variety levels, whereas 67% of these contigs were classified no further than the genus level by the RDP. The results also demonstrated that unassembled sequences analyzed by Kraken provide similar or inclusively deeper information. Moreover, sequences that did not form contigs, which are usually discarded by other programs, provided meaningful information when analyzed by Kraken. Finally, it appears that the assembly step for Sanger sequences can be eliminated when using Kraken. Kraken cumulates the information of both sequence senses, providing additional elements for the classification. In conclusion, the results demonstrate that Kraken is an excellent choice for use in the taxonomic assignment of sequences obtained by Sanger sequencing or based on third generation sequencing, of which the main goal is to generate larger sequences. Copyright © 2016 Elsevier B.V. All rights reserved.
Yu, Min-Da; He, Xiao-Song; Xi, Bei-Dou; Gao, Ru-Tai; Zhao, Xian-Wei; Zhang, Hui; Huang, Cai-Hong; Tan, Wenbing
2018-03-01
Fluorescence excitation-emission matrix (EEM) spectroscopy combined with principal component analysis (PCA) and parallel factor analysis (PARAFAC) were used to investigate the compositional characteristics of dissolved and particulate/colloidal organic matter and its correlations with nitrogen, phosphorus, and heavy metals in an effluent-dominated stream, Northern China. The results showed that dissolved organic matter (DOM) was comprised of fulvic-like, humic-like, and protein-like components in the water samples, and fulvic-like substances were the main fraction of DOM among them. Particulate/colloidal organic matter (PcOM) consisted of fulvic-like and protein-like matter. Fulvic-like substances existed in the larger molecular form in PcOM, and they comprised a large amount of nitrogen and polar functional groups. On the other hand, protein-like components in PcOM were low in benzene ring and bound to heavy metals. It could be concluded that nitrogen, phosphorus, and heavy metals in effluent had an effect on the compositional characteristics of natural DOM and PcOM, which may deepen our understanding about the environmental behaviors of organic matter in effluent.
Zhao, Jie
2010-01-01
Arabinogalactan proteins (AGPs) comprise a family of hydroxyproline-rich glycoproteins that are implicated in plant growth and development. In this study, 69 AGPs are identified from the rice genome, including 13 classical AGPs, 15 arabinogalactan (AG) peptides, three non-classical AGPs, three early nodulin-like AGPs (eNod-like AGPs), eight non-specific lipid transfer protein-like AGPs (nsLTP-like AGPs), and 27 fasciclin-like AGPs (FLAs). The results from expressed sequence tags, microarrays, and massively parallel signature sequencing tags are used to analyse the expression of AGP-encoding genes, which is confirmed by real-time PCR. The results reveal that several rice AGP-encoding genes are predominantly expressed in anthers and display differential expression patterns in response to abscisic acid, gibberellic acid, and abiotic stresses. Based on the results obtained from this analysis, an attempt has been made to link the protein structures and expression patterns of rice AGP-encoding genes to their functions. Taken together, the genome-wide identification and expression analysis of the rice AGP gene family might facilitate further functional studies of rice AGPs. PMID:20423940
NASA Astrophysics Data System (ADS)
Raczkowska, A.; Kowalczuk, P.; Sagan, S.; Zabłocka, M.; Pavlov, A. K.; Granskog, M. A.; Stedmon, C. A.
2016-02-01
Observations of Colored Dissolved Organic Matter absorption (CDOM) and fluorescence (FDOM) from water samples and an in situ fluorometer and of Inherent Optical Properties (IOP; light absorption and scattering) were carried out along a section across Fram Strait at 79°N. A 3 channel Wetlabs Wetstar fluorometer was deployed, with channels for humic- and protein-like DOM and used to assess distribution of different FDOM fractions. A relationship between fluorescence intensity of the protein-like fraction of FDOM and chlorophyll a fluorescence was found and indicated the importance of phytoplankton biomass in West Spitsbergen Current waters as a significant source of protein-like FDOM. East Greenland Current waters has low concentration of chlorophyll a, and were characterized by high humic-like FDOM fluorescence. An empirical relationship between humic-like FDOM fluorescence intensity and CDOM absorption was derived and confirms the dominance of terrigenous like CDOM on the composition of DOM in the East Greenland Current. These high resolution profile data offer a simple approach to fractionate the contribution of these two DOM source to DOM across the Fram Strait and may help refine estimates of DOC fluxes in and out of the Arctic through this region.
Toward understanding the role of individual fluorescent components in DOM-metal binding.
Wu, Jun; Zhang, Hua; Yao, Qi-Sheng; Shao, Li-Ming; He, Pin-Jing
2012-05-15
Knowledge on the function of individual fractions in dissolved organic matter (DOM) is essential for understanding the impact of DOM on metal speciation and migration. Herein, fluorescence excitation-emission matrix quenching and parallel factor (PARAFAC) analysis were adopted for bulk DOM and chemically isolated fractions from landfill leachate, i.e., humic acids (HA), fulvic acids and hydrophilic (HyI) fraction, to elucidate the role of individual fluorescent components in metal binding (Cu(II) and Cd(II)). Three components were identified by PARAFAC model, including one humic substance (HS)-like, one protein-like and one component highly correlated with the HyI fraction. Among them, the HS-like and protein-like components were responsible for Cu(II) binding, while the protein-like component was the only fraction involved in Cd(II) complexation. It was further identified that the slight quenching effect of HA fraction by Cd(II) was induced by the presence of proteinaceous materials in HA. Fluorescent substances in the HyI fraction of landfill leachate did not play as important a role as HS did. Therefore, it was suggested that the potential risk of aged leachate (more humified) as a carrier of heavy metal should not be overlooked. Copyright © 2012 Elsevier B.V. All rights reserved.
He, Xiao-Song; Fan, Qin-Dong
2016-11-01
For the purpose of investigating the effect of landfill leachate on the characteristics of organic matter in groundwater, groundwater samples were collected near and in a landfill site, and dissolved organic matter (DOM) was extracted from the groundwater samples and characterized by excitation-emission matrix (EEM) fluorescence spectra combined with fluorescence regional integration (FRI) and self-organizing map (SOM). The results showed that the groundwater DOM comprised humic-, fulvic-, and protein-like substances. The concentration of humic-like matter showed no obvious variation for all groundwater except the sample collected in the landfill site. Fulvic-like substance content decreased when the groundwater was polluted by landfill leachates. There were two kinds of protein-like matter in the groundwater. One kind was bound to humic-like substances, and its content did not change along with groundwater pollution. However, the other kind was present as "free" molecules or else bound in proteins, and its concentration increased significantly when the groundwater was polluted by landfill leachates. The FRI and SOM methods both can characterize the composition and evolution of DOM in the groundwater. However, the SOM analysis can identify whether protein-like moieties was bound to humic-like matter.
Three-dimensional fluorescence analysis of chernozem humic acids and their electrophoretic fractions
NASA Astrophysics Data System (ADS)
Trubetskoi, O. A.; Trubetskaya, O. E.
2017-09-01
Polyacrylamide gel electrophoresis in combination with size-exclusion chromatography (SEC-PAGE) has been used to obtain stable electrophoretic fractions of different molecular size (MS) from chernozem humic acids (HAs). Three-dimensional fluorescence charts of chernozem HAs and their fractions have been obtained for the first time, and all fluorescence excitation-emission maxima have been identified in the excitation wavelength range of 250-500 nm. It has been found that fractionation by the SEC-PAGE method results in a nonuniform distribution of protein- and humin-like fluorescence of the original HA preparation among the electrophoretic fractions. The electrophoretic fractions of the highest and medium MSs have only the main protein-like fluorescence maximum and traces of humin-like fluorescence. In the electrophoretic fraction of the lowest MS, the intensity of protein-like fluorescence is low, but the major part of humin-like fluorescence is localized there. Relationships between the intensity of protein-like fluorescence and the weight distribution of amino acids have been revealed, as well as between the degree of aromaticity and the intensity of humin-like fluorescence in electrophoretic fractions of different MSs. The obtained relationships can be useful in the interpretation of the spatial structural organization and ecological functions of soil HAs.
Zhang, Shurong; Bai, Yijuan; Wen, Xin; Ding, Aizhong; Zhi, Jianhui
2018-04-22
Human activities impose important disturbances on both organic and inorganic chemistry in fluvial systems. In this study, we investigated the intra-annual and downstream variations of dissolved organic carbon (DOC), dissolved organic matter (DOM) excitation-emission matrix fluorescence (EEM) with parallel factor analysis (PARAFAC), major ions, and dissolved inorganic nitrogen (DIN) species in a mountainous tributary of the Yellow River, China. Both DOM quantity and quality, as represented by DOC and DOM fluorescence respectively, changed spatially and seasonally in the studied region. Fluorescence intensity of tryptophan-like components (C3) were found much higher at the populated downstream regions than in the undisturbed forested upstream regions. Seasonally, stronger fluorescence intensity of protein-like components (C3 and C4) was observed in the low-flow period (December) and in the medium-flow period (March) than in the high-flow period (May), particularly for the downstream reaches, reflecting the dominant impacts of wastewater pollution in the downstream regions. In contrast to the protein-like fluorescence, humic-like fluorescence components C1 and C2 exhibited distinctly higher intensity in the high-flow period with smaller spatial variation indicating strong flushing effect of increasing water discharge on terrestrial-sourced humic-like materials in the high-flow period. Pollution-affected dissolved inorganic ions, particularly Na + , Cl - , and NH 4 + -N, showed similar spatial and seasonal variations with protein-like fluorescence of DOM. The significant positive correlations between protein-like fluorescence of DOM and pollution-affected ions, particularly Na + , Cl - , and NH 4 + -N, suggested that there were similar pollution sources and transportation pathways of both inorganic and organic pollutants in the region. The combination of DOM fluorescence properties and inorganic ions could provide an important reference for the pollution source characterization and river basin management.
Mycofier: a new machine learning-based classifier for fungal ITS sequences.
Delgado-Serrano, Luisa; Restrepo, Silvia; Bustos, Jose Ricardo; Zambrano, Maria Mercedes; Anzola, Juan Manuel
2016-08-11
The taxonomic and phylogenetic classification based on sequence analysis of the ITS1 genomic region has become a crucial component of fungal ecology and diversity studies. Nowadays, there is no accurate alignment-free classification tool for fungal ITS1 sequences for large environmental surveys. This study describes the development of a machine learning-based classifier for the taxonomical assignment of fungal ITS1 sequences at the genus level. A fungal ITS1 sequence database was built using curated data. Training and test sets were generated from it. A Naïve Bayesian classifier was built using features from the primary sequence with an accuracy of 87 % in the classification at the genus level. The final model was based on a Naïve Bayes algorithm using ITS1 sequences from 510 fungal genera. This classifier, denoted as Mycofier, provides similar classification accuracy compared to BLASTN, but the database used for the classification contains curated data and the tool, independent of alignment, is more efficient and contributes to the field, given the lack of an accurate classification tool for large data from fungal ITS1 sequences. The software and source code for Mycofier are freely available at https://github.com/ldelgado-serrano/mycofier.git .
ElGokhy, Sherin M; ElHefnawi, Mahmoud; Shoukry, Amin
2014-05-06
MicroRNAs (miRNAs) are endogenous ∼22 nt RNAs that are identified in many species as powerful regulators of gene expressions. Experimental identification of miRNAs is still slow since miRNAs are difficult to isolate by cloning due to their low expression, low stability, tissue specificity and the high cost of the cloning procedure. Thus, computational identification of miRNAs from genomic sequences provide a valuable complement to cloning. Different approaches for identification of miRNAs have been proposed based on homology, thermodynamic parameters, and cross-species comparisons. The present paper focuses on the integration of miRNA classifiers in a meta-classifier and the identification of miRNAs from metagenomic sequences collected from different environments. An ensemble of classifiers is proposed for miRNA hairpin prediction based on four well-known classifiers (Triplet SVM, Mipred, Virgo and EumiR), with non-identical features, and which have been trained on different data. Their decisions are combined using a single hidden layer neural network to increase the accuracy of the predictions. Our ensemble classifier achieved 89.3% accuracy, 82.2% f-measure, 74% sensitivity, 97% specificity, 92.5% precision and 88.2% negative predictive value when tested on real miRNA and pseudo sequence data. The area under the receiver operating characteristic curve of our classifier is 0.9 which represents a high performance index.The proposed classifier yields a significant performance improvement relative to Triplet-SVM, Virgo and EumiR and a minor refinement over MiPred.The developed ensemble classifier is used for miRNA prediction in mine drainage, groundwater and marine metagenomic sequences downloaded from the NCBI sequence reed archive. By consulting the miRBase repository, 179 miRNAs have been identified as highly probable miRNAs. Our new approach could thus be used for mining metagenomic sequences and finding new and homologous miRNAs. The paper investigates a computational tool for miRNA prediction in genomic or metagenomic data. It has been applied on three metagenomic samples from different environments (mine drainage, groundwater and marine metagenomic sequences). The prediction results provide a set of extremely potential miRNA hairpins for cloning prediction methods. Among the ensemble prediction obtained results there are pre-miRNA candidates that have been validated using miRbase while they have not been recognized by some of the base classifiers.
A Bioinformatics Classifier and Database for Heme-Copper Oxygen Reductases
Sousa, Filipa L.; Alves, Renato J.; Pereira-Leal, José B.; Teixeira, Miguel; Pereira, Manuela M.
2011-01-01
Background Heme-copper oxygen reductases (HCOs) are the last enzymatic complexes of most aerobic respiratory chains, reducing dioxygen to water and translocating up to four protons across the inner mitochondrial membrane (eukaryotes) or cytoplasmatic membrane (prokaryotes). The number of completely sequenced genomes is expanding exponentially, and concomitantly, the number and taxonomic distribution of HCO sequences. These enzymes were initially classified into three different types being this classification recently challenged. Methodology We reanalyzed the classification scheme and developed a new bioinformatics classifier for the HCO and Nitric oxide reductases (NOR), which we benchmark against a manually derived gold standard sequence set. It is able to classify any given sequence of subunit I from HCO and NOR with a global recall and precision both of 99.8%. We use this tool to classify this protein family in 552 completely sequenced genomes. Conclusions We concluded that the new and broader data set supports three functional and evolutionary groups of HCOs. Homology between NORs and HCOs is shown and NORs closest relationship with C Type HCOs demonstrated. We established and made available a classification web tool and an integrated Heme-Copper Oxygen reductase and NOR protein database (www.evocell.org/hco). PMID:21559461
Wang, Zhiwei; Wu, Zhichao; Tang, Shujuan
2009-04-01
Three-dimensional excitation-emission matrix (EEM) fluorescence spectroscopy was employed to characterize dissolved organic matter (DOM) in a submerged membrane bioreactor (MBR). Three fluorescence peaks could be identified from the EEM fluorescence spectra of the DOM samples in the MBR. Two peaks were associated with the protein-like fluorophores, and the third was related to the visible humic acid-like fluorophores. Only two main peaks were observed in the EEM fluorescence spectra of the extracellular polymeric substance (EPS) samples, which were due to the fluorescence of protein-like and humic acid-like matters, respectively. However, the EEM fluorescence spectra of membrane foulants were observed to have three peaks. It was also found that the dominant fluorescence substances in membrane foulants were protein-like substances, which might be due to the retention of proteins in the DOM and/or EPS in the MBR by the fine pores of the membrane. Quantitative analysis of the fluorescence spectra including peak locations, fluorescence intensity, and different peak intensity ratios and the fluorescence regional integration (FRI) analysis were also carried out in order to better understand the similarities and differences among the EEM spectra of the DOM, EPS, and membrane foulant samples and to further provide an insight into membrane fouling caused by the fluorescence substances in the DOM in submerged MBRs.
Pandey, Gaurav; Pandey, Om P; Rogers, Angela J; Ahsen, Mehmet E; Hoffman, Gabriel E; Raby, Benjamin A; Weiss, Scott T; Schadt, Eric E; Bunyavanich, Supinda
2018-06-11
Asthma is a common, under-diagnosed disease affecting all ages. We sought to identify a nasal brush-based classifier of mild/moderate asthma. 190 subjects with mild/moderate asthma and controls underwent nasal brushing and RNA sequencing of nasal samples. A machine learning-based pipeline identified an asthma classifier consisting of 90 genes interpreted via an L2-regularized logistic regression classification model. This classifier performed with strong predictive value and sensitivity across eight test sets, including (1) a test set of independent asthmatic and control subjects profiled by RNA sequencing (positive and negative predictive values of 1.00 and 0.96, respectively; AUC of 0.994), (2) two independent case-control cohorts of asthma profiled by microarray, and (3) five cohorts with other respiratory conditions (allergic rhinitis, upper respiratory infection, cystic fibrosis, smoking), where the classifier had a low to zero misclassification rate. Following validation in large, prospective cohorts, this classifier could be developed into a nasal biomarker of asthma.
Ye, Wenwu; Wang, Yang; Shen, Danyu; Li, Delong; Pu, Tianhuizi; Jiang, Zide; Zhang, Zhengguang; Zheng, Xiaobo; Tyler, Brett M; Wang, Yuanchao
2016-07-01
On the basis of its downy mildew-like morphology, the litchi downy blight pathogen was previously named Peronophythora litchii. Recently, however, it was proposed to transfer this pathogen to Phytophthora clade 4. To better characterize this unusual oomycete species and important fruit pathogen, we obtained the genome sequence of Phytophthora litchii and compared it to those from other oomycete species. P. litchii has a small genome with tightly spaced genes. On the basis of a multilocus phylogenetic analysis, the placement of P. litchii in the genus Phytophthora is strongly supported. Effector proteins predicted included 245 RxLR, 30 necrosis-and-ethylene-inducing protein-like, and 14 crinkler proteins. The typical motifs, phylogenies, and activities of these effectors were typical for a Phytophthora species. However, like the genome features of the analyzed downy mildews, P. litchii exhibited a streamlined genome with a relatively small number of genes in both core and species-specific protein families. The low GC content and slight codon preferences of P. litchii sequences were similar to those of the analyzed downy mildews and a subset of Phytophthora species. Taken together, these observations suggest that P. litchii is a Phytophthora pathogen that is in the process of acquiring downy mildew-like genomic and morphological features. Thus P. litchii may provide a novel model for investigating morphological development and genomic adaptation in oomycete pathogens.
Schmidt-Chanasit, Jonas; Bialonski, Alexandra; Heinemann, Patrick; Ulrich, Rainer G; Günther, Stephan; Rabenau, Holger F; Doerr, Hans Wilhelm
2010-07-01
Recently two different herpes simplex virus type 2 (HSV-2) clades (A and B) were described on DNA sequence data of the glycoprotein E (gE), G (gG) and I (gI) genes. To type the circulating HSV-2 wild-type strains in Germany by a novel approach and to monitor potential changes in the molecular epidemiology between 1997 and 2008. A total of 64 clinical HSV-2 isolates were analyzed by a novel approach using the DNA sequences of the complete open reading frames of glycoprotein B (gB) and gG. Recombination analysis of the gB and gG gene sequences was performed to reveal intragenic recombinants. Based on the phylogenetic analysis of the gB coding DNA sequence 8 of 64 (12%) isolates were classified as clade A strains and 56 of 64 (88%) isolates were classified as clade B strains. Analysis of the gG coding DNA sequence classified 4 (6%) isolates as clade A strains and 60 (94%) isolates as clade B strains. In comparison, the 8 isolates classified as clade A strains using the gB sequence data were classified as clade B strains when using the gG coding DNA sequence, suggesting intergenic recombination events. Intragenic recombination events were not detected. The first molecular survey of clinical HSV-2 isolates from Germany demonstrated the circulation of clade A and B strains and of intergenic recombinants over a period of 12 years. Copyright (c) 2010 Elsevier B.V. All rights reserved.
Chen, Yen-Kuang; Li, Kuo-Bin
2013-02-07
The type information of un-annotated membrane proteins provides an important hint for their biological functions. The experimental determination of membrane protein types, despite being more accurate and reliable, is not always feasible due to the costly laboratory procedures, thereby creating a need for the development of bioinformatics methods. This article describes a novel computational classifier for the prediction of membrane protein types using proteins' sequences. The classifier, comprising a collection of one-versus-one support vector machines, makes use of the following sequence attributes: (1) the cationic patch sizes, the orientation, and the topology of transmembrane segments; (2) the amino acid physicochemical properties; (3) the presence of signal peptides or anchors; and (4) the specific protein motifs. A new voting scheme was implemented to cope with the multi-class prediction. Both the training and the testing sequences were collected from SwissProt. Homologous proteins were removed such that there is no pair of sequences left in the datasets with a sequence identity higher than 40%. The performance of the classifier was evaluated by a Jackknife cross-validation and an independent testing experiments. Results show that the proposed classifier outperforms earlier predictors in prediction accuracy in seven of the eight membrane protein types. The overall accuracy was increased from 78.3% to 88.2%. Unlike earlier approaches which largely depend on position-specific substitution matrices and amino acid compositions, most of the sequence attributes implemented in the proposed classifier have supported literature evidences. The classifier has been deployed as a web server and can be accessed at http://bsaltools.ym.edu.tw/predmpt. Copyright © 2012 Elsevier Ltd. All rights reserved.
Recognising promoter sequences using an artificial immune system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cooke, D.E.; Hunt, J.E.
1995-12-31
We have developed an artificial immune system (AIS) which is based on the human immune system. The AIS possesses an adaptive learning mechanism which enables antibodies to emerge which can be used for classification tasks. In this paper, we describe how the AIS has been used to evolve antibodies which can classify promoter containing and promoter negative DNA sequences. The DNA sequences used for teaching were 57 nucleotides in length and contained procaryotic promoters. The system classified previously unseen DNA sequences with an accuracy of approximately 90%.
ERIC Educational Resources Information Center
Alcock, Lara; Simpson, Adrian
2017-01-01
This paper describes a study in which we investigated relationships between defining mathematical concepts--increasing and decreasing infinite sequences--explaining their meanings and classifying consistently with formal definitions. We explored the effect of defining, explaining or studying a definition on subsequent classification, and the…
BASiNET-BiologicAl Sequences NETwork: a case study on coding and non-coding RNAs identification.
Ito, Eric Augusto; Katahira, Isaque; Vicente, Fábio Fernandes da Rocha; Pereira, Luiz Filipe Protasio; Lopes, Fabrício Martins
2018-06-05
With the emergence of Next Generation Sequencing (NGS) technologies, a large volume of sequence data in particular de novo sequencing was rapidly produced at relatively low costs. In this context, computational tools are increasingly important to assist in the identification of relevant information to understand the functioning of organisms. This work introduces BASiNET, an alignment-free tool for classifying biological sequences based on the feature extraction from complex network measurements. The method initially transform the sequences and represents them as complex networks. Then it extracts topological measures and constructs a feature vector that is used to classify the sequences. The method was evaluated in the classification of coding and non-coding RNAs of 13 species and compared to the CNCI, PLEK and CPC2 methods. BASiNET outperformed all compared methods in all adopted organisms and datasets. BASiNET have classified sequences in all organisms with high accuracy and low standard deviation, showing that the method is robust and non-biased by the organism. The proposed methodology is implemented in open source in R language and freely available for download at https://cran.r-project.org/package=BASiNET.
Peng, Mingguo; Li, Huajie; Li, Dongdong; Du, Erdeng; Li, Zhihong
2017-06-01
Carbon nanotubes (CNTs) were utilized to adsorb DOM in micro-polluted water. The characteristics of DOM adsorption on CNTs were investigated based on UV 254 , TOC, and fluorescence spectrum measurements. Based on PARAFAC (parallel factor) analysis, four fluorescent components were extracted, including one protein-like component (C4) and three humic acid-like components (C1, C2, and C3). The adsorption isotherms, kinetics, and thermodynamics of DOM adsorption on CNTs were further investigated. A Freundlich isotherm model fit the adsorption data well with high values of correlation. As a type of macro-porous and meso-porous adsorbent, CNTs preferably adsorb humic acid-like substances rather than protein-like substances. The increasing temperature will speed up the adsorption process. The self-organizing map (SOM) analysis further explains the fluorescent properties of water samples. The results provide a new insight into the adsorption behaviour of DOM fluorescent components on CNTs.
Jin, Haibao; Jiao, Fang; Daily, Michael D.; Chen, Yulin; Yan, Feng; Ding, Yan-Huai; Zhang, Xin; Robertson, Ellen J.; Baer, Marcel D.; Chen, Chun-Long
2016-01-01
An ability to develop sequence-defined synthetic polymers that both mimic lipid amphiphilicity for self-assembly of highly stable membrane-mimetic 2D nanomaterials and exhibit protein-like functionality would revolutionize the development of biomimetic membranes. Here we report the assembly of lipid-like peptoids into highly stable, crystalline, free-standing and self-repairing membrane-mimetic 2D nanomaterials through a facile crystallization process. Both experimental and molecular dynamics simulation results show that peptoids assemble into membranes through an anisotropic formation process. We further demonstrated the use of peptoid membranes as a robust platform to incorporate and pattern functional objects through large side-chain diversity and/or co-crystallization approaches. Similar to lipid membranes, peptoid membranes exhibit changes in thickness upon exposure to external stimuli; they can coat surfaces in single layers and self-repair. We anticipate that this new class of membrane-mimetic 2D nanomaterials will provide a robust matrix for development of biomimetic membranes tailored to specific applications. PMID:27402325
NASA Astrophysics Data System (ADS)
Kim, T.; Kwon, E.; Kim, G.
2011-12-01
In order to determine the origin of dissolved organic matter (DOM) in the subterranean estuary (STE), the mixing zone of fresh terrestrial groundwater and recirculating seawater in a coastal permeable aquifer, we conducted water sampling from two STEs with different geological settings: (1) Jeju Island beaches (Hwasun and Samyang), which are composed of volcanic rocks and sandy sediments, and (2) Hampyeong beach, which is located in a large intertidal, sandy flat zone. The distributions of salinity, total hydrolysable amino acids (THAA), dissolved organic carbon (DOC), and colored DOM (CDOM) were measured for groundwater samples in these STEs. In the Hwasun STE, the humic-like peak decreases with increasing salinity, whereas the protein-like peak does not show a clear relationship with salinity. In contrast, in the Samyang STE, both humic-like peak and protein-like peak increase with increasing salinity. These contrasting results indicate that DOM in the Hwasun STE originates mainly from terrestrial inputs, while that in the Samyang STE originates mainly from biological and/or microbial activities. In the Hampyeong STE, we observed good correlations among the biodegradation index, alanine D/L ratios, THAA concentrations, DOC, and CDOM index (both humic-like and protein-like). Together with their geographical distribution patterns, these correlations indicate that DOM in the Hampyeong STE is mainly derived from marine sediments in the course of seawater recirculation. Our study shows that CDOM and amino acids are excellent tracers of DOM in the STE where DOM is derived from diverse sources.
Huang, Linxian; Li, Meilin; Si, Guangchao; Wei, Jinglin; Ngo, Huu Hao; Guo, Wenshan; Xu, Weiying; Du, Bin; Wei, Qin; Wei, Dong
2018-05-18
In the present study, the responses of microbial products in the biosorption process of Cu(II) onto aerobic granular sludge were evaluated by using batch and spectroscopic approaches. Batch experimental data showed that extracellular polymeric substances (EPSs) contributed to Cu(II) removal from an aqueous solution, especially when treating low metal concentrations, whereas soluble microbial products (SMPs) were released under the metal stress during biosorption process. A three-dimensional excitation-emission matrix (3D-EEM) identified four main fluorescence peaks in the EPS, i.e., tryptophan protein-like, aromatic protein-like, humic-like and fulvic acid-like substances, and their fluorescence intensities decreased gradually in the presence of Cu(II) during the sorption process. Particularly, tryptophan protein-like substances quenched the Cu(II) binding to a much higher extent through a static quenching process with less than one class of binding sites. According to the synchronous fluorescence spectra, the whole fluorescence intensity of released SMP samples expressed an increased trend with different degrees along with contact time. Two-dimensional correlation spectroscopy (2D-COS) suggested that the fulvic-like fluorescence fraction might be more susceptible to metal exposure than other fractions. The result of molecular weight distribution demonstrated that the SMPs released from the biosorption process differed significantly according to contact time. The result obtained could provide new insights into the responses of microbial products from aerobic granular sludge with heavy metal treatment. Copyright © 2018. Published by Elsevier Inc.
Shape-specific nanostructured protein mimics from de novo designed chimeric peptides.
Jiang, Linhai; Yang, Su; Lund, Reidar; Dong, He
2018-01-30
Natural proteins self-assemble into highly-ordered nanoscaled architectures to perform specific functions. The intricate functions of proteins have provided great impetus for researchers to develop strategies for designing and engineering synthetic nanostructures as protein mimics. Compared to the success in engineering fibrous protein mimetics, the design of discrete globular protein-like nanostructures has been challenging mainly due to the lack of precise control over geometric packing and intermolecular interactions among synthetic building blocks. In this contribution, we report an effective strategy to construct shape-specific nanostructures based on the self-assembly of chimeric peptides consisting of a coiled coil dimer and a collagen triple helix folding motif. Under salt-free conditions, we showed spontaneous self-assembly of the chimeric peptides into monodisperse, trigonal bipyramidal-like nanoparticles with precise control over the stoichiometry of two folding motifs and the geometrical arrangements relative to one another. Three coiled coil dimers are interdigitated on the equatorial plane while the two collagen triple helices are located in the axial position, perpendicular to the coiled coil plane. A detailed molecular model was proposed and further validated by small angle X-ray scattering experiments and molecular dynamics (MD) simulation. The results from this study indicated that the molecular folding of each motif within the chimeric peptides and their geometric packing played important roles in the formation of discrete protein-like nanoparticles. The peptide design and self-assembly mechanism may open up new routes for the construction of highly organized, discrete self-assembling protein-like nanostructures with greater levels of control over assembly accuracy.
NASA Astrophysics Data System (ADS)
Inamdar, S. P.; Singh, S.
2013-12-01
Understanding how dissolved organic matter (DOM) varies spatially in catchments and the processes and mechanisms that regulate this variation is critical for developing accurate and reliable models of DOM. We determined the concentrations and composition of DOM at multiple locations along a stream drainage network in a 79 ha forested, Piedmont, watershed in Maryland, USA. DOM concentrations and composition was compared for five stream locations during baseflow (drainage areas - 0.62, 3.5, 4.5, 12 and 79 ha) and three locations (3.5, 12, 79 ha) for storm flow. Sampling was conducted by manual grab samples and automated ISCO samplers. DOM composition was characterized using a suite of spectrofluorometric indices which included - HIX, a254, and FI. A site-specific PARAFAC model was also developed for DOM fluorescence to determine the humic-, fulvic-, and protein-like DOM constituents. Hydrologic flow paths during baseflow and stormflow were characterized for all stream locations using an end-member mixing model (EMMA). DOM varied notably across the sampled positions for baseflow and stormflow. During baseflow, mean DOC concentrations for the sampled locations ranged between 0.99-3.1 mg/L whereas for stormflow the range was 5.22-8.11 mg/L. Not surprisingly, DOM was more humic and aromatic during stormflow versus baseflow. The 3.5 ha stream drainage location that contained a large wetland yielded the highest DOC concentration as well as the most humic and aromatic DOM, during both, baseflow and stormflow. In contrast, a headwater stream location (0.62 ha) that received runoff from a groundwater seep registered the highest mean value for % protein-like DOM (30%) and the lowest index for aromaticity (mean a254 = 6.52) during baseflow. During stormflow, the mean % protein-like DOM was highest at the largest 79 ha drainage location (mean = 11.8%) and this site also registered the lowest mean value for a254 (46.3). Stream drainage locations that received a larger proportion of runoff along surficial flow paths produced a more aromatic and humic DOM with high DOC concentrations; whereas those with a greater proportion of groundwater contributions produced DOM with greater % of protein-like content. Overall, our observations suggest that occurrence of wetlands and the nature of hydrologic flow paths were the key determinants for the spatial pattern of DOM.
Process of labeling specific chromosomes using recombinant repetitive DNA
Moyzis, R.K.; Meyne, J.
1988-02-12
Chromosome preferential nucleotide sequences are first determined from a library of recombinant DNA clones having families of repetitive sequences. Library clones are identified with a low homology with a sequence of repetitive DNA families to which the first clones respectively belong and variant sequences are then identified by selecting clones having a pattern of hybridization with genomic DNA dissimilar to the hybridization pattern shown by the respective families. In another embodiment, variant sequences are selected from a sequence of a known repetitive DNA family. The selected variant sequence is classified as chromosome specific, chromosome preferential, or chromosome nonspecific. Sequences which are classified as chromosome preferential are further sequenced and regions are identified having a low homology with other regions of the chromosome preferential sequence or with known sequences of other family members and consensus sequences of the repetitive DNA families for the chromosome preferential sequences. The selected low homology regions are then hybridized with chromosomes to determine those low homology regions hybridized with a specific chromosome under normal stringency conditions.
Classifying short genomic fragments from novel lineages using composition and homology
2011-01-01
Background The assignment of taxonomic attributions to DNA fragments recovered directly from the environment is a vital step in metagenomic data analysis. Assignments can be made using rank-specific classifiers, which assign reads to taxonomic labels from a predetermined level such as named species or strain, or rank-flexible classifiers, which choose an appropriate taxonomic rank for each sequence in a data set. The choice of rank typically depends on the optimal model for a given sequence and on the breadth of taxonomic groups seen in a set of close-to-optimal models. Homology-based (e.g., LCA) and composition-based (e.g., PhyloPythia, TACOA) rank-flexible classifiers have been proposed, but there is at present no hybrid approach that utilizes both homology and composition. Results We first develop a hybrid, rank-specific classifier based on BLAST and Naïve Bayes (NB) that has comparable accuracy and a faster running time than the current best approach, PhymmBL. By substituting LCA for BLAST or allowing the inclusion of suboptimal NB models, we obtain a rank-flexible classifier. This hybrid classifier outperforms established rank-flexible approaches on simulated metagenomic fragments of length 200 bp to 1000 bp and is able to assign taxonomic attributions to a subset of sequences with few misclassifications. We then demonstrate the performance of different classifiers on an enhanced biological phosphorous removal metagenome, illustrating the advantages of rank-flexible classifiers when representative genomes are absent from the set of reference genomes. Application to a glacier ice metagenome demonstrates that similar taxonomic profiles are obtained across a set of classifiers which are increasingly conservative in their classification. Conclusions Our NB-based classification scheme is faster than the current best composition-based algorithm, Phymm, while providing equally accurate predictions. The rank-flexible variant of NB, which we term ε-NB, is complementary to LCA and can be combined with it to yield conservative prediction sets of very high confidence. The simple parameterization of LCA and ε-NB allows for tuning of the balance between more predictions and increased precision, allowing the user to account for the sensitivity of downstream analyses to misclassified or unclassified sequences. PMID:21827705
Chromosome specific repetitive DNA sequences
Moyzis, Robert K.; Meyne, Julianne
1991-01-01
A method is provided for determining specific nucleotide sequences useful in forming a probe which can identify specific chromosomes, preferably through in situ hybridization within the cell itself. In one embodiment, chromosome preferential nucleotide sequences are first determined from a library of recombinant DNA clones having families of repetitive sequences. Library clones are identified with a low homology with a sequence of repetitive DNA families to which the first clones respectively belong and variant sequences are then identified by selecting clones having a pattern of hybridization with genomic DNA dissimilar to the hybridization pattern shown by the respective families. In another embodiment, variant sequences are selected from a sequence of a known repetitive DNA family. The selected variant sequence is classified as chromosome specific, chromosome preferential, or chromosome nonspecific. Sequences which are classified as chromosome preferential are further sequenced and regions are identified having a low homology with other regions of the chromosome preferential sequence or with known sequences of other family me This invention is the result of a contract with the Department of Energy (Contract No. W-7405-ENG-36).
CRF: detection of CRISPR arrays using random forest.
Wang, Kai; Liang, Chun
2017-01-01
CRISPRs (clustered regularly interspaced short palindromic repeats) are particular repeat sequences found in wide range of bacteria and archaea genomes. Several tools are available for detecting CRISPR arrays in the genomes of both domains. Here we developed a new web-based CRISPR detection tool named CRF (CRISPR Finder by Random Forest). Different from other CRISPR detection tools, a random forest classifier was used in CRF to filter out invalid CRISPR arrays from all putative candidates and accordingly enhanced detection accuracy. In CRF, particularly, triplet elements that combine both sequence content and structure information were extracted from CRISPR repeats for classifier training. The classifier achieved high accuracy and sensitivity. Moreover, CRF offers a highly interactive web interface for robust data visualization that is not available among other CRISPR detection tools. After detection, the query sequence, CRISPR array architecture, and the sequences and secondary structures of CRISPR repeats and spacers can be visualized for visual examination and validation. CRF is freely available at http://bioinfolab.miamioh.edu/crf/home.php.
Peiris, Ramila H; Ignagni, Nicholas; Budman, Hector; Moresoli, Christine; Legge, Raymond L
2012-09-15
Characterization of the interactions between natural colloidal/particulate- and protein-like matter is important for understanding their contribution to different physiochemical phenomena like membrane fouling, adsorption of bacteria onto surfaces and various applications of nanoparticles in nanomedicine and nanotoxicology. Precise interpretation of the extent of such interactions is however hindered due to the limitations of most characterization methods to allow rapid, sensitive and accurate measurements. Here we report on a fluorescence-based excitation-emission matrix (EEM) approach in combination with principal component analysis (PCA) to extract information related to the interaction between natural colloidal/particulate- and protein-like matter. Surface plasmon resonance (SPR) analysis and fiber-optic probe based surface fluorescence measurements were used to confirm that the proposed approach can be used to characterize colloidal/particulate-protein interactions at the physical level. This method has potential to be a fundamental measurement of these interactions with the advantage that it can be performed rapidly and with high sensitivity. Copyright © 2012 Elsevier B.V. All rights reserved.
Monodisperse self-assembly in a model with protein-like interactions
NASA Astrophysics Data System (ADS)
Wilber, Alex W.; Doye, Jonathan P. K.; Louis, Ard A.; Lewis, Anna C. F.
2009-11-01
We study the self-assembly behavior of patchy particles with "proteinlike" interactions that can be considered as a minimal model for the assembly of viral capsids and other shell-like protein complexes. We thoroughly explore the thermodynamics and dynamics of self-assembly as a function of the parameters of the model and find robust assembly of all target structures considered. Optimal assembly occurs in the region of parameter space where a free energy barrier regulates the rate of nucleation, thus preventing the premature exhaustion of the supply of monomers that can lead to the formation of incomplete shells. The interactions also need to be specific enough to prevent the assembly of malformed shells, but while maintaining kinetic accessibility. Free energy landscapes computed for our model have a funnel-like topography guiding the system to form the target structure and show that the torsional component of the interparticle interactions prevents the formation of disordered aggregates that would otherwise act as kinetic traps.
Nakamura, Yoichi; Taniguchi, Hirokazu; Mizoguchi, Kosuke; Ikeda, Takaya; Motoshima, Kohei; Yamaguchi, Hiroyuki; Nagashima, Seiji; Nakatomi, Katsumi; Soda, Manabu; Mano, Hiroyuki; Kohno, Shigeru
2014-06-01
It is widely recognized that the risk of secondary neoplasms increases as childhood cancer survivors progress through adulthood. These are mainly hematological malignancies, and recurrent chromosome translocations are commonly detected in such cases. On the other hand, while secondary epithelial malignancies have sometimes been reported, chromosome translocations in these epithelial malignancies have not. A 33-year-old man who had been diagnosed with acute lymphoblastic leukemia and treated with chemotherapy almost 20 years earlier was diagnosed with lung adenocarcinoma. After chromosomal rearrangement of echinoderm microtubule-associated protein-like 4 gene and the anaplastic lymphoma kinase gene was detected in this adenocarcinoma, he responded to treatment with crizotinib. It was therefore concluded that this echinoderm microtubule-associated protein-like 4 gene-anaplastic lymphoma kinase gene-positive lung adenocarcinoma was a secondary epithelial malignancy. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Fluorescent water-soluble organic aerosols in the High Arctic atmosphere
Fu, Pingqing; Kawamura, Kimitaka; Chen, Jing; Qin, Mingyue; Ren, Lujie; Sun, Yele; Wang, Zifa; Barrie, Leonard A.; Tachibana, Eri; Ding, Aijun; Yamashita, Youhei
2015-01-01
Organic aerosols are ubiquitous in the earth’s atmosphere. They have been extensively studied in urban, rural and marine environments. However, little is known about the fluorescence properties of water-soluble organic carbon (WSOC) or their transport to and distribution in the polar regions. Here, we present evidence that fluorescent WSOC is a substantial component of High Arctic aerosols. The ratios of fluorescence intensity of protein-like peak to humic-like peak generally increased from dark winter to early summer, indicating an enhanced contribution of protein-like organics from the ocean to Arctic aerosols after the polar sunrise. Such a seasonal pattern is in agreement with an increase of stable carbon isotope ratios of total carbon (δ13CTC) from −26.8‰ to −22.5‰. Our results suggest that Arctic aerosols are derived from a combination of the long-range transport of terrestrial organics and local sea-to-air emission of marine organics, with an estimated contribution from the latter of 8.7–77% (mean 45%). PMID:25920042
Tian, Yu; Li, Zhipeng; Lu, Yaobin
2012-10-01
The study focused on the membrane fouling mitigation observed in a membrane bioreactor (MBR) coupled with worm reactor system. During the operation time of 100 days, the transmembrane pressure (TMP) in the combined system was maintained less than 5 kPa, while the final TMP in the Control-MBR increased to 30 kPa. The changes in properties of soluble microbial products (SMP) and extracellular polymeric substances (EPS) after worm predation were investigated by means of various analytical techniques. It was found that due to the worm predation, the reduced amount of EPS was far more than the increased amount of SMP leading to a significant decrease of protein-like substances which were dominant in the membrane foulants. Except for the content decrease, worm predation destroyed the functional groups of simple aromatic proteins and tryptophan protein-like substances in EPS, making them have lower tendency attaching to the membrane in the combined system. Copyright © 2012 Elsevier Ltd. All rights reserved.
Deterministic folding: The role of entropic forces and steric specificities
NASA Astrophysics Data System (ADS)
da Silva, Roosevelt A.; da Silva, M. A. A.; Caliri, A.
2001-03-01
The inverse folding problem of proteinlike macromolecules is studied by using a lattice Monte Carlo (MC) model in which steric specificities (nearest-neighbors constraints) are included and the hydrophobic effect is treated explicitly by considering interactions between the chain and solvent molecules. Chemical attributes and steric peculiarities of the residues are encoded in a 10-letter alphabet and a correspondent "syntax" is provided in order to write suitable sequences for the specified target structures; twenty-four target configurations, chosen in order to cover all possible values of the average contact order χ (0.2381⩽χ⩽0.4947 for this system), were encoded and analyzed. The results, obtained by MC simulations, are strongly influenced by geometrical properties of the native configuration, namely χ and the relative number φ of crankshafts-type structures: For χ<0.35 the folding is deterministic, that is, the syntax is able to encode successful sequences: The system presents larger encodability, minimum sequence-target degeneracies and smaller characteristic folding time τf. For χ⩾0.35 the above results are not reproduced any more: The folding success is severely reduced, showing strong correlation with φ. Additionally, the existence of distinct characteristic folding times suggests that different mechanisms are acting at the same time in the folding process. The results (all obtained from the same single model, under the same "physiological conditions") resemble some general features of the folding problem, supporting the premise that the steric specificities, in association with the entropic forces (hydrophobic effect), are basic ingredients in the protein folding process.
Jin, Haibao; Jiao, Fang; Daily, Michael D.; ...
2016-07-12
Two-dimensional (2D) materials with molecular-scale thickness have attracted increasing interest for separation, electronic, catalytic, optical, energy and biomedical applications. Although extensive research on 2D materials, such as graphene and graphene oxide, has been performed in recent years, progress is limited on self-assembly of 2D materials from sequence-specific macromolecules, especially from synthetic sequences that could exhibit lipid-like self-assembly of bilayer sheets and mimic membrane proteins for functions. The creation of such new class of materials could enable development of highly stable biomimetic membranes that exhibit cell-membrane-like molecular transport with exceptional selectively and high transport rates. Here we demonstrate self-assembly of lipid-likemore » 12-mer peptoids into extremely stable, crystalline, flexible and free-standing 2D membrane materials. As with cell membranes, upon exposure to external stimuli, these materials exhibit changes in thickness, varying from 3.5 nm to 5.6 nm. We find that self-assembly occurs through a facile crystallization process, in which inter-peptoid hydrogen bonds and enhanced hydrophobic interactions drive the formation of a highly-ordered structure. Molecular simulation confirms this is the energetically favored structure. Displaying functional groups at arbitrary locations of membrane-forming peptoids produces membranes with similar structures. This research further shows that single-layer membranes can be coated onto substrate surfaces. Moreover, membranes with mechanically-induced defects can self-repair. Given that peptoids are sequence-specific and exhibit protein-like molecular recognition with enhanced stability, we anticipate our membranes to be a robust platform tailored to specific applications.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jin, Haibao; Jiao, Fang; Daily, Michael D.
Two-dimensional (2D) materials with molecular-scale thickness have attracted increasing interest for separation, electronic, catalytic, optical, energy and biomedical applications. Although extensive research on 2D materials, such as graphene and graphene oxide, has been performed in recent years, progress is limited on self-assembly of 2D materials from sequence-specific macromolecules, especially from synthetic sequences that could exhibit lipid-like self-assembly of bilayer sheets and mimic membrane proteins for functions. The creation of such new class of materials could enable development of highly stable biomimetic membranes that exhibit cell-membrane-like molecular transport with exceptional selectively and high transport rates. Here we demonstrate self-assembly of lipid-likemore » 12-mer peptoids into extremely stable, crystalline, flexible and free-standing 2D membrane materials. As with cell membranes, upon exposure to external stimuli, these materials exhibit changes in thickness, varying from 3.5 nm to 5.6 nm. We find that self-assembly occurs through a facile crystallization process, in which inter-peptoid hydrogen bonds and enhanced hydrophobic interactions drive the formation of a highly-ordered structure. Molecular simulation confirms this is the energetically favored structure. Displaying functional groups at arbitrary locations of membrane-forming peptoids produces membranes with similar structures. This research further shows that single-layer membranes can be coated onto substrate surfaces. Moreover, membranes with mechanically-induced defects can self-repair. Given that peptoids are sequence-specific and exhibit protein-like molecular recognition with enhanced stability, we anticipate our membranes to be a robust platform tailored to specific applications.« less
Bulashevska, Alla; Eils, Roland
2006-06-14
The subcellular location of a protein is closely related to its function. It would be worthwhile to develop a method to predict the subcellular location for a given protein when only the amino acid sequence of the protein is known. Although many efforts have been made to predict subcellular location from sequence information only, there is the need for further research to improve the accuracy of prediction. A novel method called HensBC is introduced to predict protein subcellular location. HensBC is a recursive algorithm which constructs a hierarchical ensemble of classifiers. The classifiers used are Bayesian classifiers based on Markov chain models. We tested our method on six various datasets; among them are Gram-negative bacteria dataset, data for discriminating outer membrane proteins and apoptosis proteins dataset. We observed that our method can predict the subcellular location with high accuracy. Another advantage of the proposed method is that it can improve the accuracy of the prediction of some classes with few sequences in training and is therefore useful for datasets with imbalanced distribution of classes. This study introduces an algorithm which uses only the primary sequence of a protein to predict its subcellular location. The proposed recursive scheme represents an interesting methodology for learning and combining classifiers. The method is computationally efficient and competitive with the previously reported approaches in terms of prediction accuracies as empirical results indicate. The code for the software is available upon request.
Using simple artificial intelligence methods for predicting amyloidogenesis in antibodies
2010-01-01
Background All polypeptide backbones have the potential to form amyloid fibrils, which are associated with a number of degenerative disorders. However, the likelihood that amyloidosis would actually occur under physiological conditions depends largely on the amino acid composition of a protein. We explore using a naive Bayesian classifier and a weighted decision tree for predicting the amyloidogenicity of immunoglobulin sequences. Results The average accuracy based on leave-one-out (LOO) cross validation of a Bayesian classifier generated from 143 amyloidogenic sequences is 60.84%. This is consistent with the average accuracy of 61.15% for a holdout test set comprised of 103 AM and 28 non-amyloidogenic sequences. The LOO cross validation accuracy increases to 81.08% when the training set is augmented by the holdout test set. In comparison, the average classification accuracy for the holdout test set obtained using a decision tree is 78.64%. Non-amyloidogenic sequences are predicted with average LOO cross validation accuracies between 74.05% and 77.24% using the Bayesian classifier, depending on the training set size. The accuracy for the holdout test set was 89%. For the decision tree, the non-amyloidogenic prediction accuracy is 75.00%. Conclusions This exploratory study indicates that both classification methods may be promising in providing straightforward predictions on the amyloidogenicity of a sequence. Nevertheless, the number of available sequences that satisfy the premises of this study are limited, and are consequently smaller than the ideal training set size. Increasing the size of the training set clearly increases the accuracy, and the expansion of the training set to include not only more derivatives, but more alignments, would make the method more sound. The accuracy of the classifiers may also be improved when additional factors, such as structural and physico-chemical data, are considered. The development of this type of classifier has significant applications in evaluating engineered antibodies, and may be adapted for evaluating engineered proteins in general. PMID:20144194
Using simple artificial intelligence methods for predicting amyloidogenesis in antibodies.
David, Maria Pamela C; Concepcion, Gisela P; Padlan, Eduardo A
2010-02-08
All polypeptide backbones have the potential to form amyloid fibrils, which are associated with a number of degenerative disorders. However, the likelihood that amyloidosis would actually occur under physiological conditions depends largely on the amino acid composition of a protein. We explore using a naive Bayesian classifier and a weighted decision tree for predicting the amyloidogenicity of immunoglobulin sequences. The average accuracy based on leave-one-out (LOO) cross validation of a Bayesian classifier generated from 143 amyloidogenic sequences is 60.84%. This is consistent with the average accuracy of 61.15% for a holdout test set comprised of 103 AM and 28 non-amyloidogenic sequences. The LOO cross validation accuracy increases to 81.08% when the training set is augmented by the holdout test set. In comparison, the average classification accuracy for the holdout test set obtained using a decision tree is 78.64%. Non-amyloidogenic sequences are predicted with average LOO cross validation accuracies between 74.05% and 77.24% using the Bayesian classifier, depending on the training set size. The accuracy for the holdout test set was 89%. For the decision tree, the non-amyloidogenic prediction accuracy is 75.00%. This exploratory study indicates that both classification methods may be promising in providing straightforward predictions on the amyloidogenicity of a sequence. Nevertheless, the number of available sequences that satisfy the premises of this study are limited, and are consequently smaller than the ideal training set size. Increasing the size of the training set clearly increases the accuracy, and the expansion of the training set to include not only more derivatives, but more alignments, would make the method more sound. The accuracy of the classifiers may also be improved when additional factors, such as structural and physico-chemical data, are considered. The development of this type of classifier has significant applications in evaluating engineered antibodies, and may be adapted for evaluating engineered proteins in general.
Locating Encrypted Data Hidden Among Non-Encrypted Data Using Statistical Tools
2007-03-01
length of a compressed sequence). If a bit sequence can be significantly compressed , then it is not random. Lempel - Ziv Compression Test This test...communication, targeting, and a host other of tasks. This software will most assuredly contain classified data or algorithms requiring protection in...containing the classified data and algorithms . As the program is executed the solider would have access to the common unclassified tasks, however, to
A novel EML4-ALK variant: exon 6 of EML4 fused to exon 19 of ALK.
Penzel, Roland; Schirmacher, Peter; Warth, Arne
2012-07-01
Cytotoxic chemotherapy remains the mainstay of treatment for most patients with advanced disease. Recently, anaplastic lymphoma kinase (ALK) expression as a major target for successful treatment with ALK inhibitors was detected in a subset of non-small-cell lung carcinomas, usually as a result of echinoderm microtubule-associated protein-like 4 (EML4)-ALK rearrangements. Although the chromosomal breakpoint within the EML4 gene varied, the breakpoint within ALK was most frequently reported within intron 19 or rarely in exon 20. Therefore, the different EML4-ALK variants so far contain the same 3' portion of ALK starting with exon 20. Here, we report a novel EML4-ALK variant detected by reverse transcription polymerase chain reaction analysis. Subsequent sequencing revealed an EML4-ALK fusion variant in which exon 6 of EML4 was fused to exon 19 of ALK. It occurred in a predominant solid pulmonary adenocarcinoma of a 65-year-old woman with a clear split signal of ALK in fluorescence in situ hybridization analysis and a weakly homogeneous ALK expression in immunohistochemical staining. Because of the growing number of fusion variants a primary reverse transcription polymerase chain reaction-based screening for ALK-positive non-small-cell lung carcinoma patients may not be sufficient for predictive diagnostics but transcript-based approaches and sequencing of ALK fusion variants might finally contribute to an optimized selection of patients.
Mutations in the calcium-related gene IL1RAPL1 are associated with autism.
Piton, Amélie; Michaud, Jacques L; Peng, Huashan; Aradhya, Swaroop; Gauthier, Julie; Mottron, Laurent; Champagne, Nathalie; Lafrenière, Ronald G; Hamdan, Fadi F; Joober, Ridha; Fombonne, Eric; Marineau, Claude; Cossette, Patrick; Dubé, Marie-Pierre; Haghighi, Pejmun; Drapeau, Pierre; Barker, Philip A; Carbonetto, Salvatore; Rouleau, Guy A
2008-12-15
In a systematic sequencing screen of synaptic genes on the X chromosome, we have identified an autistic female without mental retardation (MR) who carries a de novo frameshift Ile367SerfsX6 mutation in Interleukin-1 Receptor Accessory Protein-Like 1 (IL1RAPL1), a gene implicated in calcium-regulated vesicle release and dendrite differentiation. We showed that the function of the resulting truncated IL1RAPL1 protein is severely altered in hippocampal neurons, by measuring its effect on neurite outgrowth activity. We also sequenced the coding region of the close related member IL1RAPL2 and of NCS-1/FREQ, which physically interacts with IL1RAPL1, in a cohort of subjects with autism. The screening failed to identify non-synonymous variant in IL1RAPL2, whereas a rare missense (R102Q) in NCS-1/FREQ was identified in one autistic patient. Furthermore, we identified by comparative genomic hybridization a large intragenic deletion of exons 3-7 of IL1RAPL1 in three brothers with autism and/or MR. This deletion causes a frameshift and the introduction of a premature stop codon, Ala28GlufsX15, at the very beginning of the protein. All together, our results indicate that mutations in IL1RAPL1 cause a spectrum of neurological impairments ranging from MR to high functioning autism.
Moszczynska, Anna; Burghardt, Kyle J.; Yu, Dongyue
2017-01-01
Short interspersed elements (SINEs) are typically silenced by DNA hypermethylation in somatic cells, but can retrotranspose in proliferating cells during adult neurogenesis. Hypomethylation caused by disease pathology or genotoxic stress leads to genomic instability of SINEs. The goal of the present investigation was to determine whether neurotoxic doses of binge or chronic methamphetamine (METH) trigger retrotransposition of the identifier (ID) element, a member of the rat SINE family, in the dentate gyrus genomic DNA. Adult male Sprague-Dawley rats were treated with saline or high doses of binge or chronic METH and sacrificed at three different time points thereafter. DNA methylation analysis, immunohistochemistry and next-generation sequencing (NGS) were performed on the dorsal dentate gyrus samples. Binge METH triggered hypomethylation, while chronic METH triggered hypermethylation of the CpG-2 site. Both METH regimens were associated with increased intensities in poly(A)-binding protein 1 (PABP1, a SINE regulatory protein)-like immunohistochemical staining in the dentate gyrus. The amplification of several ID element sequences was significantly higher in the chronic METH group than in the control group a week after METH, and they mapped to genes coding for proteins regulating cell growth and proliferation, transcription, protein function as well as for a variety of transporters. The results suggest that chronic METH induces ID element retrotransposition in the dorsal dentate gyrus and may affect hippocampal neurogenesis. PMID:28272323
Protein binding hot spots prediction from sequence only by a new ensemble learning method.
Hu, Shan-Shan; Chen, Peng; Wang, Bing; Li, Jinyan
2017-10-01
Hot spots are interfacial core areas of binding proteins, which have been applied as targets in drug design. Experimental methods are costly in both time and expense to locate hot spot areas. Recently, in-silicon computational methods have been widely used for hot spot prediction through sequence or structure characterization. As the structural information of proteins is not always solved, and thus hot spot identification from amino acid sequences only is more useful for real-life applications. This work proposes a new sequence-based model that combines physicochemical features with the relative accessible surface area of amino acid sequences for hot spot prediction. The model consists of 83 classifiers involving the IBk (Instance-based k means) algorithm, where instances are encoded by important properties extracted from a total of 544 properties in the AAindex1 (Amino Acid Index) database. Then top-performance classifiers are selected to form an ensemble by a majority voting technique. The ensemble classifier outperforms the state-of-the-art computational methods, yielding an F1 score of 0.80 on the benchmark binding interface database (BID) test set. http://www2.ahu.edu.cn/pchen/web/HotspotEC.htm .
Classifying Noisy Protein Sequence Data: A Case Study of Immunoglobulin Light Chains
2005-01-01
collected from patients with and without amyloidosis , and indicates that the proposed modified classifi- ers are more robust to sequence variability than...piled from patients with and without amyloidosis provides unique features to serve as a model system, not only for conformational disease studies but...produced by patients with amyloidosis . SVMs have been used recently in a wide variety of applica- tions in computational biology (Noble, 2004). Most
NASA Astrophysics Data System (ADS)
Liu, J.; Lan, T.; Qin, H.
2017-10-01
Traditional data cleaning identifies dirty data by classifying original data sequences, which is a class-imbalanced problem since the proportion of incorrect data is much less than the proportion of correct ones for most diagnostic systems in Magnetic Confinement Fusion (MCF) devices. When using machine learning algorithms to classify diagnostic data based on class-imbalanced training set, most classifiers are biased towards the major class and show very poor classification rates on the minor class. By transforming the direct classification problem about original data sequences into a classification problem about the physical similarity between data sequences, the class-balanced effect of Time-Domain Global Similarity (TDGS) method on training set structure is investigated in this paper. Meanwhile, the impact of improved training set structure on data cleaning performance of TDGS method is demonstrated with an application example in EAST POlarimetry-INTerferometry (POINT) system.
Mapping membrane activity in undiscovered peptide sequence space using machine learning
Fulan, Benjamin M.; Wong, Gerard C. L.
2016-01-01
There are some ∼1,100 known antimicrobial peptides (AMPs), which permeabilize microbial membranes but have diverse sequences. Here, we develop a support vector machine (SVM)-based classifier to investigate ⍺-helical AMPs and the interrelated nature of their functional commonality and sequence homology. SVM is used to search the undiscovered peptide sequence space and identify Pareto-optimal candidates that simultaneously maximize the distance σ from the SVM hyperplane (thus maximize its “antimicrobialness”) and its ⍺-helicity, but minimize mutational distance to known AMPs. By calibrating SVM machine learning results with killing assays and small-angle X-ray scattering (SAXS), we find that the SVM metric σ correlates not with a peptide’s minimum inhibitory concentration (MIC), but rather its ability to generate negative Gaussian membrane curvature. This surprising result provides a topological basis for membrane activity common to AMPs. Moreover, we highlight an important distinction between the maximal recognizability of a sequence to a trained AMP classifier (its ability to generate membrane curvature) and its maximal antimicrobial efficacy. As mutational distances are increased from known AMPs, we find AMP-like sequences that are increasingly difficult for nature to discover via simple mutation. Using the sequence map as a discovery tool, we find a unexpectedly diverse taxonomy of sequences that are just as membrane-active as known AMPs, but with a broad range of primary functions distinct from AMP functions, including endogenous neuropeptides, viral fusion proteins, topogenic peptides, and amyloids. The SVM classifier is useful as a general detector of membrane activity in peptide sequences. PMID:27849600
Mitochondrial iron-sulfur cluster biogenesis from molecular understanding to clinical disease
Alfadhel, Majid; Nashabat, Marwan; Ali, Qais Abu; Hundallah, Khalid
2017-01-01
Iron–sulfur clusters (ISCs) are known to play a major role in various protein functions. Located in the mitochondria, cytosol, endoplasmic reticulum and nucleus, they contribute to various core cellular functions. Until recently, only a few human diseases related to mitochondrial ISC biogenesis defects have been described. Such diseases include Friedreich ataxia, combined oxidative phosphorylation deficiency 19, infantile complex II/III deficiency defect, hereditary myopathy with lactic acidosis and mitochondrial muscle myopathy, lipoic acid biosynthesis defects, multiple mitochondrial dysfunctions syndromes and non ketotic hyperglycinemia due to glutaredoxin 5 gene defect. Disorders of mitochondrial import, export and translation, including sideroblastic anemia with ataxia, EVEN-PLUS syndrome and mitochondrial complex I deficiency due to nucleotide-binding protein-like protein gene defect, have also been implicated in ISC biogenesis defects. With advances in next generation sequencing technologies, more disorders related to ISC biogenesis defects are expected to be elucidated. In this article, we aim to shed the light on mitochondrial ISC biogenesis, related proteins and their function, pathophysiology, clinical phenotypes of related disorders, diagnostic approach, and future implications. PMID:28064324
Luévano-Martínez, Luis A; Barba-Ostria, Carlos; Araiza-Olivera, Daniela; Chiquete-Félix, Natalia; Guerrero-Castillo, Sergio; Rial, Eduardo; Georgellis, Dimitris; Uribe-Carvajal, Salvador
2012-04-01
The mitochondrial Oac (oxaloacetate carrier) found in some fungi and plants catalyses the uptake of oxaloacetate, malonate and sulfate. Despite their sequence similarity, transport specificity varies considerably between Oacs. Indeed, whereas ScOac (Saccharomyces cerevisiae Oac) is a specific anion-proton symporter, the YlOac (Yarrowia lipolytica Oac) has the added ability to transport protons, behaving as a UCP (uncoupling protein). Significantly, we identified two amino acid changes at the matrix gate of YlOac and ScOac, tyrosine to phenylalanine and methionine to leucine. We studied the role of these amino acids by expressing both wild-type and specifically mutated Oacs in an Oac-null S. cerevisiae strain. No phenotype could be associated with the methionine to leucine substitution, whereas UCP-like activity was dependent on the presence of the tyrosine residue normally expressed in the YlOac, i.e. Tyr-ScOac mediated proton transport, whereas Phe-YlOac lost its protonophoric activity. These findings indicate that the UCP-like activity of YlOac is determined by the tyrosine residue at position 146.
Species classifier choice is a key consideration when analysing low-complexity food microbiome data.
Walsh, Aaron M; Crispie, Fiona; O'Sullivan, Orla; Finnegan, Laura; Claesson, Marcus J; Cotter, Paul D
2018-03-20
The use of shotgun metagenomics to analyse low-complexity microbial communities in foods has the potential to be of considerable fundamental and applied value. However, there is currently no consensus with respect to choice of species classification tool, platform, or sequencing depth. Here, we benchmarked the performances of three high-throughput short-read sequencing platforms, the Illumina MiSeq, NextSeq 500, and Ion Proton, for shotgun metagenomics of food microbiota. Briefly, we sequenced six kefir DNA samples and a mock community DNA sample, the latter constructed by evenly mixing genomic DNA from 13 food-related bacterial species. A variety of bioinformatic tools were used to analyse the data generated, and the effects of sequencing depth on these analyses were tested by randomly subsampling reads. Compositional analysis results were consistent between the platforms at divergent sequencing depths. However, we observed pronounced differences in the predictions from species classification tools. Indeed, PERMANOVA indicated that there was no significant differences between the compositional results generated by the different sequencers (p = 0.693, R 2 = 0.011), but there was a significant difference between the results predicted by the species classifiers (p = 0.01, R 2 = 0.127). The relative abundances predicted by the classifiers, apart from MetaPhlAn2, were apparently biased by reference genome sizes. Additionally, we observed varying false-positive rates among the classifiers. MetaPhlAn2 had the lowest false-positive rate, whereas SLIMM had the greatest false-positive rate. Strain-level analysis results were also similar across platforms. Each platform correctly identified the strains present in the mock community, but accuracy was improved slightly with greater sequencing depth. Notably, PanPhlAn detected the dominant strains in each kefir sample above 500,000 reads per sample. Again, the outputs from functional profiling analysis using SUPER-FOCUS were generally accordant between the platforms at different sequencing depths. Finally, and expectedly, metagenome assembly completeness was significantly lower on the MiSeq than either on the NextSeq (p = 0.03) or the Proton (p = 0.011), and it improved with increased sequencing depth. Our results demonstrate a remarkable similarity in the results generated by the three sequencing platforms at different sequencing depths, and, in fact, the choice of bioinformatics methodology had a more evident impact on results than the choice of sequencer did.
Molecular Characterization of Hypoderma SPP. in Domestic Ruminants from Turkey and Pakistan.
Ahmed, Haroon; Simsek, Sami; Saki, Cem Ecmel; Kesik, Harun Kaya; Kilinc, Seyma Gunyakti
2017-08-01
The aim of this study was to determine the morphological and molecular characterization of Hypoderma spp. in cattle and yak from provinces in Turkey and Pakistan. In total, 78 Hypoderma larvae were collected from slaughtered animals in Turkey and Pakistan from October 2015 to January 2016. Thirty-eight of these 78 Hypoderma larvae were morphologically classified as third instar larvae (L3s) of Hypoderma bovis, 37 were classified as Hypoderma lineatum, and 3 were classified as suspected or unidentified. The restriction enzyme TaqI was used to differentiate the Hypoderma spp. by polymerase chain reaction (PCR)-restriction fragment length polymorphism (RFLP). According to the sequences and the PCR-RFLP results, all larval samples from cattle from Turkey were classified as H. bovis, except for 1 sample classified as H. lineatum. All Hypoderma larvae from Pakistan were classified as H. lineatum from cattle and as Hypoderma sinense from yak. This study provides the first molecular characterization of H. lineatum (cattle) and H. sinense (yak) in Pakistan based on PCR-RFLP and sequencing results.
Wang, Pengfei; Wang, Yingfang; Duan, Guangcai; Xue, Zerun; Wang, Linlin; Guo, Xiangjiao; Yang, Haiyan; Xi, Yuanlin
2015-04-01
This study was aimed to explore the features of clustered regularly interspaced short palindromic repeats (CRISPR) structures in Shigella by using bioinformatics. We used bioinformatics methods, including BLAST, alignment and RNA structure prediction, to analyze the CRISPR structures of Shigella genomes. The results showed that the CRISPRs existed in the four groups of Shigella, and the flanking sequences of upstream CRISPRs could be classified into the same group with those of the downstream. We also found some relatively conserved palindromic motifs in the leader sequences. Repeat sequences had the same group with corresponding flanking sequences, and could be classified into two different types by their RNA secondary structures, which contain "stem" and "ring". Some spacers were found to homologize with part sequences of plasmids or phages. The study indicated that there were correlations between repeat sequences and flanking sequences, and the repeats might act as a kind of recognition mechanism to mediate the interaction between foreign genetic elements and Cas proteins.
Pagnuco, Inti Anabela; Revuelta, María Victoria; Bondino, Hernán Gabriel; Brun, Marcel; Ten Have, Arjen
2018-01-01
Protein superfamilies can be divided into subfamilies of proteins with different functional characteristics. Their sequences can be classified hierarchically, which is part of sequence function assignation. Typically, there are no clear subfamily hallmarks that would allow pattern-based function assignation by which this task is mostly achieved based on the similarity principle. This is hampered by the lack of a score cut-off that is both sensitive and specific. HMMER Cut-off Threshold Tool (HMMERCTTER) adds a reliable cut-off threshold to the popular HMMER. Using a high quality superfamily phylogeny, it clusters a set of training sequences such that the cluster-specific HMMER profiles show cluster or subfamily member detection with 100% precision and recall (P&R), thereby generating a specific threshold as inclusion cut-off. Profiles and thresholds are then used as classifiers to screen a target dataset. Iterative inclusion of novel sequences to groups and the corresponding HMMER profiles results in high sensitivity while specificity is maintained by imposing 100% P&R self detection. In three presented case studies of protein superfamilies, classification of large datasets with 100% precision was achieved with over 95% recall. Limits and caveats are presented and explained. HMMERCTTER is a promising protein superfamily sequence classifier provided high quality training datasets are used. It provides a decision support system that aids in the difficult task of sequence function assignation in the twilight zone of sequence similarity. All relevant data and source codes are available from the Github repository at the following URL: https://github.com/BBCMdP/HMMERCTTER.
Pagnuco, Inti Anabela; Revuelta, María Victoria; Bondino, Hernán Gabriel; Brun, Marcel
2018-01-01
Background Protein superfamilies can be divided into subfamilies of proteins with different functional characteristics. Their sequences can be classified hierarchically, which is part of sequence function assignation. Typically, there are no clear subfamily hallmarks that would allow pattern-based function assignation by which this task is mostly achieved based on the similarity principle. This is hampered by the lack of a score cut-off that is both sensitive and specific. Results HMMER Cut-off Threshold Tool (HMMERCTTER) adds a reliable cut-off threshold to the popular HMMER. Using a high quality superfamily phylogeny, it clusters a set of training sequences such that the cluster-specific HMMER profiles show cluster or subfamily member detection with 100% precision and recall (P&R), thereby generating a specific threshold as inclusion cut-off. Profiles and thresholds are then used as classifiers to screen a target dataset. Iterative inclusion of novel sequences to groups and the corresponding HMMER profiles results in high sensitivity while specificity is maintained by imposing 100% P&R self detection. In three presented case studies of protein superfamilies, classification of large datasets with 100% precision was achieved with over 95% recall. Limits and caveats are presented and explained. Conclusions HMMERCTTER is a promising protein superfamily sequence classifier provided high quality training datasets are used. It provides a decision support system that aids in the difficult task of sequence function assignation in the twilight zone of sequence similarity. All relevant data and source codes are available from the Github repository at the following URL: https://github.com/BBCMdP/HMMERCTTER. PMID:29579071
NASA Astrophysics Data System (ADS)
Dal Molin, J. P.; Caliri, A.
2018-01-01
Here we focus on the conformational search for the native structure when it is ruled by the hydrophobic effect and steric specificities coming from amino acids. Our main tool of investigation is a 3D lattice model provided by a ten-letter alphabet, the stereochemical model. This minimalist model was conceived for Monte Carlo (MC) simulations when one keeps in mind the kinetic behavior of protein-like chains in solution. We have three central goals here. The first one is to characterize the folding time (τ) by two distinct sampling methods, so we present two sets of 103 MC simulations for a fast protein-like sequence. The resulting sets of characteristic folding times, τ and τq were obtained by the application of the standard Metropolis algorithm (MA), as well as by an enhanced algorithm (Mq A). The finding for τq shows two things: (i) the chain-solvent hydrophobic interactions {hk } plus a set of inter-residues steric constraints {ci,j } are able to emulate the conformational search for the native structure. For each one of the 103MC performed simulations, the target is always found within a finite time window; (ii) the ratio τq / τ ≅ 1 / 10 suggests that the effect of local thermal fluctuations, encompassed by the Tsallis weight, provides to the chain an innate efficiency to escape from energetic and steric traps. We performed additional MC simulations with variations of our design rule to attest this first result, both algorithms the MA and the Mq A were applied to a restricted set of targets, a physical insight is provided. Our second finding was obtained by a set of 600 independent MC simulations, only performed with the Mq A applied to an extended set of 200 representative targets, our native structures. The results show how structural patterns should modulate τq, which cover four orders of magnitude; this finding is our second goal. The third, and last result, was obtained with a special kind of simulation performed with the purpose to explore a possible connection between the hydrophobic component of protein stability and the native structural topology. We simulated those same 200 targets again with the Mq A, only. However, this time we evaluated the relative frequency {ϕq } in which each target visits its corresponding native structure along an appropriate simulation time. Due to the presence of the hydrophobic effect in our approach we obtained a strong correlation between the stability and the folding rate (R = 0 . 85). So, as faster a sequence found its target, as larger is the hydrophobic component of its stability. The strong correlation fulfills our last goal. This final finding suggests that the hydrophobic effect could not be a general stabilizing factor for proteins.
NASA Astrophysics Data System (ADS)
D'Sa, E. J.; Kim, H. C.; Ha, S. Y.
2016-12-01
Colored dissolved organic matter (CDOM) spectral absorption and excitation-emission matrix (EEMs) fluorescence with parallel factor analysis (PARAFAC) were examined in the Ross Sea during a survey conducted on board the R/V Araon in the austral summer of 14/15. CDOM absorption at 355 nm ranged from 0.06 to 1.14 m-1 while spectral slope S calculated between 275-295 nm wavelength ranged from 18.83 to 33.32 µm-1 with water masses playing an important role in its variability. Spectral slope S decreased with increasing CDOM absorption indicating the strong role of photo-oxidation on CDOM abundance during the summer. PARAFAC analysis of EEM data identified two humic-like (terrestrial and marine-like) and a protein-like (tryptophan-like) component. The two humic-like components were well correlated with little variability spatially and across the water column ( 0-100 m) likely indicating more refractory material. The protein-like fluorescent component was relatively quite variable supporting the autochthonous production of this fluorescent component in the highly productive Ross Sea waters.
Insights into the redox components of dissolved organic matters during stabilization process.
Yuan, Ying; Xi, Bei-Dou; He, Xiao-Song; Ma, Yan; Zhang, Hui; Li, Dan; Zhao, Xin-Yu
2018-05-01
The changes of dissolved organic matter (DOM) components during stabilization process play significant effects on its redox properties but are little reported. Composting is a stabilization process of DOM, during which both the components and electron transfer capacities (ETCs) of DOM change. The redox components within compost-derived DOM during the stabilization process are investigated in this study. The results show that compost-derived DOM contained protein-like, fulvic-like, and humic-like components. The protein-like component decreases during composting, whereas the fulvic- and humic-like components increase during the process. The electron-donating capacity (EDC), electron-accepting capacity (EAC), and ETC of compost-derived DOM all increase during composting but their correlations with the components presented significant difference. The humic-like components were the main functional component responsible for both EDC and ETC, whereas the protein- and fluvic-like components show negative effects with the EAC, EDC, and ETC, suggesting that the components within DOM have specific redox properties during the stabilization process. These findings are very meaningful for better understanding the geochemical behaviors of DOM in the environment.
Dong, Qian-Qian; Zhang, Ai; Li, Yong-Mei; Chen, Ling; Huang, Qing-Hui
2014-03-01
Surface water samples from the Huangpu River were filtered to measure the UV absorption and fluorescence spectrum. Dissolved organic carbon (DOC), N-nitrosodimethylamine (NDMA), and its formation potential (NDMA-FP) were also analyzed to explore relationships between the properties of dissolved organic matter (DOM) and the formation potential of disinfection byproducts-NDMA in the Huangpu River. The study found that: NDMA-FP concentration increased with the increasing of DOC concentration (r = 0.487, P < 0.01), but it had negative relationships with SUVA254 and HIX (r = -0.605, P < 0.01; r = -0.396, P < 0.01). NDMA-FP concentration had positive relationships with the fluorescence intensity of protein-like substances such as low-molecular-weight (LMW) tyrosine-like and tryptophan-like substances (r = 0.421, P < 0.01; r = 0.426, P < 0.01), but had a negative relationship with humic-like substance (r = -0.422, P < 0.01). Therefore, NDMA formation potential increases with the increasing DOM content in the Huangpu River, which is significantly related with the protein-like substances, but decreases with the increasing aromaticity and humification of DOM.
Ding, An; Liang, Heng; Qu, Fangshu; Bai, Langming; Li, Guibai; Ngo, Huu Hao; Guo, Wenshan
2014-11-01
To mitigate membrane fouling of membrane-coupled anaerobic process, granular activated carbon (GAC: 50 g/L) was added into an expanded granular sludge bed (EGSB). A short-term ultrafiltration test was investigated for analyzing membrane fouling potential and underlying fouling mechanisms. The results showed that adding GAC into the EGSB not only improved the COD removal efficiency, but also alleviated membrane fouling efficiently because GAC could help to reduce soluble microbial products, polysaccharides and proteins by 26.8%, 27.8% and 24.7%, respectively, compared with the control system. Furthermore, excitation emission matrix (EEM) fluorescence spectroscopy analysis revealed that GAC addition mainly reduced tryptophan protein-like, aromatic protein-like and fulvic-like substances. In addition, the resistance distribution analysis demonstrated that adding GAC primarily decreased the cake layer resistance by 53.5%. The classic filtration mode analysis showed that cake filtration was the major fouling mechanism for membrane-coupled EGSB process regardless of the GAC addition. Copyright © 2014 Elsevier Ltd. All rights reserved.
Promoter Sequences Prediction Using Relational Association Rule Mining
Czibula, Gabriela; Bocicor, Maria-Iuliana; Czibula, Istvan Gergely
2012-01-01
In this paper we are approaching, from a computational perspective, the problem of promoter sequences prediction, an important problem within the field of bioinformatics. As the conditions for a DNA sequence to function as a promoter are not known, machine learning based classification models are still developed to approach the problem of promoter identification in the DNA. We are proposing a classification model based on relational association rules mining. Relational association rules are a particular type of association rules and describe numerical orderings between attributes that commonly occur over a data set. Our classifier is based on the discovery of relational association rules for predicting if a DNA sequence contains or not a promoter region. An experimental evaluation of the proposed model and comparison with similar existing approaches is provided. The obtained results show that our classifier overperforms the existing techniques for identifying promoter sequences, confirming the potential of our proposal. PMID:22563233
Predicting Flavonoid UGT Regioselectivity
Jackson, Rhydon; Knisley, Debra; McIntosh, Cecilia; Pfeiffer, Phillip
2011-01-01
Machine learning was applied to a challenging and biologically significant protein classification problem: the prediction of avonoid UGT acceptor regioselectivity from primary sequence. Novel indices characterizing graphical models of residues were proposed and found to be widely distributed among existing amino acid indices and to cluster residues appropriately. UGT subsequences biochemically linked to regioselectivity were modeled as sets of index sequences. Several learning techniques incorporating these UGT models were compared with classifications based on standard sequence alignment scores. These techniques included an application of time series distance functions to protein classification. Time series distances defined on the index sequences were used in nearest neighbor and support vector machine classifiers. Additionally, Bayesian neural network classifiers were applied to the index sequences. The experiments identified improvements over the nearest neighbor and support vector machine classifications relying on standard alignment similarity scores, as well as strong correlations between specific subsequences and regioselectivities. PMID:21747849
Characterization of circulating transfer RNA-derived RNA fragments in cattle
Casas, Eduardo; Cai, Guohong; Neill, John D.
2015-01-01
The objective was to characterize naturally occurring circulating transfer RNA-derived RNA fragments (tRFs) in cattle1. Serum from eight clinically normal adult dairy cows was collected, and small non-coding RNAs were extracted immediately after collection and sequenced by Illumina MiSeq. Sequences aligned to transfer RNA (tRNA) genes or their flanking sequences were characterized. Sequences aligned to the beginning of 5′ end of the mature tRNA were classified as tRF5; those aligned to the 3′ end of mature tRNA were classified as tRF3; and those aligned to the beginning of the 3′ end flanking sequences were classified as tRF1. There were 3,190,962 sequences that mapped to transfer RNA and small non-coding RNAs in the bovine genome. Of these, 2,323,520 were identified as tRF5s, 562 were tRF3s, and 81 were tRF1s. There were 866,799 sequences identified as other small non-coding RNAs (microRNA, rRNA, snoRNA, etc.) and were excluded from the study. The tRF5s ranged from 28 to 40 nucleotides; and 98.7% ranged from 30 to 34 nucleotides in length. The tRFs with the greatest number of sequences were derived from tRNA of histidine, glutamic acid, lysine, glycine, and valine. There was no association between number of codons for each amino acid and number of tRFs in the samples. The reason for tRF5s being the most abundant can only be explained if these sequences are associated with function within the animal. PMID:26379699
Graphical classification of DNA sequences of HLA alleles by deep learning.
Miyake, Jun; Kaneshita, Yuhei; Asatani, Satoshi; Tagawa, Seiichi; Niioka, Hirohiko; Hirano, Takashi
2018-04-01
Alleles of human leukocyte antigen (HLA)-A DNAs are classified and expressed graphically by using artificial intelligence "Deep Learning (Stacked autoencoder)". Nucleotide sequence data corresponding to the length of 822 bp, collected from the Immuno Polymorphism Database, were compressed to 2-dimensional representation and were plotted. Profiles of the two-dimensional plots indicate that the alleles can be classified as clusters are formed. The two-dimensional plot of HLA-A DNAs gives a clear outlook for characterizing the various alleles.
Shen, Hong-Bin; Chou, Kuo-Chen
2005-11-25
The nucleus is the brain of eukaryotic cells that guides the life processes of the cell by issuing key instructions. For in-depth understanding of the biochemical process of the nucleus, the knowledge of localization of nuclear proteins is very important. With the avalanche of protein sequences generated in the post-genomic era, it is highly desired to develop an automated method for fast annotating the subnuclear locations for numerous newly found nuclear protein sequences so as to be able to timely utilize them for basic research and drug discovery. In view of this, a novel approach is developed for predicting the protein subnuclear location. It is featured by introducing a powerful classifier, the optimized evidence-theoretic K-nearest classifier, and using the pseudo amino acid composition [K.C. Chou, PROTEINS: Structure, Function, and Genetics, 43 (2001) 246], which can incorporate a considerable amount of sequence-order effects, to represent protein samples. As a demonstration, identifications were performed for 370 nuclear proteins among the following 9 subnuclear locations: (1) Cajal body, (2) chromatin, (3) heterochromatin, (4) nuclear diffuse, (5) nuclear pore, (6) nuclear speckle, (7) nucleolus, (8) PcG body, and (9) PML body. The overall success rates thus obtained by both the re-substitution test and jackknife cross-validation test are significantly higher than those by existing classifiers on the same working dataset. It is anticipated that the powerful approach may also become a useful high throughput vehicle to bridge the huge gap occurring in the post-genomic era between the number of gene sequences in databases and the number of gene products that have been functionally characterized. The OET-KNN classifier will be available at www.pami.sjtu.edu.cn/people/hbshen.
Yuan, Xiao Chun; Chen, Yue Min; Yuan, Shuo; Zheng, Wei; Si, You Tao; Yuan, Zhi Peng; Lin, Wei Sheng; Yang, Yu Sheng
2017-01-01
To study the effects of nitrogen deposition on the concentration and spectral characteristics of dissolved organic matter (DOM) in the forest soil solution from the subtropical Cunninghamia lanceolata plantation, using negative pressure sampling method, the dynamics of DOM in soil solutions from 0-15 and 15-30 cm soil layer was monitored for two years and the spectroscopic features of DOM were analyzed. The results showed that nitrogen deposition significantly reduced the concentration of dissolved organic carbon (DOC), and increased the aromatic index (AI) and the humic index (HIX), but had no significant effect on dissolved organic nitrogen (DON) concentration in both soil layers. There was obvious seasonal variation in DOM concentration of the soil solution, which was prominently higher in summer and autumn than in spring and winter.Fourier-transform infrared (FTIR) absorption spectrometry indicated that the DOM in forest soil solution had absorption peaks in the similar position of six regions, being the highest in wave number of 1145-1149 cm -1 . Three-dimensional fluorescence spectra indicated that DOM was mainly consisted of protein-like substances (Ex/Em=230 nm/300 nm) and microbial degradation products (Ex/Em=275 nm/300 nm). The availability of protein-like substances from 0-15 cm soil layer was reduced in the nitrogen treatments. Nitrogen deposition significantly reduced the concentration of DOC in soil solution, maybe largely by reducing soil pH, inhibiting soil carbon mineralization and stimulating plant growth. In particular, the decline of DOC concentration in the surface layer was due to the production inhibition of the protein-like substances and carboxylic acids. Short-term nitrogen deposition might be beneficial to the maintenance of soil fertility, while the long-term accumulation of nitrogen deposition might lead to the hard utilization of soil nutrients.
Promoter classifier: software package for promoter database analysis.
Gershenzon, Naum I; Ioshikhes, Ilya P
2005-01-01
Promoter Classifier is a package of seven stand-alone Windows-based C++ programs allowing the following basic manipulations with a set of promoter sequences: (i) calculation of positional distributions of nucleotides averaged over all promoters of the dataset; (ii) calculation of the averaged occurrence frequencies of the transcription factor binding sites and their combinations; (iii) division of the dataset into subsets of sequences containing or lacking certain promoter elements or combinations; (iv) extraction of the promoter subsets containing or lacking CpG islands around the transcription start site; and (v) calculation of spatial distributions of the promoter DNA stacking energy and bending stiffness. All programs have a user-friendly interface and provide the results in a convenient graphical form. The Promoter Classifier package is an effective tool for various basic manipulations with eukaryotic promoter sequences that usually are necessary for analysis of large promoter datasets. The program Promoter Divider is described in more detail as a representative component of the package.
On-Line Detection and Segmentation of Sports Motions Using a Wearable Sensor.
Kim, Woosuk; Kim, Myunggyu
2018-03-19
In sports motion analysis, observation is a prerequisite for understanding the quality of motions. This paper introduces a novel approach to detect and segment sports motions using a wearable sensor for supporting systematic observation. The main goal is, for convenient analysis, to automatically provide motion data, which are temporally classified according to the phase definition. For explicit segmentation, a motion model is defined as a sequence of sub-motions with boundary states. A sequence classifier based on deep neural networks is designed to detect sports motions from continuous sensor inputs. The evaluation on two types of motions (soccer kicking and two-handed ball throwing) verifies that the proposed method is successful for the accurate detection and segmentation of sports motions. By developing a sports motion analysis system using the motion model and the sequence classifier, we show that the proposed method is useful for observation of sports motions by automatically providing relevant motion data for analysis.
CATCh, an Ensemble Classifier for Chimera Detection in 16S rRNA Sequencing Studies
Mysara, Mohamed; Saeys, Yvan; Leys, Natalie; Raes, Jeroen
2014-01-01
In ecological studies, microbial diversity is nowadays mostly assessed via the detection of phylogenetic marker genes, such as 16S rRNA. However, PCR amplification of these marker genes produces a significant amount of artificial sequences, often referred to as chimeras. Different algorithms have been developed to remove these chimeras, but efforts to combine different methodologies are limited. Therefore, two machine learning classifiers (reference-based and de novo CATCh) were developed by integrating the output of existing chimera detection tools into a new, more powerful method. When comparing our classifiers with existing tools in either the reference-based or de novo mode, a higher performance of our ensemble method was observed on a wide range of sequencing data, including simulated, 454 pyrosequencing, and Illumina MiSeq data sets. Since our algorithm combines the advantages of different individual chimera detection tools, our approach produces more robust results when challenged with chimeric sequences having a low parent divergence, short length of the chimeric range, and various numbers of parents. Additionally, it could be shown that integrating CATCh in the preprocessing pipeline has a beneficial effect on the quality of the clustering in operational taxonomic units. PMID:25527546
Bernal, María; Casero, David; Singh, Vasantika; Wilson, Grandon T.; Grande, Arne; Yang, Huijun; Dodani, Sheel C.; Pellegrini, Matteo; Huijser, Peter; Connolly, Erin L.; Merchant, Sabeeha S.; Krämer, Ute
2012-01-01
The transition metal copper (Cu) is essential for all living organisms but is toxic when present in excess. To identify Cu deficiency responses comprehensively, we conducted genome-wide sequencing-based transcript profiling of Arabidopsis thaliana wild-type plants and of a mutant defective in the gene encoding SQUAMOSA PROMOTER BINDING PROTEIN-LIKE7 (SPL7), which acts as a transcriptional regulator of Cu deficiency responses. In response to Cu deficiency, FERRIC REDUCTASE OXIDASE5 (FRO5) and FRO4 transcript levels increased strongly, in an SPL7-dependent manner. Biochemical assays and confocal imaging of a Cu-specific fluorophore showed that high-affinity root Cu uptake requires prior FRO5/FRO4-dependent Cu(II)-specific reduction to Cu(I) and SPL7 function. Plant iron (Fe) deficiency markers were activated in Cu-deficient media, in which reduced growth of the spl7 mutant was partially rescued by Fe supplementation. Cultivation in Cu-deficient media caused a defect in root-to-shoot Fe translocation, which was exacerbated in spl7 and associated with a lack of ferroxidase activity. This is consistent with a possible role for a multicopper oxidase in Arabidopsis Fe homeostasis, as previously described in yeast, humans, and green algae. These insights into root Cu uptake and the interaction between Cu and Fe homeostasis will advance plant nutrition, crop breeding, and biogeochemical research. PMID:22374396
Fiannaca, Antonino; La Rosa, Massimo; Rizzo, Riccardo; Urso, Alfonso
2015-07-01
In this paper, an alignment-free method for DNA barcode classification that is based on both a spectral representation and a neural gas network for unsupervised clustering is proposed. In the proposed methodology, distinctive words are identified from a spectral representation of DNA sequences. A taxonomic classification of the DNA sequence is then performed using the sequence signature, i.e., the smallest set of k-mers that can assign a DNA sequence to its proper taxonomic category. Experiments were then performed to compare our method with other supervised machine learning classification algorithms, such as support vector machine, random forest, ripper, naïve Bayes, ridor, and classification tree, which also consider short DNA sequence fragments of 200 and 300 base pairs (bp). The experimental tests were conducted over 10 real barcode datasets belonging to different animal species, which were provided by the on-line resource "Barcode of Life Database". The experimental results showed that our k-mer-based approach is directly comparable, in terms of accuracy, recall and precision metrics, with the other classifiers when considering full-length sequences. In addition, we demonstrate the robustness of our method when a classification is performed task with a set of short DNA sequences that were randomly extracted from the original data. For example, the proposed method can reach the accuracy of 64.8% at the species level with 200-bp fragments. Under the same conditions, the best other classifier (random forest) reaches the accuracy of 20.9%. Our results indicate that we obtained a clear improvement over the other classifiers for the study of short DNA barcode sequence fragments. Copyright © 2015 Elsevier B.V. All rights reserved.
USDA-ARS?s Scientific Manuscript database
Chitin-binding proteins (CBPs) existed in various species and involved in different biology processes. In the present study, we cloned a full length cDNA of chitin-binding protein-like (PpCBP-like) from Pteromalus puparum, a pupal endoparasitoid of Pieris rapae. PpCBP-like encoded a 96 putative amin...
Molecular epidemiological characterization of poultry red mite, Dermanyssus gallinae, in Japan
CHU, Thi Thanh Huong; MURANO, Takako; UNO, Yukiko; USUI, Tatsufumi; YAMAGUCHI, Tsuyoshi
2015-01-01
Dermanyssus gallinae, the poultry red mite, is an obligatory blood-sucking ectoparasite. The genetic diversity of D. gallinae has been examined in some countries, but so far not in Asian countries. Here, we sequenced a part of the mitochondrial cytochrome oxidase subunit I (COI) and16S rRNA genes and nuclear internal transcribed spacers (ITS) region in 239 mite samples collected from 40 prefectures throughout Japan. The COI and 16S rRNA nucleotide sequences were classified into 28 and 26 haplotypes, respectively. In phylogenetic trees, the haplotypes clustered into 2 haplogroups corresponding to haplogroups A and B, which were previously reported. Haplogroups A and B were further subdivided into sub-haplogroups AJ1 and AJ2, and BJ1 and BJ2, respectively. In both trees, the sequences of haplotypes in AJ1 and BJ2 were relatively distant from those reported in other countries, while some sequences in AJ2 and BJ1 were identical to those in Europe. In addition, the ITS sequences were classified into two sequences, and both sequences were closely related to the sequences found in European countries. These findings indicate a possibility of international oversea transmission of D. gallinae. PMID:26074251
Molecular epidemiological characterization of poultry red mite, Dermanyssus gallinae, in Japan.
Chu, Thi Thanh Huong; Murano, Takako; Uno, Yukiko; Usui, Tatsufumi; Yamaguchi, Tsuyoshi
2015-11-01
Dermanyssus gallinae, the poultry red mite, is an obligatory blood-sucking ectoparasite. The genetic diversity of D. gallinae has been examined in some countries, but so far not in Asian countries. Here, we sequenced a part of the mitochondrial cytochrome oxidase subunit I (COI) and16S rRNA genes and nuclear internal transcribed spacers (ITS) region in 239 mite samples collected from 40 prefectures throughout Japan. The COI and 16S rRNA nucleotide sequences were classified into 28 and 26 haplotypes, respectively. In phylogenetic trees, the haplotypes clustered into 2 haplogroups corresponding to haplogroups A and B, which were previously reported. Haplogroups A and B were further subdivided into sub-haplogroups AJ1 and AJ2, and BJ1 and BJ2, respectively. In both trees, the sequences of haplotypes in AJ1 and BJ2 were relatively distant from those reported in other countries, while some sequences in AJ2 and BJ1 were identical to those in Europe. In addition, the ITS sequences were classified into two sequences, and both sequences were closely related to the sequences found in European countries. These findings indicate a possibility of international oversea transmission of D. gallinae.
A taxonomy for mechanical ventilation: 10 fundamental maxims.
Chatburn, Robert L; El-Khatib, Mohamad; Mireles-Cabodevila, Eduardo
2014-11-01
The American Association for Respiratory Care has declared a benchmark for competency in mechanical ventilation that includes the ability to "apply to practice all ventilation modes currently available on all invasive and noninvasive mechanical ventilators." This level of competency presupposes the ability to identify, classify, compare, and contrast all modes of ventilation. Unfortunately, current educational paradigms do not supply the tools to achieve such goals. To fill this gap, we expand and refine a previously described taxonomy for classifying modes of ventilation and explain how it can be understood in terms of 10 fundamental constructs of ventilator technology: (1) defining a breath, (2) defining an assisted breath, (3) specifying the means of assisting breaths based on control variables specified by the equation of motion, (4) classifying breaths in terms of how inspiration is started and stopped, (5) identifying ventilator-initiated versus patient-initiated start and stop events, (6) defining spontaneous and mandatory breaths, (7) defining breath sequences (8), combining control variables and breath sequences into ventilatory patterns, (9) describing targeting schemes, and (10) constructing a formal taxonomy for modes of ventilation composed of control variable, breath sequence, and targeting schemes. Having established the theoretical basis of the taxonomy, we demonstrate a step-by-step procedure to classify any mode on any mechanical ventilator. Copyright © 2014 by Daedalus Enterprises.
Role of Autism Susceptibility Gene, CNTNAP2, in Neural Circuitry for Vocal Communication
2012-10-01
experiments designed to test this hypothesis, we have now shown that FoxP2 protein in RA follows the mRNA pattern, with a striking change between...associated protein-like 2 (Cntnap2) is an exc iting molecule for the study of the genetic basis of language. In humans, Cntnap2 is a target of the FOXP2
Gao, Lei; Fan, Daidu; Li, Daoji; Cai, Jingong
2010-04-01
Twenty-eight surface water samples from rivers, muddy intertidal flats, sand shores, and bedrock coasts were collected along the Zhejiang coastline in southeast China. In addition, three samples from the Changjiang (Yangtze River) were collected for comparison. CDOM (chromophoric dissolved organic matter) absorption and fluorescence excitation-emission matrix (EEM) spectroscopy, as well as nutrients and DOC were measured in these samples. According to salinity, nutrient, and DOC constituents, the 28 Zhejiang samples were categorized into four groups, i.e. highly-polluted, river derived, muddy-flat derived, and saltwater dominated ones. Among the six parameters (two humic-like and two protein-like peak intensities in fluorescence EEM contours, absorption at 300 nm, and DOC concentration) for the Zhejiang samples, any two of them were positively correlated. The submarine groundwater discharge, rather than local rivers, might have provided most of the freshwater that interacted with the saltwater during the mixing process. The high protein-like EEM peaks in samples from muddy salt marshes and rivers were probably caused by terrestrial inputs, land-based pollution, and local biological activities in combination. Copyright 2009. Published by Elsevier Ltd.
NASA Astrophysics Data System (ADS)
Lee, Shin-Ah; Kim, Guebuem
2018-02-01
We monitored seasonal variations in dissolved organic carbon (DOC), the stable carbon isotope of DOC (δ13C-DOC), and fluorescent dissolved organic matter (FDOM) in water samples from a fixed station in the Nakdong River Estuary, Korea. Sampling was performed every hour during spring tide once a month from October 2014 to August 2015. The concentrations of DOC and humic-like FDOM showed significant negative correlations against salinity (r2 = 0.42-0.98, p < 0.0001), indicating that the river-originated DOM components were the major source and behave conservatively in the estuarine mixing zone. The extrapolated δ13C-DOC values (-27.5 to -24.5 ‰) in fresh water confirm that both components are mainly of terrestrial origin. The slopes of humic-like FDOM against salinity were 60-80 % higher in the summer and fall due to higher terrestrial production of humic-like FDOM. The slopes of protein-like FDOM against salinity, however, were 70-80 % higher in spring due to higher biological production in river water. Our results suggest that there are large seasonal changes in riverine fluxes of humic- and protein-like FDOM to the ocean.
Jiang, Yulin; Zhao, Jianfu; Li, Penghui; Huang, Qinghui
2016-10-12
Because of the significance in photosynthesis, nutrient dynamics, trophodynamics and biological activity, dissolved organic matter (DOM) is important to the microbial community in the coastal plume zone. In this study, we investigated the hydrodynamic processes, photodegradation and biodegradation of DOM at the Yangtze River plume in the East China Sea through analyzing water quality and optical properties of DOM. Surface water samples were collected to examine water quality and fluorescence properties of fluorescent dissolved organic matter (FDOM). The results indicated that dilution was the key factor in the multiple processes, and the mixing process gradually increased from nearshore to offshore in coastal water. Four components of FDOM representing humic-like substances (C1 & C4) and protein-like substances (C2 & C3) were identified, and all components showed nearly conservative behaviors. Protein-like substances were more mutable compared to humic-like substances. The photodegradation of humic-like substances caused brown algae blooms to some extent. The molecular weight of humic substances gradually decreased along the mixing process. FDOM in the plume zone was both of terrigenous and autochthonous origins, and the characteristic of terrigenous origin was obvious compared to that of autochthonous origin.
Zhu, Long-Ji; Zhao, Yue; Chen, Yan-Ni; Cui, Hong-Yang; Wei, Yu-Quan; Liu, Hai-Long; Chen, Xiao-Meng; Wei, Zi-Min
2018-01-01
Atrazine is widely used in agriculture. In this study, dissolved organic matter (DOM) from soils under four types of land use (forest (F), meadow (M), cropland (C) and wetland (W)) was used to investigate the binding characteristics of atrazine. Fluorescence excitation-emission matrix-parallel factor (EEM-PARAFAC) analysis, two-dimensional correlation spectroscopy (2D-COS) and Stern-Volmer model were combined to explore the complexation between DOM and atrazine. The EEM-PARAFAC indicated that DOM from different sources had different structures, and humic-like components had more obvious quenching effects than protein-like components. The Stern-Volmer model combined with correlation analysis showed that log K values of PARAFAC components had a significant correlation with the humification of DOM, especially for C3 component, and they were all in the same order as follows: meadow soil (5.68)>wetland soil (5.44)>cropland soil (5.35)>forest soil (5.04). The 2D-COS further confirmed that humic-like components firstly combined with atrazine followed by protein-like components. These findings suggest that DOM components can significantly influence the bioavailability, mobility and migration of atrazine in different land uses. Copyright © 2016 Elsevier Inc. All rights reserved.
Size distribution of absorbing and fluorescing DOM in Beaufort Sea, Canada Basin
NASA Astrophysics Data System (ADS)
Gao, Zhiyuan; Guéguen, Céline
2017-03-01
The molecular weight (MW) of dissolved organic matter (DOM) is considered as an important factor affecting the bioavailability of organic matter and associated chemical species. Colored DOM (CDOM) MW distribution was determined, for the first time, in the Beaufort Sea (Canada Basin) by asymmetrical flow field-flow fractionation (AF4) coupled with online diode array ultra violet-visible photometer and offline fluorescence detectors. The apparent MW ranged from 1.07 to 1.45 kDa, congruent with previous studies using high performance size exclusion chromatography and tangential flow filtration. Interestingly, a minimum in MW was associated with the Pacific Summer Waters (PSW), while higher MW was associated with the Pacific Winter Waters (PWW). The Arctic Intermediate Waters (AIW) did not show any significant change in MW and fluorescence intensities distribution between stations, suggesting homogeneous DOM composition in deep waters. Three fluorescence components including two humic-like components and one protein-like component were PARAFAC-validated. With the increase of MW, protein-like fluorescence component became more dominant while the majority remained as marine/microbially derived humic-like components. Overall, it is concluded that water mass origin influenced DOM MW distribution in the Arctic Ocean.
Baker, Andy; Ward, David; Lieten, Shakti H; Periera, Ryan; Simpson, Ellie C; Slater, Malcolm
2004-07-01
Protein-like fluorescence intensity in rivers increases with increasing anthropogenic DOM inputs from sewerage and farm wastes. Here, a portable luminescence spectrophotometer was used to investigate if this technology could be used to provide both field scientists with a rapid pollution monitoring tool and process control engineers with a portable waste water monitoring device, through the measurement of river and waste water tryptophan-like fluorescence from a range of rivers in NE England and from effluents from within two waste water treatment plants. The portable spectrophotometer determined that waste waters and sewerage effluents had the highest tryptophan-like fluorescence intensity, urban streams had an intermediate tryptophan-like fluorescence intensity, and the upstream river samples of good water quality the lowest tryptophan-like fluorescence intensity. Replicate samples demonstrated that fluorescence intensity is reproducible to +/- 20% for low fluorescence, 'clean' river water samples and +/- 5% for urban water and waste waters. Correlations between fluorescence measured by the portable spectrophotometer with a conventional bench machine were 0.91; (Spearman's rho, n = 143), demonstrating that the portable spectrophotometer does correlate with tryptophan-like fluorescence intensity measured using the bench spectrophotometer.
Chen, Peng; Li, Jinyan
2010-05-17
Prediction of long-range inter-residue contacts is an important topic in bioinformatics research. It is helpful for determining protein structures, understanding protein foldings, and therefore advancing the annotation of protein functions. In this paper, we propose a novel ensemble of genetic algorithm classifiers (GaCs) to address the long-range contact prediction problem. Our method is based on the key idea called sequence profile centers (SPCs). Each SPC is the average sequence profiles of residue pairs belonging to the same contact class or non-contact class. GaCs train on multiple but different pairs of long-range contact data (positive data) and long-range non-contact data (negative data). The negative data sets, having roughly the same sizes as the positive ones, are constructed by random sampling over the original imbalanced negative data. As a result, about 21.5% long-range contacts are correctly predicted. We also found that the ensemble of GaCs indeed makes an accuracy improvement by around 5.6% over the single GaC. Classifiers with the use of sequence profile centers may advance the long-range contact prediction. In line with this approach, key structural features in proteins would be determined with high efficiency and accuracy.
NASA Astrophysics Data System (ADS)
Kwon, Hyeong Kyu; Kim, Guebuem; Lim, Weol Ae; Park, Jong Woo
2018-04-01
We investigated phytoplankton pigments, dissolved organic carbon (DOC), and fluorescent dissolved organic matter (FDOM) during the summers of 2013 and 2016 in the coastal area of Tongyeong, Korea, where Cochlodinium polykrikoides blooms often occur. The density of red tides was evaluated using a dinoflagellate pigment, peridinin. The concentrations of peridinin and DOC in the patch areas were 15- and 4-fold higher than those in the non-patch areas. The parallel factor analysis (PARAFAC) model identified one protein-like FDOM (FDOMT) and two humic-like FDOM, classically classified as marine FDOM (FDOMM) and terrestrial FDOM (FDOMC). The concentrations of FDOMT in the patch areas were 5-fold higher than those in the non-patch areas, likely associated with biological production. In general, FDOMM and FDOMC are known to be dependent exclusively on salinity in any surface waters of the coastal ocean. However, in this study, we observed strikingly enhanced FDOMC concentration over that expected from the salinity mixing, whereas FDOMM increases were not clear. These FDOMC concentrations showed a significant positive correlation against peridinin, indicating that the production of FDOMC is associated with the red tide blooms. Our results suggest that FDOMC can be naturally enriched by some phytoplankton species, without FDOMM enrichment. Such naturally produced FDOM may play a critical role in biological production as well as biogeochemical cycle in red tide regions.
Borozan, Ivan; Watt, Stuart; Ferretti, Vincent
2015-05-01
Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized. Our model classifies sequences using a combined sequence similarity score calculated by adaptively weighting the contribution of different sequence similarity measures. Weights are determined independently for each sequence in the test set and reflect the discriminatory ability of individual similarity measures in the training set. Because the similarity between some sequences is determined more accurately with one type of measure rather than another, our classifier allows different sets of weights to be associated with different sequences. Using five different similarity measures, we show that our model significantly improves the classification accuracy over the current composition- and alignment-based models, when predicting the taxonomic lineage for both short viral sequence fragments and complete viral sequences. We also show that our model can be used effectively for the classification of reads from a real metagenome dataset as well as protein sequences. All the datasets and the code used in this study are freely available at https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html. ivan.borozan@gmail.com Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.
Borozan, Ivan; Watt, Stuart; Ferretti, Vincent
2015-01-01
Motivation: Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized. Results: Our model classifies sequences using a combined sequence similarity score calculated by adaptively weighting the contribution of different sequence similarity measures. Weights are determined independently for each sequence in the test set and reflect the discriminatory ability of individual similarity measures in the training set. Because the similarity between some sequences is determined more accurately with one type of measure rather than another, our classifier allows different sets of weights to be associated with different sequences. Using five different similarity measures, we show that our model significantly improves the classification accuracy over the current composition- and alignment-based models, when predicting the taxonomic lineage for both short viral sequence fragments and complete viral sequences. We also show that our model can be used effectively for the classification of reads from a real metagenome dataset as well as protein sequences. Availability and implementation: All the datasets and the code used in this study are freely available at https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html. Contact: ivan.borozan@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25573913
Kurgan, Lukasz; Cios, Krzysztof; Chen, Ke
2008-05-01
Protein structure prediction methods provide accurate results when a homologous protein is predicted, while poorer predictions are obtained in the absence of homologous templates. However, some protein chains that share twilight-zone pairwise identity can form similar folds and thus determining structural similarity without the sequence similarity would be desirable for the structure prediction. The folding type of a protein or its domain is defined as the structural class. Current structural class prediction methods that predict the four structural classes defined in SCOP provide up to 63% accuracy for the datasets in which sequence identity of any pair of sequences belongs to the twilight-zone. We propose SCPRED method that improves prediction accuracy for sequences that share twilight-zone pairwise similarity with sequences used for the prediction. SCPRED uses a support vector machine classifier that takes several custom-designed features as its input to predict the structural classes. Based on extensive design that considers over 2300 index-, composition- and physicochemical properties-based features along with features based on the predicted secondary structure and content, the classifier's input includes 8 features based on information extracted from the secondary structure predicted with PSI-PRED and one feature computed from the sequence. Tests performed with datasets of 1673 protein chains, in which any pair of sequences shares twilight-zone similarity, show that SCPRED obtains 80.3% accuracy when predicting the four SCOP-defined structural classes, which is superior when compared with over a dozen recent competing methods that are based on support vector machine, logistic regression, and ensemble of classifiers predictors. The SCPRED can accurately find similar structures for sequences that share low identity with sequence used for the prediction. The high predictive accuracy achieved by SCPRED is attributed to the design of the features, which are capable of separating the structural classes in spite of their low dimensionality. We also demonstrate that the SCPRED's predictions can be successfully used as a post-processing filter to improve performance of modern fold classification methods.
Kurgan, Lukasz; Cios, Krzysztof; Chen, Ke
2008-01-01
Background Protein structure prediction methods provide accurate results when a homologous protein is predicted, while poorer predictions are obtained in the absence of homologous templates. However, some protein chains that share twilight-zone pairwise identity can form similar folds and thus determining structural similarity without the sequence similarity would be desirable for the structure prediction. The folding type of a protein or its domain is defined as the structural class. Current structural class prediction methods that predict the four structural classes defined in SCOP provide up to 63% accuracy for the datasets in which sequence identity of any pair of sequences belongs to the twilight-zone. We propose SCPRED method that improves prediction accuracy for sequences that share twilight-zone pairwise similarity with sequences used for the prediction. Results SCPRED uses a support vector machine classifier that takes several custom-designed features as its input to predict the structural classes. Based on extensive design that considers over 2300 index-, composition- and physicochemical properties-based features along with features based on the predicted secondary structure and content, the classifier's input includes 8 features based on information extracted from the secondary structure predicted with PSI-PRED and one feature computed from the sequence. Tests performed with datasets of 1673 protein chains, in which any pair of sequences shares twilight-zone similarity, show that SCPRED obtains 80.3% accuracy when predicting the four SCOP-defined structural classes, which is superior when compared with over a dozen recent competing methods that are based on support vector machine, logistic regression, and ensemble of classifiers predictors. Conclusion The SCPRED can accurately find similar structures for sequences that share low identity with sequence used for the prediction. The high predictive accuracy achieved by SCPRED is attributed to the design of the features, which are capable of separating the structural classes in spite of their low dimensionality. We also demonstrate that the SCPRED's predictions can be successfully used as a post-processing filter to improve performance of modern fold classification methods. PMID:18452616
Park, Doori; Park, Su-Hyun; Ban, Yong Wook; Kim, Youn Shic; Park, Kyoung-Cheul; Kim, Nam-Soo; Kim, Ju-Kon; Choi, Ik-Young
2017-08-15
Genetically modified crops (GM crops) have been developed to improve the agricultural traits of modern crop cultivars. Safety assessments of GM crops are of paramount importance in research at developmental stages and before releasing transgenic plants into the marketplace. Sequencing technology is developing rapidly, with higher output and labor efficiencies, and will eventually replace existing methods for the molecular characterization of genetically modified organisms. To detect the transgenic insertion locations in the three GM rice gnomes, Illumina sequencing reads are mapped and classified to the rice genome and plasmid sequence. The both mapped reads are classified to characterize the junction site between plant and transgene sequence by sequence alignment. Herein, we present a next generation sequencing (NGS)-based molecular characterization method, using transgenic rice plants SNU-Bt9-5, SNU-Bt9-30, and SNU-Bt9-109. Specifically, using bioinformatics tools, we detected the precise insertion locations and copy numbers of transfer DNA, genetic rearrangements, and the absence of backbone sequences, which were equivalent to results obtained from Southern blot analyses. NGS methods have been suggested as an effective means of characterizing and detecting transgenic insertion locations in genomes. Our results demonstrate the use of a combination of NGS technology and bioinformatics approaches that offers cost- and time-effective methods for assessing the safety of transgenic plants.
Xiao, Chao-Ting; Halbur, Patrick G; Opriessnig, Tanja
2015-07-01
The oldest porcine circovirus type 2 (PCV2) sequence dates back to 1962 and is among several hundreds of publicly available PCV2 sequences. Despite this resource, few studies have investigated the global genetic diversity of PCV2. To evaluate the phylogenetic relationship of PCV2 strains, 1680 PCV2 open reading frame 2 (ORF2) sequences were compared and analysed by methods of neighbour-joining, maximum-likelihood, Bayesian inference and network analysis. Four distinct clades were consistently identified and included PCV2a, PCV2b, PCV2c and PCV2d; the p-distance between PCV2d and PCV2b was 0.055±0.008, larger than the PCV2 genotype-definition cut-off of 0.035, supporting PCV2d as an independent genotype. Among the 1680 sequences, 278-285 (16.5-17 %) were classified as PCV2a, 1007-1058 (59.9-63 %) as PCV2b, three (0.2 %) as PCV2c and 322-323 (19.2 %) as PCV2d, with the remaining 12-78 sequences (0.7-4.6 %) classified as intermediate clades or strains by the various methods. Classification of strains to genotypes differed based on the number of sequences used for the analysis, indicating that sample size is important when determining classification and assessing PCV2 trends and shifts. PCV2d was initially identified in 1999 in samples collected in Switzerland, now appears to be widespread in China and has been present in North America since 2012. During 2012-2013, 37 % of all investigated PCV2 sequences from US pigs were classified as PCV2d and overall data analysis suggests an ongoing genotype shift from PCV2b towards PCV2d. The present analyses indicate that PCV2d emerged approximately 20 years ago.
Gifford, Robert J.; Rhee, Soo-Yon; Eriksson, Nicolas; Liu, Tommy F.; Kiuchi, Mark; Das, Amar K.; Shafer, Robert W.
2008-01-01
Design Promiscuous guanine (G) to adenine (A) substitutions catalysed by apolipoprotein B RNA-editing catalytic component (APOBEC) enzymes are observed in a proportion of HIV-1 sequences in vivo and can introduce artifacts into some genetic analyses. The potential impact of undetected lethal editing on genotypic estimation of transmitted drug resistance was assessed. Methods Classifiers of lethal, APOBEC-mediated editing were developed by analysis of lentiviral pol gene sequence variation and evaluated using control sets of HIV-1 sequences. The potential impact of sequence editing on genotypic estimation of drug resistance was assessed in sets of sequences obtained from 77 studies of 25 or more therapy-naive individuals, using mixture modelling approaches to determine the maximum likelihood classification of sequences as lethally edited as opposed to viable. Results Analysis of 6437 protease and reverse transcriptase sequences from therapy-naive individuals using a novel classifier of lethal, APOBEC3G-mediated sequence editing, the polypeptide-like 3G (APOBEC3G)-mediated defectives (A3GD) index’, detected lethal editing in association with spurious ‘transmitted drug resistance’ in nearly 3% of proviral sequences obtained from whole blood and 0.2% of samples obtained from plasma. Conclusion Screening for lethally edited sequences in datasets containing a proportion of proviral DNA, such as those likely to be obtained for epidemiological surveillance of transmitted drug resistance in the developing world, can eliminate rare but potentially significant errors in genotypic estimation of transmitted drug resistance. PMID:18356601
USDA-ARS?s Scientific Manuscript database
Among the mechanisms controlling copper homeostasis in plants is the regulation of its uptake and tissue partitioning. Here we characterized a newly identified member of the conserved CTR/COPT family of copper transporters in Arabidopsis thaliana, COPT6. We showed that COPT6 resides at the plasma me...
The Aged Microenvironment Influences Prostate Carcinogenesis
2008-12-01
binding protein-like +36 nucleic acid binding Serpinb5 serine (or cysteine) peptidase inhibitor, clade +9 serine-type endopeptidase inhibitor activity...synthase ( phosphatidate +1.9 phosphatidate cytidylyltransferase activity Car1 carbonic anhydrase 1 +1.9 carbonate dehydratase activity;zinc ion...activity Wdr45l Wdr45 like +1.7 acid phosphatase activity;molecular_function unknown Perp PERP, TP53 apoptosis effector +1.7 structural constituent of
2014-05-01
target of myb1 ( chicken )-like 1 7994559 7.85 9.14 -2.44 0.044964 0.999947 LOC100507607 nuclear pore complex-interacting protein-like 2-like 7894184...2200 1950 (0 ~ 1700 ~ 1450 " ,__ ::J co 1200 ,__ 0 E ::J 1-- c ro ()) 2 950 700 450 200 t Day 0 flu ta mide or
NASA Astrophysics Data System (ADS)
Sardana, A.; Aziz, T. N.; Cottrell, B. A.
2017-12-01
In this presentation we will discuss our ongoing work to characterize the photochemical behavior of dissolved organic matter (DOM) from wastewater treated in constructed wetlands. We have used a suite of spectroscopic and chromatographic techniques to characterize the DOM and to quantify the potential production of reactive oxygenated species (ROS). In the present study, DOM was fractionated based on its hydrophobicity and both the natural water isolates and fractionated DOM were characterized using SUVA254, spectral slope ratios, excitation emission matrix fluorescence spectroscopy (EEMs) and proton nuclear magnetic resonance (1H NMR). Photodegradation of wetland DOM and the formation of the hydroxyl radical (*OH), singlet oxygen (1O2), and the triplet-excited state (3DOM*) was also determined to assess the reactivity of DOM. EEM spectra exhibited the four main fluorescence peaks that are characteristic of DOM: peak A humic-like DOM, Peak C (fulvic or chromophoric DOM), Peak M (marine-like DOM), and peak T (tryptophan or protein-like absorbance). Two additional observed peaks with shorter emission wavelengths (A' Ex/Em = 243/278 nm and T' Ex/Em = 272/319 nm) were attributed to the microbial DOM in wastewater effluent. The spectral slope ratios decreased from 1.46 at the wetland inlet to 0.89 at the wetland outlet. The protein-like Peak T fluorescence decreased from 50% at the wetland inlet to 6.7% at the Wetland 2 outlet. A negative correlation between the percent fluorescence of Peak T and Peaks A, C and M confirmed the transition from the spectrum of pure wastewater with a primarily protein-like signature to a spectrum characteristic of terrestrially derived DOM. This transition coincided with enhanced formation rates and steady state concentrations of photochemically produced reactive intermediates (PPRIs). Size Exclusion Chromatography demonstrated that the influent wastewater had a lower molecular weight as compared to downstream wetland locations. Fractionation of DOM based on hydrophobicity followed by 1H NMR analysis indicated an increase in the complexity and composition of wetland effluent DOM. This presentation will summarize these findings and present results from our new microcosm constructed wetlands built to develop insights into DOM production and photochemical characteristics.
NASA Astrophysics Data System (ADS)
Hariharan, Harishwaran; Aklaghi, Nima; Baker, Clayton A.; Rangwala, Huzefa; Kosecka, Jana; Sikdar, Siddhartha
2016-04-01
In spite of major advances in biomechanical design of upper extremity prosthetics, these devices continue to lack intuitive control. Conventional myoelectric control strategies typically utilize electromyography (EMG) signal amplitude sensed from forearm muscles. EMG has limited specificity in resolving deep muscle activity and poor signal-to-noise ratio. We have been investigating alternative control strategies that rely on real-time ultrasound imaging that can overcome many of the limitations of EMG. In this work, we present an ultrasound image sequence classification method that utilizes spatiotemporal features to describe muscle activity and classify motor intent. Ultrasound images of the forearm muscles were obtained from able-bodied subjects and a trans-radial amputee while they attempted different hand movements. A grid-based approach is used to test the feasibility of using spatio-temporal features by classifying hand motions performed by the subjects. Using the leave-one-out cross validation on image sequences acquired from able-bodied subjects, we observe that the grid-based approach is able to discern four hand motions with 95.31% accuracy. In case of the trans-radial amputee, we are able to discern three hand motions with 80% accuracy. In a second set of experiments, we study classification accuracy by extracting spatio-temporal sub-sequences the depict activity due to the motion of local anatomical interfaces. Short time and space limited cuboidal sequences are initially extracted and assigned an optical flow behavior label, based on a response function. The image space is clustered based on the location of cuboids and features calculated from the cuboids in each cluster. Using sequences of known motions, we extract feature vectors that describe said motion. A K-nearest neighbor classifier is designed for classification experiments. Using the leave-one-out cross validation on image sequences for an amputee subject, we demonstrate that the classifier is able to discern three important hand motions with an accuracy of 93.33% accuracy, 91-100% precision and 80-100% recall rate. We anticipate that ultrasound imaging based methods will address some limitations of conventional myoelectric sensing, while adding advantages inherent to ultrasound imaging.
Classification of Ancient Mammal Individuals Using Dental Pulp MALDI-TOF MS Peptide Profiling
Tran, Thi-Nguyen-Ny; Aboudharam, Gérard; Gardeisen, Armelle; Davoust, Bernard; Bocquet-Appel, Jean-Pierre; Flaudrops, Christophe; Belghazi, Maya; Raoult, Didier; Drancourt, Michel
2011-01-01
Background The classification of ancient animal corpses at the species level remains a challenging task for forensic scientists and anthropologists. Severe damage and mixed, tiny pieces originating from several skeletons may render morphological classification virtually impossible. Standard approaches are based on sequencing mitochondrial and nuclear targets. Methodology/Principal Findings We present a method that can accurately classify mammalian species using dental pulp and mass spectrometry peptide profiling. Our work was organized into three successive steps. First, after extracting proteins from the dental pulp collected from 37 modern individuals representing 13 mammalian species, trypsin-digested peptides were used for matrix-assisted laser desorption/ionization time-of-flight mass spectrometry analysis. The resulting peptide profiles accurately classified every individual at the species level in agreement with parallel cytochrome b gene sequencing gold standard. Second, using a 279–modern spectrum database, we blindly classified 33 of 37 teeth collected in 37 modern individuals (89.1%). Third, we classified 10 of 18 teeth (56%) collected in 15 ancient individuals representing five mammal species including human, from five burial sites dating back 8,500 years. Further comparison with an upgraded database comprising ancient specimen profiles yielded 100% classification in ancient teeth. Peptide sequencing yield 4 and 16 different non-keratin proteins including collagen (alpha-1 type I and alpha-2 type I) in human ancient and modern dental pulp, respectively. Conclusions/Significance Mass spectrometry peptide profiling of the dental pulp is a new approach that can be added to the arsenal of species classification tools for forensics and anthropology as a complementary method to DNA sequencing. The dental pulp is a new source for collagen and other proteins for the species classification of modern and ancient mammal individuals. PMID:21364886
Peptoid nanosheets exhibit a new secondary-structure motif.
Mannige, Ranjan V; Haxton, Thomas K; Proulx, Caroline; Robertson, Ellen J; Battigelli, Alessia; Butterfoss, Glenn L; Zuckermann, Ronald N; Whitelam, Stephen
2015-10-15
A promising route to the synthesis of protein-mimetic materials that are capable of complex functions, such as molecular recognition and catalysis, is provided by sequence-defined peptoid polymers--structural relatives of biologically occurring polypeptides. Peptoids, which are relatively non-toxic and resistant to degradation, can fold into defined structures through a combination of sequence-dependent interactions. However, the range of possible structures that are accessible to peptoids and other biological mimetics is unknown, and our ability to design protein-like architectures from these polymer classes is limited. Here we use molecular-dynamics simulations, together with scattering and microscopy data, to determine the atomic-resolution structure of the recently discovered peptoid nanosheet, an ordered supramolecular assembly that extends macroscopically in only two dimensions. Our simulations show that nanosheets are structurally and dynamically heterogeneous, can be formed only from peptoids of certain lengths, and are potentially porous to water and ions. Moreover, their formation is enabled by the peptoids' adoption of a secondary structure that is not seen in the natural world. This structure, a zigzag pattern that we call a Σ('sigma')-strand, results from the ability of adjacent backbone monomers to adopt opposed rotational states, thereby allowing the backbone to remain linear and untwisted. Linear backbones tiled in a brick-like way form an extended two-dimensional nanostructure, the Σ-sheet. The binary rotational-state motif of the Σ-strand is not seen in regular protein structures, which are usually built from one type of rotational state. We also show that the concept of building regular structures from multiple rotational states can be generalized beyond the peptoid nanosheet system.
Sharma, Neelam; Park, Sang-Wook; Vepachedu, Ramarao; Barbieri, Luigi; Ciani, Marialibera; Stirpe, Fiorenzo; Savary, Brett J; Vivanco, Jorge M
2004-01-01
Ribosome-inactivating proteins (RIPs) are N-glycosidases that remove a specific adenine from the sarcin/ricin loop of the large rRNA, thus arresting protein synthesis at the translocation step. In the present study, a protein termed tobacco RIP (TRIP) was isolated from tobacco (Nicotiana tabacum) leaves and purified using ion exchange and gel filtration chromatography in combination with yeast ribosome depurination assays. TRIP has a molecular mass of 26 kD as evidenced by sodium dodecyl sulfate-polyacrylamide gel electrophoresis and showed strong N-glycosidase activity as manifested by the depurination of yeast rRNA. Purified TRIP showed immunoreactivity with antibodies of RIPs from Mirabilis expansa. TRIP released fewer amounts of adenine residues from ribosomal (Artemia sp. and rat ribosomes) and non-ribosomal substrates (herring sperm DNA, rRNA, and tRNA) compared with other RIPs. TRIP inhibited translation in wheat (Triticum aestivum) germ more efficiently than in rabbit reticulocytes, showing an IC50 at 30 ng in the former system. Antimicrobial assays using highly purified TRIP (50 microg mL(-1)) conducted against various fungi and bacterial pathogens showed the strongest inhibitory activity against Trichoderma reesei and Pseudomonas solancearum. A 15-amino acid internal polypeptide sequence of TRIP was identical with the internal sequences of the iron-superoxide dismutase (Fe-SOD) from wild tobacco (Nicotiana plumbaginifolia), Arabidopsis, and potato (Solanum tuberosum). Purified TRIP showed SOD activity, and Escherichia coli Fe-SOD was observed to have RIP activity too. Thus, TRIP may be considered a dual activity enzyme showing RIP-like activity and Fe-SOD characteristics.
Kringel, D; Ultsch, A; Zimmermann, M; Jansen, J-P; Ilias, W; Freynhagen, R; Griessinger, N; Kopf, A; Stein, C; Doehring, A; Resch, E; Lötsch, J
2017-01-01
Next-generation sequencing (NGS) provides unrestricted access to the genome, but it produces ‘big data’ exceeding in amount and complexity the classical analytical approaches. We introduce a bioinformatics-based classifying biomarker that uses emergent properties in genetics to separate pain patients requiring extremely high opioid doses from controls. Following precisely calculated selection of the 34 most informative markers in the OPRM1, OPRK1, OPRD1 and SIGMAR1 genes, pattern of genotypes belonging to either patient group could be derived using a k-nearest neighbor (kNN) classifier that provided a diagnostic accuracy of 80.6±4%. This outperformed alternative classifiers such as reportedly functional opioid receptor gene variants or complex biomarkers obtained via multiple regression or decision tree analysis. The accumulation of several genetic variants with only minor functional influences may result in a qualitative consequence affecting complex phenotypes, pointing at emergent properties in genetics. PMID:27139154
Kringel, D; Ultsch, A; Zimmermann, M; Jansen, J-P; Ilias, W; Freynhagen, R; Griessinger, N; Kopf, A; Stein, C; Doehring, A; Resch, E; Lötsch, J
2017-10-01
Next-generation sequencing (NGS) provides unrestricted access to the genome, but it produces 'big data' exceeding in amount and complexity the classical analytical approaches. We introduce a bioinformatics-based classifying biomarker that uses emergent properties in genetics to separate pain patients requiring extremely high opioid doses from controls. Following precisely calculated selection of the 34 most informative markers in the OPRM1, OPRK1, OPRD1 and SIGMAR1 genes, pattern of genotypes belonging to either patient group could be derived using a k-nearest neighbor (kNN) classifier that provided a diagnostic accuracy of 80.6±4%. This outperformed alternative classifiers such as reportedly functional opioid receptor gene variants or complex biomarkers obtained via multiple regression or decision tree analysis. The accumulation of several genetic variants with only minor functional influences may result in a qualitative consequence affecting complex phenotypes, pointing at emergent properties in genetics.
Thomas, Paul D; Kejariwal, Anish; Campbell, Michael J; Mi, Huaiyu; Diemer, Karen; Guo, Nan; Ladunga, Istvan; Ulitsky-Lazareva, Betty; Muruganujan, Anushya; Rabkin, Steven; Vandergriff, Jody A; Doremieux, Olivier
2003-01-01
The PANTHER database was designed for high-throughput analysis of protein sequences. One of the key features is a simplified ontology of protein function, which allows browsing of the database by biological functions. Biologist curators have associated the ontology terms with groups of protein sequences rather than individual sequences. Statistical models (Hidden Markov Models, or HMMs) are built from each of these groups. The advantage of this approach is that new sequences can be automatically classified as they become available. To ensure accurate functional classification, HMMs are constructed not only for families, but also for functionally distinct subfamilies. Multiple sequence alignments and phylogenetic trees, including curator-assigned information, are available for each family. The current version of the PANTHER database includes training sequences from all organisms in the GenBank non-redundant protein database, and the HMMs have been used to classify gene products across the entire genomes of human, and Drosophila melanogaster. The ontology terms and protein families and subfamilies, as well as Drosophila gene c;assifications, can be browsed and searched for free. Due to outstanding contractual obligations, access to human gene classifications and to protein family trees and multiple sequence alignments will temporarily require a nominal registration fee. PANTHER is publicly available on the web at http://panther.celera.com.
Efficient use of unlabeled data for protein sequence classification: a comparative study.
Kuksa, Pavel; Huang, Pai-Hsi; Pavlovic, Vladimir
2009-04-29
Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags-the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers. Combined with state-of-the-art string kernels, our proposed computational framework achieves very accurate semi-supervised protein remote fold and homology detection on three large unlabeled databases. It outperforms current state-of-the-art methods and exhibits significant reduction in running time. The unlabeled sequences used under the semi-supervised setting resemble the unpolished gemstones; when used as-is, they may carry unnecessary features and hence compromise the classification accuracy but once cut and polished, they improve the accuracy of the classifiers considerably.
Genomic characterization of two new enterovirus types, EV-A114 and EV-A121.
Deshpande, Jagadish M; Sharma, Deepa K; Saxena, Vinay K; Shetty, Sushmitha A; Qureshi, Tarique Husain I H; Nalavade, Uma P
2016-12-01
Enteroviruses cause a variety of illnesses of the gastrointestinal tract, central nervous system and cardiovascular system. Phylogenetic analysis of VP1 sequences has identified 106 different human enteroviruses classified into four enterovirus species within the genus Enterovirus of the family Picornaviridae. It is likely that not all enterovirus types have been discovered. Between September 2013 and October 2014, stool samples of 6274 apparently healthy children of up to 5 years of age residing in Gorakhpur district, Uttar Pradesh, India were screened for enteroviruses. Virus isolates obtained in RD and Hep-2c cells were identified by complete VP1 sequencing. Enteroviruses were isolated from 3042 samples. A total of 87 different enterovirus types were identified. Two isolates with 71 and 74 % nucleotide sequence similarity to all other known enteroviruses were recognized as novel types. In this paper we report identification and complete genome sequence analysis of these two isolates classified as EV-A114 and EV-A121.
Prestat, Emmanuel; David, Maude M.; Hultman, Jenni; ...
2014-09-26
A new functional gene database, FOAM (Functional Ontology Assignments for Metagenomes), was developed to screen environmental metagenomic sequence datasets. FOAM provides a new functional ontology dedicated to classify gene functions relevant to environmental microorganisms based on Hidden Markov Models (HMMs). Sets of aligned protein sequences (i.e. ‘profiles’) were tailored to a large group of target KEGG Orthologs (KOs) from which HMMs were trained. The alignments were checked and curated to make them specific to the targeted KO. Within this process, sequence profiles were enriched with the most abundant sequences available to maximize the yield of accurate classifier models. An associatedmore » functional ontology was built to describe the functional groups and hierarchy. FOAM allows the user to select the target search space before HMM-based comparison steps and to easily organize the results into different functional categories and subcategories. FOAM is publicly available at http://portal.nersc.gov/project/m1317/FOAM/.« less
Kraken: ultrafast metagenomic sequence classification using exact alignments
2014-01-01
Kraken is an ultrafast and highly accurate program for assigning taxonomic labels to metagenomic DNA sequences. Previous programs designed for this task have been relatively slow and computationally expensive, forcing researchers to use faster abundance estimation programs, which only classify small subsets of metagenomic data. Using exact alignment of k-mers, Kraken achieves classification accuracy comparable to the fastest BLAST program. In its fastest mode, Kraken classifies 100 base pair reads at a rate of over 4.1 million reads per minute, 909 times faster than Megablast and 11 times faster than the abundance estimation program MetaPhlAn. Kraken is available at http://ccb.jhu.edu/software/kraken/. PMID:24580807
Detection and molecular characterization of Cryptosporidium and Eimeria species in Philippine bats.
Murakoshi, Fumi; Recuenco, Frances C; Omatsu, Tsutomu; Sano, Kaori; Taniguchi, Satoshi; Masangkay, Joseph S; Alviola, Philip; Eres, Eduardo; Cosico, Edison; Alvarez, James; Une, Yumi; Kyuwa, Shigeru; Sugiura, Yuki; Kato, Kentaro
2016-05-01
The genus Cryptosporidium, which is an obligate intracellular parasite, infects various vertebrates and causes a diarrheal disease known as cryptosporidiosis. Bats are naturally infected with zoonotic pathogens; thus, they are potential reservoirs of parasites. We investigated the species and genotype distribution as well as prevalence of Cryptosporidium and Eimeria in Philippine bats. We captured and examined 45 bats; four were positive for Cryptosporidium spp. and seven were positive for Eimeria spp. We detected Cryptosporidium bat genotype II from Ptenochirus jagori. Three other Cryptosporidium sequences, detected from Rhinolophus inops, Cynopterus brachyotis, and Eonycteris spelaea, could not be classified as any known species or genotype; we therefore propose the novel genotype Cryptosporidium bat genotypes V, VI, and VII. Bat genotype V is associated with human cryptosporidiosis clade, and therefore, this genotype may be transmissible to humans. Among the Eimeria sequences, BE3 detected from Scotophilus kuhlii was classified with known bat and rodent clades; however, other sequences detected from C. brachyotis, E. spelaea, Rousettus amplexicaudatus, and R. inops could not be classified with known Eimeria species. These isolates might represent a new genotype. Our findings demonstrate that the bats of the Philippines represent a reservoir of multiple Cryptosporidium and Eimeria spp.
Diversity of Bacteria at Healthy Human Conjunctiva
Dong, Qunfeng; Brulc, Jennifer M.; Iovieno, Alfonso; Bates, Brandon; Garoutte, Aaron; Miller, Darlene; Revanna, Kashi V.; Gao, Xiang; Antonopoulos, Dionysios A.; Slepak, Vladlen Z.
2011-01-01
Purpose. Ocular surface (OS) microbiota contributes to infectious and autoimmune diseases of the eye. Comprehensive analysis of microbial diversity at the OS has been impossible because of the limitations of conventional cultivation techniques. This pilot study aimed to explore true diversity of human OS microbiota using DNA sequencing-based detection and identification of bacteria. Methods. Composition of the bacterial community was characterized using deep sequencing of the 16S rRNA gene amplicon libraries generated from total conjunctival swab DNA. The DNA sequences were classified and the diversity parameters measured using bioinformatics software ESPRIT and MOTHUR and tools available through the Ribosomal Database Project-II (RDP-II). Results. Deep sequencing of conjunctival rDNA from four subjects yielded a total of 115,003 quality DNA reads, corresponding to 221 species-level phylotypes per subject. The combined bacterial community classified into 5 phyla and 59 distinct genera. However, 31% of all DNA reads belonged to unclassified or novel bacteria. The intersubject variability of individual OS microbiomes was very significant. Regardless, 12 genera—Pseudomonas, Propionibacterium, Bradyrhizobium, Corynebacterium, Acinetobacter, Brevundimonas, Staphylococci, Aquabacterium, Sphingomonas, Streptococcus, Streptophyta, and Methylobacterium—were ubiquitous among the analyzed cohort and represented the putative “core” of conjunctival microbiota. The other 47 genera accounted for <4% of the classified portion of this microbiome. Unexpectedly, healthy conjunctiva contained many genera that are commonly identified as ocular surface pathogens. Conclusions. The first DNA sequencing-based survey of bacterial population at the conjunctiva have revealed an unexpectedly diverse microbial community. All analyzed samples contained ubiquitous (core) genera that included commensal, environmental, and opportunistic pathogenic bacteria. PMID:21571682
Probabilistic topic modeling for the analysis and classification of genomic sequences
2015-01-01
Background Studies on genomic sequences for classification and taxonomic identification have a leading role in the biomedical field and in the analysis of biodiversity. These studies are focusing on the so-called barcode genes, representing a well defined region of the whole genome. Recently, alignment-free techniques are gaining more importance because they are able to overcome the drawbacks of sequence alignment techniques. In this paper a new alignment-free method for DNA sequences clustering and classification is proposed. The method is based on k-mers representation and text mining techniques. Methods The presented method is based on Probabilistic Topic Modeling, a statistical technique originally proposed for text documents. Probabilistic topic models are able to find in a document corpus the topics (recurrent themes) characterizing classes of documents. This technique, applied on DNA sequences representing the documents, exploits the frequency of fixed-length k-mers and builds a generative model for a training group of sequences. This generative model, obtained through the Latent Dirichlet Allocation (LDA) algorithm, is then used to classify a large set of genomic sequences. Results and conclusions We performed classification of over 7000 16S DNA barcode sequences taken from Ribosomal Database Project (RDP) repository, training probabilistic topic models. The proposed method is compared to the RDP tool and Support Vector Machine (SVM) classification algorithm in a extensive set of trials using both complete sequences and short sequence snippets (from 400 bp to 25 bp). Our method reaches very similar results to RDP classifier and SVM for complete sequences. The most interesting results are obtained when short sequence snippets are considered. In these conditions the proposed method outperforms RDP and SVM with ultra short sequences and it exhibits a smooth decrease of performance, at every taxonomic level, when the sequence length is decreased. PMID:25916734
USDA-ARS?s Scientific Manuscript database
A variant (rs3812316, C771G, and Gln241His) in the MLXIPL (Max-like protein X interacting protein-like) gene encoding the carbohydrate response element binding protein has been associated with lower triglycerides. However, its association with cardiovascular diseases and gene-diet interactions modul...
Protein-like Nanoparticles Based on Orthogonal Self-Assembly of Chimeric Peptides.
Jiang, Linhai; Xu, Dawei; Namitz, Kevin E; Cosgrove, Michael S; Lund, Reidar; Dong, He
2016-10-01
A novel two-component self-assembling chimeric peptide is designed where two orthogonal protein folding motifs are linked side by side with precisely defined position relative to one another. The self-assembly is driven by a combination of symmetry controlled molecular packing, intermolecular interactions, and geometric constraint to limit the assembly into compact dodecameric protein nanoparticles. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Automatic classification of protein structures using physicochemical parameters.
Mohan, Abhilash; Rao, M Divya; Sunderrajan, Shruthi; Pennathur, Gautam
2014-09-01
Protein classification is the first step to functional annotation; SCOP and Pfam databases are currently the most relevant protein classification schemes. However, the disproportion in the number of three dimensional (3D) protein structures generated versus their classification into relevant superfamilies/families emphasizes the need for automated classification schemes. Predicting function of novel proteins based on sequence information alone has proven to be a major challenge. The present study focuses on the use of physicochemical parameters in conjunction with machine learning algorithms (Naive Bayes, Decision Trees, Random Forest and Support Vector Machines) to classify proteins into their respective SCOP superfamily/Pfam family, using sequence derived information. Spectrophores™, a 1D descriptor of the 3D molecular field surrounding a structure was used as a benchmark to compare the performance of the physicochemical parameters. The machine learning algorithms were modified to select features based on information gain for each SCOP superfamily/Pfam family. The effect of combining physicochemical parameters and spectrophores on classification accuracy (CA) was studied. Machine learning algorithms trained with the physicochemical parameters consistently classified SCOP superfamilies and Pfam families with a classification accuracy above 90%, while spectrophores performed with a CA of around 85%. Feature selection improved classification accuracy for both physicochemical parameters and spectrophores based machine learning algorithms. Combining both attributes resulted in a marginal loss of performance. Physicochemical parameters were able to classify proteins from both schemes with classification accuracy ranging from 90-96%. These results suggest the usefulness of this method in classifying proteins from amino acid sequences.
Chiu, Shih-Hau; Chen, Chien-Chi; Yuan, Gwo-Fang; Lin, Thy-Hou
2006-06-15
The number of sequences compiled in many genome projects is growing exponentially, but most of them have not been characterized experimentally. An automatic annotation scheme must be in an urgent need to reduce the gap between the amount of new sequences produced and reliable functional annotation. This work proposes rules for automatically classifying the fungus genes. The approach involves elucidating the enzyme classifying rule that is hidden in UniProt protein knowledgebase and then applying it for classification. The association algorithm, Apriori, is utilized to mine the relationship between the enzyme class and significant InterPro entries. The candidate rules are evaluated for their classificatory capacity. There were five datasets collected from the Swiss-Prot for establishing the annotation rules. These were treated as the training sets. The TrEMBL entries were treated as the testing set. A correct enzyme classification rate of 70% was obtained for the prokaryote datasets and a similar rate of about 80% was obtained for the eukaryote datasets. The fungus training dataset which lacks an enzyme class description was also used to evaluate the fungus candidate rules. A total of 88 out of 5085 test entries were matched with the fungus rule set. These were otherwise poorly annotated using their functional descriptions. The feasibility of using the method presented here to classify enzyme classes based on the enzyme domain rules is evident. The rules may be also employed by the protein annotators in manual annotation or implemented in an automatic annotation flowchart.
Peleato, Nicolás M; Andrews, Robert C
2015-01-01
This work investigated the application of several fluorescence excitation-emission matrix analysis methods as natural organic matter (NOM) indicators for use in predicting the formation of trihalomethanes (THMs) and haloacetic acids (HAAs). Waters from four different sources (two rivers and two lakes) were subjected to jar testing followed by 24hr disinfection by-product formation tests using chlorine. NOM was quantified using three common measures: dissolved organic carbon, ultraviolet absorbance at 254 nm, and specific ultraviolet absorbance as well as by principal component analysis, peak picking, and parallel factor analysis of fluorescence spectra. Based on multi-linear modeling of THMs and HAAs, principle component (PC) scores resulted in the lowest mean squared prediction error of cross-folded test sets (THMs: 43.7 (μg/L)(2), HAAs: 233.3 (μg/L)(2)). Inclusion of principle components representative of protein-like material significantly decreased prediction error for both THMs and HAAs. Parallel factor analysis did not identify a protein-like component and resulted in prediction errors similar to traditional NOM surrogates as well as fluorescence peak picking. These results support the value of fluorescence excitation-emission matrix-principal component analysis as a suitable NOM indicator in predicting the formation of THMs and HAAs for the water sources studied. Copyright © 2014. Published by Elsevier B.V.
Chen, Wei; Gao, Xiaohong; Xu, Hang; Cai, Yan; Cui, Jianfeng
2017-03-01
Extracellular polymeric substances (EPS) are high molecular weight polymers and play a significant role in floc stability, floc size, bioflocculation and sludge settleability. The destruction and reconstruction of EPS improve the performance of solid-water separation processes. In this study, the influence of combined ultrasound pretreatment and chemical re-flocculation on the spatial distribution and composition of EPS was examined. Settleability efficiency demonstrated that the optimal operating condition was an ultrasound pretreatment time of 15 min at pH 6. Sludge particles were greatly disintegrated and the protein-like substances were converted into smaller molecules after ultrasound treatment, and pH had important effects on solubilization and degradation of protein-like substances. The flocs of sludge water after addition of polyacrylamide were larger in size and denser in structure than those resulting from addition of polyaluminium chloride. However, polyaluminium chloride had a better capacity for degrading EPS, especially at a dosage of 1.2 g/g total suspended solids. The results of this research show that the combination of ultrasonication and chemical re-flocculation is effective in treating sludge water from a drinking water treatment plant. Copyright © 2016 Elsevier Ltd. All rights reserved.
Song, Wenjuan; Zhao, Chenxi; Zhang, Daoyong; Mu, Shuyong; Pan, Xiangliang
2016-01-01
The effects of UV-B radiation (UVBR) on photosynthetic activity (Fv/Fm) of aquatic Synechocystis sp. and desert Chroococcus minutus and effects on composition and fluorescence property of extracellular polymeric substances (EPSs) from Synechocystis sp. and C. minutus were comparatively investigated. The desert cyanobacterium species C. minutus showed higher tolerance of PSII activity (Fv/Fm) to UVBR than the aquatic Synechocystis sp., and the inhibited PSII activity of C. minutus could be fully recovered while that of Synechocystis sp. could be partly recovered. UVBR had significant effect on the yield and biochemical composition of EPS of both species. Protein-like and humic acid-like substances were detected in EPS from Synechocystis sp., and protein-like and phenol-like fluorescent compounds were detected in EPS from C. minutus. Proteins in EPS of desert and aquatic species were significantly decomposed under UVBR, and the latter was more easily decomposed. The polysaccharides were much more resistant to UVBR than the proteins for both species. Polysaccharides of Synechocystis sp. was degraded slightly but those of C. minutus was little decomposed. The higher tolerance to UVBR of the desert cyanobacterium can be attributed to the higher resistance of its EPS to photodegradation induced by UVBR in comparison with the aquatic species. PMID:27597841
Guanine nucleotide binding protein-like 3 is a potential prognosis indicator of gastric cancer.
Chen, Jing; Dong, Shuang; Hu, Jiangfeng; Duan, Bensong; Yao, Jian; Zhang, Ruiyun; Zhou, Hongmei; Sheng, Haihui; Gao, Hengjun; Li, Shunlong; Zhang, Xianwen
2015-01-01
Guanine nucleotide binding protein-like 3 (GNL3) is a GIP-binding nuclear protein that has been reported to be involved in various biological processes, including cell proliferation, cellular senescence and tumorigenesis. This study aimed to investigate the expression level of GNL3 in gastric cancer and to evaluate the relationship between its expression and clinical variables and overall survival of gastric cancer patients. The expression level of GNL3 was examined in 89 human gastric cancer samples using immunohistochemistry (IHC) staining. GNL3 in gastric cancer tissues was significantly upregulated compared with paracancerous tissues. GNL3 expression in adjacent non-cancerous tissues was associated with sex and tumor size. Survival analyses showed that GNL3 expression in both gastric cancer and adjacent non-cancerous tissues were not related to overall survival. However, in the subgroup of patients with larger tumor size (≥ 6 cm), a close association was found between GNL3 expression in gastric cancer tissues and overall survival. GNL3-positive patients had a shorter survival than GNL3-negative patients. Our study suggests that GNL3 might play an important role in the progression of gastric cancer and serve as a biomarker for poor prognosis in gastric cancer patients.
Porter, Teresita M.; Golding, G. Brian
2012-01-01
Nuclear large subunit ribosomal DNA is widely used in fungal phylogenetics and to an increasing extent also amplicon-based environmental sequencing. The relatively short reads produced by next-generation sequencing, however, makes primer choice and sequence error important variables for obtaining accurate taxonomic classifications. In this simulation study we tested the performance of three classification methods: 1) a similarity-based method (BLAST + Metagenomic Analyzer, MEGAN); 2) a composition-based method (Ribosomal Database Project naïve Bayesian classifier, NBC); and, 3) a phylogeny-based method (Statistical Assignment Package, SAP). We also tested the effects of sequence length, primer choice, and sequence error on classification accuracy and perceived community composition. Using a leave-one-out cross validation approach, results for classifications to the genus rank were as follows: BLAST + MEGAN had the lowest error rate and was particularly robust to sequence error; SAP accuracy was highest when long LSU query sequences were classified; and, NBC runs significantly faster than the other tested methods. All methods performed poorly with the shortest 50–100 bp sequences. Increasing simulated sequence error reduced classification accuracy. Community shifts were detected due to sequence error and primer selection even though there was no change in the underlying community composition. Short read datasets from individual primers, as well as pooled datasets, appear to only approximate the true community composition. We hope this work informs investigators of some of the factors that affect the quality and interpretation of their environmental gene surveys. PMID:22558215
Moskalev, Evgeny A; Frohnauer, Judith; Merkelbach-Bruse, Sabine; Schildhaus, Hans-Ulrich; Dimmler, Arno; Schubert, Thomas; Boltze, Carsten; König, Helmut; Fuchs, Florian; Sirbu, Horia; Rieker, Ralf J; Agaimy, Abbas; Hartmann, Arndt; Haller, Florian
2014-06-01
Recurrent gene fusions of anaplastic lymphoma receptor tyrosine kinase (ALK) and echinoderm microtubule-associated protein-like 4 (EML4) have been recently identified in ∼5% of non-small cell lung cancers (NSCLCs) and are targets for selective tyrosine kinase inhibitors. While fluorescent in situ hybridization (FISH) is the current gold standard for detection of EML4-ALK rearrangements, several limitations exist including high costs, time-consuming evaluation and somewhat equivocal interpretation of results. In contrast, targeted massive parallel sequencing has been introduced as a powerful method for simultaneous and sensitive detection of multiple somatic mutations even in limited biopsies, and is currently evolving as the method of choice for molecular diagnostic work-up of NSCLCs. We developed a novel approach for indirect detection of EML4-ALK rearrangements based on 454 massive parallel sequencing after reverse transcription and subsequent multiplex amplification (multiplex ALK RNA-seq) which takes advantage of unbalanced expression of the 5' and 3' ALK mRNA regions. Two lung cancer cell lines and a selected series of 32 NSCLC samples including 11 cases with EML4-ALK rearrangement were analyzed with this novel approach in comparison to ALK FISH, ALK qRT-PCR and EML4-ALK RT-PCR. The H2228 cell line with known EML4-ALK rearrangement showed 171 and 729 reads for 5' and 3' ALK regions, respectively, demonstrating a clearly unbalanced expression pattern. In contrast, the H1299 cell line with ALK wildtype status displayed no reads for both ALK regions. Considering a threshold of 100 reads for 3' ALK region as indirect indicator of EML4-ALK rearrangement, there was 100% concordance between the novel multiplex ALK RNA-seq approach and ALK FISH among all 32 NSCLC samples. Multiplex ALK RNA-seq is a sensitive and specific method for indirect detection of EML4-ALK rearrangements, and can be easily implemented in panel based molecular diagnostic work-up of NSCLCs by massive parallel sequencing. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
Yadav, Saurabh; Kumari, Pragati; Kushwaha, Hemant Ritturaj
2013-01-01
Glutaredoxins are enzymatic antioxidants which are small, ubiquitous, glutathione dependent and essentially classified under thioredoxin-fold superfamily. Glutaredoxins are classified into two types: dithiol and monothiol. Monothiol glutaredoxins which carry the signature "CGFS" as a redox active motif is known for its role in oxidative stress, inside the cell. In the present analysis, the 138 amino acid long monothiol glutaredoxin, AgGRX1 from Ashbya gossypii was identified and has been used for the analysis. The multiple sequence alignment of the AgGRX1 protein sequence revealed the characteristic motif of typical monothiol glutaredoxin as observed in various other organisms. The proposed structure of the AgGRX1 protein was used to analyze signature folds related to the thioredoxin superfamily. Further, the study highlighted the structural features pertaining to the complex mechanism of glutathione docking and interacting residues.
Vipsita, Swati; Rath, Santanu Kumar
2015-01-01
Protein superfamily classification deals with the problem of predicting the family membership of newly discovered amino acid sequence. Although many trivial alignment methods are already developed by previous researchers, but the present trend demands the application of computational intelligent techniques. As there is an exponential growth in size of biological database, retrieval and inference of essential knowledge in the biological domain become a very cumbersome task. This problem can be easily handled using intelligent techniques due to their ability of tolerance for imprecision, uncertainty, approximate reasoning, and partial truth. This paper discusses the various global and local features extracted from full length protein sequence which are used for the approximation and generalisation of the classifier. The various parameters used for evaluating the performance of the classifiers are also discussed. Therefore, this review article can show right directions to the present researchers to make an improvement over the existing methods.
Three former taxa of Cucurbitaria and considerations on Petrakia in the Melanommataceae
Jaklitsch, Walter M.; Voglmayr, Hermann
2017-01-01
Based on phylogenetic analyses of an ITS-LSU-SSU-rpb2-tef1 sequence data matrix three taxa once classified in Cucurbitaria are referred to the Melanommataceae. Cucurbitaria rhododendri, also known as Melanomma rhododendri, is not congeneric with the generic type of Melanomma, M. pulvis-pyrius and thus classified in the new genus Alpinaria. Cucurbitaria piceae, known as Gemmamyces piceae, the cause of the Gemmamyces bud blight of Picea spp., belongs also to the Melanommataceae. The name Gemmamyces is conserved. For Cucurbitaria obducens, also known as Teichospora obducens, the new genus Praetumpfia is described, as it cannot be accommodated in any known genus. All species are redescribed and epitypified. Based on sequence data and morphology, Blastostroma, Mycodidymella and Xenostigmina are synonyms of Petrakia. The genus Petrakia is emended. We also provide sequences of additional markers for Beverwykella pulmonaria, Melanomma pulvis-pyrius, Petrakia echinata and Pseudotrichia mutabilis. PMID:29104325
Mining sequential patterns for protein fold recognition.
Exarchos, Themis P; Papaloukas, Costas; Lampros, Christos; Fotiadis, Dimitrios I
2008-02-01
Protein data contain discriminative patterns that can be used in many beneficial applications if they are defined correctly. In this work sequential pattern mining (SPM) is utilized for sequence-based fold recognition. Protein classification in terms of fold recognition plays an important role in computational protein analysis, since it can contribute to the determination of the function of a protein whose structure is unknown. Specifically, one of the most efficient SPM algorithms, cSPADE, is employed for the analysis of protein sequence. A classifier uses the extracted sequential patterns to classify proteins in the appropriate fold category. For training and evaluating the proposed method we used the protein sequences from the Protein Data Bank and the annotation of the SCOP database. The method exhibited an overall accuracy of 25% in a classification problem with 36 candidate categories. The classification performance reaches up to 56% when the five most probable protein folds are considered.
Franz J. St John; Javier M. Gonzalez; Edwin Pozharski
2010-01-01
In this work glycosyl hydrolase (GH) family 30 (GH30) is analyzed and shown to consist of its currently classified member sequences as well as several homologous sequence groups currently assigned within family GH5. A large scale amino acid sequence alignment and a phylogenetic tree were generated and GH30 groups and subgroups were designated. A partial rearrangement...
Neuwald, Andrew F
2009-08-01
The patterns of sequence similarity and divergence present within functionally diverse, evolutionarily related proteins contain implicit information about corresponding biochemical similarities and differences. A first step toward accessing such information is to statistically analyze these patterns, which, in turn, requires that one first identify and accurately align a very large set of protein sequences. Ideally, the set should include many distantly related, functionally divergent subgroups. Because it is extremely difficult, if not impossible for fully automated methods to align such sequences correctly, researchers often resort to manual curation based on detailed structural and biochemical information. However, multiply-aligning vast numbers of sequences in this way is clearly impractical. This problem is addressed using Multiply-Aligned Profiles for Global Alignment of Protein Sequences (MAPGAPS). The MAPGAPS program uses a set of multiply-aligned profiles both as a query to detect and classify related sequences and as a template to multiply-align the sequences. It relies on Karlin-Altschul statistics for sensitivity and on PSI-BLAST (and other) heuristics for speed. Using as input a carefully curated multiple-profile alignment for P-loop GTPases, MAPGAPS correctly aligned weakly conserved sequence motifs within 33 distantly related GTPases of known structure. By comparison, the sequence- and structurally based alignment methods hmmalign and PROMALS3D misaligned at least 11 and 23 of these regions, respectively. When applied to a dataset of 65 million protein sequences, MAPGAPS identified, classified and aligned (with comparable accuracy) nearly half a million putative P-loop GTPase sequences. A C++ implementation of MAPGAPS is available at http://mapgaps.igs.umaryland.edu. Supplementary data are available at Bioinformatics online.
Protein sequences clustering of herpes virus by using Tribe Markov clustering (Tribe-MCL)
NASA Astrophysics Data System (ADS)
Bustamam, A.; Siswantining, T.; Febriyani, N. L.; Novitasari, I. D.; Cahyaningrum, R. D.
2017-07-01
The herpes virus can be found anywhere and one of the important characteristics is its ability to cause acute and chronic infection at certain times so as a result of the infection allows severe complications occurred. The herpes virus is composed of DNA containing protein and wrapped by glycoproteins. In this work, the Herpes viruses family is classified and analyzed by clustering their protein-sequence using Tribe Markov Clustering (Tribe-MCL) algorithm. Tribe-MCL is an efficient clustering method based on the theory of Markov chains, to classify protein families from protein sequences using pre-computed sequence similarity information. We implement the Tribe-MCL algorithm using an open source program of R. We select 24 protein sequences of Herpes virus obtained from NCBI database. The dataset consists of three types of glycoprotein B, F, and H. Each type has eight herpes virus that infected humans. Based on our simulation using different inflation factor r=1.5, 2, 3 we find a various number of the clusters results. The greater the inflation factor the greater the number of their clusters. Each protein will grouped together in the same type of protein.
Information entropy of humpback whale songs.
Suzuki, Ryuji; Buck, John R; Tyack, Peter L
2006-03-01
The structure of humpback whale (Megaptera novaeangliae) songs was examined using information theory techniques. The song is an ordered sequence of individual sound elements separated by gaps of silence. Song samples were converted into sequences of discrete symbols by both human and automated classifiers. This paper analyzes the song structure in these symbol sequences using information entropy estimators and autocorrelation estimators. Both parametric and nonparametric entropy estimators are applied to the symbol sequences representing the songs. The results provide quantitative evidence consistent with the hierarchical structure proposed for these songs by Payne and McVay [Science 173, 587-597 (1971)]. Specifically, this analysis demonstrates that: (1) There is a strong structural constraint, or syntax, in the generation of the songs, and (2) the structural constraints exhibit periodicities with periods of 6-8 and 180-400 units. This implies that no empirical Markov model is capable of representing the songs' structure. The results are robust to the choice of either human or automated song-to-symbol classifiers. In addition, the entropy estimates indicate that the maximum amount of information that could be communicated by the sequence of sounds made is less than 1 bit per second.
Human action classification using procrustes shape theory
NASA Astrophysics Data System (ADS)
Cho, Wanhyun; Kim, Sangkyoon; Park, Soonyoung; Lee, Myungeun
2015-02-01
In this paper, we propose new method that can classify a human action using Procrustes shape theory. First, we extract a pre-shape configuration vector of landmarks from each frame of an image sequence representing an arbitrary human action, and then we have derived the Procrustes fit vector for pre-shape configuration vector. Second, we extract a set of pre-shape vectors from tanning sample stored at database, and we compute a Procrustes mean shape vector for these preshape vectors. Third, we extract a sequence of the pre-shape vectors from input video, and we project this sequence of pre-shape vectors on the tangent space with respect to the pole taking as a sequence of mean shape vectors corresponding with a target video. And we calculate the Procrustes distance between two sequences of the projection pre-shape vectors on the tangent space and the mean shape vectors. Finally, we classify the input video into the human action class with minimum Procrustes distance. We assess a performance of the proposed method using one public dataset, namely Weizmann human action dataset. Experimental results reveal that the proposed method performs very good on this dataset.
Alahmadi, Hanin H; Shen, Yuan; Fouad, Shereen; Luft, Caroline Di B; Bentham, Peter; Kourtzi, Zoe; Tino, Peter
2016-01-01
Early diagnosis of dementia is critical for assessing disease progression and potential treatment. State-or-the-art machine learning techniques have been increasingly employed to take on this diagnostic task. In this study, we employed Generalized Matrix Learning Vector Quantization (GMLVQ) classifiers to discriminate patients with Mild Cognitive Impairment (MCI) from healthy controls based on their cognitive skills. Further, we adopted a "Learning with privileged information" approach to combine cognitive and fMRI data for the classification task. The resulting classifier operates solely on the cognitive data while it incorporates the fMRI data as privileged information (PI) during training. This novel classifier is of practical use as the collection of brain imaging data is not always possible with patients and older participants. MCI patients and healthy age-matched controls were trained to extract structure from temporal sequences. We ask whether machine learning classifiers can be used to discriminate patients from controls and whether differences between these groups relate to individual cognitive profiles. To this end, we tested participants in four cognitive tasks: working memory, cognitive inhibition, divided attention, and selective attention. We also collected fMRI data before and after training on a probabilistic sequence learning task and extracted fMRI responses and connectivity as features for machine learning classifiers. Our results show that the PI guided GMLVQ classifiers outperform the baseline classifier that only used the cognitive data. In addition, we found that for the baseline classifier, divided attention is the only relevant cognitive feature. When PI was incorporated, divided attention remained the most relevant feature while cognitive inhibition became also relevant for the task. Interestingly, this analysis for the fMRI GMLVQ classifier suggests that (1) when overall fMRI signal is used as inputs to the classifier, the post-training session is most relevant; and (2) when the graph feature reflecting underlying spatiotemporal fMRI pattern is used, the pre-training session is most relevant. Taken together these results suggest that brain connectivity before training and overall fMRI signal after training are both diagnostic of cognitive skills in MCI.
Efficient use of unlabeled data for protein sequence classification: a comparative study
Kuksa, Pavel; Huang, Pai-Hsi; Pavlovic, Vladimir
2009-01-01
Background Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags–the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers. Results Combined with state-of-the-art string kernels, our proposed computational framework achieves very accurate semi-supervised protein remote fold and homology detection on three large unlabeled databases. It outperforms current state-of-the-art methods and exhibits significant reduction in running time. Conclusion The unlabeled sequences used under the semi-supervised setting resemble the unpolished gemstones; when used as-is, they may carry unnecessary features and hence compromise the classification accuracy but once cut and polished, they improve the accuracy of the classifiers considerably. PMID:19426450
Merson, Samuel D.; Ouwerkerk, Diane; Gulino, Lisa-Maree; Klieve, Athol; Bonde, Robert K.; Burgess, Elizabeth A.; Lanyon, Janet M.
2014-01-01
The Florida manatee, Trichechus manatus latirostris, is a hindgut-fermenting herbivore. In winter, manatees migrate to warm water overwintering sites where they undergo dietary shifts and may suffer from cold-induced stress. Given these seasonally induced changes in diet, the present study aimed to examine variation in the hindgut bacterial communities of wild manatees overwintering at Crystal River, west Florida. Faeces were sampled from 36 manatees of known sex and body size in early winter when manatees were newly arrived and then in mid-winter and late winter when diet had probably changed and environmental stress may have increased. Concentrations of faecal cortisol metabolite, an indicator of a stress response, were measured by enzyme immunoassay. Using 454-pyrosequencing, 2027 bacterial operational taxonomic units were identified in manatee faeces following amplicon pyrosequencing of the 16S rRNA gene V3/V4 region. Classified sequences were assigned to eight previously described bacterial phyla; only 0.36% of sequences could not be classified to phylum level. Five core phyla were identified in all samples. The majority (96.8%) of sequences were classified as Firmicutes (77.3 ± 11.1% of total sequences) or Bacteroidetes (19.5 ± 10.6%). Alpha-diversity measures trended towards higher diversity of hindgut microbiota in manatees in mid-winter compared to early and late winter. Beta-diversity measures, analysed through permanova, also indicated significant differences in bacterial communities based on the season.
USDA-ARS?s Scientific Manuscript database
Due the biennial generation time of onion, classical crossing takes at least four years to classify cytoplasms as normal (N) male-fertile or male-sterile (S). Molecular markers in the organellar DNAs that distinguish N and S cytoplasms are useful to reduce the time required to classify onion cytopla...
Circular RNA - New member of noncoding RNA with novel functions.
Hsiao, Kuei-Yang; Sun, H Sunny; Tsai, Shaw-Jenq
2017-06-01
A growing body of evidence indicates that circular RNAs are not simply a side product of splicing but a new class of noncoding RNAs in higher eukaryotes. The progression for the studies of circular RNAs is accelerated by combination of several advanced technologies such as next generation sequencing, gene silencing (small interfering RNAs) and editing (CRISPR/Cas9). More and more studies showed that dysregulated expression of circular RNAs plays critical roles during the development of several human diseases. Herein, we review the current advance of circular RNAs for their biosynthesis, molecular functions, and implications in human diseases. Impact statement The accumulating evidence indicate that circular RNA (circRNA) is a novel class of noncoding RNA with diverse molecular functions. Our review summarizes the current hypotheses for the models of circRNA biosynthesis including the direct interaction between upstream and downstream introns and lariat-driven circularization. In addition, molecular functions such as a decoy of microRNA (miRNA) termed miRNA sponge, transcriptional regulator, and protein-like modulator are also discussed. Finally, we reviewed the potential roles of circRNAs in neural system, cardiovascular system as well as cancers. These should provide insightful information for studying the regulation and functions of circRNA in other model of human diseases.
Nucleolar Trafficking of Nucleostemin Family Proteins: Common versus Protein-Specific Mechanisms▿ §
Meng, Lingjun; Zhu, Qubo; Tsai, Robert Y. L.
2007-01-01
The nucleolus has begun to emerge as a subnuclear organelle capable of modulating the activities of nuclear proteins in a dynamic and cell type-dependent manner. It remains unclear whether one can extrapolate a rule that predicts the nucleolar localization of multiple proteins based on protein sequence. Here, we address this issue by determining the shared and unique mechanisms that regulate the static and dynamic distributions of a family of nucleolar GTP-binding proteins, consisting of nucleostemin (NS), guanine nucleotide binding protein-like 3 (GNL3L), and Ngp1. The nucleolar residence of GNL3L is short and primarily controlled by its basic-coiled-coil domain, whereas the nucleolar residence of NS and Ngp1 is long and requires the basic and the GTP-binding domains, the latter of which functions as a retention signal. All three proteins contain a nucleoplasmic localization signal (NpLS) that prevents their nucleolar accumulation. Unlike that of the basic domain, the activity of NpLS is dynamically controlled by the GTP-binding domain. The nucleolar retention and the NpLS-regulating functions of the G domain involve specific residues that cannot be predicted by overall protein homology. This work reveals common and protein-specific mechanisms underlying the nucleolar movement of NS family proteins. PMID:17923687
Roy, Somak; Durso, Mary Beth; Wald, Abigail; Nikiforov, Yuri E; Nikiforova, Marina N
2014-01-01
A wide repertoire of bioinformatics applications exist for next-generation sequencing data analysis; however, certain requirements of the clinical molecular laboratory limit their use: i) comprehensive report generation, ii) compatibility with existing laboratory information systems and computer operating system, iii) knowledgebase development, iv) quality management, and v) data security. SeqReporter is a web-based application developed using ASP.NET framework version 4.0. The client-side was designed using HTML5, CSS3, and Javascript. The server-side processing (VB.NET) relied on interaction with a customized SQL server 2008 R2 database. Overall, 104 cases (1062 variant calls) were analyzed by SeqReporter. Each variant call was classified into one of five report levels: i) known clinical significance, ii) uncertain clinical significance, iii) pending pathologists' review, iv) synonymous and deep intronic, and v) platform and panel-specific sequence errors. SeqReporter correctly annotated and classified 99.9% (859 of 860) of sequence variants, including 68.7% synonymous single-nucleotide variants, 28.3% nonsynonymous single-nucleotide variants, 1.7% insertions, and 1.3% deletions. One variant of potential clinical significance was re-classified after pathologist review. Laboratory information system-compatible clinical reports were generated automatically. SeqReporter also facilitated quality management activities. SeqReporter is an example of a customized and well-designed informatics solution to optimize and automate the downstream analysis of clinical next-generation sequencing data. We propose it as a model that may envisage the development of a comprehensive clinical informatics solution. Copyright © 2014 American Society for Investigative Pathology and the Association for Molecular Pathology. Published by Elsevier Inc. All rights reserved.
Steel, Olivia; Kraberger, Simona; Sikorski, Alyssa; Young, Laura M; Catchpole, Ryan J; Stevens, Aaron J; Ladley, Jenny J; Coray, Dorien S; Stainton, Daisy; Dayaram, Anisha; Julian, Laurel; van Bysterveldt, Katherine; Varsani, Arvind
2016-09-01
In recent years, innovations in molecular techniques and sequencing technologies have resulted in a rapid expansion in the number of known viral sequences, in particular those with circular replication-associated protein (Rep)-encoding single-stranded (CRESS) DNA genomes. CRESS DNA viruses are present in the virome of many ecosystems and are known to infect a wide range of organisms. A large number of the recently identified CRESS DNA viruses cannot be classified into any known viral families, indicating that the current view of CRESS DNA viral sequence space is greatly underestimated. Animal faecal matter has proven to be a particularly useful source for sampling CRESS DNA viruses in an ecosystem, as it is cost-effective and non-invasive. In this study a viral metagenomic approach was used to explore the diversity of CRESS DNA viruses present in the faeces of domesticated and wild animals in New Zealand. Thirty-eight complete CRESS DNA viral genomes and two circular molecules (that may be defective molecules or single components of multicomponent genomes) were identified from forty-nine individual animal faecal samples. Based on shared genome organisations and sequence similarities, eighteen of the isolates were classified as gemycircularviruses and twelve isolates were classified as smacoviruses. The remaining eight isolates lack significant sequence similarity with any members of known CRESS DNA virus groups. This research adds significantly to our knowledge of CRESS DNA viral diversity in New Zealand, emphasising the prevalence of CRESS DNA viruses in nature, and reinforcing the suggestion that a large proportion of CRESS DNA viruses are yet to be identified. Copyright © 2016 Elsevier B.V. All rights reserved.
Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys
Werner, Jeffrey J; Koren, Omry; Hugenholtz, Philip; DeSantis, Todd Z; Walters, William A; Caporaso, J Gregory; Angenent, Largus T; Knight, Rob; Ley, Ruth E
2012-01-01
Taxonomic classification of the thousands–millions of 16S rRNA gene sequences generated in microbiome studies is often achieved using a naïve Bayesian classifier (for example, the Ribosomal Database Project II (RDP) classifier), due to favorable trade-offs among automation, speed and accuracy. The resulting classification depends on the reference sequences and taxonomic hierarchy used to train the model; although the influence of primer sets and classification algorithms have been explored in detail, the influence of training set has not been characterized. We compared classification results obtained using three different publicly available databases as training sets, applied to five different bacterial 16S rRNA gene pyrosequencing data sets generated (from human body, mouse gut, python gut, soil and anaerobic digester samples). We observed numerous advantages to using the largest, most diverse training set available, that we constructed from the Greengenes (GG) bacterial/archaeal 16S rRNA gene sequence database and the latest GG taxonomy. Phylogenetic clusters of previously unclassified experimental sequences were identified with notable improvements (for example, 50% reduction in reads unclassified at the phylum level in mouse gut, soil and anaerobic digester samples), especially for phylotypes belonging to specific phyla (Tenericutes, Chloroflexi, Synergistetes and Candidate phyla TM6, TM7). Trimming the reference sequences to the primer region resulted in systematic improvements in classification depth, and greatest gains at higher confidence thresholds. Phylotypes unclassified at the genus level represented a greater proportion of the total community variation than classified operational taxonomic units in mouse gut and anaerobic digester samples, underscoring the need for greater diversity in existing reference databases. PMID:21716311
Cinelli, Mattia; Sun, Yuxin; Best, Katharine; Heather, James M; Reich-Zeliger, Shlomit; Shifrut, Eric; Friedman, Nir; Shawe-Taylor, John; Chain, Benny
2017-04-01
Somatic DNA recombination, the hallmark of vertebrate adaptive immunity, has the potential to generate a vast diversity of antigen receptor sequences. How this diversity captures antigen specificity remains incompletely understood. In this study we use high throughput sequencing to compare the global changes in T cell receptor β chain complementarity determining region 3 (CDR3β) sequences following immunization with ovalbumin administered with complete Freund's adjuvant (CFA) or CFA alone. The CDR3β sequences were deconstructed into short stretches of overlapping contiguous amino acids. The motifs were ranked according to a one-dimensional Bayesian classifier score comparing their frequency in the repertoires of the two immunization classes. The top ranking motifs were selected and used to create feature vectors which were used to train a support vector machine. The support vector machine achieved high classification scores in a leave-one-out validation test reaching >90% in some cases. The study describes a novel two-stage classification strategy combining a one-dimensional Bayesian classifier with a support vector machine. Using this approach we demonstrate that the frequency of a small number of linear motifs three amino acids in length can accurately identify a CD4 T cell response to ovalbumin against a background response to the complex mixture of antigens which characterize Complete Freund's Adjuvant. The sequence data is available at www.ncbi.nlm.nih.gov/sra/?term¼SRP075893 . The Decombinator package is available at github.com/innate2adaptive/Decombinator . The R package e1071 is available at the CRAN repository https://cran.r-project.org/web/packages/e1071/index.html . b.chain@ucl.ac.uk. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press.
Hayat, Maqsood; Khan, Asifullah
2011-02-21
Membrane proteins are vital type of proteins that serve as channels, receptors, and energy transducers in a cell. Prediction of membrane protein types is an important research area in bioinformatics. Knowledge of membrane protein types provides some valuable information for predicting novel example of the membrane protein types. However, classification of membrane protein types can be both time consuming and susceptible to errors due to the inherent similarity of membrane protein types. In this paper, neural networks based membrane protein type prediction system is proposed. Composite protein sequence representation (CPSR) is used to extract the features of a protein sequence, which includes seven feature sets; amino acid composition, sequence length, 2 gram exchange group frequency, hydrophobic group, electronic group, sum of hydrophobicity, and R-group. Principal component analysis is then employed to reduce the dimensionality of the feature vector. The probabilistic neural network (PNN), generalized regression neural network, and support vector machine (SVM) are used as classifiers. A high success rate of 86.01% is obtained using SVM for the jackknife test. In case of independent dataset test, PNN yields the highest accuracy of 95.73%. These classifiers exhibit improved performance using other performance measures such as sensitivity, specificity, Mathew's correlation coefficient, and F-measure. The experimental results show that the prediction performance of the proposed scheme for classifying membrane protein types is the best reported, so far. This performance improvement may largely be credited to the learning capabilities of neural networks and the composite feature extraction strategy, which exploits seven different properties of protein sequences. The proposed Mem-Predictor can be accessed at http://111.68.99.218/Mem-Predictor. Copyright © 2010 Elsevier Ltd. All rights reserved.
Bacterial Community Analysis of Drinking Water Biofilms in Southern Sweden
Lührig, Katharina; Canbäck, Björn; Paul, Catherine J.; Johansson, Tomas; Persson, Kenneth M.; Rådström, Peter
2015-01-01
Next-generation sequencing of the V1–V2 and V3 variable regions of the 16S rRNA gene generated a total of 674,116 reads that described six distinct bacterial biofilm communities from both water meters and pipes. A high degree of reproducibility was demonstrated for the experimental and analytical work-flow by analyzing the communities present in parallel water meters, the rare occurrence of biological replicates within a working drinking water distribution system. The communities observed in water meters from households that did not complain about their drinking water were defined by sequences representing Proteobacteria (82–87%), with 22–40% of all sequences being classified as Sphingomonadaceae. However, a water meter biofilm community from a household with consumer reports of red water and flowing water containing elevated levels of iron and manganese had fewer sequences representing Proteobacteria (44%); only 0.6% of all sequences were classified as Sphingomonadaceae; and, in contrast to the other water meter communities, markedly more sequences represented Nitrospira and Pedomicrobium. The biofilm communities in pipes were distinct from those in water meters, and contained sequences that were identified as Mycobacterium, Nocardia, Desulfovibrio, and Sulfuricurvum. The approach employed in the present study resolved the bacterial diversity present in these biofilm communities as well as the differences that occurred in biofilms within a single distribution system, and suggests that next-generation sequencing of 16S rRNA amplicons can show changes in bacterial biofilm communities associated with different water qualities. PMID:25739379
Bacterial community analysis of drinking water biofilms in southern Sweden.
Lührig, Katharina; Canbäck, Björn; Paul, Catherine J; Johansson, Tomas; Persson, Kenneth M; Rådström, Peter
2015-01-01
Next-generation sequencing of the V1-V2 and V3 variable regions of the 16S rRNA gene generated a total of 674,116 reads that described six distinct bacterial biofilm communities from both water meters and pipes. A high degree of reproducibility was demonstrated for the experimental and analytical work-flow by analyzing the communities present in parallel water meters, the rare occurrence of biological replicates within a working drinking water distribution system. The communities observed in water meters from households that did not complain about their drinking water were defined by sequences representing Proteobacteria (82-87%), with 22-40% of all sequences being classified as Sphingomonadaceae. However, a water meter biofilm community from a household with consumer reports of red water and flowing water containing elevated levels of iron and manganese had fewer sequences representing Proteobacteria (44%); only 0.6% of all sequences were classified as Sphingomonadaceae; and, in contrast to the other water meter communities, markedly more sequences represented Nitrospira and Pedomicrobium. The biofilm communities in pipes were distinct from those in water meters, and contained sequences that were identified as Mycobacterium, Nocardia, Desulfovibrio, and Sulfuricurvum. The approach employed in the present study resolved the bacterial diversity present in these biofilm communities as well as the differences that occurred in biofilms within a single distribution system, and suggests that next-generation sequencing of 16S rRNA amplicons can show changes in bacterial biofilm communities associated with different water qualities.
Ali, Safdar; Majid, Abdul
2015-04-01
The diagnostic of human breast cancer is an intricate process and specific indicators may produce negative results. In order to avoid misleading results, accurate and reliable diagnostic system for breast cancer is indispensable. Recently, several interesting machine-learning (ML) approaches are proposed for prediction of breast cancer. To this end, we developed a novel classifier stacking based evolutionary ensemble system "Can-Evo-Ens" for predicting amino acid sequences associated with breast cancer. In this paper, first, we selected four diverse-type of ML algorithms of Naïve Bayes, K-Nearest Neighbor, Support Vector Machines, and Random Forest as base-level classifiers. These classifiers are trained individually in different feature spaces using physicochemical properties of amino acids. In order to exploit the decision spaces, the preliminary predictions of base-level classifiers are stacked. Genetic programming (GP) is then employed to develop a meta-classifier that optimal combine the predictions of the base classifiers. The most suitable threshold value of the best-evolved predictor is computed using Particle Swarm Optimization technique. Our experiments have demonstrated the robustness of Can-Evo-Ens system for independent validation dataset. The proposed system has achieved the highest value of Area Under Curve (AUC) of ROC Curve of 99.95% for cancer prediction. The comparative results revealed that proposed approach is better than individual ML approaches and conventional ensemble approaches of AdaBoostM1, Bagging, GentleBoost, and Random Subspace. It is expected that the proposed novel system would have a major impact on the fields of Biomedical, Genomics, Proteomics, Bioinformatics, and Drug Development. Copyright © 2015 Elsevier Inc. All rights reserved.
DeepGene: an advanced cancer type classifier based on deep learning and somatic point mutations.
Yuan, Yuchen; Shi, Yi; Li, Changyang; Kim, Jinman; Cai, Weidong; Han, Zeguang; Feng, David Dagan
2016-12-23
With the developments of DNA sequencing technology, large amounts of sequencing data have become available in recent years and provide unprecedented opportunities for advanced association studies between somatic point mutations and cancer types/subtypes, which may contribute to more accurate somatic point mutation based cancer classification (SMCC). However in existing SMCC methods, issues like high data sparsity, small volume of sample size, and the application of simple linear classifiers, are major obstacles in improving the classification performance. To address the obstacles in existing SMCC studies, we propose DeepGene, an advanced deep neural network (DNN) based classifier, that consists of three steps: firstly, the clustered gene filtering (CGF) concentrates the gene data by mutation occurrence frequency, filtering out the majority of irrelevant genes; secondly, the indexed sparsity reduction (ISR) converts the gene data into indexes of its non-zero elements, thereby significantly suppressing the impact of data sparsity; finally, the data after CGF and ISR is fed into a DNN classifier, which extracts high-level features for accurate classification. Experimental results on our curated TCGA-DeepGene dataset, which is a reformulated subset of the TCGA dataset containing 12 selected types of cancer, show that CGF, ISR and DNN all contribute in improving the overall classification performance. We further compare DeepGene with three widely adopted classifiers and demonstrate that DeepGene has at least 24% performance improvement in terms of testing accuracy. Based on deep learning and somatic point mutation data, we devise DeepGene, an advanced cancer type classifier, which addresses the obstacles in existing SMCC studies. Experiments indicate that DeepGene outperforms three widely adopted existing classifiers, which is mainly attributed to its deep learning module that is able to extract the high level features between combinatorial somatic point mutations and cancer types.
On multi-site damage identification using single-site training data
NASA Astrophysics Data System (ADS)
Barthorpe, R. J.; Manson, G.; Worden, K.
2017-11-01
This paper proposes a methodology for developing multi-site damage location systems for engineering structures that can be trained using single-site damaged state data only. The methodology involves training a sequence of binary classifiers based upon single-site damage data and combining the developed classifiers into a robust multi-class damage locator. In this way, the multi-site damage identification problem may be decomposed into a sequence of binary decisions. In this paper Support Vector Classifiers are adopted as the means of making these binary decisions. The proposed methodology represents an advancement on the state of the art in the field of multi-site damage identification which require either: (1) full damaged state data from single- and multi-site damage cases or (2) the development of a physics-based model to make multi-site model predictions. The potential benefit of the proposed methodology is that a significantly reduced number of recorded damage states may be required in order to train a multi-site damage locator without recourse to physics-based model predictions. In this paper it is first demonstrated that Support Vector Classification represents an appropriate approach to the multi-site damage location problem, with methods for combining binary classifiers discussed. Next, the proposed methodology is demonstrated and evaluated through application to a real engineering structure - a Piper Tomahawk trainer aircraft wing - with its performance compared to classifiers trained using the full damaged-state dataset.
Complete genome sequence of a recent panzootic virulent Newcastle disease virus from Pakistan
USDA-ARS?s Scientific Manuscript database
Complete genome sequence of a new strain of Newcastle disease virus (NDV) (chicken/Pak/Lahore-611/2013) is reported. The strain was isolated from a vaccinated chicken flock in Pakistan in 2013 and has panzootic features. The genome is 15192 nucleotides in length and is classified as sub-genotype V...
Detection of distorted frames in retinal video-sequences via machine learning
NASA Astrophysics Data System (ADS)
Kolar, Radim; Liberdova, Ivana; Odstrcilik, Jan; Hracho, Michal; Tornow, Ralf P.
2017-07-01
This paper describes detection of distorted frames in retinal sequences based on set of global features extracted from each frame. The feature vector is consequently used in classification step, in which three types of classifiers are tested. The best classification accuracy 96% has been achieved with support vector machine approach.
Genotype diversity of hepatitis C virus (HCV) in HCV-associated liver disease patients in Indonesia.
Utama, Andi; Tania, Navessa Padma; Dhenni, Rama; Gani, Rino Alvani; Hasan, Irsan; Sanityoso, Andri; Lelosutan, Syafruddin A R; Martamala, Ruswhandi; Lesmana, Laurentius Adrianus; Sulaiman, Ali; Tai, Susan
2010-09-01
Hepatitis C virus (HCV) genotype distribution in Indonesia has been reported. However, the identification of HCV genotype was based on 5'-UTR or NS5B sequence. This study was aimed to observe HCV core sequence variation among HCV-associated liver disease patients in Jakarta, and to analyse the HCV genotype diversity based on the core sequence. Sixty-eight chronic hepatitis (CH), 48 liver cirrhosis (LC) and 34 hepatocellular carcinoma (HCC) were included in this study. HCV core variation was analysed by direct sequencing. Alignment of HCV core sequences demonstrated that the core sequence was relatively varied among the genotype. Indeed, 237 bases of the core sequence could classify the HCV subtype; however, 236 bases failed to differentiate several subtypes. Based on 237 bases of the core sequences, the HCV strains were classified into genotypes 1 (subtypes 1a, 1b and 1c), 2 (subtypes 2a, 2e and 2f) and 3 (subtypes 3a and 3k). The HCV 1b (47.3%) was the most prevalent, followed by subtypes 1c (18.7%), 3k (10.7%), 2a (10.0%), 1a (6.7%), 2e (5.3%), 2f (0.7%) and 3a (0.7%). HCV 1b was the most common in all patients, and the prevalence increased with the severity of liver disease (36.8% in CH, 54.2% in LC and 58.8% in HCC). These results were similar to a previous report based on NS5B sequence analysis. Hepatitis C virus core sequence (237 bases) could identify the HCV subtype and the prevalence of HCV subtype based on core sequence was similar to those based on the NS5B region.
Choi, Yoonha; Liu, Tiffany Ting; Pankratz, Daniel G; Colby, Thomas V; Barth, Neil M; Lynch, David A; Walsh, P Sean; Raghu, Ganesh; Kennedy, Giulia C; Huang, Jing
2018-05-09
We developed a classifier using RNA sequencing data that identifies the usual interstitial pneumonia (UIP) pattern for the diagnosis of idiopathic pulmonary fibrosis. We addressed significant challenges, including limited sample size, biological and technical sample heterogeneity, and reagent and assay batch effects. We identified inter- and intra-patient heterogeneity, particularly within the non-UIP group. The models classified UIP on transbronchial biopsy samples with a receiver-operating characteristic area under the curve of ~ 0.9 in cross-validation. Using in silico mixed samples in training, we prospectively defined a decision boundary to optimize specificity at ≥85%. The penalized logistic regression model showed greater reproducibility across technical replicates and was chosen as the final model. The final model showed sensitivity of 70% and specificity of 88% in the test set. We demonstrated that the suggested methodologies appropriately addressed challenges of the sample size, disease heterogeneity and technical batch effects and developed a highly accurate and robust classifier leveraging RNA sequencing for the classification of UIP.
Layered classification techniques for remote sensing applications
NASA Technical Reports Server (NTRS)
Swain, P. H.; Wu, C. L.; Landgrebe, D. A.; Hauska, H.
1975-01-01
The single-stage method of pattern classification utilizes all available features in a single test which assigns the unknown to a category according to a specific decision strategy (such as the maximum likelihood strategy). The layered classifier classifies the unknown through a sequence of tests, each of which may be dependent on the outcome of previous tests. Although the layered classifier was originally investigated as a means of improving classification accuracy and efficiency, it was found that in the context of remote sensing data analysis, other advantages also accrue due to many of the special characteristics of both the data and the applications pursued. The layered classifier method and several of the diverse applications of this approach are discussed.
Chiu, Shih-Hau; Chen, Chien-Chi; Yuan, Gwo-Fang; Lin, Thy-Hou
2006-01-01
Background The number of sequences compiled in many genome projects is growing exponentially, but most of them have not been characterized experimentally. An automatic annotation scheme must be in an urgent need to reduce the gap between the amount of new sequences produced and reliable functional annotation. This work proposes rules for automatically classifying the fungus genes. The approach involves elucidating the enzyme classifying rule that is hidden in UniProt protein knowledgebase and then applying it for classification. The association algorithm, Apriori, is utilized to mine the relationship between the enzyme class and significant InterPro entries. The candidate rules are evaluated for their classificatory capacity. Results There were five datasets collected from the Swiss-Prot for establishing the annotation rules. These were treated as the training sets. The TrEMBL entries were treated as the testing set. A correct enzyme classification rate of 70% was obtained for the prokaryote datasets and a similar rate of about 80% was obtained for the eukaryote datasets. The fungus training dataset which lacks an enzyme class description was also used to evaluate the fungus candidate rules. A total of 88 out of 5085 test entries were matched with the fungus rule set. These were otherwise poorly annotated using their functional descriptions. Conclusion The feasibility of using the method presented here to classify enzyme classes based on the enzyme domain rules is evident. The rules may be also employed by the protein annotators in manual annotation or implemented in an automatic annotation flowchart. PMID:16776838
Survey of genome sequences in a wild sweet potato, Ipomoea trifida (H. B. K.) G. Don
Hirakawa, Hideki; Okada, Yoshihiro; Tabuchi, Hiroaki; Shirasawa, Kenta; Watanabe, Akiko; Tsuruoka, Hisano; Minami, Chiharu; Nakayama, Shinobu; Sasamoto, Shigemi; Kohara, Mitsuyo; Kishida, Yoshie; Fujishiro, Tsunakazu; Kato, Midori; Nanri, Keiko; Komaki, Akiko; Yoshinaga, Masaru; Takahata, Yasuhiro; Tanaka, Masaru; Tabata, Satoshi; Isobe, Sachiko N.
2015-01-01
Ipomoea trifida (H. B. K.) G. Don. is the most likely diploid ancestor of the hexaploid sweet potato, I. batatas (L.) Lam. To assist in analysis of the sweet potato genome, de novo whole-genome sequencing was performed with two lines of I. trifida, namely the selfed line Mx23Hm and the highly heterozygous line 0431-1, using the Illumina HiSeq platform. We classified the sequences thus obtained as either ‘core candidates’ (common to the two lines) or ‘line specific’. The total lengths of the assembled sequences of Mx23Hm (ITR_r1.0) was 513 Mb, while that of 0431-1 (ITRk_r1.0) was 712 Mb. Of the assembled sequences, 240 Mb (Mx23Hm) and 353 Mb (0431-1) were classified into core candidate sequences. A total of 62,407 (62.4 Mb) and 109,449 (87.2 Mb) putative genes were identified, respectively, in the genomes of Mx23Hm and 0431-1, of which 11,823 were derived from core sequences of Mx23Hm, while 28,831 were from the core candidate sequence of 0431-1. There were a total of 1,464,173 single-nucleotide polymorphisms and 16,682 copy number variations (CNVs) in the two assembled genomic sequences (under the condition of log2 ratio of >1 and CNV size >1,000 bases). The results presented here are expected to contribute to the progress of genomic and genetic studies of I. trifida, as well as studies of the sweet potato and the genus Ipomoea in general. PMID:25805887
Classification of HCV and HIV-1 Sequences with the Branching Index
Hraber, Peter; Kuiken, Carla; Waugh, Mark; Geer, Shaun; Bruno, William J.; Leitner, Thomas
2009-01-01
SUMMARY Classification of viral sequences should be fast, objective, accurate, and reproducible. Most methods that classify sequences use either pairwise distances or phylogenetic relations, but cannot discern when a sequence is unclassifiable. The branching index (BI) combines distance and phylogeny methods to compute a ratio that quantifies how closely a query sequence clusters with a subtype clade. In the hypothesis-testing framework of statistical inference, the BI is compared with a threshold to test whether sufficient evidence exists for the query sequence to be classified among known sequences. If above the threshold, the null hypothesis of no support for the subtype relation is rejected and the sequence is taken as belonging to the subtype clade with which it clusters on the tree. This study evaluates statistical properties of the branching index for subtype classification in HCV and HIV-1. Pairs of BI values with known positive and negative test results were computed from 10,000 random fragments of reference alignments. Sampled fragments were of sufficient length to contain phylogenetic signal that groups reference sequences together properly into subtype clades. For HCV, a threshold BI of 0.71 yields 95.1% agreement with reference subtypes, with equal false positive and false negative rates. For HIV-1, a threshold of 0.66 yields 93.5% agreement. Higher thresholds can be used where lower false positive rates are required. In synthetic recombinants, regions without breakpoints are recognized accurately; regions with breakpoints do not uniquely represent any known subtype. Web-based services for viral subtype classification with the branching index are available online. PMID:18753218
Guo, Xinwei; Ma, Zeyang; Zhang, Zhonghui; Cheng, Lailiang; Zhang, Xiuren; Li, Tianhong
2017-01-01
Transition from vegetative to floral buds is a critical physiological change during flower induction that determines fruit productivity. Small non-coding RNAs (sRNAs) including microRNAs (miRNAs) and small interfering RNAs (siRNAs) are pivotal regulators of plant growth and development. Although the key role of sRNAs in flowering regulation has been well-described in Arabidopsis and some other annual plants, their relevance to vegetative-to-floral transition (hereafter, referred to floral transition) in perennial woody trees remains under defined. Here, we performed Illumina sequencing of sRNA libraries prepared from vegetative and floral bud during flower induction of the apple trees. A large number of sRNAs exemplified by 33 previously annotated miRNAs and six novel members display significant differential expression (DE) patterns. Notably, most of these DE-miRNAs in floral transition displayed opposite expression changes in reported phase transition in apple trees. Bioinformatics analysis suggests most of the DE-miRNAs targeted transcripts involved in SQUAMOSA PROMOTER BINDING PROTEIN-LIKE ( SPL ) gene regulation, stress responses, and auxin and gibberellin (GA) pathways, with further suggestion that there is an inherent link between physiological stress response and metabolism reprogramming during floral transition. We also observed significant changes in 24 nucleotide (nt) sRNAs that are hallmarks for RNA-dependent DNA methylation (RdDM) pathway, suggestive of the correlation between epigenetic modifications and the floral transition. The study not only provides new insight into our understanding of fundamental mechanism of poorly studied floral transition in apple and other woody plants, but also presents important sRNA resource for future in-depth research in the apple flowering physiology.
Guo, Xinwei; Ma, Zeyang; Zhang, Zhonghui; Cheng, Lailiang; Zhang, Xiuren; Li, Tianhong
2017-01-01
Transition from vegetative to floral buds is a critical physiological change during flower induction that determines fruit productivity. Small non-coding RNAs (sRNAs) including microRNAs (miRNAs) and small interfering RNAs (siRNAs) are pivotal regulators of plant growth and development. Although the key role of sRNAs in flowering regulation has been well-described in Arabidopsis and some other annual plants, their relevance to vegetative-to-floral transition (hereafter, referred to floral transition) in perennial woody trees remains under defined. Here, we performed Illumina sequencing of sRNA libraries prepared from vegetative and floral bud during flower induction of the apple trees. A large number of sRNAs exemplified by 33 previously annotated miRNAs and six novel members display significant differential expression (DE) patterns. Notably, most of these DE-miRNAs in floral transition displayed opposite expression changes in reported phase transition in apple trees. Bioinformatics analysis suggests most of the DE-miRNAs targeted transcripts involved in SQUAMOSA PROMOTER BINDING PROTEIN-LIKE (SPL) gene regulation, stress responses, and auxin and gibberellin (GA) pathways, with further suggestion that there is an inherent link between physiological stress response and metabolism reprogramming during floral transition. We also observed significant changes in 24 nucleotide (nt) sRNAs that are hallmarks for RNA-dependent DNA methylation (RdDM) pathway, suggestive of the correlation between epigenetic modifications and the floral transition. The study not only provides new insight into our understanding of fundamental mechanism of poorly studied floral transition in apple and other woody plants, but also presents important sRNA resource for future in-depth research in the apple flowering physiology. PMID:28611800
Peng, Rongxue; Zhang, Rui; Lin, Guigao; Yang, Xin; Li, Ziyang; Zhang, Kuo; Zhang, Jiawei; Li, Jinming
2017-09-01
The echinoderm microtubule-associated protein-like 4 and anaplastic lymphoma kinase (ALK) receptor tyrosine kinase (EML4-ALK) rearrangement is an important biomarker that plays a pivotal role in therapeutic decision making for non-small-cell lung cancer (NSCLC) patients. Ensuring accuracy and reproducibility of EML4-ALK testing by fluorescence in situ hybridization, immunohistochemistry, RT-PCR, and next-generation sequencing requires reliable reference materials for monitoring assay sensitivity and specificity. Herein, we developed novel reference materials for various kinds of EML4-ALK testing. CRISPR/Cas9 was used to edit various NSCLC cell lines containing EML4-ALK rearrangement variants 1, 2, and 3a/b. After s.c. inoculation, the formalin-fixed, paraffin-embedded (FFPE) samples from xenografts were prepared and tested for suitability as candidate reference materials by fluorescence in situ hybridization, immunohistochemistry, RT-PCR, and next-generation sequencing. Sample validation and commutability assessments showed that all types of FFPE samples derived from xenograft tumors have typical histological structures, and EML4-ALK testing results were similar to the clinical ALK-positive NSCLC specimens. Among the four methods for EML4-ALK detection, the validation test showed 100% concordance. Furthermore, these novel FFPE reference materials showed good stability and homogeneity. Without limitations on variant types and production, our novel FFPE samples based on CRISPR/Cas9 editing and xenografts are suitable as candidate reference materials for the validation, verification, internal quality control, and proficiency testing of EML4-ALK detection. Copyright © 2017 American Society for Investigative Pathology and the Association for Molecular Pathology. Published by Elsevier Inc. All rights reserved.
Fustiñana, Maria Sol; Ariel, Pablo; Federman, Noel; Freudenthal, Ramiro; Romano, Arturo
2010-09-01
Human β-amyloid, the main component in the neuritic plaques found in patients with Alzheimer's disease, is generated by cleavage of the β-amyloid precursor protein. Beyond the role in pathology, members of this protein family are synaptic proteins and have been associated with synaptogenesis, neuronal plasticity and memory, both in vertebrates and in invertebrates. Consolidation is necessary to convert a short-term labile memory to a long-term and stable form. During consolidation, gene expression and de novo protein synthesis are regulated in order to produce key proteins for the maintenance of plastic changes produced during the acquisition of new information. Here we partially cloned and sequenced the beta-amyloid precursor protein like gene homologue in the crab Chasmagnathus (cappl), showing a 37% of identity with the fruit fly Drosophila melanogaster homologue and 23% with Homo sapiens but with much higher degree of sequence similarity in certain regions. We observed a wide distribution of cappl mRNA in the nervous system as well as in muscle and gills. The protein localized in all tissues analyzed with the exception of muscle. Immunofluorescence revealed localization of cAPPL in associative and sensory brain areas. We studied gene and protein expression during long-term memory consolidation using a well characterized memory model: the context-signal associative memory in this crab species. mRNA levels varied at different time points during long-term memory consolidation and correlated with cAPPL protein levels cAPPL mRNA and protein is widely distributed in the central nervous system of the crab and the time course of expression suggests a role of cAPPL during long-term memory formation.
Srinivasulu, Yerukala Sathipati; Wang, Jyun-Rong; Hsu, Kai-Ti; Tsai, Ming-Ju; Charoenkwan, Phasit; Huang, Wen-Lin; Huang, Hui-Ling; Ho, Shinn-Ying
2015-01-01
Protein-protein interactions (PPIs) are involved in various biological processes, and underlying mechanism of the interactions plays a crucial role in therapeutics and protein engineering. Most machine learning approaches have been developed for predicting the binding affinity of protein-protein complexes based on structure and functional information. This work aims to predict the binding affinity of heterodimeric protein complexes from sequences only. This work proposes a support vector machine (SVM) based binding affinity classifier, called SVM-BAC, to classify heterodimeric protein complexes based on the prediction of their binding affinity. SVM-BAC identified 14 of 580 sequence descriptors (physicochemical, energetic and conformational properties of the 20 amino acids) to classify 216 heterodimeric protein complexes into low and high binding affinity. SVM-BAC yielded the training accuracy, sensitivity, specificity, AUC and test accuracy of 85.80%, 0.89, 0.83, 0.86 and 83.33%, respectively, better than existing machine learning algorithms. The 14 features and support vector regression were further used to estimate the binding affinities (Pkd) of 200 heterodimeric protein complexes. Prediction performance of a Jackknife test was the correlation coefficient of 0.34 and mean absolute error of 1.4. We further analyze three informative physicochemical properties according to their contribution to prediction performance. Results reveal that the following properties are effective in predicting the binding affinity of heterodimeric protein complexes: apparent partition energy based on buried molar fractions, relations between chemical structure and biological activity in principal component analysis IV, and normalized frequency of beta turn. The proposed sequence-based prediction method SVM-BAC uses an optimal feature selection method to identify 14 informative features to classify and predict binding affinity of heterodimeric protein complexes. The characterization analysis revealed that the average numbers of beta turns and hydrogen bonds at protein-protein interfaces in high binding affinity complexes are more than those in low binding affinity complexes.
2015-01-01
Background Protein-protein interactions (PPIs) are involved in various biological processes, and underlying mechanism of the interactions plays a crucial role in therapeutics and protein engineering. Most machine learning approaches have been developed for predicting the binding affinity of protein-protein complexes based on structure and functional information. This work aims to predict the binding affinity of heterodimeric protein complexes from sequences only. Results This work proposes a support vector machine (SVM) based binding affinity classifier, called SVM-BAC, to classify heterodimeric protein complexes based on the prediction of their binding affinity. SVM-BAC identified 14 of 580 sequence descriptors (physicochemical, energetic and conformational properties of the 20 amino acids) to classify 216 heterodimeric protein complexes into low and high binding affinity. SVM-BAC yielded the training accuracy, sensitivity, specificity, AUC and test accuracy of 85.80%, 0.89, 0.83, 0.86 and 83.33%, respectively, better than existing machine learning algorithms. The 14 features and support vector regression were further used to estimate the binding affinities (Pkd) of 200 heterodimeric protein complexes. Prediction performance of a Jackknife test was the correlation coefficient of 0.34 and mean absolute error of 1.4. We further analyze three informative physicochemical properties according to their contribution to prediction performance. Results reveal that the following properties are effective in predicting the binding affinity of heterodimeric protein complexes: apparent partition energy based on buried molar fractions, relations between chemical structure and biological activity in principal component analysis IV, and normalized frequency of beta turn. Conclusions The proposed sequence-based prediction method SVM-BAC uses an optimal feature selection method to identify 14 informative features to classify and predict binding affinity of heterodimeric protein complexes. The characterization analysis revealed that the average numbers of beta turns and hydrogen bonds at protein-protein interfaces in high binding affinity complexes are more than those in low binding affinity complexes. PMID:26681483
Protein interface classification by evolutionary analysis
2012-01-01
Background Distinguishing biologically relevant interfaces from lattice contacts in protein crystals is a fundamental problem in structural biology. Despite efforts towards the computational prediction of interface character, many issues are still unresolved. Results We present here a protein-protein interface classifier that relies on evolutionary data to detect the biological character of interfaces. The classifier uses a simple geometric measure, number of core residues, and two evolutionary indicators based on the sequence entropy of homolog sequences. Both aim at detecting differential selection pressure between interface core and rim or rest of surface. The core residues, defined as fully buried residues (>95% burial), appear to be fundamental determinants of biological interfaces: their number is in itself a powerful discriminator of interface character and together with the evolutionary measures it is able to clearly distinguish evolved biological contacts from crystal ones. We demonstrate that this definition of core residues leads to distinctively better results than earlier definitions from the literature. The stringent selection and quality filtering of structural and sequence data was key to the success of the method. Most importantly we demonstrate that a more conservative selection of homolog sequences - with relatively high sequence identities to the query - is able to produce a clearer signal than previous attempts. Conclusions An evolutionary approach like the one presented here is key to the advancement of the field, which so far was missing an effective method exploiting the evolutionary character of protein interfaces. Its coverage and performance will only improve over time thanks to the incessant growth of sequence databases. Currently our method reaches an accuracy of 89% in classifying interfaces of the Ponstingl 2003 datasets and it lends itself to a variety of useful applications in structural biology and bioinformatics. We made the corresponding software implementation available to the community as an easy-to-use graphical web interface at http://www.eppic-web.org. PMID:23259833
On Burst Detection and Prediction in Retweeting Sequence
2015-05-22
We conduct a comprehensive empirical analysis of a large microblogging dataset collected from the Sina Weibo and report our observations of burst...whether and how accurate we can predict bursts using classifiers based on the extracted features. Our empirical study of the Sina Weibo data shows the...feasibility of burst prediction using appropriately extracted features and classic classifiers. 1 Introduction Microblogging, such as Twitter and Sina
Diaz, Naryttza N; Krause, Lutz; Goesmann, Alexander; Niehaus, Karsten; Nattkemper, Tim W
2009-01-01
Background Metagenomics, or the sequencing and analysis of collective genomes (metagenomes) of microorganisms isolated from an environment, promises direct access to the "unculturable majority". This emerging field offers the potential to lay solid basis on our understanding of the entire living world. However, the taxonomic classification is an essential task in the analysis of metagenomics data sets that it is still far from being solved. We present a novel strategy to predict the taxonomic origin of environmental genomic fragments. The proposed classifier combines the idea of the k-nearest neighbor with strategies from kernel-based learning. Results Our novel strategy was extensively evaluated using the leave-one-out cross validation strategy on fragments of variable length (800 bp – 50 Kbp) from 373 completely sequenced genomes. TACOA is able to classify genomic fragments of length 800 bp and 1 Kbp with high accuracy until rank class. For longer fragments ≥ 3 Kbp accurate predictions are made at even deeper taxonomic ranks (order and genus). Remarkably, TACOA also produces reliable results when the taxonomic origin of a fragment is not represented in the reference set, thus classifying such fragments to its known broader taxonomic class or simply as "unknown". We compared the classification accuracy of TACOA with the latest intrinsic classifier PhyloPythia using 63 recently published complete genomes. For fragments of length 800 bp and 1 Kbp the overall accuracy of TACOA is higher than that obtained by PhyloPythia at all taxonomic ranks. For all fragment lengths, both methods achieved comparable high specificity results up to rank class and low false negative rates are also obtained. Conclusion An accurate multi-class taxonomic classifier was developed for environmental genomic fragments. TACOA can predict with high reliability the taxonomic origin of genomic fragments as short as 800 bp. The proposed method is transparent, fast, accurate and the reference set can be easily updated as newly sequenced genomes become available. Moreover, the method demonstrated to be competitive when compared to the most current classifier PhyloPythia and has the advantage that it can be locally installed and the reference set can be kept up-to-date. PMID:19210774
Xiao, Sa; Paldurai, Anandan; Nayak, Baibaswata; Samuel, Arthur; Bharoto, Eny E.; Prajitno, Teguh Y.; Collins, Peter L.
2012-01-01
Eight highly virulent Newcastle disease virus (NDV) strains were isolated from vaccinated commercial chickens in Indonesia during outbreaks in 2009 and 2010. The complete genome sequences of two NDV strains and the sequences of the surface protein genes (F and HN) of six other strains were determined. Phylogenetic analysis classified them into two new subgroups of genotype VII in the class II cluster that were genetically distinct from vaccine strains. This is the first report of complete genome sequences of NDV strains isolated from chickens in Indonesia. PMID:22532534
NASA Astrophysics Data System (ADS)
xiaona, W.; Bao, H.; Wu, Y.
2013-12-01
As one of the largest river in the world, studying the properties of dissolved organic matter in Changjiang can help us reveal the change of terrestrial organic matter in typical large subtropical river system. Samples collected from mid-lower reaches of Changjiang and its main tributaries/lakes in July 2010 and August 2012 were analysed for dissolved organic carbon (DOC), dissolved lignin phenols and chromophoric dissolved organic carbon (CDOM). Based on the hydrological condition, both of the two cruises are in flood season, while the latter is extremely flood season. The hydrological condition can impact the signal of dissolved lignin phenols as well as DOC. The DOC concentration is similar for both the cruises, with an average of 139×21 μM in 2010 and 130×36 μM in 2012. But the dissolved lignin phenols show obvious difference, the concentration is 13.6×3.4 μg/L and 12.7×5.2 μg/L for the main stream and tributaries/lakes in 2010 respectively, but it decreases to 8.7×2.5 μg/L and 6.5×3.5 μg/L in 2012.The dissolved lignin phenols show positive correlation with DOC in August 2012, but no similar trend is observed in 2010. Excitation-emission matrix fluorescence spectroscopy combined with parallel factor analysis (EEMs-PARAFAC) decomposes the fluorescence matrices of CDOM into three humic-like (H1: 315(250)/400 nm, H2: 350(280)/460 nm, H3: 250/450~485 nm) and two protein-like (P1: 270/315 nm, P2: 285/350 nm) components. Good linear correlations are observed within three humic-like components and two protein-like components, indicating that the same types of components (humic-like or protein-like) have similar origin and geochemical behaviors. However, these two kinds of components show different tendency. The total content of dissolved lignin phenols is correlated with the absorption in 280 nm, indicating the optical property of CDOM is related to its structure. There are many factors impacting the composition of dissolved organic matter in large river system like Changjiang. We find the biomarkers have mutative geochemical behaviors in different hydrological conditions. The variation of biomarkers can reveal the alternation in hydrological factor.
Li, Chengcheng; Cabassud, Corinne; Reboul, Bernard; Guigui, Christelle
2015-02-01
Membrane bioreactor (MBR) is increasingly used for municipal wastewater treatment and reuse and great concerns have been raised to some emerging trace pollutants found in aquatic environment in the last decade, notably the pharmaceuticals. As a consequence the removal of pharmaceutical micropollutants by MBRs has been extensively investigated. But there is still a lack of knowledge on the effects of the current presence of pharmaceutical micropollutants in domestic wastewaters on MBR fouling. Among the different pharmaceuticals, it was decided to focus on carbamazepine (CBZ), an anti-epileptic drug, because of its occurrence in domestic wastewaters and persistency in biological processes including MBRs. This paper focuses on the effects of continuous carbamazepine pollution on MBR fouling. A continuous introduction of CBZ into the MBR via the feed (about 90 μg L(-1) CBZ in the feed) provoked a TMP jump. It occurred just 1 day after the addition of CBZ in MBR and a significantly higher increase rate of TMP was also observed after 1 day after addition of CBZ in MBR, as compared to that before addition of CBZ. This indicates that the pharmaceutical stress induced by CBZ causes more severe membrane fouling. Addition of CBZ was shown to induce a significant increase of the concentration of proteins in the supernatant at the beginning several days then stabilized to original level whereas no significant change was found for polysaccharides. HPLC-SEC analysis showed that addition of CBZ induced a decrease of 100-1000 kDa protein-like SMPs and a more significant increase of 10-100 kDa protein-like SMPs in the supernatant. Moreover it was found that addition of CBZ in the MBR affected the sludge microbial activities, as a slight inhibition (about 20%) of the exogenous respiration rate was observed. The increased membrane fouling could be related to the change in biomass characteristics and supernatant quality after addition of CBZ in MBR. This study allows also suggesting that 10-100 kDa protein-like SMPs might accumulate inside the biocake that was formed on the membrane surface during MBR operation and play an important role in the TMP jump phenomenon. Copyright © 2014 Elsevier Ltd. All rights reserved.
Chaotic particle swarm optimization with mutation for classification.
Assarzadeh, Zahra; Naghsh-Nilchi, Ahmad Reza
2015-01-01
In this paper, a chaotic particle swarm optimization with mutation-based classifier particle swarm optimization is proposed to classify patterns of different classes in the feature space. The introduced mutation operators and chaotic sequences allows us to overcome the problem of early convergence into a local minima associated with particle swarm optimization algorithms. That is, the mutation operator sharpens the convergence and it tunes the best possible solution. Furthermore, to remove the irrelevant data and reduce the dimensionality of medical datasets, a feature selection approach using binary version of the proposed particle swarm optimization is introduced. In order to demonstrate the effectiveness of our proposed classifier, mutation-based classifier particle swarm optimization, it is checked out with three sets of data classifications namely, Wisconsin diagnostic breast cancer, Wisconsin breast cancer and heart-statlog, with different feature vector dimensions. The proposed algorithm is compared with different classifier algorithms including k-nearest neighbor, as a conventional classifier, particle swarm-classifier, genetic algorithm, and Imperialist competitive algorithm-classifier, as more sophisticated ones. The performance of each classifier was evaluated by calculating the accuracy, sensitivity, specificity and Matthews's correlation coefficient. The experimental results show that the mutation-based classifier particle swarm optimization unequivocally performs better than all the compared algorithms.
Badaut, Cyril; Bertin, Gwladys; Rustico, Tatiana; Fievet, Nadine; Massougbodji, Achille; Gaye, Alioune; Deloron, Philippe
2010-01-01
Background Placental malaria is a disease linked to the sequestration of Plasmodium falciparum infected red blood cells (IRBC) in the placenta, leading to reduced materno-fetal exchanges and to local inflammation. One of the virulence factors of P. falciparum involved in cytoadherence to chondroitin sulfate A, its placental receptor, is the adhesive protein VAR2CSA. Its localisation on the surface of IRBC makes it accessible to the immune system. VAR2CSA contains six DBL domains. The DBL6ε domain is the most variable. High variability constitutes a means for the parasite to evade the host immune response. The DBL6ε domain could constitute a very attractive basis for a vaccine candidate but its reported variability necessitates, for antigenic characterisations, identifying and classifying commonalities across isolates. Methodology/Principal Findings Local alignment analysis of the DBL6ε domain had revealed that it is not as variable as previously described. Variability is concentrated in seven regions present on the surface of the DBL6ε domain. The main goal of our work is to classify and group variable sequences that will simplify further research to determine dominant epitopes. Firstly, variable sequences were grouped following their average percent pairwise identity (APPI). Groups comprising many variable sequences sharing low variability were found. Secondly, ELISA experiments following the IgG recognition of a recombinant DBL6ε domain, and of peptides mimicking its seven variable blocks, allowed to determine an APPI cut-off and to isolate groups represented by a single consensus sequence. Conclusions/Significance A new sequence approach is used to compare variable regions in sequences that have extensive segmental gene relationship. Using this approach, the VAR2CSA DBL6 domain is composed of 7 variable blocks with limited polymorphism. Each variable block is composed of a limited number of consensus types. Based on peptide based ELISA, variable blocks with 85% or greater sequence identity are expected to be recognized equally well by antibody and can be considered the same consensus type. Therefore, the analysis of the antibody response against the classified small number of sequences should be helpful to determine epitopes. PMID:20585655
Ning, Yi; Li, Yan-Ling; Zhou, Guo-Ying; Yang, Lu-Cun; Xu, Wen-Hua
2016-04-01
High throughput sequencing technology is also called Next Generation Sequencing (NGS), which can sequence hundreds and thousands sequences in different samples at the same time. In the present study, the culture-independent high throughput sequencing technology was applied to sequence the fungi metagenomic DNA of the fungal internal transcribed spacer 1(ITS 1) in the root of Sinopodophyllum hexandrum. Sequencing data suggested that after the quality control, 22 565 reads were remained. Cluster similarity analysis was done based on 97% sequence similarity, which obtained 517 OTUs for the three samples (LD1, LD2 and LD3). All the fungi which identified from all the reads of OTUs based on 0.8 classification thresholds using the software of RDP classifier were classified as 13 classes, 35 orders, 44 family, 55 genera. Among these genera, the genus of Tetracladium was the dominant genera in all samples(35.49%, 68.55% and 12.96%).The Shannon's diversity indices and the Simpson indices of the endophytic fungi in the samples ranged from 1.75-2.92, 0.11-0.32, respectively.This is the first time for applying high through put sequencing technol-ogyto analyze the community composition and diversity of endophytic fungi in the medicinal plant, and the results showed that there were hyper diver sity and high community composition complexity of endophytic fungi in the root of S. hexandrum. It is also proved that the high through put sequencing technology has great advantage for analyzing ecommunity composition and diversity of endophtye in the plant. Copyright© by the Chinese Pharmaceutical Association.
He, Xiao-Song; Yu, Jing; Xi, Bei-Dou; Jiang, Yong-Hai; Zhang, Jin-Bao; Li, Dan; Pan, Hong-Wei; Liu, Hong-Liang
2012-09-01
In order to investigate remove characteristics of dissolved organic matter in landfill leachate, leachates were sampled during the process (i. e. , adjusting tank, anaerobic zone, oxidation ditch and MBR processing). Dissolved organic matter was extracted and its content and structure were characterized by fluorescence excitation-emission matrix spectra, UV-Vis specrtra and FTIR spectra. The results showed that an amount of 377.6 mg x L(-1) dissolved organic carbon (DOC) was removed during the whole treatment process, and the total removal rate was up to 78.34%. The 25.56% of DOC in the adjusting tank was removed during the anaerobic zone, 41.58% of DOC in anaerobic effluent was removed during the oxidation ditch, while 50.19% of DOC in the oxidation ditch effluent decreased in the MBR process. The anaerobic process increased the content of unsaturated compound and polysaccharides in leachate DOM, which improved the leachate biochemical characteristics. The unsaturated compound and polysaccharides were removed effectively during being in oxidation ditch. Protein-like and humic-like fluorescence peaks were observed in the adjusting tank and anaerobic zone, while humic-like fluorescence peaks were just presented in the oxidation ditch and MBR processing. Protein-like and fulvic-like substances were biodegraded in the adjusting tank and anaerobic zone, while humic-like materials were removed in the MBR process.
Self-consistent-field calculations of proteinlike incorporations in polyelectrolyte complex micelles
NASA Astrophysics Data System (ADS)
Lindhoud, Saskia; Stuart, Martien A. Cohen; Norde, Willem; Leermakers, Frans A. M.
2009-11-01
Self-consistent field theory is applied to model the structure and stability of polyelectrolyte complex micelles with incorporated protein (molten globule) molecules in the core. The electrostatic interactions that drive the micelle formation are mimicked by nearest-neighbor interactions using Flory-Huggins χ parameters. The strong qualitative comparison with experimental data proves that the Flory-Huggins approach is reasonable. The free energy of insertion of a proteinlike molecule into the micelle is nonmonotonic: there is (i) a small repulsion when the protein is inside the corona; the height of the insertion barrier is determined by the local osmotic pressure and the elastic deformation of the core, (ii) a local minimum occurs when the protein molecule is at the core-corona interface; the depth (a few kBT ’s) is related to the interfacial tension at the core-corona interface and (iii) a steep repulsion (several kBT ) when part of the protein molecule is dragged into the core. Hence, the protein molecules reside preferentially at the core-corona interface and the absorption as well as the release of the protein molecules has annealed rather than quenched characteristics. Upon an increase of the ionic strength it is possible to reach a critical micellization ionic (CMI) strength. With increasing ionic strength the aggregation numbers decrease strongly and only few proteins remain associated with the micelles near the CMI.
Li, Wen-Tao; Jin, Jing; Li, Qiang; Wu, Chen-Fei; Lu, Hai; Zhou, Qing; Li, Ai-Min
2016-04-15
Online monitoring dissolved organic matter (DOM) is urgent for water treatment management. In this study, high performance size exclusion chromatography with multi-UV absorbance and multi-emission fluorescence scans were applied to spectrally characterize samples from 16 drinking water sources across Yangzi River and Huai River Watersheds. The UV absorbance indices at 254 nm and 280 nm referred to the same DOM components and concentration, and the 280 nm UV light could excite both protein-like and humic-like fluorescence. Hence a novel UV fluorescence sensor was developed out using only one UV280 light-emitting diode (LED) as light source. For all samples, enhanced coagulation was mainly effective for large molecular weight biopolymers; while anion exchange further substantially removed humic substances. During chlorination tests, UVA280 and UVA254 showed similar correlations with yields of disinfection byproducts (DBPs); the humic-like fluorescence obtained from LED sensors correlated well with both trihalomethanes and haloacetic acids yields, while the correlation between protein-like fluorescence and trihalomethanes was relatively poor. Anion exchange exhibited more reduction of DBPs yields as well as UV absorbance and fluorescence signals than enhanced coagulation. The results suggest that the LED UV fluorescence sensors are very promising for online monitoring DOM and predicting DBPs formation potential during water treatment. Copyright © 2016 Elsevier Ltd. All rights reserved.
Concepts in receptor optimization: targeting the RGD peptide.
Chen, Wei; Chang, Chia-en; Gilson, Michael K
2006-04-12
Synthetic receptors have a wide range of potential applications, but it has been difficult to design low molecular weight receptors that bind ligands with high, "proteinlike" affinities. This study uses novel computational methods to understand why it is hard to design a high-affinity receptor and to explore the limits of affinity, with the bioactive peptide RGD as a model ligand. The M2 modeling method is found to yield excellent agreement with experiment for a known RGD receptor and then is used to analyze a series of receptors generated in silico with a de novo design algorithm. Forces driving binding are found to be systematically opposed by proportionate repulsions due to desolvation and entropy. In particular, strong correlations are found between Coulombic attractions and the electrostatic desolvation penalty and between the mean energy change on binding and the cost in configurational entropy. These correlations help explain why it is hard to achieve high affinity. The change in surface area upon binding is found to correlate poorly with affinity within this series. Measures of receptor efficiency are formulated that summarize how effectively a receptor uses surface area, total energy, and Coulombic energy to achieve affinity. Analysis of the computed efficiencies suggests that a low molecular weight receptor can achieve proteinlike affinity. It is also found that macrocyclization of a receptor can, unexpectedly, increase the entropy cost of binding because the macrocyclic structure further restricts ligand motion.
Lu, Zhuoyang; Reddy, M. V. V. V. Sekhar; Liu, Jianfang; Kalichava, Ana; Liu, Jiankang; Zhang, Lei; Chen, Fang; Wang, Yun; Holthauzen, Luis Marcelo F.; White, Mark A.; Seshadrinathan, Suchithra; Zhong, Xiaoying; Ren, Gang; Rudenko, Gabby
2016-01-01
Contactin-associated protein-like 2 (CNTNAP2) is a large multidomain neuronal adhesion molecule implicated in a number of neurological disorders, including epilepsy, schizophrenia, autism spectrum disorder, intellectual disability, and language delay. We reveal here by electron microscopy that the architecture of CNTNAP2 is composed of a large, medium, and small lobe that flex with respect to each other. Using epitope labeling and fragments, we assign the F58C, L1, and L2 domains to the large lobe, the FBG and L3 domains to the middle lobe, and the L4 domain to the small lobe of the CNTNAP2 molecular envelope. Our data reveal that CNTNAP2 has a very different architecture compared with neurexin 1α, a fellow member of the neurexin superfamily and a prototype, suggesting that CNTNAP2 uses a different strategy to integrate into the synaptic protein network. We show that the ectodomains of CNTNAP2 and contactin 2 (CNTN2) bind directly and specifically, with low nanomolar affinity. We show further that mutations in CNTNAP2 implicated in autism spectrum disorder are not segregated but are distributed over the whole ectodomain. The molecular shape and dimensions of CNTNAP2 place constraints on how CNTNAP2 integrates in the cleft of axo-glial and neuronal contact sites and how it functions as an organizing and adhesive molecule. PMID:27621318
Wang, Ying; Zhang, Manman; Fu, Jun; Li, Tingting; Wang, Jinggang; Fu, Yingyu
2016-10-01
The interaction between carbamazepine (CBZ) and dissolved organic matter (DOM) from three zones (the nearshore, the river channel, and the coastal areas) in the Yangtze Estuary was investigated using fluorescence quenching titration combined with excitation emission matrix spectra and parallel factor analysis (PARAFAC). The complexation between CBZ and DOM was demonstrated by the increase in hydrogen bonding and the disappearance of the C=O stretch obtained from the Fourier transform infrared spectroscopy analysis. The results indicated that two protein-like substances (component 2 and component3) and two humic-like substances (component 1 and 4) were identified in the DOM from the Yangtze Estuary. The fluorescence quenching curves of each component with the addition of CBZ and the Ryan and Weber model calculation results both demonstrated that the different components exhibited different complexation activities with CBZ. The protein-like components had a stronger affinity with CBZ than did the humic-like substances. On the other hand, the autochthonous tyrosine-like C2 played an important role in the complexation with DOM from the river channel and coastal areas, while C3 influenced by anthropogenic activities showed an obvious effect in the nearshore area. DOMs from the river channel have the highest binding capacity for CBZ, which may ascribe to the relatively high phenol content group in the DOM.
Li, Kun; Jiang, Chao; Wang, Jianxing; Wei, Yuansong
2016-01-01
A combination of membrane bioreactor (MBR) and nanofiltration (NF) was tested at pilot-scale treating textile wastewater from the wastewater treatment station of a textile mill in Wuqing District of Tianjin (China). The MBR-NF process showed a much better treatment efficiency on the removal of the chemical oxygen demand, total organic carbon, color and turbidity in comparison with the conventional processes. The water recovery rate was enhanced to over 90% through the recycling of NF concentrate to the MBR, while the MBR-NF showed a stable permeate water quality that met with standards and could be directly discharged or further reused. The recycled NF concentrate caused an accumulation of refractory compounds in the MBR, which significantly influenced the treatment efficiency of the MBR. However, the sludge characteristics showed that the activated sludge activity was not obviously inhibited. The results of fluorescence spectra and molecular weight distribution indicated that those recalcitrant pollutants were mostly protein-like substances and a small amount of humic acid-like substances (650-6,000 Da), which contributed to membrane fouling of NF. Although the penetrated protein-like substances caused the residual color in NF permeate, the MBR-NF process was suitable for the advanced treatment and reclamation of textile wastewater under high water yield.
Guinoiseau, Thibault; Moreau, Alain; Hohnadel, Guillaume; Ngo-Giang-Huong, Nicole; Brulard, Celine; Vourc'h, Patrick; Goudeau, Alain; Gaudy-Graffin, Catherine
2017-01-01
Hepatitis C virus (HCV) evolves rapidly in a single host and circulates as a quasispecies wich is a complex mixture of genetically distinct virus's but closely related namely variants. To identify intra-individual diversity and investigate their functional properties in vitro, it is necessary to define their quasispecies composition and isolate the HCV variants. This is possible using single genome amplification (SGA). This technique, based on serially diluted cDNA to amplify a single cDNA molecule (clonal amplicon), has already been used to determine individual HCV diversity. In these studies, positive PCR reactions from SGA were directly sequenced using Sanger technology. The detection of non-clonal amplicons is necessary for excluding them to facilitate further functional analysis. Here, we compared Next Generation Sequencing (NGS) with De Novo assembly and Sanger sequencing for their ability to distinguish clonal and non-clonal amplicons after SGA on one plasma specimen. All amplicons (n = 42) classified as clonal by NGS were also classified as clonal by Sanger sequencing. No double peaks were seen on electropherograms for non-clonal amplicons with position-specific nucleotide variation below 15% by NGS. Altogether, NGS circumvented many of the difficulties encountered when using Sanger sequencing after SGA and is an appropriate tool to reliability select clonal amplicons for further functional studies.
Guinoiseau, Thibault; Moreau, Alain; Hohnadel, Guillaume; Ngo-Giang-Huong, Nicole; Brulard, Celine; Vourc’h, Patrick; Goudeau, Alain; Gaudy-Graffin, Catherine
2017-01-01
Hepatitis C virus (HCV) evolves rapidly in a single host and circulates as a quasispecies wich is a complex mixture of genetically distinct virus’s but closely related namely variants. To identify intra-individual diversity and investigate their functional properties in vitro, it is necessary to define their quasispecies composition and isolate the HCV variants. This is possible using single genome amplification (SGA). This technique, based on serially diluted cDNA to amplify a single cDNA molecule (clonal amplicon), has already been used to determine individual HCV diversity. In these studies, positive PCR reactions from SGA were directly sequenced using Sanger technology. The detection of non-clonal amplicons is necessary for excluding them to facilitate further functional analysis. Here, we compared Next Generation Sequencing (NGS) with De Novo assembly and Sanger sequencing for their ability to distinguish clonal and non-clonal amplicons after SGA on one plasma specimen. All amplicons (n = 42) classified as clonal by NGS were also classified as clonal by Sanger sequencing. No double peaks were seen on electropherograms for non-clonal amplicons with position-specific nucleotide variation below 15% by NGS. Altogether, NGS circumvented many of the difficulties encountered when using Sanger sequencing after SGA and is an appropriate tool to reliability select clonal amplicons for further functional studies. PMID:28362878
Multiple isoforms for the catalytic subunit of PKA in the basal fungal lineage Mucor circinelloides.
Fernández Núñez, Lucas; Ocampo, Josefina; Gottlieb, Alexandra M; Rossi, Silvia; Moreno, Silvia
2016-12-01
Protein kinase A (PKA) activity is involved in dimorphism of the basal fungal lineage Mucor. From the recently sequenced genome of Mucor circinelloides we could predict ten catalytic subunits of PKA. From sequence alignment and structural prediction we conclude that the catalytic core of the isoforms is conserved, and the difference between them resides in their amino termini. This high number of isoforms is maintained in the subdivision Mucoromycotina. Each paralogue, when compared to the ones form other fungi is more homologous to one of its orthologs than to its paralogs. All of these fungal isoforms cannot be included in the class I or II in which fungal protein kinases have been classified. mRNA levels for each isoform were measured during aerobic and anaerobic growth. The expression of each isoform is differential and associated to a particular growth stage. We reanalyzed the sequence of PKAC (GI 20218944), the only cloned sequence available until now for a catalytic subunit of M. circinelloides. PKAC cannot be classified as a PKA because of its difference in the conserved C-tail; it shares with PKB a conserved C2 domain in the N-terminus. No catalytic activity could be measured for this protein nor predicted bioinformatically. It can thus be classified as a pseudokinase. Its importance can not be underestimated since it is expressed at the mRNA level in different stages of growth, and its deletion is lethal. Copyright © 2016 British Mycological Society. Published by Elsevier Ltd. All rights reserved.
Merson, Samuel D; Ouwerkerk, Diane; Gulino, Lisa-Maree; Klieve, Athol; Bonde, Robert K; Burgess, Elizabeth A; Lanyon, Janet M
2014-03-01
The Florida manatee, Trichechus manatus latirostris, is a hindgut-fermenting herbivore. In winter, manatees migrate to warm water overwintering sites where they undergo dietary shifts and may suffer from cold-induced stress. Given these seasonally induced changes in diet, the present study aimed to examine variation in the hindgut bacterial communities of wild manatees overwintering at Crystal River, west Florida. Faeces were sampled from 36 manatees of known sex and body size in early winter when manatees were newly arrived and then in mid-winter and late winter when diet had probably changed and environmental stress may have increased. Concentrations of faecal cortisol metabolite, an indicator of a stress response, were measured by enzyme immunoassay. Using 454-pyrosequencing, 2027 bacterial operational taxonomic units were identified in manatee faeces following amplicon pyrosequencing of the 16S rRNA gene V3/V4 region. Classified sequences were assigned to eight previously described bacterial phyla; only 0.36% of sequences could not be classified to phylum level. Five core phyla were identified in all samples. The majority (96.8%) of sequences were classified as Firmicutes (77.3 ± 11.1% of total sequences) or Bacteroidetes (19.5 ± 10.6%). Alpha-diversity measures trended towards higher diversity of hindgut microbiota in manatees in mid-winter compared to early and late winter. Beta-diversity measures, analysed through PERMANOVA, also indicated significant differences in bacterial communities based on the season. © 2013 Federation of European Microbiological Societies. Published by John Wiley & Sons Ltd. All rights reserved.
Mining SNPs from EST sequences using filters and ensemble classifiers.
Wang, J; Zou, Q; Guo, M Z
2010-05-04
Abundant single nucleotide polymorphisms (SNPs) provide the most complete information for genome-wide association studies. However, due to the bottleneck of manual discovery of putative SNPs and the inaccessibility of the original sequencing reads, it is essential to develop a more efficient and accurate computational method for automated SNP detection. We propose a novel computational method to rapidly find true SNPs in public-available EST (expressed sequence tag) databases; this method is implemented as SNPDigger. EST sequences are clustered and aligned. SNP candidates are then obtained according to a measure of redundant frequency. Several new informative biological features, such as the structural neighbor profiles and the physical position of the SNP, were extracted from EST sequences, and the effectiveness of these features was demonstrated. An ensemble classifier, which employs a carefully selected feature set, was included for the imbalanced training data. The sensitivity and specificity of our method both exceeded 80% for human genetic data in the cross validation. Our method enables detection of SNPs from the user's own EST dataset and can be used on species for which there is no genome data. Our tests showed that this method can effectively guide SNP discovery in ESTs and will be useful to avoid and save the cost of biological analyses.
Automatic seed selection for segmentation of liver cirrhosis in laparoscopic sequences
NASA Astrophysics Data System (ADS)
Sinha, Rahul; Marcinczak, Jan Marek; Grigat, Rolf-Rainer
2014-03-01
For computer aided diagnosis based on laparoscopic sequences, image segmentation is one of the basic steps which define the success of all further processing. However, many image segmentation algorithms require prior knowledge which is given by interaction with the clinician. We propose an automatic seed selection algorithm for segmentation of liver cirrhosis in laparoscopic sequences which assigns each pixel a probability of being cirrhotic liver tissue or background tissue. Our approach is based on a trained classifier using SIFT and RGB features with PCA. Due to the unique illumination conditions in laparoscopic sequences of the liver, a very low dimensional feature space can be used for classification via logistic regression. The methodology is evaluated on 718 cirrhotic liver and background patches that are taken from laparoscopic sequences of 7 patients. Using a linear classifier we achieve a precision of 91% in a leave-one-patient-out cross-validation. Furthermore, we demonstrate that with logistic probability estimates, seeds with high certainty of being cirrhotic liver tissue can be obtained. For example, our precision of liver seeds increases to 98.5% if only seeds with more than 95% probability of being liver are used. Finally, these automatically selected seeds can be used as priors in Graph Cuts which is demonstrated in this paper.
Discovery of a novel iflavirus sequence in the eastern paralysis tick Ixodes holocyclus.
O'Brien, Caitlin A; Hall-Mendelin, Sonja; Hobson-Peters, Jody; Deliyannis, Georgia; Allen, Andy; Lew-Tabor, Ala; Rodriguez-Valle, Manuel; Barker, Dayana; Barker, Stephen C; Hall, Roy A
2018-05-11
Ixodes holocyclus, the eastern paralysis tick, is a significant parasite in Australia in terms of animal and human health. However, very little is known about its virome. In this study, next-generation sequencing of I. holocyclus salivary glands yielded a full-length genome sequence which phylogenetically groups with viruses classified in the Iflaviridae family and shares 45% amino acid similarity with its closest relative Bole hyalomma asiaticum virus 1. The sequence of this virus, provisionally named Ixodes holocyclus iflavirus (IhIV) has been identified in tick populations from northern New South Wales and Queensland, Australia and represents the first virus sequence reported from I. holocyclus.
Johansen, Morten Bo; Izarzugaza, Jose M. G.; Brunak, Søren; Petersen, Thomas Nordahl; Gupta, Ramneek
2013-01-01
We have developed a sequence conservation-based artificial neural network predictor called NetDiseaseSNP which classifies nsSNPs as disease-causing or neutral. Our method uses the excellent alignment generation algorithm of SIFT to identify related sequences and a combination of 31 features assessing sequence conservation and the predicted surface accessibility to produce a single score which can be used to rank nsSNPs based on their potential to cause disease. NetDiseaseSNP classifies successfully disease-causing and neutral mutations. In addition, we show that NetDiseaseSNP discriminates cancer driver and passenger mutations satisfactorily. Our method outperforms other state-of-the-art methods on several disease/neutral datasets as well as on cancer driver/passenger mutation datasets and can thus be used to pinpoint and prioritize plausible disease candidates among nsSNPs for further investigation. NetDiseaseSNP is publicly available as an online tool as well as a web service: http://www.cbs.dtu.dk/services/NetDiseaseSNP PMID:23935863
Ordeig, Laura; Garcia-Cehic, Damir; Gregori, Josep; Soria, Maria Eugenia; Nieto-Aponte, Leonardo; Perales, Celia; Llorens, Meritxell; Chen, Qian; Riveiro-Barciela, Mar; Buti, Maria; Esteban, Rafael; Esteban, Juan Ignacio; Rodriguez-Frias, Francisco; Quer, Josep
2018-01-01
Hepatitis C virus (HCV) is a highly divergent virus currently classified into seven major genotypes and 86 subtypes (ICTV, June 2017), which can have differing responses to therapy. Accurate genotyping/subtyping using high-resolution HCV subtyping enables confident subtype identification, identifies mixed infections and allows detection of new subtypes. During routine genotyping/subtyping, one sample from an Equatorial Guinea patient could not be classified into any of the subtypes. The complete genomic sequence was compared to reference sequences by phylogenetic and sliding window analysis. Resistance-associated substitutions (RASs) were assessed by deep sequencing. The unclassified HCV genome did not belong to any of the existing genotype 1 (G1) subtypes. Sliding window analysis along the complete genome ruled out recombination phenomena suggesting that it belongs to a new HCV G1 subtype. Two NS5A RASs (L31V+Y93H) were found to be naturally combined in the genome which could limit treatment possibilities in patients infected with this subtype.
Palmer, Lance E; Dejori, Mathaeus; Bolanos, Randall; Fasulo, Daniel
2010-01-15
With the rapid expansion of DNA sequencing databases, it is now feasible to identify relevant information from prior sequencing projects and completed genomes and apply it to de novo sequencing of new organisms. As an example, this paper demonstrates how such extra information can be used to improve de novo assemblies by augmenting the overlapping step. Finding all pairs of overlapping reads is a key task in many genome assemblers, and to this end, highly efficient algorithms have been developed to find alignments in large collections of sequences. It is well known that due to repeated sequences, many aligned pairs of reads nevertheless do not overlap. But no overlapping algorithm to date takes a rigorous approach to separating aligned but non-overlapping read pairs from true overlaps. We present an approach that extends the Minimus assembler by a data driven step to classify overlaps as true or false prior to contig construction. We trained several different classification models within the Weka framework using various statistics derived from overlaps of reads available from prior sequencing projects. These statistics included percent mismatch and k-mer frequencies within the overlaps as well as a comparative genomics score derived from mapping reads to multiple reference genomes. We show that in real whole-genome sequencing data from the E. coli and S. aureus genomes, by providing a curated set of overlaps to the contigging phase of the assembler, we nearly doubled the median contig length (N50) without sacrificing coverage of the genome or increasing the number of mis-assemblies. Machine learning methods that use comparative and non-comparative features to classify overlaps as true or false can be used to improve the quality of a sequence assembly.
Identification and correction of systematic error in high-throughput sequence data
2011-01-01
Background A feature common to all DNA sequencing technologies is the presence of base-call errors in the sequenced reads. The implications of such errors are application specific, ranging from minor informatics nuisances to major problems affecting biological inferences. Recently developed "next-gen" sequencing technologies have greatly reduced the cost of sequencing, but have been shown to be more error prone than previous technologies. Both position specific (depending on the location in the read) and sequence specific (depending on the sequence in the read) errors have been identified in Illumina and Life Technology sequencing platforms. We describe a new type of systematic error that manifests as statistically unlikely accumulations of errors at specific genome (or transcriptome) locations. Results We characterize and describe systematic errors using overlapping paired reads from high-coverage data. We show that such errors occur in approximately 1 in 1000 base pairs, and that they are highly replicable across experiments. We identify motifs that are frequent at systematic error sites, and describe a classifier that distinguishes heterozygous sites from systematic error. Our classifier is designed to accommodate data from experiments in which the allele frequencies at heterozygous sites are not necessarily 0.5 (such as in the case of RNA-Seq), and can be used with single-end datasets. Conclusions Systematic errors can easily be mistaken for heterozygous sites in individuals, or for SNPs in population analyses. Systematic errors are particularly problematic in low coverage experiments, or in estimates of allele-specific expression from RNA-Seq data. Our characterization of systematic error has allowed us to develop a program, called SysCall, for identifying and correcting such errors. We conclude that correction of systematic errors is important to consider in the design and interpretation of high-throughput sequencing experiments. PMID:22099972
Khodakov, Dmitriy; Wang, Chunyan; Zhang, David Yu
2016-10-01
Nucleic acid sequence variations have been implicated in many diseases, and reliable detection and quantitation of DNA/RNA biomarkers can inform effective therapeutic action, enabling precision medicine. Nucleic acid analysis technologies being translated into the clinic can broadly be classified into hybridization, PCR, and sequencing, as well as their combinations. Here we review the molecular mechanisms of popular commercial assays, and their progress in translation into in vitro diagnostics. Copyright © 2016 The Authors. Published by Elsevier B.V. All rights reserved.
Bulashevska, Alla; Stein, Martin; Jackson, David; Eils, Roland
2009-12-01
Accurate computational methods that can help to predict biological function of a protein from its sequence are of great interest to research biologists and pharmaceutical companies. One approach to assume the function of proteins is to predict the interactions between proteins and other molecules. In this work, we propose a machine learning method that uses a primary sequence of a domain to predict its propensity for interaction with small molecules. By curating the Pfam database with respect to the small molecule binding ability of its component domains, we have constructed a dataset of small molecule binding and non-binding domains. This dataset was then used as training set to learn a Bayesian classifier, which should distinguish members of each class. The domain sequences of both classes are modelled with Markov chains. In a Jack-knife test, our classification procedure achieved the predictive accuracies of 77.2% and 66.7% for binding and non-binding classes respectively. We demonstrate the applicability of our classifier by using it to identify previously unknown small molecule binding domains. Our predictions are available as supplementary material and can provide very useful information to drug discovery specialists. Given the ubiquitous and essential role small molecules play in biological processes, our method is important for identifying pharmaceutically relevant components of complete proteomes. The software is available from the author upon request.
Molecular biological researches of Kuro-Koji molds, their classification and safety.
Yamada, Osamu; Takara, Ryo; Hamada, Ryoko; Hayashi, Risa; Tsukahara, Masatoshi; Mikami, Shigeaki
2011-09-01
To assess the position of Kuro-Koji molds in black Aspergillus, we performed sequence analysis of approximately 2500 nucleotides of partial gene fragments, such as histone 3, on a total of 57 Aspergillus strains, including Aspergillus kawachii NBRC 4308, 12 Kuro-Koji molds isolated from awamori breweries in Japan, Aspergillus niger ATCC 1015, and A. tubingensis ATCC10550. Sequence results showed that all black Aspergillus strains could be classified into 3 types, type N which includes A. niger ATCC 1015, type T which includes A. tubingensis ATCC 10550, and type L which includes A. kawachii NBRC 4308. Phylogenetic analysis showed these three types belong to different clusters. All 12 Kuro-Koji molds isolated from awamori breweries were classified as type L, thus we concluded type L represents the industrial Kuro-Koji molds. We found all type L strains lack the An15g07920 gene which is required for ochratoxin A biosynthesis in black Aspergillus. This sequence is present in the genome of A. niger CBS 513.88 and has homology to the polyketide synthase fragment of A. ochraceus which is involved in ochratoxin A biosynthesis. Based on the industrial importance and the safety of Kuro-Koji molds, we propose to classify the type L strains as Aspergillus luchuensis, as initially reported by Dr. Inui. Copyright © 2011 The Society for Biotechnology, Japan. Published by Elsevier B.V. All rights reserved.
ERIC Educational Resources Information Center
Douglass, Claudia B.
The primary purpose of the reported study was to identify a possible interaction between the cognitive style of the students and the instructional sequence of the materials and their combined effect on achievement. The subjects were 627 biology students from six midwestern high schools. The students were ranked and classified as field-dependent…
USDA-ARS?s Scientific Manuscript database
The first complete genome sequence of a strain of Newcastle disease virus from genotype XIV is reported here. Strain duck/Nigeria/NG-695/KG.LOM.11-16/2009 was isolated from an apparently healthy domestic duck from a live bird market in Kogi State, Nigeria, in 2009. This strain is classified as a m...
Xiao, Sa; Paldurai, Anandan; Nayak, Baibaswata; Mirande, Armando; Collins, Peter L.
2013-01-01
The complete genome sequence was determined for a highly virulent Newcastle disease virus strain from vaccinated chicken farms in Mexico during outbreaks in 2010. On the basis of phylogenetic analysis this strain was classified into genotype V in the class II cluster that was closely related to Mexican strains that appeared in 2004–2006. PMID:23409252
Ibrahim, Wisam; Abadeh, Mohammad Saniee
2017-05-21
Protein fold recognition is an important problem in bioinformatics to predict three-dimensional structure of a protein. One of the most challenging tasks in protein fold recognition problem is the extraction of efficient features from the amino-acid sequences to obtain better classifiers. In this paper, we have proposed six descriptors to extract features from protein sequences. These descriptors are applied in the first stage of a three-stage framework PCA-DELM-LDA to extract feature vectors from the amino-acid sequences. Principal Component Analysis PCA has been implemented to reduce the number of extracted features. The extracted feature vectors have been used with original features to improve the performance of the Deep Extreme Learning Machine DELM in the second stage. Four new features have been extracted from the second stage and used in the third stage by Linear Discriminant Analysis LDA to classify the instances into 27 folds. The proposed framework is implemented on the independent and combined feature sets in SCOP datasets. The experimental results show that extracted feature vectors in the first stage could improve the performance of DELM in extracting new useful features in second stage. Copyright © 2017 Elsevier Ltd. All rights reserved.
Chaotic Particle Swarm Optimization with Mutation for Classification
Assarzadeh, Zahra; Naghsh-Nilchi, Ahmad Reza
2015-01-01
In this paper, a chaotic particle swarm optimization with mutation-based classifier particle swarm optimization is proposed to classify patterns of different classes in the feature space. The introduced mutation operators and chaotic sequences allows us to overcome the problem of early convergence into a local minima associated with particle swarm optimization algorithms. That is, the mutation operator sharpens the convergence and it tunes the best possible solution. Furthermore, to remove the irrelevant data and reduce the dimensionality of medical datasets, a feature selection approach using binary version of the proposed particle swarm optimization is introduced. In order to demonstrate the effectiveness of our proposed classifier, mutation-based classifier particle swarm optimization, it is checked out with three sets of data classifications namely, Wisconsin diagnostic breast cancer, Wisconsin breast cancer and heart-statlog, with different feature vector dimensions. The proposed algorithm is compared with different classifier algorithms including k-nearest neighbor, as a conventional classifier, particle swarm-classifier, genetic algorithm, and Imperialist competitive algorithm-classifier, as more sophisticated ones. The performance of each classifier was evaluated by calculating the accuracy, sensitivity, specificity and Matthews's correlation coefficient. The experimental results show that the mutation-based classifier particle swarm optimization unequivocally performs better than all the compared algorithms. PMID:25709937
Li, Chang-Lin; Li, Kai-Cheng; Wu, Dan; Chen, Yan; Luo, Hao; Zhao, Jing-Rong; Wang, Sa-Shuang; Sun, Ming-Ming; Lu, Ying-Jin; Zhong, Yan-Qing; Hu, Xu-Ye; Hou, Rui; Zhou, Bei-Bei; Bao, Lan; Xiao, Hua-Sheng; Zhang, Xu
2016-01-01
Sensory neurons are distinguished by distinct signaling networks and receptive characteristics. Thus, sensory neuron types can be defined by linking transcriptome-based neuron typing with the sensory phenotypes. Here we classify somatosensory neurons of the mouse dorsal root ganglion (DRG) by high-coverage single-cell RNA-sequencing (10 950 ± 1 218 genes per neuron) and neuron size-based hierarchical clustering. Moreover, single DRG neurons responding to cutaneous stimuli are recorded using an in vivo whole-cell patch clamp technique and classified by neuron-type genetic markers. Small diameter DRG neurons are classified into one type of low-threshold mechanoreceptor and five types of mechanoheat nociceptors (MHNs). Each of the MHN types is further categorized into two subtypes. Large DRG neurons are categorized into four types, including neurexophilin 1-expressing MHNs and mechanical nociceptors (MNs) expressing BAI1-associated protein 2-like 1 (Baiap2l1). Mechanoreceptors expressing trafficking protein particle complex 3-like and Baiap2l1-marked MNs are subdivided into two subtypes each. These results provide a new system for cataloging somatosensory neurons and their transcriptome databases. PMID:26691752
Ong, Lee-Ling S; Xinghua Zhang; Kundukad, Binu; Dauwels, Justin; Doyle, Patrick; Asada, H Harry
2016-08-01
An approach to automatically detect bacteria division with temporal models is presented. To understand how bacteria migrate and proliferate to form complex multicellular behaviours such as biofilms, it is desirable to track individual bacteria and detect cell division events. Unlike eukaryotic cells, prokaryotic cells such as bacteria lack distinctive features, causing bacteria division difficult to detect in a single image frame. Furthermore, bacteria may detach, migrate close to other bacteria and may orientate themselves at an angle to the horizontal plane. Our system trains a hidden conditional random field (HCRF) model from tracked and aligned bacteria division sequences. The HCRF model classifies a set of image frames as division or otherwise. The performance of our HCRF model is compared with a Hidden Markov Model (HMM). The results show that a HCRF classifier outperforms a HMM classifier. From 2D bright field microscopy data, it is a challenge to separate individual bacteria and associate observations to tracks. Automatic detection of sequences with bacteria division will improve tracking accuracy.
NASA Technical Reports Server (NTRS)
Paradella, W. R. (Principal Investigator); Vitorello, I.; Monteiro, M. D.
1984-01-01
Enhancement techniques and thematic classifications were applied to the metasediments of Bambui Super Group (Upper Proterozoic) in the Region of Serra do Ramalho, SW of the state of Bahia. Linear contrast stretch, band-ratios with contrast stretch, and color-composites allow lithological discriminations. The effects of human activities and of vegetation cover mask and limit, in several ways, the lithological discrimination with digital MSS data. Principal component images and color composite of linear contrast stretch of these products, show lithological discrimination through tonal gradations. This set of products allows the delineations of several metasedimentary sequences to a level superior to reconnaissance mapping. Supervised (maximum likelihood classifier) and nonsupervised (K-Means classifier) classification of the limestone sequence, host to fluorite mineralization show satisfactory results.
Yoo, Ran Hee; Lee, Seung-Won; Lim, Seungmo; Zhao, Fumei; Igori, Davaajargal; Baek, Dasom; Hong, Jin-Sung; Lee, Su-Heon; Moon, Jae Sun
2017-12-01
Two novel viruses, isolated in Bonghwa, Republic of Korea, from an Ixeridium dentatum plant with yellowing mottle symptoms, have been provisionally named Ixeridium yellow mottle-associated virus 1 (IxYMaV-1) and Ixeridium yellow mottle-associated virus 2 (IxYMaV-2). IxYMaV-1 has a genome of 6,017 nucleotides sharing a 56.4% sequence identity with that of cucurbit aphid-borne yellows virus (genus Polerovirus). The IxYMaV-2 genome of 4,196 nucleotides has a sequence identity of less than 48.3% with e other species classified within the genus Umbravirus. Genome properties and phylogenetic analysis suggested that IxYMaV-1 and -2 are representative isolates of new species classifiable within the genus Polerovirus and Umbravirus, respectively.
Walia, Rasna R; Caragea, Cornelia; Lewis, Benjamin A; Towfic, Fadi; Terribilini, Michael; El-Manzalawy, Yasser; Dobbs, Drena; Honavar, Vasant
2012-05-10
RNA molecules play diverse functional and structural roles in cells. They function as messengers for transferring genetic information from DNA to proteins, as the primary genetic material in many viruses, as catalysts (ribozymes) important for protein synthesis and RNA processing, and as essential and ubiquitous regulators of gene expression in living organisms. Many of these functions depend on precisely orchestrated interactions between RNA molecules and specific proteins in cells. Understanding the molecular mechanisms by which proteins recognize and bind RNA is essential for comprehending the functional implications of these interactions, but the recognition 'code' that mediates interactions between proteins and RNA is not yet understood. Success in deciphering this code would dramatically impact the development of new therapeutic strategies for intervening in devastating diseases such as AIDS and cancer. Because of the high cost of experimental determination of protein-RNA interfaces, there is an increasing reliance on statistical machine learning methods for training predictors of RNA-binding residues in proteins. However, because of differences in the choice of datasets, performance measures, and data representations used, it has been difficult to obtain an accurate assessment of the current state of the art in protein-RNA interface prediction. We provide a review of published approaches for predicting RNA-binding residues in proteins and a systematic comparison and critical assessment of protein-RNA interface residue predictors trained using these approaches on three carefully curated non-redundant datasets. We directly compare two widely used machine learning algorithms (Naïve Bayes (NB) and Support Vector Machine (SVM)) using three different data representations in which features are encoded using either sequence- or structure-based windows. Our results show that (i) Sequence-based classifiers that use a position-specific scoring matrix (PSSM)-based representation (PSSMSeq) outperform those that use an amino acid identity based representation (IDSeq) or a smoothed PSSM (SmoPSSMSeq); (ii) Structure-based classifiers that use smoothed PSSM representation (SmoPSSMStr) outperform those that use PSSM (PSSMStr) as well as sequence identity based representation (IDStr). PSSMSeq classifiers, when tested on an independent test set of 44 proteins, achieve performance that is comparable to that of three state-of-the-art structure-based predictors (including those that exploit geometric features) in terms of Matthews Correlation Coefficient (MCC), although the structure-based methods achieve substantially higher Specificity (albeit at the expense of Sensitivity) compared to sequence-based methods. We also find that the expected performance of the classifiers on a residue level can be markedly different from that on a protein level. Our experiments show that the classifiers trained on three different non-redundant protein-RNA interface datasets achieve comparable cross-validation performance. However, we find that the results are significantly affected by differences in the distance threshold used to define interface residues. Our results demonstrate that protein-RNA interface residue predictors that use a PSSM-based encoding of sequence windows outperform classifiers that use other encodings of sequence windows. While structure-based methods that exploit geometric features can yield significant increases in the Specificity of protein-RNA interface residue predictions, such increases are offset by decreases in Sensitivity. These results underscore the importance of comparing alternative methods using rigorous statistical procedures, multiple performance measures, and datasets that are constructed based on several alternative definitions of interface residues and redundancy cutoffs as well as including evaluations on independent test sets into the comparisons.
An information-based network approach for protein classification
Wan, Xiaogeng; Zhao, Xin; Yau, Stephen S. T.
2017-01-01
Protein classification is one of the critical problems in bioinformatics. Early studies used geometric distances and polygenetic-tree to classify proteins. These methods use binary trees to present protein classification. In this paper, we propose a new protein classification method, whereby theories of information and networks are used to classify the multivariate relationships of proteins. In this study, protein universe is modeled as an undirected network, where proteins are classified according to their connections. Our method is unsupervised, multivariate, and alignment-free. It can be applied to the classification of both protein sequences and structures. Nine examples are used to demonstrate the efficiency of our new method. PMID:28350835
Classifying next-generation sequencing data using a zero-inflated Poisson model.
Zhou, Yan; Wan, Xiang; Zhang, Baoxue; Tong, Tiejun
2018-04-15
With the development of high-throughput techniques, RNA-sequencing (RNA-seq) is becoming increasingly popular as an alternative for gene expression analysis, such as RNAs profiling and classification. Identifying which type of diseases a new patient belongs to with RNA-seq data has been recognized as a vital problem in medical research. As RNA-seq data are discrete, statistical methods developed for classifying microarray data cannot be readily applied for RNA-seq data classification. Witten proposed a Poisson linear discriminant analysis (PLDA) to classify the RNA-seq data in 2011. Note, however, that the count datasets are frequently characterized by excess zeros in real RNA-seq or microRNA sequence data (i.e. when the sequence depth is not enough or small RNAs with the length of 18-30 nucleotides). Therefore, it is desired to develop a new model to analyze RNA-seq data with an excess of zeros. In this paper, we propose a Zero-Inflated Poisson Logistic Discriminant Analysis (ZIPLDA) for RNA-seq data with an excess of zeros. The new method assumes that the data are from a mixture of two distributions: one is a point mass at zero, and the other follows a Poisson distribution. We then consider a logistic relation between the probability of observing zeros and the mean of the genes and the sequencing depth in the model. Simulation studies show that the proposed method performs better than, or at least as well as, the existing methods in a wide range of settings. Two real datasets including a breast cancer RNA-seq dataset and a microRNA-seq dataset are also analyzed, and they coincide with the simulation results that our proposed method outperforms the existing competitors. The software is available at http://www.math.hkbu.edu.hk/∼tongt. xwan@comp.hkbu.edu.hk or tongt@hkbu.edu.hk. Supplementary data are available at Bioinformatics online.
Cinelli, Mattia; Sun, , Yuxin; Best, Katharine; Heather, James M.; Reich-Zeliger, Shlomit; Shifrut, Eric; Friedman, Nir; Shawe-Taylor, John; Chain, Benny
2017-01-01
Abstract Motivation: Somatic DNA recombination, the hallmark of vertebrate adaptive immunity, has the potential to generate a vast diversity of antigen receptor sequences. How this diversity captures antigen specificity remains incompletely understood. In this study we use high throughput sequencing to compare the global changes in T cell receptor β chain complementarity determining region 3 (CDR3β) sequences following immunization with ovalbumin administered with complete Freund’s adjuvant (CFA) or CFA alone. Results: The CDR3β sequences were deconstructed into short stretches of overlapping contiguous amino acids. The motifs were ranked according to a one-dimensional Bayesian classifier score comparing their frequency in the repertoires of the two immunization classes. The top ranking motifs were selected and used to create feature vectors which were used to train a support vector machine. The support vector machine achieved high classification scores in a leave-one-out validation test reaching >90% in some cases. Summary: The study describes a novel two-stage classification strategy combining a one-dimensional Bayesian classifier with a support vector machine. Using this approach we demonstrate that the frequency of a small number of linear motifs three amino acids in length can accurately identify a CD4 T cell response to ovalbumin against a background response to the complex mixture of antigens which characterize Complete Freund’s Adjuvant. Availability and implementation: The sequence data is available at www.ncbi.nlm.nih.gov/sra/?term¼SRP075893. The Decombinator package is available at github.com/innate2adaptive/Decombinator. The R package e1071 is available at the CRAN repository https://cran.r-project.org/web/packages/e1071/index.html. Contact: b.chain@ucl.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online. PMID:28073756
Genomic analyses of Clostridium perfringens isolates from five toxinotypes.
Hassan, Karl A; Elbourne, Liam D H; Tetu, Sasha G; Melville, Stephen B; Rood, Julian I; Paulsen, Ian T
2015-05-01
Clostridium perfringens can be isolated from a range of environments, including soil, marine and fresh water sediments, and the gastrointestinal tracts of animals and humans. Some C. perfringens strains have attractive industrial applications, e.g., in the degradation of waste products or the production of useful chemicals. However, C. perfringens has been most studied as the causative agent of a range of enteric and soft tissue infections of varying severities in humans and animals. Host preference and disease type in C. perfringens are intimately linked to the production of key extracellular toxins and on this basis toxigenic C. perfringens strains have been classified into five toxinotypes (A-E). To date, twelve genome sequences have been generated for a diverse collection of C. perfringens isolates, including strains associated with human and animal infections, a human commensal strain, and a strain with potential industrial utility. Most of the sequenced strains are classified as toxinotype A. However, genome sequences of representative strains from each of the other four toxinotypes have also been determined. Analysis of this collection of sequences has highlighted a lack of features differentiating toxinotype A strains from the other isolates, indicating that the primary defining characteristic of toxinotype A strains is their lack of key plasmid-encoded extracellular toxin genes associated with toxinotype B to E strains. The representative B-E strains sequenced to date each harbour many unique genes. Additional genome sequences are needed to determine if these genes are characteristic of their respective toxinotypes. Copyright © 2014. Published by Elsevier Masson SAS.
Liu, An-An; Li, Kang; Kanade, Takeo
2012-02-01
We propose a semi-Markov model trained in a max-margin learning framework for mitosis event segmentation in large-scale time-lapse phase contrast microscopy image sequences of stem cell populations. Our method consists of three steps. First, we apply a constrained optimization based microscopy image segmentation method that exploits phase contrast optics to extract candidate subsequences in the input image sequence that contains mitosis events. Then, we apply a max-margin hidden conditional random field (MM-HCRF) classifier learned from human-annotated mitotic and nonmitotic sequences to classify each candidate subsequence as a mitosis or not. Finally, a max-margin semi-Markov model (MM-SMM) trained on manually-segmented mitotic sequences is utilized to reinforce the mitosis classification results, and to further segment each mitosis into four predefined temporal stages. The proposed method outperforms the event-detection CRF model recently reported by Huh as well as several other competing methods in very challenging image sequences of multipolar-shaped C3H10T1/2 mesenchymal stem cells. For mitosis detection, an overall precision of 95.8% and a recall of 88.1% were achieved. For mitosis segmentation, the mean and standard deviation for the localization errors of the start and end points of all mitosis stages were well below 1 and 2 frames, respectively. In particular, an overall temporal location error of 0.73 ± 1.29 frames was achieved for locating daughter cell birth events.
Classification of a set of vectors using self-organizing map- and rule-based technique
NASA Astrophysics Data System (ADS)
Ae, Tadashi; Okaniwa, Kaishirou; Nosaka, Kenzaburou
2005-02-01
There exist various objects, such as pictures, music, texts, etc., around our environment. We have a view for these objects by looking, reading or listening. Our view is concerned with our behaviors deeply, and is very important to understand our behaviors. We have a view for an object, and decide the next action (data selection, etc.) with our view. Such a series of actions constructs a sequence. Therefore, we propose a method which acquires a view as a vector from several words for a view, and apply the vector to sequence generation. We focus on sequences of the data of which a user selects from a multimedia database containing pictures, music, movie, etc... These data cannot be stereotyped because user's view for them changes by each user. Therefore, we represent the structure of the multimedia database as the vector representing user's view and the stereotyped vector, and acquire sequences containing the structure as elements. Such a vector can be classified by SOM (Self-Organizing Map). Hidden Markov Model (HMM) is a method to generate sequences. Therefore, we use HMM of which a state corresponds to the representative vector of user's view, and acquire sequences containing the change of user's view. We call it Vector-state Markov Model (VMM). We introduce the rough set theory as a rule-base technique, which plays a role of classifying the sets of data such as the sets of "Tour".
Preconception Carrier Screening by Genome Sequencing: Results from the Clinical Laboratory.
Punj, Sumit; Akkari, Yassmine; Huang, Jennifer; Yang, Fei; Creason, Allison; Pak, Christine; Potter, Amiee; Dorschner, Michael O; Nickerson, Deborah A; Robertson, Peggy D; Jarvik, Gail P; Amendola, Laura M; Schleit, Jennifer; Simpson, Dana Kostiner; Rope, Alan F; Reiss, Jacob; Kauffman, Tia; Gilmore, Marian J; Himes, Patricia; Wilfond, Benjamin; Goddard, Katrina A B; Richards, C Sue
2018-06-07
Advances in sequencing technologies permit the analysis of a larger selection of genes for preconception carrier screening. The study was designed as a sequential carrier screen using genome sequencing to analyze 728 gene-disorder pairs for carrier and medically actionable conditions in 131 women and their partners (n = 71) who were planning a pregnancy. We report here on the clinical laboratory results from this expanded carrier screening program. Variants were filtered and classified using the latest American College of Medical Genetics and Genomics (ACMG) guideline; only pathogenic and likely pathogenic variants were confirmed by orthologous methods before being reported. Novel missense variants were classified as variants of uncertain significance. We reported 304 variants in 202 participants. Twelve carrier couples (12/71 couples tested) were identified for common conditions; eight were carriers for hereditary hemochromatosis. Although both known and novel variants were reported, 48% of all reported variants were missense. For novel splice-site variants, RNA-splicing assays were performed to aid in classification. We reported ten copy-number variants and five variants in non-coding regions. One novel variant was reported in F8, associated with hemophilia A; prenatal testing showed that the male fetus harbored this variant and the neonate suffered a life-threatening hemorrhage which was anticipated and appropriately managed. Moreover, 3% of participants had variants that were medically actionable. Compared with targeted mutation screening, genome sequencing improves the sensitivity of detecting clinically significant variants. While certain novel variant interpretation remains challenging, the ACMG guidelines are useful to classify variants in a healthy population. Copyright © 2018 American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
Porter, Teresita M; Gibson, Joel F; Shokralla, Shadi; Baird, Donald J; Golding, G Brian; Hajibabaei, Mehrdad
2014-01-01
Current methods to identify unknown insect (class Insecta) cytochrome c oxidase (COI barcode) sequences often rely on thresholds of distances that can be difficult to define, sequence similarity cut-offs, or monophyly. Some of the most commonly used metagenomic classification methods do not provide a measure of confidence for the taxonomic assignments they provide. The aim of this study was to use a naïve Bayesian classifier (Wang et al. Applied and Environmental Microbiology, 2007; 73: 5261) to automate taxonomic assignments for large batches of insect COI sequences such as data obtained from high-throughput environmental sequencing. This method provides rank-flexible taxonomic assignments with an associated bootstrap support value, and it is faster than the blast-based methods commonly used in environmental sequence surveys. We have developed and rigorously tested the performance of three different training sets using leave-one-out cross-validation, two field data sets, and targeted testing of Lepidoptera, Diptera and Mantodea sequences obtained from the Barcode of Life Data system. We found that type I error rates, incorrect taxonomic assignments with a high bootstrap support, were already relatively low but could be lowered further by ensuring that all query taxa are actually present in the reference database. Choosing bootstrap support cut-offs according to query length and summarizing taxonomic assignments to more inclusive ranks can also help to reduce error while retaining the maximum number of assignments. Additionally, we highlight gaps in the taxonomic and geographic representation of insects in public sequence databases that will require further work by taxonomists to improve the quality of assignments generated using any method.
Liu, Bin; Wang, Xiaolong; Lin, Lei; Dong, Qiwen; Wang, Xuan
2008-12-01
Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-based methods is to find a suitable representation of protein sequences. In this paper, a novel building block of proteins called Top-n-grams is presented, which contains the evolutionary information extracted from the protein sequence frequency profiles. The protein sequence frequency profiles are calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into Top-n-grams. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top-n-gram. The training vectors are evaluated by SVM to train classifiers which are then used to classify the test protein sequences. We demonstrate that the prediction performance of remote homology detection and fold recognition can be improved by combining Top-n-grams and latent semantic analysis (LSA), which is an efficient feature extraction technique from natural language processing. When tested on superfamily and fold benchmarks, the method combining Top-n-grams and LSA gives significantly better results compared to related methods. The method based on Top-n-grams significantly outperforms the methods based on many other building blocks including N-grams, patterns, motifs and binary profiles. Therefore, Top-n-gram is a good building block of the protein sequences and can be widely used in many tasks of the computational biology, such as the sequence alignment, the prediction of domain boundary, the designation of knowledge-based potentials and the prediction of protein binding sites.
Distribution of hepatitis B virus subgenotype F2a in São Paulo, Brazil.
Alvarado-Mora, Mónica V; Botelho-Lima, Livia S; Santana, Rubia A; Sitnik, Roberta; Ferreira, Paulo Abrão; do Amaral Mello, Francisco; Mangueira, Cristovão P; Carrilho, Flair J; Rebello Pinho, João R
2013-10-21
HBV genotype F is primarily found in indigenous populations from South America and is classified in four subgenotypes (F1 to F4). Subgenotype F2a is the most common in Brazil among genotype F cases. The aim of this study was to characterize HBV genotype F2a circulating in 16 patients from São Paulo, Brazil. Samples were collected between 2006 and 2012 and sent to Hospital Israelita Albert Einstein. A fragment of 1306 bp partially comprising HBsAg and DNA polymerase coding regions was amplified and sequenced. Viral sequences were genotyped by phylogenetic analysis using reference sequences from GenBank (n=198), including 80 classified as subgenotype F2a. Bayesian Markov chain Monte Carlo simulation implemented in BEAST v.1.5.4 was applied to obtain the best possible estimates using the model of nucleotide substitutions GTR+G+I. It were identified three groups of sequences of subgenotype F2a: 1) 10 sequences from São Paulo state; 2) 3 sequences from Rio de Janeiro and one from São Paulo states; 3) 8 sequences from the West Amazon Basin. These results showing for the first time the distribution of F2a subgenotype in Brazil. The spreading and the dynamic of subgenotype F2a in Brazil requires the study of a higher number of samples from different regions as it is unfold in almost all Brazilian populations studied so far. We cannot infer with certainty the origin of these different groups due to the lack of available sequences. Nevertheless, our data suggest that the common origin of these groups probably occurred a long time ago.
DNA unzipping phase diagram calculated via replica theory.
Roland, C Brian; Hatch, Kristi Adamson; Prentiss, Mara; Shakhnovich, Eugene I
2009-05-01
We show how single-molecule unzipping experiments can provide strong evidence that the zero-force melting transition of long molecules of natural dsDNA should be classified as a phase transition of the higher-order type (continuous). Toward this end, we study a statistical-mechanics model for the fluctuating structure of a long molecule of dsDNA, and compute the equilibrium phase diagram for the experiment in which the molecule is unzipped under applied force. We consider a perfect-matching dsDNA model, in which the loops are volume-excluding chains with arbitrary loop exponent c . We include stacking interactions, hydrogen bonds, and main-chain entropy. We include sequence heterogeneity at the level of random sequences; in particular, there is no correlation in the base-pairing (bp) energy from one sequence position to the next. We present heuristic arguments to demonstrate that the low-temperature macrostate does not exhibit degenerate ergodicity breaking. We use this claim to understand the results of our replica-theoretic calculation of the equilibrium properties of the system. As a function of temperature, we obtain the minimal force at which the molecule separates completely. This critical-force curve is a line in the temperature-force phase diagram that marks the regions where the molecule exists primarily as a double helix versus the region where the molecule exists as two separate strands. We compare our random-sequence model to magnetic tweezer experiments performed on the 48 502 bp genome of bacteriophage lambda . We find good agreement with the experimental data, which is restricted to temperatures between 24 and 50 degrees C . At higher temperatures, the critical-force curve of our random-sequence model is very different for that of the homogeneous-sequence version of our model. For both sequence models, the critical force falls to zero at the melting temperature T_{c} like |T-T_{c}|;{alpha} . For the homogeneous-sequence model, alpha=1/2 almost exactly, while for the random-sequence model, alpha approximately 0.9 . Importantly, the shape of the critical-force curve is connected, via our theory, to the manner in which the helix fraction falls to zero at T_{c} . The helix fraction is the property that is used to classify the melting transition as a type of phase transition. In our calculation, the shape of the critical-force curve holds strong evidence that the zero-force melting transition of long natural dsDNA should be classified as a higher-order (continuous) phase transition. Specifically, the order is 3rd or greater.
Smeragliuolo, Anna H.; Long, John Davis; Bumanlag, Silverio Joseph; He, Victor; Lampe, Anna
2017-01-01
The objective of this study was to determine whether kinematic data collected by the Microsoft Kinect 2 (MK2) could be used to quantify postural stability in healthy subjects. Twelve subjects were recruited for the project, and were instructed to perform a sequence of simple postural stability tasks. The movement sequence was performed as subjects were seated on top of a force platform, and the MK2 was positioned in front of them. This sequence of tasks was performed by each subject under three different postural conditions: “both feet on the ground” (1), “One foot off the ground” (2), and “both feet off the ground” (3). We compared force platform and MK2 data to quantify the degree to which the MK2 was returning reliable data across subjects. We then applied a novel machine-learning paradigm to the MK2 data in order to determine the extent to which data from the MK2 could be used to reliably classify different postural conditions. Our initial comparison of force plate and MK2 data showed a strong agreement between the two devices, with strong Pearson correlations between the trunk centroids “Spine_Mid” (0.85 ± 0.06), “Neck” (0.86 ± 0.07) and “Head” (0.87 ± 0.07), and the center of pressure centroid inferred by the force platform. Mean accuracy for the machine learning classifier from MK2 was 97.0%, with a specific classification accuracy breakdown of 90.9%, 100%, and 100% for conditions 1 through 3, respectively. Mean accuracy for the machine learning classifier derived from the force platform data was lower at 84.4%. We conclude that data from the MK2 has sufficient information content to allow us to classify sequences of tasks being performed under different levels of postural stability. Future studies will focus on validating this protocol on large populations of individuals with actual balance impairments in order to create a toolkit that is clinically validated and available to the medical community. PMID:28196139
Dehbandi, Behdad; Barachant, Alexandre; Smeragliuolo, Anna H; Long, John Davis; Bumanlag, Silverio Joseph; He, Victor; Lampe, Anna; Putrino, David
2017-01-01
The objective of this study was to determine whether kinematic data collected by the Microsoft Kinect 2 (MK2) could be used to quantify postural stability in healthy subjects. Twelve subjects were recruited for the project, and were instructed to perform a sequence of simple postural stability tasks. The movement sequence was performed as subjects were seated on top of a force platform, and the MK2 was positioned in front of them. This sequence of tasks was performed by each subject under three different postural conditions: "both feet on the ground" (1), "One foot off the ground" (2), and "both feet off the ground" (3). We compared force platform and MK2 data to quantify the degree to which the MK2 was returning reliable data across subjects. We then applied a novel machine-learning paradigm to the MK2 data in order to determine the extent to which data from the MK2 could be used to reliably classify different postural conditions. Our initial comparison of force plate and MK2 data showed a strong agreement between the two devices, with strong Pearson correlations between the trunk centroids "Spine_Mid" (0.85 ± 0.06), "Neck" (0.86 ± 0.07) and "Head" (0.87 ± 0.07), and the center of pressure centroid inferred by the force platform. Mean accuracy for the machine learning classifier from MK2 was 97.0%, with a specific classification accuracy breakdown of 90.9%, 100%, and 100% for conditions 1 through 3, respectively. Mean accuracy for the machine learning classifier derived from the force platform data was lower at 84.4%. We conclude that data from the MK2 has sufficient information content to allow us to classify sequences of tasks being performed under different levels of postural stability. Future studies will focus on validating this protocol on large populations of individuals with actual balance impairments in order to create a toolkit that is clinically validated and available to the medical community.
Kulmanov, Maxat; Khan, Mohammed Asif; Hoehndorf, Robert; Wren, Jonathan
2018-02-15
A large number of protein sequences are becoming available through the application of novel high-throughput sequencing technologies. Experimental functional characterization of these proteins is time-consuming and expensive, and is often only done rigorously for few selected model organisms. Computational function prediction approaches have been suggested to fill this gap. The functions of proteins are classified using the Gene Ontology (GO), which contains over 40 000 classes. Additionally, proteins have multiple functions, making function prediction a large-scale, multi-class, multi-label problem. We have developed a novel method to predict protein function from sequence. We use deep learning to learn features from protein sequences as well as a cross-species protein-protein interaction network. Our approach specifically outputs information in the structure of the GO and utilizes the dependencies between GO classes as background information to construct a deep learning model. We evaluate our method using the standards established by the Computational Assessment of Function Annotation (CAFA) and demonstrate a significant improvement over baseline methods such as BLAST, in particular for predicting cellular locations. Web server: http://deepgo.bio2vec.net, Source code: https://github.com/bio-ontology-research-group/deepgo. robert.hoehndorf@kaust.edu.sa. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.
Zeng, Qiwei; Chen, Hongyu; Zhang, Chao; Han, Minjing; Li, Tian; Qi, Xiwu; Xiang, Zhonghuai; He, Ningjia
2015-01-01
Mulberry, belonging to the order Rosales, family Moraceae, and genus Morus, has received attention because of both its economic and medicinal value, as well as for its important ecological function. The genus Morus has a worldwide distribution, however, its taxonomy remains complex and disputed. Many studies have attempted to classify Morus species, resulting in varied numbers of designated Morus spp. To address this issue, we used information from internal transcribed spacer (ITS) genetic sequences to study the taxonomy of all the members of generally accepted genus Morus. We found that intraspecific 5.8S rRNA sequences were identical but that interspecific 5.8S sequences were diverse. M. alba and M. notabilis showed the shortest (215 bp) and the longest (233 bp) ITS1 sequence length, respectively. With the completion of the mulberry genome, we could identify single nucleotide polymorphisms within the ITS locus in the M. notabilis genome. From reconstruction of a phylogenetic tree based on the complete ITS data, we propose that the Morus genus should be classified into eight species, including M. alba, M. nigra, M. notabilis, M. serrata, M. celtidifolia, M. insignis, M. rubra, and M. mesozygia. Furthermore, the classification of the ITS sequences of known interspecific hybrid clones into both paternal and maternal clades indicated that ITS variation was sufficient to distinguish interspecific hybrids in the genus Morus. PMID:26266951
A potential functional association between mutant BMPR2 and primary ovarian insufficiency.
Patiño, Liliana Catherine; Silgado, Daniel; Laissue, Paul
2017-06-01
Primary ovarian insufficiency (POI) affects ~1% of women in the general population. Despite numerous attempts at identifying POI genetic aetiology, coding mutations in only a few genes have been functionally related to POI pathogenesis. It has been suggested that mutant BMPR2 might contribute towards the phenotype. Several BMP15 (a BMPR2 ligand) coding mutations in human species have been related to POI pathogenesis. The BMPR2 p.Ser987Phe mutation, previously identified in a woman with POI, might therefore lead to cellular dysfunction contributing to the phenotype. To explore such an assumption, the present study assessed potential pathogenic subcellular localization/aggregation patterns associated with the p.Ser987Phe mutant form of BMPR2 in a relevant model for studying ovarian function. A significant increase in protein-like aggregation patterns was identified at the endoplasmic reticulum (ER) which permitted us to establish, for the first time, a potential functional association between mutant BMPR2 and POI aetiology. Since BMPR2 mutant forms were previously related to idiopathic pulmonary arterial hypertension, BMPR2 mutations may be related to an as-yet-to-be described syndromic form of POI involving pulmonary dysfunction. Additional assays are necessary to confirm that BMPR2 abnormal subcellular patterns are composed by aggregates. POI: primary ovarian insufficiency; ER: endoplasmic reticulum; NGS: next generation sequencing.
Halogen Bonding: A Powerful Tool for Modulation of Peptide Conformation
2017-01-01
Halogen bonding is a weak chemical force that has so far mostly found applications in crystal engineering. Despite its potential for use in drug discovery, as a new molecular tool in the direction of molecular recognition events, it has rarely been assessed in biopolymers. Motivated by this fact, we have developed a peptide model system that permits the quantitative evaluation of weak forces in a biologically relevant proteinlike environment and have applied it for the assessment of a halogen bond formed between two amino acid side chains. The influence of a single weak force is measured by detection of the extent to which it modulates the conformation of a cooperatively folding system. We have optimized the amino acid sequence of the model peptide on analogues with a hydrogen bond-forming site as a model for the intramolecular halogen bond to be studied, demonstrating the ability of the technique to provide information about any type of weak secondary interaction. A combined solution nuclear magnetic resonance spectroscopic and computational investigation demonstrates that an interstrand halogen bond is capable of conformational stabilization of a β-hairpin foldamer comparable to an analogous hydrogen bond. This is the first report of incorporation of a conformation-stabilizing halogen bond into a peptide/protein system, and the first quantification of a chlorine-centered halogen bond in a biologically relevant system in solution. PMID:28581720
Stec, James; Wang, Jing; Coombes, Kevin; Ayers, Mark; Hoersch, Sebastian; Gold, David L.; Ross, Jeffrey S; Hess, Kenneth R.; Tirrell, Stephen; Linette, Gerald; Hortobagyi, Gabriel N.; Symmans, W. Fraser; Pusztai, Lajos
2005-01-01
We examined how well differentially expressed genes and multigene outcome classifiers retain their class-discriminating values when tested on data generated by different transcriptional profiling platforms. RNA from 33 stage I-III breast cancers was hybridized to both Affymetrix GeneChip and Millennium Pharmaceuticals cDNA arrays. Only 30% of all corresponding gene expression measurements on the two platforms had Pearson correlation coefficient r ≥ 0.7 when UniGene was used to match probes. There was substantial variation in correlation between different Affymetrix probe sets matched to the same cDNA probe. When cDNA and Affymetrix probes were matched by basic local alignment tool (BLAST) sequence identity, the correlation increased substantially. We identified 182 genes in the Affymetrix and 45 in the cDNA data (including 17 common genes) that accurately separated 91% of cases in supervised hierarchical clustering in each data set. Cross-platform testing of these informative genes resulted in lower clustering accuracy of 45 and 79%, respectively. Several sets of accurate five-gene classifiers were developed on each platform using linear discriminant analysis. The best 100 classifiers showed average misclassification error rate of 2% on the original data that rose to 19.5% when tested on data from the other platform. Random five-gene classifiers showed misclassification error rate of 33%. We conclude that multigene predictors optimized for one platform lose accuracy when applied to data from another platform due to missing genes and sequence differences in probes that result in differing measurements for the same gene. PMID:16049308
Bai, Yingchen; Wu, Fengchang; Xing, Baoshan; Meng, Wei; Shi, Guolan; Ma, Yan; Giesy, John P
2015-03-04
XAD-8 adsorption technique coupled with stepwise elution using pyrophosphate buffers with initial pH values of 3, 5, 7, 9, and 13 was developed to isolate Chinese standard fulvic acid (FA) and then separated the FA into five sub-fractions: FApH3, FApH5, FApH7, FApH9 and FApH13, respectively. Mass percentages of FApH3-FApH13 decreased from 42% to 2.5%, and the recovery ratios ranged from 99.0% to 99.5%. Earlier eluting sub-fractions contained greater proportions of carboxylic groups with greater polarity and molecular mass, and later eluting sub-fractions had greater phenolic and aliphatic content. Protein-like components, as well as amorphous and crystalline poly(methylene)-containing components were enriched using neutral and basic buffers. Three main mechanisms likely affect stepwise elution of humic components from XAD-8 resin with pyrophosphate buffers including: 1) the carboxylic-rich sub-fractions are deprotonated at lower pH values and eluted earlier, while phenolic-rich sub-fractions are deprotonated at greater pH values and eluted later. 2) protein or protein-like components can be desorbed and eluted by use of stepwise elution as progressively greater pH values exceed their isoelectric points. 3) size exclusion affects elution of FA sub-fractions. Successful isolation of FA sub-fractions will benefit exploration of the origin, structure, evolution and the investigation of interactions with environmental contaminants.
Wu, Jun; Zhang, Hua; He, Pin-Jing; Shao, Li-Ming
2011-02-01
Dissolved organic matter (DOM) plays an important role in heavy metal migration from municipal solid waste (MSW) to aquatic environments via the leachate pathway. In this study, fluorescence excitation-emission matrix (EEM) quenching combined with parallel factor (PARAFAC) analysis was adopted to characterize the binding properties of four heavy metals (Cu, Pb, Zn and Cd) and DOM in MSW leachate. Nine leachate samples were collected from various stages of MSW management, including collection, transportation, incineration, landfill and subsequent leachate treatment. Three humic-like components and one protein-like component were identified in the MSW-derived DOM by PARAFAC. Significant differences in quenching effects were observed between components and metal ions, and a relatively consistent trend in metal quenching curves was observed among various leachate samples. Among the four heavy metals, Cu(II) titration led to fluorescence quenching of all four PARAFAC-derived components. Additionally, strong quenching effects were only observed in protein-like and fulvic acid (FA)-like components with the addition of Pb(II), which suggested that these fractions are mainly responsible for Pb(II) binding in MSW-derived DOM. Moreover, the significant quenching effects of the FA-like component by the four heavy metals revealed that the FA-like fraction in MSW-derived DOM plays an important role in heavy metal speciation; therefore, it may be useful as an indicator to assess the potential ability of heavy metal binding and migration. © 2010 Elsevier Ltd. All rights reserved.
Lu, Zhuoyang; Reddy, M V V V Sekhar; Liu, Jianfang; Kalichava, Ana; Liu, Jiankang; Zhang, Lei; Chen, Fang; Wang, Yun; Holthauzen, Luis Marcelo F; White, Mark A; Seshadrinathan, Suchithra; Zhong, Xiaoying; Ren, Gang; Rudenko, Gabby
2016-11-11
Contactin-associated protein-like 2 (CNTNAP2) is a large multidomain neuronal adhesion molecule implicated in a number of neurological disorders, including epilepsy, schizophrenia, autism spectrum disorder, intellectual disability, and language delay. We reveal here by electron microscopy that the architecture of CNTNAP2 is composed of a large, medium, and small lobe that flex with respect to each other. Using epitope labeling and fragments, we assign the F58C, L1, and L2 domains to the large lobe, the FBG and L3 domains to the middle lobe, and the L4 domain to the small lobe of the CNTNAP2 molecular envelope. Our data reveal that CNTNAP2 has a very different architecture compared with neurexin 1α, a fellow member of the neurexin superfamily and a prototype, suggesting that CNTNAP2 uses a different strategy to integrate into the synaptic protein network. We show that the ectodomains of CNTNAP2 and contactin 2 (CNTN2) bind directly and specifically, with low nanomolar affinity. We show further that mutations in CNTNAP2 implicated in autism spectrum disorder are not segregated but are distributed over the whole ectodomain. The molecular shape and dimensions of CNTNAP2 place constraints on how CNTNAP2 integrates in the cleft of axo-glial and neuronal contact sites and how it functions as an organizing and adhesive molecule. © 2016 by The American Society for Biochemistry and Molecular Biology, Inc.
Preston, Jill C; Jorgensen, Stacy A; Orozco, Rebecca; Hileman, Lena C
2016-02-01
Duplicated petunia clade-VI SPL genes differentially promote the timing of inflorescence and flower development, and leaf initiation rate. The timing of plant reproduction relative to favorable environmental conditions is a critical component of plant fitness, and is often associated with variation in plant architecture and habit. Recent studies have shown that overexpression of the microRNA miR156 in distantly related annual species results in plants with perennial characteristics, including late flowering, weak apical dominance, and abundant leaf production. These phenotypes are largely mediated through the negative regulation of a subset of genes belonging to the SQUAMOSA PROMOTER BINDING PROTEIN-LIKE (SPL) family of transcription factors. In order to determine how and to what extent paralogous SPL genes have partitioned their roles in plant growth and development, we functionally characterized petunia clade-VI SPL genes under different environmental conditions. Our results demonstrate that PhSBP1and PhSBP2 differentially promote discrete stages of the reproductive transition, and that PhSBP1, and possibly PhCNR, accelerates leaf initiation rate. In contrast to the closest homologs in annual Arabidopsis thaliana and Mimulus guttatus, PhSBP1 and PhSBP2 transcription is not mediated by the gibberellic acid pathway, but is positively correlated with photoperiod and developmental age. The developmental functions of clade-VI SPL genes have, thus, evolved following both gene duplication and speciation within the core eudicots, likely through differential regulation and incomplete sub-functionalization.
Bai, Yingchen; Wu, Fengchang; Xing, Baoshan; Meng, Wei; Shi, Guolan; Ma, Yan; Giesy, John P.
2015-01-01
XAD-8 adsorption technique coupled with stepwise elution using pyrophosphate buffers with initial pH values of 3, 5, 7, 9, and 13 was developed to isolate Chinese standard fulvic acid (FA) and then separated the FA into five sub-fractions: FApH3, FApH5, FApH7, FApH9 and FApH13, respectively. Mass percentages of FApH3-FApH13 decreased from 42% to 2.5%, and the recovery ratios ranged from 99.0% to 99.5%. Earlier eluting sub-fractions contained greater proportions of carboxylic groups with greater polarity and molecular mass, and later eluting sub-fractions had greater phenolic and aliphatic content. Protein-like components, as well as amorphous and crystalline poly(methylene)-containing components were enriched using neutral and basic buffers. Three main mechanisms likely affect stepwise elution of humic components from XAD-8 resin with pyrophosphate buffers including: 1) the carboxylic-rich sub-fractions are deprotonated at lower pH values and eluted earlier, while phenolic-rich sub-fractions are deprotonated at greater pH values and eluted later. 2) protein or protein-like components can be desorbed and eluted by use of stepwise elution as progressively greater pH values exceed their isoelectric points. 3) size exclusion affects elution of FA sub-fractions. Successful isolation of FA sub-fractions will benefit exploration of the origin, structure, evolution and the investigation of interactions with environmental contaminants. PMID:25735451
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lu, Zhuoyang; Reddy, M. V. V. V. Sekhar; Liu, Jianfang
Contactin-associated protein-like 2 (CNTNAP2) is a large multidomain neuronal adhesion molecule implicated in a number of neurological disorders, including epilepsy, schizophrenia, autism spectrum disorder, intellectual disability, and language delay. We reveal in this paper by electron microscopy that the architecture of CNTNAP2 is composed of a large, medium, and small lobe that flex with respect to each other. Using epitope labeling and fragments, we assign the F58C, L1, and L2 domains to the large lobe, the FBG and L3 domains to the middle lobe, and the L4 domain to the small lobe of the CNTNAP2 molecular envelope. Our data revealmore » that CNTNAP2 has a very different architecture compared with neurexin 1α, a fellow member of the neurexin superfamily and a prototype, suggesting that CNTNAP2 uses a different strategy to integrate into the synaptic protein network. We show that the ectodomains of CNTNAP2 and contactin 2 (CNTN2) bind directly and specifically, with low nanomolar affinity. We show further that mutations in CNTNAP2 implicated in autism spectrum disorder are not segregated but are distributed over the whole ectodomain. Finally, the molecular shape and dimensions of CNTNAP2 place constraints on how CNTNAP2 integrates in the cleft of axo-glial and neuronal contact sites and how it functions as an organizing and adhesive molecule.« less
Lu, Zhuoyang; Reddy, M. V. V. V. Sekhar; Liu, Jianfang; ...
2016-09-12
Contactin-associated protein-like 2 (CNTNAP2) is a large multidomain neuronal adhesion molecule implicated in a number of neurological disorders, including epilepsy, schizophrenia, autism spectrum disorder, intellectual disability, and language delay. We reveal in this paper by electron microscopy that the architecture of CNTNAP2 is composed of a large, medium, and small lobe that flex with respect to each other. Using epitope labeling and fragments, we assign the F58C, L1, and L2 domains to the large lobe, the FBG and L3 domains to the middle lobe, and the L4 domain to the small lobe of the CNTNAP2 molecular envelope. Our data revealmore » that CNTNAP2 has a very different architecture compared with neurexin 1α, a fellow member of the neurexin superfamily and a prototype, suggesting that CNTNAP2 uses a different strategy to integrate into the synaptic protein network. We show that the ectodomains of CNTNAP2 and contactin 2 (CNTN2) bind directly and specifically, with low nanomolar affinity. We show further that mutations in CNTNAP2 implicated in autism spectrum disorder are not segregated but are distributed over the whole ectodomain. Finally, the molecular shape and dimensions of CNTNAP2 place constraints on how CNTNAP2 integrates in the cleft of axo-glial and neuronal contact sites and how it functions as an organizing and adhesive molecule.« less
Yang, Xiaofang; Zhou, Zhongbo; Raju, Maddela Naga; Cai, Xiaoxuan; Meng, Fangang
2017-07-01
Effluent organic matter (EfOM) from municipal wastewater treatment plants potentially has a detrimental effect on both aquatic organisms and humans. This study evaluated the removal and transformation of chromophoric dissolved organic matter (CDOM) and fluorescent dissolved organic matter (FDOM) in a full-scale wastewater treatment plant under different seasons. The results showed that bio-treatment was found to be more efficient in removing bulk DOM (in term of dissolved organic carbon, DOC) than CDOM and FDOM, which was contrary to the disinfection process. CDOM and FDOM were selectively removed at various stages during the treatment. Typically, the low molecular weight fractions of CDOM and protein-like FDOM were more efficiently removed during bio-treatment process, whereas the humic-like FDOM exhibited comparable decreases in both bio-treatment and disinfection processes. Overall, the performance of the WWTP was weak in terms of CDOM and FDOM removal, resulting in enrichment of CDOM and FDOM in effluent. Moreover, the total removal of the bulk DOM (P<0.05) and the protein-like FDOM (P<0.05) displayed a significant seasonal variation, with higher removal efficiencies in summer, whereas removal of CDOM and the humic-like FDOM showed little differences between summer and winter. In all, the results provide useful information for understanding the fate and transformation of DOM, illustrating that sub-fractions of DOM could be selectively removed depending on treatment processes and seasonality. Copyright © 2016. Published by Elsevier B.V.
Chen, Wei; Liu, Xiao-Yang; Yu, Han-Qing
2017-03-01
Temperature variation caused by climate change, seasonal variation and geographic locations affects the physicochemical compositions of chromophoric dissolved organic matter (CDOM), resulting in difference in the fates of CDOM-related environmental pollutants. Exploration into the thermal induced structural transition of CDOM can help to better understand their environmental impacts, but information on this aspect is still lacking. Through integrating fluorescence excitation-emission matrix coupled parallel factor analysis with synchronous fluorescence two-dimensional correlation spectroscopy, this study provides an in-depth insight into the temperature-dependent conformational transitions of CDOM and their impact on its hydrophobic interaction with persistent organic pollutants (with phenanthrene as an example) in water. The fluorescence components in CDOM change linearly to water temperature with different extents and different temperature regions. The thermal induced transition priority in CDOM is protein-like component → fulvic-like component → humic-like component. Furthermore, the impact of thermal-induced conformational transition of CDOM on its hydrophobic interaction with phenanthrene is observed and explored. The fluorescence-based analytic results reveal that the conjugation degree of the aromatic groups in the fulvic- and humic-like substances, and the unfolding of the secondary structure in the protein-like substances with aromatic structure, contribute to the conformation variation. This integrated approach jointly enhances the characterization of temperature-dependent conformational variation of CDOM, and provides a promising way to elucidate the environmental behaviours of CDOM. Copyright © 2017 Elsevier Ltd. All rights reserved.
Huang, Tianpei; Zhang, Xiaojuan; Pan, Jieru; Su, Xiaoyu; Jin, Xin; Guan, Xiong
2016-01-01
Bacillus thuringiensis (Bt), one of the most successful biopesticides, may expand its potential by producing bacteriocins (thuricins). The aim of this study was to investigate the antimicrobial potential of a novel Bt bacteriocin, thuricin BtCspB, produced by Bt BRC-ZYR2. The results showed that this bacteriocin has a high similarity with cold-shock protein B (CspB). BtCspB lost its activity after proteinase K treatment; however it was active at 60 °C for 30 min and was stable in the pH range 5–7. The partial loss of activity after the treatments of lipase II and catalase were likely due to the change in BtCspB structure and the partial degradation of BtCspB, respectively. The loss of activity at high temperatures and the activity variation at different pHs were not due to degradation or large conformational change. BtCspB did not inhibit four probiotics. It was only active against B. cereus strains 0938 and ATCC 10987 with MIC values of 3.125 μg/mL and 0.781 μg/mL, and MBC values of 12.5 μg/mL and 6.25 μg/mL, respectively. Taken together, these results provide new insights into a novel cold shock protein-like bacteriocin, BtCspB, which displayed promise for its use in food preservation and treatment of B. cereus-associated diseases. PMID:27762322
Huang, Tianpei; Zhang, Xiaojuan; Pan, Jieru; Su, Xiaoyu; Jin, Xin; Guan, Xiong
2016-10-20
Bacillus thuringiensis (Bt), one of the most successful biopesticides, may expand its potential by producing bacteriocins (thuricins). The aim of this study was to investigate the antimicrobial potential of a novel Bt bacteriocin, thuricin BtCspB, produced by Bt BRC-ZYR2. The results showed that this bacteriocin has a high similarity with cold-shock protein B (CspB). BtCspB lost its activity after proteinase K treatment; however it was active at 60 °C for 30 min and was stable in the pH range 5-7. The partial loss of activity after the treatments of lipase II and catalase were likely due to the change in BtCspB structure and the partial degradation of BtCspB, respectively. The loss of activity at high temperatures and the activity variation at different pHs were not due to degradation or large conformational change. BtCspB did not inhibit four probiotics. It was only active against B. cereus strains 0938 and ATCC 10987 with MIC values of 3.125 μg/mL and 0.781 μg/mL, and MBC values of 12.5 μg/mL and 6.25 μg/mL, respectively. Taken together, these results provide new insights into a novel cold shock protein-like bacteriocin, BtCspB, which displayed promise for its use in food preservation and treatment of B. cereus-associated diseases.
Zhou, Zhiwei; Yang, Yanling; Li, Xing; Ji, Siyang; Zhang, Hao; Wang, Shuai; Zeng, Qingping; Han, Xinghang
2016-01-01
To clarify the role of solubilized organics derived from drinking water treatment sludge (DWTS) in the elimination of natural organic matter (NOM) in the DWTS recycling process, a probe sonoreactor at a frequency of 25 kHz was used to solubilize the organics at varied specific energies. The coagulation behavior related to NOM removal in recycling the sonicated DWTS with and without solubilized organics was evaluated, and the effect on organic fractionations in coagulated water was determined. The study results could provide useful implications in designing DWTS recycling processes that avoid the enrichment of organic matter. Our results indicate that DWTS was disrupted through a low release of soluble chemical oxygen demand (SCOD) and proteins, which could deteriorate the coagulated water quality under the specific energy of 37.87-1212.1 kW h/kg TS. The optimal coagulation behavior for NOM removal was achieved by recycling the sonicated DWTS without solubilized organics at 151.5 kW h/kg TS specific energy. Recycling the sonicated DWTS could increase the enrichment potential of weakly hydrophobic acid, hydrophilic matter, and <3 kDa fractions; the enrichment risks could be reduced by discharging the solubilized organics. Fluorescent characteristic analysis indicated that when recycling the sonicated DWTS without solubilized organics, the removal of humic-like substances was limited, whereas removal of protein-like substances was enhanced, lowering the enrichment potential of protein-like substances. Copyright © 2015. Published by Elsevier B.V.
Multi-Modal Curriculum Learning for Semi-Supervised Image Classification.
Gong, Chen; Tao, Dacheng; Maybank, Stephen J; Liu, Wei; Kang, Guoliang; Yang, Jie
2016-07-01
Semi-supervised image classification aims to classify a large quantity of unlabeled images by typically harnessing scarce labeled images. Existing semi-supervised methods often suffer from inadequate classification accuracy when encountering difficult yet critical images, such as outliers, because they treat all unlabeled images equally and conduct classifications in an imperfectly ordered sequence. In this paper, we employ the curriculum learning methodology by investigating the difficulty of classifying every unlabeled image. The reliability and the discriminability of these unlabeled images are particularly investigated for evaluating their difficulty. As a result, an optimized image sequence is generated during the iterative propagations, and the unlabeled images are logically classified from simple to difficult. Furthermore, since images are usually characterized by multiple visual feature descriptors, we associate each kind of features with a teacher, and design a multi-modal curriculum learning (MMCL) strategy to integrate the information from different feature modalities. In each propagation, each teacher analyzes the difficulties of the currently unlabeled images from its own modality viewpoint. A consensus is subsequently reached among all the teachers, determining the currently simplest images (i.e., a curriculum), which are to be reliably classified by the multi-modal learner. This well-organized propagation process leveraging multiple teachers and one learner enables our MMCL to outperform five state-of-the-art methods on eight popular image data sets.
Austin, Peter C; Lee, Douglas S
2011-01-01
Purpose: Classification trees are increasingly being used to classifying patients according to the presence or absence of a disease or health outcome. A limitation of classification trees is their limited predictive accuracy. In the data-mining and machine learning literature, boosting has been developed to improve classification. Boosting with classification trees iteratively grows classification trees in a sequence of reweighted datasets. In a given iteration, subjects that were misclassified in the previous iteration are weighted more highly than subjects that were correctly classified. Classifications from each of the classification trees in the sequence are combined through a weighted majority vote to produce a final classification. The authors' objective was to examine whether boosting improved the accuracy of classification trees for predicting outcomes in cardiovascular patients. Methods: We examined the utility of boosting classification trees for classifying 30-day mortality outcomes in patients hospitalized with either acute myocardial infarction or congestive heart failure. Results: Improvements in the misclassification rate using boosted classification trees were at best minor compared to when conventional classification trees were used. Minor to modest improvements to sensitivity were observed, with only a negligible reduction in specificity. For predicting cardiovascular mortality, boosted classification trees had high specificity, but low sensitivity. Conclusions: Gains in predictive accuracy for predicting cardiovascular outcomes were less impressive than gains in performance observed in the data mining literature. PMID:22254181
De Jonckheere, Johan F; Gryseels, Sophie; Eddyani, Miriam
2012-08-01
We have isolated several free-living amoeba strains from the environment in Ghana, which have internal transcribed spacers, including the 5.8S rDNA, sequences similar to sequences attributed to Vahlkampfiidae (Heterolobosea) in databases. However, morphological examination shows that the isolates belong to the Hartmannellidae (Amoebozoa). We provide evidence that the sequences in the databases are wrongly classified as belonging to a genus or species of the Vahlkampfiidae, but rather belong to strains of the genus Hartmannella. Copyright © 2012 Elsevier GmbH. All rights reserved.
Sankari, E Siva; Manimegalai, D
2017-12-21
Predicting membrane protein types is an important and challenging research area in bioinformatics and proteomics. Traditional biophysical methods are used to classify membrane protein types. Due to large exploration of uncharacterized protein sequences in databases, traditional methods are very time consuming, expensive and susceptible to errors. Hence, it is highly desirable to develop a robust, reliable, and efficient method to predict membrane protein types. Imbalanced datasets and large datasets are often handled well by decision tree classifiers. Since imbalanced datasets are taken, the performance of various decision tree classifiers such as Decision Tree (DT), Classification And Regression Tree (CART), C4.5, Random tree, REP (Reduced Error Pruning) tree, ensemble methods such as Adaboost, RUS (Random Under Sampling) boost, Rotation forest and Random forest are analysed. Among the various decision tree classifiers Random forest performs well in less time with good accuracy of 96.35%. Another inference is RUS boost decision tree classifier is able to classify one or two samples in the class with very less samples while the other classifiers such as DT, Adaboost, Rotation forest and Random forest are not sensitive for the classes with fewer samples. Also the performance of decision tree classifiers is compared with SVM (Support Vector Machine) and Naive Bayes classifier. Copyright © 2017 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Zhao, Y.; Song, K.; Wen, Z.; Li, L.; Zang, S.; Shao, T.; Li, S.; Du, J.
2015-04-01
The seasonal characteristics of fluorescence components in CDOM for lakes in the semi-arid region of Northeast China were examined by excitation-emission matrices fluorescence and parallel factor analysis (EEM-PARAFAC). Two humic-like peaks C1 (Ex/Em = 230, 300/425 nm) and C2 (Ex/Em = 255, 350/460 nm) and two protein-like B (Ex/Em = 220, 275/320 nm) and T (Ex/Em = 225, 290/360 nm) peaks were identified using PARAFAC. The average fluorescence intensity of the four components differed with seasonal variation from June and August 2013 to February and April 2014. The total fluorescence intensity significantly varied from 2.54 ± 0.68 nm-1 in June to the mean value 1.93 ± 0.70 nm-1 in August 2013, and then increased to 2.34 ± 0.92 nm-1 in February and reduced to the lowest 1.57 ± 0.55 nm-1 in April 2014. In general, the fluorescence intensity was dominated by peak C1, indicating that most part of CDOM for inland waters being investigated in this study was originated from phytoplankton degradation. The lowest C2 represents only a small portion of CDOM from terrestrial imported organic matter to water bodies through rainwash and soil leaching. The two protein-like intensities (B and T) formed in situ through microbial activity have almost the same intensity. Especially, in August 2013 and February 2014, the two protein-like peaks showed obviously difference from other seasons and the highest C1 (1.02 nm-1) was present in February 2014. Components 1 and 2 exhibited strong linear correlation (R2 = 0.633). There were significantly positive linear relationships between CDOM absorption coefficients a(254) (R2 = 0.72, 0.46, p < 0.01), a(280) (R2 = 0.77, 0.47, p < 0.01), a(350) (R2 = 0.76, 0.78, p < 0.01) and Fmax for two humic-like components (C1 and C2), respectively. A close relationship (R2 = 0.931) was found between salinity and DOC. However, almost no obvious correlation was found between salinity and EEM-PARAFAC extracted components except for C3 (R2 = 0.469). Results from this investigation demonstrate that the EEM-PARAFAC technique can be used to evaluate the seasonal dynamics of CDOM fluorescence components for inland waters in semi-arid regions of Northeast China.
Ji, Hongwei; He, Jiangping; Yang, Xin; Deklerck, Rudi; Cornelis, Jan
2013-05-01
In this paper, we present an autocontext model(ACM)-based automatic liver segmentation algorithm, which combines ACM, multiatlases, and mean-shift techniques to segment liver from 3-D CT images. Our algorithm is a learning-based method and can be divided into two stages. At the first stage, i.e., the training stage, ACM is performed to learn a sequence of classifiers in each atlas space (based on each atlas and other aligned atlases). With the use of multiple atlases, multiple sequences of ACM-based classifiers are obtained. At the second stage, i.e., the segmentation stage, the test image will be segmented in each atlas space by applying each sequence of ACM-based classifiers. The final segmentation result will be obtained by fusing segmentation results from all atlas spaces via a multiclassifier fusion technique. Specially, in order to speed up segmentation, given a test image, we first use an improved mean-shift algorithm to perform over-segmentation and then implement the region-based image labeling instead of the original inefficient pixel-based image labeling. The proposed method is evaluated on the datasets of MICCAI 2007 liver segmentation challenge. The experimental results show that the average volume overlap error and the average surface distance achieved by our method are 8.3% and 1.5 m, respectively, which are comparable to the results reported in the existing state-of-the-art work on liver segmentation.
Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins.
Kolker, Natali; Higdon, Roger; Broomall, William; Stanberry, Larissa; Welch, Dean; Lu, Wei; Haynes, Winston; Barga, Roger; Kolker, Eugene
2011-01-01
To address the monumental challenge of assigning function to millions of sequenced proteins, we completed the first of a kind all-versus-all sequence alignments using BLAST for 9.9 million proteins in the UniRef100 database. Microsoft Windows Azure produced over 3 billion filtered records in 6 days using 475 eight-core virtual machines. Protein classification into functional groups was then performed using Hive and custom jars implemented on top of Apache Hadoop utilizing the MapReduce paradigm. First, using the Clusters of Orthologous Genes (COG) database, a length normalized bit score (LNBS) was determined to be the best similarity measure for classification of proteins. LNBS achieved sensitivity and specificity of 98% each. Second, out of 5.1 million bacterial proteins, about two-thirds were assigned to significantly extended COG groups, encompassing 30 times more assigned proteins. Third, the remaining proteins were classified into protein functional groups using an innovative implementation of a single-linkage algorithm on an in-house Hadoop compute cluster. This implementation significantly reduces the run time for nonindexed queries and optimizes efficient clustering on a large scale. The performance was also verified on Amazon Elastic MapReduce. This clustering assigned nearly 2 million proteins to approximately half a million different functional groups. A similar approach was applied to classify 2.8 million eukaryotic sequences resulting in over 1 million proteins being assign to existing KOG groups and the remainder clustered into 100,000 functional groups.
An, Ji-Yong; Meng, Fan-Rong; You, Zhu-Hong; Fang, Yu-Hong; Zhao, Yu-Jun; Zhang, Ming
2016-01-01
We propose a novel computational method known as RVM-LPQ that combines the Relevance Vector Machine (RVM) model and Local Phase Quantization (LPQ) to predict PPIs from protein sequences. The main improvements are the results of representing protein sequences using the LPQ feature representation on a Position Specific Scoring Matrix (PSSM), reducing the influence of noise using a Principal Component Analysis (PCA), and using a Relevance Vector Machine (RVM) based classifier. We perform 5-fold cross-validation experiments on Yeast and Human datasets, and we achieve very high accuracies of 92.65% and 97.62%, respectively, which is significantly better than previous works. To further evaluate the proposed method, we compare it with the state-of-the-art support vector machine (SVM) classifier on the Yeast dataset. The experimental results demonstrate that our RVM-LPQ method is obviously better than the SVM-based method. The promising experimental results show the efficiency and simplicity of the proposed method, which can be an automatic decision support tool for future proteomics research.
Terminator Detection by Support Vector Machine Utilizing aStochastic Context-Free Grammar
DOE Office of Scientific and Technical Information (OSTI.GOV)
Francis-Lyon, Patricia; Cristianini, Nello; Holbrook, Stephen
2006-12-30
A 2-stage detector was designed to find rho-independent transcription terminators in the Escherichia coli genome. The detector includes a Stochastic Context Free Grammar (SCFG) component and a Support Vector Machine (SVM) component. To find terminators, the SCFG searches the intergenic regions of nucleotide sequence for local matches to a terminator grammar that was designed and trained utilizing examples of known terminators. The grammar selects sequences that are the best candidates for terminators and assigns them a prefix, stem-loop, suffix structure using the Cocke-Younger-Kasaami (CYK) algorithm, modified to incorporate energy affects of base pairing. The parameters from this inferred structure aremore » passed to the SVM classifier, which distinguishes terminators from non-terminators that score high according to the terminator grammar. The SVM was trained with negative examples drawn from intergenic sequences that include both featureless and RNA gene regions (which were assigned prefix, stem-loop, suffix structure by the SCFG), so that it successfully distinguishes terminators from either of these. The classifier was found to be 96.4% successful during testing.« less
Siew, Joyce Phui Yee; Khan, Asif M; Tan, Paul T J; Koh, Judice L Y; Seah, Seng Hong; Koo, Chuay Yeng; Chai, Siaw Ching; Armugam, Arunmozhiarasi; Brusic, Vladimir; Jeyaseelan, Kandiah
2004-12-12
Sequence annotations, functional and structural data on snake venom neurotoxins (svNTXs) are scattered across multiple databases and literature sources. Sequence annotations and structural data are available in the public molecular databases, while functional data are almost exclusively available in the published articles. There is a need for a specialized svNTXs database that contains NTX entries, which are organized, well annotated and classified in a systematic manner. We have systematically analyzed svNTXs and classified them using structure-function groups based on their structural, functional and phylogenetic properties. Using conserved motifs in each phylogenetic group, we built an intelligent module for the prediction of structural and functional properties of unknown NTXs. We also developed an annotation tool to aid the functional prediction of newly identified NTXs as an additional resource for the venom research community. We created a searchable online database of NTX proteins sequences (http://research.i2r.a-star.edu.sg/Templar/DB/snake_neurotoxin). This database can also be found under Swiss-Prot Toxin Annotation Project website (http://www.expasy.org/sprot/).
Anisimov, Andrey P; Panfertsev, Evgeniy A; Svetoch, Tat'yana E; Dentovskaya, Svetlana V
2007-01-01
Sequencing of lcrV genes and comparison of the deduced amino acid sequences from ten Y. pestis strains belonging mostly to the group of atypical rhamnose-positive isolates (non-pestis subspecies or pestoides group) showed that the LcrV proteins analyzed could be classified into five sequence types. This classification was based on major amino acid polymorphisms among LcrV proteins in the four "hot points" of the protein sequences. Some additional minor polymorphisms were found throughout these sequence types. The "hot points" corresponded to amino acids 18 (Lys --> Asn), 72 (Lys --> Arg), 273 (Cys --> Ser), and 324-326 (Ser-Gly-Lys --> Arg) in the LcrV sequence of the reference Y. pestis strain CO92. One possible explanation for polymorphism in amino acid sequences of LcrV among different strains is that strain-specific variation resulted from adaptation of the plague pathogen to different rodent and lagomorph hosts.
Repliscan: a tool for classifying replication timing regions.
Zynda, Gregory J; Song, Jawon; Concia, Lorenzo; Wear, Emily E; Hanley-Bowdoin, Linda; Thompson, William F; Vaughn, Matthew W
2017-08-07
Replication timing experiments that use label incorporation and high throughput sequencing produce peaked data similar to ChIP-Seq experiments. However, the differences in experimental design, coverage density, and possible results make traditional ChIP-Seq analysis methods inappropriate for use with replication timing. To accurately detect and classify regions of replication across the genome, we present Repliscan. Repliscan robustly normalizes, automatically removes outlying and uninformative data points, and classifies Repli-seq signals into discrete combinations of replication signatures. The quality control steps and self-fitting methods make Repliscan generally applicable and more robust than previous methods that classify regions based on thresholds. Repliscan is simple and effective to use on organisms with different genome sizes. Even with analysis window sizes as small as 1 kilobase, reliable profiles can be generated with as little as 2.4x coverage.
Shittu, Ismaila; Sharma, Poonam; Volkening, Jeremy D.; Solomon, Ponman; Sulaiman, Lanre K.; Joannis, Tony M.; Williams-Coplin, Dawn; Miller, Patti J.; Dimitrov, Kiril M.
2016-01-01
The first complete genome sequence of a strain of Newcastle disease virus (NDV) from genotype XIV is reported here. Strain duck/Nigeria/NG-695/KG.LOM.11-16/2009 was isolated from an apparently healthy domestic duck from a live bird market in Kogi State, Nigeria, in 2009. This strain is classified as a member of subgenotype XIVb of class II. PMID:26823576
Complete Genome Sequence of Genotype VI Newcastle Disease Viruses Isolated from Pigeons in Pakistan
Wajid, Abdul; Rehmani, Shafqat Fatima; Sharma, Poonam; Goraichuk, Iryna V.; Dimitrov, Kiril M.
2016-01-01
Two complete genome sequences of Newcastle disease virus (NDV) are described here. Virulent isolates pigeon/Pakistan/Lahore/21A/2015 and pigeon/Pakistan/Lahore/25A/2015 were obtained from racing pigeons sampled in the Pakistani province of Punjab during 2015. Phylogenetic analysis of the fusion protein genes and complete genomes classified the isolates as members of NDV class II, genotype VI. PMID:27540069
An investigation of Hebbian phase sequences as assembly graphs
Almeida-Filho, Daniel G.; Lopes-dos-Santos, Vitor; Vasconcelos, Nivaldo A. P.; Miranda, José G. V.; Tort, Adriano B. L.; Ribeiro, Sidarta
2014-01-01
Hebb proposed that synapses between neurons that fire synchronously are strengthened, forming cell assemblies and phase sequences. The former, on a shorter scale, are ensembles of synchronized cells that function transiently as a closed processing system; the latter, on a larger scale, correspond to the sequential activation of cell assemblies able to represent percepts and behaviors. Nowadays, the recording of large neuronal populations allows for the detection of multiple cell assemblies. Within Hebb's theory, the next logical step is the analysis of phase sequences. Here we detected phase sequences as consecutive assembly activation patterns, and then analyzed their graph attributes in relation to behavior. We investigated action potentials recorded from the adult rat hippocampus and neocortex before, during and after novel object exploration (experimental periods). Within assembly graphs, each assembly corresponded to a node, and each edge corresponded to the temporal sequence of consecutive node activations. The sum of all assembly activations was proportional to firing rates, but the activity of individual assemblies was not. Assembly repertoire was stable across experimental periods, suggesting that novel experience does not create new assemblies in the adult rat. Assembly graph attributes, on the other hand, varied significantly across behavioral states and experimental periods, and were separable enough to correctly classify experimental periods (Naïve Bayes classifier; maximum AUROCs ranging from 0.55 to 0.99) and behavioral states (waking, slow wave sleep, and rapid eye movement sleep; maximum AUROCs ranging from 0.64 to 0.98). Our findings agree with Hebb's view that assemblies correspond to primitive building blocks of representation, nearly unchanged in the adult, while phase sequences are labile across behavioral states and change after novel experience. The results are compatible with a role for phase sequences in behavior and cognition. PMID:24782715
Daniel, Hubert D-J; David, Joel; Raghuraman, Sukanya; Gnanamony, Manu; Chandy, George M; Sridharan, Gopalan; Abraham, Priya
2017-05-01
Based on genetic heterogeneity, hepatitis C virus (HCV) is classified into seven major genotypes and 64 subtypes. In spite of the sequence heterogeneity, all genotypes share an identical complement of colinear genes within the large open reading frame. The genetic interrelationships between these genes are consistent among genotypes. Due to this property, complete sequencing of the HCV genome is not required. HCV genotypes along with subtypes are critical for planning antiviral therapy. Certain genotypes are also associated with higher progression to liver cirrhosis. In this study, 100 blood samples were collected from individuals who came for routine HCV genotype identification. These samples were used for the comparison of two different genotyping methods (5'NCR PCR-RFLP and HCV core type-specific PCR) with NS5b sequencing. Of the 100 samples genotyped using 5'NCR PCR-RFLP and HCV core type-specific PCR, 90% (κ = 0.913, P < 0.00) and 96% (κ = 0.794, P < 0.00) correlated with NS5b sequencing, respectively. Sixty percent and 75% of discordant samples by 5'NCR PCR-RFLP and HCV core type-specific PCR, respectively, belonged to genotype 6. All the HCV genotype 1 subtypes were classified accurately by both the methods. This study shows that the 5'NCR-based PCR-RFLP and the HCV core type-specific PCR-based assays correctly identified HCV genotypes except genotype 6 from this region. Direct sequencing of the HCV core region was able to identify all the genotype 6 from this region and serves as an alternative to NS5b sequencing. © 2016 Wiley Periodicals, Inc.
Vision-based posture recognition using an ensemble classifier and a vote filter
NASA Astrophysics Data System (ADS)
Ji, Peng; Wu, Changcheng; Xu, Xiaonong; Song, Aiguo; Li, Huijun
2016-10-01
Posture recognition is a very important Human-Robot Interaction (HRI) way. To segment effective posture from an image, we propose an improved region grow algorithm which combining with the Single Gauss Color Model. The experiment shows that the improved region grow algorithm can get the complete and accurate posture than traditional Single Gauss Model and region grow algorithm, and it can eliminate the similar region from the background at the same time. In the posture recognition part, and in order to improve the recognition rate, we propose a CNN ensemble classifier, and in order to reduce the misjudgments during a continuous gesture control, a vote filter is proposed and applied to the sequence of recognition results. Comparing with CNN classifier, the CNN ensemble classifier we proposed can yield a 96.27% recognition rate, which is better than that of CNN classifier, and the proposed vote filter can improve the recognition result and reduce the misjudgments during the consecutive gesture switch.
Castejon, Maria; Menéndez, Maria Carmen; Comas, Iñaki; Vicente, Ana; Garcia, Maria J
2018-06-01
Bacterial whole-genome sequences contain informative features of their evolutionary pathways. Comparison of whole-genome sequences have become the method of choice for classification of prokaryotes, thus allowing the identification of bacteria from an evolutionary perspective, and providing data to resolve some current controversies. Currently, controversy exists about the assignment of members of the Mycobacterium avium complex, as is for the cases of Mycobacterium yongonense and 'Mycobacterium indicus pranii'. These two mycobacteria, closely related to Mycobacterium intracellulare on the basis of standard phenotypic and single gene-sequences comparisons, were not considered a member of such species on the basis on some particular differences displayed by a single strain. Whole-genome sequence comparison procedures, namely the average nucleotide identity and the genome distance, showed that those two mycobacteria should be considered members of the species M. intracellulare. The results were confirmed with other whole-genome comparison supplementary methods. According to the data provided, Mycobacterium yongonense and 'Mycobacterium indicus pranii' should be considered and renamed and included as members of M. intracellulare. This study highlights the problems caused when a novel species is accepted on the basis of a single strain, as was the case for M. yongonense. Based mainly on whole-genome sequence analysis, we conclude that M. yongonense should be reclassified as a subspecies of Mycobacterium intracellulareas Mycobacterium intracellularesubsp. yongonense and 'Mycobacterium indicus pranii' classified in the same subspecies as the type strain of Mycobacterium intracellulare and classified as Mycobacterium intracellularesubsp. intracellulare.
Characterization of occult hepatitis B virus infection among HIV positive patients in Cameroon.
Gachara, George; Magoro, Tshifhiwa; Mavhandu, Lufuno; Lum, Emmaculate; Kimbi, Helen K; Ndip, Roland N; Bessong, Pascal O
2017-03-08
Occult hepatitis B infection (OBI) among HIV positive patients varies widely in different geographic regions. We undertook a study to determine the prevalence of occult hepatitis B infection among HIV infected individuals visiting a health facility in South West Cameroon and characterized occult HBV strains based on sequence analyses. Plasma samples (n = 337), which previously tested negative for hepatitis B surface antigen (HBsAg), were screened for antibodies against hepatitis B core (anti-HBc) and surface (anti-HBs) antigens followed by DNA extraction. A 366 bp region covering the overlapping surface/polymerase gene of HBV was then amplified in a nested PCR and the amplicons sequenced using Sanger sequencing. The resulting sequences were then analyzed for genotypes and for escape and drug resistance mutations. Twenty samples were HBV DNA positive and were classified as OBI giving a prevalence of 5.9%. Out of these, 9 (45%) were anti-HBs positive, while 10 (52.6%) were anti-HBc positive. Additionally, 2 had dual anti-HBs and anti-HBc reactivity, while 6 had no detectable HBV antibodies. Out of the ten samples that were successfully sequenced, nine were classified as genotype E and one as genotype A. Three sequences possessed mutations associated with lamivudine resistance. We detected a number of mutations within the major hydrophilic region of the surface gene where most immune escape mutations occur. Findings from this study show the presence of hepatitis B in patients without any of the HBV serological markers. Further prospective studies are required to determine the risk factors and markers of OBI.
Tsuji, K; Tsien, H C; Hanson, R S; DePalma, S R; Scholtz, R; LaRoche, S
1990-01-01
16S ribosomal RNAs (rRNA) of 12 methylotrophic bacteria have been almost completely sequenced to establish their phylogenetic relationships. Methylotrophs that are physiologically related are phylogenetically diverse and are scattered among the purple eubacteria (class Proteobacteria). Group I methylotrophs can be classified in the beta- and the gamma-subdivisions and group II methylotrophs in the alpha-subdivision of the purple eubacteria, respectively. Pink-pigmented facultative and non-pigmented obligate group II methylotrophs form two distinctly separate branches within the alpha-subdivision. The secondary structures of the 16S rRNA sequences of 'Methylocystis parvus' strain OBBP, 'Methylosinus trichosporium' strain OB3b, 'Methylosporovibrio methanica' strain 81Z and Hyphomicrobium sp. strain DM2 are similar, and these non-pigmented obligate group II methylotrophs form one tight cluster in the alpha-subdivision. The pink-pigmented facultative methylotrophs, Methylobacterium extorquens strain AM1, Methylobacterium sp. strain DM4 and Methylobacterium organophilum strain XX form another cluster within the alpha-subdivision. Although similar in phenotypic characteristics, Methylobacterium organophilum strain XX and Methylobacterium extorquens strain AM1 are clearly distinguishable by their 16S rRNA sequences. The group I methylotrophs, Methylophilus methylotrophus strain AS1 and methylotrophic species DM11, which do not utilize methane, are similar in 16S rRNA sequence to bacteria in the beta-subdivision. The methane-utilizing, obligate group I methanotrophs, Methylococcus capsulatus strain BATH and Methylomonas methanica, are placed in the gamma-subdivision. The results demonstrate that it is possible to distinguish and classify the methylotrophic bacteria using 16S rRNA sequence analysis.
USDA-ARS?s Scientific Manuscript database
MOCASSIN-prot is a software, implemented in Perl and Matlab, for constructing protein similarity networks to classify proteins. Both domain composition and quantitative sequence similarity information are utilized in constructing the directed protein similarity networks. For each reference protein i...
A comprehensive simulation study on classification of RNA-Seq data.
Zararsız, Gökmen; Goksuluk, Dincer; Korkmaz, Selcuk; Eldem, Vahap; Zararsiz, Gozde Erturk; Duru, Izzet Parug; Ozturk, Ahmet
2017-01-01
RNA sequencing (RNA-Seq) is a powerful technique for the gene-expression profiling of organisms that uses the capabilities of next-generation sequencing technologies. Developing gene-expression-based classification algorithms is an emerging powerful method for diagnosis, disease classification and monitoring at molecular level, as well as providing potential markers of diseases. Most of the statistical methods proposed for the classification of gene-expression data are either based on a continuous scale (eg. microarray data) or require a normal distribution assumption. Hence, these methods cannot be directly applied to RNA-Seq data since they violate both data structure and distributional assumptions. However, it is possible to apply these algorithms with appropriate modifications to RNA-Seq data. One way is to develop count-based classifiers, such as Poisson linear discriminant analysis and negative binomial linear discriminant analysis. Another way is to bring the data closer to microarrays and apply microarray-based classifiers. In this study, we compared several classifiers including PLDA with and without power transformation, NBLDA, single SVM, bagging SVM (bagSVM), classification and regression trees (CART), and random forests (RF). We also examined the effect of several parameters such as overdispersion, sample size, number of genes, number of classes, differential-expression rate, and the transformation method on model performances. A comprehensive simulation study is conducted and the results are compared with the results of two miRNA and two mRNA experimental datasets. The results revealed that increasing the sample size, differential-expression rate and decreasing the dispersion parameter and number of groups lead to an increase in classification accuracy. Similar with differential-expression studies, the classification of RNA-Seq data requires careful attention when handling data overdispersion. We conclude that, as a count-based classifier, the power transformed PLDA and, as a microarray-based classifier, vst or rlog transformed RF and SVM classifiers may be a good choice for classification. An R/BIOCONDUCTOR package, MLSeq, is freely available at https://www.bioconductor.org/packages/release/bioc/html/MLSeq.html.
ITS2 data corroborate a monophyletic chlorophycean DO-group (Sphaeropleales)
2008-01-01
Background Within Chlorophyceae the ITS2 secondary structure shows an unbranched helix I, except for the 'Hydrodictyon' and the 'Scenedesmus' clade having a ramified first helix. The latter two are classified within the Sphaeropleales, characterised by directly opposed basal bodies in their flagellar apparatuses (DO-group). Previous studies could not resolve the taxonomic position of the 'Sphaeroplea' clade within the Chlorophyceae without ambiguity and two pivotal questions remain open: (1) Is the DO-group monophyletic and (2) is a branched helix I an apomorphic feature of the DO-group? In the present study we analysed the secondary structure of three newly obtained ITS2 sequences classified within the 'Sphaeroplea' clade and resolved sphaeroplealean relationships by applying different phylogenetic approaches based on a combined sequence-structure alignment. Results The newly obtained ITS2 sequences of Ankyra judayi, Atractomorpha porcata and Sphaeroplea annulina of the 'Sphaeroplea' clade do not show any branching in the secondary structure of their helix I. All applied phylogenetic methods highly support the 'Sphaeroplea' clade as a sister group to the 'core Sphaeropleales'. Thus, the DO-group is monophyletic. Furthermore, based on characteristics in the sequence-structure alignment one is able to distinguish distinct lineages within the green algae. Conclusion In green algae, a branched helix I in the secondary structure of the ITS2 evolves past the 'Sphaeroplea' clade. A branched helix I is an apomorph characteristic within the monophyletic DO-group. Our results corroborate the fundamental relevance of including the secondary structure in sequence analysis and phylogenetics. PMID:18655698
Bào, Yīmíng; Kuhn, Jens H
2018-01-01
During the last decade, genome sequence-based classification of viruses has become increasingly prominent. Viruses can be even classified based on coding-complete genome sequence data alone. Nevertheless, classification remains arduous as experts are required to establish phylogenetic trees to depict the evolutionary relationships of such sequences for preliminary taxonomic placement. Pairwise sequence comparison (PASC) of genomes is one of several novel methods for establishing relationships among viruses. This method, provided by the US National Center for Biotechnology Information as an open-access tool, circumvents phylogenetics, and yet PASC results are often in agreement with those of phylogenetic analyses. Computationally inexpensive, PASC can be easily performed by non-taxonomists. Here we describe how to use the PASC tool for the preliminary classification of novel viral hemorrhagic fever-causing viruses.
Adapting Pipeline Architectures to Track Developing Aftershock Sequences and Recurrent Explosions
2014-02-14
Sumatra earthquake was used to study the performance of subspace detectors to detect and classify events from within a very large (Area = ~250,000 km2... detectors to identify and organize repeating waveforms discovered in multichannel seismic data streams. The framework has been tested and evaluated on...a variety of different test cases from mining blasts in Central Asia to moderate and large earthquake aftershock sequences. The framework performs
Wajid, Abdul; Rehmani, Shafqat F.; Wasim, Muhammad; Basharat, Asma; Bibi, Tasra; Arif, Saima; Dimitrov, Kiril M.
2016-01-01
Here, we report the complete genome sequence of a virulent Newcastle disease virus (vNDV) strain, duck/Pakistan/Lahore/AW-123/2015, isolated from apparently healthy laying ducks (Anas platyrhynchos domesticus) from the province of Punjab, Pakistan. The virus has a genome length of 15,192 nucleotides and is classified as member of subgenotype VIIi, class II. PMID:27469959
USDA-ARS?s Scientific Manuscript database
This chapter describes the ascomycetous yeast genus Naumovozyma, which was recognized from multigene deoxyribonucleic acid (DNA) sequence analysis. The genus has two describes species, which were formerly classified in the genus Saccharomyces. The species reproduce by multilateral budding but do not...
Novel heterozygous NOTCH3 pathogenic variant found in two Chinese patients with CADASIL.
Li, Shufeng; Chen, Yifan; Shan, Haitao; Ma, Fang; Shi, Minke; Xue, Jun
2017-12-01
NOTCH3 mutations have been described to cause cerebral autosomal dominant arteriopathy with subcortical infarcts and leukoencephalopathy (CADASIL). Here, we report 2 CADASIL patients from a Chinese family. Whole genome sequencing was performed on the two CADASIL patients. The novel variant c.128G>C in exon 2 of NOTCH3 was identified and confirmed through PCR-Sanger sequencing (Human Genome Variation Society nomenclature: HGVS: NOTCH3 c.128G>C; p.Cys43Ser). The heterozygous NOTCH3 variant cause a cysteine to serine substitution at codon 43. According to the variant interpretation guideline of American College of Medical Genetics and Genomics (ACMG), this variant was classified as "pathogenic". Other variants in HTRA1, COL4A1 and COL4A2 were also found, they were classified as "benign". Copyright © 2017 Elsevier Ltd. All rights reserved.
Lysine acetylation sites prediction using an ensemble of support vector machine classifiers.
Xu, Yan; Wang, Xiao-Bo; Ding, Jun; Wu, Ling-Yun; Deng, Nai-Yang
2010-05-07
Lysine acetylation is an essentially reversible and high regulated post-translational modification which regulates diverse protein properties. Experimental identification of acetylation sites is laborious and expensive. Hence, there is significant interest in the development of computational methods for reliable prediction of acetylation sites from amino acid sequences. In this paper we use an ensemble of support vector machine classifiers to perform this work. The experimentally determined acetylation lysine sites are extracted from Swiss-Prot database and scientific literatures. Experiment results show that an ensemble of support vector machine classifiers outperforms single support vector machine classifier and other computational methods such as PAIL and LysAcet on the problem of predicting acetylation lysine sites. The resulting method has been implemented in EnsemblePail, a web server for lysine acetylation sites prediction available at http://www.aporc.org/EnsemblePail/. Copyright (c) 2010 Elsevier Ltd. All rights reserved.
Combining MLC and SVM Classifiers for Learning Based Decision Making: Analysis and Evaluations
Zhang, Yi; Ren, Jinchang; Jiang, Jianmin
2015-01-01
Maximum likelihood classifier (MLC) and support vector machines (SVM) are two commonly used approaches in machine learning. MLC is based on Bayesian theory in estimating parameters of a probabilistic model, whilst SVM is an optimization based nonparametric method in this context. Recently, it is found that SVM in some cases is equivalent to MLC in probabilistically modeling the learning process. In this paper, MLC and SVM are combined in learning and classification, which helps to yield probabilistic output for SVM and facilitate soft decision making. In total four groups of data are used for evaluations, covering sonar, vehicle, breast cancer, and DNA sequences. The data samples are characterized in terms of Gaussian/non-Gaussian distributed and balanced/unbalanced samples which are then further used for performance assessment in comparing the SVM and the combined SVM-MLC classifier. Interesting results are reported to indicate how the combined classifier may work under various conditions. PMID:26089862
Combining MLC and SVM Classifiers for Learning Based Decision Making: Analysis and Evaluations.
Zhang, Yi; Ren, Jinchang; Jiang, Jianmin
2015-01-01
Maximum likelihood classifier (MLC) and support vector machines (SVM) are two commonly used approaches in machine learning. MLC is based on Bayesian theory in estimating parameters of a probabilistic model, whilst SVM is an optimization based nonparametric method in this context. Recently, it is found that SVM in some cases is equivalent to MLC in probabilistically modeling the learning process. In this paper, MLC and SVM are combined in learning and classification, which helps to yield probabilistic output for SVM and facilitate soft decision making. In total four groups of data are used for evaluations, covering sonar, vehicle, breast cancer, and DNA sequences. The data samples are characterized in terms of Gaussian/non-Gaussian distributed and balanced/unbalanced samples which are then further used for performance assessment in comparing the SVM and the combined SVM-MLC classifier. Interesting results are reported to indicate how the combined classifier may work under various conditions.
Cho, Ming-Yuan; Hoang, Thi Thom
2017-01-01
Fast and accurate fault classification is essential to power system operations. In this paper, in order to classify electrical faults in radial distribution systems, a particle swarm optimization (PSO) based support vector machine (SVM) classifier has been proposed. The proposed PSO based SVM classifier is able to select appropriate input features and optimize SVM parameters to increase classification accuracy. Further, a time-domain reflectometry (TDR) method with a pseudorandom binary sequence (PRBS) stimulus has been used to generate a dataset for purposes of classification. The proposed technique has been tested on a typical radial distribution network to identify ten different types of faults considering 12 given input features generated by using Simulink software and MATLAB Toolbox. The success rate of the SVM classifier is over 97%, which demonstrates the effectiveness and high efficiency of the developed method.
Cruz, V P; Oliveira, C; Foresti, F
2015-01-01
5S rDNA genes of the stingray Potamotrygon motoro were PCR replicated, purified, cloned and sequenced. Two distinct classes of segments of different sizes were obtained. The smallest, with 342 bp units, was classified as class I, and the largest, with 1900 bp units, was designated as class II. Alignment with the consensus sequences for both classes showed changes in a few bases in the 5S rDNA genes. TATA-like sequences were detected in the nontranscribed spacer (NTS) regions of class I and a microsatellite (GCT) 10 sequence was detected in the NTS region of class II. The results obtained can help to understand the molecular organization of ribosomal genes and the mechanism of gene dispersion.
Prediction and Identification of Krüppel-Like Transcription Factors by Machine Learning Method.
Liao, Zhijun; Wang, Xinrui; Chen, Xingyong; Zou, Quan
2017-01-01
The Krüppel-like factors (KLFs) are a family of containing Zn finger(ZF) motif transcription factors with 18 members in human genome, among them, KLF18 is predicted by bioinformatics. KLFs possess various physiological function involving in a number of cancers and other diseases. Here we perform a binary-class classification of KLFs and non-KLFs by machine learning methods. The protein sequences of KLFs and non-KLFs were searched from UniProt and randomly separate them into training dataset(containing positive and negative sequences) and test dataset(containing only negative sequences), after extracting the 188-dimensional(188D) feature vectors we carry out category with four classifiers(GBDT, libSVM, RF, and k-NN). On the human KLFs, we further dig into the evolutionary relationship and motif distribution, and finally we analyze the conserved amino acid residue of three zinc fingers. The classifier model from training dataset were well constructed, and the highest specificity(Sp) was 99.83% from a library for support vector machine(libSVM) and all the correctly classified rates were over 70% for 10-fold cross-validation on test dataset. The 18 human KLFs can be further divided into 7 groups and the zinc finger domains were located at the carboxyl terminus, and many conserved amino acid residues including Cysteine and Histidine, and the span and interval between them were consistent in the three ZF domains. Two classification models for KLFs prediction have been built by novel machine learning methods. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
A smart phone-based pocket fall accident detection, positioning, and rescue system.
Kau, Lih-Jen; Chen, Chih-Sheng
2015-01-01
We propose in this paper a novel algorithm as well as architecture for the fall accident detection and corresponding wide area rescue system based on a smart phone and the third generation (3G) networks. To realize the fall detection algorithm, the angles acquired by the electronic compass (ecompass) and the waveform sequence of the triaxial accelerometer on the smart phone are used as the system inputs. The acquired signals are then used to generate an ordered feature sequence and then examined in a sequential manner by the proposed cascade classifier for recognition purpose. Once the corresponding feature is verified by the classifier at current state, it can proceed to next state; otherwise, the system will reset to the initial state and wait for the appearance of another feature sequence. Once a fall accident event is detected, the user's position can be acquired by the global positioning system (GPS) or the assisted GPS, and sent to the rescue center via the 3G communication network so that the user can get medical help immediately. With the proposed cascaded classification architecture, the computational burden and power consumption issue on the smart phone system can be alleviated. Moreover, as we will see in the experiment that a distinguished fall accident detection accuracy up to 92% on the sensitivity and 99.75% on the specificity can be obtained when a set of 450 test actions in nine different kinds of activities are estimated by using the proposed cascaded classifier, which justifies the superiority of the proposed algorithm.
Stream Dissolved Organic Matter Quantity and Quality Along a Wetland-Cropland Catchment Gradient
NASA Astrophysics Data System (ADS)
McDonough, O.; Hosen, J. D.; Lang, M. W.; Oesterling, R.; Palmer, M.
2012-12-01
Wetlands may be critical sources of dissolved organic matter (DOM) to stream networks. Yet, more than half of wetlands in the continental United States have been lost since European settlement, with the majority of loss attributed to agriculture. The degree to which agricultural loss of wetlands impacts stream DOM is largely unknown and may have important ecological implications. Using twenty headwater catchments on the Delmarva Peninsula (Maryland, USA), we investigated the seasonal influence of wetland and cropland coverage on downstream DOM quantity and quality. In addition to quantifying bulk downstream dissolved organic carbon (DOC) concentration, we used a suite of DOM UV-absorbance metrics and parallel factor analysis (PARAFAC) modeling of excitation-emission fluorescence spectra (EEMs) to characterize DOM composition. Percent bioavailable DOC (%BDOC) was measured during the Spring sampling using a 28-day incubation. Percent wetland coverage and % cropland within the watersheds were significantly negatively correlated (r = -0.93, p < 0.001). Results show that % wetland coverage was positively correlated with stream DOM concentration, molecular weight, aromaticity, humic-like fluorescence, and allochthonous origin. Conversely, increased wetland coverage was negatively correlated with stream DOM protein-like fluorescence. Percent BDOC decreased with DOM humic-like fluorescence and increased with protein-like fluorescence. We observed minimal seasonal interaction between % wetland coverage and DOM concentration and composition across Spring, Fall, and Winter sampling seasons. However, principal component analysis suggested more pronounced seasonal differences exist in stream DOM. This study highlights the influence of wetlands on downstream DOM in agriculturally impacted landscapes where loss of wetlands to cultivation may significantly alter stream DOM quantity and quality.
Li, Kun; Wang, Jianxing; Liu, Jibao; Wei, Yuansong; Chen, Meixue
2016-05-01
Municipal sewage from an oxidation ditch was treated for reuse by nanofiltration (NF) in this study. The NF performance was optimized, and its fouling characteristics after different operational durations (i.e., 48 and 169hr) were analyzed to investigate the applicability of nanofiltration for water reuse. The optimum performance was achieved when transmembrane pressure=12bar, pH=4 and flow rate=8L/min using a GE membrane. The permeate water quality could satisfy the requirements of water reclamation for different uses and local standards for water reuse in Beijing. Flux decline in the fouling experiments could be divided into a rapid flux decline and a quasi-steady state. The boundary flux theory was used to predict the evolution of permeate flux. The expected operational duration based on the 169-hr experiment was 392.6hr which is 175% longer than that of the 48-hr one. High molecular weight (MW) protein-like substances were suggested to be the dominant foulants after an extended period based on the MW distribution and the fluorescence characteristics. The analyses of infrared spectra and extracellular polymeric substances revealed that the roles of both humic- and polysaccharide-like substances were diminished, while that of protein-like substances were strengthened in the contribution of membrane fouling with time prolonged. Inorganic salts were found to have marginally influence on membrane fouling. Additionally, alkali washing was more efficient at removing organic foulants in the long term, and a combination of water flushing and alkali washing was appropriate for NF fouling control in municipal sewage treatment. Copyright © 2015. Published by Elsevier B.V.
Owada, Yuki; Yonechi, Atsushi; Higuchi, Mitsunori; Suzuki, Hiroyuki
2016-03-10
Grand-glass nodule for CT image has thought to be less aggressive tumor in lung cancer. Echinoderm microtubule-associated protein-like 4-anaplastic lymphoma kinase (EML4-ALK)-positive lung cancer presenting with Ground-glass nodules (GGNs) is relatively rare, and few such cases have been reported. An asymptomatic 56-year-old woman exhibited a 1.1-cm GGN in the lower lobe of the left lung on computed tomography during a medical checkup. Positron emission tomography showed no difference in uptake by the nodule compared with other organs. We elected to perform surgery because the nodule included a solid component and had grown only slightly during the last 2 years according to thin-section computed tomography. Partial resection of the lower left lung was performed by video-assisted thoracic surgery. Pathological examination revealed mucus-producing high columnar epithelium forming an irregular tubular-acinar-like structure partly replacing the alveolar epithelium on hematoxylin and eosin staining. More than 50 % of the tumor demonstrated a lepidic growth pattern. The tumor was negative for epidermal growth factor receptor mutation but positive for the EML4-ALK fusion oncogene according to fluorescence in situ hybridization. We herein report a case of EML4-ALK-positive lung cancer presenting with a GGN along with a review of the relevant literature, including histopathological findings and imaging features. We consider that EML4-ALK-positive lung cancer is often highly progressive and that careful follow-up is therefore essential in these patients.
Yang, Liyang; Hur, Jin; Zhuang, Wane
2015-05-01
Fluorescence excitation emission matrices-parallel factor analysis (EEM-PARAFAC) is a powerful tool for characterizing dissolved organic matter (DOM), and it is applied in a rapidly growing number of studies on drinking water and wastewater treatments. This paper presents an overview of recent findings about the occurrence and behavior of PARAFAC components in drinking water and wastewater treatments, as well as their feasibility for assessing the treatment performance and water quality including disinfection by-product formation potentials (DBPs FPs). A variety of humic-like, protein-like, and unique (e.g., pyrene-like) fluorescent components have been identified, providing valuable insights into the chemical composition of DOM and the effects of various treatment processes in engineered systems. Coagulation/flocculation-clarification preferentially removes humic-like components, and additional treatments such as biological activated carbon filtration, anion exchange, and UV irradiation can further remove DOM from drinking water. In contrast, biological treatments are more effective for protein-like components in wastewater treatments. PARAFAC components have been proven to be valuable as surrogates for conventional water quality parameter, to track the changes of organic matter quantity and quality in drinking water and wastewater treatments. They are also feasible for assessing formations of trihalomethanes and other DBPs and evaluating treatment system performance. Further studies of EEM-PARAFAC for assessing the effects of the raw water quality and variable treatment conditions on the removal of DOM, and the formation potentials of various emerging DBPs, are essential for optimizing the treatment processes to ensure treated water quality.
Cheng, Yuan-yue; Guo, Wei-dong; Long, Ai-min; Chen, Shao-yong
2010-09-01
The optical characteristics of chromophoric dissolved organic matter (CDOM) were determined in rain samples collected in Xiamen Island, during a rainy season in 2007, using fluorescence excitation-emission matrix spectroscopy associated with UV-Vis absorbance spectra. Results showed that the absorbance spectra of CDOM in rain samples decreased exponentially with wavelength. The absorbance coefficient at 300 nm [a(300)] ranged from 0.27 to 3.45 m(-1), which would be used as an index of CDOM abundance, and the mean value was 1.08 m(-1). The content of earlier stage of precipitation events was higher than that of later stage of precipitation events, which implied that anthropogenic sources or atmospheric pollution or air mass types were important contributors to CDOM levels in precipitation. EEMs spectra showed 4 types of fluorescence signals (2 humic-like fluorescence peaks and 2 protein-like fluorescence peaks) in rainwater samples, and there were significant positive correlations of peak A with C and peak B with S, showing their same sources or some relationship of the two humic-like substance and the two protein-like substance. The strong positive correlations of the two humic-like fluorescence peaks with a(300), suggested that the chromophores responsible for absorbance might be the same as fluorophores responsible for fluorescence. Results showed that the presence of highly absorbing and fluorescing CDOM in rainwater is of significant importance in atmospheric chemistry and might play a previously unrecognized role in the wavelength dependent spectral attenuation of solar radiation by atmospheric waters.
Su, Yaling; Chen, Feizhou; Liu, Zhengwen
2015-05-01
Here we investigated absorption and fluorescence properties of chromophoric dissolved organic matter (CDOM) in 15 alpine lakes located below or above the tree line to determine its source and composition. The results indicate that the concentrations of CDOM in below-tree-line lakes are significantly higher than in above-tree-line lakes, as evidenced from the absorption coefficients of a250 and a365. The intensities of the protein-like and humic-like fluorescence in below-tree-line lakes are higher than in above-tree-line lakes as well. Three fluorescent components were identified using parallel factor analysis (PARAFAC) modelling. Component 1 is probably associated with biological degradation of terrestrial humic component. The terrestrial humic-like component 2 is only found in below-tree-line lakes. The protein-like or phenolic component 3 is dominant in above-tree-line lakes, which is probably more derived from autochthonous origin. In this study, (1) higher a250/a365 and S275-295 values indicate smaller molecular weights of CDOM in above-tree-line lakes than in below-tree-line lakes, and smaller molecular weights at the surface than at 2.0 m depth; (2) SUVA254 and FI255 results provide evidence of lower percent aromaticity of CDOM in above-tree-line lakes; and (3) FI310 and FI370 suggest a strong allochthonous origin at the surface in below-tree-line lakes, and more contribution from autochthonous biological and aquatic bacterial origin in above-tree-line lakes.
Zhuo, Jian-Fu; Guo, Wei-Dong; Deng, Xun; Zhang, Zhi-Ying; Xu, Jing; Huang, Ling-Feng
2010-06-01
Fluorescence excitation-emission matrix spectroscopy (EEMs) combined with absorption spectroscopy were applied to study the optical properties of CDOM samples from highly-polluted Yundang Lagoon in Xiamen in order to demonstrate the feasibility of using these spectral properties as a tracer of the degree of organic pollution in similar polluted coastal waters. Surface water samples were collected from 13 stations 4 times during April and May, 2008. Parallel factor analysis (PARAFAC) model was used to resolve the EEMs of CDOM. Five separate fluorescent components were identified, including two humic-like components (C1: 240, 325/422 nm; C5: 260, 380/474 nm), two protein-like components (C2: 225, 275/350 nm; C4: 240, 300/354 nm) and one xenobiotic-like component (C3: 225/342 nm), which could be used as a good tracer for the input of the anthropogenic organic, pollutants. The concentrations of component C3 and dissolved organic carbon (DOC) are much higher near the inlet of sewage discharge, demonstrating that the discharge of surrounding sewage is a major source of organic pollutants in Yundang Lagoon. CDOM absorption coefficient alpha (280) and the score of humic-like component C1 showed significant linear relationships with COD(Mn), and a strong positive correlation was also found between the score of protein-like component C2 and BOD5. This suggested that the optical properties of CDOM may provide a fast in-situ way to monitor the variation of the water quality in Yundang Lagoon and that of similar polluted coastal waters.
NASA Astrophysics Data System (ADS)
Lajtha, K.; Lee, B. S.
2015-12-01
Dissolved organic matter (DOM) is a critical component of the carbon cycle linking terrestrial and aquatic ecosystems, yet DOM composition representative of DOM sources at headwater catchments in the western U.S is poorly understood. This study examined the effect of forest management history and hydrologic patterns on DOM chemistry at nine experimental watersheds located in the H.J. Andrews Long Term Ecological Research Experimental Forest of the Oregon Cascades. Stream water samples representing a three-week composite of each watershed were collected between May 2013 and February 2015 (32 events). DOM chemistry was characterized by examining UV and fluorescent properties of stream samples. Specific UV absorbance at 254 nm (SUVA254; Weishaar et al. 2003), generally indicative of aromaticity, showed the lowest value at the high elevation clear-cut site (watershed 6, 1,030 m) and the highest value at the low elevation clear-cut site (watershed 10, 680 m) throughout the study period. DOM fluorescent components, identified by this study using a multivariate statistical model, Parallel Factor Analysis (PARAFAC), did not differ significantly among experimental watersheds with varying forest management history. However, a protein-like DOM component exhibited temporal variations. Correlation analysis between the protein-like DOM and hydrologic patterns indicate that stream water during dry seasons come from protein-rich groundwater sources. This study shows UV and fluorescent spectroscopy DOM characterization is a viable finger printing method to detect DOM sources in pristine headwater streams at the western Cascades of Oregon where characterization of the stream water source with low DOC and DON concentrations is difficult.
Yan, Caixia; Liu, Huihui; Sheng, Yanru; Huang, Xian; Nie, Minghua; Huang, Qi; Baalousha, Mohammed
2018-10-01
Characterization of natural colloids is the key to understand pollutant fate and transport in the environment. The present study investigates the relationship between size and fluorescence properties of colloidal organic matter (COM) from five tributaries of Poyang Lake. Colloids were size-fractionated using cross-flow ultrafiltration and their fluorescence properties were measured by three-dimensional excitation-emission matrix fluorescence spectroscopy (3D-EEM). Parallel factor analysis (PARAFAC) and/or Self-organizing map (SOM) were applied to assess fluorescence properties as proxy indicators for the different size of colloids. PARAFAC analysis identified four fluorescence components including three humic-like components (C1-C3) and a protein-like component (C4). These four fluorescence components, and in particular the protein-like component, are primarily present in <1 kDa phase. For the colloidal fractions (1-10 kDa, 10-100 kDa, and 100 kDa-0.7 μm), the majority of fluorophores are associated with the smallest size fraction. SOM analysis demonstrated that relatively high fluorescence intensity and aromaticity occur primarily in <1 kDa phase, followed by 1-10 kDa colloids. Coupling PARAFAC and SOM facilitate the visualization and interpretation of the relationship between colloidal size and fluorescence properties with fewer input variables, shorter running time, higher reliability, and nondestructive results. Fluorescence indices analysis reveals that the smallest colloidal fraction (1-10 kDa) was dominated by higher humified and less autochthonous COM. Copyright © 2018 Elsevier B.V. All rights reserved.
Bastiaansen, Anna E M; van Sonderen, Agnes; Titulaer, Maarten J
2017-06-01
Twenty years since the discovery of voltage-gated potassium channel (VGKC)-related autoimmunity; it is currently known that the antibodies are not directed at the VGKC itself but to two closely associated proteins, anti-leucine-rich glioma-inactivated 1 (LGI1) and contactin-associated protein-like 2 (Caspr2). Antibodies to LGI1 and Caspr2 give well-described clinical phenotypes. Anti-LGI1 encephalitis patients mostly have limbic symptoms, and anti-Caspr2 patients have variable syndromes with both central and peripheral symptoms. A large group of patients with heterogeneous symptoms are VGKC positive but do not have antibodies against LGI1 or Caspr2. The clinical relevance of VGKC positivity in these 'double-negative' patients is questionable. This review focusses on these three essentially different subgroups. The clinical phenotypes of anti-LGI1 encephalitis and anti-Caspr2 encephalitis have been described in more detail including data on treatment and long-term follow-up. A specific human leukocyte antigen (HLA) association was found in nontumor anti-LGI1 encephalitis, but not clearly in those with tumors. There has been increasing interest in the VGKC patients without LGI1/Caspr2 antibodies questioning its relevance in clinical practice. Anti-LGI1 encephalitis and anti-Caspr2 encephalitis are separate clinical entities. Early recognition and treatment is necessary and rewarding. The term VGKC-complex antibodies, lumping patients with anti-LGI1, anti-Caspr2 antibodies or lacking both, should be considered obsolete.
Comparing K-mer based methods for improved classification of 16S sequences.
Vinje, Hilde; Liland, Kristian Hovde; Almøy, Trygve; Snipen, Lars
2015-07-01
The need for precise and stable taxonomic classification is highly relevant in modern microbiology. Parallel to the explosion in the amount of sequence data accessible, there has also been a shift in focus for classification methods. Previously, alignment-based methods were the most applicable tools. Now, methods based on counting K-mers by sliding windows are the most interesting classification approach with respect to both speed and accuracy. Here, we present a systematic comparison on five different K-mer based classification methods for the 16S rRNA gene. The methods differ from each other both in data usage and modelling strategies. We have based our study on the commonly known and well-used naïve Bayes classifier from the RDP project, and four other methods were implemented and tested on two different data sets, on full-length sequences as well as fragments of typical read-length. The difference in classification error obtained by the methods seemed to be small, but they were stable and for both data sets tested. The Preprocessed nearest-neighbour (PLSNN) method performed best for full-length 16S rRNA sequences, significantly better than the naïve Bayes RDP method. On fragmented sequences the naïve Bayes Multinomial method performed best, significantly better than all other methods. For both data sets explored, and on both full-length and fragmented sequences, all the five methods reached an error-plateau. We conclude that no K-mer based method is universally best for classifying both full-length sequences and fragments (reads). All methods approach an error plateau indicating improved training data is needed to improve classification from here. Classification errors occur most frequent for genera with few sequences present. For improving the taxonomy and testing new classification methods, the need for a better and more universal and robust training data set is crucial.
Rainetová, P; Jiřincová, H; Musílek, M; Nováková, L; Vodičková, I; Štruncová, V; Švecová, M; Pazdiora, P; Piskunová, N; Trubač, P; Zajíc, T; Havlíčková, M
2015-06-01
Introducing enterovirus sequencing as an advanced approach to classify the viruses isolated according to the novel nomenclature and to characterize isolates in detail. Seventy-five specimens collected from 64 patients in two hospitals, Liberec Regional Hospital, and Plzeň University Hospital, were analyzed. The study patients' age ranged from four to 54 years, with a median of 15 years in males and 16 years in females. In most patients, the reasons for admission were intense headache, fever, vomiting, tiredness, meningeal symptoms, intestinal symptoms (in two patients), and skin symptoms (in one patient). The specimens collected were rectal and throat swabs, cerebrospinal fluid (CSF) and stool specimens. Molecular detection and typing were performed using the RT-PCR method. A segment of the 5´non-coding RNA was selected for typing. Specimens were amplified using single-step PCR with external primers and with the same primers extended to include M13 sequences (Generi-Biotech). The LASERGENE software (DIASTAR) was used in sequence editing, alignment, and quality check. The sequences obtained were checked against the central GenBank sequence database using the BLAST algorithm. The identification of the study isolates resulted in 61 ECHO viruses 30, three coxsackie viruses B1, one coxsackie virus B3, one coxsackie virus A9, one enterovirus 86, one enterovirus 71, Two ECHO viruses 13/coxsackie virus B5, one ECHO virus 7/30/coxsackie virus B4, one coxsackie virus B4/enterovirus B, one enterovirus 87/ECHO virus 30/enterovirus B, and one ECHO virus 3. All viruses isolated, except enterovirus 71 classified into group A, were of group B. The enteroviruses were identified unambigously, although the sequencing only targeted a short, conserved segment that showed considerable variability. The sequencing was an effective alternative to enterovirus identification by the neutralisation test and allowed for detailed characterization of the isolates. The predominance of ECHO 30 as the cause of aseptic meningitis is in accordance with the literature data.
Carroll, Laura M.; Miller, Rachel A.; Wiedmann, Martin
2017-01-01
ABSTRACT The Bacillus cereus group comprises nine species, several of which are pathogenic. Differentiating between isolates that may cause disease and those that do not is a matter of public health and economic importance, but it can be particularly challenging due to the high genomic similarity within the group. To this end, we have developed BTyper, a computational tool that employs a combination of (i) virulence gene-based typing, (ii) multilocus sequence typing (MLST), (iii) panC clade typing, and (iv) rpoB allelic typing to rapidly classify B. cereus group isolates using nucleotide sequencing data. BTyper was applied to a set of 662 B. cereus group genome assemblies to (i) identify anthrax-associated genes in non-B. anthracis members of the B. cereus group, and (ii) identify assemblies from B. cereus group strains with emetic potential. With BTyper, the anthrax toxin genes cya, lef, and pagA were detected in 8 genomes classified by the NCBI as B. cereus that clustered into two distinct groups using k-medoids clustering, while either the B. anthracis poly-γ-d-glutamate capsule biosynthesis genes capABCDE or the hyaluronic acid capsule hasA gene was detected in an additional 16 assemblies classified as either B. cereus or Bacillus thuringiensis isolated from clinical, environmental, and food sources. The emetic toxin genes cesABCD were detected in 24 assemblies belonging to panC clades III and VI that had been isolated from food, clinical, and environmental settings. The command line version of BTyper is available at https://github.com/lmc297/BTyper. In addition, BMiner, a companion application for analyzing multiple BTyper output files in aggregate, can be found at https://github.com/lmc297/BMiner. IMPORTANCE Bacillus cereus is a foodborne pathogen that is estimated to cause tens of thousands of illnesses each year in the United States alone. Even with molecular methods, it can be difficult to distinguish nonpathogenic B. cereus group isolates from their pathogenic counterparts, including the human pathogen Bacillus anthracis, which is responsible for anthrax, as well as the insect pathogen B. thuringiensis. By using the variety of typing schemes employed by BTyper, users can rapidly classify, characterize, and assess the virulence potential of any isolate using its nucleotide sequencing data. PMID:28625989
SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition
Melvin, Iain; Ie, Eugene; Kuang, Rui; Weston, Jason; Stafford, William Noble; Leslie, Christina
2007-01-01
Background Predicting a protein's structural class from its amino acid sequence is a fundamental problem in computational biology. Much recent work has focused on developing new representations for protein sequences, called string kernels, for use with support vector machine (SVM) classifiers. However, while some of these approaches exhibit state-of-the-art performance at the binary protein classification problem, i.e. discriminating between a particular protein class and all other classes, few of these studies have addressed the real problem of multi-class superfamily or fold recognition. Moreover, there are only limited software tools and systems for SVM-based protein classification available to the bioinformatics community. Results We present a new multi-class SVM-based protein fold and superfamily recognition system and web server called SVM-Fold, which can be found at . Our system uses an efficient implementation of a state-of-the-art string kernel for sequence profiles, called the profile kernel, where the underlying feature representation is a histogram of inexact matching k-mer frequencies. We also employ a novel machine learning approach to solve the difficult multi-class problem of classifying a sequence of amino acids into one of many known protein structural classes. Binary one-vs-the-rest SVM classifiers that are trained to recognize individual structural classes yield prediction scores that are not comparable, so that standard "one-vs-all" classification fails to perform well. Moreover, SVMs for classes at different levels of the protein structural hierarchy may make useful predictions, but one-vs-all does not try to combine these multiple predictions. To deal with these problems, our method learns relative weights between one-vs-the-rest classifiers and encodes information about the protein structural hierarchy for multi-class prediction. In large-scale benchmark results based on the SCOP database, our code weighting approach significantly improves on the standard one-vs-all method for both the superfamily and fold prediction in the remote homology setting and on the fold recognition problem. Moreover, our code weight learning algorithm strongly outperforms nearest-neighbor methods based on PSI-BLAST in terms of prediction accuracy on every structure classification problem we consider. Conclusion By combining state-of-the-art SVM kernel methods with a novel multi-class algorithm, the SVM-Fold system delivers efficient and accurate protein fold and superfamily recognition. PMID:17570145
A Comparison of Two Measures of HIV Diversity in Multi-Assay Algorithms for HIV Incidence Estimation
Cousins, Matthew M.; Konikoff, Jacob; Sabin, Devin; Khaki, Leila; Longosz, Andrew F.; Laeyendecker, Oliver; Celum, Connie; Buchbinder, Susan P.; Seage, George R.; Kirk, Gregory D.; Moore, Richard D.; Mehta, Shruti H.; Margolick, Joseph B.; Brown, Joelle; Mayer, Kenneth H.; Kobin, Beryl A.; Wheeler, Darrell; Justman, Jessica E.; Hodder, Sally L.; Quinn, Thomas C.; Brookmeyer, Ron; Eshleman, Susan H.
2014-01-01
Background Multi-assay algorithms (MAAs) can be used to estimate HIV incidence in cross-sectional surveys. We compared the performance of two MAAs that use HIV diversity as one of four biomarkers for analysis of HIV incidence. Methods Both MAAs included two serologic assays (LAg-Avidity assay and BioRad-Avidity assay), HIV viral load, and an HIV diversity assay. HIV diversity was quantified using either a high resolution melting (HRM) diversity assay that does not require HIV sequencing (HRM score for a 239 base pair env region) or sequence ambiguity (the percentage of ambiguous bases in a 1,302 base pair pol region). Samples were classified as MAA positive (likely from individuals with recent HIV infection) if they met the criteria for all of the assays in the MAA. The following performance characteristics were assessed: (1) the proportion of samples classified as MAA positive as a function of duration of infection, (2) the mean window period, (3) the shadow (the time period before sample collection that is being assessed by the MAA), and (4) the accuracy of cross-sectional incidence estimates for three cohort studies. Results The proportion of samples classified as MAA positive as a function of duration of infection was nearly identical for the two MAAs. The mean window period was 141 days for the HRM-based MAA and 131 days for the sequence ambiguity-based MAA. The shadows for both MAAs were <1 year. Both MAAs provided cross-sectional HIV incidence estimates that were very similar to longitudinal incidence estimates based on HIV seroconversion. Conclusions MAAs that include the LAg-Avidity assay, the BioRad-Avidity assay, HIV viral load, and HIV diversity can provide accurate HIV incidence estimates. Sequence ambiguity measures obtained using a commercially-available HIV genotyping system can be used as an alternative to HRM scores in MAAs for cross-sectional HIV incidence estimation. PMID:24968135
Durviaux, Serge; Treanor, John; Beran, Jiri; Duval, Xavier; Esen, Meral; Feldman, Gregory; Frey, Sharon E.; Launay, Odile; Leroux-Roels, Geert; McElhaney, Janet E.; Nowakowski, Andrzej; Ruiz-Palacios, Guillermo M.; van Essen, Gerrit A.; Oostvogels, Lidia; Devaster, Jeanne-Marie
2014-01-01
Estimations of the effectiveness of vaccines against seasonal influenza virus are guided by comparisons of the antigenicities between influenza virus isolates from clinical breakthrough cases with strains included in a vaccine. This study examined whether the prediction of antigenicity using a sequence analysis of the hemagglutinin (HA) gene-encoded HA1 domain is a simpler alternative to using the conventional hemagglutination inhibition (HI) assay, which requires influenza virus culturing. Specimens were taken from breakthrough cases that occurred in a trivalent influenza virus vaccine efficacy trial involving >43,000 participants during the 2008-2009 season. A total of 498 influenza viruses were successfully subtyped as A(H3N2) (380 viruses), A(H1N1) (29 viruses), B(Yamagata) (23 viruses), and B(Victoria) (66 viruses) from 603 PCR- or culture-confirmed specimens. Unlike the B strains, most A(H3N2) (377 viruses) and all A(H1N1) viruses were classified as homologous to the respective vaccine strains based on their HA1 domain nucleic acid sequence. HI titers relative to the respective vaccine strains and PCR subtyping were determined for 48% (182/380) of A(H3N2) and 86% (25/29) of A(H1N1) viruses. Eighty-four percent of the A(H3N2) and A(H1N1) viruses classified as homologous by sequence were matched to the respective vaccine strains by HI testing. However, these homologous A(H3N2) and A(H1N1) viruses displayed a wide range of relative HI titers. Therefore, although PCR is a sensitive diagnostic method for confirming influenza virus cases, HA1 sequence analysis appeared to be of limited value in accurately predicting antigenicity; hence, it may be inappropriate to classify clinical specimens as homologous or heterologous to the vaccine strain for estimating vaccine efficacy in a prospective clinical trial. PMID:24371255
Ali, Safdar; Majid, Abdul; Khan, Asifullah
2014-04-01
Development of an accurate and reliable intelligent decision-making method for the construction of cancer diagnosis system is one of the fast growing research areas of health sciences. Such decision-making system can provide adequate information for cancer diagnosis and drug discovery. Descriptors derived from physicochemical properties of protein sequences are very useful for classifying cancerous proteins. Recently, several interesting research studies have been reported on breast cancer classification. To this end, we propose the exploitation of the physicochemical properties of amino acids in protein primary sequences such as hydrophobicity (Hd) and hydrophilicity (Hb) for breast cancer classification. Hd and Hb properties of amino acids, in recent literature, are reported to be quite effective in characterizing the constituent amino acids and are used to study protein foldings, interactions, structures, and sequence-order effects. Especially, using these physicochemical properties, we observed that proline, serine, tyrosine, cysteine, arginine, and asparagine amino acids offer high discrimination between cancerous and healthy proteins. In addition, unlike traditional ensemble classification approaches, the proposed 'IDM-PhyChm-Ens' method was developed by combining the decision spaces of a specific classifier trained on different feature spaces. The different feature spaces used were amino acid composition, split amino acid composition, and pseudo amino acid composition. Consequently, we have exploited different feature spaces using Hd and Hb properties of amino acids to develop an accurate method for classification of cancerous protein sequences. We developed ensemble classifiers using diverse learning algorithms such as random forest (RF), support vector machines (SVM), and K-nearest neighbor (KNN) trained on different feature spaces. We observed that ensemble-RF, in case of cancer classification, performed better than ensemble-SVM and ensemble-KNN. Our analysis demonstrates that ensemble-RF, ensemble-SVM and ensemble-KNN are more effective than their individual counterparts. The proposed 'IDM-PhyChm-Ens' method has shown improved performance compared to existing techniques.
The Sirenomelia Sequence: A Case History
Fadhlaoui, Anis; Khrouf, Mohamed; Gaigi, Soumaya; Zhioua, Fethi; Chaker, Anis
2010-01-01
We report a case of sirenomelia sequence observed in an incident of preterm labor during the 29th gestational week. According to some authors, this syndrome should be classified separately from caudal regression syndrome and is likely to be the result of an abnormality taking place during the fourth gestational week, causing developmental abnormalities in the lower extremities, pelvis, genitalia, urinary tract and digestive organs. Despite recent progress in pathology, the etiopathogenesis of sirenomelia is still debated. PMID:21769253
Shittu, Ismaila; Sharma, Poonam; Joannis, Tony M.; Volkening, Jeremy D.; Odaibo, Georgina N.; Olaleye, David O.; Williams-Coplin, Dawn; Solomon, Ponman; Abolnik, Celia; Miller, Patti J.; Dimitrov, Kiril M.
2016-01-01
The first complete genome sequence of a strain of Newcastle disease virus (NDV) of genotype XVII is described here. A velogenic strain (duck/Nigeria/903/KUDU-113/1992) was isolated from an apparently healthy free-roaming domestic duck sampled in Kuru, Nigeria, in 1992. Phylogenetic analysis of the fusion protein gene and complete genome classified the isolate as a member of NDV class II, genotype XVII. PMID:26847901
Sirius PSB: a generic system for analysis of biological sequences.
Koh, Chuan Hock; Lin, Sharene; Jedd, Gregory; Wong, Limsoon
2009-12-01
Computational tools are essential components of modern biological research. For example, BLAST searches can be used to identify related proteins based on sequence homology, or when a new genome is sequenced, prediction models can be used to annotate functional sites such as transcription start sites, translation initiation sites and polyadenylation sites and to predict protein localization. Here we present Sirius Prediction Systems Builder (PSB), a new computational tool for sequence analysis, classification and searching. Sirius PSB has four main operations: (1) Building a classifier, (2) Deploying a classifier, (3) Search for proteins similar to query proteins, (4) Preliminary and post-prediction analysis. Sirius PSB supports all these operations via a simple and interactive graphical user interface. Besides being a convenient tool, Sirius PSB has also introduced two novelties in sequence analysis. Firstly, genetic algorithm is used to identify interesting features in the feature space. Secondly, instead of the conventional method of searching for similar proteins via sequence similarity, we introduced searching via features' similarity. To demonstrate the capabilities of Sirius PSB, we have built two prediction models - one for the recognition of Arabidopsis polyadenylation sites and another for the subcellular localization of proteins. Both systems are competitive against current state-of-the-art models based on evaluation of public datasets. More notably, the time and effort required to build each model is greatly reduced with the assistance of Sirius PSB. Furthermore, we show that under certain conditions when BLAST is unable to find related proteins, Sirius PSB can identify functionally related proteins based on their biophysical similarities. Sirius PSB and its related supplements are available at: http://compbio.ddns.comp.nus.edu.sg/~sirius.
Environmental distribution, abundance and activity of the Miscellaneous Crenarchaeotal Group
NASA Astrophysics Data System (ADS)
Lloyd, K. G.; Biddle, J.; Teske, A.
2011-12-01
Many marine sedimentary microbes have only been identified by 16S rRNA sequences. Consequently, little is known about the types of metabolism, activity levels, or relative abundance of these groups in marine sediments. We found that one of these uncultured groups, called the Miscellaneous Crenarchaeotal Group (MCG), dominated clone libraries made from reverse transcribed 16S rRNA, and 454 pyrosequenced 16S rRNA genes, in the White Oak River estuary. Primers suitable for quantitative PCR were developed for MCG and used to show that 16S rRNA DNA copy numbers from MCG account for nearly all the archaeal 16S rRNA genes present. RT-qPCR shows much less MCG rRNA than total archaeal rRNA, but comparisons of different primers for each group suggest bias in the RNA-based work relative to the DNA-based work. There is no evidence of a population shift with depth below the sulfate-methane transition zone, suggesting that the metabolism of MCG may not be tied to sulfur or methane cycles. We classified 2,771 new sequences within the SSU Silva 106 database that, along with the classified sequences in the Silva database was used to make an MCG database of 4,646 sequences that allowed us to increase the named subgroups of MCG from 7 to 19. Percent terrestrial sequences in each subgroup is positively correlated with percent of the marine sequences that are nearshore, suggesting that membership in the different subgroups is not random, but dictated by environmental selective pressures. Given their high phylogenetic diversity, ubiquitous distribution in anoxic environments, and high DNA copy number relative to total archaea, members of MCG are most likely anaerobic heterotrophs who are integral to the post-depositional marine carbon cycle.
Cenik, Can; Chua, Hon Nian; Singh, Guramrit; Akef, Abdalla; Snyder, Michael P; Palazzo, Alexander F; Moore, Melissa J; Roth, Frederick P
2017-03-01
Introns are found in 5' untranslated regions (5'UTRs) for 35% of all human transcripts. These 5'UTR introns are not randomly distributed: Genes that encode secreted, membrane-bound and mitochondrial proteins are less likely to have them. Curiously, transcripts lacking 5'UTR introns tend to harbor specific RNA sequence elements in their early coding regions. To model and understand the connection between coding-region sequence and 5'UTR intron status, we developed a classifier that can predict 5'UTR intron status with >80% accuracy using only sequence features in the early coding region. Thus, the classifier identifies transcripts with 5 ' proximal- i ntron- m inus-like-coding regions ("5IM" transcripts). Unexpectedly, we found that the early coding sequence features defining 5IM transcripts are widespread, appearing in 21% of all human RefSeq transcripts. The 5IM class of transcripts is enriched for non-AUG start codons, more extensive secondary structure both preceding the start codon and near the 5' cap, greater dependence on eIF4E for translation, and association with ER-proximal ribosomes. 5IM transcripts are bound by the exon junction complex (EJC) at noncanonical 5' proximal positions. Finally, N 1 -methyladenosines are specifically enriched in the early coding regions of 5IM transcripts. Taken together, our analyses point to the existence of a distinct 5IM class comprising ∼20% of human transcripts. This class is defined by depletion of 5' proximal introns, presence of specific RNA sequence features associated with low translation efficiency, N 1 -methyladenosines in the early coding region, and enrichment for noncanonical binding by the EJC. © 2017 Cenik et al.; Published by Cold Spring Harbor Laboratory Press for the RNA Society.
Ishiguro, Naotaka; Inoshima, Yasuo; Yanai, Tokuma; Sasaki, Motoki; Matsui, Akira; Kikuchi, Hiroki; Maruyama, Masashi; Hongo, Hitomi; Vostretsov, Yuri E; Gasilin, Viatcheslav; Kosintsev, Pavel A; Quanjia, Chen; Chunxue, Wang
2016-02-01
The mitochondrial DNA (mtDNA) control region (198- to 598-bp) of four ancient Canis specimens (two Canis mandibles, a cranium, and a first phalanx) was examined, and each specimen was genetically identified as Japanese wolf. Two unique nucleotide substitutions, the 78-C insertion and the 482-G deletion, both of which are specific for Japanese wolf, were observed in each sample. Based on the mtDNA sequences analyzed, these four specimens and 10 additional Japanese wolf samples could be classified into two groups- Group A (10 samples) and Group B (4 samples)-which contain or lack an 8-bp insertion/deletion (indel), respectively. Interestingly, three dogs (Akita-b, Kishu 25, and S-husky 102) that each contained Japanese wolf-specific features were also classified into Group A or B based on the 8-bp indel. To determine the origin or ancestor of the Japanese wolf, mtDNA control regions of ancient continental Canis specimens were examined; 84 specimens were from Russia, and 29 were from China. However, none of these 113 specimens contained Japanese wolf-specific sequences. Moreover, none of 426 Japanese modern hunting dogs examined contained these Japanese wolf-specific mtDNA sequences. The mtDNA control region sequences of Groups A and B appeared to be unique to grey wolf and dog populations.
Discovering Deeply Divergent RNA Viruses in Existing Metatranscriptome Data with Machine Learning
NASA Astrophysics Data System (ADS)
Rivers, A. R.
2016-02-01
Most sampling of RNA viruses and phages has been directed toward a narrow range of hosts and environments. Several marine metagenomic studies have examined the RNA viral fraction in aquatic samples and found a number of picornaviruses and uncharacterized sequences. The lack of homology to known protein families has limited the discovery of new RNA viruses. We developed a computational method for identifying RNA viruses that relies on information in the codon transition probabilities of viral sequences to train a classifier. This approach does not rely on homology, but it has higher information content than other reference-free methods such as tetranucleotide frequency. Training and validation with RefSeq data gave true positive and true negative rates of 99.6% and 99.5% on the highly imbalanced validation sets (0.2% viruses) that, like the metatranscriptomes themselves, contain mostly non-viral sequences. To further test the method, a validation dataset of putative RNA virus genomes were identified in metatransciptomes by the presence of RNA dependent RNA polymerase, an essential gene for RNA viruses. The classifier successfully identified 99.4% of those contigs as viral. This approach is currently being extended to screen all metatranscriptome data sequenced at the DOE Joint Genome Institute, presently 4.5 Gb of assembled data from 504 public projects representing a wide range of marine, aquatic and terrestrial environments.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features
Mohammad-Noori, Morteza; Beer, Michael A.
2014-01-01
Abstract Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust estimation of k-mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, our gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer-SVM on human ENCODE ChIP-seq datasets, and further demonstrate the general utility of our method using a Naïve-Bayes classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence classification problem. PMID:25033408
Enhanced regulatory sequence prediction using gapped k-mer features.
Ghandi, Mahmoud; Lee, Dongwon; Mohammad-Noori, Morteza; Beer, Michael A
2014-07-01
Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust estimation of k-mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, our gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer-SVM on human ENCODE ChIP-seq datasets, and further demonstrate the general utility of our method using a Naïve-Bayes classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence classification problem.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gao, Jian; Luo, Mao; Zhu, Ye
2015-03-27
Viola yedoensis Makino is an important Chinese traditional medicine plant adapted to cadmium (Cd) pollution regions. Illumina sequencing technology was used to sequence the transcriptome of V. yedoensis Makino. We sequenced Cd-treated (VIYCd) and untreated (VIYCK) samples of V. yedoensis, and obtained 100,410,834 and 83,587,676 high quality reads, respectively. After de novo assembly and quantitative assessment, 109,800 unigenes were finally generated with an average length of 661 bp. We then obtained functional annotations by aligning unigenes with public protein databases including NR, NT, SwissProt, KEGG and COG. In addition, 892 differentially expressed genes (DEGs) were investigated between the two libraries ofmore » untreated (VIYCK) and Cd-treated (VIYCd) plants. Moreover, 15 randomly selected DEGs were further validated with qRT-PCR and the results were highly accordant with the Solexa analysis. This study firstly generated a successful global analysis of the V. yedoensis transcriptome and it will provide for further studies on gene expression, genomics, and functional genomics in Violaceae. - Highlights: • A de novo assembly generated 109,800 unigenes and 5,4479 of them were annotated. • 31,285 could be classified into 26 COG categories. • 263 biosynthesis pathways were predicted and classified into five categories. • 892 DEGs were detected and 15 of them were validated by qRT-PCR.« less
Wang, Liyan; Ma, Lina; Liu, Yongan; Gao, Pengcheng; Li, Youquan; Li, Xuerui; Liu, Yongsheng
2016-10-01
Haemophilus parasuis is the etiological agent of Glässers disease, which causes high morbidity and mortality in swine herds. Although H. parasuis strains can be classified into 15 serovars with the Kielstein-Rapp-Gabrielson serotyping scheme, a large number of isolates cannot be classified and have been designated 'nontypeable' strains. In this study, multilocus sequence typing (MLST) of H. parasuis was used to analyze 48 H. parasuis field strains isolated in China and two strains from Australia. Twenty-six new alleles and 29 new sequence types (STs) were detected, enriching the H. parasuis MLST databases. A BURST analysis indicated that H. parasuis lacks stable population structure and is highly heterogeneous, and that there is no association between STs and geographic area. When an UPGMA dendrogram was constructed, two major clades, clade A and clade B, were defined. Animal experiments, in which guinea pigs were challenged intraperitoneally with the bacterial isolates, supported the hypothesis that the H. parasuis STs in clade A are generally avirulent or weakly virulent, whereas the STs in clade B tend to be virulent. Copyright © 2016 Elsevier B.V. All rights reserved.
Computer-Vision-Assisted Palm Rehabilitation With Supervised Learning.
Vamsikrishna, K M; Dogra, Debi Prosad; Desarkar, Maunendra Sankar
2016-05-01
Physical rehabilitation supported by the computer-assisted-interface is gaining popularity among health-care fraternity. In this paper, we have proposed a computer-vision-assisted contactless methodology to facilitate palm and finger rehabilitation. Leap motion controller has been interfaced with a computing device to record parameters describing 3-D movements of the palm of a user undergoing rehabilitation. We have proposed an interface using Unity3D development platform. Our interface is capable of analyzing intermediate steps of rehabilitation without the help of an expert, and it can provide online feedback to the user. Isolated gestures are classified using linear discriminant analysis (DA) and support vector machines (SVM). Finally, a set of discrete hidden Markov models (HMM) have been used to classify gesture sequence performed during rehabilitation. Experimental validation using a large number of samples collected from healthy volunteers reveals that DA and SVM perform similarly while applied on isolated gesture recognition. We have compared the results of HMM-based sequence classification with CRF-based techniques. Our results confirm that both HMM and CRF perform quite similarly when tested on gesture sequences. The proposed system can be used for home-based palm or finger rehabilitation in the absence of experts.
Blouin, Arnaud G; Chooi, Kar Mun; Warren, Ben; Napier, Kathryn R; Barrero, Roberto A; MacDiarmid, Robin M
2018-05-01
A novel virus, with characteristics of viruses classified within the genus Vitivirus, was identified from a sample of Vitis vinifera cv. Chardonnay in New Zealand. The virus was detected with high throughput sequencing (small RNA and total RNA) and its sequence was confirmed by Sanger sequencing. Its genome is 7507 nt long (excluding the polyA tail) with an organisation similar to that described for other classifiable members of the genus Vitivirus. The closest relative of the virus is grapevine virus E (GVE) with 65% aa identity in ORF1 (65% nt identity) and 63% aa identity in the coat protein (66% nt identity). The relationship with GVE was confirmed with phylogenetic analysis, showing the new virus branching with GVE, Agave tequilina leaf virus and grapevine virus G (GVG). A limited survey revealed the presence of this virus in multiple plants from the same location where the newly described GVG was discovered, and in most cases both viruses were detected as co-infections. The genetic characteristics of this virus suggest it represents an isolate of a new species within the genus Vitivirus and following the current nomenclature, we propose the name "Grapevine virus I".
Lu, Hui-Meng; Yin, Da-Chuan; Ye, Ya-Jing; Luo, Hui-Min; Geng, Li-Qiang; Li, Hai-Sheng; Guo, Wei-Hong; Shang, Peng
2009-01-01
As the most widely utilized technique to determine the 3-dimensional structure of protein molecules, X-ray crystallography can provide structure of the highest resolution among the developed techniques. The resolution obtained via X-ray crystallography is known to be influenced by many factors, such as the crystal quality, diffraction techniques, and X-ray sources, etc. In this paper, the authors found that the protein sequence could also be one of the factors. We extracted information of the resolution and the sequence of proteins from the Protein Data Bank (PDB), classified the proteins into different clusters according to the sequence similarity, and statistically analyzed the relationship between the sequence similarity and the best resolution obtained. The results showed that there was a pronounced correlation between the sequence similarity and the obtained resolution. These results indicate that protein structure itself is one variable that may affect resolution when X-ray crystallography is used.
2013-01-01
Background Plastids are an important component of plant cells, being the site of manufacture and storage of chemical compounds used by the cell, and contain pigments such as those used in photosynthesis, starch synthesis/storage, cell color etc. They are essential organelles of the plant cell, also present in algae. Recent advances in genomic technology and sequencing efforts is generating a huge amount of DNA sequence data every day. The predicted proteome of these genomes needs annotation at a faster pace. In view of this, one such annotation need is to develop an automated system that can distinguish between plastid and non-plastid proteins accurately, and further classify plastid-types based on their functionality. We compared the amino acid compositions of plastid proteins with those of non-plastid ones and found significant differences, which were used as a basis to develop various feature-based prediction models using similarity-search and machine learning. Results In this study, we developed separate Support Vector Machine (SVM) trained classifiers for characterizing the plastids in two steps: first distinguishing the plastid vs. non-plastid proteins, and then classifying the identified plastids into their various types based on their function (chloroplast, chromoplast, etioplast, and amyloplast). Five diverse protein features: amino acid composition, dipeptide composition, the pseudo amino acid composition, Nterminal-Center-Cterminal composition and the protein physicochemical properties are used to develop SVM models. Overall, the dipeptide composition-based module shows the best performance with an accuracy of 86.80% and Matthews Correlation Coefficient (MCC) of 0.74 in phase-I and 78.60% with a MCC of 0.44 in phase-II. On independent test data, this model also performs better with an overall accuracy of 76.58% and 74.97% in phase-I and phase-II, respectively. The similarity-based PSI-BLAST module shows very low performance with about 50% prediction accuracy for distinguishing plastid vs. non-plastids and only 20% in classifying various plastid-types, indicating the need and importance of machine learning algorithms. Conclusion The current work is a first attempt to develop a methodology for classifying various plastid-type proteins. The prediction modules have also been made available as a web tool, PLpred available at http://bioinfo.okstate.edu/PLpred/ for real time identification/characterization. We believe this tool will be very useful in the functional annotation of various genomes. PMID:24266945
Sentence Combining: A Sequence for Instruction.
ERIC Educational Resources Information Center
Lawlor, Joseph
1983-01-01
Classifies various syntactic structures normally included in sentence-combining instruction into five categories: coordinates, adverbials, restrictive noun modifiers, noun substitutes, and free modifiers. Within each category, structures are further divided into three levels to provide teachers with guidelines for planning instruction. (RH)
Khan, Abdul Latif; Asaf, Sajjad; Khan, Abdur Rahim; Al-Harrasi, Ahmed; Al-Rawahi, Ahmed; Lee, In-Jung
2016-05-10
Preussia sp. BSL10, family Sporormiaceae, was actively producing phytohormone (indole-3-acetic acid) and extra-cellular enzymes (phosphatases and glucosidases). The fungus was also promoting the growth of arid-land tree-Boswellia sacra. Looking at such prospects of this fungus, we sequenced its draft genome for the first time. The Illumina based sequence analysis reveals an approximate genome size of 31.4Mbp for Preussia sp. BSL10. Based on ab initio gene prediction, total 32,312 coding sequences were annotated consisting of 11,967 coding genes, pseudogenes, and 221 tRNA genes. Furthermore, 321 carbohydrate-active enzymes were predicted and classified into many functional families. Copyright © 2016 Elsevier B.V. All rights reserved.
Zhang, Longlong; Yue, Qinyan; Yang, Kunlun; Zhao, Pin; Gao, Baoyu
2018-02-01
Extracellular polymeric substances (EPS) and ciprofloxacin-degrading microbial community in the combined Fe-C micro-electrolysis and up-flow biological aerated filter (UBAF) process for the treatment of high-level ciprofloxacin (CIP) were analyzed. The research demonstrated a great potential of Fe-C micro-electrolysis-UBAF for the elimination of high-level CIP. Above 90% of CIP removal was achieved through the combined process at 100 mg L -1 of CIP loading. In UBAF, the pollutants were mainly removed at 0-70 cm heights. Three-dimensional fluorescence spectrum (3D-EEM) was used to characterize the chemical structural of loosely bound EPS (LB-EPS) and tightly bound EPS (TB-EPS) extracted from biofilm sample in UBAF. The results showed that the protein-like substances in LB-EPS and TB-EPS had no clear change in the study. Nevertheless, an obvious release of polysaccharides in EPSs was observed during long-term exposure to CIP, which was considered as a protective response of microbial to CIP toxic. The high-throughput sequencing results revealed that the biodiversity of bacteria community became increasingly rich with gradual ciprofloxacin biodegradation in UBAF. The ciprofloxacin-degrading microbial community was mainly dominated by Proteobacteria and Bacteroidetes. Microorganisms from genera Dechloromonas, Brevundimonas, Flavobacterium, Sphingopyxis and Bosea might take a major role in ciprofloxacin degradation. This study provides deep theoretical guidance for real CIP wastewater treatment. Copyright © 2017. Published by Elsevier Ltd.
Zhang, Huiyong; Zhao, Xin; Li, Jigang; Cai, Huaqing; Deng, Xing Wang; Li, Lei
2014-01-01
Light and copper are important environmental determinants of plant growth and development. Despite the wealth of knowledge on both light and copper signaling, the molecular mechanisms that integrate the two pathways remain poorly understood. Here, we use Arabidopsis thaliana to demonstrate an interaction between SQUAMOSA PROMOTER BINDING PROTEIN-LIKE7 (SPL7) and ELONGATED HYPOCOTYL5 (HY5), which mediate copper and light signaling, respectively. Through whole-genome chromatin immunoprecipitation and RNA sequencing analyses, we elucidated the SPL7 regulon and compared it with that of HY5. We found that the two transcription factors coregulate many genes, including those involved in anthocyanin accumulation and photosynthesis. Moreover, SPL7 and HY5 act coordinately to transcriptionally regulate MIR408, which results in differential expression of microRNA408 (miR408) and its target genes in response to changing light and copper conditions. We demonstrate that this regulation is tied to copper allocation to the chloroplast and plastocyanin levels. Finally, we found that constitutively activated miR408 rescues the distinct developmental defects of the hy5, spl7, and hy5 spl7 mutants. These findings revealed the existence of crosstalk between light and copper, mediated by a HY5-SPL7 network. Furthermore, integration of transcriptional and posttranscriptional regulation is critical for governing proper metabolism and development in response to combined copper and light signaling. PMID:25516599
SPL13 regulates shoot branching and flowering time in Medicago sativa.
Gao, Ruimin; Gruber, Margaret Y; Amyot, Lisa; Hannoufa, Abdelali
2018-01-01
Our results show SPL13 plays a crucial role in regulating vegetative and reproductive development in Medicago sativa L. (alfalfa), and that MYB112 is targeted and downregulated by SPL13 in alfalfa. We previously showed that transgenic Medicago sativa (alfalfa) plants overexpressing microRNA156 (miR156) show a bushy phenotype, reduced internodal length, delayed flowering time, and enhanced biomass yield. In alfalfa, transcripts of seven SQUAMOSA-PROMOTER BINDING PROTEIN-LIKE (SPL) transcription factors, including SPL13, are targeted for cleavage by miR156. Thus, association of each target SPL gene to a trait or set of traits is essential for developing molecular markers for alfalfa breeding. In this study, we investigated SPL13 function using SPL13 overexpression and silenced alfalfa plants. Severe growth retardation, distorted branches and up-curled leaves were observed in miR156-impervious 35S::SPL13m over-expression plants. In contrast, more lateral branches and delayed flowering time were observed in SPL13 silenced plants. SPL13 transcripts were predominantly present in the plant meristems, indicating that SPL13 is involved in regulating shoot branch development. Accordingly, the shoot branching-related CAROTENOID CLEAVAGE DIOXYGENASE 8 gene was found to be significantly downregulated in SPL13 RNAi silencing plants. A R2R3-MYB gene MYB112 was also identified as being directly silenced by SPL13 based on Next Generation Sequencing-mediated transcriptome analysis and chromatin immunoprecipitation assays, suggesting that MYB112 may be involved in regulating alfalfa vegetative growth.
Douville, Christopher; Masica, David L.; Stenson, Peter D.; Cooper, David N.; Gygax, Derek M.; Kim, Rick; Ryan, Michael
2015-01-01
ABSTRACT Insertion/deletion variants (indels) alter protein sequence and length, yet are highly prevalent in healthy populations, presenting a challenge to bioinformatics classifiers. Commonly used features—DNA and protein sequence conservation, indel length, and occurrence in repeat regions—are useful for inference of protein damage. However, these features can cause false positives when predicting the impact of indels on disease. Existing methods for indel classification suffer from low specificities, severely limiting clinical utility. Here, we further develop our variant effect scoring tool (VEST) to include the classification of in‐frame and frameshift indels (VEST‐indel) as pathogenic or benign. We apply 24 features, including a new “PubMed” feature, to estimate a gene's importance in human disease. When compared with four existing indel classifiers, our method achieves a drastically reduced false‐positive rate, improving specificity by as much as 90%. This approach of estimating gene importance might be generally applicable to missense and other bioinformatics pathogenicity predictors, which often fail to achieve high specificity. Finally, we tested all possible meta‐predictors that can be obtained from combining the four different indel classifiers using Boolean conjunctions and disjunctions, and derived a meta‐predictor with improved performance over any individual method. PMID:26442818
Douville, Christopher; Masica, David L; Stenson, Peter D; Cooper, David N; Gygax, Derek M; Kim, Rick; Ryan, Michael; Karchin, Rachel
2016-01-01
Insertion/deletion variants (indels) alter protein sequence and length, yet are highly prevalent in healthy populations, presenting a challenge to bioinformatics classifiers. Commonly used features--DNA and protein sequence conservation, indel length, and occurrence in repeat regions--are useful for inference of protein damage. However, these features can cause false positives when predicting the impact of indels on disease. Existing methods for indel classification suffer from low specificities, severely limiting clinical utility. Here, we further develop our variant effect scoring tool (VEST) to include the classification of in-frame and frameshift indels (VEST-indel) as pathogenic or benign. We apply 24 features, including a new "PubMed" feature, to estimate a gene's importance in human disease. When compared with four existing indel classifiers, our method achieves a drastically reduced false-positive rate, improving specificity by as much as 90%. This approach of estimating gene importance might be generally applicable to missense and other bioinformatics pathogenicity predictors, which often fail to achieve high specificity. Finally, we tested all possible meta-predictors that can be obtained from combining the four different indel classifiers using Boolean conjunctions and disjunctions, and derived a meta-predictor with improved performance over any individual method. © 2015 The Authors. **Human Mutation published by Wiley Periodicals, Inc.
NASA Astrophysics Data System (ADS)
Liao, Zhijun; Wang, Xinrui; Zeng, Yeting; Zou, Quan
2016-12-01
The Dishevelled/EGL-10/Pleckstrin (DEP) domain-containing (DEPDC) proteins have seven members. However, whether this superfamily can be distinguished from other proteins based only on the amino acid sequences, remains unknown. Here, we describe a computational method to segregate DEPDCs and non-DEPDCs. First, we examined the Pfam numbers of the known DEPDCs and used the longest sequences for each Pfam to construct a phylogenetic tree. Subsequently, we extracted 188-dimensional (188D) and 20D features of DEPDCs and non-DEPDCs and classified them with random forest classifier. We also mined the motifs of human DEPDCs to find the related domains. Finally, we designed experimental verification methods of human DEPDC expression at the mRNA level in hepatocellular carcinoma (HCC) and adjacent normal tissues. The phylogenetic analysis showed that the DEPDCs superfamily can be divided into three clusters. Moreover, the 188D and 20D features can both be used to effectively distinguish the two protein types. Motif analysis revealed that the DEP and RhoGAP domain was common in human DEPDCs, human HCC and the adjacent tissues that widely expressed DEPDCs. However, their regulation was not identical. In conclusion, we successfully constructed a binary classifier for DEPDCs and experimentally verified their expression in human HCC tissues.
Nikolaidis, Nikolas; Nei, Masatoshi
2004-03-01
We have identified the Hsp70 gene superfamily of the nematode Caenorhabditis briggsae and investigated the evolution of these genes in comparison with Hsp70 genes from C. elegans, Drosophila, and yeast. The Hsp70 genes are classified into three monophyletic groups according to their subcellular localization, namely, cytoplasm (CYT), endoplasmic reticulum (ER), and mitochondria (MT). The Hsp110 genes can be classified into the polyphyletic CYT group and the monophyletic ER group. The different Hsp70 and Hsp110 groups appeared to evolve following the model of divergent evolution. This model can also explain the evolution of the ER and MT genes. On the other hand, the CYT genes are divided into heat-inducible and constitutively expressed genes. The constitutively expressed genes have evolved more or less following the birth-and-death process, and the rates of gene birth and gene death are different between the two nematode species. By contrast, some heat-inducible genes show an intraspecies phylogenetic clustering. This suggests that they are subject to sequence homogenization resulting from gene conversion-like events. In addition, the heat-inducible genes show high levels of sequence conservation in both intra-species and inter-species comparisons, and in most cases, amino acid sequence similarity is higher than nucleotide sequence similarity. This indicates that purifying selection also plays an important role in maintaining high sequence similarity among paralogous Hsp70 genes. Therefore, we suggest that the CYT heat-inducible genes have been subjected to a combination of purifying selection, birth-and-death process, and gene conversion-like events.
Delineating slowly and rapidly evolving fractions of the Drosophila genome.
Keith, Jonathan M; Adams, Peter; Stephen, Stuart; Mattick, John S
2008-05-01
Evolutionary conservation is an important indicator of function and a major component of bioinformatic methods to identify non-protein-coding genes. We present a new Bayesian method for segmenting pairwise alignments of eukaryotic genomes while simultaneously classifying segments into slowly and rapidly evolving fractions. We also describe an information criterion similar to the Akaike Information Criterion (AIC) for determining the number of classes. Working with pairwise alignments enables detection of differences in conservation patterns among closely related species. We analyzed three whole-genome and three partial-genome pairwise alignments among eight Drosophila species. Three distinct classes of conservation level were detected. Sequences comprising the most slowly evolving component were consistent across a range of species pairs, and constituted approximately 62-66% of the D. melanogaster genome. Almost all (>90%) of the aligned protein-coding sequence is in this fraction, suggesting much of it (comprising the majority of the Drosophila genome, including approximately 56% of non-protein-coding sequences) is functional. The size and content of the most rapidly evolving component was species dependent, and varied from 1.6% to 4.8%. This fraction is also enriched for protein-coding sequence (while containing significant amounts of non-protein-coding sequence), suggesting it is under positive selection. We also classified segments according to conservation and GC content simultaneously. This analysis identified numerous sub-classes of those identified on the basis of conservation alone, but was nevertheless consistent with that classification. Software, data, and results available at www.maths.qut.edu.au/-keithj/. Genomic segments comprising the conservation classes available in BED format.
Genomewide Function Conservation and Phylogeny in the Herpesviridae
Albà, M. Mar; Das, Rhiju; Orengo, Christine A.; Kellam, Paul
2001-01-01
The Herpesviridae are a large group of well-characterized double-stranded DNA viruses for which many complete genome sequences have been determined. We have extracted protein sequences from all predicted open reading frames of 19 herpesvirus genomes. Sequence comparison and protein sequence clustering methods have been used to construct herpesvirus protein homologous families. This resulted in 1692 proteins being clustered into 243 multiprotein families and 196 singleton proteins. Predicted functions were assigned to each homologous family based on genome annotation and published data and each family classified into seven broad functional groups. Phylogenetic profiles were constructed for each herpesvirus from the homologous protein families and used to determine conserved functions and genomewide phylogenetic trees. These trees agreed with molecular-sequence-derived trees and allowed greater insight into the phylogeny of ungulate and murine gammaherpesviruses. PMID:11156614
Search for soliton modes in helical poly-γ-benzyl-l-glutamate
NASA Astrophysics Data System (ADS)
Renthal, Robert; Taboada, J.
1989-07-01
Solid α-helical poly(γ-benzyl-L-glutamate) was examined at low temperature for evidence of the unusual temperature-dependent vibrational mode found by Careri and co-workers in solid acetanilide and attributed to a soliton wave trapped in protein-like hydrogen bonds. We have confirmed the anomaly in acetanilide, however, a similar temperature-dependent mode was not observed in poly(γ-benzyl-L-glutamate). These results indicate that anharmonic amide modes may only be present in certain α-helical structures. Two new low frequency modes (180 and 90 cm -1) are observed for poly(γ-benzyl-L-glutamate).
NASA Astrophysics Data System (ADS)
Raczkowska, Anna; Kowalczuk, Piotr; Sagan, Slawomir; Zablocka, Monika; Stedmon, Colin; Granskog, Mats
2017-04-01
Water masses exchange between the Atlantic Ocean and the Arctic Ocean occurs in Nordic Seas and this process represents a crucial component of the northern hemisphere climate system. Nordic Seas are dominated by Atlantic Waters (AW) and Polar Waters (PW) and water formed in the mixing process or local modifications like precipitation and sea-ice melt. Classification of water masses only on the basis of temperature, salinity or density not take into account different sources of fresh water in the Nordic Seas. In this study we propose that measured signal from the in situ three channel WET Star fluorometer could be a useful tool for characterization of dissolved organic matter (DOM) and refinement of water masses classification . Spectral properties of Chromophoric Dissolved Organic Matter and Fluorescent Dissolved Organic Matter (CDOM and FDOM) were characterized in different water masses along a section across the Fram Strait at 79°N as well as in the Nordic Seas in 2014 and 2015. Observations of CDOM and FDOM were carried out with use of in situ three channel WET Labs WET Star fluorometer and Excitation Emission Matrix spectra (EEMs) measured in the water samples. The WET Labs WET Star three channels in situ fluorometer was designed to measure emission of humic and protein-like FDOM fractions. Instruments output was calibrated against respective fluorescence intensity of EMMs measured with use of Aqualog fluorometer (Horiba Scientific) at excitation and emission ranges corresponding to in situ fluorometer channels. The correctness of the calibration was confirmed by empirical linear relationship between WET Star in situ fluorescence intensities and aCDOM(350) derived from water samples. Measured WET Star fluorometer signal enabled to asses distribution of different FDOM fractions in the Nordic Seas. The distribution of humic-like fluorescence intensity in the function of salinity revealed three distinct mixing curves: the first indicates mixing between surface PW diluted by sea ice melt with core of PW from East Greenland Current, the second imply transition from PW to AW, the third curve is an indicator of modification of AW by sea ice melting in the area of Western and Northern Spitsbergen Shelf. Furthermore, fluorescence intensities of humic-like DOM fraction is very low and remains practically constant in the core of AW. In the AW there is a strong subsurface maximum of chlorophyll a fluorescence which was aligned with protein-like fraction of DOM. The linear relationship between phytoplankton fluorescence and fluorescence intensity of protein-like DOM fraction proved that phytoplankton was primary source of protein like fraction of DOM in the AW.
Candidate new rotavirus species in Schreiber's bats, Serbia.
Bányai, Krisztián; Kemenesi, Gábor; Budinski, Ivana; Földes, Fanni; Zana, Brigitta; Marton, Szilvia; Varga-Kugler, Renáta; Oldal, Miklós; Kurucz, Kornélia; Jakab, Ferenc
2017-03-01
The genus Rotavirus comprises eight species designated A to H and one tentative species, Rotavirus I. In a virus metagenomic analysis of Schreiber's bats sampled in Serbia in 2014 we obtained sequences likely representing novel rotavirus species. Whole genome sequencing and phylogenetic analysis classified the representative strain into a tentative tenth rotavirus species, we provisionally called Rotavirus J. The novel virus shared a maximum of 50% amino acid sequence identity within the VP6 gene to currently known members of the genus. This study extends our understanding of the genetic diversity of rotaviruses in bats. Copyright © 2016 Elsevier B.V. All rights reserved.
Huang, Yin-Fu; Wang, Chia-Ming; Liou, Sing-Wu
2013-01-01
A hybrid self-adaptive harmony search and back-propagation mining system was proposed to discover weighted patterns in human intron sequences. By testing the weights under a lazy nearest neighbor classifier, the numerical results revealed the significance of these weighted patterns. Comparing these weighted patterns with the popular intron consensus model, it is clear that the discovered weighted patterns make originally the ambiguous 5SS and 3SS header patterns more specific and concrete.
Wang, Chia-Ming; Liou, Sing-Wu
2013-01-01
A hybrid self-adaptive harmony search and back-propagation mining system was proposed to discover weighted patterns in human intron sequences. By testing the weights under a lazy nearest neighbor classifier, the numerical results revealed the significance of these weighted patterns. Comparing these weighted patterns with the popular intron consensus model, it is clear that the discovered weighted patterns make originally the ambiguous 5SS and 3SS header patterns more specific and concrete. PMID:23737711
NASA Astrophysics Data System (ADS)
Maier, Oskar; Wilms, Matthias; von der Gablentz, Janina; Krämer, Ulrike; Handels, Heinz
2014-03-01
Automatic segmentation of ischemic stroke lesions in magnetic resonance (MR) images is important in clinical practice and for neuroscientific trials. The key problem is to detect largely inhomogeneous regions of varying sizes, shapes and locations. We present a stroke lesion segmentation method based on local features extracted from multi-spectral MR data that are selected to model a human observer's discrimination criteria. A support vector machine classifier is trained on expert-segmented examples and then used to classify formerly unseen images. Leave-one-out cross validation on eight datasets with lesions of varying appearances is performed, showing our method to compare favourably with other published approaches in terms of accuracy and robustness. Furthermore, we compare a number of feature selectors and closely examine each feature's and MR sequence's contribution.
Porras-Alfaro, Andrea; Liu, Kuan-Liang; Kuske, Cheryl R; Xie, Gary
2014-02-01
We compared the classification accuracy of two sections of the fungal internal transcribed spacer (ITS) region, individually and combined, and the 5' section (about 600 bp) of the large-subunit rRNA (LSU), using a naive Bayesian classifier and BLASTN. A hand-curated ITS-LSU training set of 1,091 sequences and a larger training set of 8,967 ITS region sequences were used. Of the factors evaluated, database composition and quality had the largest effect on classification accuracy, followed by fragment size and use of a bootstrap cutoff to improve classification confidence. The naive Bayesian classifier and BLASTN gave similar results at higher taxonomic levels, but the classifier was faster and more accurate at the genus level when a bootstrap cutoff was used. All of the ITS and LSU sections performed well (>97.7% accuracy) at higher taxonomic ranks from kingdom to family, and differences between them were small at the genus level (within 0.66 to 1.23%). When full-length sequence sections were used, the LSU outperformed the ITS1 and ITS2 fragments at the genus level, but the ITS1 and ITS2 showed higher accuracy when smaller fragment sizes of the same length and a 50% bootstrap cutoff were used. In a comparison using the larger ITS training set, ITS1 and ITS2 had very similar accuracy classification for fragments between 100 and 200 bp. Collectively, the results show that any of the ITS or LSU sections we tested provided comparable classification accuracy to the genus level and underscore the need for larger and more diverse classification training sets.
Liu, Kuan-Liang; Kuske, Cheryl R.
2014-01-01
We compared the classification accuracy of two sections of the fungal internal transcribed spacer (ITS) region, individually and combined, and the 5′ section (about 600 bp) of the large-subunit rRNA (LSU), using a naive Bayesian classifier and BLASTN. A hand-curated ITS-LSU training set of 1,091 sequences and a larger training set of 8,967 ITS region sequences were used. Of the factors evaluated, database composition and quality had the largest effect on classification accuracy, followed by fragment size and use of a bootstrap cutoff to improve classification confidence. The naive Bayesian classifier and BLASTN gave similar results at higher taxonomic levels, but the classifier was faster and more accurate at the genus level when a bootstrap cutoff was used. All of the ITS and LSU sections performed well (>97.7% accuracy) at higher taxonomic ranks from kingdom to family, and differences between them were small at the genus level (within 0.66 to 1.23%). When full-length sequence sections were used, the LSU outperformed the ITS1 and ITS2 fragments at the genus level, but the ITS1 and ITS2 showed higher accuracy when smaller fragment sizes of the same length and a 50% bootstrap cutoff were used. In a comparison using the larger ITS training set, ITS1 and ITS2 had very similar accuracy classification for fragments between 100 and 200 bp. Collectively, the results show that any of the ITS or LSU sections we tested provided comparable classification accuracy to the genus level and underscore the need for larger and more diverse classification training sets. PMID:24242255
Chumbe, Ana; Izquierdo-Lara, Ray; Tataje, Luis; Gonzalez, Rosa; Cribillero, Giovana; González, Armando E; Fernández-Díaz, Manolo; Icochea, Eliana
2017-03-01
Infections of poultry with virulent strains of avian paramyxovirus 1 (APMV-1), also known as Newcastle disease viruses (NDVs), cause Newcastle disease (ND). This highly contagious disease affects poultry and many other species of birds worldwide. In countries where the disease is prevalent, constant monitoring and characterization of isolates causing outbreaks are necessary. In this study, we report the results of pathogenicity testing and phylogenetic analyses of seven NDVs isolated from several regions of Peru between 2004 and 2015. Six viruses had intracerebral pathogenicity indices (ICPIs) of between 1.75 and 1.88, corresponding to a velogenic pathotype. The remaining virus had an ICPI of 0.00, corresponding to a lentogenic pathotype. These results were consistent with amino acid sequences at the fusion protein (F) cleavage site. All velogenic isolates had the polybasic amino acid sequence 112 RRQKR↓F 117 at the F cleavage site. Phylogenetic analyses of complete F gene sequences showed that all isolates are classified in class II of APMV-1. The velogenic viruses are classified in genotype XII, while the lentogenic virus is classified in genotype II, closely related to the LaSota vaccine strain. Moreover, tree topology, bootstrap values, and genetic distances observed within genotype XII resulted in the identification of novel subgenotypes XIIa (in South America) and XIIb (in China) and possibly two clades within genotype XIIa. All velogenic Peruvian viruses belonged to subgenotype XIIa. Overall, our results confirm the presence of genotype XII in Peru and suggest that it is the prevalent genotype currently circulating in our country. The phylogenetic characterization of these isolates helps to characterize the evolution of NDV and may help with the development of vaccines specific to our regional necessities.
Zhao, Jiaduo; Gong, Weiguo; Tang, Yuzhen; Li, Weihong
2016-01-20
In this paper, we propose an effective human and nonhuman pyroelectric infrared (PIR) signal recognition method to reduce PIR detector false alarms. First, using the mathematical model of the PIR detector, we analyze the physical characteristics of the human and nonhuman PIR signals; second, based on the analysis results, we propose an empirical mode decomposition (EMD)-based symbolic dynamic analysis method for the recognition of human and nonhuman PIR signals. In the proposed method, first, we extract the detailed features of a PIR signal into five symbol sequences using an EMD-based symbolization method, then, we generate five feature descriptors for each PIR signal through constructing five probabilistic finite state automata with the symbol sequences. Finally, we use a weighted voting classification strategy to classify the PIR signals with their feature descriptors. Comparative experiments show that the proposed method can effectively classify the human and nonhuman PIR signals and reduce PIR detector's false alarms.
Liu, Jiemeng; Wang, Haifeng; Yang, Hongxing; Zhang, Yizhe; Wang, Jinfeng; Zhao, Fangqing; Qi, Ji
2013-01-01
Compared with traditional algorithms for long metagenomic sequence classification, characterizing microorganisms’ taxonomic and functional abundance based on tens of millions of very short reads are much more challenging. We describe an efficient composition and phylogeny-based algorithm [Metagenome Composition Vector (MetaCV)] to classify very short metagenomic reads (75–100 bp) into specific taxonomic and functional groups. We applied MetaCV to the Meta-HIT data (371-Gb 75-bp reads of 109 human gut metagenomes), and this single-read-based, instead of assembly-based, classification has a high resolution to characterize the composition and structure of human gut microbiota, especially for low abundance species. Most strikingly, it only took MetaCV 10 days to do all the computation work on a server with five 24-core nodes. To our knowledge, MetaCV, benefited from the strategy of composition comparison, is the first algorithm that can classify millions of very short reads within affordable time. PMID:22941634
Skeleton-based human action recognition using multiple sequence alignment
NASA Astrophysics Data System (ADS)
Ding, Wenwen; Liu, Kai; Cheng, Fei; Zhang, Jin; Li, YunSong
2015-05-01
Human action recognition and analysis is an active research topic in computer vision for many years. This paper presents a method to represent human actions based on trajectories consisting of 3D joint positions. This method first decompose action into a sequence of meaningful atomic actions (actionlets), and then label actionlets with English alphabets according to the Davies-Bouldin index value. Therefore, an action can be represented using a sequence of actionlet symbols, which will preserve the temporal order of occurrence of each of the actionlets. Finally, we employ sequence comparison to classify multiple actions through using string matching algorithms (Needleman-Wunsch). The effectiveness of the proposed method is evaluated on datasets captured by commodity depth cameras. Experiments of the proposed method on three challenging 3D action datasets show promising results.
A laboratory information management system for DNA barcoding workflows.
Vu, Thuy Duong; Eberhardt, Ursula; Szöke, Szániszló; Groenewald, Marizeth; Robert, Vincent
2012-07-01
This paper presents a laboratory information management system for DNA sequences (LIMS) created and based on the needs of a DNA barcoding project at the CBS-KNAW Fungal Biodiversity Centre (Utrecht, the Netherlands). DNA barcoding is a global initiative for species identification through simple DNA sequence markers. We aim at generating barcode data for all strains (or specimens) included in the collection (currently ca. 80 k). The LIMS has been developed to better manage large amounts of sequence data and to keep track of the whole experimental procedure. The system has allowed us to classify strains more efficiently as the quality of sequence data has improved, and as a result, up-to-date taxonomic names have been given to strains and more accurate correlation analyses have been carried out.
Automating document classification for the Immune Epitope Database
Wang, Peng; Morgan, Alexander A; Zhang, Qing; Sette, Alessandro; Peters, Bjoern
2007-01-01
Background The Immune Epitope Database contains information on immune epitopes curated manually from the scientific literature. Like similar projects in other knowledge domains, significant effort is spent on identifying which articles are relevant for this purpose. Results We here report our experience in automating this process using Naïve Bayes classifiers trained on 20,910 abstracts classified by domain experts. Improvements on the basic classifier performance were made by a) utilizing information stored in PubMed beyond the abstract itself b) applying standard feature selection criteria and c) extracting domain specific feature patterns that e.g. identify peptides sequences. We have implemented the classifier into the curation process determining if abstracts are clearly relevant, clearly irrelevant, or if no certain classification can be made, in which case the abstracts are manually classified. Testing this classification scheme on an independent dataset, we achieve 95% sensitivity and specificity in the 51.1% of abstracts that were automatically classified. Conclusion By implementing text classification, we have sped up the reference selection process without sacrificing sensitivity or specificity of the human expert classification. This study provides both practical recommendations for users of text classification tools, as well as a large dataset which can serve as a benchmark for tool developers. PMID:17655769
Algorithms exploiting ultrasonic sensors for subject classification
NASA Astrophysics Data System (ADS)
Desai, Sachi; Quoraishee, Shafik
2009-09-01
Proposed here is a series of techniques exploiting micro-Doppler ultrasonic sensors capable of characterizing various detected mammalian targets based on their physiological movements captured a series of robust features. Employed is a combination of unique and conventional digital signal processing techniques arranged in such a manner they become capable of classifying a series of walkers. These processes for feature extraction develops a robust feature space capable of providing discrimination of various movements generated from bipeds and quadrupeds and further subdivided into large or small. These movements can be exploited to provide specific information of a given signature dividing it in a series of subset signatures exploiting wavelets to generate start/stop times. After viewing a series spectrograms of the signature we are able to see distinct differences and utilizing kurtosis, we generate an envelope detector capable of isolating each of the corresponding step cycles generated during a walk. The walk cycle is defined as one complete sequence of walking/running from the foot pushing off the ground and concluding when returning to the ground. This time information segments the events that are readily seen in the spectrogram but obstructed in the temporal domain into individual walk sequences. This walking sequence is then subsequently translated into a three dimensional waterfall plot defining the expected energy value associated with the motion at particular instance of time and frequency. The value is capable of being repeatable for each particular class and employable to discriminate the events. Highly reliable classification is realized exploiting a classifier trained on a candidate sample space derived from the associated gyrations created by motion from actors of interest. The classifier developed herein provides a capability to classify events as an adult humans, children humans, horses, and dogs at potentially high rates based on the tested sample space. The algorithm developed and described will provide utility to an underused sensor modality for human intrusion detection because of the current high-rate of generated false alarms. The active ultrasonic sensor coupled in a multi-modal sensor suite with binary, less descriptive sensors like seismic devices realizing a greater accuracy rate for detection of persons of interest for homeland purposes.
Analysis of composition-based metagenomic classification.
Higashi, Susan; Barreto, André da Motta Salles; Cantão, Maurício Egidio; de Vasconcelos, Ana Tereza Ribeiro
2012-01-01
An essential step of a metagenomic study is the taxonomic classification, that is, the identification of the taxonomic lineage of the organisms in a given sample. The taxonomic classification process involves a series of decisions. Currently, in the context of metagenomics, such decisions are usually based on empirical studies that consider one specific type of classifier. In this study we propose a general framework for analyzing the impact that several decisions can have on the classification problem. Instead of focusing on any specific classifier, we define a generic score function that provides a measure of the difficulty of the classification task. Using this framework, we analyze the impact of the following parameters on the taxonomic classification problem: (i) the length of n-mers used to encode the metagenomic sequences, (ii) the similarity measure used to compare sequences, and (iii) the type of taxonomic classification, which can be conventional or hierarchical, depending on whether the classification process occurs in a single shot or in several steps according to the taxonomic tree. We defined a score function that measures the degree of separability of the taxonomic classes under a given configuration induced by the parameters above. We conducted an extensive computational experiment and found out that reasonable values for the parameters of interest could be (i) intermediate values of n, the length of the n-mers; (ii) any similarity measure, because all of them resulted in similar scores; and (iii) the hierarchical strategy, which performed better in all of the cases. As expected, short n-mers generate lower configuration scores because they give rise to frequency vectors that represent distinct sequences in a similar way. On the other hand, large values for n result in sparse frequency vectors that represent differently metagenomic fragments that are in fact similar, also leading to low configuration scores. Regarding the similarity measure, in contrast to our expectations, the variation of the measures did not change the configuration scores significantly. Finally, the hierarchical strategy was more effective than the conventional strategy, which suggests that, instead of using a single classifier, one should adopt multiple classifiers organized as a hierarchy.
Classification of cardiac arrhythmias using competitive networks.
Leite, Cicilia R M; Martin, Daniel L; Sizilio, Glaucia R A; Dos Santos, Keylly E A; de Araujo, Bruno G; Valentim, Ricardo A M; Neto, Adriao D D; de Melo, Jorge D; Guerreiro, Ana M G
2010-01-01
Information generated by sensors that collect a patient's vital signals are continuous and unlimited data sequences. Traditionally, this information requires special equipment and programs to monitor them. These programs process and react to the continuous entry of data from different origins. Thus, the purpose of this study is to analyze the data produced by these biomedical devices, in this case the electrocardiogram (ECG). Processing uses a neural classifier, Kohonen competitive neural networks, detecting if the ECG shows any cardiac arrhythmia. In fact, it is possible to classify an ECG signal and thereby detect if it is exhibiting or not any alteration, according to normality.
Protein classification using sequential pattern mining.
Exarchos, Themis P; Papaloukas, Costas; Lampros, Christos; Fotiadis, Dimitrios I
2006-01-01
Protein classification in terms of fold recognition can be employed to determine the structural and functional properties of a newly discovered protein. In this work sequential pattern mining (SPM) is utilized for sequence-based fold recognition. One of the most efficient SPM algorithms, cSPADE, is employed for protein primary structure analysis. Then a classifier uses the extracted sequential patterns for classifying proteins of unknown structure in the appropriate fold category. The proposed methodology exhibited an overall accuracy of 36% in a multi-class problem of 17 candidate categories. The classification performance reaches up to 65% when the three most probable protein folds are considered.
Li, Lei; Liu, Ming; Wu, Meng; Jiang, Chunyu; Chen, Xiaofen; Ma, Xiaoyan; Liu, Jia; Li, Weitao; Tang, Xiaoxue; Li, Zhongpei
2017-05-01
The swine effluent studied was collected from scale pig farms, located in Yujiang County of Jiangxi Province, China, and duckweed (Spriodela polyrrhiza) was selected to dispose the effluent. The purpose of this study was to elucidate the effects of duckweed growth on the dissolved organic matter composition in swine effluent. Throughout the experiment period, the concentrations of organic matter were determined regularly, and the excitation-emission matrix (3DEEM) spectroscopy was used to characterize the fluorescence component. Compared with no-duckweed treatments (controls), the specific ultra-violet absorbance at 254nm (SUVA 254 ) was increased by a final average of 34.4% as the phytoremediation using duckweed, and the removal rate of DOC was increased by a final average of 28.0%. In swine effluent, four fluorescence components were identified, including two protein-like (tryptophan, tyrosine) and two humic-like (fulvic acids, humic acids) components. For all treatments, the concentrations of protein-like components decreased by a final average of 69.0%. As the growth of duckweed, the concentrations of humic-like components were increased by a final average of 123.5% than controls. Significant and positive correlations were observed between SUVA 254 and humic-like components. Compared with the controls, the humification index (HIX) increased by a final average of 9.0% for duckweed treatments. Meanwhile, the duckweed growth leaded to a lower biological index (BIX) and a higher proportion of microbial-derived fulvic acids than controls. In conclusion, the duckweed remediation not only enhanced the removal rate of organic matter in swine effluent, but also increased the percent of humic substances. Copyright © 2016. Published by Elsevier B.V.
Pisani, Oliva; Yamashita, Youhei; Jaffé, Rudolf
2011-07-01
This study shows that light exposure of flocculent material (floc) from the Florida Coastal Everglades (FCE) results in significant dissolved organic matter (DOM) generation through photo-dissolution processes. Floc was collected at two sites along the Shark River Slough (SRS) and irradiated with artificial sunlight. The DOM generated was characterized using elemental analysis and excitation emission matrix fluorescence coupled with parallel factor analysis. To investigate the seasonal variations of DOM photo-generation from floc, this experiment was performed in typical dry (April) and wet (October) seasons for the FCE. Our results show that the dissolved organic carbon (DOC) for samples incubated under dark conditions displayed a relatively small increase, suggesting that microbial processes and/or leaching might be minor processes in comparison to photo-dissolution for the generation of DOM from floc. On the other hand, DOC increased substantially (as much as 259 mgC gC(-1)) for samples exposed to artificial sunlight, indicating the release of DOM through photo-induced alterations of floc. The fluorescence intensity of both humic-like and protein-like components also increased with light exposure. Terrestrial humic-like components were found to be the main contributors (up to 70%) to the chromophoric DOM (CDOM) pool, while protein-like components comprised a relatively small percentage (up to 16%) of the total CDOM. Simultaneously to the generation of DOC, both total dissolved nitrogen and soluble reactive phosphorus also increased substantially during the photo-incubation period. Thus, the photo-dissolution of floc can be an important source of DOM to the FCE environment, with the potential to influence nutrient dynamics in this system. Copyright © 2011 Elsevier Ltd. All rights reserved.
Li, Penghui; Chen, Ling; Zhang, Wen; Huang, Qinghui
2015-01-01
To investigate the seasonal and interannual dynamics of dissolved organic matter (DOM) in the Yangtze Estuary, surface and bottom water samples in the Yangtze Estuary and its adjacent sea were collected and characterized using fluorescence excitation-emission matrices (EEMs) and parallel factor analysis (PARAFAC) in both dry and wet seasons in 2012 and 2013. Two protein-like components and three humic-like components were identified. Three humic-like components decreased linearly with increasing salinity (r>0.90, p<0.001), suggesting their distribution could primarily be controlled by physical mixing. By contrast, two protein-like components fell below the theoretical mixing line, largely due to microbial degradation and removal during mixing. Higher concentrations of humic-like components found in 2012 could be attributed to higher freshwater discharge relative to 2013. There was a lack of systematic patterns for three humic-like components between seasons and years, probably due to variations of other factors such as sources and characteristics. Highest concentrations of fluorescent components, observed in estuarine turbidity maximum (ETM) region, could be attributed to sediment resuspension and subsequent release of DOM, supported by higher concentrations of fluorescent components in bottom water than in surface water at two stations where sediments probably resuspended. Meanwhile, photobleaching could be reflected from the changes in the ratios between fluorescence intensity (Fmax) of humic-like components and chromophoric DOM (CDOM) absorption coefficient (a355) along the salinity gradient. This study demonstrates the abundance and composition of DOM in estuaries are controlled not only by hydrological conditions, but also by its sources, characteristics and related estuarine biogeochemical processes. PMID:26107640
Phong, Diep Dinh; Hur, Jin
2015-12-15
Photocatalytic degradation of dissolved organic matter (DOM) using TiO2 as a catalyst and UVA as a light source was examined under various experimental settings with different TiO2 doses, solution pH, and the light intensities. The changes in UV absorbance and fluorescence with the irradiation time followed a pseudo-first order model much better than those of dissolved organic carbon. In general, the degradation rates were increased by higher TiO2 doses and light intensities. However, the exact photocatalytic responses of DOM to the irradiation were affected by many other factors such as aggregation of TiO2, light scattering, hydroxyl radicals produced, and DOM sorption on TiO2. Fluorescence excitation-emission matrix (EEM) coupled with parallel factor analysis (PARAFAC) revealed that the DOM changes in fluorescence could be described by the combinations of four dissimilar components including one protein-like, two humic-like, and one terrestrial humic-like components, each of which followed well the pseudo-first order model. The photocatalytic degradation rates were higher for protein-like versus humic-like component, whereas the opposite order was displayed for the degradation rates in the absence of TiO2, suggesting different dominant mechanisms operating between the systems with and without TiO2. Our results based on EEM-PARAFAC provided new insights into the underlying mechanisms associated with the photocatalytic degradation of DOM as well as the potential environmental impact of the treated water. This study demonstrated a successful application of EEM-PARAFAC for photocatalytic systems via directly comparing the kinetic rates of the individual DOM components with different compositions. Copyright © 2015 Elsevier Ltd. All rights reserved.
Jiang, De-gang; Huang, Qing-hui; Li, Jian-hua
2010-07-01
As an important component of dissolved organic matter (DOM), chromophoric dissolved organic matter (CDOM) plays a central role in the global biogeochemical carbon cycle. Macroalgae are essential producers in aquatic ecosystems. They can release a considerable part of photosynthetic products as CDOM. So changes in optical properties of CDOM are studied on filamentous green macroalgae-Chadophorasle found in tidal flats of a brackish Lake Beihu in natural field condition by using spectrometry. Humic-like fluorescence peaks and protein-like fluorescence peaks detected by fluorescence excitation-emission matrix spectrum (EEMS) change little in control experiment but increase dramatically in incubation experiment. Applying parallel factor analysis (PARAFAC) together with fluorescence excitation-emission matrix can get four components of CDOM (C1, C2, C3 and C4) which are relative to humic-like fluorescence peak A(C), M and protein-like fluorescence peak B, T respectively. In incubation experiment four components increase by 211.5%, 255.8%, 75.3% and 129.3% respectively while in control experiment components have little changes except C1 decreasing by 34.3%. Absorption coefficient alpha (355) increases by 92.9% and has positive significant correlation (P < 0.01) with the four components in incubation-experiment while alpha (355) decreases by 59.8% and only has correlation (P < 0.05) with C1 in control experiment. As the parameters representing CDOM molecular weight and composition, M and S values in incubation experiment are smaller than in control experiment, which illustrate that aromatic and macromolecular CDOM is produced in growth of Chadophorasle. All results indicate that growth of Chadophorasle can change the content and composition of CDOM.
Zhang, Yunlin; Yin, Yan; Feng, Longqing; Zhu, Guangwei; Shi, Zhiqiang; Liu, Xiaohan; Zhang, Yuanzhi
2011-10-15
Chromophoric dissolved organic matter (CDOM) is an important optically active substance that transports nutrients, heavy metals, and other pollutants from terrestrial to aquatic systems and is used as a measure of water quality. To investigate how the source and composition of CDOM changes in both space and time, we used chemical, spectroscopic, and fluorescence analyses to characterize CDOM in Lake Tianmuhu (a drinking water source) and its catchment in China. Parallel factor analysis (PARAFAC) identified three individual fluorophore moieties that were attributed to humic-like and protein-like materials in 224 water samples collected between December 2008 and September 2009. The upstream rivers contained significantly higher concentrations of CDOM than did the lake water (a(350) of 4.27±2.51 and 2.32±0.59 m(-1), respectively), indicating that the rivers carried a substantial load of organic matter to the lake. Of the three main rivers that flow into Lake Tianmuhu, the Pingqiao River brought in the most CDOM from the catchment to the lake. CDOM absorption and the microbial and terrestrial humic-like components, but not the protein-like component, were significantly higher in the wet season than in other seasons, indicating that the frequency of rainfall and runoff could significantly impact the quantity and quality of CDOM collected from the catchment. The different relationships between the maximum fluorescence intensities of the three PARAFAC components, CDOM absorption, and chemical oxygen demand (COD) concentration in riverine and lake water indicated the difference in the composition of CDOM between Lake Tianmuhu and the rivers that feed it. This study demonstrates the utility of combining excitation-emission matrix fluorescence and PARAFAC to study CDOM dynamics in inland waters. Copyright © 2011 Elsevier Ltd. All rights reserved.
Zhou, Lei; Zhou, Yongqiang; Hu, Yang; Cai, Jian; Bai, Chengrong; Shao, Keqiang; Gao, Guang; Zhang, Yunlin; Jeppesen, Erik; Tang, Xiangming
2017-12-01
Lake Bosten is the largest oligosaline lake in arid northwestern China, and water from its tributaries and evaporation control the water balance of the lake. In this study, water quality and chromophoric dissolved organic matter (CDOM) absorption and fluorescence were investigated in different seasons to elucidate how hydraulic connectivity and evaporation may affect the water quality and variability of CDOM in the lake. Mean suspended solids and turbidity were significantly higher in the upstream tributaries than in the lake, the difference being notably more pronounced in the wet than in the dry season. A markedly higher mean first principal component (PC1) score, which was significantly positively related to protein-like components, and a considerably lower fluorescence peak integration ratio - I C :I T , indicative of the terrestrial humic-like CDOM contribution percentage, were observed in the lake than in the upstream tributaries. Correspondingly, notably higher contribution percentages of terrestrial humic-like components were observed in the river mouth areas than in the remaining lake regions. Furthermore, significantly higher mean turbidity, and notably lower mean conductivity and salinity, were recorded in the southwestern Kaidu river mouth than in the remaining lake regions in the wet season. Notably higher mean salinity is recorded in Lake Bosten than in upstream tributaries. Autochthonous protein-like associated amino-acids and also PC1 scores increased significantly with increasing salinity. We conclude that the dynamics of water quality and CDOM composition in remote arid Lake Bosten are strongly driven by evaporation and also the hydraulic connectivity between the upstream tributaries and the downstream lake. Copyright © 2017 Elsevier Ltd. All rights reserved.
Lei, Kai-Jian; Lin, Ya-Ming; Ren, Jing; Bai, Ling; Miao, Yu-Chen; An, Guo-Yong; Song, Chun-Peng
2016-01-01
The microRNA156 (miR156)-modulated SQUAMOSA PROMOTER BINDING PROTEIN-LIKE (SPL) is involved in diverse biological processes that include growth, development and metabolism. Here, we report that the Arabidopsis miR156 and SPL3 as regulators play important roles in phosphate (Pi) deficiency response. MiR156 was induced during Pi starvation whereas SPL3 expression was repressed. Phenotypes of reduced rhizosphere acidification and decreased anthocyanin accumulation were observed in 35S:MIM156 (via target mimicry) transgenic plants under Pi deficiency. The content and uptake of Pi in 35S:MIM156 Arabidopsis plants were increased compared with wild-type (Col-0 ecotype) plants. 35S:rSPL3 seedlings showed similar anthocyanin accumulation and Pi content phenotypes to those of 35S:MIM156 plants. Chromatin immunoprecipitation and an electrophoretic mobility shift assay indicated that the SPL3 protein directly bound to GTAC motifs in the PLDZ2, Pht1;5 and miR399f promoters. The expression of several Pi starvation-induced genes was increased in 35S:MIM156 and 35S:rSPL3 plants, including high-affinity Pi transporters, Mt4/TPS1-like genes and phosphatases. Collectively, our results suggest that the miR156-SPL3-Pht1;5 (-PLDZ2 and -miR399f) pathways constitute a component of the Pi deficiency-induced regulatory mechanism of Arabidopsis. © The Author 2015. Published by Oxford University Press on behalf of Japanese Society of Plant Physiologists. All rights reserved. For permissions, please email: journals.permissions@oup.com.
Jian, Qianyun; Boyer, Treavor H; Yang, Xiuhong; Xia, Beicheng; Yang, Xin
2016-06-01
Dissolved organic matter (DOM) was leached from leaves of two trees commonly grown in subtropical regions, Pinus elliottii (commonly known as slash pine) and Schima superba (S. superba), and its degradation pattern and potential for forming disinfection byproducts (DBPs) were evaluated. The leaves were exposed in the field for up to one year before leaching. The DOM leached from slash pine litter contained on average 10.4 mg of dissolved organic carbon (DOC) per gram of dry weight; for S. superba the average was 37.2 mg-DOC/g-dry weight. Ultraviolet and visible light absorbance, fluorescence, and molecular weight analysis indicated that more aromatic/humic and higher molecular weight compounds are formed as leaf litter ages. A 4-component parallel factor analysis of the fluorescence data showed that the intensity of peaks related with protein-like components decreased gradually during biodegradation, while that of peaks attributed to humic-acid-like components increased continuously. Fresh slash pine leachates formed on average 40.0 μg of trihalomethane (THM) per milligram of DOC, while S. superba leachates formed 45.6 μg. THM formation showed peak values of 55.7 μg/mg DOC for slash pine and 74.9 μg/mg DOC for S. superba after 8 months of aging. The formation of haloacetonitrile (HAN) and trichloronitromethane (TCNM) increased with increasing leaf age, while chloral hydrate (CH) formation did not show such a trend. Specific UV absorbance showed some positive correlation with DBPs, but humic-acid-like and protein-like absorbance peaks correlated with CH and TCNM yields in only some leaf samples. Copyright © 2016 Elsevier Ltd. All rights reserved.
Lee, Lobin A; Arvai, Kevin J; Jones, Dan
2015-07-01
As DNA sequencing of multigene panels becomes routine for cancer samples in the clinical laboratory, an efficient process for classifying variants has become more critical. Determining which germline variants are significant for cancer disposition and which somatic mutations are integral to cancer development or therapy response remains difficult, even for well-studied genes such as BRCA1 and TP53. We compare and contrast the general principles and lines of evidence commonly used to distinguish the significance of cancer-associated germline and somatic genetic variants. The factors important in each step of the analysis pipeline are reviewed, as are some of the publicly available annotation tools. Given the range of indications and uses of cancer sequencing assays, including diagnosis, staging, prognostication, theranostics, and residual disease detection, the need for flexible methods for scoring of variants is discussed. The usefulness of protein prediction tools and multimodal risk-based or Bayesian approaches are highlighted. Using TET2 variants encountered in hematologic neoplasms, several examples of this multifactorial approach to classifying sequence variants of unknown significance are presented. Although there are still significant gaps in the publicly available data for many cancer genes that limit the broad application of explicit algorithms for variant scoring, the elements of a more rigorous model are outlined. Copyright © 2015 American Society for Investigative Pathology and the Association for Molecular Pathology. Published by Elsevier Inc. All rights reserved.
Landone Vescovo, Ignacio A; Golemba, Marcelo D; Di Lello, Federico A; Culasso, Andrés C A; Levin, Gustavo; Ruberto, Lucas; Mac Cormack, Walter P; López, José L
2014-01-01
Bacterial richness in maritime Antarctica has been poorly described to date. Phylogenetic affiliation of seawater free-living microbial assemblages was studied from three locations near the Argentinean Jubany Station during two Antarctic summers. Sixty 16S RNA cloned sequences were phylogenetically affiliated to Alphaproteobacteria (30/60 clones), Gammaproteobacteria(19/60 clones), Betaproteobacteria and Cytophaga-Flavobacteriia-Bacteroides (CFB), which were (2/60) and (3/60) respectively. Furthermore, six out of 60 clones could not be classified. Both, Alphaproteobacteria and Gammaproteobacteria, showed several endemic and previously undescribed sequences. Moreover, the absence of Cyanobacteria sequences in our samples is remarkable. In conclusion, we are reporting a rich sequence assemblage composed of widely divergent isolates among themselves and distant from the most closely related sequences currently deposited in data banks. Copyright © 2014 Asociación Argentina de Microbiología. Publicado por Elsevier España. All rights reserved.
Li, Zhongshan; Liu, Zhenwei; Jiang, Yi; Chen, Denghui; Ran, Xia; Sun, Zhong Sheng; Wu, Jinyu
2017-01-01
Exome sequencing has been widely used to identify the genetic variants underlying human genetic disorders for clinical diagnoses, but the identification of pathogenic sequence variants among the huge amounts of benign ones is complicated and challenging. Here, we describe a new Web server named mirVAFC for pathogenic sequence variants prioritizations from clinical exome sequencing (CES) variant data of single individual or family. The mirVAFC is able to comprehensively annotate sequence variants, filter out most irrelevant variants using custom criteria, classify variants into different categories as for estimated pathogenicity, and lastly provide pathogenic variants prioritizations based on classifications and mutation effects. Case studies using different types of datasets for different diseases from publication and our in-house data have revealed that mirVAFC can efficiently identify the right pathogenic candidates as in original work in each case. Overall, the Web server mirVAFC is specifically developed for pathogenic sequence variant identifications from family-based CES variants using classification-based prioritizations. The mirVAFC Web server is freely accessible at https://www.wzgenomics.cn/mirVAFC/. © 2016 WILEY PERIODICALS, INC.
Randomized Prediction Games for Adversarial Machine Learning.
Rota Bulo, Samuel; Biggio, Battista; Pillai, Ignazio; Pelillo, Marcello; Roli, Fabio
In spam and malware detection, attackers exploit randomization to obfuscate malicious data and increase their chances of evading detection at test time, e.g., malware code is typically obfuscated using random strings or byte sequences to hide known exploits. Interestingly, randomization has also been proposed to improve security of learning algorithms against evasion attacks, as it results in hiding information about the classifier to the attacker. Recent work has proposed game-theoretical formulations to learn secure classifiers, by simulating different evasion attacks and modifying the classification function accordingly. However, both the classification function and the simulated data manipulations have been modeled in a deterministic manner, without accounting for any form of randomization. In this paper, we overcome this limitation by proposing a randomized prediction game, namely, a noncooperative game-theoretic formulation in which the classifier and the attacker make randomized strategy selections according to some probability distribution defined over the respective strategy set. We show that our approach allows one to improve the tradeoff between attack detection and false alarms with respect to the state-of-the-art secure classifiers, even against attacks that are different from those hypothesized during design, on application examples including handwritten digit recognition, spam, and malware detection.In spam and malware detection, attackers exploit randomization to obfuscate malicious data and increase their chances of evading detection at test time, e.g., malware code is typically obfuscated using random strings or byte sequences to hide known exploits. Interestingly, randomization has also been proposed to improve security of learning algorithms against evasion attacks, as it results in hiding information about the classifier to the attacker. Recent work has proposed game-theoretical formulations to learn secure classifiers, by simulating different evasion attacks and modifying the classification function accordingly. However, both the classification function and the simulated data manipulations have been modeled in a deterministic manner, without accounting for any form of randomization. In this paper, we overcome this limitation by proposing a randomized prediction game, namely, a noncooperative game-theoretic formulation in which the classifier and the attacker make randomized strategy selections according to some probability distribution defined over the respective strategy set. We show that our approach allows one to improve the tradeoff between attack detection and false alarms with respect to the state-of-the-art secure classifiers, even against attacks that are different from those hypothesized during design, on application examples including handwritten digit recognition, spam, and malware detection.
Maillard, J C; Martinez, D; Bensaid, A
1996-07-23
One hundred and twenty-seven Brahman cattle from several locations in Martinique (FWI), reared under different environmental conditions, were followed over three years and checked for clinical signs of dermatophilosis. To confirm that these animals had been in contact with the pathogen Dermatophilus congolensis, their sera were tested by ELISA. On the basis of this epidemiological study, 12 animals were classified as resistant (seropositive without clinical signs), belonging to herds in which the prevalence of the disease ranged from 25 to nearly 98%. Eighteen animals classified as highly susceptible displayed severe characteristic skin lesions. These 30 selected animals were typed for class I antigens of the major histocompatibility complex (MHC). MHC class II genes were analyzed using the polymerase chain reaction (PCR) and restriction fragment length polymorphism (RFLP) techniques, on the exon 2 of the bovine leucocyte antigen (BoLA) DRB3 gene. Several alleles were found, according to patterns provided by the restriction enzymes used: Fnu 4HI, Dpn II, Hae III, and Rsa I. A particular sequence "EIAY" at amino acid positions 66/67/74/78 located in the antigen recognition sites (ARS) was found in the 12 animals classified as resistant, and 10 of them displayed also class I BoLA-A8 specificity. On the other hand, only 3 out of the 18 susceptible animals showed simultaneously the BoLA-DRB3 "EIAY" sequence and BoLA-A8 specificity. Interestingly, a serine residue at position 30 of the ARS was found in 8 of the susceptible animals and was completely absent from all resistant animals. Furthermore, in a same animal, the serine at position 30 and the EIAY sequence were never found simultaneously on the same haplotype. These results show a strong correlation between the resistant character to dermatophilosis and the association of MHC haplotypes: the BoLA-A8 specificity and the BoLA-DRB3 "EIAY" sequence at ARS positions 66/67/74/78 with the lack of serine in position 30. To confirm these results, family segregation studies are in progress and some interesting observations have been obtained.
Piégu, Benoît; Bire, Solenne; Arensburger, Peter; Bigot, Yves
2015-05-01
The increase of publicly available sequencing data has allowed for rapid progress in our understanding of genome composition. As new information becomes available we should constantly be updating and reanalyzing existing and newly acquired data. In this report we focus on transposable elements (TEs) which make up a significant portion of nearly all sequenced genomes. Our ability to accurately identify and classify these sequences is critical to understanding their impact on host genomes. At the same time, as we demonstrate in this report, problems with existing classification schemes have led to significant misunderstandings of the evolution of both TE sequences and their host genomes. In a pioneering publication Finnegan (1989) proposed classifying all TE sequences into two classes based on transposition mechanisms and structural features: the retrotransposons (class I) and the DNA transposons (class II). We have retraced how ideas regarding TE classification and annotation in both prokaryotic and eukaryotic scientific communities have changed over time. This has led us to observe that: (1) a number of TEs have convergent structural features and/or transposition mechanisms that have led to misleading conclusions regarding their classification, (2) the evolution of TEs is similar to that of viruses by having several unrelated origins, (3) there might be at least 8 classes and 12 orders of TEs including 10 novel orders. In an effort to address these classification issues we propose: (1) the outline of a universal TE classification, (2) a set of methods and classification rules that could be used by all scientific communities involved in the study of TEs, and (3) a 5-year schedule for the establishment of an International Committee for Taxonomy of Transposable Elements (ICTTE). Copyright © 2015 Elsevier Inc. All rights reserved.
Altet, Laura; Francino, Olga; Solano-Gallego, Laia; Renier, Corinne; Sánchez, Armand
2002-01-01
The NRAMP1 gene (Slc11a1) encodes an ion transporter protein involved in the control of intraphagosomal replication of parasites and in macrophage activation. It has been described in mice as the determinant of natural resistance or susceptibility to infection with antigenically unrelated pathogens, including Leishmania. Our aims were to sequence and map the canine Slc11a1 gene and to identify mutations that may be associated with resistance or susceptibility to Leishmania infection. The canine Slc11a1 gene has been mapped to dog chromosome CFA37 and covers 9 kb, including a 700-bp promoter region, 15 exons, and a polymorphic microsatellite in intron 1. It encodes a 547-amino-acid protein that has over 87% identity with the Slc11a1 proteins of different mammalian species. A case-control study with 33 resistant and 84 susceptible dogs showed an association between allele 145 of the microsatellite and susceptible dogs. Sequence variant analysis was performed by direct sequencing of the cDNA and the promoter region of four unrelated beagles experimentally infected with Leishmania infantum to search for possible functional mutations. Two of the dogs were classified as susceptible and the other two were classified as resistant based on their immune responses. Two important mutations were found in susceptible dogs: a G-rich region in the promoter that was common to both animals and a complete deletion of exon 11, which encodes the consensus transport motif of the protein, in the unique susceptible dog that needed an additional and prolonged treatment to avoid continuous relapses. A study with a larger dog population would be required to prove the association of these sequence variants with disease susceptibility. PMID:12010961
Effective Feature Selection for Classification of Promoter Sequences.
K, Kouser; P G, Lavanya; Rangarajan, Lalitha; K, Acharya Kshitish
2016-01-01
Exploring novel computational methods in making sense of biological data has not only been a necessity, but also productive. A part of this trend is the search for more efficient in silico methods/tools for analysis of promoters, which are parts of DNA sequences that are involved in regulation of expression of genes into other functional molecules. Promoter regions vary greatly in their function based on the sequence of nucleotides and the arrangement of protein-binding short-regions called motifs. In fact, the regulatory nature of the promoters seems to be largely driven by the selective presence and/or the arrangement of these motifs. Here, we explore computational classification of promoter sequences based on the pattern of motif distributions, as such classification can pave a new way of functional analysis of promoters and to discover the functionally crucial motifs. We make use of Position Specific Motif Matrix (PSMM) features for exploring the possibility of accurately classifying promoter sequences using some of the popular classification techniques. The classification results on the complete feature set are low, perhaps due to the huge number of features. We propose two ways of reducing features. Our test results show improvement in the classification output after the reduction of features. The results also show that decision trees outperform SVM (Support Vector Machine), KNN (K Nearest Neighbor) and ensemble classifier LibD3C, particularly with reduced features. The proposed feature selection methods outperform some of the popular feature transformation methods such as PCA and SVD. Also, the methods proposed are as accurate as MRMR (feature selection method) but much faster than MRMR. Such methods could be useful to categorize new promoters and explore regulatory mechanisms of gene expressions in complex eukaryotic species.
PreCisIon: PREdiction of CIS-regulatory elements improved by gene's positION.
Elati, Mohamed; Nicolle, Rémy; Junier, Ivan; Fernández, David; Fekih, Rim; Font, Julio; Képès, François
2013-02-01
Conventional approaches to predict transcriptional regulatory interactions usually rely on the definition of a shared motif sequence on the target genes of a transcription factor (TF). These efforts have been frustrated by the limited availability and accuracy of TF binding site motifs, usually represented as position-specific scoring matrices, which may match large numbers of sites and produce an unreliable list of target genes. To improve the prediction of binding sites, we propose to additionally use the unrelated knowledge of the genome layout. Indeed, it has been shown that co-regulated genes tend to be either neighbors or periodically spaced along the whole chromosome. This study demonstrates that respective gene positioning carries significant information. This novel type of information is combined with traditional sequence information by a machine learning algorithm called PreCisIon. To optimize this combination, PreCisIon builds a strong gene target classifier by adaptively combining weak classifiers based on either local binding sequence or global gene position. This strategy generically paves the way to the optimized incorporation of any future advances in gene target prediction based on local sequence, genome layout or on novel criteria. With the current state of the art, PreCisIon consistently improves methods based on sequence information only. This is shown by implementing a cross-validation analysis of the 20 major TFs from two phylogenetically remote model organisms. For Bacillus subtilis and Escherichia coli, respectively, PreCisIon achieves on average an area under the receiver operating characteristic curve of 70 and 60%, a sensitivity of 80 and 70% and a specificity of 60 and 56%. The newly predicted gene targets are demonstrated to be functionally consistent with previously known targets, as assessed by analysis of Gene Ontology enrichment or of the relevant literature and databases.
supernovae: Photometric classification of supernovae
NASA Astrophysics Data System (ADS)
Charnock, Tom; Moss, Adam
2017-05-01
Supernovae classifies supernovae using their light curves directly as inputs to a deep recurrent neural network, which learns information from the sequence of observations. Observational time and filter fluxes are used as inputs; since the inputs are agnostic, additional data such as host galaxy information can also be included.
ProDeGe: A computational protocol for fully automated decontamination of genomes
Tennessen, Kristin; Andersen, Evan; Clingenpeel, Scott; ...
2015-06-09
Single amplified genomes and genomes assembled from metagenomes have enabled the exploration of uncultured microorganisms at an unprecedented scale. However, both these types of products are plagued by contamination. Since these genomes are now being generated in a high-throughput manner and sequences from them are propagating into public databases to drive novel scientific discoveries, rigorous quality controls and decontamination protocols are urgently needed. Here, we present ProDeGe (Protocol for fully automated Decontamination of Genomes), the first computational protocol for fully automated decontamination of draft genomes. ProDeGe classifies sequences into two classes—clean and contaminant—using a combination of homology and feature-based methodologies.more » On average, 84% of sequence from the non-target organism is removed from the data set (specificity) and 84% of the sequence from the target organism is retained (sensitivity). Lastly, the procedure operates successfully at a rate of ~0.30 CPU core hours per megabase of sequence and can be applied to any type of genome sequence.« less
Kickoff to Conflict: A Sequence Analysis of Intra-State Conflict-Preceding Event Structures
D'Orazio, Vito; Yonamine, James E.
2015-01-01
While many studies have suggested or assumed that the periods preceding the onset of intra-state conflict are similar across time and space, few have empirically tested this proposition. Using the Integrated Crisis Early Warning System's domestic event data in Asia from 1998–2010, we subject this proposition to empirical analysis. We code the similarity of government-rebel interactions in sequences preceding the onset of intra-state conflict to those preceding further periods of peace using three different metrics: Euclidean, Levenshtein, and mutual information. These scores are then used as predictors in a bivariate logistic regression to forecast whether we are likely to observe conflict in neither, one, or both of the states. We find that our model accurately classifies cases where both sequences precede peace, but struggles to distinguish between cases in which one sequence escalates to conflict and where both sequences escalate to conflict. These findings empirically suggest that generalizable patterns exist between event sequences that precede peace. PMID:25951105
ProDeGe: A computational protocol for fully automated decontamination of genomes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tennessen, Kristin; Andersen, Evan; Clingenpeel, Scott
Single amplified genomes and genomes assembled from metagenomes have enabled the exploration of uncultured microorganisms at an unprecedented scale. However, both these types of products are plagued by contamination. Since these genomes are now being generated in a high-throughput manner and sequences from them are propagating into public databases to drive novel scientific discoveries, rigorous quality controls and decontamination protocols are urgently needed. Here, we present ProDeGe (Protocol for fully automated Decontamination of Genomes), the first computational protocol for fully automated decontamination of draft genomes. ProDeGe classifies sequences into two classes—clean and contaminant—using a combination of homology and feature-based methodologies.more » On average, 84% of sequence from the non-target organism is removed from the data set (specificity) and 84% of the sequence from the target organism is retained (sensitivity). Lastly, the procedure operates successfully at a rate of ~0.30 CPU core hours per megabase of sequence and can be applied to any type of genome sequence.« less
Genome-Wide Comparative Gene Family Classification
Frech, Christian; Chen, Nansheng
2010-01-01
Correct classification of genes into gene families is important for understanding gene function and evolution. Although gene families of many species have been resolved both computationally and experimentally with high accuracy, gene family classification in most newly sequenced genomes has not been done with the same high standard. This project has been designed to develop a strategy to effectively and accurately classify gene families across genomes. We first examine and compare the performance of computer programs developed for automated gene family classification. We demonstrate that some programs, including the hierarchical average-linkage clustering algorithm MC-UPGMA and the popular Markov clustering algorithm TRIBE-MCL, can reconstruct manual curation of gene families accurately. However, their performance is highly sensitive to parameter setting, i.e. different gene families require different program parameters for correct resolution. To circumvent the problem of parameterization, we have developed a comparative strategy for gene family classification. This strategy takes advantage of existing curated gene families of reference species to find suitable parameters for classifying genes in related genomes. To demonstrate the effectiveness of this novel strategy, we use TRIBE-MCL to classify chemosensory and ABC transporter gene families in C. elegans and its four sister species. We conclude that fully automated programs can establish biologically accurate gene families if parameterized accordingly. Comparative gene family classification finds optimal parameters automatically, thus allowing rapid insights into gene families of newly sequenced species. PMID:20976221
Chen, Peng; Li, Jinyan; Wong, Limsoon; Kuwahara, Hiroyuki; Huang, Jianhua Z; Gao, Xin
2013-08-01
Hot spot residues of proteins are fundamental interface residues that help proteins perform their functions. Detecting hot spots by experimental methods is costly and time-consuming. Sequential and structural information has been widely used in the computational prediction of hot spots. However, structural information is not always available. In this article, we investigated the problem of identifying hot spots using only physicochemical characteristics extracted from amino acid sequences. We first extracted 132 relatively independent physicochemical features from a set of the 544 properties in AAindex1, an amino acid index database. Each feature was utilized to train a classification model with a novel encoding schema for hot spot prediction by the IBk algorithm, an extension of the K-nearest neighbor algorithm. The combinations of the individual classifiers were explored and the classifiers that appeared frequently in the top performing combinations were selected. The hot spot predictor was built based on an ensemble of these classifiers and to work in a voting manner. Experimental results demonstrated that our method effectively exploited the feature space and allowed flexible weights of features for different queries. On the commonly used hot spot benchmark sets, our method significantly outperformed other machine learning algorithms and state-of-the-art hot spot predictors. The program is available at http://sfb.kaust.edu.sa/pages/software.aspx. Copyright © 2013 Wiley Periodicals, Inc.
Yamada, Kazuhiko; Kamimura, Eikichi; Kondo, Mariko; Tsuchiya, Kimiyuki; Nishida-Umehara, Chizuko; Matsuda, Yoichi
2006-02-01
We molecularly cloned new families of site-specific repetitive DNA sequences from BglII- and EcoRI-digested genomic DNA of the Syrian hamster (Mesocricetus auratus, Cricetrinae, Rodentia) and characterized them by chromosome in situ hybridization and filter hybridization. They were classified into six different types of repetitive DNA sequence families according to chromosomal distribution and genome organization. The hybridization patterns of the sequences were consistent with the distribution of C-positive bands and/or Hoechst-stained heterochromatin. The centromeric major satellite DNA and sex chromosome-specific and telomeric region-specific repetitive sequences were conserved in the same genus (Mesocricetus) but divergent in different genera. The chromosome-2-specific sequence was conserved in two genera, Mesocricetus and Cricetulus, and a low copy number of repetitive sequences on the heterochromatic chromosome arms were conserved in the subfamily Cricetinae but not in the subfamily Calomyscinae. By contrast, the other type of repetitive sequences on the heterochromatic chromosome arms, which had sequence similarities to a LINE sequence of rodents, was conserved through the three subfamilies, Cricetinae, Calomyscinae and Murinae. The nucleotide divergence of the repetitive sequences of heterochromatin was well correlated with the phylogenetic relationships of the Cricetinae species, and each sequence has been independently amplified and diverged in the same genome.
von Kohn, Christopher; Kiełkowska, Agnieszka; Havey, Michael J
2013-12-01
Male-sterile (S) cytoplasm of onion is an alien cytoplasm introgressed into onion in antiquity and is widely used for hybrid seed production. Owing to the biennial generation time of onion, classical crossing takes at least 4 years to classify cytoplasms as S or normal (N) male-fertile. Molecular markers in the organellar DNAs that distinguish N and S cytoplasms are useful to reduce the time required to classify onion cytoplasms. In this research, we completed next-generation sequencing of the chloroplast DNAs of N- and S-cytoplasmic onions; we assembled and annotated the genomes in addition to identifying polymorphisms that distinguish these cytoplasms. The sizes (153 538 and 153 355 base pairs) and GC contents (36.8%) were very similar for the chloroplast DNAs of N and S cytoplasms, respectively, as expected given their close phylogenetic relationship. The size difference was primarily due to small indels in intergenic regions and a deletion in the accD gene of N-cytoplasmic onion. The structures of the onion chloroplast DNAs were similar to those of most land plants with large and small single copy regions separated by inverted repeats. Twenty-eight single nucleotide polymorphisms, two polymorphic restriction-enzyme sites, and one indel distributed across 20 chloroplast genes in the large and small single copy regions were selected and validated using diverse onion populations previously classified as N or S cytoplasmic using restriction fragment length polymorphisms. Although cytoplasmic male sterility is likely associated with the mitochondrial DNA, maternal transmission of the mitochondrial and chloroplast DNAs allows for polymorphisms in either genome to be useful for classifying onion cytoplasms to aid the development of hybrid onion cultivars.
NASA Technical Reports Server (NTRS)
Malone, T. B.; Micocci, A.
1975-01-01
The alternate methods of conducting a man-machine interface evaluation are classified as static and dynamic, and are evaluated. A dynamic evaluation tool is presented to provide for a determination of the effectiveness of the man-machine interface in terms of the sequence of operations (task and task sequences) and in terms of the physical characteristics of the interface. This dynamic checklist approach is recommended for shuttle and shuttle payload man-machine interface evaluations based on reduced preparation time, reduced data, and increased sensitivity of critical problems.
Walking Objectively Measured: Classifying Accelerometer Data with GPS and Travel Diaries
Kang, Bumjoon; Moudon, Anne V.; Hurvitz, Philip M.; Reichley, Lucas; Saelens, Brian E.
2013-01-01
Purpose This study developed and tested an algorithm to classify accelerometer data as walking or non-walking using either GPS or travel diary data within a large sample of adults under free-living conditions. Methods Participants wore an accelerometer and a GPS unit, and concurrently completed a travel diary for 7 consecutive days. Physical activity (PA) bouts were identified using accelerometry count sequences. PA bouts were then classified as walking or non-walking based on a decision-tree algorithm consisting of 7 classification scenarios. Algorithm reliability was examined relative to two independent analysts’ classification of a 100-bout verification sample. The algorithm was then applied to the entire set of PA bouts. Results The 706 participants’ (mean age 51 years, 62% female, 80% non-Hispanic white, 70% college graduate or higher) yielded 4,702 person-days of data and had a total of 13,971 PA bouts. The algorithm showed a mean agreement of 95% with the independent analysts. It classified physical activity into 8,170 (58.5 %) walking bouts and 5,337 (38.2%) non-walking bouts; 464 (3.3%) bouts were not classified for lack of GPS and diary data. Nearly 70% of the walking bouts and 68% of the non-walking bouts were classified using only the objective accelerometer and GPS data. Travel diary data helped classify 30% of all bouts with no GPS data. The mean duration of PA bouts classified as walking was 15.2 min (SD=12.9). On average, participants had 1.7 walking bouts and 25.4 total walking minutes per day. Conclusions GPS and travel diary information can be helpful in classifying most accelerometer-derived PA bouts into walking or non-walking behavior. PMID:23439414
voomDDA: discovery of diagnostic biomarkers and classification of RNA-seq data.
Zararsiz, Gokmen; Goksuluk, Dincer; Klaus, Bernd; Korkmaz, Selcuk; Eldem, Vahap; Karabulut, Erdem; Ozturk, Ahmet
2017-01-01
RNA-Seq is a recent and efficient technique that uses the capabilities of next-generation sequencing technology for characterizing and quantifying transcriptomes. One important task using gene-expression data is to identify a small subset of genes that can be used to build diagnostic classifiers particularly for cancer diseases. Microarray based classifiers are not directly applicable to RNA-Seq data due to its discrete nature. Overdispersion is another problem that requires careful modeling of mean and variance relationship of the RNA-Seq data. In this study, we present voomDDA classifiers: variance modeling at the observational level (voom) extensions of the nearest shrunken centroids (NSC) and the diagonal discriminant classifiers. VoomNSC is one of these classifiers and brings voom and NSC approaches together for the purpose of gene-expression based classification. For this purpose, we propose weighted statistics and put these weighted statistics into the NSC algorithm. The VoomNSC is a sparse classifier that models the mean-variance relationship using the voom method and incorporates voom's precision weights into the NSC classifier via weighted statistics. A comprehensive simulation study was designed and four real datasets are used for performance assessment. The overall results indicate that voomNSC performs as the sparsest classifier. It also provides the most accurate results together with power-transformed Poisson linear discriminant analysis, rlog transformed support vector machines and random forests algorithms. In addition to prediction purposes, the voomNSC classifier can be used to identify the potential diagnostic biomarkers for a condition of interest. Through this work, statistical learning methods proposed for microarrays can be reused for RNA-Seq data. An interactive web application is freely available at http://www.biosoft.hacettepe.edu.tr/voomDDA/.
Morvan syndrome: a rare cause of syndrome of inappropriate antidiuretic hormone secretion
DEMIRBAS, SEREF; AYKAN, MUSA BARIS; ZENGIN, HAYDAR; MAZMAN, SEMIR; SAGLAM, KENAN
2017-01-01
The syndrome of inappropriate antidiuretic hormone secretion (SIADH) accounts for an important part of hyponatremia cases. The causes of SIADH can be detected almost always. As a rare disorder, Morvan Syndrome can be defined by the sum of peripheral nerve hyperexcitability, autonomic instability and neuropsychiatric features. Antibodies to voltage-gated potassium channels (Anti – VGKC-Ab) including contactin associated protein-like 2 antibodies (CASPR2-Ab) and leucine-rich glioma inactivated protein 1 antibodies (LGI1-Ab) were previously known for the potential association with this condition. We present a Morvan Syndrome in a patient who presented with various neuropsychiatric symptoms and SIADH. PMID:28781533
Morvan syndrome: a rare cause of syndrome of inappropriate antidiuretic hormone secretion.
Demirbas, Seref; Aykan, Musa Baris; Zengin, Haydar; Mazman, Semir; Saglam, Kenan
2017-01-01
The syndrome of inappropriate antidiuretic hormone secretion (SIADH) accounts for an important part of hyponatremia cases. The causes of SIADH can be detected almost always. As a rare disorder, Morvan Syndrome can be defined by the sum of peripheral nerve hyperexcitability, autonomic instability and neuropsychiatric features. Antibodies to voltage-gated potassium channels (Anti - VGKC-Ab) including contactin associated protein-like 2 antibodies (CASPR2-Ab) and leucine-rich glioma inactivated protein 1 antibodies (LGI1-Ab) were previously known for the potential association with this condition. We present a Morvan Syndrome in a patient who presented with various neuropsychiatric symptoms and SIADH.
Using RNA Sequencing to Classify Organisms into Three Primary Kingdoms.
ERIC Educational Resources Information Center
Evans, Robert H.
1983-01-01
Using the biochemical record to class archaebacteria, eukaryotes, and eubacteria involves abstractions difficult for the concrete learner. Therefore, a method is provided in which students discover some basic tenets of biochemical classification and apply them in a "hands-on" classification problem. The method involves use of RNA…
A Case for a Process Approach: The Warwick Experience.
ERIC Educational Resources Information Center
Screen, P.
1988-01-01
Describes the cyclical nature of a problem-solving sequence produced from observing children involved in the process. Discusses the generic qualities of science: (1) observing; (2) inferring; (3) classifying; (4) predicting; (5) controlling variables; and (6) hypothesizing. Explains the processes in use and advantages of a process-led course. (RT)
Survey on Classifying Human Actions Through Visual Sensors
2011-05-04
47] Herrera, A., Beck , A., Bell, D., Miller, P., Wu, Q., Yan, W., “Behaviour Analysis and Prediction in Image Sequences Using Rough Sets...report TR-97-021, University of Berkeley, 1998 [83] DARPA Mind’s Eye Broad Agency Announcement, DARPA- BAA -10-53, 2010 www.darpa.mil/tcto/docs
USDA-ARS?s Scientific Manuscript database
Strains from a collection of 3,639 diverse Bacillus thuringiensis isolates were classified based on phenotypic profiles resulting from six biochemical tests, including production of amylase (T), lecithinase (L), urease (U), acid from sucrose (S) and salicin (A), and the hydrolysis of esculin (E). St...
USDA-ARS?s Scientific Manuscript database
A PCR-based method was used to classify 90 samples of nucleopolyhedrovirus (NPV; Baculoviridae: Alphabaculovirus) obtained worldwide from larvae of Heliothis virescens, Helicoverpa zea, and Helicoverpa armigera. Partial nucleotide sequencing and phylogenetic analysis of three highly conserved genes...
Activities to Promote Critical Thinking. Classroom Practices in Teaching English, 1986.
ERIC Educational Resources Information Center
National Council of Teachers of English, Urbana, IL.
Intended to involve students in language and communication study in such a way that significant thinking occurs, this collection of teaching ideas outlines ways to teach literature and composition that engage the students in such thinking processes as inferring, sequencing, predicting, classifying, problem solving, and synthesizing. The activities…
Supervised DNA Barcodes species classification: analysis, comparisons and results
2014-01-01
Background Specific fragments, coming from short portions of DNA (e.g., mitochondrial, nuclear, and plastid sequences), have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. Species classification with DNA Barcode sequences has been proven effective on different organisms. Indeed, specific gene regions have been identified as Barcode: COI in animals, rbcL and matK in plants, and ITS in fungi. The classification problem assigns an unknown specimen to a known species by analyzing its Barcode. This task has to be supported with reliable methods and algorithms. Methods In this work the efficacy of supervised machine learning methods to classify species with DNA Barcode sequences is shown. The Weka software suite, which includes a collection of supervised classification methods, is adopted to address the task of DNA Barcode analysis. Classifier families are tested on synthetic and empirical datasets belonging to the animal, fungus, and plant kingdoms. In particular, the function-based method Support Vector Machines (SVM), the rule-based RIPPER, the decision tree C4.5, and the Naïve Bayes method are considered. Additionally, the classification results are compared with respect to ad-hoc and well-established DNA Barcode classification methods. Results A software that converts the DNA Barcode FASTA sequences to the Weka format is released, to adapt different input formats and to allow the execution of the classification procedure. The analysis of results on synthetic and real datasets shows that SVM and Naïve Bayes outperform on average the other considered classifiers, although they do not provide a human interpretable classification model. Rule-based methods have slightly inferior classification performances, but deliver the species specific positions and nucleotide assignments. On synthetic data the supervised machine learning methods obtain superior classification performances with respect to the traditional DNA Barcode classification methods. On empirical data their classification performances are at a comparable level to the other methods. Conclusions The classification analysis shows that supervised machine learning methods are promising candidates for handling with success the DNA Barcoding species classification problem, obtaining excellent performances. To conclude, a powerful tool to perform species identification is now available to the DNA Barcoding community. PMID:24721333
A gradient-boosting approach for filtering de novo mutations in parent-offspring trios.
Liu, Yongzhuang; Li, Bingshan; Tan, Renjie; Zhu, Xiaolin; Wang, Yadong
2014-07-01
Whole-genome and -exome sequencing on parent-offspring trios is a powerful approach to identifying disease-associated genes by detecting de novo mutations in patients. Accurate detection of de novo mutations from sequencing data is a critical step in trio-based genetic studies. Existing bioinformatic approaches usually yield high error rates due to sequencing artifacts and alignment issues, which may either miss true de novo mutations or call too many false ones, making downstream validation and analysis difficult. In particular, current approaches have much worse specificity than sensitivity, and developing effective filters to discriminate genuine from spurious de novo mutations remains an unsolved challenge. In this article, we curated 59 sequence features in whole genome and exome alignment context which are considered to be relevant to discriminating true de novo mutations from artifacts, and then employed a machine-learning approach to classify candidates as true or false de novo mutations. Specifically, we built a classifier, named De Novo Mutation Filter (DNMFilter), using gradient boosting as the classification algorithm. We built the training set using experimentally validated true and false de novo mutations as well as collected false de novo mutations from an in-house large-scale exome-sequencing project. We evaluated DNMFilter's theoretical performance and investigated relative importance of different sequence features on the classification accuracy. Finally, we applied DNMFilter on our in-house whole exome trios and one CEU trio from the 1000 Genomes Project and found that DNMFilter could be coupled with commonly used de novo mutation detection approaches as an effective filtering approach to significantly reduce false discovery rate without sacrificing sensitivity. The software DNMFilter implemented using a combination of Java and R is freely available from the website at http://humangenome.duke.edu/software. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Rai, Shesh N; Trainor, Patrick J; Khosravi, Farhad; Kloecker, Goetz; Panchapakesan, Balaji
2016-01-01
The development of biosensors that produce time series data will facilitate improvements in biomedical diagnostics and in personalized medicine. The time series produced by these devices often contains characteristic features arising from biochemical interactions between the sample and the sensor. To use such characteristic features for determining sample class, similarity-based classifiers can be utilized. However, the construction of such classifiers is complicated by the variability in the time domains of such series that renders the traditional distance metrics such as Euclidean distance ineffective in distinguishing between biological variance and time domain variance. The dynamic time warping (DTW) algorithm is a sequence alignment algorithm that can be used to align two or more series to facilitate quantifying similarity. In this article, we evaluated the performance of DTW distance-based similarity classifiers for classifying time series that mimics electrical signals produced by nanotube biosensors. Simulation studies demonstrated the positive performance of such classifiers in discriminating between time series containing characteristic features that are obscured by noise in the intensity and time domains. We then applied a DTW distance-based k -nearest neighbors classifier to distinguish the presence/absence of mesenchymal biomarker in cancer cells in buffy coats in a blinded test. Using a train-test approach, we find that the classifier had high sensitivity (90.9%) and specificity (81.8%) in differentiating between EpCAM-positive MCF7 cells spiked in buffy coats and those in plain buffy coats.
ConSpeciFix: Classifying prokaryotic species based on gene flow.
Bobay, Louis-Marie; Ellis, Brian Shin-Hua; Ochman, Howard
2018-05-16
Classification of prokaryotic species is usually based on sequence similarity thresholds, which are easy to apply but lack a biologically-relevant foundation. Here, we present ConSpeciFix, a program that classifies prokaryotes into species using criteria set forth by the Biological Species Concept, thereby unifying species definition in all domains of life. ConSpeciFix's webserver is freely available at www.conspecifix.com. The local version of the program can be freely downloaded from https://github.com/Bobay-Ochman/ConSpeciFix. ConSpeciFix is written in Python 2.7 and requires the following dependencies: Usearch, MCL, MAFFT and RAxML. ljbobay@uncg.edu.
Tracing Primordial Protein Evolution through Structurally Guided Stepwise Segment Elongation*
Watanabe, Hideki; Yamasaki, Kazuhiko; Honda, Shinya
2014-01-01
The understanding of how primordial proteins emerged has been a fundamental and longstanding issue in biology and biochemistry. For a better understanding of primordial protein evolution, we synthesized an artificial protein on the basis of an evolutionary hypothesis, segment-based elongation starting from an autonomously foldable short peptide. A 10-residue protein, chignolin, the smallest foldable polypeptide ever reported, was used as a structural support to facilitate higher structural organization and gain-of-function in the development of an artificial protein. Repetitive cycles of segment elongation and subsequent phage display selection successfully produced a 25-residue protein, termed AF.2A1, with nanomolar affinity against the Fc region of immunoglobulin G. AF.2A1 shows exquisite molecular recognition ability such that it can distinguish conformational differences of the same molecule. The structure determined by NMR measurements demonstrated that AF.2A1 forms a globular protein-like conformation with the chignolin-derived β-hairpin and a tryptophan-mediated hydrophobic core. Using sequence analysis and a mutation study, we discovered that the structural organization and gain-of-function emerged from the vicinity of the chignolin segment, revealing that the structural support served as the core in both structural and functional development. Here, we propose an evolutionary model for primordial proteins in which a foldable segment serves as the evolving core to facilitate structural and functional evolution. This study provides insights into primordial protein evolution and also presents a novel methodology for designing small sized proteins useful for industrial and pharmaceutical applications. PMID:24356963
Song, Yuhyun; Leman, Scotland; Monteil, Caroline L.; Heath, Lenwood S.; Vinatzer, Boris A.
2014-01-01
A broadly accepted and stable biological classification system is a prerequisite for biological sciences. It provides the means to describe and communicate about life without ambiguity. Current biological classification and nomenclature use the species as the basic unit and require lengthy and laborious species descriptions before newly discovered organisms can be assigned to a species and be named. The current system is thus inadequate to classify and name the immense genetic diversity within species that is now being revealed by genome sequencing on a daily basis. To address this lack of a general intra-species classification and naming system adequate for today’s speed of discovery of new diversity, we propose a classification and naming system that is exclusively based on genome similarity and that is suitable for automatic assignment of codes to any genome-sequenced organism without requiring any phenotypic or phylogenetic analysis. We provide examples demonstrating that genome similarity-based codes largely align with current taxonomic groups at many different levels in bacteria, animals, humans, plants, and viruses. Importantly, the proposed approach is only slightly affected by the order of code assignment and can thus provide codes that reflect similarity between organisms and that do not need to be revised upon discovery of new diversity. We envision genome similarity-based codes to complement current biological nomenclature and to provide a universal means to communicate unambiguously about any genome-sequenced organism in fields as diverse as biodiversity research, infectious disease control, human and microbial forensics, animal breed and plant cultivar certification, and human ancestry research. PMID:24586551
Informative priors based on transcription factor structural class improve de novo motif discovery.
Narlikar, Leelavati; Gordân, Raluca; Ohler, Uwe; Hartemink, Alexander J
2006-07-15
An important problem in molecular biology is to identify the locations at which a transcription factor (TF) binds to DNA, given a set of DNA sequences believed to be bound by that TF. In previous work, we showed that information in the DNA sequence of a binding site is sufficient to predict the structural class of the TF that binds it. In particular, this suggests that we can predict which locations in any DNA sequence are more likely to be bound by certain classes of TFs than others. Here, we argue that traditional methods for de novo motif finding can be significantly improved by adopting an informative prior probability that a TF binding site occurs at each sequence location. To demonstrate the utility of such an approach, we present priority, a powerful new de novo motif finding algorithm. Using data from TRANSFAC, we train three classifiers to recognize binding sites of basic leucine zipper, forkhead, and basic helix loop helix TFs. These classifiers are used to equip priority with three class-specific priors, in addition to a default prior to handle TFs of other classes. We apply priority and a number of popular motif finding programs to sets of yeast intergenic regions that are reported by ChIP-chip to be bound by particular TFs. priority identifies motifs the other methods fail to identify, and correctly predicts the structural class of the TF recognizing the identified binding sites. Supplementary material and code can be found at http://www.cs.duke.edu/~amink/.
Leo, Michael C; McMullen, Carmit; Wilfond, Benjamin S; Lynch, Frances L; Reiss, Jacob A; Gilmore, Marian J; Himes, Patricia; Kauffman, Tia L; Davis, James V; Jarvik, Gail P; Berg, Jonathan S; Harding, Cary; Kennedy, Kathleen A; Simpson, Dana Kostiner; Quigley, Denise I; Richards, C Sue; Rope, Alan F; Goddard, Katrina A B
2016-03-01
Advances in genome sequencing and gene discovery have created opportunities to efficiently assess more genetic conditions than ever before. Given the large number of conditions that can be screened, the implementation of expanded carrier screening using genome sequencing will require practical methods of simplifying decisions about the conditions for which patients want to be screened. One method to simplify decision making is to generate a taxonomy based on expert judgment. However, expert perceptions of condition attributes used to classify these conditions may differ from those used by patients. To understand whether expert and patient perceptions differ, we asked women who had received preconception genetic carrier screening in the last 3 years to fill out a survey to rate the attributes (predictability, controllability, visibility, and severity) of several autosomal recessive or X-linked genetic conditions. These conditions were classified into one of five taxonomy categories developed by subject experts (significantly shortened lifespan, serious medical problems, mild medical problems, unpredictable medical outcomes, and adult-onset conditions). A total of 193 women provided 739 usable ratings across 20 conditions. The mean ratings and correlations demonstrated that participants made distinctions across both attributes and categories. Aggregated mean attribute ratings across categories demonstrated logical consistency between the key features of each attribute and category, although participants perceived little difference between the mild and serious categories. This study provides empirical evidence for the validity of our proposed taxonomy, which will simplify patient decisions for results they would like to receive from preconception carrier screening via genome sequencing. © 2016 Wiley Periodicals, Inc.
The most conserved genome segments for life detection on Earth and other planets.
Isenbarger, Thomas A; Carr, Christopher E; Johnson, Sarah Stewart; Finney, Michael; Church, George M; Gilbert, Walter; Zuber, Maria T; Ruvkun, Gary
2008-12-01
On Earth, very simple but powerful methods to detect and classify broad taxa of life by the polymerase chain reaction (PCR) are now standard practice. Using DNA primers corresponding to the 16S ribosomal RNA gene, one can survey a sample from any environment for its microbial inhabitants. Due to massive meteoritic exchange between Earth and Mars (as well as other planets), a reasonable case can be made for life on Mars or other planets to be related to life on Earth. In this case, the supremely sensitive technologies used to study life on Earth, including in extreme environments, can be applied to the search for life on other planets. Though the 16S gene has become the standard for life detection on Earth, no genome comparisons have established that the ribosomal genes are, in fact, the most conserved DNA segments across the kingdoms of life. We present here a computational comparison of full genomes from 13 diverse organisms from the Archaea, Bacteria, and Eucarya to identify genetic sequences conserved across the widest divisions of life. Our results identify the 16S and 23S ribosomal RNA genes as well as other universally conserved nucleotide sequences in genes encoding particular classes of transfer RNAs and within the nucleotide binding domains of ABC transporters as the most conserved DNA sequence segments across phylogeny. This set of sequences defines a core set of DNA regions that have changed the least over billions of years of evolution and provides a means to identify and classify divergent life, including ancestrally related life on other planets.
Genotyping of Chromobacterium violaceum isolates by recA PCR-RFLP analysis.
Scholz, Holger Christian; Witte, Angela; Tomaso, Herbert; Al Dahouk, Sascha; Neubauer, Heinrich
2005-03-15
Intraspecies variation of Chromobacterium violaceum was examined by comparative sequence - and by restriction fragment length polymorphism analysis of the recombinase A gene (recA-PCR-RFLP). Primers deduced from the known recA gene sequence of the type strain C. violaceum ATCC 12472(T) allowed the specific amplification of a 1040bp recA fragment from each of the 13 C. violaceum strains investigated, whereas other closely related organisms tested negative. HindII-PstI-recA RFLP analysis generated from 13 representative C. violaceum strains enabled us to identify at least three different genospecies. In conclusion, analysis of the recA gene provides a rapid and robust nucleotide sequence-based approach to specifically identify and classify C. violaceum on genospecies level.
MR Persei - A new rotating, spotted flare star
NASA Technical Reports Server (NTRS)
Honeycutt, R. K.; Turner, G. W.; Vesper, D. N.; Schlegel, E. M.
1992-01-01
Spectroscopy and photometry are used to show that MR Persei, an object originally classified as a dwarf nova, is in fact a flare star. The automated CCD photometry consists of sequences of exposures within a single night as well as long-term photometry over a five-month interval. One sequence shows a 30-min flare, accompanied by post-flare 'dips'. A 0.2 mag variation with a period of about one-half day is also seen in this sequence. The long-term photometry is used to refine the period to 0.45483 d, which we attribute to the rotation of a spotted star. Evidence for membership of MR Per in the young Alpha Per cluster is considered, and found to be inconclusive.
Simões-Araújo, Jean Luiz; Rumjanek, Norma Gouvêa; Xavier, Gustavo Ribeiro; Zilli, Jerri Édson
The strain BR 3351 T (Bradyrhizobium manausense) was obtained from nodules of cowpea (Vigna unguiculata L. Walp) growing in soil collected from Amazon rainforest. Furthermore, it was observed that the strain has high capacity to fix nitrogen symbiotically in symbioses with cowpea. We report here the draft genome sequence of strain BR 3351 T . The information presented will be important for comparative analysis of nodulation and nitrogen fixation for diazotrophic bacteria. A draft genome with 9,145,311bp and 62.9% of GC content was assembled in 127 scaffolds using 100bp pair-end Illumina MiSeq system. The RAST annotation identified 8603 coding sequences, 51 RNAs genes, classified in 504 subsystems. Published by Elsevier Editora Ltda.
Protein Sequence Classification with Improved Extreme Learning Machine Algorithms
2014-01-01
Precisely classifying a protein sequence from a large biological protein sequences database plays an important role for developing competitive pharmacological products. Comparing the unseen sequence with all the identified protein sequences and returning the category index with the highest similarity scored protein, conventional methods are usually time-consuming. Therefore, it is urgent and necessary to build an efficient protein sequence classification system. In this paper, we study the performance of protein sequence classification using SLFNs. The recent efficient extreme learning machine (ELM) and its invariants are utilized as the training algorithms. The optimal pruned ELM is first employed for protein sequence classification in this paper. To further enhance the performance, the ensemble based SLFNs structure is constructed where multiple SLFNs with the same number of hidden nodes and the same activation function are used as ensembles. For each ensemble, the same training algorithm is adopted. The final category index is derived using the majority voting method. Two approaches, namely, the basic ELM and the OP-ELM, are adopted for the ensemble based SLFNs. The performance is analyzed and compared with several existing methods using datasets obtained from the Protein Information Resource center. The experimental results show the priority of the proposed algorithms. PMID:24795876
Texture analysis of common renal masses in multiple MR sequences for prediction of pathology
NASA Astrophysics Data System (ADS)
Hoang, Uyen N.; Malayeri, Ashkan A.; Lay, Nathan S.; Summers, Ronald M.; Yao, Jianhua
2017-03-01
This pilot study performs texture analysis on multiple magnetic resonance (MR) images of common renal masses for differentiation of renal cell carcinoma (RCC). Bounding boxes are drawn around each mass on one axial slice in T1 delayed sequence to use for feature extraction and classification. All sequences (T1 delayed, venous, arterial, pre-contrast phases, T2, and T2 fat saturated sequences) are co-registered and texture features are extracted from each sequence simultaneously. Random forest is used to construct models to classify lesions on 96 normal regions, 87 clear cell RCCs, 8 papillary RCCs, and 21 renal oncocytomas; ground truths are verified through pathology reports. The highest performance is seen in random forest model when data from all sequences are used in conjunction, achieving an overall classification accuracy of 83.7%. When using data from one single sequence, the overall accuracies achieved for T1 delayed, venous, arterial, and pre-contrast phase, T2, and T2 fat saturated were 79.1%, 70.5%, 56.2%, 61.0%, 60.0%, and 44.8%, respectively. This demonstrates promising results of utilizing intensity information from multiple MR sequences for accurate classification of renal masses.
Using random forests for assistance in the curation of G-protein coupled receptor databases.
Shkurin, Aleksei; Vellido, Alfredo
2017-08-18
Biology is experiencing a gradual but fast transformation from a laboratory-centred science towards a data-centred one. As such, it requires robust data engineering and the use of quantitative data analysis methods as part of database curation. This paper focuses on G protein-coupled receptors, a large and heterogeneous super-family of cell membrane proteins of interest to biology in general. One of its families, Class C, is of particular interest to pharmacology and drug design. This family is quite heterogeneous on its own, and the discrimination of its several sub-families is a challenging problem. In the absence of known crystal structure, such discrimination must rely on their primary amino acid sequences. We are interested not as much in achieving maximum sub-family discrimination accuracy using quantitative methods, but in exploring sequence misclassification behavior. Specifically, we are interested in isolating those sequences showing consistent misclassification, that is, sequences that are very often misclassified and almost always to the same wrong sub-family. Random forests are used for this analysis due to their ensemble nature, which makes them naturally suited to gauge the consistency of misclassification. This consistency is here defined through the voting scheme of their base tree classifiers. Detailed consistency results for the random forest ensemble classification were obtained for all receptors and for all data transformations of their unaligned primary sequences. Shortlists of the most consistently misclassified receptors for each subfamily and transformation, as well as an overall shortlist including those cases that were consistently misclassified across transformations, were obtained. The latter should be referred to experts for further investigation as a data curation task. The automatic discrimination of the Class C sub-families of G protein-coupled receptors from their unaligned primary sequences shows clear limits. This study has investigated in some detail the consistency of their misclassification using random forest ensemble classifiers. Different sub-families have been shown to display very different discrimination consistency behaviors. The individual identification of consistently misclassified sequences should provide a tool for quality control to GPCR database curators.
Costa-Alcalde, José Javier; Barbeito-Castiñeiras, Gema; González-Alba, José María; Aguilera, Antonio; Galán, Juan Carlos; Pérez-Del-Molino, María Luisa
2018-06-02
The American Thoracic Society and the Infectious Diseases Society of America recommend that clinically significant non-tuberculous mycobacteria (NTM) should be identified to the species level in order to determine their clinical significance. The aim of this study was to evaluate identification of rapidly growing NTM (RGM) isolated from clinical samples by using MALDI-TOF MS and a commercial molecular system. The results were compared with identification using a reference method. We included 46 clinical isolates of RGM and identified them using the commercial molecular system GenoType ® CM/AS (Hain, Lifescience, Germany), MALDI-TOF MS (Bruker) and, as reference method, partial rpoβ gene sequencing followed by BLAST and phylogenetic analysis with the 1093 sequences available in the GeneBank. The degree of agreement between GenoType ® and MALDI-TOF MS and the reference method, partial rpoβ sequencing, was 27/43 (62.8%) and 38/43 cases (88.3%) respectively. For all the samples correctly classified by GenoType ® , we obtained the same result with MALDI-TOF MS (27/27). However, MALDI-TOF MS also correctly identified 68.75% (11/16) of the samples that GenoType ® had misclassified (p=0.005). MALDI-TOF MS classified significantly better than GenoType ® . When a MALDI-TOF MS score >1.85 was achieved, MALDI-TOF MS and partial rpoβ gene sequencing were equivalent. GenoType ® was not able to distinguish between species belonging to the M. fortuitum complex. MALDI-TOF MS methodology is simple, rapid and associated with lower consumable costs than GenoType ® . The partial rpoβ sequencing methods with BLAST and phylogenetic analysis were not able to identify some RGM unequivocally. Therefore, sequencing of additional regions would be indicated in these cases. Copyright © 2018 Elsevier España, S.L.U. and Sociedad Española de Enfermedades Infecciosas y Microbiología Clínica. All rights reserved.
USDA-ARS?s Scientific Manuscript database
Phytoplasmas classified in group 16SrXII infect a wide range of plants and are transmitted by polyphagous planthoppers of the family Cixiidae. Based on 16S rRNA gene sequence identity and biological properties, group 16SrXII encompasses several species, including ‘Candidatus Phytoplasma australiens...
Task Selection, Task Switching and Multitasking during Computer-Based Independent Study
ERIC Educational Resources Information Center
Judd, Terry
2015-01-01
Detailed logs of students' computer use, during independent study sessions, were captured in an open-access computer laboratory. Each log consisted of a chronological sequence of tasks representing either the application or the Internet domain displayed in the workstation's active window. Each task was classified using a three-tier schema…
ERIC Educational Resources Information Center
Tou, Erik R
2013-01-01
This project classifies groups of small order using a group's center as the key feature. Groups of a given order "n" are typed based on the order of each group's center. Students are led through a sequence of exercises that combine proof-writing, independent research, and an analysis of specific classes of finite groups…
USDA-ARS?s Scientific Manuscript database
A PCR-based method was used to classify 109 isolates of nucleopolyhedrovirus (NPV; Baculoviridae: Alphabaculovirus) collected worldwide from larvae of Heliothis virescens, Helicoverpa zea, and Helicoverpa armigera. Partial nucleotide sequencing and phylogenetic analysis of three highly conserved ge...
USDA-ARS?s Scientific Manuscript database
Lymantria dispar multiple nucleopolyhedrovirus (LdMNPV) has been formulated and applied to control outbreaks of the gypsy moth, L. dispar. To classify and determine the degree of genetic variation among isolates of L. dispar NPVs from different parts of the range of the gypsy moth, partial sequence...
A Case Study in Using Explicit Instruction to Teach Young Children Counting Skills
ERIC Educational Resources Information Center
Hinton, Vanessa; Stroizer, Shaunita; Flores, Margaret
2015-01-01
Number sense is one's ability to understand what numbers mean, perform mental mathematics, and look at the world and make comparisons. Researchers show instruction that teaches children how to classify numbers, put numbers in sequence, conserve numbers effectively, and count builds their number sense skills. Targeted instruction that teaches…
Prediction of Nucleotide Binding Peptides Using Star Graph Topological Indices.
Liu, Yong; Munteanu, Cristian R; Fernández Blanco, Enrique; Tan, Zhiliang; Santos Del Riego, Antonino; Pazos, Alejandro
2015-11-01
The nucleotide binding proteins are involved in many important cellular processes, such as transmission of genetic information or energy transfer and storage. Therefore, the screening of new peptides for this biological function is an important research topic. The current study proposes a mixed methodology to obtain the first classification model that is able to predict new nucleotide binding peptides, using only the amino acid sequence. Thus, the methodology uses a Star graph molecular descriptor of the peptide sequences and the Machine Learning technique for the best classifier. The best model represents a Random Forest classifier based on two features of the embedded and non-embedded graphs. The performance of the model is excellent, considering similar models in the field, with an Area Under the Receiver Operating Characteristic Curve (AUROC) value of 0.938 and true positive rate (TPR) of 0.886 (test subset). The prediction of new nucleotide binding peptides with this model could be useful for drug target studies in drug development. © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Rotation invariant features for wear particle classification
NASA Astrophysics Data System (ADS)
Arof, Hamzah; Deravi, Farzin
1997-09-01
This paper investigates the ability of a set of rotation invariant features to classify images of wear particles found in used lubricating oil of machinery. The rotation invariant attribute of the features is derived from the property of the magnitudes of Fourier transform coefficients that do not change with spatial shift of the input elements. By analyzing individual circular neighborhoods centered at every pixel in an image, local and global texture characteristics of an image can be described. A number of input sequences are formed by the intensities of pixels on concentric rings of various radii measured from the center of each neighborhood. Fourier transforming the sequences would generate coefficients whose magnitudes are invariant to rotation. Rotation invariant features extracted from these coefficients were utilized to classify wear particle images that were obtained from a number of different particles captured at different orientations. In an experiment involving images of 6 classes, the circular neighborhood features obtained a 91% recognition rate which compares favorably to a 76% rate achieved by features of a 6 by 6 co-occurrence matrix.
Improving tRNAscan-SE Annotation Results via Ensemble Classifiers.
Zou, Quan; Guo, Jiasheng; Ju, Ying; Wu, Meihong; Zeng, Xiangxiang; Hong, Zhiling
2015-11-01
tRNAScan-SE is a tRNA detection program that is widely used for tRNA annotation; however, the false positive rate of tRNAScan-SE is unacceptable for large sequences. Here, we used a machine learning method to try to improve the tRNAScan-SE results. A new predictor, tRNA-Predict, was designed. We obtained real and pseudo-tRNA sequences as training data sets using tRNAScan-SE and constructed three different tRNA feature sets. We then set up an ensemble classifier, LibMutil, to predict tRNAs from the training data. The positive data set of 623 tRNA sequences was obtained from tRNAdb 2009 and the negative data set was the false positive tRNAs predicted by tRNAscan-SE. Our in silico experiments revealed a prediction accuracy rate of 95.1 % for tRNA-Predict using 10-fold cross-validation. tRNA-Predict was developed to distinguish functional tRNAs from pseudo-tRNAs rather than to predict tRNAs from a genome-wide scan. However, tRNA-Predict can work with the output of tRNAscan-SE, which is a genome-wide scanning method, to improve the tRNAscan-SE annotation results. The tRNA-Predict web server is accessible at http://datamining.xmu.edu.cn/∼gjs/tRNA-Predict. © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Prevalence of BK virus subtype I in Germany.
Krumbholz, Andi; Zell, Roland; Egerer, Renate; Sauerbrei, Andreas; Helming, Andrea; Gruhn, Bernd; Wutzler, Peter
2006-12-01
The primary infection with human polyomavirus BK (BKV) occurs in early childhood and leads to viral latency within the urogenital tract. Up to 90% of the adult population are seropositive. In immunosuppressed patients, the BKV may be reactivated resulting in typical disease patterns like hemorrhagic cystitis and tubulointerstitial nephritis. Based on serological and molecular methods, BKV isolates were classified into four subtypes previously. Sixty specimens obtained from German renal and bone marrow transplant recipients were analyzed to gain data on the prevalence of BKV subtypes in Germany. With 90.9%, BKV subtype I was found to be predominant in both patient groups. 6.1% of BKV strains were classified as subtype IV. This pattern of phylogenetic distribution is similar to that demonstrated previously in England, Tanzania, the United States and Japan. Remarkably, there was one German BKV virus with a sequence which clusters together with strain SB in subtype II. The BKV subtype I was found to consist of at least three subgroups designated as Ia, Ib, and Ic. While the majority of the German sequences represent subgroup Ic, most of the Japanese sequences are clearly distinct. These findings support the hypothesis of distinct geographical prevalence of BKV subgroups. For the genotyping region, a relationship of BKV subgroups to disease patterns like hemorrhagic cystitis or tubulointerstitial nephritis could not be demonstrated. (c) 2006 Wiley-Liss, Inc.
Friesen, Vicki L.; Piatt, John F.; Baker, Allan J.
1996-01-01
Marbled Murrelets (Brachyramphus marmoratus) are coastal seabirds that breed predominantly in old-growth forest throughout the North Pacific. Presently they are classified into two phenotypically distinct subspecies: one in North America (B. m. marmoratus) and one in Asia (B. m. perdix). The Asian form was classified as a separate species in 1811, but was lumped with B. marmoratus during the 20th century. Populations of both types are considered threatened or endangered and information about the extent of genetic differentiation among birds from different sites is required for their conservation. We compared variation in 1,045 base pairs of the mitochondrial cytochrome b gene and 39 allozyme loci among Marbled Murrelets and the closely related Kittlitz's Murrelets (B. brevirostris) from throughout the North Pacific. All analyses indicted that North American and Asian Marbled Murrelets are genetically distinct: cytochrome b sequences were highly divergent, fixed allele differences occurred at two allozyme loci, and estimated gene flow was essentially zero. Phylogenetic analyses of cytochrome b sequences and allozymes both provided strong support for a monophyletic relationship among North American Marbled Murrelets and Kittlitz's Murrelets, with Long-billed Murrelets forming the basal lineage. Long-billed and North American Marbled Murrelets clearly represent distinct species by any definition, and must be managed independently. Significant genetic differentiation also was found among both Marbled and Kittlitz's Murrelets from different sites within North America.
Li, Yunhai; Lee, Kee Khoon; Walsh, Sean; Smith, Caroline; Hadingham, Sophie; Sorefan, Karim; Cawley, Gavin; Bevan, Michael W
2006-03-01
Establishing transcriptional regulatory networks by analysis of gene expression data and promoter sequences shows great promise. We developed a novel promoter classification method using a Relevance Vector Machine (RVM) and Bayesian statistical principles to identify discriminatory features in the promoter sequences of genes that can correctly classify transcriptional responses. The method was applied to microarray data obtained from Arabidopsis seedlings treated with glucose or abscisic acid (ABA). Of those genes showing >2.5-fold changes in expression level, approximately 70% were correctly predicted as being up- or down-regulated (under 10-fold cross-validation), based on the presence or absence of a small set of discriminative promoter motifs. Many of these motifs have known regulatory functions in sugar- and ABA-mediated gene expression. One promoter motif that was not known to be involved in glucose-responsive gene expression was identified as the strongest classifier of glucose-up-regulated gene expression. We show it confers glucose-responsive gene expression in conjunction with another promoter motif, thus validating the classification method. We were able to establish a detailed model of glucose and ABA transcriptional regulatory networks and their interactions, which will help us to understand the mechanisms linking metabolism with growth in Arabidopsis. This study shows that machine learning strategies coupled to Bayesian statistical methods hold significant promise for identifying functionally significant promoter sequences.
Ninomiya, M; Takahashi, M; Shimosegawa, T; Okamoto, H
2007-01-01
Recently, we identified a novel human virus with a circular DNA genome of 3.2 kb, tentatively designated as torque teno midi virus (TTMDV), with a genomic organization resembling those of torque teno virus (TTV) of 3.8-3.9 kb and torque teno mini virus (TTMV) of 2.8-2.9 kb. To investigate the extent of genomic variability of TTMDV genomes, the full-length sequence was determined for 15 TTMDV isolates obtained from viremic individuals in Japan. The 15 TTMDV isolates comprised 3175-3230 bases and shared 67.0-90.3% identities with each other, and were only 68.4-73.0% identical to the 3 reported TTMDV isolates over the entire genome. TTMDV possessed a genomic organization with four open reading frames (ORF1-ORF4) with characteristic sequence motifs and stem and loop structures with high GC content, similar to TTV and TTMV. The total of 18 TTMDV genomes differed by up to 60.7% from each other in the amino acid sequence of ORF1 (658-677 amino acids), but segregated phylogenetically into the same cluster, which was distantly related to the TTVs and TTMVs. These results indicate that TTMDV with a circular DNA genome of 3.2 kb, has an extremely high degree of genomic variability, and is classifiable into a third group in the genus Anellovirus.
Controlling the Display of Capsule Endoscopy Video for Diagnostic Assistance
NASA Astrophysics Data System (ADS)
Vu, Hai; Echigo, Tomio; Sagawa, Ryusuke; Yagi, Keiko; Shiba, Masatsugu; Higuchi, Kazuhide; Arakawa, Tetsuo; Yagi, Yasushi
Interpretations by physicians of capsule endoscopy image sequences captured over periods of 7-8 hours usually require 45 to 120 minutes of extreme concentration. This paper describes a novel method to reduce diagnostic time by automatically controlling the display frame rate. Unlike existing techniques, this method displays original images with no skipping of frames. The sequence can be played at a high frame rate in stable regions to save time. Then, in regions with rough changes, the speed is decreased to more conveniently ascertain suspicious findings. To realize such a system, cue information about the disparity of consecutive frames, including color similarity and motion displacements is extracted. A decision tree utilizes these features to classify the states of the image acquisitions. For each classified state, the delay time between frames is calculated by parametric functions. A scheme selecting the optimal parameters set determined from assessments by physicians is deployed. Experiments involved clinical evaluations to investigate the effectiveness of this method compared to a standard-view using an existing system. Results from logged action based analysis show that compared with an existing system the proposed method reduced diagnostic time to around 32.5 ± minutes per full sequence while the number of abnormalities found was similar. As well, physicians needed less effort because of the systems efficient operability. The results of the evaluations should convince physicians that they can safely use this method and obtain reduced diagnostic times.
Tripathi, Pooja; Pandey, Paras N
2017-07-07
The present work employs pseudo amino acid composition (PseAAC) for encoding the protein sequences in their numeric form. Later this will be arranged in the similarity matrix, which serves as input for spectral graph clustering method. Spectral methods are used previously also for clustering of protein sequences, but they uses pair wise alignment scores of protein sequences, in similarity matrix. The alignment score depends on the length of sequences, so clustering short and long sequences together may not good idea. Therefore the idea of introducing PseAAC with spectral clustering algorithm came into scene. We extensively tested our method and compared its performance with other existing machine learning methods. It is consistently observed that, the number of clusters that we obtained for a given set of proteins is close to the number of superfamilies in that set and PseAAC combined with spectral graph clustering shows the best classification results. Copyright © 2017 Elsevier Ltd. All rights reserved.
Compositional segmentation and complexity measurement in stock indices
NASA Astrophysics Data System (ADS)
Wang, Haifeng; Shang, Pengjian; Xia, Jianan
2016-01-01
In this paper, we introduce a complexity measure based on the entropic segmentation called sequence compositional complexity (SCC) into the analysis of financial time series. SCC was first used to deal directly with the complex heterogeneity in nonstationary DNA sequences. We already know that SCC was found to be higher in sequences with long-range correlation than those with low long-range correlation, especially in the DNA sequences. Now, we introduce this method into financial index data, subsequently, we find that the values of SCC of some mature stock indices, such as S & P 500 (simplified with S & P in the following) and HSI, are likely to be lower than the SCC value of Chinese index data (such as SSE). What is more, we find that, if we classify the indices with the method of SCC, the financial market of Hong Kong has more similarities with mature foreign markets than Chinese ones. So we believe that a good correspondence is found between the SCC of the index sequence and the complexity of the market involved.
Muangkram, Yuttamol; Amano, Akira; Wajjwalku, Worawidh; Pinyopummintr, Tanu; Thongtip, Nikorn; Kaolim, Nongnid; Sukmak, Manakorn; Kamolnorranath, Sumate; Siriaroonrat, Boripat; Tipkantha, Wanlaya; Maikaew, Umaporn; Thomas, Warisara; Polsrila, Kanda; Dongsaard, Kwanreaun; Sanannu, Saowaphang; Wattananorrasate, Anuwat
2017-07-01
The Asian tapir (Tapirus indicus) has been classified as Endangered on the IUCN Red List of Threatened Species (2008). Genetic diversity data provide important information for the management of captive breeding and conservation of this species. We analyzed mitochondrial control region (CR) sequences from 37 captive Asian tapirs in Thailand. Multiple alignments of the full-length CR sequences sized 1268 bp comprised three domains as described in other mammal species. Analysis of 16 parsimony-informative variable sites revealed 11 haplotypes. Furthermore, the phylogenetic analysis using median-joining network clearly showed three clades correlated with our earlier cytochrome b gene study in this endangered species. The repetitive motif is located between first and second conserved sequence blocks, similar to the Brazilian tapir. The highest polymorphic site was located in the extended termination associated sequences domain. The results could be applied for future genetic management based in captivity and wild that shows stable populations.
Structural features of the rice chromosome 4 centromere.
Zhang, Yu; Huang, Yuchen; Zhang, Lei; Li, Ying; Lu, Tingting; Lu, Yiqi; Feng, Qi; Zhao, Qiang; Cheng, Zhukuan; Xue, Yongbiao; Wing, Rod A; Han, Bin
2004-01-01
A complete sequence of a chromosome centromere is necessary for fully understanding centromere function. We reported the sequence structures of the first complete rice chromosome centromere through sequencing a large insert bacterial artificial chromosome clone-based contig, which covered the rice chromosome 4 centromere. Complete sequencing of the 124-kb rice chromosome 4 centromere revealed that it consisted of 18 tracts of 379 tandemly arrayed repeats known as CentO and a total of 19 centromeric retroelements (CRs) but no unique sequences were detected. Four tracts, composed of 65 CentO repeats, were located in the opposite orientation, and 18 CentO tracts were flanked by 19 retroelements. The CRs were classified into four types, and the type I retroelements appeared to be more specific to rice centromeres. The preferential insert of the CRs among CentO repeats indicated that the centromere-specific retroelements may contribute to centromere expansion during evolution. The presence of three intact retrotransposons in the centromere suggests that they may be responsible for functional centromere initiation through a transcription-mediated mechanism.
Characterization of GM events by insert knowledge adapted re-sequencing approaches
Yang, Litao; Wang, Congmao; Holst-Jensen, Arne; Morisset, Dany; Lin, Yongjun; Zhang, Dabing
2013-01-01
Detection methods and data from molecular characterization of genetically modified (GM) events are needed by stakeholders of public risk assessors and regulators. Generally, the molecular characteristics of GM events are incomprehensively revealed by current approaches and biased towards detecting transformation vector derived sequences. GM events are classified based on available knowledge of the sequences of vectors and inserts (insert knowledge). Herein we present three insert knowledge-adapted approaches for characterization GM events (TT51-1 and T1c-19 rice as examples) based on paired-end re-sequencing with the advantages of comprehensiveness, accuracy, and automation. The comprehensive molecular characteristics of two rice events were revealed with additional unintended insertions comparing with the results from PCR and Southern blotting. Comprehensive transgene characterization of TT51-1 and T1c-19 is shown to be independent of a priori knowledge of the insert and vector sequences employing the developed approaches. This provides an opportunity to identify and characterize also unknown GM events. PMID:24088728
Characterization of GM events by insert knowledge adapted re-sequencing approaches.
Yang, Litao; Wang, Congmao; Holst-Jensen, Arne; Morisset, Dany; Lin, Yongjun; Zhang, Dabing
2013-10-03
Detection methods and data from molecular characterization of genetically modified (GM) events are needed by stakeholders of public risk assessors and regulators. Generally, the molecular characteristics of GM events are incomprehensively revealed by current approaches and biased towards detecting transformation vector derived sequences. GM events are classified based on available knowledge of the sequences of vectors and inserts (insert knowledge). Herein we present three insert knowledge-adapted approaches for characterization GM events (TT51-1 and T1c-19 rice as examples) based on paired-end re-sequencing with the advantages of comprehensiveness, accuracy, and automation. The comprehensive molecular characteristics of two rice events were revealed with additional unintended insertions comparing with the results from PCR and Southern blotting. Comprehensive transgene characterization of TT51-1 and T1c-19 is shown to be independent of a priori knowledge of the insert and vector sequences employing the developed approaches. This provides an opportunity to identify and characterize also unknown GM events.
Centrifuge: rapid and sensitive classification of metagenomic sequences
Song, Li; Breitwieser, Florian P.
2016-01-01
Centrifuge is a novel microbial classification engine that enables rapid, accurate, and sensitive labeling of reads and quantification of species on desktop computers. The system uses an indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem. Centrifuge requires a relatively small index (4.2 GB for 4078 bacterial and 200 archaeal genomes) and classifies sequences at very high speed, allowing it to process the millions of reads from a typical high-throughput DNA sequencing run within a few minutes. Together, these advances enable timely and accurate analysis of large metagenomics data sets on conventional desktop computers. Because of its space-optimized indexing schemes, Centrifuge also makes it possible to index the entire NCBI nonredundant nucleotide sequence database (a total of 109 billion bases) with an index size of 69 GB, in contrast to k-mer-based indexing schemes, which require far more extensive space. PMID:27852649
DOE Office of Scientific and Technical Information (OSTI.GOV)
Klenk, Hans-Peter; Held, Brittany; Lucas, Susan
Saccharomonospora azurea Runmao et al. 1987 is a member to the genomically so far poorly characterized genus Saccharomonospora in the family Pseudonocardiaceae. Members of the genus Sacharomonosoras are of interest because they originate from diverse habitats, such as leaf litter, manure, compost, surface of peat, moist and over-heated grain, where they might play a role in the primary degradation of plant material by attacking hemicellulose. They are Gram-negative staining organisms classified among the usually Gram-positive actinomycetes. Next to S. viridis, S. azurea is only the second member in the genus Saccharomonospora for with a completely sequenced type strain genome willmore » be published. Here we describe the features of this organism, together with the complete genome sequence with project status 'permanent draft', and annotation. The 4,763,832 bp long chromosome with its 4,472 protein-coding and 58 RNA genes was sequenced as part of the DOE funded Community Sequencing Program (CSP) 2010 at the Joint Genome Institute (JGI).« less
Simões-Araújo, Jean Luiz; Leite, Jakson; Marie Rouws, Luc Felicianus; Passos, Samuel Ribeiro; Xavier, Gustavo Ribeiro; Rumjanek, Norma Gouvêa; Zilli, Jerri Édson
The strain BR 3262 was isolated from nodule of cowpea (Vigna unguiculata L. Walp) growing in soil of the Atlantic Forest area in Brazil and it is reported as an efficient nitrogen fixing bacterium associated to cowpea. Firstly, this strain was assigned as Bradyrhizobium elkanii, however, recently a more detailed genetic and molecular characterization has indicated it could be a Bradyrhizobium pachyrhizi species. We report here the draft genome sequence of B. pachyrhizi strain BR 3262, an elite bacterium used as inoculant for cowpea. The whole genome with 116 scaffolds, 8,965,178bp and 63.8% of C+G content for BR 3262 was obtained using Illumina MiSeq sequencing technology. Annotation was added by the RAST prokaryotic genome annotation service and shown 8369 coding sequences, 52 RNAs genes, classified in 504 subsystems. Published by Elsevier Editora Ltda.
He, Shui-Lian; Yang, Yang; Morrell, Peter L; Yi, Ting-Shuang
2015-01-01
Foxtail millet (Setaria italica (L.) Beauv) is one of the earliest domesticated grains, which has been cultivated in northern China by 8,700 years before present (YBP) and across Eurasia by 4,000 YBP. Owing to a small genome and diploid nature, foxtail millet is a tractable model crop for studying functional genomics of millets and bioenergy grasses. In this study, we examined nucleotide sequence diversity, geographic structure, and levels of linkage disequilibrium at four nuclear loci (ADH1, G3PDH, IGS1 and TPI1) in representative samples of 311 landrace accessions across its cultivated range. Higher levels of nucleotide sequence and haplotype diversity were observed in samples from China relative to other sampled regions. Genetic assignment analysis classified the accessions into seven clusters based on nucleotide sequence polymorphisms. Intralocus LD decayed rapidly to half the initial value within ~1.2 kb or less.
Genomic Diversity and Evolution of the Lyssaviruses
Delmas, Olivier; Holmes, Edward C.; Talbi, Chiraz; Larrous, Florence; Dacheux, Laurent; Bouchier, Christiane; Bourhy, Hervé
2008-01-01
Lyssaviruses are RNA viruses with single-strand, negative-sense genomes responsible for rabies-like diseases in mammals. To date, genomic and evolutionary studies have most often utilized partial genome sequences, particularly of the nucleoprotein and glycoprotein genes, with little consideration of genome-scale evolution. Herein, we report the first genomic and evolutionary analysis using complete genome sequences of all recognised lyssavirus genotypes, including 14 new complete genomes of field isolates from 6 genotypes and one genotype that is completely sequenced for the first time. In doing so we significantly increase the extent of genome sequence data available for these important viruses. Our analysis of these genome sequence data reveals that all lyssaviruses have the same genomic organization. A phylogenetic analysis reveals strong geographical structuring, with the greatest genetic diversity in Africa, and an independent origin for the two known genotypes that infect European bats. We also suggest that multiple genotypes may exist within the diversity of viruses currently classified as ‘Lagos Bat’. In sum, we show that rigorous phylogenetic techniques based on full length genome sequence provide the best discriminatory power for genotype classification within the lyssaviruses. PMID:18446239
Bolzán, Alejandro D
2017-07-01
By definition, telomeric sequences are located at the very ends or terminal regions of chromosomes. However, several vertebrate species show blocks of (TTAGGG)n repeats present in non-terminal regions of chromosomes, the so-called interstitial telomeric sequences (ITSs), interstitial telomeric repeats or interstitial telomeric bands, which include those intrachromosomal telomeric-like repeats located near (pericentromeric ITSs) or within the centromere (centromeric ITSs) and those telomeric repeats located between the centromere and the telomere (i.e., truly interstitial telomeric sequences) of eukaryotic chromosomes. According with their sequence organization, localization and flanking sequences, ITSs can be classified into four types: 1) short ITSs, 2) subtelomeric ITSs, 3) fusion ITSs, and 4) heterochromatic ITSs. The first three types have been described mainly in the human genome, whereas heterochromatic ITSs have been found in several vertebrate species but not in humans. Several lines of evidence suggest that ITSs play a significant role in genome instability and evolution. This review aims to summarize our current knowledge about the origin, function, instability and evolution of these telomeric-like repeats in vertebrate chromosomes. Copyright © 2017 Elsevier B.V. All rights reserved.
A machine learning approach for viral genome classification.
Remita, Mohamed Amine; Halioui, Ahmed; Malick Diouara, Abou Abdallah; Daigle, Bruno; Kiani, Golrokh; Diallo, Abdoulaye Baniré
2017-04-11
Advances in cloning and sequencing technology are yielding a massive number of viral genomes. The classification and annotation of these genomes constitute important assets in the discovery of genomic variability, taxonomic characteristics and disease mechanisms. Existing classification methods are often designed for specific well-studied family of viruses. Thus, the viral comparative genomic studies could benefit from more generic, fast and accurate tools for classifying and typing newly sequenced strains of diverse virus families. Here, we introduce a virus classification platform, CASTOR, based on machine learning methods. CASTOR is inspired by a well-known technique in molecular biology: restriction fragment length polymorphism (RFLP). It simulates, in silico, the restriction digestion of genomic material by different enzymes into fragments. It uses two metrics to construct feature vectors for machine learning algorithms in the classification step. We benchmark CASTOR for the classification of distinct datasets of human papillomaviruses (HPV), hepatitis B viruses (HBV) and human immunodeficiency viruses type 1 (HIV-1). Results reveal true positive rates of 99%, 99% and 98% for HPV Alpha species, HBV genotyping and HIV-1 M subtyping, respectively. Furthermore, CASTOR shows a competitive performance compared to well-known HIV-1 specific classifiers (REGA and COMET) on whole genomes and pol fragments. The performance of CASTOR, its genericity and robustness could permit to perform novel and accurate large scale virus studies. The CASTOR web platform provides an open access, collaborative and reproducible machine learning classifiers. CASTOR can be accessed at http://castor.bioinfo.uqam.ca .
Predicting permanent and transient protein-protein interfaces.
La, David; Kong, Misun; Hoffman, William; Choi, Youn Im; Kihara, Daisuke
2013-05-01
Protein-protein interactions (PPIs) are involved in diverse functions in a cell. To optimize functional roles of interactions, proteins interact with a spectrum of binding affinities. Interactions are conventionally classified into permanent and transient, where the former denotes tight binding between proteins that result in strong complexes, whereas the latter compose of relatively weak interactions that can dissociate after binding to regulate functional activity at specific time point. Knowing the type of interactions has significant implications for understanding the nature and function of PPIs. In this study, we constructed amino acid substitution models that capture mutation patterns at permanent and transient type of protein interfaces, which were found to be different with statistical significance. Using the substitution models, we developed a novel computational method that predicts permanent and transient protein binding interfaces (PBIs) in protein surfaces. Without knowledge of the interacting partner, the method uses a single query protein structure and a multiple sequence alignment of the sequence family. Using a large dataset of permanent and transient proteins, we show that our method, BindML+, performs very well in protein interface classification. A very high area under the curve (AUC) value of 0.957 was observed when predicted protein binding sites were classified. Remarkably, near prefect accuracy was achieved with an AUC of 0.991 when actual binding sites were classified. The developed method will be also useful for protein design of permanent and transient PBIs. Copyright © 2013 Wiley Periodicals, Inc.
Xie, L; Lv, M-F; Yang, J; Chen, J-P; Zhang, H-M
Maize rough dwarf disease (MRDD) has long been known as one of the most devastating viral diseases of maize worldwide and is caused by single or complex infection by four fijiviruses: Maize rough dwarf virus (MRDV) in Europe and the Middle East, Mal de Rio Cuarto virus (MRCV) in South America, rice black-streaked dwarf virus (RBSDV), and Southern rice black-streaked dwarf virus (SRBSDV or Rice black-streaked dwarf virus 2, RBSDV-2) in East Asia. These are currently classified as four distinct species in the genus Fijivirus, family Reoviridae, but their taxonomic status has been questioned. To help resolve this, the nucleotide sequences of the ten genomic segments of an Italian isolate of MRDV have been determined, providing the first complete genomic sequence of this virus. Its genome has 29144 nucleotides and is similar in organization to those of RBSDV, SRBSDV, and MRCV. The 13 ORFs always share highest identities (81.3-97.2%) with the corresponding ORFs of RBSDV and phylogenetic analyses of the different genome segments and ORFs all confirm that MRDV clusters most closely with RBSDV and that MRCV and SRBSDV are slightly more distantly related. The results suggest that MRDV and RBSDV should be classified as different geographic strains of the same virus species and we suggest the name cereal black-streaked dwarf fijivirus (CBSDV) for consideration.
Koopman, W J; Guetta, E; van de Wiel, C C; Vosman, B; van den Berg, R G
1998-11-01
Internal transcribed spacer (ITS-1) sequences from 97 accessions representing 23 species of Lactuca and related genera were determined and used to evaluate species relationships of Lactuca sensu lato (s.l.). The ITS-1 phylogenies, calculated using PAUP and PHYLIP, correspond better to the classification of Feráková than to other classifications evaluated, although the inclusion of sect. Lactuca subsect. Cyanicae is not supported. Therefore, exclusion of subsect. Cyanicae from Lactuca sensu Feráková is proposed. The amended genus contains the entire gene pool (sensu Harlan and De Wet) of cultivated lettuce (Lactuca sativa). The position of the species in the amended classification corresponds to their position in the lettuce gene pool. In the ITS-1 phylogenies, a clade with L. sativa, L. serriola, L. dregeana, L. altaica, and L. aculeata represents the primary gene pool. L. virosa and L. saligna, branching off closest to this clade, encompass the secondary gene pool. L. virosa is possibly of hybrid origin. The primary and secondary gene pool species are classified in sect. Lactuca subsect. Lactuca. The species L. quercina, L. viminea, L. sibirica, and L. tatarica, branching off next, represent the tertiary gene pool. They are classified in Lactuca sect. Lactucopsis, sect. Phaenixopus, and sect. Mulgedium, respectively. L. perennis and L. tenerrima, classified in sect. Lactuca subsect. Cyanicae, form clades with species from related genera and are not part of the lettuce gene pool.
Machine learning for classifying tuberculosis drug-resistance from DNA sequencing data.
Yang, Yang; Niehaus, Katherine E; Walker, Timothy M; Iqbal, Zamin; Walker, A Sarah; Wilson, Daniel J; Peto, Tim E A; Crook, Derrick W; Smith, E Grace; Zhu, Tingting; Clifton, David A
2018-05-15
Correct and rapid determination of Mycobacterium tuberculosis (MTB) resistance against available tuberculosis (TB) drugs is essential for the control and management of TB. Conventional molecular diagnostic test assumes that the presence of any well-studied single nucleotide polymorphisms is sufficient to cause resistance, which yields low sensitivity for resistance classification. Given the availability of DNA sequencing data from MTB, we developed machine learning models for a cohort of 1839 UK bacterial isolates to classify MTB resistance against eight anti-TB drugs (isoniazid, rifampicin, ethambutol, pyrazinamide, ciprofloxacin, moxifloxacin, ofloxacin, streptomycin) and to classify multi-drug resistance. Compared to previous rules-based approach, the sensitivities from the best-performing models increased by 2-4% for isoniazid, rifampicin and ethambutol to 97% (P < 0.01), respectively; for ciprofloxacin and multi-drug resistant TB, they increased to 96%. For moxifloxacin and ofloxacin, sensitivities increased by 12 and 15% from 83 and 81% based on existing known resistance alleles to 95% and 96% (P < 0.01), respectively. Particularly, our models improved sensitivities compared to the previous rules-based approach by 15 and 24% to 84 and 87% for pyrazinamide and streptomycin (P < 0.01), respectively. The best-performing models increase the area-under-the-ROC curve by 10% for pyrazinamide and streptomycin (P < 0.01), and 4-8% for other drugs (P < 0.01). The details of source code are provided at http://www.robots.ox.ac.uk/~davidc/code.php. david.clifton@eng.ox.ac.uk. Supplementary data are available at Bioinformatics online.
Predicting novel microRNA: a comprehensive comparison of machine learning approaches.
Stegmayer, Georgina; Di Persia, Leandro E; Rubiolo, Mariano; Gerard, Matias; Pividori, Milton; Yones, Cristian; Bugnon, Leandro A; Rodriguez, Tadeo; Raad, Jonathan; Milone, Diego H
2018-05-23
The importance of microRNAs (miRNAs) is widely recognized in the community nowadays because these short segments of RNA can play several roles in almost all biological processes. The computational prediction of novel miRNAs involves training a classifier for identifying sequences having the highest chance of being precursors of miRNAs (pre-miRNAs). The big issue with this task is that well-known pre-miRNAs are usually few in comparison with the hundreds of thousands of candidate sequences in a genome, which results in high class imbalance. This imbalance has a strong influence on most standard classifiers, and if not properly addressed in the model and the experiments, not only performance reported can be completely unrealistic but also the classifier will not be able to work properly for pre-miRNA prediction. Besides, another important issue is that for most of the machine learning (ML) approaches already used (supervised methods), it is necessary to have both positive and negative examples. The selection of positive examples is straightforward (well-known pre-miRNAs). However, it is difficult to build a representative set of negative examples because they should be sequences with hairpin structure that do not contain a pre-miRNA. This review provides a comprehensive study and comparative assessment of methods from these two ML approaches for dealing with the prediction of novel pre-miRNAs: supervised and unsupervised training. We present and analyze the ML proposals that have appeared during the past 10 years in literature. They have been compared in several prediction tasks involving two model genomes and increasing imbalance levels. This work provides a review of existing ML approaches for pre-miRNA prediction and fair comparisons of the classifiers with same features and data sets, instead of just a revision of published software tools. The results and the discussion can help the community to select the most adequate bioinformatics approach according to the prediction task at hand. The comparative results obtained suggest that from low to mid-imbalance levels between classes, supervised methods can be the best. However, at very high imbalance levels, closer to real case scenarios, models including unsupervised and deep learning can provide better performance.
matK-QR classifier: a patterns based approach for plant species identification.
More, Ravi Prabhakar; Mane, Rupali Chandrashekhar; Purohit, Hemant J
2016-01-01
DNA barcoding is widely used and most efficient approach that facilitates rapid and accurate identification of plant species based on the short standardized segment of the genome. The nucleotide sequences of maturaseK ( matK ) and ribulose-1, 5-bisphosphate carboxylase ( rbcL ) marker loci are commonly used in plant species identification. Here, we present a new and highly efficient approach for identifying a unique set of discriminating nucleotide patterns to generate a signature (i.e. regular expression) for plant species identification. In order to generate molecular signatures, we used matK and rbcL loci datasets, which encompass 125 plant species in 52 genera reported by the CBOL plant working group. Initially, we performed Multiple Sequence Alignment (MSA) of all species followed by Position Specific Scoring Matrix (PSSM) for both loci to achieve a percentage of discrimination among species. Further, we detected Discriminating Patterns (DP) at genus and species level using PSSM for the matK dataset. Combining DP and consecutive pattern distances, we generated molecular signatures for each species. Finally, we performed a comparative assessment of these signatures with the existing methods including BLASTn, Support Vector Machines (SVM), Jrip-RIPPER, J48 (C4.5 algorithm), and the Naïve Bayes (NB) methods against NCBI-GenBank matK dataset. Due to the higher discrimination success obtained with the matK as compared to the rbcL , we selected matK gene for signature generation. We generated signatures for 60 species based on identified discriminating patterns at genus and species level. Our comparative assessment results suggest that a total of 46 out of 60 species could be correctly identified using generated signatures, followed by BLASTn (34 species), SVM (18 species), C4.5 (7 species), NB (4 species) and RIPPER (3 species) methods As a final outcome of this study, we converted signatures into QR codes and developed a software matK -QR Classifier (http://www.neeri.res.in/matk_classifier/index.htm), which search signatures in the query matK gene sequences and predict corresponding plant species. This novel approach of employing pattern-based signatures opens new avenues for the classification of species. In addition to existing methods, we believe that matK -QR Classifier would be a valuable tool for molecular taxonomists enabling precise identification of plant species.
Evaluation of exome variants using the Ion Proton Platform to sequence error-prone regions.
Seo, Heewon; Park, Yoomi; Min, Byung Joo; Seo, Myung Eui; Kim, Ju Han
2017-01-01
The Ion Proton sequencer from Thermo Fisher accurately determines sequence variants from target regions with a rapid turnaround time at a low cost. However, misleading variant-calling errors can occur. We performed a systematic evaluation and manual curation of read-level alignments for the 675 ultrarare variants reported by the Ion Proton sequencer from 27 whole-exome sequencing data but that are not present in either the 1000 Genomes Project and the Exome Aggregation Consortium. We classified positive variant calls into 393 highly likely false positives, 126 likely false positives, and 156 likely true positives, which comprised 58.2%, 18.7%, and 23.1% of the variants, respectively. We identified four distinct error patterns of variant calling that may be bioinformatically corrected when using different strategies: simplicity region, SNV cluster, peripheral sequence read, and base inversion. Local de novo assembly successfully corrected 201 (38.7%) of the 519 highly likely or likely false positives. We also demonstrate that the two sequencing kits from Thermo Fisher (the Ion PI Sequencing 200 kit V3 and the Ion PI Hi-Q kit) exhibit different error profiles across different error types. A refined calling algorithm with better polymerase may improve the performance of the Ion Proton sequencing platform.
Event-related potential correlates of declarative and non-declarative sequence knowledge.
Ferdinand, Nicola K; Rünger, Dennis; Frensch, Peter A; Mecklinger, Axel
2010-07-01
The goal of the present study was to demonstrate that declarative and non-declarative knowledge acquired in an incidental sequence learning task contributes differentially to memory retrieval and leads to dissociable ERP signatures in a recognition memory task. For this purpose, participants performed a sequence learning task and were classified as verbalizers, partial verbalizers, or nonverbalizers according to their ability to verbally report the systematic response sequence. Thereafter, ERPs were recorded in a recognition memory task time-locked to sequence triplets that were either part of the previously learned sequence or not. Although all three groups executed old sequence triplets faster than new triplets in the recognition memory task, qualitatively distinct ERP patterns were found for participants with and without reportable knowledge. Verbalizers and, to a lesser extent, partial verbalizers showed an ERP correlate of recollection for parts of the incidentally learned sequence. In contrast, nonverbalizers showed a different ERP effect with a reverse polarity that might reflect priming. This indicates that an ensemble of qualitatively different processes is at work when declarative and non-declarative sequence knowledge is retrieved. By this, our findings favor a multiple-systems view postulating that explicit and implicit learning are supported by different and functionally independent systems. Copyright (c) 2010 Elsevier Ltd. All rights reserved.
Stranieri, Andrew; Abawajy, Jemal; Kelarev, Andrei; Huda, Shamsul; Chowdhury, Morshed; Jelinek, Herbert F
2013-07-01
This article addresses the problem of determining optimal sequences of tests for the clinical assessment of cardiac autonomic neuropathy (CAN). We investigate the accuracy of using only one of the recommended Ewing tests to classify CAN and the additional accuracy obtained by adding the remaining tests of the Ewing battery. This is important as not all five Ewing tests can always be applied in each situation in practice. We used new and unique database of the diabetes screening research initiative project, which is more than ten times larger than the data set used by Ewing in his original investigation of CAN. We utilized decision trees and the optimal decision path finder (ODPF) procedure for identifying optimal sequences of tests. We present experimental results on the accuracy of using each one of the recommended Ewing tests to classify CAN and the additional accuracy that can be achieved by adding the remaining tests of the Ewing battery. We found the best sequences of tests for cost-function equal to the number of tests. The accuracies achieved by the initial segments of the optimal sequences for 2, 3 and 4 categories of CAN are 80.80, 91.33, 93.97 and 94.14, and respectively, 79.86, 89.29, 91.16 and 91.76, and 78.90, 86.21, 88.15 and 88.93. They show significant improvement compared to the sequence considered previously in the literature and the mathematical expectations of the accuracies of a random sequence of tests. The complete outcomes obtained for all subsets of the Ewing features are required for determining optimal sequences of tests for any cost-function with the use of the ODPF procedure. We have also found two most significant additional features that can increase the accuracy when some of the Ewing attributes cannot be obtained. The outcomes obtained can be used to determine the optimal sequences of tests for each individual cost-function by following the ODPF procedure. The results show that the best single Ewing test for diagnosing CAN is the deep breathing heart rate variation test. Optimal sequences found for the cost-function equal to the number of tests guarantee that the best accuracy is achieved after any number of tests and provide an improvement in comparison with the previous ordering of tests or a random sequence. Copyright © 2013 Elsevier B.V. All rights reserved.
MIPS: a database for protein sequences, homology data and yeast genome information.
Mewes, H W; Albermann, K; Heumann, K; Liebl, S; Pfeiffer, F
1997-01-01
The MIPS group (Martinsried Institute for Protein Sequences) at the Max-Planck-Institute for Biochemistry, Martinsried near Munich, Germany, collects, processes and distributes protein sequence data within the framework of the tripartite association of the PIR-International Protein Sequence Database (,). MIPS contributes nearly 50% of the data input to the PIR-International Protein Sequence Database. The database is distributed on CD-ROM together with PATCHX, an exhaustive supplement of unique, unverified protein sequences from external sources compiled by MIPS. Through its WWW server (http://www.mips.biochem.mpg.de/ ) MIPS permits internet access to sequence databases, homology data and to yeast genome information. (i) Sequence similarity results from the FASTA program () are stored in the FASTA database for all proteins from PIR-International and PATCHX. The database is dynamically maintained and permits instant access to FASTA results. (ii) Starting with FASTA database queries, proteins have been classified into families and superfamilies (PROT-FAM). (iii) The HPT (hashed position tree) data structure () developed at MIPS is a new approach for rapid sequence and pattern searching. (iv) MIPS provides access to the sequence and annotation of the complete yeast genome (), the functional classification of yeast genes (FunCat) and its graphical display, the 'Genome Browser' (). A CD-ROM based on the JAVA programming language providing dynamic interactive access to the yeast genome and the related protein sequences has been compiled and is available on request. PMID:9016498
Maximizing lipocalin prediction through balanced and diversified training set and decision fusion.
Nath, Abhigyan; Subbiah, Karthikeyan
2015-12-01
Lipocalins are short in sequence length and perform several important biological functions. These proteins are having less than 20% sequence similarity among paralogs. Experimentally identifying them is an expensive and time consuming process. The computational methods based on the sequence similarity for allocating putative members to this family are also far elusive due to the low sequence similarity existing among the members of this family. Consequently, the machine learning methods become a viable alternative for their prediction by using the underlying sequence/structurally derived features as the input. Ideally, any machine learning based prediction method must be trained with all possible variations in the input feature vector (all the sub-class input patterns) to achieve perfect learning. A near perfect learning can be achieved by training the model with diverse types of input instances belonging to the different regions of the entire input space. Furthermore, the prediction performance can be improved through balancing the training set as the imbalanced data sets will tend to produce the prediction bias towards majority class and its sub-classes. This paper is aimed to achieve (i) the high generalization ability without any classification bias through the diversified and balanced training sets as well as (ii) enhanced the prediction accuracy by combining the results of individual classifiers with an appropriate fusion scheme. Instead of creating the training set randomly, we have first used the unsupervised Kmeans clustering algorithm to create diversified clusters of input patterns and created the diversified and balanced training set by selecting an equal number of patterns from each of these clusters. Finally, probability based classifier fusion scheme was applied on boosted random forest algorithm (which produced greater sensitivity) and K nearest neighbour algorithm (which produced greater specificity) to achieve the enhanced predictive performance than that of individual base classifiers. The performance of the learned models trained on Kmeans preprocessed training set is far better than the randomly generated training sets. The proposed method achieved a sensitivity of 90.6%, specificity of 91.4% and accuracy of 91.0% on the first test set and sensitivity of 92.9%, specificity of 96.2% and accuracy of 94.7% on the second blind test set. These results have established that diversifying training set improves the performance of predictive models through superior generalization ability and balancing the training set improves prediction accuracy. For smaller data sets, unsupervised Kmeans based sampling can be an effective technique to increase generalization than that of the usual random splitting method. Copyright © 2015 Elsevier Ltd. All rights reserved.
Mizianty, Marcin J; Kurgan, Lukasz
2009-12-13
Knowledge of structural class is used by numerous methods for identification of structural/functional characteristics of proteins and could be used for the detection of remote homologues, particularly for chains that share twilight-zone similarity. In contrast to existing sequence-based structural class predictors, which target four major classes and which are designed for high identity sequences, we predict seven classes from sequences that share twilight-zone identity with the training sequences. The proposed MODular Approach to Structural class prediction (MODAS) method is unique as it allows for selection of any subset of the classes. MODAS is also the first to utilize a novel, custom-built feature-based sequence representation that combines evolutionary profiles and predicted secondary structure. The features quantify information relevant to the definition of the classes including conservation of residues and arrangement and number of helix/strand segments. Our comprehensive design considers 8 feature selection methods and 4 classifiers to develop Support Vector Machine-based classifiers that are tailored for each of the seven classes. Tests on 5 twilight-zone and 1 high-similarity benchmark datasets and comparison with over two dozens of modern competing predictors show that MODAS provides the best overall accuracy that ranges between 80% and 96.7% (83.5% for the twilight-zone datasets), depending on the dataset. This translates into 19% and 8% error rate reduction when compared against the best performing competing method on two largest datasets. The proposed predictor provides accurate predictions at 58% accuracy for membrane proteins class, which is not considered by majority of existing methods, in spite that this class accounts for only 2% of the data. Our predictive model is analyzed to demonstrate how and why the input features are associated with the corresponding classes. The improved predictions stem from the novel features that express collocation of the secondary structure segments in the protein sequence and that combine evolutionary and secondary structure information. Our work demonstrates that conservation and arrangement of the secondary structure segments predicted along the protein chain can successfully predict structural classes which are defined based on the spatial arrangement of the secondary structures. A web server is available at http://biomine.ece.ualberta.ca/MODAS/.
2009-01-01
Background Knowledge of structural class is used by numerous methods for identification of structural/functional characteristics of proteins and could be used for the detection of remote homologues, particularly for chains that share twilight-zone similarity. In contrast to existing sequence-based structural class predictors, which target four major classes and which are designed for high identity sequences, we predict seven classes from sequences that share twilight-zone identity with the training sequences. Results The proposed MODular Approach to Structural class prediction (MODAS) method is unique as it allows for selection of any subset of the classes. MODAS is also the first to utilize a novel, custom-built feature-based sequence representation that combines evolutionary profiles and predicted secondary structure. The features quantify information relevant to the definition of the classes including conservation of residues and arrangement and number of helix/strand segments. Our comprehensive design considers 8 feature selection methods and 4 classifiers to develop Support Vector Machine-based classifiers that are tailored for each of the seven classes. Tests on 5 twilight-zone and 1 high-similarity benchmark datasets and comparison with over two dozens of modern competing predictors show that MODAS provides the best overall accuracy that ranges between 80% and 96.7% (83.5% for the twilight-zone datasets), depending on the dataset. This translates into 19% and 8% error rate reduction when compared against the best performing competing method on two largest datasets. The proposed predictor provides accurate predictions at 58% accuracy for membrane proteins class, which is not considered by majority of existing methods, in spite that this class accounts for only 2% of the data. Our predictive model is analyzed to demonstrate how and why the input features are associated with the corresponding classes. Conclusions The improved predictions stem from the novel features that express collocation of the secondary structure segments in the protein sequence and that combine evolutionary and secondary structure information. Our work demonstrates that conservation and arrangement of the secondary structure segments predicted along the protein chain can successfully predict structural classes which are defined based on the spatial arrangement of the secondary structures. A web server is available at http://biomine.ece.ualberta.ca/MODAS/. PMID:20003388
Coexistence of Native and Denatured Phases in a Single Proteinlike Molecule
NASA Astrophysics Data System (ADS)
Du, Rose; Grosberg, Alexander Yu.; Tanaka, Toyoichi
1999-11-01
In order to understand the nuclei which develop during the course of protein folding and unfolding, we examine equilibrium coexistence of phases within a single heteropolymer chain. We computationally generate the phase segregation by applying a ``folding pressure,'' or adding an energetic bonus for native monomer-monomer contacts. The computer models reveal that in a polymer system some nuclei hinder folding via topological constraints. Using this insight, we show that the critical nucleus size is of the order of the entire chain and that unfolding time scales as exp\\(cN2/3\\), in the large N limit, N and c being the chain length and a constant, respectively.
Niv, Masha Y.; Skrabanek, Lucy; Roberts, Richard J.; Scheraga, Harold A.; Weinstein, Harel
2008-01-01
Restriction endonucleases (REases) are DNA-cleaving enzymes that have become indispensable tools in molecular biology. Type II REases are highly divergent in sequence despite their common structural core, function and, in some cases, common specificities towards DNA sequences. This makes it difficult to identify and classify them functionally based on sequence, and has hampered the efforts of specificity-engineering. Here, we define novel REase sequence motifs, which extend beyond the PD-(D/E)XK hallmark, and incorporate secondary structure information. The automated search using these motifs is carried out with a newly developed fast regular expression matching algorithm that accommodates long patterns with optional secondary structure constraints. Using this new tool, named Scan2S, motifs derived from REases with specificity towards GATC- and CGGG-containing DNA sequences successfully identify REases of the same specificity. Notably, some of these sequences are not identified by standard sequence detection tools. The new motifs highlight potential specificity-determining positions that do not fully overlap for the GATC- and the CCGG-recognizing REases and are candidates for specificity re-engineering. PMID:17972284
Li, Fan; Ma, Liying; Feng, Yi; Hu, Jing; Ni, Na; Ruan, Yuhua; Shao, Yiming
2017-06-01
HIV-1 transmission in intravenous drug users (IDUs) has been characterized by high genetic multiplicity and suggests a greater challenge for HIV-1 infection blocking. We investigated a total of 749 sequences of full-length gp160 gene obtained by single genome sequencing (SGS) from 22 HIV-1 early infected IDUs in Xinjiang province, northwest China, and generated a transmitted and founder virus (T/F virus) consensus sequence (IDU.CON). The T/F virus was classified as subtype CRF07_BC and predicted to be CCR5-tropic virus. The variable region (V1, V2, and V4 loop) of IDU.CON showed length variation compared with the heterosexual T/F virus consensus sequence (HSX.CON) and homosexual T/F virus consensus sequence (MSM.CON). A total of 26 N-linked glycosylation sites were discovered in the IDU.CON sequence, which is less than that of MSM.CON and HSX.CON. Characterization of T/F virus from IDUs highlights the genetic make-up and complexity of virus near the moment of transmission or in early infection preceding systemic dissemination and is important toward the development of an effective HIV-1 preventive methods, including vaccines.
Protein Information Resource: a community resource for expert annotation of protein data
Barker, Winona C.; Garavelli, John S.; Hou, Zhenglin; Huang, Hongzhan; Ledley, Robert S.; McGarvey, Peter B.; Mewes, Hans-Werner; Orcutt, Bruce C.; Pfeiffer, Friedhelm; Tsugita, Akira; Vinayaka, C. R.; Xiao, Chunlin; Yeh, Lai-Su L.; Wu, Cathy
2001-01-01
The Protein Information Resource, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the most comprehensive and expertly annotated protein sequence database in the public domain, the PIR-International Protein Sequence Database. To provide timely and high quality annotation and promote database interoperability, the PIR-International employs rule-based and classification-driven procedures based on controlled vocabulary and standard nomenclature and includes status tags to distinguish experimentally determined from predicted protein features. The database contains about 200 000 non-redundant protein sequences, which are classified into families and superfamilies and their domains and motifs identified. Entries are extensively cross-referenced to other sequence, classification, genome, structure and activity databases. The PIR web site features search engines that use sequence similarity and database annotation to facilitate the analysis and functional identification of proteins. The PIR-International databases and search tools are accessible on the PIR web site at http://pir.georgetown.edu/ and at the MIPS web site at http://www.mips.biochem.mpg.de. The PIR-International Protein Sequence Database and other files are also available by FTP. PMID:11125041
Niv, Masha Y; Skrabanek, Lucy; Roberts, Richard J; Scheraga, Harold A; Weinstein, Harel
2008-05-01
Restriction endonucleases (REases) are DNA-cleaving enzymes that have become indispensable tools in molecular biology. Type II REases are highly divergent in sequence despite their common structural core, function and, in some cases, common specificities towards DNA sequences. This makes it difficult to identify and classify them functionally based on sequence, and has hampered the efforts of specificity-engineering. Here, we define novel REase sequence motifs, which extend beyond the PD-(D/E)XK hallmark, and incorporate secondary structure information. The automated search using these motifs is carried out with a newly developed fast regular expression matching algorithm that accommodates long patterns with optional secondary structure constraints. Using this new tool, named Scan2S, motifs derived from REases with specificity towards GATC- and CGGG-containing DNA sequences successfully identify REases of the same specificity. Notably, some of these sequences are not identified by standard sequence detection tools. The new motifs highlight potential specificity-determining positions that do not fully overlap for the GATC- and the CCGG-recognizing REases and are candidates for specificity re-engineering.
Length-independent structural similarities enrich the antibody CDR canonical class model.
Nowak, Jaroslaw; Baker, Terry; Georges, Guy; Kelm, Sebastian; Klostermann, Stefan; Shi, Jiye; Sridharan, Sudharsan; Deane, Charlotte M
2016-01-01
Complementarity-determining regions (CDRs) are antibody loops that make up the antigen binding site. Here, we show that all CDR types have structurally similar loops of different lengths. Based on these findings, we created length-independent canonical classes for the non-H3 CDRs. Our length variable structural clusters show strong sequence patterns suggesting either that they evolved from the same original structure or result from some form of convergence. We find that our length-independent method not only clusters a larger number of CDRs, but also predicts canonical class from sequence better than the standard length-dependent approach. To demonstrate the usefulness of our findings, we predicted cluster membership of CDR-L3 sequences from 3 next-generation sequencing datasets of the antibody repertoire (over 1,000,000 sequences). Using the length-independent clusters, we can structurally classify an additional 135,000 sequences, which represents a ∼20% improvement over the standard approach. This suggests that our length-independent canonical classes might be a highly prevalent feature of antibody space, and could substantially improve our ability to accurately predict the structure of novel CDRs identified by next-generation sequencing.
Language Classification using N-grams Accelerated by FPGA-based Bloom Filters
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jacob, A; Gokhale, M
N-Gram (n-character sequences in text documents) counting is a well-established technique used in classifying the language of text in a document. In this paper, n-gram processing is accelerated through the use of reconfigurable hardware on the XtremeData XD1000 system. Our design employs parallelism at multiple levels, with parallel Bloom Filters accessing on-chip RAM, parallel language classifiers, and parallel document processing. In contrast to another hardware implementation (HAIL algorithm) that uses off-chip SRAM for lookup, our highly scalable implementation uses only on-chip memory blocks. Our implementation of end-to-end language classification runs at 85x comparable software and 1.45x the competing hardware design.
Surgical and molecular pathology of pancreatic neoplasms.
Hackeng, Wenzel M; Hruban, Ralph H; Offerhaus, G Johan A; Brosens, Lodewijk A A
2016-06-07
Histologic characteristics have proven to be very useful for classifying different types of tumors of the pancreas. As a result, the major tumor types in the pancreas have long been classified based on their microscopic appearance. Recent advances in whole exome sequencing, gene expression profiling, and knowledge of tumorigenic pathways have deepened our understanding of the underlying biology of pancreatic neoplasia. These advances have not only confirmed the traditional histologic classification system, but also opened new doors to early diagnosis and targeted treatment. This review discusses the histopathology, genetic and epigenetic alterations and potential treatment targets of the five major malignant pancreatic tumors - pancreatic ductal adenocarcinoma, pancreatic neuroendocrine tumor, solid-pseudopapillary neoplasm, acinar cell carcinoma and pancreatoblastoma.
Piao, Yongjun; Piao, Minghao; Ryu, Keun Ho
2017-01-01
Cancer classification has been a crucial topic of research in cancer treatment. In the last decade, messenger RNA (mRNA) expression profiles have been widely used to classify different types of cancers. With the discovery of a new class of small non-coding RNAs; known as microRNAs (miRNAs), various studies have shown that the expression patterns of miRNA can also accurately classify human cancers. Therefore, there is a great demand for the development of machine learning approaches to accurately classify various types of cancers using miRNA expression data. In this article, we propose a feature subset-based ensemble method in which each model is learned from a different projection of the original feature space to classify multiple cancers. In our method, the feature relevance and redundancy are considered to generate multiple feature subsets, the base classifiers are learned from each independent miRNA subset, and the average posterior probability is used to combine the base classifiers. To test the performance of our method, we used bead-based and sequence-based miRNA expression datasets and conducted 10-fold and leave-one-out cross validations. The experimental results show that the proposed method yields good results and has higher prediction accuracy than popular ensemble methods. The Java program and source code of the proposed method and the datasets in the experiments are freely available at https://sourceforge.net/projects/mirna-ensemble/. Copyright © 2016 Elsevier Ltd. All rights reserved.
Detecting falls with wearable sensors using machine learning techniques.
Özdemir, Ahmet Turan; Barshan, Billur
2014-06-18
Falls are a serious public health problem and possibly life threatening for people in fall risk groups. We develop an automated fall detection system with wearable motion sensor units fitted to the subjects' body at six different positions. Each unit comprises three tri-axial devices (accelerometer, gyroscope, and magnetometer/compass). Fourteen volunteers perform a standardized set of movements including 20 voluntary falls and 16 activities of daily living (ADLs), resulting in a large dataset with 2520 trials. To reduce the computational complexity of training and testing the classifiers, we focus on the raw data for each sensor in a 4 s time window around the point of peak total acceleration of the waist sensor, and then perform feature extraction and reduction. Most earlier studies on fall detection employ rule-based approaches that rely on simple thresholding of the sensor outputs. We successfully distinguish falls from ADLs using six machine learning techniques (classifiers): the k-nearest neighbor (k-NN) classifier, least squares method (LSM), support vector machines (SVM), Bayesian decision making (BDM), dynamic time warping (DTW), and artificial neural networks (ANNs). We compare the performance and the computational complexity of the classifiers and achieve the best results with the k-NN classifier and LSM, with sensitivity, specificity, and accuracy all above 99%. These classifiers also have acceptable computational requirements for training and testing. Our approach would be applicable in real-world scenarios where data records of indeterminate length, containing multiple activities in sequence, are recorded.
SINEBase: a database and tool for SINE analysis.
Vassetzky, Nikita S; Kramerov, Dmitri A
2013-01-01
SINEBase (http://sines.eimb.ru) integrates the revisited body of knowledge about short interspersed elements (SINEs). A set of formal definitions concerning SINEs was introduced. All available sequence data were screened through these definitions and the genetic elements misidentified as SINEs were discarded. As a result, 175 SINE families have been recognized in animals, flowering plants and green algae. These families were classified by the modular structure of their nucleotide sequences and the frequencies of different patterns were evaluated. These data formed the basis for the database of SINEs. The SINEBase website can be used in two ways: first, to explore the database of SINE families, and second, to analyse candidate SINE sequences using specifically developed tools. This article presents an overview of the database and the process of SINE identification and analysis.
SINEBase: a database and tool for SINE analysis
Vassetzky, Nikita S.; Kramerov, Dmitri A.
2013-01-01
SINEBase (http://sines.eimb.ru) integrates the revisited body of knowledge about short interspersed elements (SINEs). A set of formal definitions concerning SINEs was introduced. All available sequence data were screened through these definitions and the genetic elements misidentified as SINEs were discarded. As a result, 175 SINE families have been recognized in animals, flowering plants and green algae. These families were classified by the modular structure of their nucleotide sequences and the frequencies of different patterns were evaluated. These data formed the basis for the database of SINEs. The SINEBase website can be used in two ways: first, to explore the database of SINE families, and second, to analyse candidate SINE sequences using specifically developed tools. This article presents an overview of the database and the process of SINE identification and analysis. PMID:23203982
Types of diaphragmatic motion during hepatic angiography.
Katsuda, T; Kuroda, C; Fujita, M
1997-01-01
To determine the types and causes of diaphragmatic motion during hepatic angiography, the authors used transarterial cut-film portography (TAP) to study movement of the diaphragm during breath-holding. Thirty-three TAP sequences were studied, and the patients' diaphragmatic motions were classified into four categories according to the distance their diaphragms moved. Results showed that the diaphragm was stationary in 33% of the TAP studies, while perpetual motion occurred in 15% of the studies, early-phase motion occurred in 12% and late-phase motion occurred in 40%. Ten sequences showed diaphragmatic motion of more than 10 mm, with eight sequences showing caudal motion and two showing cranial motion. This article discusses the cause of diaphragmatic motion during breath-holding for hepatic angiography and presents suggestions to reduce motion artifacts during the exam.
Chiaki Hori; Jill Gaskell; Kiyohiko Igarashi; Masahiro Samejima; David Hibbett; Bernard Henrissat; Dan Cullen
2013-01-01
To degrade the polysaccharides, wood-decay fungi secrete a variety of glycoside hydrolases (GHs) and carbohydrate esterases (CEs) classified into various sequence-based families of carbohydrate-active enzymes (CAZys) and their appended carbohydrate-binding modules (CBM). Oxidative enzymes, such as cellobiose dehydrogenase (CDH) and lytic polysaccharide monooxygenase (...
Robert L. Harrison; Melody A. Keena; Daniel L. Rowley
2014-01-01
Lymantria dispar multiple nucleopolyhedrovirus (LdMNPV) has been formulated and applied to control outbreaks of the gypsy moth, L. dispar. To classify and determine the degree of genetic variation among isolates of L. dispar NPVs from different parts of the range of the gypsy moth, partial sequences of the
P-type ATPase superfamily: evidence for critical roles for kingdom evolution.
Okamura, Hideyuki; Denawa, Masatsugu; Ohniwa, Ryosuke; Takeyasu, Kunio
2003-04-01
The P-type ATPase has become a protein superfamily. On the basis of sequence similarities, the phylogenetic analyses, and substrate specificities, this superfamily can be classified into 5 families and 11 subfamilies. A comparative phylogenetic analysis demonstrates the relationship between the molecular evolution of these subfamilies and the establishment of the kingdoms of living things.
Easy-to-use phylogenetic analysis system for hepatitis B virus infection.
Sugiyama, Masaya; Inui, Ayano; Shin-I, Tadasu; Komatsu, Haruki; Mukaide, Motokazu; Masaki, Naohiko; Murata, Kazumoto; Ito, Kiyoaki; Nakanishi, Makoto; Fujisawa, Tomoo; Mizokami, Masashi
2011-10-01
The molecular phylogenetic analysis has been broadly applied to clinical and virological study. However, the appropriate settings and application of calculation parameters are difficult for non-specialists of molecular genetics. In the present study, the phylogenetic analysis tool was developed for the easy determination of genotypes and transmission route. A total of 23 patients of 10 families infected with hepatitis B virus (HBV) were enrolled and expected to undergo intrafamilial transmission. The extracted HBV DNA were amplified and sequenced in a region of the S gene. The software to automatically classify query sequence was constructed and installed on the Hepatitis Virus Database (HVDB). Reference sequences were retrieved from HVDB, which contained major genotypes from A to H. Multiple-alignments using CLUSTAL W were performed before the genetic distance matrix was calculated with the six-parameter method. The phylogenetic tree was output by the neighbor-joining method. User interface using WWW-browser was also developed for intuitive control. This system was named as the easy-to-use phylogenetic analysis system (E-PAS). Twenty-three sera of 10 families were analyzed to evaluate E-PAS. The queries obtained from nine families were genotype C and were located in one cluster per family. However, one patient of a family was classified into the cluster different from her family, suggesting that E-PAS detected the sample distinct from that of her family on the transmission route. The E-PAS to output phylogenetic tree was developed since requisite material was sequence data only. E-PAS could expand to determine HBV genotypes as well as transmission routes. © 2011 The Japan Society of Hepatology.
Qiu, Wang-Ren; Jiang, Shi-Yu; Sun, Bi-Qian; Xiao, Xuan; Cheng, Xiang; Chou, Kuo-Chen
2017-01-01
Being a kind of post-transcriptional modification (PTCM) in RNA, the 2'-Omethylation modification occurs in the processes of life development and disease formation as well. Accordingly, from the angles of both basic research and drug development, we are facing a challenging problem: given an uncharacterized RNA sequence formed by many nucleotides of A (adenine), C (cytosine), G (guanine), and U (uracil), which one can be of 2-O'-methylation modification, and which one cannot? Unfortunately, so far no computational method whatsoever has been developed to address such a problem. To fill this empty area, we propose a predictor called iRNA-2methyl. It is formed by incorporating a series of sequence-coupled factors into the general PseKNC (pseudo nucleotide composition), followed by fusing 12 basic random forest classifier into four ensemble predictors, with each aimed to identify the cases of A, C, G, and U along the RNA sequence concerned, respectively. Rigorous jackknife cross-validations have indicated that the success rates are very high (>93%). For the convenience of most experimental scientists, a user-friendly web-server for iRNA-2methyl has been established at http://www.jci-bioinfo.cn/iRNA-2methyl, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved. The proposed predictor iRNA-2methyl will become a very useful bioinformatics tool for medicinal chemistry, helping to design effective drugs against the diseases related to the 2'-Omethylation modification. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Camps, Carme; Petousi, Nayia; Bento, Celeste; Cario, Holger; Copley, Richard R.; McMullin, Mary Frances; van Wijk, Richard; Ratcliffe, Peter J.; Robbins, Peter A.; Taylor, Jenny C.
2016-01-01
Erythrocytosis is a rare disorder characterized by increased red cell mass and elevated hemoglobin concentration and hematocrit. Several genetic variants have been identified as causes for erythrocytosis in genes belonging to different pathways including oxygen sensing, erythropoiesis and oxygen transport. However, despite clinical investigation and screening for these mutations, the cause of disease cannot be found in a considerable number of patients, who are classified as having idiopathic erythrocytosis. In this study, we developed a targeted next-generation sequencing panel encompassing the exonic regions of 21 genes from relevant pathways (~79 Kb) and sequenced 125 patients with idiopathic erythrocytosis. The panel effectively screened 97% of coding regions of these genes, with an average coverage of 450×. It identified 51 different rare variants, all leading to alterations of protein sequence, with 57 out of 125 cases (45.6%) having at least one of these variants. Ten of these were known erythrocytosis-causing variants, which had been missed following existing diagnostic algorithms. Twenty-two were novel variants in erythrocytosis-associated genes (EGLN1, EPAS1, VHL, BPGM, JAK2, SH2B3) and in novel genes included in the panel (e.g. EPO, EGLN2, HIF3A, OS9), some with a high likelihood of functionality, for which future segregation, functional and replication studies will be useful to provide further evidence for causality. The rest were classified as polymorphisms. Overall, these results demonstrate the benefits of using a gene panel rather than existing methods in which focused genetic screening is performed depending on biochemical measurements: the gene panel improves diagnostic accuracy and provides the opportunity for discovery of novel variants. PMID:27651169
Camps, Carme; Petousi, Nayia; Bento, Celeste; Cario, Holger; Copley, Richard R; McMullin, Mary Frances; van Wijk, Richard; Ratcliffe, Peter J; Robbins, Peter A; Taylor, Jenny C
2016-11-01
Erythrocytosis is a rare disorder characterized by increased red cell mass and elevated hemoglobin concentration and hematocrit. Several genetic variants have been identified as causes for erythrocytosis in genes belonging to different pathways including oxygen sensing, erythropoiesis and oxygen transport. However, despite clinical investigation and screening for these mutations, the cause of disease cannot be found in a considerable number of patients, who are classified as having idiopathic erythrocytosis. In this study, we developed a targeted next-generation sequencing panel encompassing the exonic regions of 21 genes from relevant pathways (~79 Kb) and sequenced 125 patients with idiopathic erythrocytosis. The panel effectively screened 97% of coding regions of these genes, with an average coverage of 450×. It identified 51 different rare variants, all leading to alterations of protein sequence, with 57 out of 125 cases (45.6%) having at least one of these variants. Ten of these were known erythrocytosis-causing variants, which had been missed following existing diagnostic algorithms. Twenty-two were novel variants in erythrocytosis-associated genes (EGLN1, EPAS1, VHL, BPGM, JAK2, SH2B3) and in novel genes included in the panel (e.g. EPO, EGLN2, HIF3A, OS9), some with a high likelihood of functionality, for which future segregation, functional and replication studies will be useful to provide further evidence for causality. The rest were classified as polymorphisms. Overall, these results demonstrate the benefits of using a gene panel rather than existing methods in which focused genetic screening is performed depending on biochemical measurements: the gene panel improves diagnostic accuracy and provides the opportunity for discovery of novel variants. Copyright© Ferrata Storti Foundation.
André, Nicole M.
2018-01-01
ABSTRACT The difficulties related to virus taxonomy have been amplified by recent advances in next-generation sequencing and metagenomics, prompting the field to revisit the question of what constitutes a useful viral classification. Here, taking a challenging classification found in coronaviruses, we argue that consideration of biological properties in addition to sequence-based demarcations is critical for generating useful taxonomy that recapitulates complex evolutionary histories. Within the Alphacoronavirus genus, the Alphacoronavirus 1 species encompasses several biologically distinct viruses. We carried out functionally based phylogenetic analysis, centered on the spike gene, which encodes the main surface antigen and primary driver of tropism and pathogenesis. Within the Alphacoronavirus 1 species, we identify clade A (encompassing serotype I feline coronavirus [FCoV] and canine coronavirus [CCoV]) and clade B (grouping serotype II FCoV and CCoV and transmissible gastroenteritis virus [TGEV]-like viruses). We propose this clade designation, along with the newly proposed Alphacoronavirus 2 species, as an improved way to classify the Alphacoronavirus genus. IMPORTANCE Our work focuses on improving the classification of the Alphacoronavirus genus. The Alphacoronavirus 1 species groups viruses of veterinary importance that infect distinct mammalian hosts and includes canine and feline coronaviruses and transmissible gastroenteritis virus. It is the prototype species of the Alphacoronavirus genus; however, it encompasses biologically distinct viruses. To better characterize this prototypical species, we performed phylogenetic analyses based on the sequences of the spike protein, one of the main determinants of tropism and pathogenesis, and reveal the existence of two subgroups or clades that fit with previously established serotype demarcations. We propose a new clade designation to better classify Alphacoronavirus 1 members. PMID:29299531
Pettersson, B; Kodjo, A; Ronaghi, M; Uhlén, M; Tønjum, T
1998-01-01
Thirty-three strains previously classified into 11 species in the bacterial family Moraxellaceae were subjected to phylogenetic analysis based on 16S rRNA sequences. The family Moraxellaceae formed a distinct clade consisting of four phylogenetic groups as judged from branch lengths, bootstrap values and signature nucleotides. Group I contained the classical moraxellae and strains of the coccal moraxellae, previously known as Branhamella, with 16S rRNA similarity of > or = 95%. A further division of group I into five tentative clusters is discussed. Group II consisted of two strains representing Moraxella atlantae and Moraxella osloensis. These strains were only distantly related to each other (93.4%) and also to the other members of the Moraxellaceae (< or = 93%). Therefore, reasons for reclassification of these species into separate and new genera are discussed. Group III harboured strains of the genus Psychrobacter and strain 752/52 of [Moraxella] phenylpyruvica. This strain of [M.] phenylpyruvica formed an early branch from the group III line of descent. Interestingly, a distant relationship was found between Psychrobacter phenylpyruvicus strain ATCC 23333T (formerly classified as [M.] phenylpyruvica) and [M.] phenylpyruvica strain 752/52, exhibiting less than 96% nucleotide similarity between their 16S rRNA sequences. The establishment of a new genus for [M.] phenylpyruvica strain 752/52 is therefore suggested. Group IV contained only two strains of the genus Acinetobacter. Strategies for the development of diagnostic probes and distinctive sequences for 16S rRNA-based species-specific assays within group I are suggested. Although these findings add to the classificatory placements within the Moraxellaceae, analysis of a more comprehensive selection of strains is still needed to obtain a complete classification system within this family.
2015-01-01
Abstract Trees contribute to enormous plant oil reserves because many trees contain 50%–80% of oil (triacylglycerols, TAGs) in the fruits and kernels. TAGs accumulate in subcellular structures called oil bodies/droplets, in which TAGs are covered by low-molecular-mass hydrophobic proteins called oleosins (OLEs). The OLEs/TAGs ratio determines the size and shape of intracellular oil bodies. There is a lack of comprehensive sequence analysis and structural information of OLEs among diverse trees. The objectives of this study were to identify OLEs from 22 tree species (e.g., tung tree, tea-oil tree, castor bean), perform genome-wide analysis of OLEs, classify OLEs, identify conserved sequence motifs and amino acid residues, and predict secondary and three-dimensional structures in tree OLEs and OLE subfamilies. Data mining identified 65 OLEs with perfect conservation of the “proline knot” motif (PX5SPX3P) from 19 trees. These OLEs contained >40% hydrophobic amino acid residues. They displayed similar properties and amino acid composition. Genome-wide phylogenetic analysis and multiple sequence alignment demonstrated that these proteins could be classified into five OLE subfamilies. There were distinct patterns of sequence conservation among the OLE subfamilies and within individual tree species. Computational modeling indicated that OLEs were composed of at least three α-helixes connected with short coils without any β-strand and that they exhibited distinct 3D structures and ligand binding sites. These analyses provide fundamental information in the similarity and specificity of diverse OLE isoforms within the same subfamily and among the different species, which should facilitate studying the structure-function relationship and identify critical amino acid residues in OLEs for metabolic engineering of tree TAGs. PMID:26258573
Cao, Heping
2015-09-01
Trees contribute to enormous plant oil reserves because many trees contain 50%-80% of oil (triacylglycerols, TAGs) in the fruits and kernels. TAGs accumulate in subcellular structures called oil bodies/droplets, in which TAGs are covered by low-molecular-mass hydrophobic proteins called oleosins (OLEs). The OLEs/TAGs ratio determines the size and shape of intracellular oil bodies. There is a lack of comprehensive sequence analysis and structural information of OLEs among diverse trees. The objectives of this study were to identify OLEs from 22 tree species (e.g., tung tree, tea-oil tree, castor bean), perform genome-wide analysis of OLEs, classify OLEs, identify conserved sequence motifs and amino acid residues, and predict secondary and three-dimensional structures in tree OLEs and OLE subfamilies. Data mining identified 65 OLEs with perfect conservation of the "proline knot" motif (PX5SPX3P) from 19 trees. These OLEs contained >40% hydrophobic amino acid residues. They displayed similar properties and amino acid composition. Genome-wide phylogenetic analysis and multiple sequence alignment demonstrated that these proteins could be classified into five OLE subfamilies. There were distinct patterns of sequence conservation among the OLE subfamilies and within individual tree species. Computational modeling indicated that OLEs were composed of at least three α-helixes connected with short coils without any β-strand and that they exhibited distinct 3D structures and ligand binding sites. These analyses provide fundamental information in the similarity and specificity of diverse OLE isoforms within the same subfamily and among the different species, which should facilitate studying the structure-function relationship and identify critical amino acid residues in OLEs for metabolic engineering of tree TAGs.
Newell, Nicholas E
2011-12-15
The extraction of the set of features most relevant to function from classified biological sequence sets is still a challenging problem. A central issue is the determination of expected counts for higher order features so that artifact features may be screened. Cascade detection (CD), a new algorithm for the extraction of localized features from sequence sets, is introduced. CD is a natural extension of the proportional modeling techniques used in contingency table analysis into the domain of feature detection. The algorithm is successfully tested on synthetic data and then applied to feature detection problems from two different domains to demonstrate its broad utility. An analysis of HIV-1 protease specificity reveals patterns of strong first-order features that group hydrophobic residues by side chain geometry and exhibit substantial symmetry about the cleavage site. Higher order results suggest that favorable cooperativity is weak by comparison and broadly distributed, but indicate possible synergies between negative charge and hydrophobicity in the substrate. Structure-function results for the Schellman loop, a helix-capping motif in proteins, contain strong first-order features and also show statistically significant cooperativities that provide new insights into the design of the motif. These include a new 'hydrophobic staple' and multiple amphipathic and electrostatic pair features. CD should prove useful not only for sequence analysis, but also for the detection of multifactor synergies in cross-classified data from clinical studies or other sources. Windows XP/7 application and data files available at: https://sites.google.com/site/cascadedetect/home. nacnewell@comcast.net Supplementary information is available at Bioinformatics online.
Whittaker, Gary R; André, Nicole M; Millet, Jean Kaoru
2018-01-01
The difficulties related to virus taxonomy have been amplified by recent advances in next-generation sequencing and metagenomics, prompting the field to revisit the question of what constitutes a useful viral classification. Here, taking a challenging classification found in coronaviruses, we argue that consideration of biological properties in addition to sequence-based demarcations is critical for generating useful taxonomy that recapitulates complex evolutionary histories. Within the Alphacoronavirus genus, the Alphacoronavirus 1 species encompasses several biologically distinct viruses. We carried out functionally based phylogenetic analysis, centered on the spike gene, which encodes the main surface antigen and primary driver of tropism and pathogenesis. Within the Alphacoronavirus 1 species, we identify clade A (encompassing serotype I feline coronavirus [FCoV] and canine coronavirus [CCoV]) and clade B (grouping serotype II FCoV and CCoV and transmissible gastroenteritis virus [TGEV]-like viruses). We propose this clade designation, along with the newly proposed Alphacoronavirus 2 species, as an improved way to classify the Alphacoronavirus genus. IMPORTANCE Our work focuses on improving the classification of the Alphacoronavirus genus. The Alphacoronavirus 1 species groups viruses of veterinary importance that infect distinct mammalian hosts and includes canine and feline coronaviruses and transmissible gastroenteritis virus. It is the prototype species of the Alphacoronavirus genus; however, it encompasses biologically distinct viruses. To better characterize this prototypical species, we performed phylogenetic analyses based on the sequences of the spike protein, one of the main determinants of tropism and pathogenesis, and reveal the existence of two subgroups or clades that fit with previously established serotype demarcations. We propose a new clade designation to better classify Alphacoronavirus 1 members.
CANDELS Visual Classifications: Scheme, Data Release, and First Results
NASA Technical Reports Server (NTRS)
Kartaltepe, Jeyhan S.; Mozena, Mark; Kocevski, Dale; McIntosh, Daniel H.; Lotz, Jennifer; Bell, Eric F.; Faber, Sandy; Ferguson, Henry; Koo, David; Bassett, Robert;
2014-01-01
We have undertaken an ambitious program to visually classify all galaxies in the five CANDELS fields down to H <24.5 involving the dedicated efforts of 65 individual classifiers. Once completed, we expect to have detailed morphological classifications for over 50,000 galaxies spanning 0 < z < 4 over all the fields. Here, we present our detailed visual classification scheme, which was designed to cover a wide range of CANDELS science goals. This scheme includes the basic Hubble sequence types, but also includes a detailed look at mergers and interactions, the clumpiness of galaxies, k-corrections, and a variety of other structural properties. In this paper, we focus on the first field to be completed - GOODS-S, which has been classified at various depths. The wide area coverage spanning the full field (wide+deep+ERS) includes 7634 galaxies that have been classified by at least three different people. In the deep area of the field, 2534 galaxies have been classified by at least five different people at three different depths. With this paper, we release to the public all of the visual classifications in GOODS-S along with the Perl/Tk GUI that we developed to classify galaxies. We present our initial results here, including an analysis of our internal consistency and comparisons among multiple classifiers as well as a comparison to the Sersic index. We find that the level of agreement among classifiers is quite good and depends on both the galaxy magnitude and the galaxy type, with disks showing the highest level of agreement and irregulars the lowest. A comparison of our classifications with the Sersic index and restframe colors shows a clear separation between disk and spheroid populations. Finally, we explore morphological k-corrections between the V-band and H-band observations and find that a small fraction (84 galaxies in total) are classified as being very different between these two bands. These galaxies typically have very clumpy and extended morphology or are very faint in the V-band.
Szymańska-Czerwińska, Monika; Mitura, Agata; Niemczuk, Krzysztof; Zaręba, Kinga; Jodełko, Agnieszka; Pluta, Aneta; Scharf, Sabine; Vitek, Bailey; Aaziz, Rachid; Vorimore, Fabien; Laroucau, Karine; Schnee, Christiane
2017-01-01
Wild birds are considered as a reservoir for avian chlamydiosis posing a potential infectious threat to domestic poultry and humans. Analysis of 894 cloacal or fecal swabs from free-living birds in Poland revealed an overall Chlamydiaceae prevalence of 14.8% (n = 132) with the highest prevalence noted in Anatidae (19.7%) and Corvidae (13.4%). Further testing conducted with species-specific real-time PCR showed that 65 samples (49.2%) were positive for C. psittaci whereas only one was positive for C. avium. To classify the non-identified chlamydial agents and to genotype the C. psittaci and C. avium-positive samples, specimens were subjected to ompA-PCR and sequencing (n = 83). The ompA-based NJ dendrogram revealed that only 23 out of 83 sequences were assigned to C. psittaci, in particular to four clades representing the previously described C. psittaci genotypes B, C, Mat116 and 1V. Whereas the 59 remaining sequences were assigned to two new clades named G1 and G2, each one including sequences recently obtained from chlamydiae detected in Swedish wetland birds. G1 (18 samples from Anatidae and Rallidae) grouped closely together with genotype 1V and in relative proximity to several C. abortus isolates, and G2 (41 samples from Anatidae and Corvidae) grouped closely to C. psittaci strains of the classical ABE cluster, Matt116 and M56. Finally, deep molecular analysis of four representative isolates of genotypes 1V, G1 and G2 based on 16S rRNA, IGS and partial 23S rRNA sequences as well as MLST clearly classify these isolates within the C. abortus species. Consequently, we propose an expansion of the C. abortus species to include not only the classical isolates of mammalian origin, but also avian isolates so far referred to as atypical C. psittaci or C. psittaci/C. abortus intermediates.
Searching for δ Scuti-type pulsation and characterising northern pre-main-sequence field stars
NASA Astrophysics Data System (ADS)
Díaz-Fraile, D.; Rodríguez, E.; Amado, P. J.
2014-08-01
Context. Pre-main-sequence (PMS) stars are objects evolving from the birthline to the zero-age main sequence (ZAMS). Given a mass range near the ZAMS, the temperatures and luminosities of PMS and main-sequence stars are very similar. Moreover, their evolutionary tracks intersect one another causing some ambiguity in the determination of their evolutionary status. In this context, the detection and study of pulsations in PMS stars is crucial for differentiating between both types of stars by obtaining information of their interiors via asteroseismic techniques. Aims: A photometric variability study of a sample of northern field stars, which previously classified as either PMS or Herbig Ae/Be objects, has been undertaken with the purpose of detecting δ Scuti-type pulsations. Determination of physical parameters for these stars has also been carried out to locate them on the Hertzsprung-Russell diagram and check the instability strip for this type of pulsators. Methods: Multichannel photomultiplier and CCD time series photometry in the uvby Strömgren and BVI Johnson bands were obtained during four consecutive years from 2007 to 2010. The light curves have been analysed, and a variability criterion has been established. Among the objects classified as variable stars, we have selected those which present periodicities above 4 d-1, which was established as the lowest limit for δ Scuti-type pulsations in this investigation. Finally, these variable stars have been placed in a colour-magnitude diagram using the physical parameters derived with the collected uvbyβ Strömgren-Crawford photometry. Results: Five PMS δ Scuti- and three probable β Cephei-type stars have been detected. Two additional PMS δ Scuti stars are also confirmed in this work. Moreover, three new δ Scuti- and two γ Doradus-type stars have been detected among the main-sequence objects used as comparison or check stars.
2013-01-01
Background Significant efforts have been made to address the problem of identifying short genes in prokaryotic genomes. However, most known methods are not effective in detecting short genes. Because of the limited information contained in short DNA sequences, it is very difficult to accurately distinguish between protein coding and non-coding sequences in prokaryotic genomes. We have developed a new Iteratively Adaptive Sparse Partial Least Squares (IASPLS) algorithm as the classifier to improve the accuracy of the identification process. Results For testing, we chose the short coding and non-coding sequences from seven prokaryotic organisms. We used seven feature sets (including GC content, Z-curve, etc.) of short genes. In comparison with GeneMarkS, Metagene, Orphelia, and Heuristic Approachs methods, our model achieved the best prediction performance in identification of short prokaryotic genes. Even when we focused on the very short length group ([60–100 nt)), our model provided sensitivity as high as 83.44% and specificity as high as 92.8%. These values are two or three times higher than three of the other methods while Metagene fails to recognize genes in this length range. The experiments also proved that the IASPLS can improve the identification accuracy in comparison with other widely used classifiers, i.e. Logistic, Random Forest (RF) and K nearest neighbors (KNN). The accuracy in using IASPLS was improved 5.90% or more in comparison with the other methods. In addition to the improvements in accuracy, IASPLS required ten times less computer time than using KNN or RF. Conclusions It is conclusive that our method is preferable for application as an automated method of short gene classification. Its linearity and easily optimized parameters make it practicable for predicting short genes of newly-sequenced or under-studied species. Reviewers This article was reviewed by Alexey Kondrashov, Rajeev Azad (nominated by Dr J.Peter Gogarten) and Yuriy Fofanov (nominated by Dr Janet Siefert). PMID:24067167
Szymańska-Czerwińska, Monika; Mitura, Agata; Niemczuk, Krzysztof; Zaręba, Kinga; Jodełko, Agnieszka; Pluta, Aneta; Scharf, Sabine; Vitek, Bailey; Aaziz, Rachid; Vorimore, Fabien; Laroucau, Karine; Schnee, Christiane
2017-01-01
Wild birds are considered as a reservoir for avian chlamydiosis posing a potential infectious threat to domestic poultry and humans. Analysis of 894 cloacal or fecal swabs from free-living birds in Poland revealed an overall Chlamydiaceae prevalence of 14.8% (n = 132) with the highest prevalence noted in Anatidae (19.7%) and Corvidae (13.4%). Further testing conducted with species-specific real-time PCR showed that 65 samples (49.2%) were positive for C. psittaci whereas only one was positive for C. avium. To classify the non-identified chlamydial agents and to genotype the C. psittaci and C. avium-positive samples, specimens were subjected to ompA-PCR and sequencing (n = 83). The ompA-based NJ dendrogram revealed that only 23 out of 83 sequences were assigned to C. psittaci, in particular to four clades representing the previously described C. psittaci genotypes B, C, Mat116 and 1V. Whereas the 59 remaining sequences were assigned to two new clades named G1 and G2, each one including sequences recently obtained from chlamydiae detected in Swedish wetland birds. G1 (18 samples from Anatidae and Rallidae) grouped closely together with genotype 1V and in relative proximity to several C. abortus isolates, and G2 (41 samples from Anatidae and Corvidae) grouped closely to C. psittaci strains of the classical ABE cluster, Matt116 and M56. Finally, deep molecular analysis of four representative isolates of genotypes 1V, G1 and G2 based on 16S rRNA, IGS and partial 23S rRNA sequences as well as MLST clearly classify these isolates within the C. abortus species. Consequently, we propose an expansion of the C. abortus species to include not only the classical isolates of mammalian origin, but also avian isolates so far referred to as atypical C. psittaci or C. psittaci/C. abortus intermediates. PMID:28350846
Oliveira, Marília Barros; Junior, Murillo Lobo; Grossi-de-Sá, Maria Fátima; Petrofeza, Silvana
2015-06-15
Sclerotinia sclerotiorum (Lib.) de Bary is a necrotrophic fungal pathogen that causes a disease known as white mold, which is a major problem for dry bean (Phaseolus vulgaris L.) and other crops in many growing areas in Brazil. To investigate the role of methyl jasmonate (MeJA) in defending dry bean plants against S. sclerotiorum, we used suppression subtractive hybridization (SSH) of cDNA and identified genes that are differentially expressed during plant-pathogen interactions after treatment. Exogenous MeJA application enhanced resistance to the pathogen, and SSH analyses led to the identification of 94 unigenes, presumably involved in a variety of functions, which were classified into several functional categories, including metabolism, signal transduction, protein biogenesis and degradation, and cell defense and rescue. Using RT-qPCR, some unigenes were found to be differentially expressed in a time-dependent manner in dry bean plants during the interaction with S. sclerotiorum after MeJA treatment, including the pathogenesis-related protein PR3 (chitinase), PvCallose (callose synthase), PvNBS-LRR (NBS-LRR resistance-like protein), PvF-box (F-box family protein-like), and a polygalacturonase inhibitor protein (PGIP). Based on these expression data, the putative roles of differentially expressed genes were discussed in relation to the disease and MeJA resistance induction. Changes in the activity of the pathogenesis-related proteins β-1,3-glucanase, chitinase, phenylalanine ammonia-lyase, and peroxidase in plants after MeJA treatment and following inoculation of the pathogen were also investigated as molecular markers of induced resistance. Foliar application of MeJA induced partial resistance against S. sclerotiorum in plants as well as a consistent increase in pathogenesis-related protein activities. Our findings provide new insights into the physiological and molecular mechanisms of resistance induced by MeJA in the P. vulgaris-S. sclerotiorum pathosystem. Copyright © 2015 Elsevier GmbH. All rights reserved.
Yanagi, Tomohiro; Shirasawa, Kenta; Terachi, Mayuko; Isobe, Sachiko
2017-01-01
Cultivated strawberry ( Fragaria × ananassa Duch.) has homoeologous chromosomes because of allo-octoploidy. For example, two homoeologous chromosomes that belong to different sub-genome of allopolyploids have similar base sequences. Thus, when conducting de novo assembly of DNA sequences, it is difficult to determine whether these sequences are derived from the same chromosome. To avoid the difficulties associated with homoeologous chromosomes and demonstrate the possibility of sequencing allopolyploids using single chromosomes, we conducted sequence analysis using microdissected single somatic chromosomes of cultivated strawberry. Three hundred and ten somatic chromosomes of the Japanese octoploid strawberry 'Reiko' were individually selected under a light microscope using a microdissection system. DNA from 288 of the dissected chromosomes was successfully amplified using a DNA amplification kit. Using next-generation sequencing, we decoded the base sequences of the amplified DNA segments, and on the basis of mapping, we identified DNA sequences from 144 samples that were best matched to the reference genomes of the octoploid strawberry, F. × ananassa , and the diploid strawberry, F. vesca . The 144 samples were classified into seven pseudo-molecules of F. vesca . The coverage rates of the DNA sequences from the single chromosome onto all pseudo-molecular sequences varied from 3 to 29.9%. We demonstrated an efficient method for sequence analysis of allopolyploid plants using microdissected single chromosomes. On the basis of our results, we believe that whole-genome analysis of allopolyploid plants can be enhanced using methodology that employs microdissected single chromosomes.
Signal-3L: A 3-layer approach for predicting signal peptides.
Shen, Hong-Bin; Chou, Kuo-Chen
2007-11-16
Functioning as an "address tag" that directs nascent proteins to their proper cellular and extracellular locations, signal peptides have become a crucial tool in finding new drugs or reprogramming cells for gene therapy. To effectively and timely use such a tool, however, the first important thing is to develop an automated method for rapidly and accurately identifying the signal peptide for a given nascent protein. With the avalanche of new protein sequences generated in the post-genomic era, the challenge has become even more urgent and critical. In this paper, we have developed a novel method for predicting signal peptide sequences and their cleavage sites in human, plant, animal, eukaryotic, Gram-positive, and Gram-negative protein sequences, respectively. The new predictor is called Signal-3L that consists of three prediction engines working, respectively, for the following three progressively deepening layers: (1) identifying a query protein as secretory or non-secretory by an ensemble classifier formed by fusing many individual OET-KNN (optimized evidence-theoretic K nearest neighbor) classifiers operated in various dimensions of PseAA (pseudo amino acid) composition spaces; (2) selecting a set of candidates for the possible signal peptide cleavage sites of a query secretory protein by a subsite-coupled discrimination algorithm; (3) determining the final cleavage site by fusing the global sequence alignment outcome for each of the aforementioned candidates through a voting system. Signal-3L is featured by high success prediction rates with short computational time, and hence is particularly useful for the analysis of large-scale datasets. Signal-3L is freely available as a web-server at http://chou.med.harvard.edu/bioinf/Signal-3L/ or http://202.120.37.186/bioinf/Signal-3L, where, to further support the demand of the related areas, the signal peptides identified by Signal-3L for all the protein entries in Swiss-Prot databank that do not have signal peptide annotations or are annotated with uncertain terms but are classified by Signal-3L as secretory proteins are provided in a downloadable file. The large-scale file is prepared with Microsoft Excel and named "Tab-Signal-3L.xls", and will be updated once a year to include new protein entries and reflect the continuous development of Signal-3L.
Gustaf: Detecting and correctly classifying SVs in the NGS twilight zone.
Trappe, Kathrin; Emde, Anne-Katrin; Ehrlich, Hans-Christian; Reinert, Knut
2014-12-15
The landscape of structural variation (SV) including complex duplication and translocation patterns is far from resolved. SV detection tools usually exhibit low agreement, are often geared toward certain types or size ranges of variation and struggle to correctly classify the type and exact size of SVs. We present Gustaf (Generic mUlti-SpliT Alignment Finder), a sound generic multi-split SV detection tool that detects and classifies deletions, inversions, dispersed duplications and translocations of ≥ 30 bp. Our approach is based on a generic multi-split alignment strategy that can identify SV breakpoints with base pair resolution. We show that Gustaf correctly identifies SVs, especially in the range from 30 to 100 bp, which we call the next-generation sequencing (NGS) twilight zone of SVs, as well as larger SVs >500 bp. Gustaf performs better than similar tools in our benchmark and is furthermore able to correctly identify size and location of dispersed duplications and translocations, which otherwise might be wrongly classified, for example, as large deletions. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
The Cervical Microbiome over 7 Years and a Comparison of Methodologies for Its Characterization
Smith, Benjamin C.; McAndrew, Thomas; Chen, Zigui; Harari, Ariana; Barris, David M.; Viswanathan, Shankar; Rodriguez, Ana Cecilia; Castle, Phillip; Herrero, Rolando; Schiffman, Mark; Burk, Robert D.
2012-01-01
Background The rapidly expanding field of microbiome studies offers investigators a large choice of methods for each step in the process of determining the microorganisms in a sample. The human cervicovaginal microbiome affects female reproductive health, susceptibility to and natural history of many sexually transmitted infections, including human papillomavirus (HPV). At present, long-term behavior of the cervical microbiome in early sexual life is poorly understood. Methods The V6 and V6–V9 regions of the 16S ribosomal RNA gene were amplified from DNA isolated from exfoliated cervical cells. Specimens from 10 women participating in the Natural History Study of HPV in Guanacaste, Costa Rica were sampled successively over a period of 5–7 years. We sequenced amplicons using 3 different platforms (Sanger, Roche 454, and Illumina HiSeq 2000) and analyzed sequences using pipelines based on 3 different classification algorithms (usearch, RDP Classifier, and pplacer). Results Usearch and pplacer provided consistent microbiome classifications for all sequencing methods, whereas RDP Classifier deviated significantly when characterizing Illumina reads. Comparing across sequencing platforms indicated 7%–41% of the reads were reclassified, while comparing across software pipelines reclassified up to 32% of the reads. Variability in classification was shown not to be due to a difference in read lengths. Six cervical microbiome community types were observed and are characterized by a predominance of either G. vaginalis or Lactobacillus spp. Over the 5–7 year period, subjects displayed fluctuation between community types. A PERMANOVA analysis on pairwise Kantorovich-Rubinstein distances between the microbiota of all samples yielded an F-test ratio of 2.86 (p<0.01), indicating a significant difference comparing within and between subjects’ microbiota. Conclusions Amplification and sequencing methods affected the characterization of the microbiome more than classification algorithms. Pplacer and usearch performed consistently with all sequencing methods. The analyses identified 6 community types consistent with those previously reported. The long-term behavior of the cervical microbiome indicated that fluctuations were subject dependent. PMID:22792313
Identification and genomic characterization of a novel porcine parvovirus (PPV6) in China.
Ni, Jianqiang; Qiao, Caixia; Han, Xue; Han, Tao; Kang, Wenhua; Zi, Zhanchao; Cao, Zhen; Zhai, Xinyan; Cai, Xuepeng
2014-12-02
Parvoviruses are classified into two subfamilies based on their host range: the Parvovirinae, which infect vertebrates, and the Densovirinae, which mainly infect insects and other arthropods. In recent years, a number of novel parvoviruses belonging to the subfamily Parvovirinae have been identified from various animal species and humans, including human parvovirus 4 (PARV4), porcine hokovirus, ovine partetravirus, porcine parvovirus 4 (PPV4), and porcine parvovirus 5 (PPV5). Using sequence-independent single primer amplification (SISPA), a novel parvovirus within the subfamily Parvovirinae that was distinct from any known parvoviruses was identified and five full-length genome sequences were determined and analyzed. A novel porcine parvovirus, provisionally named PPV6, was initially identified from aborted pig fetuses in China. Retrospective studies revealed the prevalence of PPV6 in aborted pig fetuses and piglets(50% and 75%, respectively) was apparently higher than that in finishing pigs and sows (15.6% and 3.8% respectively). Furthermore, the prevalence of PPV6 in finishing pig was similar in affected and unaffected farms (i.e. 16.7% vs. 13.6%-21.7%). This finding indicates that animal age, perhaps due to increased innate immune resistance, strongly influences the level of PPV6 viremia. Complete genome sequencing and multiple alignments have shown that the nearly full-length genome sequences were approximately 6,100 nucleotides in length and shared 20.5%-42.6% DNA sequence identity with other members of the Parvovirinae subfamily. Phylogenetic analysis showed that PPV6 was significantly distinct from other known parvoviruses and was most closely related to PPV4. Our findings and review of published parvovirus sequences suggested that a novel porcine parvovirus is currently circulating in China and might be classified into the novel genus Copiparvovirus within the subfamily Parvovirinae. However, the clinical manifestations of PPV6 are still unknown in that the prevalence of PPV6 was similar between healthy pigs and sick pigs in a retrospective epidemiological study. The identification of PPV6 within the subfamily Parvovirinae provides further insight into the viral and genetic diversity of parvoviruses.
Korak, Julie A; Wert, Eric C; Rosario-Ortiz, Fernando L
2015-01-01
Intracellular organic matter (IOM) from cyanobacteria may be released into natural waters following cell death in aquatic ecosystems and during oxidation processes in drinking water treatment plants. Fluorescence spectroscopy was evaluated to identify the presence of IOM from three cyanobacteria species during simulated release into natural water and following oxidation processes (i.e. ozone, free chlorine, chloramine, chlorine dioxide). Peak picking and the fluorescence index (FI) were explored to determine which IOM components (e.g., pigments) provide unique and persistent fluorescence signatures with minimal interferences from the background dissolved organic matter (DOM) found in Colorado River water (CRW). When IOM was added to ultrapure water, the fluorescence signature of the three cyanobacteria species showed similarities to each other. Each IOM exhibited a strong protein-like fluorescence and fluorescence at Ex 370 nm and Em 460 nm (FDOM), where commercial fluorescence sensors monitor. All species also had strong phycobiliprotein fluorescence (i.e. phycocyanin or phycoerythrin) in the higher excitation range (500-650 nm). All three IOM isolates had FI values greater than 2. When IOM was added to CRW, phycobiliprotein fluorescence was quenched through interactions between IOM and CRW-DOM. Mixing IOM and CRW demonstrated that protein-like and FDOM intensity responses were not a simple superposition of the starting material intensities, indicating that interactions between IOM and CRW-DOM fluorescing moieties were important. Fluorescence intensity in all regions decreased with exposure to ozone, free chlorine, and chlorine dioxide, but the FI still indicated compositional differences compared to CRW-DOM. The phycobiliproteins in IOM are not promising as a surrogate for IOM release, because their fluorescence intensity is quenched by interactions with DOM and decreased during oxidation processes. Increases in both FDOM intensity and FI are viable qualitative indicators of IOM release in natural waters and following oxidation and may provide a more robust real-time indication of the presence of IOM than conventional dissolved organic carbon or UV absorbance measurements.
[Evolution of Dissolved Organic Matter Properties in a Constructed Wetland of Xiao River, Hebei].
Ma, Li-na; Zhang, Hui; Tan, Wen-bing; Yu, Min-da; Huang, Zhi-gang; Gao, Ru-tai; Xi, Bei-dou; He, Xiao-song
2016-01-01
The evolution of water DOC and COD, and the source, chemical structure, humification degree and redox of dissolved organic matter (DOM) in a constructed wetland of Xiao River, Hebei, was investigated by 3D excitation--emission matrix fluorescence spectroscopy coupled with ultraviolet spectroscopy and chemical reduction, in order to explore the geochemical processes and environmental effects of DOM. Although DOC contributes at least 60% to COD, its decrease in the constructed wetland is mainly caused by the more extensive degradation of elements N, H, S, and P than C in DOM, and 65% is contributed from the former. DOM is mainly consisted of microbial products based on proxies f470/520 and BIX, indicating that DOM in water is apparently affected by microbial degradation. The result based on PARAFAC model shows that DOM in the constructed wetland contains protein-like and humus-like components, and Fulvic- and humic-like components are relatively easier to degrade than protein-like components. Fulvic- and humic-like components undergo similar decomposition in the constructed wetland. A common source of chromophoric dissolved organic matter (CDOM) and fluorescent dissolved organic matter (FDOM) exists; both CDOM and FDOM are mainly composed of a humus-like material and do not exhibit selective degradation in the constructed wetland. The proxies E2 /E3, A240-400, r(A, C) and HIX in water have no changes after flowing into the constructed wetland, implying that the humification degree of DOM in water is hardly affected by wet constructed wetland. However, the constructed wetland environment is not only beneficial in forming the reduced state of DOM, but also facilitates the reduction of ferric. It can also improve the capability of DOM to function as an electron shuttle. This result may be related to the condition that the aromatic carbon of DOM can be stabilized well in the constructed wetland.
NASA Astrophysics Data System (ADS)
Zhao, Ying; Song, Kaishan; Shang, Yingxin; Shao, Tiantian; Wen, Zhidan; Lv, Lili
2017-08-01
The spatial characteristics of fluorescent dissolved organic matter (FDOM) components in river waters in China were first examined by excitation-emission matrix spectra and fluorescence regional integration (FRI) with the data collected during September to November between 2013 and 2015. One tyrosine-like (R1), one tryptophan-like (R2), one fulvic-like (R3), one microbial protein-like (R4), and one humic-like (R5) components have been identified by FRI method. Principal component analysis (PCA) was conducted to assess variations in the five FDOM components (FRί (ί = 1, 2, 3, 4, and 5)) and the humification index for all 194 river water samples. The average fluorescence intensities of the five fluorescent components and the total fluorescence intensities FSUM differed under spatial variation among the seven major river basins (Songhua, Liao, Hai, Yellow and Huai, Yangtze, Pearl, and Inflow Rivers) in China. When all the river water samples were pooled together, the fulvic-like FR3 and the humic-like FR5 showed a strong positive linear relationship (R2 = 0.90, n = 194), indicating that the two allochthonous FDOM components R3 and R5 may originate from similar sources. There is a moderate strong positive correlation between the tryptophan-like FR2 and the microbial protein-like FR4 (R2 = 0.71, n = 194), suggesting that parts of two autochthonous FDOM components R2 and R4 are likely from some common sources. However, the total allochthonous substance FR(3+5) and the total autochthonous substances FR(1+2+4) exhibited a weak correlation (R2 = 0.40, n = 194). Significant positive linear relationships between FR3 (R2 = 0.69, n = 194), FR5 (R2 = 0.79, n = 194), and chromophoric DOM (CDOM) absorption coefficient a(254) were observed, which demonstrated that the CDOM absorption was dominated by the allochthonous FDOM components R3 and R5.
Guo, Wei-Dong; Huang, Jian-Ping; Hong, Hua-Sheng; Xu, Jing; Deng, Xun
2010-06-01
The distribution and estuarine behavior of fluorescent components of chromophoric dissolved organic matter (CDOM) from Jiulong Estuary were determined by fluorescence excitation emission matrix spectroscopy (EEMs) combined with parallel factor analysis (PARAFAC). The feasibility of these components as tracers for organic pollution in estuarine environments was also evaluated. Four separate fluorescent components were identified by PARAFAC, including three humic-like components (C1: 240, 310/382 nm; C2: 230, 250, 340/422 nm; C4: 260, 390/482 nm) and one protein-like components (C3: 225, 275/342 nm). These results indicated that UV humic-like peak A area designated by traditional "peak-picking method" was not a single peak but actually a combination of several fluorescent components, and it also had inherent links to so-called marine humic-like peak M or terrestrial humic-like peak C. Component C2 which include peak M decreased with increase of salinity in Jiulong Estuary, demonstrating that peak M can not be thought as the specific indicator of the "marine" humic-like component. Two humic-like components C1 and C2 showed additional behavior in the turbidity maximum region (salinity < 6) and then conservative mixing behavior for the rest estuarine region, while humic-like components C4 showed conservative mixing behavior for the whole estuarine region. However, the protein-like component C3 showed nonconservative mixing behavior, suggesting it had autochthonous estuarine origin. EEMs-PARAFAC can provide fluorescent fingerprint to differentiate the DOM features for three tributaries of Jiulong River. The observed linear relationships between humic-like components and absorption coefficient a (280) with chemical oxygen demand (COD) and biological oxygen demand (BOD5) suggest that the optical properties of CDOM may provide a fast in-situ way to monitor the variation of the degree of organic pollution in estuarine environments.
[Removal of DON in micro-polluted raw water by coagulation and adsorption using activated carbon].
Liu, Bing; Yu, Guo-Zhong; Gu, Li; Zhao, Cheng-Mei; Li, Qing-Fei; Zhai, Hui-Min
2013-04-01
Dissolved organic nitrogen as a precursor of new type nitrogenous disinfection by-products in drinking water attracted gradually the attention of scholars all over the world. In order to explore the mechanism of DON removal in micro-polluted raw water by coagulation and adsorption, water quality parameters, such as DON, DOC, NH4(+) -N, UV254, pH and dissolved oxygen, were determined in raw water and the molecular weight distribution of the DON and DOC was investigated. The variations in DON, DOC and UV254 in the coagulation and adsorption tests were investigated, and the changes of DON in raw water were characterized using three-dimensional fluorescence spectroscopy. The results showed that DON, DOC and UV254 were 1.28 mg x L(-1), 8.56 mg x L(-1), 0.16 cm(-1), and DOC/DON and SUVA were 6.69 mg x mg(-1), 1.87 m(-1) x (mg x L(-1))(-1) in raw water, respectively. The molecular weight distribution of the DON in raw water showed a bimodal distribution. The small molecular weight (< 6 000) fractions accounted for a high proportion of 68% and the large (> 20 000) fractions accounted for about 22%. The removal of DON, DOC and UV254 was about 20%, 26% and 70%, respectively, in the coagulation test and the dosage of coagulant was 10 mg x L(-1). The removal of DON, DOC and UV254 was about 60%, 35% and 100%, respectively, in the adsorption test and the dosage of activated carbon was 1.0 g. In the combination of coagulation and adsorption, the removal of DON and DOC reached approximately 82% and 64%, respectively. 3DEEM revealed that the variation of DON in the coagulation and adsorption tests depended intimately on tryptophan protein-like substances, aromatic protein-like substances and fulvic acid-like substances.
Liu, Yu-Tao; Shi, Yuan-Kai; Hao, Xue-Zhi; Wang, Lin; Li, Jun-Ling; Han, Xiao-Hong; Li, Dan; Zhou, Yu-Jie; Tang, Le
2014-01-01
Background The echinoderm microtubule-associated protein-like-4-anaplastic lymphoma kinase (EML4-ALK) fusion gene defines a novel molecular subset of non-small-cell lung cancer (NSCLC). However, the clinicopathological features of patients with the EML4-ALK fusion gene have not been defined completely. Methods Clinicopathological data of 200 Chinese patients with advanced NSCLC were analyzed retrospectively to explore their possible correlations with EML4-ALK fusions. Results The EML4-ALK fusion gene was detected in 56 (28.0%) of the 200 NSCLC patients, and undetected in 22 (11.0%) patients because of an insufficient amount of pathological tissue. The median age of the patients with positive and negative EML4-ALK was 48 and 55 years, respectively. Patients with the EML4-ALK fusion gene were significantly younger (P< 0.001). The detection rate of the EML4-ALK fusion gene in patients who received primary tumor or metastatic lymph node resection was significantly higher than in patients who received fine-needle biopsy (P= 0.003). The detection rate of the EML4-ALK fusion gene in patients with a time lag from obtainment of the pathological tissue to EML4-ALK fusion gene detection ≤48 months was significantly higher than in patients >48 months (P= 0.020). The occurrence of the EML4-ALK fusion gene in patients with wild-type epidermal growth factor receptor (EGFR) was significantly higher than in patients with mutant-type EGFR (42.5% [37/87] vs. 6.3% [1/16], P= 0.005). Conclusions Younger age and wild-type EGFR were identified as clinicopathological characteristics of patients with advanced NSCLC who harbored the EML4-ALK fusion gene. The optimal time lag from the obtainment of the pathological tissue to the time of EML4-ALK fusion gene detection is ≤48 months. PMID:26767009
NASA Astrophysics Data System (ADS)
Kellerman, A.; Hawkings, J.; Marshall, M.; Spencer, R.; Wadham, J.
2017-12-01
The Greenland Ice Sheet (GrIS) is losing mass at a remarkable rate. This loss of mass coincides with the export of dissolved organic matter (DOM) and other nutrients from the ice sheet and exerts a primary control on secondary production in downstream ecosystems. However, little is known about the source and composition of DOM exported from these dilute, yet immense, systems. Samples were collected from May 11, 2015 to July 29, 2015 from the outflow of Leverett Glacier, a large, land-terminating glacier of the southwest GrIS. Dissolved organic carbon (DOC) concentrations were measured and the optical properties of DOM were characterized using absorbance and fluorescence spectroscopy. At the beginning of the season, when discharge is <5 m3 sec-1, red-shifted fluorescence suggests terrestrial inputs from either overridden soils or proglacial inputs dominate the DOM pool. With the onset of melt, after an initial pulse in both DOC quantity and red-shifted fluorescence intensity, the DOC concentration and fluorescence intensity is diluted, with little change in DOM composition. The terrestrial signal is lost with the first outburst event in late June, and a single protein-like fluorophore is exhibited for three weeks. On July 10th, a fourth outburst event introduces a second protein-like fluorophore, indicative of production on the ice sheet, and this signature is maintained until the end of the July. These results suggest that subglaical drainage flowpaths and water source influence the exported DOC concentration and DOM composition over a summer melt season. As glacial outflow shifts from higher DOC concentrations early in the season to low DOC concentrations later in the summer, these results impact estimates of carbon export from glaciers. Furthermore, as composition is related to reactivity, the compositional changes observed may indicate shifts in the bioavailability of the DOM upon delivery to coastal systems, a result of changing DOM sources over the course of the season.
Zhang, Yunlin; Liu, Xiaohan; Osburn, Christopher L; Wang, Mingzhu; Qin, Boqiang; Zhou, Yongqiang
2013-01-01
CDOM biogeochemical cycle is driven by several physical and biological processes such as river input, biogeneration and photobleaching that act as primary sinks and sources of CDOM. Watershed-derived allochthonous (WDA) and phytoplankton-derived autochthonous (PDA) CDOM were exposed to 9 days of natural solar radiation to assess the photobleaching response of different CDOM sources, using absorption and fluorescence (excitation-emission matrix) spectroscopy. Our results showed a marked decrease in total dissolved nitrogen (TDN) concentration under natural sunlight exposure for both WDA and PDA CDOM, indicating photoproduction of ammonium from TDN. In contrast, photobleaching caused a marked increase in total dissolved phosphorus (TDP) concentration for both WDA and PDA CDOM. Thus TDN:TDP ratios decreased significantly both for WDA and PDA CDOM, which partially explained the seasonal dynamic of TDN:TDP ratio in Lake Taihu. Photobleaching rate of CDOM absorption a(254), was 0.032 m/MJ for WDA CDOM and 0.051 m/MJ for PDA CDOM from days 0-9, indicating that phototransformations were initially more rapid for the newly produced CDOM from phytoplankton than for the river CDOM. Extrapolation of these values to the field indicated that 3.9%-5.1% CDOM at the water surface was photobleached and mineralized every day in summer in Lake Taihu. Photobleaching caused the increase of spectral slope, spectral slope ratio and molecular size, indicating the CDOM mean molecular weight decrease which was favorable to further microbial degradation of mineralization. Three fluorescent components were validated in parallel factor analysis models calculated separately for WDA and PDA CDOM. Our study suggests that the humic-like fluorescence materials could be rapidly and easily photobleached for WDA and PDA CDOM, but the protein-like fluorescence materials was not photobleached and even increased from the transformation of the humic-like fluorescence substance to the protein-like fluorescence substance. Photobleaching was an important driver of CDOM and nutrients biogeochemistry in lake water.