Smith, R F; Wiese, B A; Wojzynski, M K; Davison, D B; Worley, K C
1996-05-01
The BCM Search Launcher is an integrated set of World Wide Web (WWW) pages that organize molecular biology-related search and analysis services available on the WWW by function, and provide a single point of entry for related searches. The Protein Sequence Search Page, for example, provides a single sequence entry form for submitting sequences to WWW servers that offer remote access to a variety of different protein sequence search tools, including BLAST, FASTA, Smith-Waterman, BEAUTY, PROSITE, and BLOCKS searches. Other Launch pages provide access to (1) nucleic acid sequence searches, (2) multiple and pair-wise sequence alignments, (3) gene feature searches, (4) protein secondary structure prediction, and (5) miscellaneous sequence utilities (e.g., six-frame translation). The BCM Search Launcher also provides a mechanism to extend the utility of other WWW services by adding supplementary hypertext links to results returned by remote servers. For example, links to the NCBI's Entrez data base and to the Sequence Retrieval System (SRS) are added to search results returned by the NCBI's WWW BLAST server. These links provide easy access to auxiliary information, such as Medline abstracts, that can be extremely helpful when analyzing BLAST data base hits. For new or infrequent users of sequence data base search tools, we have preset the default search parameters to provide the most informative first-pass sequence analysis possible. We have also developed a batch client interface for Unix and Macintosh computers that allows multiple input sequences to be searched automatically as a background task, with the results returned as individual HTML documents directly to the user's system. The BCM Search Launcher and batch client are available on the WWW at URL http:@gc.bcm.tmc.edu:8088/search-launcher.html.
The Human Transcript Database: A Catalogue of Full Length cDNA Inserts
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bouckk John; Michael McLeod; Kim Worley
1999-09-10
The BCM Search Launcher provided improved access to web-based sequence analysis services during the granting period and beyond. The Search Launcher web site grouped analysis procedures by function and provided default parameters that provided reasonable search results for most applications. For instance, most queries were automatically masked for repeat sequences prior to sequence database searches to avoid spurious matches. In addition to the web-based access and arrangements that were made using the functions easier, the BCM Search Launcher provided unique value-added applications like the BEAUTY sequence database search tool that combined information about protein domains and sequence database search resultsmore » to give an enhanced, more complete picture of the reliability and relative value of the information reported. This enhanced search tool made evaluating search results more straight-forward and consistent. Some of the favorite features of the web site are the sequence utilities and the batch client functionality that allows processing of multiple samples from the command line interface. One measure of the success of the BCM Search Launcher is the number of sites that have adopted the models first developed on the site. The graphic display on the BLAST search from the NCBI web site is one such outgrowth, as is the display of protein domain search results within BLAST search results, and the design of the Biology Workbench application. The logs of usage and comments from users confirm the great utility of this resource.« less
cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on CPU+GPU.
Zhang, Jing; Wang, Hao; Feng, Wu-Chun
2017-01-01
BLAST, short for Basic Local Alignment Search Tool, is a ubiquitous tool used in the life sciences for pairwise sequence search. However, with the advent of next-generation sequencing (NGS), whether at the outset or downstream from NGS, the exponential growth of sequence databases is outstripping our ability to analyze the data. While recent studies have utilized the graphics processing unit (GPU) to speedup the BLAST algorithm for searching protein sequences (i.e., BLASTP), these studies use coarse-grained parallelism, where one sequence alignment is mapped to only one thread. Such an approach does not efficiently utilize the capabilities of a GPU, particularly due to the irregularity of BLASTP in both execution paths and memory-access patterns. To address the above shortcomings, we present a fine-grained approach to parallelize BLASTP, where each individual phase of sequence search is mapped to many threads on a GPU. This approach, which we refer to as cuBLASTP, reorders data-access patterns and reduces divergent branches of the most time-consuming phases (i.e., hit detection and ungapped extension). In addition, cuBLASTP optimizes the remaining phases (i.e., gapped extension and alignment with trace back) on a multicore CPU and overlaps their execution with the phases running on the GPU.
Embedding strategies for effective use of information from multiple sequence alignments.
Henikoff, S.; Henikoff, J. G.
1997-01-01
We describe a new strategy for utilizing multiple sequence alignment information to detect distant relationships in searches of sequence databases. A single sequence representing a protein family is enriched by replacing conserved regions with position-specific scoring matrices (PSSMs) or consensus residues derived from multiple alignments of family members. In comprehensive tests of these and other family representations, PSSM-embedded queries produced the best results overall when used with a special version of the Smith-Waterman searching algorithm. Moreover, embedding consensus residues instead of PSSMs improved performance with readily available single sequence query searching programs, such as BLAST and FASTA. Embedding PSSMs or consensus residues into a representative sequence improves searching performance by extracting multiple alignment information from motif regions while retaining single sequence information where alignment is uncertain. PMID:9070452
Kaindl, H; Kainz, G; Radda, K
2001-01-01
Most of the work on search in artificial intelligence (AI) deals with one search direction only-mostly forward search-although it is known that a structural asymmetry of the search graph causes differences in the efficiency of searching in the forward or the backward direction, respectively. In the case of symmetrical graph structure, however, current theory would not predict such differences in efficiency. In several classes of job sequencing problems, we observed a phenomenon of asymmetry in search that relates to the distribution of the are costs in the search graph. This phenomenon can be utilized for improving the search efficiency by a new algorithm that automatically selects the search direction. We demonstrate fur a class of job sequencing problems that, through the utilization of this phenomenon, much more difficult problems can be solved-according to our best knowledge-than by the best published approach, and on the same problems, the running time is much reduced. As a consequence, we propose to check given problems for asymmetrical distribution of are costs that may cause asymmetry in search.
SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters.
Wang, Chunlin; Lefkowitz, Elliot J
2004-10-28
Large-scale sequence comparison is a powerful tool for biological inference in modern molecular biology. Comparing new sequences to those in annotated databases is a useful source of functional and structural information about these sequences. Using software such as the basic local alignment search tool (BLAST) or HMMPFAM to identify statistically significant matches between newly sequenced segments of genetic material and those in databases is an important task for most molecular biologists. Searching algorithms are intrinsically slow and data-intensive, especially in light of the rapid growth of biological sequence databases due to the emergence of high throughput DNA sequencing techniques. Thus, traditional bioinformatics tools are impractical on PCs and even on dedicated UNIX servers. To take advantage of larger databases and more reliable methods, high performance computation becomes necessary. We describe the implementation of SS-Wrapper (Similarity Search Wrapper), a package of wrapper applications that can parallelize similarity search applications on a Linux cluster. Our wrapper utilizes a query segmentation-search (QS-search) approach to parallelize sequence database search applications. It takes into consideration load balancing between each node on the cluster to maximize resource usage. QS-search is designed to wrap many different search tools, such as BLAST and HMMPFAM using the same interface. This implementation does not alter the original program, so newly obtained programs and program updates should be accommodated easily. Benchmark experiments using QS-search to optimize BLAST and HMMPFAM showed that QS-search accelerated the performance of these programs almost linearly in proportion to the number of CPUs used. We have also implemented a wrapper that utilizes a database segmentation approach (DS-BLAST) that provides a complementary solution for BLAST searches when the database is too large to fit into the memory of a single node. Used together, QS-search and DS-BLAST provide a flexible solution to adapt sequential similarity searching applications in high performance computing environments. Their ease of use and their ability to wrap a variety of database search programs provide an analytical architecture to assist both the seasoned bioinformaticist and the wet-bench biologist.
SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters
Wang, Chunlin; Lefkowitz, Elliot J
2004-01-01
Background Large-scale sequence comparison is a powerful tool for biological inference in modern molecular biology. Comparing new sequences to those in annotated databases is a useful source of functional and structural information about these sequences. Using software such as the basic local alignment search tool (BLAST) or HMMPFAM to identify statistically significant matches between newly sequenced segments of genetic material and those in databases is an important task for most molecular biologists. Searching algorithms are intrinsically slow and data-intensive, especially in light of the rapid growth of biological sequence databases due to the emergence of high throughput DNA sequencing techniques. Thus, traditional bioinformatics tools are impractical on PCs and even on dedicated UNIX servers. To take advantage of larger databases and more reliable methods, high performance computation becomes necessary. Results We describe the implementation of SS-Wrapper (Similarity Search Wrapper), a package of wrapper applications that can parallelize similarity search applications on a Linux cluster. Our wrapper utilizes a query segmentation-search (QS-search) approach to parallelize sequence database search applications. It takes into consideration load balancing between each node on the cluster to maximize resource usage. QS-search is designed to wrap many different search tools, such as BLAST and HMMPFAM using the same interface. This implementation does not alter the original program, so newly obtained programs and program updates should be accommodated easily. Benchmark experiments using QS-search to optimize BLAST and HMMPFAM showed that QS-search accelerated the performance of these programs almost linearly in proportion to the number of CPUs used. We have also implemented a wrapper that utilizes a database segmentation approach (DS-BLAST) that provides a complementary solution for BLAST searches when the database is too large to fit into the memory of a single node. Conclusions Used together, QS-search and DS-BLAST provide a flexible solution to adapt sequential similarity searching applications in high performance computing environments. Their ease of use and their ability to wrap a variety of database search programs provide an analytical architecture to assist both the seasoned bioinformaticist and the wet-bench biologist. PMID:15511296
Methods and statistics for combining motif match scores.
Bailey, T L; Gribskov, M
1998-01-01
Position-specific scoring matrices are useful for representing and searching for protein sequence motifs. A sequence family can often be described by a group of one or more motifs, and an effective search must combine the scores for matching a sequence to each of the motifs in the group. We describe three methods for combining match scores and estimating the statistical significance of the combined scores and evaluate the search quality (classification accuracy) and the accuracy of the estimate of statistical significance of each. The three methods are: 1) sum of scores, 2) sum of reduced variates, 3) product of score p-values. We show that method 3) is superior to the other two methods in both regards, and that combining motif scores indeed gives better search accuracy. The MAST sequence homology search algorithm utilizing the product of p-values scoring method is available for interactive use and downloading at URL http:/(/)www.sdsc.edu/MEME.
Googling DNA sequences on the World Wide Web.
Hajibabaei, Mehrdad; Singer, Gregory A C
2009-11-10
New web-based technologies provide an excellent opportunity for sharing and accessing information and using web as a platform for interaction and collaboration. Although several specialized tools are available for analyzing DNA sequence information, conventional web-based tools have not been utilized for bioinformatics applications. We have developed a novel algorithm and implemented it for searching species-specific genomic sequences, DNA barcodes, by using popular web-based methods such as Google. We developed an alignment independent character based algorithm based on dividing a sequence library (DNA barcodes) and query sequence to words. The actual search is conducted by conventional search tools such as freely available Google Desktop Search. We implemented our algorithm in two exemplar packages. We developed pre and post-processing software to provide customized input and output services, respectively. Our analysis of all publicly available DNA barcode sequences shows a high accuracy as well as rapid results. Our method makes use of conventional web-based technologies for specialized genetic data. It provides a robust and efficient solution for sequence search on the web. The integration of our search method for large-scale sequence libraries such as DNA barcodes provides an excellent web-based tool for accessing this information and linking it to other available categories of information on the web.
A pluggable framework for parallel pairwise sequence search.
Archuleta, Jeremy; Feng, Wu-chun; Tilevich, Eli
2007-01-01
The current and near future of the computing industry is one of multi-core and multi-processor technology. Most existing sequence-search tools have been designed with a focus on single-core, single-processor systems. This discrepancy between software design and hardware architecture substantially hinders sequence-search performance by not allowing full utilization of the hardware. This paper presents a novel framework that will aid the conversion of serial sequence-search tools into a parallel version that can take full advantage of the available hardware. The framework, which is based on a software architecture called mixin layers with refined roles, enables modules to be plugged into the framework with minimal effort. The inherent modular design improves maintenance and extensibility, thus opening up a plethora of opportunities for advanced algorithmic features to be developed and incorporated while routine maintenance of the codebase persists.
Worley, K C; Wiese, B A; Smith, R F
1995-09-01
BEAUTY (BLAST enhanced alignment utility) is an enhanced version of the NCBI's BLAST data base search tool that facilitates identification of the functions of matched sequences. We have created new data bases of conserved regions and functional domains for protein sequences in NCBI's Entrez data base, and BEAUTY allows this information to be incorporated directly into BLAST search results. A Conserved Regions Data Base, containing the locations of conserved regions within Entrez protein sequences, was constructed by (1) clustering the entire data base into families, (2) aligning each family using our PIMA multiple sequence alignment program, and (3) scanning the multiple alignments to locate the conserved regions within each aligned sequence. A separate Annotated Domains Data Base was constructed by extracting the locations of all annotated domains and sites from sequences represented in the Entrez, PROSITE, BLOCKS, and PRINTS data bases. BEAUTY performs a BLAST search of those Entrez sequences with conserved regions and/or annotated domains. BEAUTY then uses the information from the Conserved Regions and Annotated Domains data bases to generate, for each matched sequence, a schematic display that allows one to directly compare the relative locations of (1) the conserved regions, (2) annotated domains and sites, and (3) the locally aligned regions matched in the BLAST search. In addition, BEAUTY search results include World-Wide Web hypertext links to a number of external data bases that provide a variety of additional types of information on the function of matched sequences. This convenient integration of protein families, conserved regions, annotated domains, alignment displays, and World-Wide Web resources greatly enhances the biological informativeness of sequence similarity searches. BEAUTY searches can be performed remotely on our system using the "BCM Search Launcher" World-Wide Web pages (URL is < http:/ /gc.bcm.tmc.edu:8088/ search-launcher/launcher.html > ).
BEAUTY-X: enhanced BLAST searches for DNA queries.
Worley, K C; Culpepper, P; Wiese, B A; Smith, R F
1998-01-01
BEAUTY (BLAST Enhanced Alignment Utility) is an enhanced version of the BLAST database search tool that facilitates identification of the functions of matched sequences. Three recent improvements to the BEAUTY program described here make the enhanced output (1) available for DNA queries, (2) available for searches of any protein database, and (3) more up-to-date, with periodic updates of the domain information. BEAUTY searches of the NCBI and EMBL non-redundant protein sequence databases are available from the BCM Search Launcher Web pages (http://gc.bcm.tmc. edu:8088/search-launcher/launcher.html). BEAUTY Post-Processing of submitted search results is available using the BCM Search Launcher Batch Client (version 2.6) (ftp://gc.bcm.tmc. edu/pub/software/search-launcher/). Example figures are available at http://dot.bcm.tmc. edu:9331/papers/beautypp.html (kworley,culpep)@bcm.tmc.edu
ASGARD: an open-access database of annotated transcriptomes for emerging model arthropod species.
Zeng, Victor; Extavour, Cassandra G
2012-01-01
The increased throughput and decreased cost of next-generation sequencing (NGS) have shifted the bottleneck genomic research from sequencing to annotation, analysis and accessibility. This is particularly challenging for research communities working on organisms that lack the basic infrastructure of a sequenced genome, or an efficient way to utilize whatever sequence data may be available. Here we present a new database, the Assembled Searchable Giant Arthropod Read Database (ASGARD). This database is a repository and search engine for transcriptomic data from arthropods that are of high interest to multiple research communities but currently lack sequenced genomes. We demonstrate the functionality and utility of ASGARD using de novo assembled transcriptomes from the milkweed bug Oncopeltus fasciatus, the cricket Gryllus bimaculatus and the amphipod crustacean Parhyale hawaiensis. We have annotated these transcriptomes to assign putative orthology, coding region determination, protein domain identification and Gene Ontology (GO) term annotation to all possible assembly products. ASGARD allows users to search all assemblies by orthology annotation, GO term annotation or Basic Local Alignment Search Tool. User-friendly features of ASGARD include search term auto-completion suggestions based on database content, the ability to download assembly product sequences in FASTA format, direct links to NCBI data for predicted orthologs and graphical representation of the location of protein domains and matches to similar sequences from the NCBI non-redundant database. ASGARD will be a useful repository for transcriptome data from future NGS studies on these and other emerging model arthropods, regardless of sequencing platform, assembly or annotation status. This database thus provides easy, one-stop access to multi-species annotated transcriptome information. We anticipate that this database will be useful for members of multiple research communities, including developmental biology, physiology, evolutionary biology, ecology, comparative genomics and phylogenomics. Database URL: asgard.rc.fas.harvard.edu.
NASA Astrophysics Data System (ADS)
Lee, Feifei; Kotani, Koji; Chen, Qiu; Ohmi, Tadahiro
2010-02-01
In this paper, a fast search algorithm for MPEG-4 video clips from video database is proposed. An adjacent pixel intensity difference quantization (APIDQ) histogram is utilized as the feature vector of VOP (video object plane), which had been reliably applied to human face recognition previously. Instead of fully decompressed video sequence, partially decoded data, namely DC sequence of the video object are extracted from the video sequence. Combined with active search, a temporal pruning algorithm, fast and robust video search can be realized. The proposed search algorithm has been evaluated by total 15 hours of video contained of TV programs such as drama, talk, news, etc. to search for given 200 MPEG-4 video clips which each length is 15 seconds. Experimental results show the proposed algorithm can detect the similar video clip in merely 80ms, and Equal Error Rate (ERR) of 2 % in drama and news categories are achieved, which are more accurately and robust than conventional fast video search algorithm.
TALE proteins search DNA using a rotationally decoupled mechanism.
Cuculis, Luke; Abil, Zhanar; Zhao, Huimin; Schroeder, Charles M
2016-10-01
Transcription activator-like effector (TALE) proteins are a class of programmable DNA-binding proteins used extensively for gene editing. Despite recent progress, however, little is known about their sequence search mechanism. Here, we use single-molecule experiments to study TALE search along DNA. Our results show that TALEs utilize a rotationally decoupled mechanism for nonspecific search, despite remaining associated with DNA templates during the search process. Our results suggest that the protein helical structure enables TALEs to adopt a loosely wrapped conformation around DNA templates during nonspecific search, facilitating rapid one-dimensional (1D) diffusion under a range of solution conditions. Furthermore, this model is consistent with a previously reported two-state mechanism for TALE search that allows these proteins to overcome the search speed-stability paradox. Taken together, our results suggest that TALE search is unique among the broad class of sequence-specific DNA-binding proteins and supports efficient 1D search along DNA.
muBLASTP: database-indexed protein sequence search on multicore CPUs.
Zhang, Jing; Misra, Sanchit; Wang, Hao; Feng, Wu-Chun
2016-11-04
The Basic Local Alignment Search Tool (BLAST) is a fundamental program in the life sciences that searches databases for sequences that are most similar to a query sequence. Currently, the BLAST algorithm utilizes a query-indexed approach. Although many approaches suggest that sequence search with a database index can achieve much higher throughput (e.g., BLAT, SSAHA, and CAFE), they cannot deliver the same level of sensitivity as the query-indexed BLAST, i.e., NCBI BLAST, or they can only support nucleotide sequence search, e.g., MegaBLAST. Due to different challenges and characteristics between query indexing and database indexing, the existing techniques for query-indexed search cannot be used into database indexed search. muBLASTP, a novel database-indexed BLAST for protein sequence search, delivers identical hits returned to NCBI BLAST. On Intel Haswell multicore CPUs, for a single query, the single-threaded muBLASTP achieves up to a 4.41-fold speedup for alignment stages, and up to a 1.75-fold end-to-end speedup over single-threaded NCBI BLAST. For a batch of queries, the multithreaded muBLASTP achieves up to a 5.7-fold speedups for alignment stages, and up to a 4.56-fold end-to-end speedup over multithreaded NCBI BLAST. With a newly designed index structure for protein database and associated optimizations in BLASTP algorithm, we re-factored BLASTP algorithm for modern multicore processors that achieves much higher throughput with acceptable memory footprint for the database index.
Discrete Cosine Transform Image Coding With Sliding Block Codes
NASA Astrophysics Data System (ADS)
Divakaran, Ajay; Pearlman, William A.
1989-11-01
A transform trellis coding scheme for images is presented. A two dimensional discrete cosine transform is applied to the image followed by a search on a trellis structured code. This code is a sliding block code that utilizes a constrained size reproduction alphabet. The image is divided into blocks by the transform coding. The non-stationarity of the image is counteracted by grouping these blocks in clusters through a clustering algorithm, and then encoding the clusters separately. Mandela ordered sequences are formed from each cluster i.e identically indexed coefficients from each block are grouped together to form one dimensional sequences. A separate search ensues on each of these Mandela ordered sequences. Padding sequences are used to improve the trellis search fidelity. The padding sequences absorb the error caused by the building up of the trellis to full size. The simulations were carried out on a 256x256 image ('LENA'). The results are comparable to any existing scheme. The visual quality of the image is enhanced considerably by the padding and clustering.
Tandem mass spectrometry spectral libraries and library searching.
Deutsch, Eric W
2011-01-01
Spectral library searching in the field of proteomics has been gaining visibility and use in the last few years, primarily due to the expansion of public proteomics data repositories and the large spectral libraries that can be generated from them. Spectral library searching has several advantages over conventional sequence searching: it is generally much faster, and has higher specificity and sensitivity. The speed increase is primarily, due to having a smaller, fully indexable search space of real spectra that are known to be observable. The increase in specificity and sensitivity is primarily due to the ability of a search engine to utilize the known intensities of the fragment ions, rather than just comparing with theoretical spectra as is done with sequence searching. The main disadvantage of spectral library searching is that one can only identify peptide ions that have been seen before and are stored in the spectral library. In this chapter, an overview of spectral library searching and the libraries currently available are presented.
ESTree db: a Tool for Peach Functional Genomics
Lazzari, Barbara; Caprera, Andrea; Vecchietti, Alberto; Stella, Alessandra; Milanesi, Luciano; Pozzi, Carlo
2005-01-01
Background The ESTree db represents a collection of Prunus persica expressed sequenced tags (ESTs) and is intended as a resource for peach functional genomics. A total of 6,155 successful EST sequences were obtained from four in-house prepared cDNA libraries from Prunus persica mesocarps at different developmental stages. Another 12,475 peach EST sequences were downloaded from public databases and added to the ESTree db. An automated pipeline was prepared to process EST sequences using public software integrated by in-house developed Perl scripts and data were collected in a MySQL database. A php-based web interface was developed to query the database. Results The ESTree db version as of April 2005 encompasses 18,630 sequences representing eight libraries. Contig assembly was performed with CAP3. Putative single nucleotide polymorphism (SNP) detection was performed with the AutoSNP program and a search engine was implemented to retrieve results. All the sequences and all the contig consensus sequences were annotated both with blastx against the GenBank nr db and with GOblet against the viridiplantae section of the Gene Ontology db. Links to NiceZyme (Expasy) and to the KEGG metabolic pathways were provided. A local BLAST utility is available. A text search utility allows querying and browsing the database. Statistics were provided on Gene Ontology occurrences to assign sequences to Gene Ontology categories. Conclusion The resulting database is a comprehensive resource of data and links related to peach EST sequences. The Sequence Report and Contig Report pages work as the web interface core structures, giving quick access to data related to each sequence/contig. PMID:16351742
ESTree db: a tool for peach functional genomics.
Lazzari, Barbara; Caprera, Andrea; Vecchietti, Alberto; Stella, Alessandra; Milanesi, Luciano; Pozzi, Carlo
2005-12-01
The ESTree db http://www.itb.cnr.it/estree/ represents a collection of Prunus persica expressed sequenced tags (ESTs) and is intended as a resource for peach functional genomics. A total of 6,155 successful EST sequences were obtained from four in-house prepared cDNA libraries from Prunus persica mesocarps at different developmental stages. Another 12,475 peach EST sequences were downloaded from public databases and added to the ESTree db. An automated pipeline was prepared to process EST sequences using public software integrated by in-house developed Perl scripts and data were collected in a MySQL database. A php-based web interface was developed to query the database. The ESTree db version as of April 2005 encompasses 18,630 sequences representing eight libraries. Contig assembly was performed with CAP3. Putative single nucleotide polymorphism (SNP) detection was performed with the AutoSNP program and a search engine was implemented to retrieve results. All the sequences and all the contig consensus sequences were annotated both with blastx against the GenBank nr db and with GOblet against the viridiplantae section of the Gene Ontology db. Links to NiceZyme (Expasy) and to the KEGG metabolic pathways were provided. A local BLAST utility is available. A text search utility allows querying and browsing the database. Statistics were provided on Gene Ontology occurrences to assign sequences to Gene Ontology categories. The resulting database is a comprehensive resource of data and links related to peach EST sequences. The Sequence Report and Contig Report pages work as the web interface core structures, giving quick access to data related to each sequence/contig.
He, Ji; Dai, Xinbin; Zhao, Xuechun
2007-02-09
BLAST searches are widely used for sequence alignment. The search results are commonly adopted for various functional and comparative genomics tasks such as annotating unknown sequences, investigating gene models and comparing two sequence sets. Advances in sequencing technologies pose challenges for high-throughput analysis of large-scale sequence data. A number of programs and hardware solutions exist for efficient BLAST searching, but there is a lack of generic software solutions for mining and personalized management of the results. Systematically reviewing the results and identifying information of interest remains tedious and time-consuming. Personal BLAST Navigator (PLAN) is a versatile web platform that helps users to carry out various personalized pre- and post-BLAST tasks, including: (1) query and target sequence database management, (2) automated high-throughput BLAST searching, (3) indexing and searching of results, (4) filtering results online, (5) managing results of personal interest in favorite categories, (6) automated sequence annotation (such as NCBI NR and ontology-based annotation). PLAN integrates, by default, the Decypher hardware-based BLAST solution provided by Active Motif Inc. with a greatly improved efficiency over conventional BLAST software. BLAST results are visualized by spreadsheets and graphs and are full-text searchable. BLAST results and sequence annotations can be exported, in part or in full, in various formats including Microsoft Excel and FASTA. Sequences and BLAST results are organized in projects, the data publication levels of which are controlled by the registered project owners. In addition, all analytical functions are provided to public users without registration. PLAN has proved a valuable addition to the community for automated high-throughput BLAST searches, and, more importantly, for knowledge discovery, management and sharing based on sequence alignment results. The PLAN web interface is platform-independent, easily configurable and capable of comprehensive expansion, and user-intuitive. PLAN is freely available to academic users at http://bioinfo.noble.org/plan/. The source code for local deployment is provided under free license. Full support on system utilization, installation, configuration and customization are provided to academic users.
He, Ji; Dai, Xinbin; Zhao, Xuechun
2007-01-01
Background BLAST searches are widely used for sequence alignment. The search results are commonly adopted for various functional and comparative genomics tasks such as annotating unknown sequences, investigating gene models and comparing two sequence sets. Advances in sequencing technologies pose challenges for high-throughput analysis of large-scale sequence data. A number of programs and hardware solutions exist for efficient BLAST searching, but there is a lack of generic software solutions for mining and personalized management of the results. Systematically reviewing the results and identifying information of interest remains tedious and time-consuming. Results Personal BLAST Navigator (PLAN) is a versatile web platform that helps users to carry out various personalized pre- and post-BLAST tasks, including: (1) query and target sequence database management, (2) automated high-throughput BLAST searching, (3) indexing and searching of results, (4) filtering results online, (5) managing results of personal interest in favorite categories, (6) automated sequence annotation (such as NCBI NR and ontology-based annotation). PLAN integrates, by default, the Decypher hardware-based BLAST solution provided by Active Motif Inc. with a greatly improved efficiency over conventional BLAST software. BLAST results are visualized by spreadsheets and graphs and are full-text searchable. BLAST results and sequence annotations can be exported, in part or in full, in various formats including Microsoft Excel and FASTA. Sequences and BLAST results are organized in projects, the data publication levels of which are controlled by the registered project owners. In addition, all analytical functions are provided to public users without registration. Conclusion PLAN has proved a valuable addition to the community for automated high-throughput BLAST searches, and, more importantly, for knowledge discovery, management and sharing based on sequence alignment results. The PLAN web interface is platform-independent, easily configurable and capable of comprehensive expansion, and user-intuitive. PLAN is freely available to academic users at . The source code for local deployment is provided under free license. Full support on system utilization, installation, configuration and customization are provided to academic users. PMID:17291345
ESTuber db: an online database for Tuber borchii EST sequences.
Lazzari, Barbara; Caprera, Andrea; Cosentino, Cristian; Stella, Alessandra; Milanesi, Luciano; Viotti, Angelo
2007-03-08
The ESTuber database (http://www.itb.cnr.it/estuber) includes 3,271 Tuber borchii expressed sequence tags (EST). The dataset consists of 2,389 sequences from an in-house prepared cDNA library from truffle vegetative hyphae, and 882 sequences downloaded from GenBank and representing four libraries from white truffle mycelia and ascocarps at different developmental stages. An automated pipeline was prepared to process EST sequences using public software integrated by in-house developed Perl scripts. Data were collected in a MySQL database, which can be queried via a php-based web interface. Sequences included in the ESTuber db were clustered and annotated against three databases: the GenBank nr database, the UniProtKB database and a third in-house prepared database of fungi genomic sequences. An algorithm was implemented to infer statistical classification among Gene Ontology categories from the ontology occurrences deduced from the annotation procedure against the UniProtKB database. Ontologies were also deduced from the annotation of more than 130,000 EST sequences from five filamentous fungi, for intra-species comparison purposes. Further analyses were performed on the ESTuber db dataset, including tandem repeats search and comparison of the putative protein dataset inferred from the EST sequences to the PROSITE database for protein patterns identification. All the analyses were performed both on the complete sequence dataset and on the contig consensus sequences generated by the EST assembly procedure. The resulting web site is a resource of data and links related to truffle expressed genes. The Sequence Report and Contig Report pages are the web interface core structures which, together with the Text search utility and the Blast utility, allow easy access to the data stored in the database.
Discovering discovery patterns with Predication-based Semantic Indexing.
Cohen, Trevor; Widdows, Dominic; Schvaneveldt, Roger W; Davies, Peter; Rindflesch, Thomas C
2012-12-01
In this paper we utilize methods of hyperdimensional computing to mediate the identification of therapeutically useful connections for the purpose of literature-based discovery. Our approach, named Predication-based Semantic Indexing, is utilized to identify empirically sequences of relationships known as "discovery patterns", such as "drug x INHIBITS substance y, substance y CAUSES disease z" that link pharmaceutical substances to diseases they are known to treat. These sequences are derived from semantic predications extracted from the biomedical literature by the SemRep system, and subsequently utilized to direct the search for known treatments for a held out set of diseases. Rapid and efficient inference is accomplished through the application of geometric operators in PSI space, allowing for both the derivation of discovery patterns from a large set of known TREATS relationships, and the application of these discovered patterns to constrain search for therapeutic relationships at scale. Our results include the rediscovery of discovery patterns that have been constructed manually by other authors in previous research, as well as the discovery of a set of previously unrecognized patterns. The application of these patterns to direct search through PSI space results in better recovery of therapeutic relationships than is accomplished with models based on distributional statistics alone. These results demonstrate the utility of efficient approximate inference in geometric space as a means to identify therapeutic relationships, suggesting a role of these methods in drug repurposing efforts. In addition, the results provide strong support for the utility of the discovery pattern approach pioneered by Hristovski and his colleagues. Copyright © 2012 Elsevier Inc. All rights reserved.
Discovering discovery patterns with predication-based Semantic Indexing
Cohen, Trevor; Widdows, Dominic; Schvaneveldt, Roger W.; Davies, Peter; Rindflesch, Thomas C.
2012-01-01
In this paper we utilize methods of hyperdimensional computing to mediate the identification of therapeutically useful connections for the purpose of literature-based discovery. Our approach, named Predication-based Semantic Indexing, is utilized to identify empirically sequences of relationships known as “discovery patterns”, such as “drug x INHIBITS substance y, substance y CAUSES disease z” that link pharmaceutical substances to diseases they are known to treat. These sequences are derived from semantic predications extracted from the biomedical literature by the SemRep system, and subsequently utilized to direct the search for known treatments for a held out set of diseases. Rapid and efficient inference is accomplished through the application of geometric operators in PSI space, allowing for both the derivation of discovery patterns from a large set of known TREATS relationships, and the application of these discovered patterns to constrain search for therapeutic relationships at scale. Our results include the rediscovery of discovery patterns that have been constructed manually by other authors in previous research, as well as the discovery of a set of previously unrecognized patterns. The application of these patterns to direct search through PSI space results in better recovery of therapeutic relationships than is accomplished with models based on distributional statistics alone. These results demonstrate the utility of efficient approximate inference in geometric space as a means to identify therapeutic relationships, suggesting a role of these methods in drug repurposing efforts. In addition, the results provide strong support for the utility of the discovery pattern approach pioneered by Hristovski and his colleagues. PMID:22841748
Secure Genomic Computation through Site-Wise Encryption
Zhao, Yongan; Wang, XiaoFeng; Tang, Haixu
2015-01-01
Commercial clouds provide on-demand IT services for big-data analysis, which have become an attractive option for users who have no access to comparable infrastructure. However, utilizing these services for human genome analysis is highly risky, as human genomic data contains identifiable information of human individuals and their disease susceptibility. Therefore, currently, no computation on personal human genomic data is conducted on public clouds. To address this issue, here we present a site-wise encryption approach to encrypt whole human genome sequences, which can be subject to secure searching of genomic signatures on public clouds. We implemented this method within the Hadoop framework, and tested it on the case of searching disease markers retrieved from the ClinVar database against patients’ genomic sequences. The secure search runs only one order of magnitude slower than the simple search without encryption, indicating our method is ready to be used for secure genomic computation on public clouds. PMID:26306278
Secure Genomic Computation through Site-Wise Encryption.
Zhao, Yongan; Wang, XiaoFeng; Tang, Haixu
2015-01-01
Commercial clouds provide on-demand IT services for big-data analysis, which have become an attractive option for users who have no access to comparable infrastructure. However, utilizing these services for human genome analysis is highly risky, as human genomic data contains identifiable information of human individuals and their disease susceptibility. Therefore, currently, no computation on personal human genomic data is conducted on public clouds. To address this issue, here we present a site-wise encryption approach to encrypt whole human genome sequences, which can be subject to secure searching of genomic signatures on public clouds. We implemented this method within the Hadoop framework, and tested it on the case of searching disease markers retrieved from the ClinVar database against patients' genomic sequences. The secure search runs only one order of magnitude slower than the simple search without encryption, indicating our method is ready to be used for secure genomic computation on public clouds.
Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework
2012-01-01
Background For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed. Results We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed. Conclusion The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources. PMID:23216909
Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework.
Lewis, Steven; Csordas, Attila; Killcoyne, Sarah; Hermjakob, Henning; Hoopmann, Michael R; Moritz, Robert L; Deutsch, Eric W; Boyle, John
2012-12-05
For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed. We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed. The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources.
Rapid search for tertiary fragments reveals protein sequence–structure relationships
Zhou, Jianfu; Grigoryan, Gevorg
2015-01-01
Finding backbone substructures from the Protein Data Bank that match an arbitrary query structural motif, composed of multiple disjoint segments, is a problem of growing relevance in structure prediction and protein design. Although numerous protein structure search approaches have been proposed, methods that address this specific task without additional restrictions and on practical time scales are generally lacking. Here, we propose a solution, dubbed MASTER, that is both rapid, enabling searches over the Protein Data Bank in a matter of seconds, and provably correct, finding all matches below a user-specified root-mean-square deviation cutoff. We show that despite the potentially exponential time complexity of the problem, running times in practice are modest even for queries with many segments. The ability to explore naturally plausible structural and sequence variations around a given motif has the potential to synthesize its design principles in an automated manner; so we go on to illustrate the utility of MASTER to protein structural biology. We demonstrate its capacity to rapidly establish structure–sequence relationships, uncover the native designability landscapes of tertiary structural motifs, identify structural signatures of binding, and automatically rewire protein topologies. Given the broad utility of protein tertiary fragment searches, we hope that providing MASTER in an open-source format will enable novel advances in understanding, predicting, and designing protein structure. PMID:25420575
2010-01-01
Background Comparative genomics methods such as phylogenetic profiling can mine powerful inferences from inherently noisy biological data sets. We introduce Sites Inferred by Metabolic Background Assertion Labeling (SIMBAL), a method that applies the Partial Phylogenetic Profiling (PPP) approach locally within a protein sequence to discover short sequence signatures associated with functional sites. The approach is based on the basic scoring mechanism employed by PPP, namely the use of binomial distribution statistics to optimize sequence similarity cutoffs during searches of partitioned training sets. Results Here we illustrate and validate the ability of the SIMBAL method to find functionally relevant short sequence signatures by application to two well-characterized protein families. In the first example, we partitioned a family of ABC permeases using a metabolic background property (urea utilization). Thus, the TRUE set for this family comprised members whose genome of origin encoded a urea utilization system. By moving a sliding window across the sequence of a permease, and searching each subsequence in turn against the full set of partitioned proteins, the method found which local sequence signatures best correlated with the urea utilization trait. Mapping of SIMBAL "hot spots" onto crystal structures of homologous permeases reveals that the significant sites are gating determinants on the cytosolic face rather than, say, docking sites for the substrate-binding protein on the extracellular face. In the second example, we partitioned a protein methyltransferase family using gene proximity as a criterion. In this case, the TRUE set comprised those methyltransferases encoded near the gene for the substrate RF-1. SIMBAL identifies sequence regions that map onto the substrate-binding interface while ignoring regions involved in the methyltransferase reaction mechanism in general. Neither method for training set construction requires any prior experimental characterization. Conclusions SIMBAL shows that, in functionally divergent protein families, selected short sequences often significantly outperform their full-length parent sequence for making functional predictions by sequence similarity, suggesting avenues for improved functional classifiers. When combined with structural data, SIMBAL affords the ability to localize and model functional sites. PMID:20102603
Protein structure determination by exhaustive search of Protein Data Bank derived databases.
Stokes-Rees, Ian; Sliz, Piotr
2010-12-14
Parallel sequence and structure alignment tools have become ubiquitous and invaluable at all levels in the study of biological systems. We demonstrate the application and utility of this same parallel search paradigm to the process of protein structure determination, benefitting from the large and growing corpus of known structures. Such searches were previously computationally intractable. Through the method of Wide Search Molecular Replacement, developed here, they can be completed in a few hours with the aide of national-scale federated cyberinfrastructure. By dramatically expanding the range of models considered for structure determination, we show that small (less than 12% structural coverage) and low sequence identity (less than 20% identity) template structures can be identified through multidimensional template scoring metrics and used for structure determination. Many new macromolecular complexes can benefit significantly from such a technique due to the lack of known homologous protein folds or sequences. We demonstrate the effectiveness of the method by determining the structure of a full-length p97 homologue from Trichoplusia ni. Example cases with the MHC/T-cell receptor complex and the EmoB protein provide systematic estimates of minimum sequence identity, structure coverage, and structural similarity required for this method to succeed. We describe how this structure-search approach and other novel computationally intensive workflows are made tractable through integration with the US national computational cyberinfrastructure, allowing, for example, rapid processing of the entire Structural Classification of Proteins protein fragment database.
Optimizing high performance computing workflow for protein functional annotation.
Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene
2014-09-10
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.
Optimizing high performance computing workflow for protein functional annotation
Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene
2014-01-01
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data. PMID:25313296
De novo characterization of Lentinula edodes C(91-3) transcriptome by deep Solexa sequencing.
Zhong, Mintao; Liu, Ben; Wang, Xiaoli; Liu, Lei; Lun, Yongzhi; Li, Xingyun; Ning, Anhong; Cao, Jing; Huang, Min
2013-02-01
Lentinula edodes, has been utilized as food, as well as, in popular medicine, moreover, its extract isolated from its mycelium and fruiting body have shown several therapeutic properties. Yet little is understood about its genes involved in these properties, and the absence of L.edodes genomes has been a barrier to the development of functional genomics research. However, high throughput sequencing technologies are now being widely applied to non-model species. To facilitate research on L.edodes, we leveraged Solexa sequencing technology in de novo assembly of L.edodes C(91-3) transcriptome. In a single run, we produced more than 57 million sequencing reads. These reads were assembled into 28,923 unigene sequences (mean size=689bp) including 18,120 unigenes with coding sequence (CDS). Based on similarity search with known proteins, assembled unigene sequences were annotated with gene descriptions, gene ontology (GO) and clusters of orthologous group (COG) terms. Our data provides the first comprehensive sequence resource available for functional genomics studies in L.edodes, and demonstrates the utility of Illumina/Solexa sequencing for de novo transcriptome characterization and gene discovery in a non-model mushroom. Copyright © 2012 Elsevier Inc. All rights reserved.
Simrank: Rapid and sensitive general-purpose k-mer search tool
2011-01-01
Background Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project http://nihroadmap.nih.gov/hmp. Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence database partitioning, guide tree estimation, molecular classification and alignment acceleration have benefited from embedded k-mer searches as sub-routines. However, a rapid, general-purpose, open-source, flexible, stand-alone k-mer tool has not been available. Results Here we present a stand-alone utility, Simrank, which allows users to rapidly identify database strings the most similar to query strings. Performance testing of Simrank and related tools against DNA, RNA, protein and human-languages found Simrank 10X to 928X faster depending on the dataset. Conclusions Simrank provides molecular ecologists with a high-throughput, open source choice for comparing large sequence sets to find similarity. PMID:21524302
Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics.
Deutsch, Eric W; Sun, Zhi; Campbell, David S; Binz, Pierre-Alain; Farrah, Terry; Shteynberg, David; Mendoza, Luis; Omenn, Gilbert S; Moritz, Robert L
2016-11-04
The results of analysis of shotgun proteomics mass spectrometry data can be greatly affected by the selection of the reference protein sequence database against which the spectra are matched. For many species there are multiple sources from which somewhat different sequence sets can be obtained. This can lead to confusion about which database is best in which circumstances-a problem especially acute in human sample analysis. All sequence databases are genome-based, with sequences for the predicted gene and their protein translation products compiled. Our goal is to create a set of primary sequence databases that comprise the union of sequences from many of the different available sources and make the result easily available to the community. We have compiled a set of four sequence databases of varying sizes, from a small database consisting of only the ∼20,000 primary isoforms plus contaminants to a very large database that includes almost all nonredundant protein sequences from several sources. This set of tiered, increasingly complete human protein sequence databases suitable for mass spectrometry proteomics sequence database searching is called the Tiered Human Integrated Search Proteome set. In order to evaluate the utility of these databases, we have analyzed two different data sets, one from the HeLa cell line and the other from normal human liver tissue, with each of the four tiers of database complexity. The result is that approximately 0.8%, 1.1%, and 1.5% additional peptides can be identified for Tiers 2, 3, and 4, respectively, as compared with the Tier 1 database, at substantially increasing computational cost. This increase in computational cost may be worth bearing if the identification of sequence variants or the discovery of sequences that are not present in the reviewed knowledge base entries is an important goal of the study. We find that it is useful to search a data set against a simpler database, and then check the uniqueness of the discovered peptides against a more complex database. We have set up an automated system that downloads all the source databases on the first of each month and automatically generates a new set of search databases and makes them available for download at http://www.peptideatlas.org/thisp/ .
Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics
Deutsch, Eric W.; Sun, Zhi; Campbell, David S.; Binz, Pierre-Alain; Farrah, Terry; Shteynberg, David; Mendoza, Luis; Omenn, Gilbert S.; Moritz, Robert L.
2016-01-01
The results of analysis of shotgun proteomics mass spectrometry data can be greatly affected by the selection of the reference protein sequence database against which the spectra are matched. For many species there are multiple sources from which somewhat different sequence sets can be obtained. This can lead to confusion about which database is best in which circumstances – a problem especially acute in human sample analysis. All sequence databases are genome-based, with sequences for the predicted gene and their protein translation products compiled. Our goal is to create a set of primary sequence databases that comprise the union of sequences from many of the different available sources and make the result easily available to the community. We have compiled a set of four sequence databases of varying sizes, from a small database consisting of only the ~20,000 primary isoforms plus contaminants to a very large database that includes almost all non-redundant protein sequences from several sources. This set of tiered, increasingly complete human protein sequence databases suitable for mass spectrometry proteomics sequence database searching is called the Tiered Human Integrated Search Proteome set. In order to evaluate the utility of these databases, we have analyzed two different data sets, one from the HeLa cell line and the other from normal human liver tissue, with each of the four tiers of database complexity. The result is that approximately 0.8%, 1.1%, and 1.5% additional peptides can be identified for Tiers 2, 3, and 4, respectively, as compared with the Tier 1 database, at substantially increasing computational cost. This increase in computational cost may be worth bearing if the identification of sequence variants or the discovery of sequences that are not present in the reviewed knowledge base entries is an important goal of the study. We find that it is useful to search a data set against a simpler database, and then check the uniqueness of the discovered peptides against a more complex database. We have set up an automated system that downloads all the source databases on the first of each month and automatically generates a new set of search databases and makes them available for download at http://www.peptideatlas.org/thisp/. PMID:27577934
Design of nucleic acid sequences for DNA computing based on a thermodynamic approach
Tanaka, Fumiaki; Kameda, Atsushi; Yamamoto, Masahito; Ohuchi, Azuma
2005-01-01
We have developed an algorithm for designing multiple sequences of nucleic acids that have a uniform melting temperature between the sequence and its complement and that do not hybridize non-specifically with each other based on the minimum free energy (ΔGmin). Sequences that satisfy these constraints can be utilized in computations, various engineering applications such as microarrays, and nano-fabrications. Our algorithm is a random generate-and-test algorithm: it generates a candidate sequence randomly and tests whether the sequence satisfies the constraints. The novelty of our algorithm is that the filtering method uses a greedy search to calculate ΔGmin. This effectively excludes inappropriate sequences before ΔGmin is calculated, thereby reducing computation time drastically when compared with an algorithm without the filtering. Experimental results in silico showed the superiority of the greedy search over the traditional approach based on the hamming distance. In addition, experimental results in vitro demonstrated that the experimental free energy (ΔGexp) of 126 sequences correlated well with ΔGmin (|R| = 0.90) than with the hamming distance (|R| = 0.80). These results validate the rationality of a thermodynamic approach. We implemented our algorithm in a graphic user interface-based program written in Java. PMID:15701762
Zhao, Panpan; Zhong, Jiayong; Liu, Wanting; Zhao, Jing; Zhang, Gong
2017-12-01
Multiple search engines based on various models have been developed to search MS/MS spectra against a reference database, providing different results for the same data set. How to integrate these results efficiently with minimal compromise on false discoveries is an open question due to the lack of an independent, reliable, and highly sensitive standard. We took the advantage of the translating mRNA sequencing (RNC-seq) result as a standard to evaluate the integration strategies of the protein identifications from various search engines. We used seven mainstream search engines (Andromeda, Mascot, OMSSA, X!Tandem, pFind, InsPecT, and ProVerB) to search the same label-free MS data sets of human cell lines Hep3B, MHCCLM3, and MHCC97H from the Chinese C-HPP Consortium for Chromosomes 1, 8, and 20. As expected, the union of seven engines resulted in a boosted false identification, whereas the intersection of seven engines remarkably decreased the identification power. We found that identifications of at least two out of seven engines resulted in maximizing the protein identification power while minimizing the ratio of suspicious/translation-supported identifications (STR), as monitored by our STR index, based on RNC-Seq. Furthermore, this strategy also significantly improves the peptides coverage of the protein amino acid sequence. In summary, we demonstrated a simple strategy to significantly improve the performance for shotgun mass spectrometry by protein-level integrating multiple search engines, maximizing the utilization of the current MS spectra without additional experimental work.
NASA Technical Reports Server (NTRS)
Wheeler, Ward C.
2003-01-01
A method to align sequence data based on parsimonious synapomorphy schemes generated by direct optimization (DO; earlier termed optimization alignment) is proposed. DO directly diagnoses sequence data on cladograms without an intervening multiple-alignment step, thereby creating topology-specific, dynamic homology statements. Hence, no multiple-alignment is required to generate cladograms. Unlike general and globally optimal multiple-alignment procedures, the method described here, implied alignment (IA), takes these dynamic homologies and traces them back through a single cladogram, linking the unaligned sequence positions in the terminal taxa via DO transformation series. These "lines of correspondence" link ancestor-descendent states and, when displayed as linearly arrayed columns without hypothetical ancestors, are largely indistinguishable from standard multiple alignment. Since this method is based on synapomorphy, the treatment of certain classes of insertion-deletion (indel) events may be different from that of other alignment procedures. As with all alignment methods, results are dependent on parameter assumptions such as indel cost and transversion:transition ratios. Such an IA could be used as a basis for phylogenetic search, but this would be questionable since the homologies derived from the implied alignment depend on its natal cladogram and any variance, between DO and IA + Search, due to heuristic approach. The utility of this procedure in heuristic cladogram searches using DO and the improvement of heuristic cladogram cost calculations are discussed. c2003 The Willi Hennig Society. Published by Elsevier Science (USA). All rights reserved.
Yates, John R
2015-11-01
Advances in computer technology and software have driven developments in mass spectrometry over the last 50 years. Computers and software have been impactful in three areas: the automation of difficult calculations to aid interpretation, the collection of data and control of instruments, and data interpretation. As the power of computers has grown, so too has the utility and impact on mass spectrometers and their capabilities. This has been particularly evident in the use of tandem mass spectrometry data to search protein and nucleotide sequence databases to identify peptide and protein sequences. This capability has driven the development of many new approaches to study biological systems, including the use of "bottom-up shotgun proteomics" to directly analyze protein mixtures. Graphical Abstract ᅟ.
Piehowski, Paul D; Petyuk, Vladislav A; Sandoval, John D; Burnum, Kristin E; Kiebel, Gary R; Monroe, Matthew E; Anderson, Gordon A; Camp, David G; Smith, Richard D
2013-03-01
For bottom-up proteomics, there are wide variety of database-searching algorithms in use for matching peptide sequences to tandem MS spectra. Likewise, there are numerous strategies being employed to produce a confident list of peptide identifications from the different search algorithm outputs. Here we introduce a grid-search approach for determining optimal database filtering criteria in shotgun proteomics data analyses that is easily adaptable to any search. Systematic Trial and Error Parameter Selection--referred to as STEPS--utilizes user-defined parameter ranges to test a wide array of parameter combinations to arrive at an optimal "parameter set" for data filtering, thus maximizing confident identifications. The benefits of this approach in terms of numbers of true-positive identifications are demonstrated using datasets derived from immunoaffinity-depleted blood serum and a bacterial cell lysate, two common proteomics sample types. © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Liu, Min; Zhang, Zhongqi; Zang, Tianzhu; Spahr, Chris; Cheetham, Janet; Ren, Da; Sunny Zhou, Zhaohui
2013-01-01
Characterization of protein crosslinking, particularly without prior knowledge of the chemical nature and site of crosslinking, poses a significant challenge due to their intrinsic structural complexity and the lack of a comprehensive analytical approach. Towards this end, we have developed a generally applicable workflow—XChem-Finder that involves four stages. (1) Detection of crosslinked peptides via 18O-labeling at C-termini. (2) Determination of the putative partial sequences of each crosslinked peptide pair using a fragment ion mass database search against known protein sequences coupled with a de novo sequence tag search. (3) Extension to full sequences based on protease specificity, the unique combination of mass, and other constraints. (4) Deduction of crosslinking chemistry and site. The mass difference between the sum of two putative full-length peptides and the crosslinked peptide provides the formulas (elemental composition analysis) for the functional groups involved in each cross- linking. Combined with sequence restraint from MS/MS data, plausible crosslinking chemistry and site were inferred, and ultimately, confirmed by matching with all data. Applying our approach to a stressed IgG2 antibody, ten cross-linked peptides were discovered and found to be connected via thioether originating from disulfides at locations that had not been previously recognized. Furthermore, once the crosslink chemistry was revealed, a targeted crosslink search yielded four additional crosslinked peptides that all contain the C-terminus of the light chain. PMID:23634697
Kamatuka, Kenta; Hattori, Masahiro; Sugiyama, Tomoyasu
2016-12-01
RNA interference (RNAi) screening is extensively used in the field of reverse genetics. RNAi libraries constructed using random oligonucleotides have made this technology affordable. However, the new methodology requires exploration of the RNAi target gene information after screening because the RNAi library includes non-natural sequences that are not found in genes. Here, we developed a web-based tool to support RNAi screening. The system performs short hairpin RNA (shRNA) target prediction that is informed by comprehensive enquiry (SPICE). SPICE automates several tasks that are laborious but indispensable to evaluate the shRNAs obtained by RNAi screening. SPICE has four main functions: (i) sequence identification of shRNA in the input sequence (the sequence might be obtained by sequencing clones in the RNAi library), (ii) searching the target genes in the database, (iii) demonstrating biological information obtained from the database, and (iv) preparation of search result files that can be utilized in a local personal computer (PC). Using this system, we demonstrated that genes targeted by random oligonucleotide-derived shRNAs were not different from those targeted by organism-specific shRNA. The system facilitates RNAi screening, which requires sequence analysis after screening. The SPICE web application is available at http://www.spice.sugysun.org/.
Genome sequencing in microfabricated high-density picolitre reactors.
Margulies, Marcel; Egholm, Michael; Altman, William E; Attiya, Said; Bader, Joel S; Bemben, Lisa A; Berka, Jan; Braverman, Michael S; Chen, Yi-Ju; Chen, Zhoutao; Dewell, Scott B; Du, Lei; Fierro, Joseph M; Gomes, Xavier V; Godwin, Brian C; He, Wen; Helgesen, Scott; Ho, Chun Heen; Ho, Chun He; Irzyk, Gerard P; Jando, Szilveszter C; Alenquer, Maria L I; Jarvie, Thomas P; Jirage, Kshama B; Kim, Jong-Bum; Knight, James R; Lanza, Janna R; Leamon, John H; Lefkowitz, Steven M; Lei, Ming; Li, Jing; Lohman, Kenton L; Lu, Hong; Makhijani, Vinod B; McDade, Keith E; McKenna, Michael P; Myers, Eugene W; Nickerson, Elizabeth; Nobile, John R; Plant, Ramona; Puc, Bernard P; Ronan, Michael T; Roth, George T; Sarkis, Gary J; Simons, Jan Fredrik; Simpson, John W; Srinivasan, Maithreyan; Tartaro, Karrie R; Tomasz, Alexander; Vogt, Kari A; Volkmer, Greg A; Wang, Shally H; Wang, Yong; Weiner, Michael P; Yu, Pengguang; Begley, Richard F; Rothberg, Jonathan M
2005-09-15
The proliferation of large-scale DNA-sequencing projects in recent years has driven a search for alternative methods to reduce time and cost. Here we describe a scalable, highly parallel sequencing system with raw throughput significantly greater than that of state-of-the-art capillary electrophoresis instruments. The apparatus uses a novel fibre-optic slide of individual wells and is able to sequence 25 million bases, at 99% or better accuracy, in one four-hour run. To achieve an approximately 100-fold increase in throughput over current Sanger sequencing technology, we have developed an emulsion method for DNA amplification and an instrument for sequencing by synthesis using a pyrosequencing protocol optimized for solid support and picolitre-scale volumes. Here we show the utility, throughput, accuracy and robustness of this system by shotgun sequencing and de novo assembly of the Mycoplasma genitalium genome with 96% coverage at 99.96% accuracy in one run of the machine.
GuiTope: an application for mapping random-sequence peptides to protein sequences.
Halperin, Rebecca F; Stafford, Phillip; Emery, Jack S; Navalkar, Krupa Arun; Johnston, Stephen Albert
2012-01-03
Random-sequence peptide libraries are a commonly used tool to identify novel ligands for binding antibodies, other proteins, and small molecules. It is often of interest to compare the selected peptide sequences to the natural protein binding partners to infer the exact binding site or the importance of particular residues. The ability to search a set of sequences for similarity to a set of peptides may sometimes enable the prediction of an antibody epitope or a novel binding partner. We have developed a software application designed specifically for this task. GuiTope provides a graphical user interface for aligning peptide sequences to protein sequences. All alignment parameters are accessible to the user including the ability to specify the amino acid frequency in the peptide library; these frequencies often differ significantly from those assumed by popular alignment programs. It also includes a novel feature to align di-peptide inversions, which we have found improves the accuracy of antibody epitope prediction from peptide microarray data and shows utility in analyzing phage display datasets. Finally, GuiTope can randomly select peptides from a given library to estimate a null distribution of scores and calculate statistical significance. GuiTope provides a convenient method for comparing selected peptide sequences to protein sequences, including flexible alignment parameters, novel alignment features, ability to search a database, and statistical significance of results. The software is available as an executable (for PC) at http://www.immunosignature.com/software and ongoing updates and source code will be available at sourceforge.net.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Karpinets, Tatiana V; Park, Byung; Syed, Mustafa H
2010-01-01
The Carbohydrate-Active Enzyme (CAZy) database provides a rich set of manually annotated enzymes that degrade, modify, or create glycosidic bonds. Despite rich and invaluable information stored in the database, software tools utilizing this information for annotation of newly sequenced genomes by CAZy families are limited. We have employed two annotation approaches to fill the gap between manually curated high-quality protein sequences collected in the CAZy database and the growing number of other protein sequences produced by genome or metagenome sequencing projects. The first approach is based on a similarity search against the entire non-redundant sequences of the CAZy database. Themore » second approach performs annotation using links or correspondences between the CAZy families and protein family domains. The links were discovered using the association rule learning algorithm applied to sequences from the CAZy database. The approaches complement each other and in combination achieved high specificity and sensitivity when cross-evaluated with the manually curated genomes of Clostridium thermocellum ATCC 27405 and Saccharophagus degradans 2-40. The capability of the proposed framework to predict the function of unknown protein domains (DUF) and of hypothetical proteins in the genome of Neurospora crassa is demonstrated. The framework is implemented as a Web service, the CAZymes Analysis Toolkit (CAT), and is available at http://cricket.ornl.gov/cgi-bin/cat.cgi.« less
Park, Byung H; Karpinets, Tatiana V; Syed, Mustafa H; Leuze, Michael R; Uberbacher, Edward C
2010-12-01
The Carbohydrate-Active Enzyme (CAZy) database provides a rich set of manually annotated enzymes that degrade, modify, or create glycosidic bonds. Despite rich and invaluable information stored in the database, software tools utilizing this information for annotation of newly sequenced genomes by CAZy families are limited. We have employed two annotation approaches to fill the gap between manually curated high-quality protein sequences collected in the CAZy database and the growing number of other protein sequences produced by genome or metagenome sequencing projects. The first approach is based on a similarity search against the entire nonredundant sequences of the CAZy database. The second approach performs annotation using links or correspondences between the CAZy families and protein family domains. The links were discovered using the association rule learning algorithm applied to sequences from the CAZy database. The approaches complement each other and in combination achieved high specificity and sensitivity when cross-evaluated with the manually curated genomes of Clostridium thermocellum ATCC 27405 and Saccharophagus degradans 2-40. The capability of the proposed framework to predict the function of unknown protein domains and of hypothetical proteins in the genome of Neurospora crassa is demonstrated. The framework is implemented as a Web service, the CAZymes Analysis Toolkit, and is available at http://cricket.ornl.gov/cgi-bin/cat.cgi.
Virus Identification in Unknown Tropical Febrile Illness Cases Using Deep Sequencing
Balmaseda, Angel; Harris, Eva; DeRisi, Joseph L.
2012-01-01
Dengue virus is an emerging infectious agent that infects an estimated 50–100 million people annually worldwide, yet current diagnostic practices cannot detect an etiologic pathogen in ∼40% of dengue-like illnesses. Metagenomic approaches to pathogen detection, such as viral microarrays and deep sequencing, are promising tools to address emerging and non-diagnosable disease challenges. In this study, we used the Virochip microarray and deep sequencing to characterize the spectrum of viruses present in human sera from 123 Nicaraguan patients presenting with dengue-like symptoms but testing negative for dengue virus. We utilized a barcoding strategy to simultaneously deep sequence multiple serum specimens, generating on average over 1 million reads per sample. We then implemented a stepwise bioinformatic filtering pipeline to remove the majority of human and low-quality sequences to improve the speed and accuracy of subsequent unbiased database searches. By deep sequencing, we were able to detect virus sequence in 37% (45/123) of previously negative cases. These included 13 cases with Human Herpesvirus 6 sequences. Other samples contained sequences with similarity to sequences from viruses in the Herpesviridae, Flaviviridae, Circoviridae, Anelloviridae, Asfarviridae, and Parvoviridae families. In some cases, the putative viral sequences were virtually identical to known viruses, and in others they diverged, suggesting that they may derive from novel viruses. These results demonstrate the utility of unbiased metagenomic approaches in the detection of known and divergent viruses in the study of tropical febrile illness. PMID:22347512
Gulvik, Christopher A.; Effler, T. Chad; Wilhelm, Steven W.; Buchan, Alison
2012-01-01
Development and use of primer sets to amplify nucleic acid sequences of interest is fundamental to studies spanning many life science disciplines. As such, the validation of primer sets is essential. Several computer programs have been created to aid in the initial selection of primer sequences that may or may not require multiple nucleotide combinations (i.e., degeneracies). Conversely, validation of primer specificity has remained largely unchanged for several decades, and there are currently few available programs that allows for an evaluation of primers containing degenerate nucleotide bases. To alleviate this gap, we developed the program De-MetaST that performs an in silico amplification using user defined nucleotide sequence dataset(s) and primer sequences that may contain degenerate bases. The program returns an output file that contains the in silico amplicons. When De-MetaST is paired with NCBI’s BLAST (De-MetaST-BLAST), the program also returns the top 10 nr NCBI database hits for each recovered in silico amplicon. While the original motivation for development of this search tool was degenerate primer validation using the wealth of nucleotide sequences available in environmental metagenome and metatranscriptome databases, this search tool has potential utility in many data mining applications. PMID:23189198
Magnetic resonance imaging in active surveillance—a modern approach
Moore, Caroline M.
2018-01-01
In recent years, active surveillance has been increasingly adopted as a conservative management approach to low and sometimes intermediate risk prostate cancer, to avoid or delay treatment until there is evidence of higher risk disease. A number of studies have investigated the role of multiparametric magnetic resonance imaging (mpMRI) in this setting. MpMRI refers to the use of multiple MRI sequences (T2-weighted anatomical and functional imaging which can include diffusion-weighted imaging, dynamic contrast enhanced imaging, spectroscopy). Each of the parameters investigates different aspects of the prostate gland (anatomy, cellularity, vascularity, etc.). In addition to a qualitative assessment, the radiologist can also extrapolate quantitative imaging biomarkers from these sequences, for example the apparent diffusion coefficient from diffusion-weighted imaging. There are many different types of articles (e.g., reviews, commentaries, consensus meetings, etc.) that address the use of mpMRI in men on active surveillance for prostate cancer. In this paper, we compare original articles that investigate the role of the different mpMRI sequences in men on active surveillance for prostate cancer, in order to discuss the relative utility of the different sequences, and combinations of sequences. We searched MEDLINE/PubMed for manuscripts published from inception to 1st December 2017. The search terms used were (prostate cancer or prostate adenocarcinoma or prostatic carcinoma or prostate carcinoma or prostatic adenocarcinoma) and (MRI or NMR or magnetic resonance imaging or mpMRI or multiparametric MRI) and active surveillance. Overall, 425 publications were found. All abstracts were reviewed to identify papers with original data. Twenty-five papers were analysed and summarised. Some papers based their analysis only on one mpMRI sequence, while others assessed two or more. The evidence from this review suggests that qualitative assessments and quantitative data from different mpMRI sequences hold promise in the management of men on active surveillance for prostate cancer. Both qualitative and quantitative approaches should be considered when assessing mpMRI of the prostate. There is a need for robust studies assessing the relative utility of different combinations of sequences in a systematic manner to determine the most efficient use of mpMRI in men on active surveillance. PMID:29594026
Memory-efficient dynamic programming backtrace and pairwise local sequence alignment.
Newberg, Lee A
2008-08-15
A backtrace through a dynamic programming algorithm's intermediate results in search of an optimal path, or to sample paths according to an implied probability distribution, or as the second stage of a forward-backward algorithm, is a task of fundamental importance in computational biology. When there is insufficient space to store all intermediate results in high-speed memory (e.g. cache) existing approaches store selected stages of the computation, and recompute missing values from these checkpoints on an as-needed basis. Here we present an optimal checkpointing strategy, and demonstrate its utility with pairwise local sequence alignment of sequences of length 10,000. Sample C++-code for optimal backtrace is available in the Supplementary Materials. Supplementary data is available at Bioinformatics online.
Schroeter, Elena R.; Feranec, Robert S.; Vashishth, Deepak
2016-01-01
Vertebrate fossils have been collected for hundreds of years and are stored in museum collections around the world. These remains provide a readily available resource to search for preserved proteins; however, the vast majority of palaeoproteomic studies have focused on relatively recently collected bones with a well-known handling history. Here, we characterize proteins from the nasal turbinates of the first Castoroides ohioensis skull ever discovered. Collected in 1845, this is the oldest museum-curated specimen characterized using palaeoproteomic tools. Our mass spectrometry analysis detected many collagen I peptides, a peptide from haemoglobin beta, and in vivo and diagenetic post-translational modifications. Additionally, the identified collagen I sequences provide enough resolution to place C. ohioensis within Rodentia. This study illustrates the utility of archived museum specimens for both the recovery of preserved proteins and phylogenetic analyses. PMID:27306052
Demircan, Turan; Keskin, Ilknur; Dumlu, Seda Nilgün; Aytürk, Nilüfer; Avşaroğlu, Mahmut Erhan; Akgün, Emel; Öztürk, Gürkan; Baykal, Ahmet Tarık
2017-01-01
Salamander axolotl has been emerging as an important model for stem cell research due to its powerful regenerative capacity. Several advantages, such as the high capability of advanced tissue, organ, and appendages regeneration, promote axolotl as an ideal model system to extend our current understanding on the mechanisms of regeneration. Acknowledging the common molecular pathways between amphibians and mammals, there is a great potential to translate the messages from axolotl research to mammalian studies. However, the utilization of axolotl is hindered due to the lack of reference databases of genomic, transcriptomic, and proteomic data. Here, we introduce the proteome analysis of the axolotl tail section searched against an mRNA-seq database. We translated axolotl mRNA sequences to protein sequences and annotated these to process the LC-MS/MS data and identified 1001 nonredundant proteins. Functional classification of identified proteins was performed by gene ontology searches. The presence of some of the identified proteins was validated by in situ antibody labeling. Furthermore, we have analyzed the proteome expressional changes postamputation at three time points to evaluate the underlying mechanisms of the regeneration process. Taken together, this work expands the proteomics data of axolotl to contribute to its establishment as a fully utilized model. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Wang, Penghao; Wilson, Susan R
2013-01-01
Mass spectrometry-based protein identification is a very challenging task. The main identification approaches include de novo sequencing and database searching. Both approaches have shortcomings, so an integrative approach has been developed. The integrative approach firstly infers partial peptide sequences, known as tags, directly from tandem spectra through de novo sequencing, and then puts these sequences into a database search to see if a close peptide match can be found. However the current implementation of this integrative approach has several limitations. Firstly, simplistic de novo sequencing is applied and only very short sequence tags are used. Secondly, most integrative methods apply an algorithm similar to BLAST to search for exact sequence matches and do not accommodate sequence errors well. Thirdly, by applying these methods the integrated de novo sequencing makes a limited contribution to the scoring model which is still largely based on database searching. We have developed a new integrative protein identification method which can integrate de novo sequencing more efficiently into database searching. Evaluated on large real datasets, our method outperforms popular identification methods.
An approach to large scale identification of non-obvious structural similarities between proteins
Cherkasov, Artem; Jones, Steven JM
2004-01-01
Background A new sequence independent bioinformatics approach allowing genome-wide search for proteins with similar three dimensional structures has been developed. By utilizing the numerical output of the sequence threading it establishes putative non-obvious structural similarities between proteins. When applied to the testing set of proteins with known three dimensional structures the developed approach was able to recognize structurally similar proteins with high accuracy. Results The method has been developed to identify pathogenic proteins with low sequence identity and high structural similarity to host analogues. Such protein structure relationships would be hypothesized to arise through convergent evolution or through ancient horizontal gene transfer events, now undetectable using current sequence alignment techniques. The pathogen proteins, which could mimic or interfere with host activities, would represent candidate virulence factors. The developed approach utilizes the numerical outputs from the sequence-structure threading. It identifies the potential structural similarity between a pair of proteins by correlating the threading scores of the corresponding two primary sequences against the library of the standard folds. This approach allowed up to 64% sensitivity and 99.9% specificity in distinguishing protein pairs with high structural similarity. Conclusion Preliminary results obtained by comparison of the genomes of Homo sapiens and several strains of Chlamydia trachomatis have demonstrated the potential usefulness of the method in the identification of bacterial proteins with known or potential roles in virulence. PMID:15147578
Alignment of high-throughput sequencing data inside in-memory databases.
Firnkorn, Daniel; Knaup-Gregori, Petra; Lorenzo Bermejo, Justo; Ganzinger, Matthias
2014-01-01
In times of high-throughput DNA sequencing techniques, performance-capable analysis of DNA sequences is of high importance. Computer supported DNA analysis is still an intensive time-consuming task. In this paper we explore the potential of a new In-Memory database technology by using SAP's High Performance Analytic Appliance (HANA). We focus on read alignment as one of the first steps in DNA sequence analysis. In particular, we examined the widely used Burrows-Wheeler Aligner (BWA) and implemented stored procedures in both, HANA and the free database system MySQL, to compare execution time and memory management. To ensure that the results are comparable, MySQL has been running in memory as well, utilizing its integrated memory engine for database table creation. We implemented stored procedures, containing exact and inexact searching of DNA reads within the reference genome GRCh37. Due to technical restrictions in SAP HANA concerning recursion, the inexact matching problem could not be implemented on this platform. Hence, performance analysis between HANA and MySQL was made by comparing the execution time of the exact search procedures. Here, HANA was approximately 27 times faster than MySQL which means, that there is a high potential within the new In-Memory concepts, leading to further developments of DNA analysis procedures in the future.
Terminator Detection by Support Vector Machine Utilizing aStochastic Context-Free Grammar
DOE Office of Scientific and Technical Information (OSTI.GOV)
Francis-Lyon, Patricia; Cristianini, Nello; Holbrook, Stephen
2006-12-30
A 2-stage detector was designed to find rho-independent transcription terminators in the Escherichia coli genome. The detector includes a Stochastic Context Free Grammar (SCFG) component and a Support Vector Machine (SVM) component. To find terminators, the SCFG searches the intergenic regions of nucleotide sequence for local matches to a terminator grammar that was designed and trained utilizing examples of known terminators. The grammar selects sequences that are the best candidates for terminators and assigns them a prefix, stem-loop, suffix structure using the Cocke-Younger-Kasaami (CYK) algorithm, modified to incorporate energy affects of base pairing. The parameters from this inferred structure aremore » passed to the SVM classifier, which distinguishes terminators from non-terminators that score high according to the terminator grammar. The SVM was trained with negative examples drawn from intergenic sequences that include both featureless and RNA gene regions (which were assigned prefix, stem-loop, suffix structure by the SCFG), so that it successfully distinguishes terminators from either of these. The classifier was found to be 96.4% successful during testing.« less
BLAST and FASTA similarity searching for multiple sequence alignment.
Pearson, William R
2014-01-01
BLAST, FASTA, and other similarity searching programs seek to identify homologous proteins and DNA sequences based on excess sequence similarity. If two sequences share much more similarity than expected by chance, the simplest explanation for the excess similarity is common ancestry-homology. The most effective similarity searches compare protein sequences, rather than DNA sequences, for sequences that encode proteins, and use expectation values, rather than percent identity, to infer homology. The BLAST and FASTA packages of sequence comparison programs provide programs for comparing protein and DNA sequences to protein databases (the most sensitive searches). Protein and translated-DNA comparisons to protein databases routinely allow evolutionary look back times from 1 to 2 billion years; DNA:DNA searches are 5-10-fold less sensitive. BLAST and FASTA can be run on popular web sites, but can also be downloaded and installed on local computers. With local installation, target databases can be customized for the sequence data being characterized. With today's very large protein databases, search sensitivity can also be improved by searching smaller comprehensive databases, for example, a complete protein set from an evolutionarily neighboring model organism. By default, BLAST and FASTA use scoring strategies target for distant evolutionary relationships; for comparisons involving short domains or queries, or searches that seek relatively close homologs (e.g. mouse-human), shallower scoring matrices will be more effective. Both BLAST and FASTA provide very accurate statistical estimates, which can be used to reliably identify protein sequences that diverged more than 2 billion years ago.
Ramu, Chenna
2003-07-01
SIRW (http://sirw.embl.de/) is a World Wide Web interface to the Simple Indexing and Retrieval System (SIR) that is capable of parsing and indexing various flat file databases. In addition it provides a framework for doing sequence analysis (e.g. motif pattern searches) for selected biological sequences through keyword search. SIRW is an ideal tool for the bioinformatics community for searching as well as analyzing biological sequences of interest.
RNA Bricks—a database of RNA 3D motifs and their interactions
Chojnowski, Grzegorz; Waleń, Tomasz; Bujnicki, Janusz M.
2014-01-01
The RNA Bricks database (http://iimcb.genesilico.pl/rnabricks), stores information about recurrent RNA 3D motifs and their interactions, found in experimentally determined RNA structures and in RNA–protein complexes. In contrast to other similar tools (RNA 3D Motif Atlas, RNA Frabase, Rloom) RNA motifs, i.e. ‘RNA bricks’ are presented in the molecular environment, in which they were determined, including RNA, protein, metal ions, water molecules and ligands. All nucleotide residues in RNA bricks are annotated with structural quality scores that describe real-space correlation coefficients with the electron density data (if available), backbone geometry and possible steric conflicts, which can be used to identify poorly modeled residues. The database is also equipped with an algorithm for 3D motif search and comparison. The algorithm compares spatial positions of backbone atoms of the user-provided query structure and of stored RNA motifs, without relying on sequence or secondary structure information. This enables the identification of local structural similarities among evolutionarily related and unrelated RNA molecules. Besides, the search utility enables searching ‘RNA bricks’ according to sequence similarity, and makes it possible to identify motifs with modified ribonucleotide residues at specific positions. PMID:24220091
HepSEQ: International Public Health Repository for Hepatitis B
Gnaneshan, Saravanamuttu; Ijaz, Samreen; Moran, Joanne; Ramsay, Mary; Green, Jonathan
2007-01-01
HepSEQ is a repository for an extensive library of public health and molecular data relating to hepatitis B virus (HBV) infection collected from international sources. It is hosted by the Centre for Infections, Health Protection Agency (HPA), England, United Kingdom. This repository has been developed as a web-enabled, quality-controlled database to act as a tool for surveillance, HBV case management and for research. The web front-end for the database system can be accessed from . The format of the database system allows for comprehensive molecular, clinical and epidemiological data to be deposited into a functional database, to search and manipulate the stored data and to extract and visualize the information on epidemiological, virological, clinical, nucleotide sequence and mutational aspects of HBV infection through web front-end. Specific tools, built into the database, can be utilized to analyse deposited data and provide information on HBV genotype, identify mutations with known clinical significance (e.g. vaccine escape, precore and antiviral-resistant mutations) and carry out sequence homology searches against other deposited strains. Further mechanisms are also in place to allow specific tailored searches of the database to be undertaken. PMID:17130143
Yu, Jia; Blom, Jochen; Sczyrba, Alexander; Goesmann, Alexander
2017-09-10
The introduction of next generation sequencing has caused a steady increase in the amounts of data that have to be processed in modern life science. Sequence alignment plays a key role in the analysis of sequencing data e.g. within whole genome sequencing or metagenome projects. BLAST is a commonly used alignment tool that was the standard approach for more than two decades, but in the last years faster alternatives have been proposed including RapSearch, GHOSTX, and DIAMOND. Here we introduce HAMOND, an application that uses Apache Hadoop to parallelize DIAMOND computation in order to scale-out the calculation of alignments. HAMOND is fault tolerant and scalable by utilizing large cloud computing infrastructures like Amazon Web Services. HAMOND has been tested in comparative genomics analyses and showed promising results both in efficiency and accuracy. Copyright © 2017 The Author(s). Published by Elsevier B.V. All rights reserved.
Sequence search on a supercomputer.
Gotoh, O; Tagashira, Y
1986-01-10
A set of programs was developed for searching nucleic acid and protein sequence data bases for sequences similar to a given sequence. The programs, written in FORTRAN 77, were optimized for vector processing on a Hitachi S810-20 supercomputer. A search of a 500-residue protein sequence against the entire PIR data base Ver. 1.0 (1) (0.5 M residues) is carried out in a CPU time of 45 sec. About 4 min is required for an exhaustive search of a 1500-base nucleotide sequence against all mammalian sequences (1.2M bases) in Genbank Ver. 29.0. The CPU time is reduced to about a quarter with a faster version.
Binary Interval Search: a scalable algorithm for counting interval intersections.
Layer, Ryan M; Skadron, Kevin; Robins, Gabriel; Hall, Ira M; Quinlan, Aaron R
2013-01-01
The comparison of diverse genomic datasets is fundamental to understand genome biology. Researchers must explore many large datasets of genome intervals (e.g. genes, sequence alignments) to place their experimental results in a broader context and to make new discoveries. Relationships between genomic datasets are typically measured by identifying intervals that intersect, that is, they overlap and thus share a common genome interval. Given the continued advances in DNA sequencing technologies, efficient methods for measuring statistically significant relationships between many sets of genomic features are crucial for future discovery. We introduce the Binary Interval Search (BITS) algorithm, a novel and scalable approach to interval set intersection. We demonstrate that BITS outperforms existing methods at counting interval intersections. Moreover, we show that BITS is intrinsically suited to parallel computing architectures, such as graphics processing units by illustrating its utility for efficient Monte Carlo simulations measuring the significance of relationships between sets of genomic intervals. https://github.com/arq5x/bits.
POPcorn: An Online Resource Providing Access to Distributed and Diverse Maize Project Data.
Cannon, Ethalinda K S; Birkett, Scott M; Braun, Bremen L; Kodavali, Sateesh; Jennewein, Douglas M; Yilmaz, Alper; Antonescu, Valentin; Antonescu, Corina; Harper, Lisa C; Gardiner, Jack M; Schaeffer, Mary L; Campbell, Darwin A; Andorf, Carson M; Andorf, Destri; Lisch, Damon; Koch, Karen E; McCarty, Donald R; Quackenbush, John; Grotewold, Erich; Lushbough, Carol M; Sen, Taner Z; Lawrence, Carolyn J
2011-01-01
The purpose of the online resource presented here, POPcorn (Project Portal for corn), is to enhance accessibility of maize genetic and genomic resources for plant biologists. Currently, many online locations are difficult to find, some are best searched independently, and individual project websites often degrade over time-sometimes disappearing entirely. The POPcorn site makes available (1) a centralized, web-accessible resource to search and browse descriptions of ongoing maize genomics projects, (2) a single, stand-alone tool that uses web Services and minimal data warehousing to search for sequence matches in online resources of diverse offsite projects, and (3) a set of tools that enables researchers to migrate their data to the long-term model organism database for maize genetic and genomic information: MaizeGDB. Examples demonstrating POPcorn's utility are provided herein.
POPcorn: An Online Resource Providing Access to Distributed and Diverse Maize Project Data
Cannon, Ethalinda K. S.; Birkett, Scott M.; Braun, Bremen L.; Kodavali, Sateesh; Jennewein, Douglas M.; Yilmaz, Alper; Antonescu, Valentin; Antonescu, Corina; Harper, Lisa C.; Gardiner, Jack M.; Schaeffer, Mary L.; Campbell, Darwin A.; Andorf, Carson M.; Andorf, Destri; Lisch, Damon; Koch, Karen E.; McCarty, Donald R.; Quackenbush, John; Grotewold, Erich; Lushbough, Carol M.; Sen, Taner Z.; Lawrence, Carolyn J.
2011-01-01
The purpose of the online resource presented here, POPcorn (Project Portal for corn), is to enhance accessibility of maize genetic and genomic resources for plant biologists. Currently, many online locations are difficult to find, some are best searched independently, and individual project websites often degrade over time—sometimes disappearing entirely. The POPcorn site makes available (1) a centralized, web-accessible resource to search and browse descriptions of ongoing maize genomics projects, (2) a single, stand-alone tool that uses web Services and minimal data warehousing to search for sequence matches in online resources of diverse offsite projects, and (3) a set of tools that enables researchers to migrate their data to the long-term model organism database for maize genetic and genomic information: MaizeGDB. Examples demonstrating POPcorn's utility are provided herein. PMID:22253616
GGRNA: an ultrafast, transcript-oriented search engine for genes and transcripts
Naito, Yuki; Bono, Hidemasa
2012-01-01
GGRNA (http://GGRNA.dbcls.jp/) is a Google-like, ultrafast search engine for genes and transcripts. The web server accepts arbitrary words and phrases, such as gene names, IDs, gene descriptions, annotations of gene and even nucleotide/amino acid sequences through one simple search box, and quickly returns relevant RefSeq transcripts. A typical search takes just a few seconds, which dramatically enhances the usability of routine searching. In particular, GGRNA can search sequences as short as 10 nt or 4 amino acids, which cannot be handled easily by popular sequence analysis tools. Nucleotide sequences can be searched allowing up to three mismatches, or the query sequences may contain degenerate nucleotide codes (e.g. N, R, Y, S). Furthermore, Gene Ontology annotations, Enzyme Commission numbers and probe sequences of catalog microarrays are also incorporated into GGRNA, which may help users to conduct searches by various types of keywords. GGRNA web server will provide a simple and powerful interface for finding genes and transcripts for a wide range of users. All services at GGRNA are provided free of charge to all users. PMID:22641850
GGRNA: an ultrafast, transcript-oriented search engine for genes and transcripts.
Naito, Yuki; Bono, Hidemasa
2012-07-01
GGRNA (http://GGRNA.dbcls.jp/) is a Google-like, ultrafast search engine for genes and transcripts. The web server accepts arbitrary words and phrases, such as gene names, IDs, gene descriptions, annotations of gene and even nucleotide/amino acid sequences through one simple search box, and quickly returns relevant RefSeq transcripts. A typical search takes just a few seconds, which dramatically enhances the usability of routine searching. In particular, GGRNA can search sequences as short as 10 nt or 4 amino acids, which cannot be handled easily by popular sequence analysis tools. Nucleotide sequences can be searched allowing up to three mismatches, or the query sequences may contain degenerate nucleotide codes (e.g. N, R, Y, S). Furthermore, Gene Ontology annotations, Enzyme Commission numbers and probe sequences of catalog microarrays are also incorporated into GGRNA, which may help users to conduct searches by various types of keywords. GGRNA web server will provide a simple and powerful interface for finding genes and transcripts for a wide range of users. All services at GGRNA are provided free of charge to all users.
Mackey, Aaron J; Pearson, William R
2004-10-01
Relational databases are designed to integrate diverse types of information and manage large sets of search results, greatly simplifying genome-scale analyses. Relational databases are essential for management and analysis of large-scale sequence analyses, and can also be used to improve the statistical significance of similarity searches by focusing on subsets of sequence libraries most likely to contain homologs. This unit describes using relational databases to improve the efficiency of sequence similarity searching and to demonstrate various large-scale genomic analyses of homology-related data. This unit describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. These include basic use of the database to generate a novel sequence library subset, how to extend and use seqdb_demo for the storage of sequence similarity search results and making use of various kinds of stored search results to address aspects of comparative genomic analysis.
Comet: an open-source MS/MS sequence database search tool.
Eng, Jimmy K; Jahan, Tahmina A; Hoopmann, Michael R
2013-01-01
Proteomics research routinely involves identifying peptides and proteins via MS/MS sequence database search. Thus the database search engine is an integral tool in many proteomics research groups. Here, we introduce the Comet search engine to the existing landscape of commercial and open-source database search tools. Comet is open source, freely available, and based on one of the original sequence database search tools that has been widely used for many years. © 2012 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Knutson, Stacy T.; Westwood, Brian M.; Leuthaeuser, Janelle B.; Turner, Brandon E.; Nguyendac, Don; Shea, Gabrielle; Kumar, Kiran; Hayden, Julia D.; Harper, Angela F.; Brown, Shoshana D.; Morris, John H.; Ferrin, Thomas E.; Babbitt, Patricia C.
2017-01-01
Abstract Protein function identification remains a significant problem. Solving this problem at the molecular functional level would allow mechanistic determinant identification—amino acids that distinguish details between functional families within a superfamily. Active site profiling was developed to identify mechanistic determinants. DASP and DASP2 were developed as tools to search sequence databases using active site profiling. Here, TuLIP (Two‐Level Iterative clustering Process) is introduced as an iterative, divisive clustering process that utilizes active site profiling to separate structurally characterized superfamily members into functionally relevant clusters. Underlying TuLIP is the observation that functionally relevant families (curated by Structure‐Function Linkage Database, SFLD) self‐identify in DASP2 searches; clusters containing multiple functional families do not. Each TuLIP iteration produces candidate clusters, each evaluated to determine if it self‐identifies using DASP2. If so, it is deemed a functionally relevant group. Divisive clustering continues until each structure is either a functionally relevant group member or a singlet. TuLIP is validated on enolase and glutathione transferase structures, superfamilies well‐curated by SFLD. Correlation is strong; small numbers of structures prevent statistically significant analysis. TuLIP‐identified enolase clusters are used in DASP2 GenBank searches to identify sequences sharing functional site features. Analysis shows a true positive rate of 96%, false negative rate of 4%, and maximum false positive rate of 4%. F‐measure and performance analysis on the enolase search results and comparison to GEMMA and SCI‐PHY demonstrate that TuLIP avoids the over‐division problem of these methods. Mechanistic determinants for enolase families are evaluated and shown to correlate well with literature results. PMID:28054422
Knutson, Stacy T; Westwood, Brian M; Leuthaeuser, Janelle B; Turner, Brandon E; Nguyendac, Don; Shea, Gabrielle; Kumar, Kiran; Hayden, Julia D; Harper, Angela F; Brown, Shoshana D; Morris, John H; Ferrin, Thomas E; Babbitt, Patricia C; Fetrow, Jacquelyn S
2017-04-01
Protein function identification remains a significant problem. Solving this problem at the molecular functional level would allow mechanistic determinant identification-amino acids that distinguish details between functional families within a superfamily. Active site profiling was developed to identify mechanistic determinants. DASP and DASP2 were developed as tools to search sequence databases using active site profiling. Here, TuLIP (Two-Level Iterative clustering Process) is introduced as an iterative, divisive clustering process that utilizes active site profiling to separate structurally characterized superfamily members into functionally relevant clusters. Underlying TuLIP is the observation that functionally relevant families (curated by Structure-Function Linkage Database, SFLD) self-identify in DASP2 searches; clusters containing multiple functional families do not. Each TuLIP iteration produces candidate clusters, each evaluated to determine if it self-identifies using DASP2. If so, it is deemed a functionally relevant group. Divisive clustering continues until each structure is either a functionally relevant group member or a singlet. TuLIP is validated on enolase and glutathione transferase structures, superfamilies well-curated by SFLD. Correlation is strong; small numbers of structures prevent statistically significant analysis. TuLIP-identified enolase clusters are used in DASP2 GenBank searches to identify sequences sharing functional site features. Analysis shows a true positive rate of 96%, false negative rate of 4%, and maximum false positive rate of 4%. F-measure and performance analysis on the enolase search results and comparison to GEMMA and SCI-PHY demonstrate that TuLIP avoids the over-division problem of these methods. Mechanistic determinants for enolase families are evaluated and shown to correlate well with literature results. © 2017 The Authors Protein Science published by Wiley Periodicals, Inc. on behalf of The Protein Society.
Chen, Ming; Henry, Nathan; Almsaeed, Abdullah; Zhou, Xiao; Wegrzyn, Jill; Ficklin, Stephen
2017-01-01
Abstract Tripal is an open source software package for developing biological databases with a focus on genetic and genomic data. It consists of a set of core modules that deliver essential functions for loading and displaying data records and associated attributes including organisms, sequence features and genetic markers. Beyond the core modules, community members are encouraged to contribute extension modules to build on the Tripal core and to customize Tripal for individual community needs. To expand the utility of the Tripal software system, particularly for RNASeq data, we developed two new extension modules. Tripal Elasticsearch enables fast, scalable searching of the entire content of a Tripal site as well as the construction of customized advanced searches of specific data types. We demonstrate the use of this module for searching assembled transcripts by functional annotation. A second module, Tripal Analysis Expression, houses and displays records from gene expression assays such as RNA sequencing. This includes biological source materials (biomaterials), gene expression values and protocols used to generate the data. In the case of an RNASeq experiment, this would reflect the individual organisms and tissues used to produce sequencing libraries, the normalized gene expression values derived from the RNASeq data analysis and a description of the software or code used to generate the expression values. The module will load data from common flat file formats including standard NCBI Biosample XML. Data loading, display options and other configurations can be controlled by authorized users in the Drupal administrative backend. Both modules are open source, include usage documentation, and can be found in the Tripal organization’s GitHub repository. Database URL: Tripal Elasticsearch module: https://github.com/tripal/tripal_elasticsearch Tripal Analysis Expression module: https://github.com/tripal/tripal_analysis_expression PMID:29220446
Exploiting proteomic data for genome annotation and gene model validation in Aspergillus niger.
Wright, James C; Sugden, Deana; Francis-McIntyre, Sue; Riba-Garcia, Isabel; Gaskell, Simon J; Grigoriev, Igor V; Baker, Scott E; Beynon, Robert J; Hubbard, Simon J
2009-02-04
Proteomic data is a potentially rich, but arguably unexploited, data source for genome annotation. Peptide identifications from tandem mass spectrometry provide prima facie evidence for gene predictions and can discriminate over a set of candidate gene models. Here we apply this to the recently sequenced Aspergillus niger fungal genome from the Joint Genome Institutes (JGI) and another predicted protein set from another A.niger sequence. Tandem mass spectra (MS/MS) were acquired from 1d gel electrophoresis bands and searched against all available gene models using Average Peptide Scoring (APS) and reverse database searching to produce confident identifications at an acceptable false discovery rate (FDR). 405 identified peptide sequences were mapped to 214 different A.niger genomic loci to which 4093 predicted gene models clustered, 2872 of which contained the mapped peptides. Interestingly, 13 (6%) of these loci either had no preferred predicted gene model or the genome annotators' chosen "best" model for that genomic locus was not found to be the most parsimonious match to the identified peptides. The peptides identified also boosted confidence in predicted gene structures spanning 54 introns from different gene models. This work highlights the potential of integrating experimental proteomics data into genomic annotation pipelines much as expressed sequence tag (EST) data has been. A comparison of the published genome from another strain of A.niger sequenced by DSM showed that a number of the gene models or proteins with proteomics evidence did not occur in both genomes, further highlighting the utility of the method.
Brody, Thomas; Yavatkar, Amarendra S; Kuzin, Alexander; Kundu, Mukta; Tyson, Leonard J; Ross, Jermaine; Lin, Tzu-Yang; Lee, Chi-Hon; Awasaki, Takeshi; Lee, Tzumin; Odenwald, Ward F
2012-01-01
Background: Phylogenetic footprinting has revealed that cis-regulatory enhancers consist of conserved DNA sequence clusters (CSCs). Currently, there is no systematic approach for enhancer discovery and analysis that takes full-advantage of the sequence information within enhancer CSCs. Results: We have generated a Drosophila genome-wide database of conserved DNA consisting of >100,000 CSCs derived from EvoPrints spanning over 90% of the genome. cis-Decoder database search and alignment algorithms enable the discovery of functionally related enhancers. The program first identifies conserved repeat elements within an input enhancer and then searches the database for CSCs that score highly against the input CSC. Scoring is based on shared repeats as well as uniquely shared matches, and includes measures of the balance of shared elements, a diagnostic that has proven to be useful in predicting cis-regulatory function. To demonstrate the utility of these tools, a temporally-restricted CNS neuroblast enhancer was used to identify other functionally related enhancers and analyze their structural organization. Conclusions: cis-Decoder reveals that co-regulating enhancers consist of combinations of overlapping shared sequence elements, providing insights into the mode of integration of multiple regulating transcription factors. The database and accompanying algorithms should prove useful in the discovery and analysis of enhancers involved in any developmental process. Developmental Dynamics 241:169–189, 2012. © 2011 Wiley Periodicals, Inc. Key findings A genome-wide catalog of Drosophila conserved DNA sequence clusters. cis-Decoder discovers functionally related enhancers. Functionally related enhancers share balanced sequence element copy numbers. Many enhancers function during multiple phases of development. PMID:22174086
Liu, Yu; Hong, Yang; Lin, Chun-Yuan; Hung, Che-Lun
2015-01-01
The Smith-Waterman (SW) algorithm has been widely utilized for searching biological sequence databases in bioinformatics. Recently, several works have adopted the graphic card with Graphic Processing Units (GPUs) and their associated CUDA model to enhance the performance of SW computations. However, these works mainly focused on the protein database search by using the intertask parallelization technique, and only using the GPU capability to do the SW computations one by one. Hence, in this paper, we will propose an efficient SW alignment method, called CUDA-SWfr, for the protein database search by using the intratask parallelization technique based on a CPU-GPU collaborative system. Before doing the SW computations on GPU, a procedure is applied on CPU by using the frequency distance filtration scheme (FDFS) to eliminate the unnecessary alignments. The experimental results indicate that CUDA-SWfr runs 9.6 times and 96 times faster than the CPU-based SW method without and with FDFS, respectively.
Genetic dissection of the consensus sequence for the class 2 and class 3 flagellar promoters
Wozniak, Christopher E.; Hughes, Kelly T.
2008-01-01
Summary Computational searches for DNA binding sites often utilize consensus sequences. These search models make assumptions that the frequency of a base pair in an alignment relates to the base pair’s importance in binding and presume that base pairs contribute independently to the overall interaction with the DNA binding protein. These two assumptions have generally been found to be accurate for DNA binding sites. However, these assumptions are often not satisfied for promoters, which are involved in additional steps in transcription initiation after RNA polymerase has bound to the DNA. To test these assumptions for the flagellar regulatory hierarchy, class 2 and class 3 flagellar promoters were randomly mutagenized in Salmonella. Important positions were then saturated for mutagenesis and compared to scores calculated from the consensus sequence. Double mutants were constructed to determine how mutations combined for each promoter type. Mutations in the binding site for FlhD4C2, the activator of class 2 promoters, better satisfied the assumptions for the binding model than did mutations in the class 3 promoter, which is recognized by the σ28 transcription factor. These in vivo results indicate that the activator sites within flagellar promoters can be modeled using simple assumptions but that the DNA sequences recognized by the flagellar sigma factor require more complex models. PMID:18486950
Using SQL Databases for Sequence Similarity Searching and Analysis.
Pearson, William R; Mackey, Aaron J
2017-09-13
Relational databases can integrate diverse types of information and manage large sets of similarity search results, greatly simplifying genome-scale analyses. By focusing on taxonomic subsets of sequences, relational databases can reduce the size and redundancy of sequence libraries and improve the statistical significance of homologs. In addition, by loading similarity search results into a relational database, it becomes possible to explore and summarize the relationships between all of the proteins in an organism and those in other biological kingdoms. This unit describes how to use relational databases to improve the efficiency of sequence similarity searching and demonstrates various large-scale genomic analyses of homology-related data. It also describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. The unit also introduces search_demo, a database that stores sequence similarity search results. The search_demo database is then used to explore the evolutionary relationships between E. coli proteins and proteins in other organisms in a large-scale comparative genomic analysis. © 2017 by John Wiley & Sons, Inc. Copyright © 2017 John Wiley & Sons, Inc.
CBrowse: a SAM/BAM-based contig browser for transcriptome assembly visualization and analysis.
Li, Pei; Ji, Guoli; Dong, Min; Schmidt, Emily; Lenox, Douglas; Chen, Liangliang; Liu, Qi; Liu, Lin; Zhang, Jie; Liang, Chun
2012-09-15
To address the impending need for exploring rapidly increased transcriptomics data generated for non-model organisms, we developed CBrowse, an AJAX-based web browser for visualizing and analyzing transcriptome assemblies and contigs. Designed in a standard three-tier architecture with a data pre-processing pipeline, CBrowse is essentially a Rich Internet Application that offers many seamlessly integrated web interfaces and allows users to navigate, sort, filter, search and visualize data smoothly. The pre-processing pipeline takes the contig sequence file in FASTA format and its relevant SAM/BAM file as the input; detects putative polymorphisms, simple sequence repeats and sequencing errors in contigs and generates image, JSON and database-compatible CSV text files that are directly utilized by different web interfaces. CBowse is a generic visualization and analysis tool that facilitates close examination of assembly quality, genetic polymorphisms, sequence repeats and/or sequencing errors in transcriptome sequencing projects. CBrowse is distributed under the GNU General Public License, available at http://bioinfolab.muohio.edu/CBrowse/ liangc@muohio.edu or liangc.mu@gmail.com; glji@xmu.edu.cn Supplementary data are available at Bioinformatics online.
Madsen, James A.; Xu, Hua; Robinson, Michelle R.; Horton, Andrew P.; Shaw, Jared B.; Giles, David K.; Kaoud, Tamer S.; Dalby, Kevin N.; Trent, M. Stephen; Brodbelt, Jennifer S.
2013-01-01
The use of ultraviolet photodissociation (UVPD) for the activation and dissociation of peptide anions is evaluated for broader coverage of the proteome. To facilitate interpretation and assignment of the resulting UVPD mass spectra of peptide anions, the MassMatrix database search algorithm was modified to allow automated analysis of negative polarity MS/MS spectra. The new UVPD algorithms were developed based on the MassMatrix database search engine by adding specific fragmentation pathways for UVPD. The new UVPD fragmentation pathways in MassMatrix were rigorously and statistically optimized using two large data sets with high mass accuracy and high mass resolution for both MS1 and MS2 data acquired on an Orbitrap mass spectrometer for complex Halobacterium and HeLa proteome samples. Negative mode UVPD led to the identification of 3663 and 2350 peptides for the Halo and HeLa tryptic digests, respectively, corresponding to 655 and 645 peptides that were unique when compared with electron transfer dissociation (ETD), higher energy collision-induced dissociation, and collision-induced dissociation results for the same digests analyzed in the positive mode. In sum, 805 and 619 proteins were identified via UVPD for the Halobacterium and HeLa samples, respectively, with 49 and 50 unique proteins identified in contrast to the more conventional MS/MS methods. The algorithm also features automated charge determination for low mass accuracy data, precursor filtering (including intact charge-reduced peaks), and the ability to combine both positive and negative MS/MS spectra into a single search, and it is freely open to the public. The accuracy and specificity of the MassMatrix UVPD search algorithm was also assessed for low resolution, low mass accuracy data on a linear ion trap. Analysis of a known mixture of three mitogen-activated kinases yielded similar sequence coverage percentages for UVPD of peptide anions versus conventional collision-induced dissociation of peptide cations, and when these methods were combined into a single search, an increase of up to 13% sequence coverage was observed for the kinases. The ability to sequence peptide anions and cations in alternating scans in the same chromatographic run was also demonstrated. Because ETD has a significant bias toward identifying highly basic peptides, negative UVPD was used to improve the identification of the more acidic peptides in conjunction with positive ETD for the more basic species. In this case, tryptic peptides from the cytosolic section of HeLa cells were analyzed by polarity switching nanoLC-MS/MS utilizing ETD for cation sequencing and UVPD for anion sequencing. Relative to searching using ETD alone, positive/negative polarity switching significantly improved sequence coverages across identified proteins, resulting in a 33% increase in unique peptide identifications and more than twice the number of peptide spectral matches. PMID:23695934
Peroxidase gene discovery from the horseradish transcriptome.
Näätsaari, Laura; Krainer, Florian W; Schubert, Michael; Glieder, Anton; Thallinger, Gerhard G
2014-03-24
Horseradish peroxidases (HRPs) from Armoracia rusticana have long been utilized as reporters in various diagnostic assays and histochemical stainings. Regardless of their increasing importance in the field of life sciences and suggested uses in medical applications, chemical synthesis and other industrial applications, the HRP isoenzymes, their substrate specificities and enzymatic properties are poorly characterized. Due to lacking sequence information of natural isoenzymes and the low levels of HRP expression in heterologous hosts, commercially available HRP is still extracted as a mixture of isoenzymes from the roots of A. rusticana. In this study, a normalized, size-selected A. rusticana transcriptome library was sequenced using 454 Titanium technology. The resulting reads were assembled into 14871 isotigs with an average length of 1133 bp. Sequence databases, ORF finding and ORF characterization were utilized to identify peroxidase genes from the 14871 isotigs generated by de novo assembly. The sequences were manually reviewed and verified with Sanger sequencing of PCR amplified genomic fragments, resulting in the discovery of 28 secretory peroxidases, 23 of them previously unknown. A total of 22 isoenzymes including allelic variants were successfully expressed in Pichia pastoris and showed peroxidase activity with at least one of the substrates tested, thus enabling their development into commercial pure isoenzymes. This study demonstrates that transcriptome sequencing combined with sequence motif search is a powerful concept for the discovery and quick supply of new enzymes and isoenzymes from any plant or other eukaryotic organisms. Identification and manual verification of the sequences of 28 HRP isoenzymes do not only contribute a set of peroxidases for industrial, biological and biomedical applications, but also provide valuable information on the reliability of the approach in identifying and characterizing a large group of isoenzymes.
Peroxidase gene discovery from the horseradish transcriptome
2014-01-01
Background Horseradish peroxidases (HRPs) from Armoracia rusticana have long been utilized as reporters in various diagnostic assays and histochemical stainings. Regardless of their increasing importance in the field of life sciences and suggested uses in medical applications, chemical synthesis and other industrial applications, the HRP isoenzymes, their substrate specificities and enzymatic properties are poorly characterized. Due to lacking sequence information of natural isoenzymes and the low levels of HRP expression in heterologous hosts, commercially available HRP is still extracted as a mixture of isoenzymes from the roots of A. rusticana. Results In this study, a normalized, size-selected A. rusticana transcriptome library was sequenced using 454 Titanium technology. The resulting reads were assembled into 14871 isotigs with an average length of 1133 bp. Sequence databases, ORF finding and ORF characterization were utilized to identify peroxidase genes from the 14871 isotigs generated by de novo assembly. The sequences were manually reviewed and verified with Sanger sequencing of PCR amplified genomic fragments, resulting in the discovery of 28 secretory peroxidases, 23 of them previously unknown. A total of 22 isoenzymes including allelic variants were successfully expressed in Pichia pastoris and showed peroxidase activity with at least one of the substrates tested, thus enabling their development into commercial pure isoenzymes. Conclusions This study demonstrates that transcriptome sequencing combined with sequence motif search is a powerful concept for the discovery and quick supply of new enzymes and isoenzymes from any plant or other eukaryotic organisms. Identification and manual verification of the sequences of 28 HRP isoenzymes do not only contribute a set of peroxidases for industrial, biological and biomedical applications, but also provide valuable information on the reliability of the approach in identifying and characterizing a large group of isoenzymes. PMID:24666710
GWFASTA: server for FASTA search in eukaryotic and microbial genomes.
Issac, Biju; Raghava, G P S
2002-09-01
Similarity searches are a powerful method for solving important biological problems such as database scanning, evolutionary studies, gene prediction, and protein structure prediction. FASTA is a widely used sequence comparison tool for rapid database scanning. Here we describe the GWFASTA server that was developed to assist the FASTA user in similarity searches against partially and/or completely sequenced genomes. GWFASTA consists of more than 60 microbial genomes, eight eukaryote genomes, and proteomes of annotatedgenomes. Infact, it provides the maximum number of databases for similarity searching from a single platform. GWFASTA allows the submission of more than one sequence as a single query for a FASTA search. It also provides integrated post-processing of FASTA output, including compositional analysis of proteins, multiple sequences alignment, and phylogenetic analysis. Furthermore, it summarizes the search results organism-wise for prokaryotes and chromosome-wise for eukaryotes. Thus, the integration of different tools for sequence analyses makes GWFASTA a powerful toolfor biologists.
The HMMER Web Server for Protein Sequence Similarity Search.
Prakash, Ananth; Jeffryes, Matt; Bateman, Alex; Finn, Robert D
2017-12-08
Protein sequence similarity search is one of the most commonly used bioinformatics methods for identifying evolutionarily related proteins. In general, sequences that are evolutionarily related share some degree of similarity, and sequence-search algorithms use this principle to identify homologs. The requirement for a fast and sensitive sequence search method led to the development of the HMMER software, which in the latest version (v3.1) uses a combination of sophisticated acceleration heuristics and mathematical and computational optimizations to enable the use of profile hidden Markov models (HMMs) for sequence analysis. The HMMER Web server provides a common platform by linking the HMMER algorithms to databases, thereby enabling the search for homologs, as well as providing sequence and functional annotation by linking external databases. This unit describes three basic protocols and two alternate protocols that explain how to use the HMMER Web server using various input formats and user defined parameters. © 2017 by John Wiley & Sons, Inc. Copyright © 2017 John Wiley & Sons, Inc.
Berillo, Olga; Régnier, Mireille; Ivashchenko, Anatoly
2014-01-01
microRNAs are small RNA molecules that inhibit the translation of target genes. microRNA binding sites are located in the untranslated regions as well as in the coding domains. We describe TmiRUSite and TmiROSite scripts developed using python as tools for the extraction of nucleotide sequences for miRNA binding sites with their encoded amino acid residue sequences. The scripts allow for retrieving a set of additional sequences at left and at right from the binding site. The scripts presents all received data in table formats that are easy to analyse further. The predicted data finds utility in molecular and evolutionary biology studies. They find use in studying miRNA binding sites in animals and plants. TmiRUSite and TmiROSite scripts are available for free from authors upon request and at https: //sites.google.com/site/malaheenee/downloads for download.
Sousa, Filipa L; Parente, Daniel J; Hessman, Jacob A; Chazelle, Allen; Teichmann, Sarah A; Swint-Kruse, Liskin
2016-09-01
The AlloRep database (www.AlloRep.org) (Sousa et al., 2016) [1] compiles extensive sequence, mutagenesis, and structural information for the LacI/GalR family of transcription regulators. Sequence alignments are presented for >3000 proteins in 45 paralog subfamilies and as a subsampled alignment of the whole family. Phenotypic and biochemical data on almost 6000 mutants have been compiled from an exhaustive search of the literature; citations for these data are included herein. These data include information about oligomerization state, stability, DNA binding and allosteric regulation. Protein structural data for 65 proteins are presented as easily-accessible, residue-contact networks. Finally, this article includes example queries to enable the use of the AlloRep database. See the related article, "AlloRep: a repository of sequence, structural and mutagenesis data for the LacI/GalR transcription regulators" (Sousa et al., 2016) [1].
Binary Interval Search: a scalable algorithm for counting interval intersections
Layer, Ryan M.; Skadron, Kevin; Robins, Gabriel; Hall, Ira M.; Quinlan, Aaron R.
2013-01-01
Motivation: The comparison of diverse genomic datasets is fundamental to understand genome biology. Researchers must explore many large datasets of genome intervals (e.g. genes, sequence alignments) to place their experimental results in a broader context and to make new discoveries. Relationships between genomic datasets are typically measured by identifying intervals that intersect, that is, they overlap and thus share a common genome interval. Given the continued advances in DNA sequencing technologies, efficient methods for measuring statistically significant relationships between many sets of genomic features are crucial for future discovery. Results: We introduce the Binary Interval Search (BITS) algorithm, a novel and scalable approach to interval set intersection. We demonstrate that BITS outperforms existing methods at counting interval intersections. Moreover, we show that BITS is intrinsically suited to parallel computing architectures, such as graphics processing units by illustrating its utility for efficient Monte Carlo simulations measuring the significance of relationships between sets of genomic intervals. Availability: https://github.com/arq5x/bits. Contact: arq5x@virginia.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23129298
Introduction to the fathead minnow genome browser and ...
Ab initio gene prediction and evidence alignment were used to produce the first annotations for the fathead minnow SOAPdenovo genome assembly. Additionally, a genome browser hosted at genome.setac.org provides simplified access to the annotation data in context with fathead minnow genomic sequence. This work is meant to extend the utility of fathead minnow genome as a resource and enable the continued development of this species as a model organism. The fathead minnow (Pimephales promelas) is a laboratory model organism widely used in regulatory toxicity testing and ecotoxicology research. Despite, the wealth of toxicological data for this organism, until recently genome scale information was lacking for the species, which limited the utility of the species for pathway-based toxicity testing and research. As part of a EPA Pathfinder Innovation Project, next generation sequencing was applied to generate a draft genome assembly, which was published in 2016. However, application of those genome-scale sequencing resources was still limited by the lack of available gene annotations for fathead minnow. Here we report on development of a first generation genome annotation for fathead minnow and the dissemination of that information through a web-based browser that makes it easy to search for genes of interest, extract the corresponding sequence, identify intron and exon boundaries and regulatory regions, and align the computationally predicted genes with other supporti
Exploiting proteomic data for genome annotation and gene model validation in Aspergillus niger
Wright, James C; Sugden, Deana; Francis-McIntyre, Sue; Riba-Garcia, Isabel; Gaskell, Simon J; Grigoriev, Igor V; Baker, Scott E; Beynon, Robert J; Hubbard, Simon J
2009-01-01
Background Proteomic data is a potentially rich, but arguably unexploited, data source for genome annotation. Peptide identifications from tandem mass spectrometry provide prima facie evidence for gene predictions and can discriminate over a set of candidate gene models. Here we apply this to the recently sequenced Aspergillus niger fungal genome from the Joint Genome Institutes (JGI) and another predicted protein set from another A.niger sequence. Tandem mass spectra (MS/MS) were acquired from 1d gel electrophoresis bands and searched against all available gene models using Average Peptide Scoring (APS) and reverse database searching to produce confident identifications at an acceptable false discovery rate (FDR). Results 405 identified peptide sequences were mapped to 214 different A.niger genomic loci to which 4093 predicted gene models clustered, 2872 of which contained the mapped peptides. Interestingly, 13 (6%) of these loci either had no preferred predicted gene model or the genome annotators' chosen "best" model for that genomic locus was not found to be the most parsimonious match to the identified peptides. The peptides identified also boosted confidence in predicted gene structures spanning 54 introns from different gene models. Conclusion This work highlights the potential of integrating experimental proteomics data into genomic annotation pipelines much as expressed sequence tag (EST) data has been. A comparison of the published genome from another strain of A.niger sequenced by DSM showed that a number of the gene models or proteins with proteomics evidence did not occur in both genomes, further highlighting the utility of the method. PMID:19193216
Research on parallel algorithm for sequential pattern mining
NASA Astrophysics Data System (ADS)
Zhou, Lijuan; Qin, Bai; Wang, Yu; Hao, Zhongxiao
2008-03-01
Sequential pattern mining is the mining of frequent sequences related to time or other orders from the sequence database. Its initial motivation is to discover the laws of customer purchasing in a time section by finding the frequent sequences. In recent years, sequential pattern mining has become an important direction of data mining, and its application field has not been confined to the business database and has extended to new data sources such as Web and advanced science fields such as DNA analysis. The data of sequential pattern mining has characteristics as follows: mass data amount and distributed storage. Most existing sequential pattern mining algorithms haven't considered the above-mentioned characteristics synthetically. According to the traits mentioned above and combining the parallel theory, this paper puts forward a new distributed parallel algorithm SPP(Sequential Pattern Parallel). The algorithm abides by the principal of pattern reduction and utilizes the divide-and-conquer strategy for parallelization. The first parallel task is to construct frequent item sets applying frequent concept and search space partition theory and the second task is to structure frequent sequences using the depth-first search method at each processor. The algorithm only needs to access the database twice and doesn't generate the candidated sequences, which abates the access time and improves the mining efficiency. Based on the random data generation procedure and different information structure designed, this paper simulated the SPP algorithm in a concrete parallel environment and implemented the AprioriAll algorithm. The experiments demonstrate that compared with AprioriAll, the SPP algorithm had excellent speedup factor and efficiency.
Seqenv: linking sequences to environments through text mining.
Sinclair, Lucas; Ijaz, Umer Z; Jensen, Lars Juhl; Coolen, Marco J L; Gubry-Rangin, Cecile; Chroňáková, Alica; Oulas, Anastasis; Pavloudi, Christina; Schnetzer, Julia; Weimann, Aaron; Ijaz, Ali; Eiler, Alexander; Quince, Christopher; Pafilis, Evangelos
2016-01-01
Understanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are often focused on specific environment types or processes leading to the production of individual, unconnected datasets. The large amounts of legacy sequence data with associated metadata that exist can be harnessed to better place the genetic information found in these surveys into a wider environmental context. Here we introduce a software program, seqenv, to carry out precisely such a task. It automatically performs similarity searches of short sequences against the "nt" nucleotide database provided by NCBI and, out of every hit, extracts-if it is available-the textual metadata field. After collecting all the isolation sources from all the search results, we run a text mining algorithm to identify and parse words that are associated with the Environmental Ontology (EnvO) controlled vocabulary. This, in turn, enables us to determine both in which environments individual sequences or taxa have previously been observed and, by weighted summation of those results, to summarize complete samples. We present two demonstrative applications of seqenv to a survey of ammonia oxidizing archaea as well as to a plankton paleome dataset from the Black Sea. These demonstrate the ability of the tool to reveal novel patterns in HTS and its utility in the fields of environmental source tracking, paleontology, and studies of microbial biogeography. To install seqenv, go to: https://github.com/xapple/seqenv.
Artificially designed pathogens - a diagnostic option for future military deployments.
Zautner, Andreas E; Masanta, Wycliffe O; Hinz, Rebecca; Hagen, Ralf Matthias; Frickmann, Hagen
2015-01-01
Diagnostic microbial isolates of bio-safety levels 3 and 4 are difficult to handle in medical field camps under military deployment settings. International transport of such isolates is challenging due to restrictions by the International Air Transport Association. An alternative option might be inactivation and sequencing of the pathogen at the deployment site with subsequent sequence-based revitalization in well-equipped laboratories in the home country for further scientific assessment. A literature review was written based on a PubMed search. First described for poliovirus in 2002, de novo synthesis of pathogens based on their sequence information has become a well-established procedure in science. Successful syntheses have been demonstrated for both viruses and prokaryotes. However, the technology is not yet available for routine diagnostic purposes. Due to the potential utility of diagnostic sequencing and sequence-based de novo synthesis of pathogens, it seems worthwhile to establish the technology for diagnostic purposes over the intermediate term. This is particularly true for resource-restricted deployment settings, where safe handling of harmful pathogens cannot always be guaranteed.
Faster sequence homology searches by clustering subsequences.
Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka
2015-04-15
Sequence homology searches are used in various fields. New sequencing technologies produce huge amounts of sequence data, which continuously increase the size of sequence databases. As a result, homology searches require large amounts of computational time, especially for metagenomic analysis. We developed a fast homology search method based on database subsequence clustering, and implemented it as GHOSTZ. This method clusters similar subsequences from a database to perform an efficient seed search and ungapped extension by reducing alignment candidates based on triangle inequality. The database subsequence clustering technique achieved an ∼2-fold increase in speed without a large decrease in search sensitivity. When we measured with metagenomic data, GHOSTZ is ∼2.2-2.8 times faster than RAPSearch and is ∼185-261 times faster than BLASTX. The source code is freely available for download at http://www.bi.cs.titech.ac.jp/ghostz/ akiyama@cs.titech.ac.jp Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press.
Li, Guo-Zhong; Vissers, Johannes P C; Silva, Jeffrey C; Golick, Dan; Gorenstein, Marc V; Geromanos, Scott J
2009-03-01
A novel database search algorithm is presented for the qualitative identification of proteins over a wide dynamic range, both in simple and complex biological samples. The algorithm has been designed for the analysis of data originating from data independent acquisitions, whereby multiple precursor ions are fragmented simultaneously. Measurements used by the algorithm include retention time, ion intensities, charge state, and accurate masses on both precursor and product ions from LC-MS data. The search algorithm uses an iterative process whereby each iteration incrementally increases the selectivity, specificity, and sensitivity of the overall strategy. Increased specificity is obtained by utilizing a subset database search approach, whereby for each subsequent stage of the search, only those peptides from securely identified proteins are queried. Tentative peptide and protein identifications are ranked and scored by their relative correlation to a number of models of known and empirically derived physicochemical attributes of proteins and peptides. In addition, the algorithm utilizes decoy database techniques for automatically determining the false positive identification rates. The search algorithm has been tested by comparing the search results from a four-protein mixture, the same four-protein mixture spiked into a complex biological background, and a variety of other "system" type protein digest mixtures. The method was validated independently by data dependent methods, while concurrently relying on replication and selectivity. Comparisons were also performed with other commercially and publicly available peptide fragmentation search algorithms. The presented results demonstrate the ability to correctly identify peptides and proteins from data independent acquisition strategies with high sensitivity and specificity. They also illustrate a more comprehensive analysis of the samples studied; providing approximately 20% more protein identifications, compared to a more conventional data directed approach using the same identification criteria, with a concurrent increase in both sequence coverage and the number of modified peptides.
Identification and analysis of multigene families by comparison of exon fingerprints.
Brown, N P; Whittaker, A J; Newell, W R; Rawlings, C J; Beck, S
1995-06-02
Gene families are often recognised by sequence homology using similarity searching to find relationships, however, genomic sequence data provides gene architectural information not used by conventional search methods. In particular, intron positions and phases are expected to be relatively conserved features, because mis-splicing and reading frame shifts should be selected against. A fast search technique capable of detecting possible weak sequence homologies apparent at the intron/exon level of gene organization is presented for comparing spliceosomal genes and gene fragments. FINEX compares strings of exons delimited by intron/exon boundary positions and intron phases (exon fingerprint) using a global dynamic programming algorithm with a combined intron phase identity and exon size dissimilarity score. Exon fingerprints are typically two orders of magnitude smaller than their nucleic acid sequence counterparts giving rise to fast search times: a ranked search against a library of 6755 fingerprints for a typical three exon fingerprint completes in under 30 seconds on an ordinary workstation, while a worst case largest fingerprint of 52 exons completes in just over one minute. The short "sequence" length of exon fingerprints in comparisons is compensated for by the large exon alphabet compounded of intron phase types and a wide range of exon sizes, the latter contributing the most information to alignments. FINEX performs better in some searches than conventional methods, finding matches with similar exon organization, but low sequence homology. A search using a human serum albumin finds all members of the multigene family in the FINEX database at the top of the search ranking, despite very low amino acid percentage identities between family members. The method should complement conventional sequence searching and alignment techniques, offering a means of identifying otherwise hard to detect homologies where genomic data are available.
A Globally Optimal Particle Tracking Technique for Stereo Imaging Velocimetry Experiments
NASA Technical Reports Server (NTRS)
McDowell, Mark
2008-01-01
An important phase of any Stereo Imaging Velocimetry experiment is particle tracking. Particle tracking seeks to identify and characterize the motion of individual particles entrained in a fluid or air experiment. We analyze a cylindrical chamber filled with water and seeded with density-matched particles. In every four-frame sequence, we identify a particle track by assigning a unique track label for each camera image. The conventional approach to particle tracking is to use an exhaustive tree-search method utilizing greedy algorithms to reduce search times. However, these types of algorithms are not optimal due to a cascade effect of incorrect decisions upon adjacent tracks. We examine the use of a guided evolutionary neural net with simulated annealing to arrive at a globally optimal assignment of tracks. The net is guided both by the minimization of the search space through the use of prior limiting assumptions about valid tracks and by a strategy which seeks to avoid high-energy intermediate states which can trap the net in a local minimum. A stochastic search algorithm is used in place of back-propagation of error to further reduce the chance of being trapped in an energy well. Global optimization is achieved by minimizing an objective function, which includes both track smoothness and particle-image utilization parameters. In this paper we describe our model and present our experimental results. We compare our results with a nonoptimizing, predictive tracker and obtain an average increase in valid track yield of 27 percent
Li, Gao-Peng; Jiang, Liang; Ni, Jia-Zuan; Liu, Qiong; Zhang, Yan
2014-10-17
Selenium (Se) and sulfur (S) are closely related elements that exhibit similar chemical properties. Some genes related to S metabolism are also involved in Se utilization in many organisms. However, the evolutionary relationship between the two utilization traits is unclear. In this study, we conducted a comparative analysis of the selenophosphate synthetase (SelD) family, a key protein for all known Se utilization traits, in all sequenced archaea. Our search showed a very limited distribution of SelD and Se utilization in this kingdom. Interestingly, a SelD-like protein was detected in two orders of Crenarchaeota: Sulfolobales and Thermoproteales. Sequence and phylogenetic analyses revealed that SelD-like protein contains the same domain and conserved functional residues as those of SelD, and might be involved in S metabolism in these S-reducing organisms. Further genome-wide analysis of patterns of gene occurrence in different thermoproteales suggested that several genes, including SirA-like, Prx-like and adenylylsulfate reductase, were strongly related to SelD-like gene. Based on these findings, we proposed a simple model wherein SelD-like may play an important role in the biosynthesis of certain thiophosphate compound. Our data suggest novel genes involved in S metabolism in hyperthermophilic S-reducing archaea, and may provide a new window for understanding the complex relationship between Se and S metabolism in archaea.
RAPSearch: a fast protein similarity search tool for short reads
2011-01-01
Background Next Generation Sequencing (NGS) is producing enormous corpuses of short DNA reads, affecting emerging fields like metagenomics. Protein similarity search--a key step to achieve annotation of protein-coding genes in these short reads, and identification of their biological functions--faces daunting challenges because of the very sizes of the short read datasets. Results We developed a fast protein similarity search tool RAPSearch that utilizes a reduced amino acid alphabet and suffix array to detect seeds of flexible length. For short reads (translated in 6 frames) we tested, RAPSearch achieved ~20-90 times speedup as compared to BLASTX. RAPSearch missed only a small fraction (~1.3-3.2%) of BLASTX similarity hits, but it also discovered additional homologous proteins (~0.3-2.1%) that BLASTX missed. By contrast, BLAT, a tool that is even slightly faster than RAPSearch, had significant loss of sensitivity as compared to RAPSearch and BLAST. Conclusions RAPSearch is implemented as open-source software and is accessible at http://omics.informatics.indiana.edu/mg/RAPSearch. It enables faster protein similarity search. The application of RAPSearch in metageomics has also been demonstrated. PMID:21575167
DOE R&D Accomplishments Database
Chandonia, John-Marc; Hon, Gary; Walker, Nigel S.; Lo Conte, Loredana; Koehl, Patrice; Levitt, Michael; Brenner, Steven E.
2003-09-15
The ASTRAL compendium provides several databases and tools to aid in the analysis of protein structures, particularly through the use of their sequences. Partially derived from the SCOP database of protein structure domains, it includes sequences for each domain and other resources useful for studying these sequences and domain structures. The current release of ASTRAL contains 54,745 domains, more than three times as many as the initial release four years ago. ASTRAL has undergone major transformations in the past two years. In addition to several complete updates each year, ASTRAL is now updated on a weekly basis with preliminary classifications of domains from newly released PDB structures. These classifications are available as a stand-alone database, as well as available integrated into other ASTRAL databases such as representative subsets. To enhance the utility of ASTRAL to structural biologists, all SCOP domains are now made available as PDB-style coordinate files as well as sequences. In addition to sequences and representative subsets based on SCOP domains, sequences and subsets based on PDB chains are newly included in ASTRAL. Several search tools have been added to ASTRAL to facilitate retrieval of data by individual users and automated methods.
Nucleic Acid Extraction from Synthetic Mars Analog Soils for in situ Life Detection
NASA Astrophysics Data System (ADS)
Mojarro, Angel; Ruvkun, Gary; Zuber, Maria T.; Carr, Christopher E.
2017-08-01
Biological informational polymers such as nucleic acids have the potential to provide unambiguous evidence of life beyond Earth. To this end, we are developing an automated in situ life-detection instrument that integrates nucleic acid extraction and nanopore sequencing: the Search for Extra-Terrestrial Genomes (SETG) instrument. Our goal is to isolate and determine the sequence of nucleic acids from extant or preserved life on Mars, if, for example, there is common ancestry to life on Mars and Earth. As is true of metagenomic analysis of terrestrial environmental samples, the SETG instrument must isolate nucleic acids from crude samples and then determine the DNA sequence of the unknown nucleic acids. Our initial DNA extraction experiments resulted in low to undetectable amounts of DNA due to soil chemistry-dependent soil-DNA interactions, namely adsorption to mineral surfaces, binding to divalent/trivalent cations, destruction by iron redox cycling, and acidic conditions. Subsequently, we developed soil-specific extraction protocols that increase DNA yields through a combination of desalting, utilization of competitive binders, and promotion of anaerobic conditions. Our results suggest that a combination of desalting and utilizing competitive binders may establish a "universal" nucleic acid extraction protocol suitable for analyzing samples from diverse soils on Mars.
Nucleic Acid Extraction from Synthetic Mars Analog Soils for in situ Life Detection.
Mojarro, Angel; Ruvkun, Gary; Zuber, Maria T; Carr, Christopher E
2017-08-01
Biological informational polymers such as nucleic acids have the potential to provide unambiguous evidence of life beyond Earth. To this end, we are developing an automated in situ life-detection instrument that integrates nucleic acid extraction and nanopore sequencing: the Search for Extra-Terrestrial Genomes (SETG) instrument. Our goal is to isolate and determine the sequence of nucleic acids from extant or preserved life on Mars, if, for example, there is common ancestry to life on Mars and Earth. As is true of metagenomic analysis of terrestrial environmental samples, the SETG instrument must isolate nucleic acids from crude samples and then determine the DNA sequence of the unknown nucleic acids. Our initial DNA extraction experiments resulted in low to undetectable amounts of DNA due to soil chemistry-dependent soil-DNA interactions, namely adsorption to mineral surfaces, binding to divalent/trivalent cations, destruction by iron redox cycling, and acidic conditions. Subsequently, we developed soil-specific extraction protocols that increase DNA yields through a combination of desalting, utilization of competitive binders, and promotion of anaerobic conditions. Our results suggest that a combination of desalting and utilizing competitive binders may establish a "universal" nucleic acid extraction protocol suitable for analyzing samples from diverse soils on Mars. Key Words: Life-detection instruments-Nucleic acids-Mars-Panspermia. Astrobiology 17, 747-760.
A new method to improve network topological similarity search: applied to fold recognition
Lhota, John; Hauptman, Ruth; Hart, Thomas; Ng, Clara; Xie, Lei
2015-01-01
Motivation: Similarity search is the foundation of bioinformatics. It plays a key role in establishing structural, functional and evolutionary relationships between biological sequences. Although the power of the similarity search has increased steadily in recent years, a high percentage of sequences remain uncharacterized in the protein universe. Thus, new similarity search strategies are needed to efficiently and reliably infer the structure and function of new sequences. The existing paradigm for studying protein sequence, structure, function and evolution has been established based on the assumption that the protein universe is discrete and hierarchical. Cumulative evidence suggests that the protein universe is continuous. As a result, conventional sequence homology search methods may be not able to detect novel structural, functional and evolutionary relationships between proteins from weak and noisy sequence signals. To overcome the limitations in existing similarity search methods, we propose a new algorithmic framework—Enrichment of Network Topological Similarity (ENTS)—to improve the performance of large scale similarity searches in bioinformatics. Results: We apply ENTS to a challenging unsolved problem: protein fold recognition. Our rigorous benchmark studies demonstrate that ENTS considerably outperforms state-of-the-art methods. As the concept of ENTS can be applied to any similarity metric, it may provide a general framework for similarity search on any set of biological entities, given their representation as a network. Availability and implementation: Source code freely available upon request Contact: lxie@iscb.org PMID:25717198
Nonuniformity correction for an infrared focal plane array based on diamond search block matching.
Sheng-Hui, Rong; Hui-Xin, Zhou; Han-Lin, Qin; Rui, Lai; Kun, Qian
2016-05-01
In scene-based nonuniformity correction algorithms, artificial ghosting and image blurring degrade the correction quality severely. In this paper, an improved algorithm based on the diamond search block matching algorithm and the adaptive learning rate is proposed. First, accurate transform pairs between two adjacent frames are estimated by the diamond search block matching algorithm. Then, based on the error between the corresponding transform pairs, the gradient descent algorithm is applied to update correction parameters. During the process of gradient descent, the local standard deviation and a threshold are utilized to control the learning rate to avoid the accumulation of matching error. Finally, the nonuniformity correction would be realized by a linear model with updated correction parameters. The performance of the proposed algorithm is thoroughly studied with four real infrared image sequences. Experimental results indicate that the proposed algorithm can reduce the nonuniformity with less ghosting artifacts in moving areas and can also overcome the problem of image blurring in static areas.
Histoplasma capsulatum proteome response to decreased iron availability
Winters, Michael S; Spellman, Daniel S; Chan, Qilin; Gomez, Francisco J; Hernandez, Margarita; Catron, Brittany; Smulian, Alan G; Neubert, Thomas A; Deepe, George S
2008-01-01
Background A fundamental pathogenic feature of the fungus Histoplasma capsulatum is its ability to evade innate and adaptive immune defenses. Once ingested by macrophages the organism is faced with several hostile environmental conditions including iron limitation. H. capsulatum can establish a persistent state within the macrophage. A gap in knowledge exists because the identities and number of proteins regulated by the organism under host conditions has yet to be defined. Lack of such knowledge is an important problem because until these proteins are identified it is unlikely that they can be targeted as new and innovative treatment for histoplasmosis. Results To investigate the proteomic response by H. capsulatum to decreasing iron availability we have created H. capsulatum protein/genomic databases compatible with current mass spectrometric (MS) search engines. Databases were assembled from the H. capsulatum G217B strain genome using gene prediction programs and expressed sequence tag (EST) libraries. Searching these databases with MS data generated from two dimensional (2D) in-gel digestions of proteins resulted in over 50% more proteins identified compared to searching the publicly available fungal databases alone. Using 2D gel electrophoresis combined with statistical analysis we discovered 42 H. capsulatum proteins whose abundance was significantly modulated when iron concentrations were lowered. Altered proteins were identified by mass spectrometry and database searching to be involved in glycolysis, the tricarboxylic acid cycle, lysine metabolism, protein synthesis, and one protein sequence whose function was unknown. Conclusion We have created a bioinformatics platform for H. capsulatum and demonstrated the utility of a proteomic approach by identifying a shift in metabolism the organism utilizes to cope with the hostile conditions provided by the host. We have shown that enzyme transcripts regulated by other fungal pathogens in response to lowering iron availability are also regulated in H. capsulatum at the protein level. We also identified H. capsulatum proteins sensitive to iron level reductions which have yet to be connected to iron availability in other pathogens. These data also indicate the complexity of the response by H. capsulatum to nutritional deprivation. Finally, we demonstrate the importance of a strain specific gene/protein database for H. capsulatum proteomic analysis. PMID:19108728
LETTER TO THE EDITOR: Exhaustive search for low-autocorrelation binary sequences
NASA Astrophysics Data System (ADS)
Mertens, S.
1996-09-01
Binary sequences with low autocorrelations are important in communication engineering and in statistical mechanics as ground states of the Bernasconi model. Computer searches are the main tool in the construction of such sequences. Owing to the exponential size 0305-4470/29/18/005/img1 of the configuration space, exhaustive searches are limited to short sequences. We discuss an exhaustive search algorithm with run-time characteristic 0305-4470/29/18/005/img2 and apply it to compile a table of exact ground states of the Bernasconi model up to N = 48. The data suggest F > 9 for the optimal merit factor in the limit 0305-4470/29/18/005/img3.
Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index.
Pandey, Prashant; Almodaresi, Fatemeh; Bender, Michael A; Ferdman, Michael; Johnson, Rob; Patro, Rob
2018-06-18
Sequence-level searches on large collections of RNA sequencing experiments, such as the NCBI Sequence Read Archive (SRA), would enable one to ask many questions about the expression or variation of a given transcript in a population. Existing approaches, such as the sequence Bloom tree, suffer from fundamental limitations of the Bloom filter, resulting in slow build and query times, less-than-optimal space usage, and potentially large numbers of false-positives. This paper introduces Mantis, a space-efficient system that uses new data structures to index thousands of raw-read experiments and facilitates large-scale sequence searches. In our evaluation, index construction with Mantis is 6× faster and yields a 20% smaller index than the state-of-the-art split sequence Bloom tree (SSBT). For queries, Mantis is 6-108× faster than SSBT and has no false-positives or -negatives. For example, Mantis was able to search for all 200,400 known human transcripts in an index of 2,652 RNA sequencing experiments in 82 min; SSBT took close to 4 days. Copyright © 2018 Elsevier Inc. All rights reserved.
Paparo, M.; Benko, J. M.; Hareter, M.; ...
2016-05-11
In this study, a sequence search method was developed to search the regular frequency spacing in δ Scuti stars through visual inspection and an algorithmic search. We searched for sequences of quasi-equally spaced frequencies, containing at least four members per sequence, in 90 δ Scuti stars observed by CoRoT. We found an unexpectedly large number of independent series of regular frequency spacing in 77 δ Scuti stars (from one to eight sequences) in the non-asymptotic regime. We introduce the sequence search method presenting the sequences and echelle diagram of CoRoT 102675756 and the structure of the algorithmic search. Four sequencesmore » (echelle ridges) were found in the 5–21 d –1 region where the pairs of the sequences are shifted (between 0.5 and 0.59 d –1) by twice the value of the estimated rotational splitting frequency (0.269 d –1). The general conclusions for the whole sample are also presented in this paper. The statistics of the spacings derived by the sequence search method, by FT (Fourier transform of the frequencies), and the statistics of the shifts are also compared. In many stars more than one almost equally valid spacing appeared. The model frequencies of FG Vir and their rotationally split components were used to formulate the possible explanation that one spacing is the large separation while the other is the sum of the large separation and the rotational frequency. In CoRoT 102675756, the two spacings (2.249 and 1.977 d –1) are in better agreement with the sum of a possible 1.710 d –1 large separation and two or one times, respectively, the value of the rotational frequency.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Paparo, M.; Benko, J. M.; Hareter, M.
In this study, a sequence search method was developed to search the regular frequency spacing in δ Scuti stars through visual inspection and an algorithmic search. We searched for sequences of quasi-equally spaced frequencies, containing at least four members per sequence, in 90 δ Scuti stars observed by CoRoT. We found an unexpectedly large number of independent series of regular frequency spacing in 77 δ Scuti stars (from one to eight sequences) in the non-asymptotic regime. We introduce the sequence search method presenting the sequences and echelle diagram of CoRoT 102675756 and the structure of the algorithmic search. Four sequencesmore » (echelle ridges) were found in the 5–21 d –1 region where the pairs of the sequences are shifted (between 0.5 and 0.59 d –1) by twice the value of the estimated rotational splitting frequency (0.269 d –1). The general conclusions for the whole sample are also presented in this paper. The statistics of the spacings derived by the sequence search method, by FT (Fourier transform of the frequencies), and the statistics of the shifts are also compared. In many stars more than one almost equally valid spacing appeared. The model frequencies of FG Vir and their rotationally split components were used to formulate the possible explanation that one spacing is the large separation while the other is the sum of the large separation and the rotational frequency. In CoRoT 102675756, the two spacings (2.249 and 1.977 d –1) are in better agreement with the sum of a possible 1.710 d –1 large separation and two or one times, respectively, the value of the rotational frequency.« less
Attractors in Sequence Space: Agent-Based Exploration of MHC I Binding Peptides.
Jäger, Natalie; Wisniewska, Joanna M; Hiss, Jan A; Freier, Anja; Losch, Florian O; Walden, Peter; Wrede, Paul; Schneider, Gisbert
2010-01-12
Ant Colony Optimization (ACO) is a meta-heuristic that utilizes a computational analogue of ant trail pheromones to solve combinatorial optimization problems. The size of the ant colony and the representation of the ants' pheromone trails is unique referring to the given optimization problem. In the present study, we employed ACO to generate novel peptides that stabilize MHC I protein on the plasma membrane of a murine lymphoma cell line. A jury of feedforward neural network classifiers served as fitness function for peptide design by ACO. Bioactive murine MHC I H-2K(b) stabilizing as well as nonstabilizing octapeptides were designed, synthesized and tested. These peptides reveal residue motifs that are relevant for MHC I receptor binding. We demonstrate how the performance of the implemented ACO algorithm depends on the colony size and the size of the search space. The actual peptide design process by ACO constitutes a search path in sequence space that can be visualized as trajectories on a self-organizing map (SOM). By projecting the sequence space on a SOM we visualize the convergence of the different solutions that emerge during the optimization process in sequence space. The SOM representation reveals attractors in sequence space for MHC I binding peptides. The combination of ACO and SOM enables systematic peptide optimization. This technique allows for the rational design of various types of bioactive peptides with minimal experimental effort. Here, we demonstrate its successful application to the design of MHC-I binding and nonbinding peptides which exhibit substantial bioactivity in a cell-based assay. Copyright © 2010 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
CLAST: CUDA implemented large-scale alignment search tool.
Yano, Masahiro; Mori, Hiroshi; Akiyama, Yutaka; Yamada, Takuji; Kurokawa, Ken
2014-12-11
Metagenomics is a powerful methodology to study microbial communities, but it is highly dependent on nucleotide sequence similarity searching against sequence databases. Metagenomic analyses with next-generation sequencing technologies produce enormous numbers of reads from microbial communities, and many reads are derived from microbes whose genomes have not yet been sequenced, limiting the usefulness of existing sequence similarity search tools. Therefore, there is a clear need for a sequence similarity search tool that can rapidly detect weak similarity in large datasets. We developed a tool, which we named CLAST (CUDA implemented large-scale alignment search tool), that enables analyses of millions of reads and thousands of reference genome sequences, and runs on NVIDIA Fermi architecture graphics processing units. CLAST has four main advantages over existing alignment tools. First, CLAST was capable of identifying sequence similarities ~80.8 times faster than BLAST and 9.6 times faster than BLAT. Second, CLAST executes global alignment as the default (local alignment is also an option), enabling CLAST to assign reads to taxonomic and functional groups based on evolutionarily distant nucleotide sequences with high accuracy. Third, CLAST does not need a preprocessed sequence database like Burrows-Wheeler Transform-based tools, and this enables CLAST to incorporate large, frequently updated sequence databases. Fourth, CLAST requires <2 GB of main memory, making it possible to run CLAST on a standard desktop computer or server node. CLAST achieved very high speed (similar to the Burrows-Wheeler Transform-based Bowtie 2 for long reads) and sensitivity (equal to BLAST, BLAT, and FR-HIT) without the need for extensive database preprocessing or a specialized computing platform. Our results demonstrate that CLAST has the potential to be one of the most powerful and realistic approaches to analyze the massive amount of sequence data from next-generation sequencing technologies.
Hawkins, Troy; Chitale, Meghana; Luban, Stanislav; Kihara, Daisuke
2009-02-15
Protein function prediction is a central problem in bioinformatics, increasing in importance recently due to the rapid accumulation of biological data awaiting interpretation. Sequence data represents the bulk of this new stock and is the obvious target for consideration as input, as newly sequenced organisms often lack any other type of biological characterization. We have previously introduced PFP (Protein Function Prediction) as our sequence-based predictor of Gene Ontology (GO) functional terms. PFP interprets the results of a PSI-BLAST search by extracting and scoring individual functional attributes, searching a wide range of E-value sequence matches, and utilizing conventional data mining techniques to fill in missing information. We have shown it to be effective in predicting both specific and low-resolution functional attributes when sufficient data is unavailable. Here we describe (1) significant improvements to the PFP infrastructure, including the addition of prediction significance and confidence scores, (2) a thorough benchmark of performance and comparisons to other related prediction methods, and (3) applications of PFP predictions to genome-scale data. We applied PFP predictions to uncharacterized protein sequences from 15 organisms. Among these sequences, 60-90% could be annotated with a GO molecular function term at high confidence (>or=80%). We also applied our predictions to the protein-protein interaction network of the Malaria plasmodium (Plasmodium falciparum). High confidence GO biological process predictions (>or=90%) from PFP increased the number of fully enriched interactions in this dataset from 23% of interactions to 94%. Our benchmark comparison shows significant performance improvement of PFP relative to GOtcha, InterProScan, and PSI-BLAST predictions. This is consistent with the performance of PFP as the overall best predictor in both the AFP-SIG '05 and CASP7 function (FN) assessments. PFP is available as a web service at http://dragon.bio.purdue.edu/pfp/. (c) 2008 Wiley-Liss, Inc.
Gärling, T
1996-09-01
How people choose between sequences of actions was investigated in an everyday errand-planning task. In this task subjects chose the preferred sequence of performing a number of errands in a fictitious environment. Two experiments were conducted with undergraduate students serving as subjects. One group searched information about each alternative. The same information was directly available to another group. In Experiment 1 the results showed that for two errands subjects took into account all attributes describing the errands, thus suggesting a tradeoff between priority, wait time, and travel distance with priority being the most important. Consistent with this finding predominantly intraalternative information search was observed. These results were replicated in Experiment 2 for three errands. In addition choice outcomes, information search, and sequence of responding suggested that for more than two actions sequence choices are made in stages.
Fan, Long; Hui, Jerome H L; Yu, Zu Guo; Chu, Ka Hou
2014-07-01
Species identification based on short sequences of DNA markers, that is, DNA barcoding, has emerged as an integral part of modern taxonomy. However, software for the analysis of large and multilocus barcoding data sets is scarce. The Basic Local Alignment Search Tool (BLAST) is currently the fastest tool capable of handling large databases (e.g. >5000 sequences), but its accuracy is a concern and has been criticized for its local optimization. However, current more accurate software requires sequence alignment or complex calculations, which are time-consuming when dealing with large data sets during data preprocessing or during the search stage. Therefore, it is imperative to develop a practical program for both accurate and scalable species identification for DNA barcoding. In this context, we present VIP Barcoding: a user-friendly software in graphical user interface for rapid DNA barcoding. It adopts a hybrid, two-stage algorithm. First, an alignment-free composition vector (CV) method is utilized to reduce searching space by screening a reference database. The alignment-based K2P distance nearest-neighbour method is then employed to analyse the smaller data set generated in the first stage. In comparison with other software, we demonstrate that VIP Barcoding has (i) higher accuracy than Blastn and several alignment-free methods and (ii) higher scalability than alignment-based distance methods and character-based methods. These results suggest that this platform is able to deal with both large-scale and multilocus barcoding data with accuracy and can contribute to DNA barcoding for modern taxonomy. VIP Barcoding is free and available at http://msl.sls.cuhk.edu.hk/vipbarcoding/. © 2014 John Wiley & Sons Ltd.
ProGeRF: Proteome and Genome Repeat Finder Utilizing a Fast Parallel Hash Function
Moraes, Walas Jhony Lopes; Rodrigues, Thiago de Souza; Bartholomeu, Daniella Castanheira
2015-01-01
Repetitive element sequences are adjacent, repeating patterns, also called motifs, and can be of different lengths; repetitions can involve their exact or approximate copies. They have been widely used as molecular markers in population biology. Given the sizes of sequenced genomes, various bioinformatics tools have been developed for the extraction of repetitive elements from DNA sequences. However, currently available tools do not provide options for identifying repetitive elements in the genome or proteome, displaying a user-friendly web interface, and performing-exhaustive searches. ProGeRF is a web site for extracting repetitive regions from genome and proteome sequences. It was designed to be efficient, fast, and accurate and primarily user-friendly web tool allowing many ways to view and analyse the results. ProGeRF (Proteome and Genome Repeat Finder) is freely available as a stand-alone program, from which the users can download the source code, and as a web tool. It was developed using the hash table approach to extract perfect and imperfect repetitive regions in a (multi)FASTA file, while allowing a linear time complexity. PMID:25811026
Searching RNA motifs and their intermolecular contacts with constraint networks.
Thébault, P; de Givry, S; Schiex, T; Gaspin, C
2006-09-01
Searching RNA gene occurrences in genomic sequences is a task whose importance has been renewed by the recent discovery of numerous functional RNA, often interacting with other ligands. Even if several programs exist for RNA motif search, none exists that can represent and solve the problem of searching for occurrences of RNA motifs in interaction with other molecules. We present a constraint network formulation of this problem. RNA are represented as structured motifs that can occur on more than one sequence and which are related together by possible hybridization. The implemented tool MilPat is used to search for several sRNA families in genomic sequences. Results show that MilPat allows to efficiently search for interacting motifs in large genomic sequences and offers a simple and extensible framework to solve such problems. New and known sRNA are identified as H/ACA candidates in Methanocaldococcus jannaschii. http://carlit.toulouse.inra.fr/MilPaT/MilPat.pl.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Paparó, M.; Benkő, J. M.; Hareter, M.
A sequence search method was developed to search the regular frequency spacing in δ Scuti stars through visual inspection and an algorithmic search. We searched for sequences of quasi-equally spaced frequencies, containing at least four members per sequence, in 90 δ Scuti stars observed by CoRoT . We found an unexpectedly large number of independent series of regular frequency spacing in 77 δ Scuti stars (from one to eight sequences) in the non-asymptotic regime. We introduce the sequence search method presenting the sequences and echelle diagram of CoRoT 102675756 and the structure of the algorithmic search. Four sequences (echelle ridges)more » were found in the 5–21 d{sup −1} region where the pairs of the sequences are shifted (between 0.5 and 0.59 d{sup −1}) by twice the value of the estimated rotational splitting frequency (0.269 d{sup −1}). The general conclusions for the whole sample are also presented in this paper. The statistics of the spacings derived by the sequence search method, by FT (Fourier transform of the frequencies), and the statistics of the shifts are also compared. In many stars more than one almost equally valid spacing appeared. The model frequencies of FG Vir and their rotationally split components were used to formulate the possible explanation that one spacing is the large separation while the other is the sum of the large separation and the rotational frequency. In CoRoT 102675756, the two spacings (2.249 and 1.977 d{sup −1}) are in better agreement with the sum of a possible 1.710 d{sup −1} large separation and two or one times, respectively, the value of the rotational frequency.« less
Cho, Jin-Young; Lee, Hyoung-Joo; Jeong, Seul-Ki; Paik, Young-Ki
2017-12-01
Mass spectrometry (MS) is a widely used proteome analysis tool for biomedical science. In an MS-based bottom-up proteomic approach to protein identification, sequence database (DB) searching has been routinely used because of its simplicity and convenience. However, searching a sequence DB with multiple variable modification options can increase processing time, false-positive errors in large and complicated MS data sets. Spectral library searching is an alternative solution, avoiding the limitations of sequence DB searching and allowing the detection of more peptides with high sensitivity. Unfortunately, this technique has less proteome coverage, resulting in limitations in the detection of novel and whole peptide sequences in biological samples. To solve these problems, we previously developed the "Combo-Spec Search" method, which uses manually multiple references and simulated spectral library searching to analyze whole proteomes in a biological sample. In this study, we have developed a new analytical interface tool called "Epsilon-Q" to enhance the functions of both the Combo-Spec Search method and label-free protein quantification. Epsilon-Q performs automatically multiple spectral library searching, class-specific false-discovery rate control, and result integration. It has a user-friendly graphical interface and demonstrates good performance in identifying and quantifying proteins by supporting standard MS data formats and spectrum-to-spectrum matching powered by SpectraST. Furthermore, when the Epsilon-Q interface is combined with the Combo-Spec search method, called the Epsilon-Q system, it shows a synergistic function by outperforming other sequence DB search engines for identifying and quantifying low-abundance proteins in biological samples. The Epsilon-Q system can be a versatile tool for comparative proteome analysis based on multiple spectral libraries and label-free quantification.
Query-seeded iterative sequence similarity searching improves selectivity 5–20-fold
Li, Weizhong; Lopez, Rodrigo
2017-01-01
Abstract Iterative similarity search programs, like psiblast, jackhmmer, and psisearch, are much more sensitive than pairwise similarity search methods like blast and ssearch because they build a position specific scoring model (a PSSM or HMM) that captures the pattern of sequence conservation characteristic to a protein family. But models are subject to contamination; once an unrelated sequence has been added to the model, homologs of the unrelated sequence will also produce high scores, and the model can diverge from the original protein family. Examination of alignment errors during psiblast PSSM contamination suggested a simple strategy for dramatically reducing PSSM contamination. psiblast PSSMs are built from the query-based multiple sequence alignment (MSA) implied by the pairwise alignments between the query model (PSSM, HMM) and the subject sequences in the library. When the original query sequence residues are inserted into gapped positions in the aligned subject sequence, the resulting PSSM rarely produces alignment over-extensions or alignments to unrelated sequences. This simple step, which tends to anchor the PSSM to the original query sequence and slightly increase target percent identity, can reduce the frequency of false-positive alignments more than 20-fold compared with psiblast and jackhmmer, with little loss in search sensitivity. PMID:27923999
Jones, Bethan M; Edwards, Richard J; Skipp, Paul J; O'Connor, C David; Iglesias-Rodriguez, M Debora
2011-06-01
Emiliania huxleyi is a unicellular marine phytoplankton species known to play a significant role in global biogeochemistry. Through the dual roles of photosynthesis and production of calcium carbonate (calcification), carbon is transferred from the atmosphere to ocean sediments. Almost nothing is known about the molecular mechanisms that control calcification, a process that is tightly regulated within the cell. To initiate proteomic studies on this important and phylogenetically remote organism, we have devised efficient protein extraction protocols and developed a bioinformatics pipeline that allows the statistically robust assignment of proteins from MS/MS data using preexisting EST sequences. The bioinformatics tool, termed BUDAPEST (Bioinformatics Utility for Data Analysis of Proteomics using ESTs), is fully automated and was used to search against data generated from three strains. BUDAPEST increased the number of identifications over standard protein database searches from 37 to 99 proteins when data were amalgamated. Proteins involved in diverse cellular processes were uncovered. For example, experimental evidence was obtained for a novel type I polyketide synthase and for various photosystem components. The proteomic and bioinformatic approaches developed in this study are of wider applicability, particularly to the oceanographic community where genomic sequence data for species of interest are currently scarce.
Multi-source and ontology-based retrieval engine for maize mutant phenotypes
Green, Jason M.; Harnsomburana, Jaturon; Schaeffer, Mary L.; Lawrence, Carolyn J.; Shyu, Chi-Ren
2011-01-01
Model Organism Databases, including the various plant genome databases, collect and enable access to massive amounts of heterogeneous information, including sequence data, gene product information, images of mutant phenotypes, etc, as well as textual descriptions of many of these entities. While a variety of basic browsing and search capabilities are available to allow researchers to query and peruse the names and attributes of phenotypic data, next-generation search mechanisms that allow querying and ranking of text descriptions are much less common. In addition, the plant community needs an innovative way to leverage the existing links in these databases to search groups of text descriptions simultaneously. Furthermore, though much time and effort have been afforded to the development of plant-related ontologies, the knowledge embedded in these ontologies remains largely unused in available plant search mechanisms. Addressing these issues, we have developed a unique search engine for mutant phenotypes from MaizeGDB. This advanced search mechanism integrates various text description sources in MaizeGDB to aid a user in retrieving desired mutant phenotype information. Currently, descriptions of mutant phenotypes, loci and gene products are utilized collectively for each search, though expansion of the search mechanism to include other sources is straightforward. The retrieval engine, to our knowledge, is the first engine to exploit the content and structure of available domain ontologies, currently the Plant and Gene Ontologies, to expand and enrich retrieval results in major plant genomic databases. Database URL: http:www.PhenomicsWorld.org/QBTA.php PMID:21558151
Structator: fast index-based search for RNA sequence-structure patterns
2011-01-01
Background The secondary structure of RNA molecules is intimately related to their function and often more conserved than the sequence. Hence, the important task of searching databases for RNAs requires to match sequence-structure patterns. Unfortunately, current tools for this task have, in the best case, a running time that is only linear in the size of sequence databases. Furthermore, established index data structures for fast sequence matching, like suffix trees or arrays, cannot benefit from the complementarity constraints introduced by the secondary structure of RNAs. Results We present a novel method and readily applicable software for time efficient matching of RNA sequence-structure patterns in sequence databases. Our approach is based on affix arrays, a recently introduced index data structure, preprocessed from the target database. Affix arrays support bidirectional pattern search, which is required for efficiently handling the structural constraints of the pattern. Structural patterns like stem-loops can be matched inside out, such that the loop region is matched first and then the pairing bases on the boundaries are matched consecutively. This allows to exploit base pairing information for search space reduction and leads to an expected running time that is sublinear in the size of the sequence database. The incorporation of a new chaining approach in the search of RNA sequence-structure patterns enables the description of molecules folding into complex secondary structures with multiple ordered patterns. The chaining approach removes spurious matches from the set of intermediate results, in particular of patterns with little specificity. In benchmark experiments on the Rfam database, our method runs up to two orders of magnitude faster than previous methods. Conclusions The presented method's sublinear expected running time makes it well suited for RNA sequence-structure pattern matching in large sequence databases. RNA molecules containing several stem-loop substructures can be described by multiple sequence-structure patterns and their matches are efficiently handled by a novel chaining method. Beyond our algorithmic contributions, we provide with Structator a complete and robust open-source software solution for index-based search of RNA sequence-structure patterns. The Structator software is available at http://www.zbh.uni-hamburg.de/Structator. PMID:21619640
Divide and Conquer (DC) BLAST: fast and easy BLAST execution within HPC environments
Yim, Won Cheol; Cushman, John C.
2017-07-22
Bioinformatics is currently faced with very large-scale data sets that lead to computational jobs, especially sequence similarity searches, that can take absurdly long times to run. For example, the National Center for Biotechnology Information (NCBI) Basic Local Alignment Search Tool (BLAST and BLAST+) suite, which is by far the most widely used tool for rapid similarity searching among nucleic acid or amino acid sequences, is highly central processing unit (CPU) intensive. While the BLAST suite of programs perform searches very rapidly, they have the potential to be accelerated. In recent years, distributed computing environments have become more widely accessible andmore » used due to the increasing availability of high-performance computing (HPC) systems. Therefore, simple solutions for data parallelization are needed to expedite BLAST and other sequence analysis tools. However, existing software for parallel sequence similarity searches often requires extensive computational experience and skill on the part of the user. In order to accelerate BLAST and other sequence analysis tools, Divide and Conquer BLAST (DCBLAST) was developed to perform NCBI BLAST searches within a cluster, grid, or HPC environment by using a query sequence distribution approach. Scaling from one (1) to 256 CPU cores resulted in significant improvements in processing speed. Thus, DCBLAST dramatically accelerates the execution of BLAST searches using a simple, accessible, robust, and parallel approach. DCBLAST works across multiple nodes automatically and it overcomes the speed limitation of single-node BLAST programs. DCBLAST can be used on any HPC system, can take advantage of hundreds of nodes, and has no output limitations. Thus, this freely available tool simplifies distributed computation pipelines to facilitate the rapid discovery of sequence similarities between very large data sets.« less
Divide and Conquer (DC) BLAST: fast and easy BLAST execution within HPC environments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yim, Won Cheol; Cushman, John C.
Bioinformatics is currently faced with very large-scale data sets that lead to computational jobs, especially sequence similarity searches, that can take absurdly long times to run. For example, the National Center for Biotechnology Information (NCBI) Basic Local Alignment Search Tool (BLAST and BLAST+) suite, which is by far the most widely used tool for rapid similarity searching among nucleic acid or amino acid sequences, is highly central processing unit (CPU) intensive. While the BLAST suite of programs perform searches very rapidly, they have the potential to be accelerated. In recent years, distributed computing environments have become more widely accessible andmore » used due to the increasing availability of high-performance computing (HPC) systems. Therefore, simple solutions for data parallelization are needed to expedite BLAST and other sequence analysis tools. However, existing software for parallel sequence similarity searches often requires extensive computational experience and skill on the part of the user. In order to accelerate BLAST and other sequence analysis tools, Divide and Conquer BLAST (DCBLAST) was developed to perform NCBI BLAST searches within a cluster, grid, or HPC environment by using a query sequence distribution approach. Scaling from one (1) to 256 CPU cores resulted in significant improvements in processing speed. Thus, DCBLAST dramatically accelerates the execution of BLAST searches using a simple, accessible, robust, and parallel approach. DCBLAST works across multiple nodes automatically and it overcomes the speed limitation of single-node BLAST programs. DCBLAST can be used on any HPC system, can take advantage of hundreds of nodes, and has no output limitations. Thus, this freely available tool simplifies distributed computation pipelines to facilitate the rapid discovery of sequence similarities between very large data sets.« less
Geoseq: a tool for dissecting deep-sequencing datasets.
Gurtowski, James; Cancio, Anthony; Shah, Hardik; Levovitz, Chaya; George, Ajish; Homann, Robert; Sachidanandam, Ravi
2010-10-12
Datasets generated on deep-sequencing platforms have been deposited in various public repositories such as the Gene Expression Omnibus (GEO), Sequence Read Archive (SRA) hosted by the NCBI, or the DNA Data Bank of Japan (ddbj). Despite being rich data sources, they have not been used much due to the difficulty in locating and analyzing datasets of interest. Geoseq http://geoseq.mssm.edu provides a new method of analyzing short reads from deep sequencing experiments. Instead of mapping the reads to reference genomes or sequences, Geoseq maps a reference sequence against the sequencing data. It is web-based, and holds pre-computed data from public libraries. The analysis reduces the input sequence to tiles and measures the coverage of each tile in a sequence library through the use of suffix arrays. The user can upload custom target sequences or use gene/miRNA names for the search and get back results as plots and spreadsheet files. Geoseq organizes the public sequencing data using a controlled vocabulary, allowing identification of relevant libraries by organism, tissue and type of experiment. Analysis of small sets of sequences against deep-sequencing datasets, as well as identification of public datasets of interest, is simplified by Geoseq. We applied Geoseq to, a) identify differential isoform expression in mRNA-seq datasets, b) identify miRNAs (microRNAs) in libraries, and identify mature and star sequences in miRNAS and c) to identify potentially mis-annotated miRNAs. The ease of using Geoseq for these analyses suggests its utility and uniqueness as an analysis tool.
KONAGAbase: a genomic and transcriptomic database for the diamondback moth, Plutella xylostella.
Jouraku, Akiya; Yamamoto, Kimiko; Kuwazaki, Seigo; Urio, Masahiro; Suetsugu, Yoshitaka; Narukawa, Junko; Miyamoto, Kazuhisa; Kurita, Kanako; Kanamori, Hiroyuki; Katayose, Yuichi; Matsumoto, Takashi; Noda, Hiroaki
2013-07-09
The diamondback moth (DBM), Plutella xylostella, is one of the most harmful insect pests for crucifer crops worldwide. DBM has rapidly evolved high resistance to most conventional insecticides such as pyrethroids, organophosphates, fipronil, spinosad, Bacillus thuringiensis, and diamides. Therefore, it is important to develop genomic and transcriptomic DBM resources for analysis of genes related to insecticide resistance, both to clarify the mechanism of resistance of DBM and to facilitate the development of insecticides with a novel mode of action for more effective and environmentally less harmful insecticide rotation. To contribute to this goal, we developed KONAGAbase, a genomic and transcriptomic database for DBM (KONAGA is the Japanese word for DBM). KONAGAbase provides (1) transcriptomic sequences of 37,340 ESTs/mRNAs and 147,370 RNA-seq contigs which were clustered and assembled into 84,570 unigenes (30,695 contigs, 50,548 pseudo singletons, and 3,327 singletons); and (2) genomic sequences of 88,530 WGS contigs with 246,244 degenerate contigs and 106,455 singletons from which 6,310 de novo identified repeat sequences and 34,890 predicted gene-coding sequences were extracted. The unigenes and predicted gene-coding sequences were clustered and 32,800 representative sequences were extracted as a comprehensive putative gene set. These sequences were annotated with BLAST descriptions, Gene Ontology (GO) terms, and Pfam descriptions, respectively. KONAGAbase contains rich graphical user interface (GUI)-based web interfaces for easy and efficient searching, browsing, and downloading sequences and annotation data. Five useful search interfaces consisting of BLAST search, keyword search, BLAST result-based search, GO tree-based search, and genome browser are provided. KONAGAbase is publicly available from our website (http://dbm.dna.affrc.go.jp/px/) through standard web browsers. KONAGAbase provides DBM comprehensive transcriptomic and draft genomic sequences with useful annotation information with easy-to-use web interfaces, which helps researchers to efficiently search for target sequences such as insect resistance-related genes. KONAGAbase will be continuously updated and additional genomic/transcriptomic resources and analysis tools will be provided for further efficient analysis of the mechanism of insecticide resistance and the development of effective insecticides with a novel mode of action for DBM.
G2LC: Resources Autoscaling for Real Time Bioinformatics Applications in IaaS.
Hu, Rongdong; Liu, Guangming; Jiang, Jingfei; Wang, Lixin
2015-01-01
Cloud computing has started to change the way how bioinformatics research is being carried out. Researchers who have taken advantage of this technology can process larger amounts of data and speed up scientific discovery. The variability in data volume results in variable computing requirements. Therefore, bioinformatics researchers are pursuing more reliable and efficient methods for conducting sequencing analyses. This paper proposes an automated resource provisioning method, G2LC, for bioinformatics applications in IaaS. It enables application to output the results in a real time manner. Its main purpose is to guarantee applications performance, while improving resource utilization. Real sequence searching data of BLAST is used to evaluate the effectiveness of G2LC. Experimental results show that G2LC guarantees the application performance, while resource is saved up to 20.14%.
Design study of a HEAO-C spread spectrum transponder telemetry system for use with the TDRSS subnet
NASA Technical Reports Server (NTRS)
Weathers, G.
1975-01-01
The results of a design study of a spread spectrum transponder for use on the HEAO-C satellite were given. The transponder performs the functions of code turn-around for ground range and range-rate determination, ground command receiver, and telemetry data transmitter. The spacecraft transponder and associated communication system components will allow the HEAO-C satellite to utilize the Tracking and Data Relay Satellite System (TDRSS) subnet of the post 1978 STDN. The following areas were discussed in the report: TDRSS Subnet Description, TDRSS-HEAO-C System Configuration, Gold Code Generator, Convolutional Encoder Design and Decoder Algorithm, High Speed Sequence Generators, Statistical Evaluation of Candidate Code Sequences using Amplitude and Phase Moments, Code and Carrier Phase Lock Loops, Total Spread Spectrum Transponder System, and Reference Literature Search.
G2LC: Resources Autoscaling for Real Time Bioinformatics Applications in IaaS
Hu, Rongdong; Liu, Guangming; Jiang, Jingfei; Wang, Lixin
2015-01-01
Cloud computing has started to change the way how bioinformatics research is being carried out. Researchers who have taken advantage of this technology can process larger amounts of data and speed up scientific discovery. The variability in data volume results in variable computing requirements. Therefore, bioinformatics researchers are pursuing more reliable and efficient methods for conducting sequencing analyses. This paper proposes an automated resource provisioning method, G2LC, for bioinformatics applications in IaaS. It enables application to output the results in a real time manner. Its main purpose is to guarantee applications performance, while improving resource utilization. Real sequence searching data of BLAST is used to evaluate the effectiveness of G2LC. Experimental results show that G2LC guarantees the application performance, while resource is saved up to 20.14%. PMID:26504488
Biosequence Similarity Search on the Mercury System
Krishnamurthy, Praveen; Buhler, Jeremy; Chamberlain, Roger; Franklin, Mark; Gyang, Kwame; Jacob, Arpith; Lancaster, Joseph
2007-01-01
Biosequence similarity search is an important application in modern molecular biology. Search algorithms aim to identify sets of sequences whose extensional similarity suggests a common evolutionary origin or function. The most widely used similarity search tool for biosequences is BLAST, a program designed to compare query sequences to a database. Here, we present the design of BLASTN, the version of BLAST that searches DNA sequences, on the Mercury system, an architecture that supports high-volume, high-throughput data movement off a data store and into reconfigurable hardware. An important component of application deployment on the Mercury system is the functional decomposition of the application onto both the reconfigurable hardware and the traditional processor. Both the Mercury BLASTN application design and its performance analysis are described. PMID:18846267
Nucleic Acid Extraction from Synthetic Mars Analog Soils for in situ Life Detection
Mojarro, Angel; Ruvkun, Gary; Zuber, Maria T.
2017-01-01
Abstract Biological informational polymers such as nucleic acids have the potential to provide unambiguous evidence of life beyond Earth. To this end, we are developing an automated in situ life-detection instrument that integrates nucleic acid extraction and nanopore sequencing: the Search for Extra-Terrestrial Genomes (SETG) instrument. Our goal is to isolate and determine the sequence of nucleic acids from extant or preserved life on Mars, if, for example, there is common ancestry to life on Mars and Earth. As is true of metagenomic analysis of terrestrial environmental samples, the SETG instrument must isolate nucleic acids from crude samples and then determine the DNA sequence of the unknown nucleic acids. Our initial DNA extraction experiments resulted in low to undetectable amounts of DNA due to soil chemistry–dependent soil-DNA interactions, namely adsorption to mineral surfaces, binding to divalent/trivalent cations, destruction by iron redox cycling, and acidic conditions. Subsequently, we developed soil-specific extraction protocols that increase DNA yields through a combination of desalting, utilization of competitive binders, and promotion of anaerobic conditions. Our results suggest that a combination of desalting and utilizing competitive binders may establish a “universal” nucleic acid extraction protocol suitable for analyzing samples from diverse soils on Mars. Key Words: Life-detection instruments—Nucleic acids—Mars—Panspermia. Astrobiology 17, 747–760. PMID:28704064
Using the TIGR gene index databases for biological discovery.
Lee, Yuandan; Quackenbush, John
2003-11-01
The TIGR Gene Index web pages provide access to analyses of ESTs and gene sequences for nearly 60 species, as well as a number of resources derived from these. Each species-specific database is presented using a common format with a homepage. A variety of methods exist that allow users to search each species-specific database. Methods implemented currently include nucleotide or protein sequence queries using WU-BLAST, text-based searches using various sequence identifiers, searches by gene, tissue and library name, and searches using functional classes through Gene Ontology assignments. This protocol provides guidance for using the Gene Index Databases to extract information.
Reneker, Jeff; Shyu, Chi-Ren; Zeng, Peiyu; Polacco, Joseph C.; Gassmann, Walter
2004-01-01
We have developed a web server for the life sciences community to use to search for short repeats of DNA sequence of length between 3 and 10 000 bases within multiple species. This search employs a unique and fast hash function approach. Our system also applies information retrieval algorithms to discover knowledge of cross-species conservation of repeat sequences. Furthermore, we have incorporated a part of the Gene Ontology database into our information retrieval algorithms to broaden the coverage of the search. Our web server and tutorial can be found at http://acmes.rnet.missouri.edu. PMID:15215469
Impact of Predicting Health Care Utilization Via Web Search Behavior: A Data-Driven Analysis.
Agarwal, Vibhu; Zhang, Liangliang; Zhu, Josh; Fang, Shiyuan; Cheng, Tim; Hong, Chloe; Shah, Nigam H
2016-09-21
By recent estimates, the steady rise in health care costs has deprived more than 45 million Americans of health care services and has encouraged health care providers to better understand the key drivers of health care utilization from a population health management perspective. Prior studies suggest the feasibility of mining population-level patterns of health care resource utilization from observational analysis of Internet search logs; however, the utility of the endeavor to the various stakeholders in a health ecosystem remains unclear. The aim was to carry out a closed-loop evaluation of the utility of health care use predictions using the conversion rates of advertisements that were displayed to the predicted future utilizers as a surrogate. The statistical models to predict the probability of user's future visit to a medical facility were built using effective predictors of health care resource utilization, extracted from a deidentified dataset of geotagged mobile Internet search logs representing searches made by users of the Baidu search engine between March 2015 and May 2015. We inferred presence within the geofence of a medical facility from location and duration information from users' search logs and putatively assigned medical facility visit labels to qualifying search logs. We constructed a matrix of general, semantic, and location-based features from search logs of users that had 42 or more search days preceding a medical facility visit as well as from search logs of users that had no medical visits and trained statistical learners for predicting future medical visits. We then carried out a closed-loop evaluation of the utility of health care use predictions using the show conversion rates of advertisements displayed to the predicted future utilizers. In the context of behaviorally targeted advertising, wherein health care providers are interested in minimizing their cost per conversion, the association between show conversion rate and predicted utilization score, served as a surrogate measure of the model's utility. We obtained the highest area under the curve (0.796) in medical visit prediction with our random forests model and daywise features. Ablating feature categories one at a time showed that the model performance worsened the most when location features were dropped. An online evaluation in which advertisements were served to users who had a high predicted probability of a future medical visit showed a 3.96% increase in the show conversion rate. Results from our experiments done in a research setting suggest that it is possible to accurately predict future patient visits from geotagged mobile search logs. Results from the offline and online experiments on the utility of health utilization predictions suggest that such prediction can have utility for health care providers.
Impact of Predicting Health Care Utilization Via Web Search Behavior: A Data-Driven Analysis
Zhang, Liangliang; Zhu, Josh; Fang, Shiyuan; Cheng, Tim; Hong, Chloe; Shah, Nigam H
2016-01-01
Background By recent estimates, the steady rise in health care costs has deprived more than 45 million Americans of health care services and has encouraged health care providers to better understand the key drivers of health care utilization from a population health management perspective. Prior studies suggest the feasibility of mining population-level patterns of health care resource utilization from observational analysis of Internet search logs; however, the utility of the endeavor to the various stakeholders in a health ecosystem remains unclear. Objective The aim was to carry out a closed-loop evaluation of the utility of health care use predictions using the conversion rates of advertisements that were displayed to the predicted future utilizers as a surrogate. The statistical models to predict the probability of user’s future visit to a medical facility were built using effective predictors of health care resource utilization, extracted from a deidentified dataset of geotagged mobile Internet search logs representing searches made by users of the Baidu search engine between March 2015 and May 2015. Methods We inferred presence within the geofence of a medical facility from location and duration information from users’ search logs and putatively assigned medical facility visit labels to qualifying search logs. We constructed a matrix of general, semantic, and location-based features from search logs of users that had 42 or more search days preceding a medical facility visit as well as from search logs of users that had no medical visits and trained statistical learners for predicting future medical visits. We then carried out a closed-loop evaluation of the utility of health care use predictions using the show conversion rates of advertisements displayed to the predicted future utilizers. In the context of behaviorally targeted advertising, wherein health care providers are interested in minimizing their cost per conversion, the association between show conversion rate and predicted utilization score, served as a surrogate measure of the model’s utility. Results We obtained the highest area under the curve (0.796) in medical visit prediction with our random forests model and daywise features. Ablating feature categories one at a time showed that the model performance worsened the most when location features were dropped. An online evaluation in which advertisements were served to users who had a high predicted probability of a future medical visit showed a 3.96% increase in the show conversion rate. Conclusions Results from our experiments done in a research setting suggest that it is possible to accurately predict future patient visits from geotagged mobile search logs. Results from the offline and online experiments on the utility of health utilization predictions suggest that such prediction can have utility for health care providers. PMID:27655225
Utilization of a radiology-centric search engine.
Sharpe, Richard E; Sharpe, Megan; Siegel, Eliot; Siddiqui, Khan
2010-04-01
Internet-based search engines have become a significant component of medical practice. Physicians increasingly rely on information available from search engines as a means to improve patient care, provide better education, and enhance research. Specialized search engines have emerged to more efficiently meet the needs of physicians. Details about the ways in which radiologists utilize search engines have not been documented. The authors categorized every 25th search query in a radiology-centric vertical search engine by radiologic subspecialty, imaging modality, geographic location of access, time of day, use of abbreviations, misspellings, and search language. Musculoskeletal and neurologic imagings were the most frequently searched subspecialties. The least frequently searched were breast imaging, pediatric imaging, and nuclear medicine. Magnetic resonance imaging and computed tomography were the most frequently searched modalities. A majority of searches were initiated in North America, but all continents were represented. Searches occurred 24 h/day in converted local times, with a majority occurring during the normal business day. Misspellings and abbreviations were common. Almost all searches were performed in English. Search engine utilization trends are likely to mirror trends in diagnostic imaging in the region from which searches originate. Internet searching appears to function as a real-time clinical decision-making tool, a research tool, and an educational resource. A more thorough understanding of search utilization patterns can be obtained by analyzing phrases as actually entered as well as the geographic location and time of origination. This knowledge may contribute to the development of more efficient and personalized search engines.
Panagopoulos, Ioannis; Gorunova, Ludmila; Bjerkehagen, Bodil; Heim, Sverre
2014-01-01
Whole transcriptome sequencing was used to study a small round cell tumor in which a t(4;19)(q35;q13) was part of the complex karyotype but where the initial reverse transcriptase PCR (RT-PCR) examination did not detect a CIC-DUX4 fusion transcript previously described as the crucial gene-level outcome of this specific translocation. The RNA sequencing data were analysed using the FusionMap, FusionFinder, and ChimeraScan programs which are specifically designed to identify fusion genes. FusionMap, FusionFinder, and ChimeraScan identified 1017, 102, and 101 fusion transcripts, respectively, but CIC-DUX4 was not among them. Since the RNA sequencing data are in the fastq text-based format, we searched the files using the "grep" command-line utility. The "grep" command searches the text for specific expressions and displays, by default, the lines where matches occur. The "specific expression" was a sequence of 20 nucleotides from the coding part of the last exon 20 of CIC (Reference Sequence: NM_015125.3) chosen since all the so far reported CIC breakpoints have occurred here. Fifteen chimeric CIC-DUX4 cDNA sequences were captured and the fusion between the CIC and DUX4 genes was mapped precisely. New primer combinations were constructed based on these findings and were used together with a polymerase suitable for amplification of GC-rich DNA templates to amplify CIC-DUX4 cDNA fragments which had the same fusion point found with "grep". In conclusion, FusionMap, FusionFinder, and ChimeraScan generated a plethora of fusion transcripts but did not detect the biologically important CIC-DUX4 chimeric transcript; they are generally useful but evidently suffer from imperfect both sensitivity and specificity. The "grep" command is an excellent tool to capture chimeric transcripts from RNA sequencing data when the pathological and/or cytogenetic information strongly indicates the presence of a specific fusion gene.
Processing Dynamic Image Sequences from a Moving Sensor.
1984-02-01
65 Roadsign Image Sequence ..... ................ ... 70 Roadsign Sequence with Redundant Features .. ........ . 79 Roadsign Subimage...Selected Feature Error Values .. ........ 66 2c. Industrial Image Selected Feature Local Search Values. .. .... 67 3ab. Roadsign Image Error Values...72 3c. Roadsign Image Local Search Values ............. 73 4ab. Roadsign Redundant Feature Error Values. ............ 8 4c. Roadsign
galaxie--CGI scripts for sequence identification through automated phylogenetic analysis.
Nilsson, R Henrik; Larsson, Karl-Henrik; Ursing, Björn M
2004-06-12
The prevalent use of similarity searches like BLAST to identify sequences and species implicitly assumes the reference database to be of extensive sequence sampling. This is often not the case, restraining the correctness of the outcome as a basis for sequence identification. Phylogenetic inference outperforms similarity searches in retrieving correct phylogenies and consequently sequence identities, and a project was initiated to design a freely available script package for sequence identification through automated Web-based phylogenetic analysis. Three CGI scripts were designed to facilitate qualified sequence identification from a Web interface. Query sequences are aligned to pre-made alignments or to alignments made by ClustalW with entries retrieved from a BLAST search. The subsequent phylogenetic analysis is based on the PHYLIP package for inferring neighbor-joining and parsimony trees. The scripts are highly configurable. A service installation and a version for local use are found at http://andromeda.botany.gu.se/galaxiewelcome.html and http://galaxie.cgb.ki.se
Bringing the fathead minnow into the genomic era | Science ...
The fathead minnow is a well-established ecotoxicological model organism that has been widely used for regulatory ecotoxicity testing and research for over a half century. While a large amount of molecular information has been gathered on the fathead minnow over the years, the lack of genomic sequence data has limited the utility of the fathead minnow for certain applications. To address this limitation, high-throughput Illumina sequencing technology was employed to sequence the fathead minnow genome. Approximately 100X coverage was achieved by sequencing several libraries of paired-end reads with differing genome insert sizes. Two draft genome assemblies were generated using the SOAPdenovo and String Graph Assembler (SGA) methods, respectively. When these were compared, the SOAPdenovo assembly had a higher scaffold N50 value of 60.4 kbp versus 15.4 kbp, and it also performed better in a Core Eukaryotic Genes Mapping Analysis (CEGMA), mapping 91% versus 67% of genes. As such, this assembly was selected for further development and annotation. The foundation for genome annotation was generated using AUGUSTUS, an ab initio method for gene prediction. A total of 43,345 potential coding sequences were predicted on the genome assembly. These predicted sequences were translated to peptides and queried in a BLAST search against all vertebrates, with 28,290 of these sequences corresponding to zebrafish peptides and 5,242 producing no significant alignments. Additional ty
Complete mitochondrial genomes of eleven extinct or possibly extinct bird species.
Anmarkrud, Jarl A; Lifjeld, Jan T
2017-03-01
Natural history museum collections represent a vast source of ancient and historical DNA samples from extinct taxa that can be utilized by high-throughput sequencing tools to reveal novel genetic and phylogenetic information about them. Here, we report on the successful sequencing of complete mitochondrial genome sequences (mitogenomes) from eleven extinct bird species, using de novo assembly of short sequences derived from toepad samples of degraded DNA from museum specimens. For two species (the Passenger Pigeon Ectopistes migratorius and the South Island Piopio Turnagra capensis), whole mitogenomes were already available from recent studies, whereas for five others (the Great Auk Pinguinis impennis, the Imperial Woodpecker Campehilus imperialis, the Huia Heteralocha acutirostris, the Kauai Oo Moho braccathus and the South Island Kokako Callaeas cinereus), there were partial mitochondrial sequences available for comparison. For all seven species, we found sequence similarities of >98%. For the remaining four species (the Kamao Myadestes myadestinus, the Paradise Parrot Psephotellus pulcherrimus, the Ou Psittirostra psittacea and the Lesser Akialoa Akialoa obscura), there was no sequence information available for comparison, so we conducted blast searches and phylogenetic analyses to determine their phylogenetic positions and identify their closest extant relatives. These mitogenomes will be valuable for future analyses of avian phylogenetics and illustrate the importance of museum collections as repositories for genomics resources. © 2016 John Wiley & Sons Ltd.
Verheggen, Kenneth; Raeder, Helge; Berven, Frode S; Martens, Lennart; Barsnes, Harald; Vaudel, Marc
2017-09-13
Sequence database search engines are bioinformatics algorithms that identify peptides from tandem mass spectra using a reference protein sequence database. Two decades of development, notably driven by advances in mass spectrometry, have provided scientists with more than 30 published search engines, each with its own properties. In this review, we present the common paradigm behind the different implementations, and its limitations for modern mass spectrometry datasets. We also detail how the search engines attempt to alleviate these limitations, and provide an overview of the different software frameworks available to the researcher. Finally, we highlight alternative approaches for the identification of proteomic mass spectrometry datasets, either as a replacement for, or as a complement to, sequence database search engines. © 2017 Wiley Periodicals, Inc.
Finding similar nucleotide sequences using network BLAST searches.
Ladunga, Istvan
2009-06-01
The Basic Local Alignment Search Tool (BLAST) is a keystone of bioinformatics due to its performance and user-friendliness. Beginner and intermediate users will learn how to design and submit blastn and Megablast searches on the Web pages at the National Center for Biotechnology Information. We map nucleic acid sequences to genomes, find identical or similar mRNA, expressed sequence tag, and noncoding RNA sequences, and run Megablast searches, which are much faster than blastn. Understanding results is assisted by taxonomy reports, genomic views, and multiple alignments. We interpret expected frequency thresholds, biological significance, and statistical significance. Weak hits provide no evidence, but hints for further analyses. We find genes that may code for homologous proteins by translated BLAST. We reduce false positives by filtering out low-complexity regions. Parsed BLAST results can be integrated into analysis pipelines. Links in the output connect to Entrez, PUBMED, structural, sequence, interaction, and expression databases. This facilitates integration with a wide spectrum of biological knowledge.
NASA Astrophysics Data System (ADS)
Ferreira, M.; Creveling, J.; Hilburn, I.; Karlsson, E.; Pepe-Ranney, C.; Spear, J.; Dawson, S.; Geobio2008, I.
2008-12-01
Silicified structures that exhibit a putative biologic component in their formation permeate the rock record as stromatolites. We have studied a silicified microbial structure from a hot spring in Yellowstone National Park using phenotypic, phylogenetic, and metagenomic analyses to determine microbial carbon metabolic pathways and the phylogenetic affiliations of microbes present in this unique structure. In this multi-faceted approach, dominant physiologies, specifically with regards to anaerobic and aerobic metabolisms, were inferred from 16S rRNA gene sequences and 454 sequencing data from bulk DNA samples of the structure. Carbon utilization as indicated by ECO Biolog plates showed abundant heterotrophy and heterotrophic diversity throughout the microbial structure. Microbes within the structure are able to utilize all tested sources of carbohydrates, lipids/fatty acids, and protein/amino acids as carbon sources. ECO plate testing of the hot spring water yielded considerable less carbohydrate consumption (only 4 out of 13 tested carbohydrates) and similar lipids/fatty acids and protein/amino acids consumption (2 out of 3 and 5 out of 5 tested sources respectively). Full length 16S rRNA gene sequences and metagenomic 454 pyrosequencing of community DNA showed limited diversity among primary producers. From the 16S data, the majority of the autotrophs are inferred to utilize the Calvin cycle for CO2 fixation, followed by 3-hydroxypropionate/4- hydroxybutyrate CO2 fixation. However, an analysis of the metagenomic data compared to the KEGG database does not show genes directly involved with Calvin cycle carbon fixation. Further BLAST searches of our data failed to find significant matches within our 6514 metagenomic sequences to known RuBisCo sequences taken from the NCBI database. This is likely due to a far under-sampled dataset of metagenomic sequences, and the low number (958) that had matches to the KEGG pathways database. Anaerobic versus aerobic physiology also can be estimated from the 16S clone libraries. Phylogenetic analysis of recovered 16S sequences suggests that 15% of the 16S sequences can be attributed to anaerobic microbes while 42% likely come from aerobes. The remaining 43% of 16S rRNA gene sequences belong to metabolically unassigned phyla both known and novel. This preliminary study demonstrates that the small spatially stratified silicified microbial structure present on the margins of a hot spring contains a rich and complex microbial community with different trophic levels and enzymatic pathways.
Yu, Dongliang; Meng, Yijun; Zuo, Ziwei; Xue, Jie; Wang, Huizhong
2016-01-01
Nat-siRNAs (small interfering RNAs originated from natural antisense transcripts) are a class of functional small RNA (sRNA) species discovered in both plants and animals. These siRNAs are highly enriched within the annealed regions of the NAT (natural antisense transcript) pairs. To date, great research efforts have been taken for systematical identification of the NATs in various organisms. However, developing a freely available and easy-to-use program for NAT prediction is strongly demanded by researchers. Here, we proposed an integrative pipeline named NATpipe for systematical discovery of NATs from de novo assembled transcriptomes. By utilizing sRNA sequencing data, the pipeline also allowed users to search for phase-distributed nat-siRNAs within the perfectly annealed regions of the NAT pairs. Additionally, more reliable nat-siRNA loci could be identified based on degradome sequencing data. A case study on the non-model plant Dendrobium officinale was performed to illustrate the utility of NATpipe. Finally, we hope that NATpipe would be a useful tool for NAT prediction, nat-siRNA discovery, and related functional studies. NATpipe is available at www.bioinfolab.cn/NATpipe/NATpipe.zip. PMID:26858106
Stavropoulos, Dimitri J; Merico, Daniele; Jobling, Rebekah; Bowdin, Sarah; Monfared, Nasim; Thiruvahindrapuram, Bhooma; Nalpathamkalam, Thomas; Pellecchia, Giovanna; Yuen, Ryan K C; Szego, Michael J; Hayeems, Robin Z; Shaul, Randi Zlotnik; Brudno, Michael; Girdea, Marta; Frey, Brendan; Alipanahi, Babak; Ahmed, Sohnee; Babul-Hirji, Riyana; Porras, Ramses Badilla; Carter, Melissa T; Chad, Lauren; Chaudhry, Ayeshah; Chitayat, David; Doust, Soghra Jougheh; Cytrynbaum, Cheryl; Dupuis, Lucie; Ejaz, Resham; Fishman, Leona; Guerin, Andrea; Hashemi, Bita; Helal, Mayada; Hewson, Stacy; Inbar-Feigenberg, Michal; Kannu, Peter; Karp, Natalya; Kim, Raymond; Kronick, Jonathan; Liston, Eriskay; MacDonald, Heather; Mercimek-Mahmutoglu, Saadet; Mendoza-Londono, Roberto; Nasr, Enas; Nimmo, Graeme; Parkinson, Nicole; Quercia, Nada; Raiman, Julian; Roifman, Maian; Schulze, Andreas; Shugar, Andrea; Shuman, Cheryl; Sinajon, Pierre; Siriwardena, Komudi; Weksberg, Rosanna; Yoon, Grace; Carew, Chris; Erickson, Raith; Leach, Richard A; Klein, Robert; Ray, Peter N; Meyn, M Stephen; Scherer, Stephen W; Cohn, Ronald D; Marshall, Christian R
2016-01-13
The standard of care for first-tier clinical investigation of the etiology of congenital malformations and neurodevelopmental disorders is chromosome microarray analysis (CMA) for copy number variations (CNVs), often followed by gene(s)-specific sequencing searching for smaller insertion-deletions (indels) and single nucleotide variant (SNV) mutations. Whole genome sequencing (WGS) has the potential to capture all classes of genetic variation in one experiment; however, the diagnostic yield for mutation detection of WGS compared to CMA, and other tests, needs to be established. In a prospective study we utilized WGS and comprehensive medical annotation to assess 100 patients referred to a paediatric genetics service and compared the diagnostic yield versus standard genetic testing. WGS identified genetic variants meeting clinical diagnostic criteria in 34% of cases, representing a 4-fold increase in diagnostic rate over CMA (8%) (p-value = 1.42e-05) alone and >2-fold increase in CMA plus targeted gene sequencing (13%) (p-value = 0.0009). WGS identified all rare clinically significant CNVs that were detected by CMA. In 26 patients, WGS revealed indel and missense mutations presenting in a dominant (63%) or a recessive (37%) manner. We found four subjects with mutations in at least two genes associated with distinct genetic disorders, including two cases harboring a pathogenic CNV and SNV. When considering medically actionable secondary findings in addition to primary WGS findings, 38% of patients would benefit from genetic counseling. Clinical implementation of WGS as a primary test will provide a higher diagnostic yield than conventional genetic testing and potentially reduce the time required to reach a genetic diagnosis.
Transterm—extended search facilities and improved integration with other databases
Jacobs, Grant H.; Stockwell, Peter A.; Tate, Warren P.; Brown, Chris M.
2006-01-01
Transterm has now been publicly available for >10 years. Major changes have been made since its last description in this database issue in 2002. The current database provides data for key regions of mRNA sequences, a curated database of mRNA motifs and tools to allow users to investigate their own motifs or mRNA sequences. The key mRNA regions database is derived computationally from Genbank. It contains 3′ and 5′ flanking regions, the initiation and termination signal context and coding sequence for annotated CDS features from Genbank and RefSeq. The database is non-redundant, enabling summary files and statistics to be prepared for each species. Advances include providing extended search facilities, the database may now be searched by BLAST in addition to regular expressions (patterns) allowing users to search for motifs such as known miRNA sequences, and the inclusion of RefSeq data. The database contains >40 motifs or structural patterns important for translational control. In this release, patterns from UTRsite and Rfam are also incorporated with cross-referencing. Users may search their sequence data with Transterm or user-defined patterns. The system is accessible at . PMID:16381889
Fast online and index-based algorithms for approximate search of RNA sequence-structure patterns
2013-01-01
Background It is well known that the search for homologous RNAs is more effective if both sequence and structure information is incorporated into the search. However, current tools for searching with RNA sequence-structure patterns cannot fully handle mutations occurring on both these levels or are simply not fast enough for searching large sequence databases because of the high computational costs of the underlying sequence-structure alignment problem. Results We present new fast index-based and online algorithms for approximate matching of RNA sequence-structure patterns supporting a full set of edit operations on single bases and base pairs. Our methods efficiently compute semi-global alignments of structural RNA patterns and substrings of the target sequence whose costs satisfy a user-defined sequence-structure edit distance threshold. For this purpose, we introduce a new computing scheme to optimally reuse the entries of the required dynamic programming matrices for all substrings and combine it with a technique for avoiding the alignment computation of non-matching substrings. Our new index-based methods exploit suffix arrays preprocessed from the target database and achieve running times that are sublinear in the size of the searched sequences. To support the description of RNA molecules that fold into complex secondary structures with multiple ordered sequence-structure patterns, we use fast algorithms for the local or global chaining of approximate sequence-structure pattern matches. The chaining step removes spurious matches from the set of intermediate results, in particular of patterns with little specificity. In benchmark experiments on the Rfam database, our improved online algorithm is faster than the best previous method by up to factor 45. Our best new index-based algorithm achieves a speedup of factor 560. Conclusions The presented methods achieve considerable speedups compared to the best previous method. This, together with the expected sublinear running time of the presented index-based algorithms, allows for the first time approximate matching of RNA sequence-structure patterns in large sequence databases. Beyond the algorithmic contributions, we provide with RaligNAtor a robust and well documented open-source software package implementing the algorithms presented in this manuscript. The RaligNAtor software is available at http://www.zbh.uni-hamburg.de/ralignator. PMID:23865810
Code of Federal Regulations, 2011 CFR
2011-04-01
... unsuccessful searches; utilization of Debt Collection Act. 2002.13 Section 2002.13 Housing and Urban... interest and for unsuccessful searches; utilization of Debt Collection Act. (a) Charging interest. HUD will... time will be assessed when the records requested are not found or when the records located are withheld...
SW#db: GPU-Accelerated Exact Sequence Similarity Database Search.
Korpar, Matija; Šošić, Martin; Blažeka, Dino; Šikić, Mile
2015-01-01
In recent years we have witnessed a growth in sequencing yield, the number of samples sequenced, and as a result-the growth of publicly maintained sequence databases. The increase of data present all around has put high requirements on protein similarity search algorithms with two ever-opposite goals: how to keep the running times acceptable while maintaining a high-enough level of sensitivity. The most time consuming step of similarity search are the local alignments between query and database sequences. This step is usually performed using exact local alignment algorithms such as Smith-Waterman. Due to its quadratic time complexity, alignments of a query to the whole database are usually too slow. Therefore, the majority of the protein similarity search methods prior to doing the exact local alignment apply heuristics to reduce the number of possible candidate sequences in the database. However, there is still a need for the alignment of a query sequence to a reduced database. In this paper we present the SW#db tool and a library for fast exact similarity search. Although its running times, as a standalone tool, are comparable to the running times of BLAST, it is primarily intended to be used for exact local alignment phase in which the database of sequences has already been reduced. It uses both GPU and CPU parallelization and was 4-5 times faster than SSEARCH, 6-25 times faster than CUDASW++ and more than 20 times faster than SSW at the time of writing, using multiple queries on Swiss-prot and Uniref90 databases.
Improve homology search sensitivity of PacBio data by correcting frameshifts.
Du, Nan; Sun, Yanni
2016-09-01
Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than secondary generation sequencing technologies such as Illumina. The long read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and identify gene isoforms with higher accuracy in transcriptomic sequencing. However, PacBio data has high sequencing error rate and most of the errors are insertion or deletion errors. During alignment-based homology search, insertion or deletion errors in genes will cause frameshifts and may only lead to marginal alignment scores and short alignments. As a result, it is hard to distinguish true alignments from random alignments and the ambiguity will incur errors in structural and functional annotation. Existing frameshift correction tools are designed for data with much lower error rate and are not optimized for PacBio data. As an increasing number of groups are using SMRT, there is an urgent need for dedicated homology search tools for PacBio data. In this work, we introduce Frame-Pro, a profile homology search tool for PacBio reads. Our tool corrects sequencing errors and also outputs the profile alignments of the corrected sequences against characterized protein families. We applied our tool to both simulated and real PacBio data. The results showed that our method enables more sensitive homology search, especially for PacBio data sets of low sequencing coverage. In addition, we can correct more errors when comparing with a popular error correction tool that does not rely on hybrid sequencing. The source code is freely available at https://sourceforge.net/projects/frame-pro/ yannisun@msu.edu. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Practical and Efficient Searching in Proteomics: A Cross Engine Comparison
Paulo, Joao A.
2014-01-01
Background Analysis of large datasets produced by mass spectrometry-based proteomics relies on database search algorithms to sequence peptides and identify proteins. Several such scoring methods are available, each based on different statistical foundations and thereby not producing identical results. Here, the aim is to compare peptide and protein identifications using multiple search engines and examine the additional proteins gained by increasing the number of technical replicate analyses. Methods A HeLa whole cell lysate was analyzed on an Orbitrap mass spectrometer for 10 technical replicates. The data were combined and searched using Mascot, SEQUEST, and Andromeda. Comparisons were made of peptide and protein identifications among the search engines. In addition, searches using each engine were performed with incrementing number of technical replicates. Results The number and identity of peptides and proteins differed across search engines. For all three search engines, the differences in proteins identifications were greater than the differences in peptide identifications indicating that the major source of the disparity may be at the protein inference grouping level. The data also revealed that analysis of 2 technical replicates can increase protein identifications by up to 10-15%, while a third replicate results in an additional 4-5%. Conclusions The data emphasize two practical methods of increasing the robustness of mass spectrometry data analysis. The data show that 1) using multiple search engines can expand the number of identified proteins (union) and validate protein identifications (intersection), and 2) analysis of 2 or 3 technical replicates can substantially expand protein identifications. Moreover, information can be extracted from a dataset by performing database searching with different engines and performing technical repeats, which requires no additional sample preparation and effectively utilizes research time and effort. PMID:25346847
Practical and Efficient Searching in Proteomics: A Cross Engine Comparison.
Paulo, Joao A
2013-10-01
Analysis of large datasets produced by mass spectrometry-based proteomics relies on database search algorithms to sequence peptides and identify proteins. Several such scoring methods are available, each based on different statistical foundations and thereby not producing identical results. Here, the aim is to compare peptide and protein identifications using multiple search engines and examine the additional proteins gained by increasing the number of technical replicate analyses. A HeLa whole cell lysate was analyzed on an Orbitrap mass spectrometer for 10 technical replicates. The data were combined and searched using Mascot, SEQUEST, and Andromeda. Comparisons were made of peptide and protein identifications among the search engines. In addition, searches using each engine were performed with incrementing number of technical replicates. The number and identity of peptides and proteins differed across search engines. For all three search engines, the differences in proteins identifications were greater than the differences in peptide identifications indicating that the major source of the disparity may be at the protein inference grouping level. The data also revealed that analysis of 2 technical replicates can increase protein identifications by up to 10-15%, while a third replicate results in an additional 4-5%. The data emphasize two practical methods of increasing the robustness of mass spectrometry data analysis. The data show that 1) using multiple search engines can expand the number of identified proteins (union) and validate protein identifications (intersection), and 2) analysis of 2 or 3 technical replicates can substantially expand protein identifications. Moreover, information can be extracted from a dataset by performing database searching with different engines and performing technical repeats, which requires no additional sample preparation and effectively utilizes research time and effort.
Tachyon search speeds up retrieval of similar sequences by several orders of magnitude.
Tan, Joshua; Kuchibhatla, Durga; Sirota, Fernanda L; Sherman, Westley A; Gattermayer, Tobias; Kwoh, Chia Yee; Eisenhaber, Frank; Schneider, Georg; Maurer-Stroh, Sebastian
2012-06-15
The usage of current sequence search tools becomes increasingly slower as databases of protein sequences continue to grow exponentially. Tachyon, a new algorithm that identifies closely related protein sequences ~200 times faster than standard BLAST, circumvents this limitation with a reduced database and oligopeptide matching heuristic. The tool is publicly accessible as a webserver at http://tachyon.bii.a-star.edu.sg and can also be accessed programmatically through SOAP.
Raja, Huzefa A; Baker, Timothy R; Little, Jason G; Oberlies, Nicholas H
2017-01-01
One challenge in the dietary supplement industry is confirmation of species identity for processed raw materials, i.e. those modified by milling, drying, or extraction, which move through a multilevel supply chain before reaching the finished product. This is particularly difficult for samples containing fungal mycelia, where processing removes morphological characteristics, such that they do not present sufficient variation to differentiate species by traditional techniques. To address this issue, we have demonstrated the utility of DNA barcoding to verify the taxonomic identity of fungi found commonly in the food and dietary supplement industry; such data are critical for protecting consumer health, by assuring both safety and quality. By using DNA barcoding of nuclear ribosomal internal transcribed spacer (ITS) of the rRNA gene with fungal specific ITS primers, ITS barcodes were generated for 33 representative fungal samples, all of which could be used by consumers for food and/or dietary supplement purposes. In the majority of cases, we were able to sequence the ITS region from powdered mycelium samples, grocery store mushrooms, and capsules from commercial dietary supplements. After generating ITS barcodes utilizing standard procedures accepted by the Consortium for the Barcode of Life, we tested their utility by performing a BLAST search against authenticate published ITS sequences in GenBank. In some cases, we also downloaded published, homologous sequences of the ITS region of fungi inspected in this study and examined the phylogenetic relationships of barcoded fungal species in light of modern taxonomic and phylogenetic studies. We anticipate that these data will motivate discussions on DNA barcoding based species identification as applied to the verification/certification of mushroom-containing dietary supplements. Copyright © 2016 The Authors. Published by Elsevier Ltd.. All rights reserved.
G-Hash: Towards Fast Kernel-based Similarity Search in Large Graph Databases.
Wang, Xiaohong; Smalter, Aaron; Huan, Jun; Lushington, Gerald H
2009-01-01
Structured data including sets, sequences, trees and graphs, pose significant challenges to fundamental aspects of data management such as efficient storage, indexing, and similarity search. With the fast accumulation of graph databases, similarity search in graph databases has emerged as an important research topic. Graph similarity search has applications in a wide range of domains including cheminformatics, bioinformatics, sensor network management, social network management, and XML documents, among others.Most of the current graph indexing methods focus on subgraph query processing, i.e. determining the set of database graphs that contains the query graph and hence do not directly support similarity search. In data mining and machine learning, various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models for supervised learning, graph kernel functions have (i) high computational complexity and (ii) non-trivial difficulty to be indexed in a graph database.Our objective is to bridge graph kernel function and similarity search in graph databases by proposing (i) a novel kernel-based similarity measurement and (ii) an efficient indexing structure for graph data management. Our method of similarity measurement builds upon local features extracted from each node and their neighboring nodes in graphs. A hash table is utilized to support efficient storage and fast search of the extracted local features. Using the hash table, a graph kernel function is defined to capture the intrinsic similarity of graphs and for fast similarity query processing. We have implemented our method, which we have named G-hash, and have demonstrated its utility on large chemical graph databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Most importantly, the new similarity measurement and the index structure is scalable to large database with smaller indexing size, faster indexing construction time, and faster query processing time as compared to state-of-the-art indexing methods such as C-tree, gIndex, and GraphGrep.
Renard, Bernhard Y.; Xu, Buote; Kirchner, Marc; Zickmann, Franziska; Winter, Dominic; Korten, Simone; Brattig, Norbert W.; Tzur, Amit; Hamprecht, Fred A.; Steen, Hanno
2012-01-01
Currently, the reliable identification of peptides and proteins is only feasible when thoroughly annotated sequence databases are available. Although sequencing capacities continue to grow, many organisms remain without reliable, fully annotated reference genomes required for proteomic analyses. Standard database search algorithms fail to identify peptides that are not exactly contained in a protein database. De novo searches are generally hindered by their restricted reliability, and current error-tolerant search strategies are limited by global, heuristic tradeoffs between database and spectral information. We propose a Bayesian information criterion-driven error-tolerant peptide search (BICEPS) and offer an open source implementation based on this statistical criterion to automatically balance the information of each single spectrum and the database, while limiting the run time. We show that BICEPS performs as well as current database search algorithms when such algorithms are applied to sequenced organisms, whereas BICEPS only uses a remotely related organism database. For instance, we use a chicken instead of a human database corresponding to an evolutionary distance of more than 300 million years (International Chicken Genome Sequencing Consortium (2004) Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432, 695–716). We demonstrate the successful application to cross-species proteomics with a 33% increase in the number of identified proteins for a filarial nematode sample of Litomosoides sigmodontis. PMID:22493179
Metagenomics and Bioinformatics in Microbial Ecology: Current Status and Beyond.
Hiraoka, Satoshi; Yang, Ching-Chia; Iwasaki, Wataru
2016-09-29
Metagenomic approaches are now commonly used in microbial ecology to study microbial communities in more detail, including many strains that cannot be cultivated in the laboratory. Bioinformatic analyses make it possible to mine huge metagenomic datasets and discover general patterns that govern microbial ecosystems. However, the findings of typical metagenomic and bioinformatic analyses still do not completely describe the ecology and evolution of microbes in their environments. Most analyses still depend on straightforward sequence similarity searches against reference databases. We herein review the current state of metagenomics and bioinformatics in microbial ecology and discuss future directions for the field. New techniques will allow us to go beyond routine analyses and broaden our knowledge of microbial ecosystems. We need to enrich reference databases, promote platforms that enable meta- or comprehensive analyses of diverse metagenomic datasets, devise methods that utilize long-read sequence information, and develop more powerful bioinformatic methods to analyze data from diverse perspectives.
Faint Debris Detection by Particle Based Track-Before-Detect Method
NASA Astrophysics Data System (ADS)
Uetsuhara, M.; Ikoma, N.
2014-09-01
This study proposes a particle method to detect faint debris, which is hardly seen in single frame, from an image sequence based on the concept of track-before-detect (TBD). The most widely used detection method is detect-before-track (DBT), which firstly detects signals of targets from single frame by distinguishing difference of intensity between foreground and background then associate the signals for each target between frames. DBT is capable of tracking bright targets but limited. DBT is necessary to consider presence of false signals and is difficult to recover from false association. On the other hand, TBD methods try to track targets without explicitly detecting the signals followed by evaluation of goodness of each track and obtaining detection results. TBD has an advantage over DBT in detecting weak signals around background level in single frame. However, conventional TBD methods for debris detection apply brute-force search over candidate tracks then manually select true one from the candidates. To reduce those significant drawbacks of brute-force search and not-fully automated process, this study proposes a faint debris detection algorithm by a particle based TBD method consisting of sequential update of target state and heuristic search of initial state. The state consists of position, velocity direction and magnitude, and size of debris over the image at a single frame. The sequential update process is implemented by a particle filter (PF). PF is an optimal filtering technique that requires initial distribution of target state as a prior knowledge. An evolutional algorithm (EA) is utilized to search the initial distribution. The EA iteratively applies propagation and likelihood evaluation of particles for the same image sequences and resulting set of particles is used as an initial distribution of PF. This paper describes the algorithm of the proposed faint debris detection method. The algorithm demonstrates performance on image sequences acquired during observation campaigns dedicated to GEO breakup fragments, which would contain a sufficient number of faint debris images. The results indicate the proposed method is capable of tracking faint debris with moderate computational costs at operational level.
Yap, Choon-Kong; Eisenhaber, Birgit; Eisenhaber, Frank; Wong, Wing-Cheong
2016-11-29
While the local-mode HMMER3 is notable for its massive speed improvement, the slower glocal-mode HMMER2 is more exact for domain annotation by enforcing full domain-to-sequence alignments. Since a unit of domain necessarily implies a unit of function, local-mode HMMER3 alone remains insufficient for precise function annotation tasks. In addition, the incomparable E-values for the same domain model by different HMMER builds create difficulty when checking for domain annotation consistency on a large-scale basis. In this work, both the speed of HMMER3 and glocal-mode alignment of HMMER2 are combined within the xHMMER3x2 framework for tackling the large-scale domain annotation task. Briefly, HMMER3 is utilized for initial domain detection so that HMMER2 can subsequently perform the glocal-mode, sequence-to-full-domain alignments for the detected HMMER3 hits. An E-value calibration procedure is required to ensure that the search space by HMMER2 is sufficiently replicated by HMMER3. We find that the latter is straightforwardly possible for ~80% of the models in the Pfam domain library (release 29). However in the case of the remaining ~20% of HMMER3 domain models, the respective HMMER2 counterparts are more sensitive. Thus, HMMER3 searches alone are insufficient to ensure sensitivity and a HMMER2-based search needs to be initiated. When tested on the set of UniProt human sequences, xHMMER3x2 can be configured to be between 7× and 201× faster than HMMER2, but with descending domain detection sensitivity from 99.8 to 95.7% with respect to HMMER2 alone; HMMER3's sensitivity was 95.7%. At extremes, xHMMER3x2 is either the slow glocal-mode HMMER2 or the fast HMMER3 with glocal-mode. Finally, the E-values to false-positive rates (FPR) mapping by xHMMER3x2 allows E-values of different model builds to be compared, so that any annotation discrepancies in a large-scale annotation exercise can be flagged for further examination by dissectHMMER. The xHMMER3x2 workflow allows large-scale domain annotation speed to be drastically improved over HMMER2 without compromising for domain-detection with regard to sensitivity and sequence-to-domain alignment incompleteness. The xHMMER3x2 code and its webserver (for Pfam release 27, 28 and 29) are freely available at http://xhmmer3x2.bii.a-star.edu.sg/ . Reviewed by Thomas Dandekar, L. Aravind, Oliviero Carugo and Shamil Sunyaev. For the full reviews, please go to the Reviewers' comments section.
Glycan fragment database: a database of PDB-based glycan 3D structures.
Jo, Sunhwan; Im, Wonpil
2013-01-01
The glycan fragment database (GFDB), freely available at http://www.glycanstructure.org, is a database of the glycosidic torsion angles derived from the glycan structures in the Protein Data Bank (PDB). Analogous to protein structure, the structure of an oligosaccharide chain in a glycoprotein, referred to as a glycan, can be characterized by the torsion angles of glycosidic linkages between relatively rigid carbohydrate monomeric units. Knowledge of accessible conformations of biologically relevant glycans is essential in understanding their biological roles. The GFDB provides an intuitive glycan sequence search tool that allows the user to search complex glycan structures. After a glycan search is complete, each glycosidic torsion angle distribution is displayed in terms of the exact match and the fragment match. The exact match results are from the PDB entries that contain the glycan sequence identical to the query sequence. The fragment match results are from the entries with the glycan sequence whose substructure (fragment) or entire sequence is matched to the query sequence, such that the fragment results implicitly include the influences from the nearby carbohydrate residues. In addition, clustering analysis based on the torsion angle distribution can be performed to obtain the representative structures among the searched glycan structures.
An intuitive graphical webserver for multiple-choice protein sequence search.
Banky, Daniel; Szalkai, Balazs; Grolmusz, Vince
2014-04-10
Every day tens of thousands of sequence searches and sequence alignment queries are submitted to webservers. The capitalized word "BLAST" becomes a verb, describing the act of performing sequence search and alignment. However, if one needs to search for sequences that contain, for example, two hydrophobic and three polar residues at five given positions, the query formation on the most frequently used webservers will be difficult. Some servers support the formation of queries with regular expressions, but most of the users are unfamiliar with their syntax. Here we present an intuitive, easily applicable webserver, the Protein Sequence Analysis server, that allows the formation of multiple choice queries by simply drawing the residues to their positions; if more than one residue are drawn to the same position, then they will be nicely stacked on the user interface, indicating the multiple choice at the given position. This computer-game-like interface is natural and intuitive, and the coloring of the residues makes possible to form queries requiring not just certain amino acids in the given positions, but also small nonpolar, negatively charged, hydrophobic, positively charged, or polar ones. The webserver is available at http://psa.pitgroup.org. Copyright © 2014 Elsevier B.V. All rights reserved.
WormBase ParaSite - a comprehensive resource for helminth genomics.
Howe, Kevin L; Bolt, Bruce J; Shafie, Myriam; Kersey, Paul; Berriman, Matthew
2017-07-01
The number of publicly available parasitic worm genome sequences has increased dramatically in the past three years, and research interest in helminth functional genomics is now quickly gathering pace in response to the foundation that has been laid by these collective efforts. A systematic approach to the organisation, curation, analysis and presentation of these data is clearly vital for maximising the utility of these data to researchers. We have developed a portal called WormBase ParaSite (http://parasite.wormbase.org) for interrogating helminth genomes on a large scale. Data from over 100 nematode and platyhelminth species are integrated, adding value by way of systematic and consistent functional annotation (e.g. protein domains and Gene Ontology terms), gene expression analysis (e.g. alignment of life-stage specific transcriptome data sets), and comparative analysis (e.g. orthologues and paralogues). We provide several ways of exploring the data, including genome browsers, genome and gene summary pages, text search, sequence search, a query wizard, bulk downloads, and programmatic interfaces. In this review, we provide an overview of the back-end infrastructure and analysis behind WormBase ParaSite, and the displays and tools available to users for interrogating helminth genomic data. Copyright © 2016 The Authors. Published by Elsevier B.V. All rights reserved.
WEB-server for search of a periodicity in amino acid and nucleotide sequences
NASA Astrophysics Data System (ADS)
E Frenkel, F.; Skryabin, K. G.; Korotkov, E. V.
2017-12-01
A new web server (http://victoria.biengi.ac.ru/splinter/login.php) was designed and developed to search for periodicity in nucleotide and amino acid sequences. The web server operation is based upon a new mathematical method of searching for multiple alignments, which is founded on the position weight matrices optimization, as well as on implementation of the two-dimensional dynamic programming. This approach allows the construction of multiple alignments of the indistinctly similar amino acid and nucleotide sequences that accumulated more than 1.5 substitutions per a single amino acid or a nucleotide without performing the sequences paired comparisons. The article examines the principles of the web server operation and two examples of studying amino acid and nucleotide sequences, as well as information that could be obtained using the web server.
Computationally mapping sequence space to understand evolutionary protein engineering.
Armstrong, Kathryn A; Tidor, Bruce
2008-01-01
Evolutionary protein engineering has been dramatically successful, producing a wide variety of new proteins with altered stability, binding affinity, and enzymatic activity. However, the success of such procedures is often unreliable, and the impact of the choice of protein, engineering goal, and evolutionary procedure is not well understood. We have created a framework for understanding aspects of the protein engineering process by computationally mapping regions of feasible sequence space for three small proteins using structure-based design protocols. We then tested the ability of different evolutionary search strategies to explore these sequence spaces. The results point to a non-intuitive relationship between the error-prone PCR mutation rate and the number of rounds of replication. The evolutionary relationships among feasible sequences reveal hub-like sequences that serve as particularly fruitful starting sequences for evolutionary search. Moreover, genetic recombination procedures were examined, and tradeoffs relating sequence diversity and search efficiency were identified. This framework allows us to consider the impact of protein structure on the allowed sequence space and therefore on the challenges that each protein presents to error-prone PCR and genetic recombination procedures.
EGenBio: A Data Management System for Evolutionary Genomics and Biodiversity
Nahum, Laila A; Reynolds, Matthew T; Wang, Zhengyuan O; Faith, Jeremiah J; Jonna, Rahul; Jiang, Zhi J; Meyer, Thomas J; Pollock, David D
2006-01-01
Background Evolutionary genomics requires management and filtering of large numbers of diverse genomic sequences for accurate analysis and inference on evolutionary processes of genomic and functional change. We developed Evolutionary Genomics and Biodiversity (EGenBio; ) to begin to address this. Description EGenBio is a system for manipulation and filtering of large numbers of sequences, integrating curated sequence alignments and phylogenetic trees, managing evolutionary analyses, and visualizing their output. EGenBio is organized into three conceptual divisions, Evolution, Genomics, and Biodiversity. The Genomics division includes tools for selecting pre-aligned sequences from different genes and species, and for modifying and filtering these alignments for further analysis. Species searches are handled through queries that can be modified based on a tree-based navigation system and saved. The Biodiversity division contains tools for analyzing individual sequences or sequence alignments, whereas the Evolution division contains tools involving phylogenetic trees. Alignments are annotated with analytical results and modification history using our PRAED format. A miscellaneous Tools section and Help framework are also available. EGenBio was developed around our comparative genomic research and a prototype database of mtDNA genomes. It utilizes MySQL-relational databases and dynamic page generation, and calls numerous custom programs. Conclusion EGenBio was designed to serve as a platform for tools and resources to ease combined analysis in evolution, genomics, and biodiversity. PMID:17118150
NASA Astrophysics Data System (ADS)
Singh, Ranjan Kumar; Rinawa, Moti Lal
2018-04-01
The residual stresses arising in fiber-reinforced laminates during their curing in closed molds lead to changes in the composites after their removal from the molds and cooling. One of these dimensional changes of angle sections is called springback. The parameters such as lay-up, stacking sequence, material system, cure temperature, thickness etc play important role in it. In present work, it is attempted to optimize lay-up and stacking sequence for maximization of flexural stiffness and minimization of springback angle. The search algorithms are employed to obtain best sequence through repair strategy such as swap. A new search algorithm, termed as lay-up search algorithm (LSA) is also proposed, which is an extension of permutation search algorithm (PSA). The efficacy of PSA and LSA is tested on the laminates with a range of lay-ups. A computer code is developed on MATLAB implementing the above schemes. Also, the strategies for multi objective optimization using search algorithms are suggested and tested.
SANSparallel: interactive homology search against Uniprot
Somervuo, Panu; Holm, Liisa
2015-01-01
Proteins evolve by mutations and natural selection. The network of sequence similarities is a rich source for mining homologous relationships that inform on protein structure and function. There are many servers available to browse the network of homology relationships but one has to wait up to a minute for results. The SANSparallel webserver provides protein sequence database searches with immediate response and professional alignment visualization by third-party software. The output is a list, pairwise alignment or stacked alignment of sequence-similar proteins from Uniprot, UniRef90/50, Swissprot or Protein Data Bank. The stacked alignments are viewed in Jalview or as sequence logos. The database search uses the suffix array neighborhood search (SANS) method, which has been re-implemented as a client-server, improved and parallelized. The method is extremely fast and as sensitive as BLAST above 50% sequence identity. Benchmarks show that the method is highly competitive compared to previously published fast database search programs: UBLAST, DIAMOND, LAST, LAMBDA, RAPSEARCH2 and BLAT. The web server can be accessed interactively or programmatically at http://ekhidna2.biocenter.helsinki.fi/cgi-bin/sans/sans.cgi. It can be used to make protein functional annotation pipelines more efficient, and it is useful in interactive exploration of the detailed evidence supporting the annotation of particular proteins of interest. PMID:25855811
RNA motif search with data-driven element ordering.
Rampášek, Ladislav; Jimenez, Randi M; Lupták, Andrej; Vinař, Tomáš; Brejová, Broňa
2016-05-18
In this paper, we study the problem of RNA motif search in long genomic sequences. This approach uses a combination of sequence and structure constraints to uncover new distant homologs of known functional RNAs. The problem is NP-hard and is traditionally solved by backtracking algorithms. We have designed a new algorithm for RNA motif search and implemented a new motif search tool RNArobo. The tool enhances the RNAbob descriptor language, allowing insertions in helices, which enables better characterization of ribozymes and aptamers. A typical RNA motif consists of multiple elements and the running time of the algorithm is highly dependent on their ordering. By approaching the element ordering problem in a principled way, we demonstrate more than 100-fold speedup of the search for complex motifs compared to previously published tools. We have developed a new method for RNA motif search that allows for a significant speedup of the search of complex motifs that include pseudoknots. Such speed improvements are crucial at a time when the rate of DNA sequencing outpaces growth in computing. RNArobo is available at http://compbio.fmph.uniba.sk/rnarobo .
Mixed Sequence Reader: A Program for Analyzing DNA Sequences with Heterozygous Base Calling
Chang, Chun-Tien; Tsai, Chi-Neu; Tang, Chuan Yi; Chen, Chun-Houh; Lian, Jang-Hau; Hu, Chi-Yu; Tsai, Chia-Lung; Chao, Angel; Lai, Chyong-Huey; Wang, Tzu-Hao; Lee, Yun-Shien
2012-01-01
The direct sequencing of PCR products generates heterozygous base-calling fluorescence chromatograms that are useful for identifying single-nucleotide polymorphisms (SNPs), insertion-deletions (indels), short tandem repeats (STRs), and paralogous genes. Indels and STRs can be easily detected using the currently available Indelligent or ShiftDetector programs, which do not search reference sequences. However, the detection of other genomic variants remains a challenge due to the lack of appropriate tools for heterozygous base-calling fluorescence chromatogram data analysis. In this study, we developed a free web-based program, Mixed Sequence Reader (MSR), which can directly analyze heterozygous base-calling fluorescence chromatogram data in .abi file format using comparisons with reference sequences. The heterozygous sequences are identified as two distinct sequences and aligned with reference sequences. Our results showed that MSR may be used to (i) physically locate indel and STR sequences and determine STR copy number by searching NCBI reference sequences; (ii) predict combinations of microsatellite patterns using the Federal Bureau of Investigation Combined DNA Index System (CODIS); (iii) determine human papilloma virus (HPV) genotypes by searching current viral databases in cases of double infections; (iv) estimate the copy number of paralogous genes, such as β-defensin 4 (DEFB4) and its paralog HSPDP3. PMID:22778697
Thanh Mai Pham, Le; Kim, Yong Hwan
2016-01-01
Using bioinformatic homology search tools, this study utilized sequence phylogeny, gene organization and conserved motifs to identify members of the family of O-methyltransferases from lignin-degrading fungus Phanerochaete chrysosporium. The heterologous expression and characterization of O-methyltransferases from P. chrysosporium were studied. The expressed protein utilized S-(5'-adenosyl)-L-methionine p-toluenesulfonate salt (SAM) and methylated various free-hydroxyl phenolic compounds at both meta and para site. In the same motif, O-methyltransferases were also identified in other white-rot fungi including Bjerkandera adusta, Ceriporiopsis (Gelatoporia) subvermispora B, and Trametes versicolor. As free-hydroxyl phenolic compounds have been known as inhibitors for lignin peroxidase, the presence of O-methyltransferases in white-rot fungi suggested their biological functions in accelerating lignin degradation in white-rot basidiomycetes by converting those inhibitory groups into non-toxic methylated phenolic ones. Copyright © 2015 Elsevier Inc. All rights reserved.
Shilov, Ignat V; Seymour, Sean L; Patel, Alpesh A; Loboda, Alex; Tang, Wilfred H; Keating, Sean P; Hunter, Christie L; Nuwaysir, Lydia M; Schaeffer, Daniel A
2007-09-01
The Paragon Algorithm, a novel database search engine for the identification of peptides from tandem mass spectrometry data, is presented. Sequence Temperature Values are computed using a sequence tag algorithm, allowing the degree of implication by an MS/MS spectrum of each region of a database to be determined on a continuum. Counter to conventional approaches, features such as modifications, substitutions, and cleavage events are modeled with probabilities rather than by discrete user-controlled settings to consider or not consider a feature. The use of feature probabilities in conjunction with Sequence Temperature Values allows for a very large increase in the effective search space with only a very small increase in the actual number of hypotheses that must be scored. The algorithm has a new kind of user interface that removes the user expertise requirement, presenting control settings in the language of the laboratory that are translated to optimal algorithmic settings. To validate this new algorithm, a comparison with Mascot is presented for a series of analogous searches to explore the relative impact of increasing search space probed with Mascot by relaxing the tryptic digestion conformance requirements from trypsin to semitrypsin to no enzyme and with the Paragon Algorithm using its Rapid mode and Thorough mode with and without tryptic specificity. Although they performed similarly for small search space, dramatic differences were observed in large search space. With the Paragon Algorithm, hundreds of biological and artifact modifications, all possible substitutions, and all levels of conformance to the expected digestion pattern can be searched in a single search step, yet the typical cost in search time is only 2-5 times that of conventional small search space. Despite this large increase in effective search space, there is no drastic loss of discrimination that typically accompanies the exploration of large search space.
Neuropeptidomics of the Mosquito Aedes Aegypti
2010-01-01
translational processing ( pyroglutamate formation) was detected for AST-C and CAPA-PVK-2. For the first time in insects, we succeeded in the direct...hormones, trace DNA sequences generated by TIGR and the Broad Institute were first searched by TBLASTN24 using amino acid sequences of candidate peptides...previously described.1 TBLASTN searches, using the amino acid sequences of putative Ae. aegypti neuropeptide and peptide hormone orthologs identified in
Techniques for automatic large scale change analysis of temporal multispectral imagery
NASA Astrophysics Data System (ADS)
Mercovich, Ryan A.
Change detection in remotely sensed imagery is a multi-faceted problem with a wide variety of desired solutions. Automatic change detection and analysis to assist in the coverage of large areas at high resolution is a popular area of research in the remote sensing community. Beyond basic change detection, the analysis of change is essential to provide results that positively impact an image analyst's job when examining potentially changed areas. Present change detection algorithms are geared toward low resolution imagery, and require analyst input to provide anything more than a simple pixel level map of the magnitude of change that has occurred. One major problem with this approach is that change occurs in such large volume at small spatial scales that a simple change map is no longer useful. This research strives to create an algorithm based on a set of metrics that performs a large area search for change in high resolution multispectral image sequences and utilizes a variety of methods to identify different types of change. Rather than simply mapping the magnitude of any change in the scene, the goal of this research is to create a useful display of the different types of change in the image. The techniques presented in this dissertation are used to interpret large area images and provide useful information to an analyst about small regions that have undergone specific types of change while retaining image context to make further manual interpretation easier. This analyst cueing to reduce information overload in a large area search environment will have an impact in the areas of disaster recovery, search and rescue situations, and land use surveys among others. By utilizing a feature based approach founded on applying existing statistical methods and new and existing topological methods to high resolution temporal multispectral imagery, a novel change detection methodology is produced that can automatically provide useful information about the change occurring in large area and high resolution image sequences. The change detection and analysis algorithm developed could be adapted to many potential image change scenarios to perform automatic large scale analysis of change.
Cumulative Weighing of Time in Intertemporal Tradeoffs
2016-01-01
We examine preferences for sequences of delayed monetary gains. In the experimental literature, two prominent models have been advanced as psychological descriptions of preferences for sequences. In one model, the instantaneous utilities of the outcomes in a sequence are discounted as a function of their delays, and assembled into a discounted utility of the sequence. In the other model, the accumulated utility of the outcomes in a sequence is considered along with utility or disutility from improvement in outcome utilities and utility or disutility from the spreading of outcome utilities. Drawing on three threads of evidence concerning preferences for sequences of monetary gains, we propose that the accumulated utility of the outcomes in a sequence is traded off against the duration of utility accumulation. In our first experiment, aggregate choice behavior provides qualitative support for the tradeoff model. In three subsequent experiments, one of which incentivized, disaggregate choice behavior provides quantitative support for the tradeoff model in Bayesian model contests. One thread of evidence motivating the tradeoff model is that, when, in the choice between two single dated outcomes, it is conveyed that receiving less sooner means receiving nothing later, preference for receiving more later increases, but when it is conveyed that receiving more later means receiving nothing sooner, preference is left unchanged. Our results show that this asymmetric hidden-zero effect is indeed driven by those supporting the tradeoff model. The tradeoff model also accommodates all remaining evidence on preferences for sequences of monetary gains. PMID:27560853
Cluster-Based Multipolling Sequencing Algorithm for Collecting RFID Data in Wireless LANs
NASA Astrophysics Data System (ADS)
Choi, Woo-Yong; Chatterjee, Mainak
2015-03-01
With the growing use of RFID (Radio Frequency Identification), it is becoming important to devise ways to read RFID tags in real time. Access points (APs) of IEEE 802.11-based wireless Local Area Networks (LANs) are being integrated with RFID networks that can efficiently collect real-time RFID data. Several schemes, such as multipolling methods based on the dynamic search algorithm and random sequencing, have been proposed. However, as the number of RFID readers associated with an AP increases, it becomes difficult for the dynamic search algorithm to derive the multipolling sequence in real time. Though multipolling methods can eliminate the polling overhead, we still need to enhance the performance of the multipolling methods based on random sequencing. To that extent, we propose a real-time cluster-based multipolling sequencing algorithm that drastically eliminates more than 90% of the polling overhead, particularly so when the dynamic search algorithm fails to derive the multipolling sequence in real time.
THGS: a web-based database of Transmembrane Helices in Genome Sequences
Fernando, S. A.; Selvarani, P.; Das, Soma; Kumar, Ch. Kiran; Mondal, Sukanta; Ramakumar, S.; Sekar, K.
2004-01-01
Transmembrane Helices in Genome Sequences (THGS) is an interactive web-based database, developed to search the transmembrane helices in the user-interested gene sequences available in the Genome Database (GDB). The proposed database has provision to search sequence motifs in transmembrane and globular proteins. In addition, the motif can be searched in the other sequence databases (Swiss-Prot and PIR) or in the macromolecular structure database, Protein Data Bank (PDB). Further, the 3D structure of the corresponding queried motif, if it is available in the solved protein structures deposited in the Protein Data Bank, can also be visualized using the widely used graphics package RASMOL. All the sequence databases used in the present work are updated frequently and hence the results produced are up to date. The database THGS is freely available via the world wide web and can be accessed at http://pranag.physics.iisc.ernet.in/thgs/ or http://144.16.71.10/thgs/. PMID:14681375
2017-01-01
Abstract Target search as performed by DNA-binding proteins is a complex process, in which multiple factors contribute to both thermodynamic discrimination of the target sequence from overwhelmingly abundant off-target sites and kinetic acceleration of dynamic sequence interrogation. TRF1, the protein that binds to telomeric tandem repeats, faces an intriguing variant of the search problem where target sites are clustered within short fragments of chromosomal DNA. In this study, we use extensive (>0.5 ms in total) MD simulations to study the dynamical aspects of sequence-specific binding of TRF1 at both telomeric and non-cognate DNA. For the first time, we describe the spontaneous formation of a sequence-specific native protein–DNA complex in atomistic detail, and study the mechanism by which proteins avoid off-target binding while retaining high affinity for target sites. Our calculated free energy landscapes reproduce the thermodynamics of sequence-specific binding, while statistical approaches allow for a comprehensive description of intermediate stages of complex formation. PMID:28633355
Establishing homologies in protein sequences
NASA Technical Reports Server (NTRS)
Dayhoff, M. O.; Barker, W. C.; Hunt, L. T.
1983-01-01
Computer-based statistical techniques used to determine homologies between proteins occurring in different species are reviewed. The technique is based on comparison of two protein sequences, either by relating all segments of a given length in one sequence to all segments of the second or by finding the best alignment of the two sequences. Approaches discussed include selection using printed tabulations, identification of very similar sequences, and computer searches of a database. The use of the SEARCH, RELATE, and ALIGN programs (Dayhoff, 1979) is explained; sample data are presented in graphs, diagrams, and tables and the construction of scoring matrices is considered.
Reinharz, Vladimir; Ponty, Yann; Waldispühl, Jérôme
2013-07-01
The design of RNA sequences folding into predefined secondary structures is a milestone for many synthetic biology and gene therapy studies. Most of the current software uses similar local search strategies (i.e. a random seed is progressively adapted to acquire the desired folding properties) and more importantly do not allow the user to control explicitly the nucleotide distribution such as the GC-content in their sequences. However, the latter is an important criterion for large-scale applications as it could presumably be used to design sequences with better transcription rates and/or structural plasticity. In this article, we introduce IncaRNAtion, a novel algorithm to design RNA sequences folding into target secondary structures with a predefined nucleotide distribution. IncaRNAtion uses a global sampling approach and weighted sampling techniques. We show that our approach is fast (i.e. running time comparable or better than local search methods), seedless (we remove the bias of the seed in local search heuristics) and successfully generates high-quality sequences (i.e. thermodynamically stable) for any GC-content. To complete this study, we develop a hybrid method combining our global sampling approach with local search strategies. Remarkably, our glocal methodology overcomes both local and global approaches for sampling sequences with a specific GC-content and target structure. IncaRNAtion is available at csb.cs.mcgill.ca/incarnation/. Supplementary data are available at Bioinformatics online.
Duan, Zhigui; Cao, Rui; Jiang, Liping; Liang, Songping
2013-01-14
In past years, spider venoms have attracted increasing attention due to their extraordinary chemical and pharmacological diversity. The recently popularized proteomic method highly improved our ability to analyze the proteins in the venom. However, the lack of information about isolated venom proteins sequences dramatically limits the ability to confidently identify venom proteins. In the present paper, the venom from Araneus ventricosus was analyzed using two complementary approaches: 2-DE/Shotgun-LC-MS/MS coupled to MASCOT search and 2-DE/Shotgun-LC-MS/MS coupled to manual de novo sequencing followed by local venom protein database (LVPD) search. The LVPD was constructed with toxin-like protein sequences obtained from the analysis of cDNA library from A. ventricosus venom glands. Our results indicate that a total of 130 toxin-like protein sequences were unambiguously identified by manual de novo sequencing coupled to LVPD search, accounting for 86.67% of all toxin-like proteins in LVPD. Thus manual de novo sequencing coupled to LVPD search was proved an extremely effective approach for the analysis of venom proteins. In addition, the approach displays impeccable advantage in validating mutant positions of isoforms from the same toxin-like family. Intriguingly, methyl esterifcation of glutamic acid was discovered for the first time in animal venom proteins by manual de novo sequencing. Crown Copyright © 2012. Published by Elsevier B.V. All rights reserved.
PARTS: Probabilistic Alignment for RNA joinT Secondary structure prediction
Harmanci, Arif Ozgun; Sharma, Gaurav; Mathews, David H.
2008-01-01
A novel method is presented for joint prediction of alignment and common secondary structures of two RNA sequences. The joint consideration of common secondary structures and alignment is accomplished by structural alignment over a search space defined by the newly introduced motif called matched helical regions. The matched helical region formulation generalizes previously employed constraints for structural alignment and thereby better accommodates the structural variability within RNA families. A probabilistic model based on pseudo free energies obtained from precomputed base pairing and alignment probabilities is utilized for scoring structural alignments. Maximum a posteriori (MAP) common secondary structures, sequence alignment and joint posterior probabilities of base pairing are obtained from the model via a dynamic programming algorithm called PARTS. The advantage of the more general structural alignment of PARTS is seen in secondary structure predictions for the RNase P family. For this family, the PARTS MAP predictions of secondary structures and alignment perform significantly better than prior methods that utilize a more restrictive structural alignment model. For the tRNA and 5S rRNA families, the richer structural alignment model of PARTS does not offer a benefit and the method therefore performs comparably with existing alternatives. For all RNA families studied, the posterior probability estimates obtained from PARTS offer an improvement over posterior probability estimates from a single sequence prediction. When considering the base pairings predicted over a threshold value of confidence, the combination of sensitivity and positive predictive value is superior for PARTS than for the single sequence prediction. PARTS source code is available for download under the GNU public license at http://rna.urmc.rochester.edu. PMID:18304945
PubDNA Finder: a web database linking full-text articles to sequences of nucleic acids.
García-Remesal, Miguel; Cuevas, Alejandro; Pérez-Rey, David; Martín, Luis; Anguita, Alberto; de la Iglesia, Diana; de la Calle, Guillermo; Crespo, José; Maojo, Víctor
2010-11-01
PubDNA Finder is an online repository that we have created to link PubMed Central manuscripts to the sequences of nucleic acids appearing in them. It extends the search capabilities provided by PubMed Central by enabling researchers to perform advanced searches involving sequences of nucleic acids. This includes, among other features (i) searching for papers mentioning one or more specific sequences of nucleic acids and (ii) retrieving the genetic sequences appearing in different articles. These additional query capabilities are provided by a searchable index that we created by using the full text of the 176 672 papers available at PubMed Central at the time of writing and the sequences of nucleic acids appearing in them. To automatically extract the genetic sequences occurring in each paper, we used an original method we have developed. The database is updated monthly by automatically connecting to the PubMed Central FTP site to retrieve and index new manuscripts. Users can query the database via the web interface provided. PubDNA Finder can be freely accessed at http://servet.dia.fi.upm.es:8080/pubdnafinder
SA-Search: a web tool for protein structure mining based on a Structural Alphabet
Guyon, Frédéric; Camproux, Anne-Claude; Hochez, Joëlle; Tufféry, Pierre
2004-01-01
SA-Search is a web tool that can be used to mine for protein structures and extract structural similarities. It is based on a hidden Markov model derived Structural Alphabet (SA) that allows the compression of three-dimensional (3D) protein conformations into a one-dimensional (1D) representation using a limited number of prototype conformations. Using such a representation, classical methods developed for amino acid sequences can be employed. Currently, SA-Search permits the performance of fast 3D similarity searches such as the extraction of exact words using a suffix tree approach, and the search for fuzzy words viewed as a simple 1D sequence alignment problem. SA-Search is available at http://bioserv.rpbs.jussieu.fr/cgi-bin/SA-Search. PMID:15215446
SA-Search: a web tool for protein structure mining based on a Structural Alphabet.
Guyon, Frédéric; Camproux, Anne-Claude; Hochez, Joëlle; Tufféry, Pierre
2004-07-01
SA-Search is a web tool that can be used to mine for protein structures and extract structural similarities. It is based on a hidden Markov model derived Structural Alphabet (SA) that allows the compression of three-dimensional (3D) protein conformations into a one-dimensional (1D) representation using a limited number of prototype conformations. Using such a representation, classical methods developed for amino acid sequences can be employed. Currently, SA-Search permits the performance of fast 3D similarity searches such as the extraction of exact words using a suffix tree approach, and the search for fuzzy words viewed as a simple 1D sequence alignment problem. SA-Search is available at http://bioserv.rpbs.jussieu.fr/cgi-bin/SA-Search.
Evaluating the protein coding potential of exonized transposable element sequences
Piriyapongsa, Jittima; Rutledge, Mark T; Patel, Sanil; Borodovsky, Mark; Jordan, I King
2007-01-01
Background Transposable element (TE) sequences, once thought to be merely selfish or parasitic members of the genomic community, have been shown to contribute a wide variety of functional sequences to their host genomes. Analysis of complete genome sequences have turned up numerous cases where TE sequences have been incorporated as exons into mRNAs, and it is widely assumed that such 'exonized' TEs encode protein sequences. However, the extent to which TE-derived sequences actually encode proteins is unknown and a matter of some controversy. We have tried to address this outstanding issue from two perspectives: i-by evaluating ascertainment biases related to the search methods used to uncover TE-derived protein coding sequences (CDS) and ii-through a probabilistic codon-frequency based analysis of the protein coding potential of TE-derived exons. Results We compared the ability of three classes of sequence similarity search methods to detect TE-derived sequences among data sets of experimentally characterized proteins: 1-a profile-based hidden Markov model (HMM) approach, 2-BLAST methods and 3-RepeatMasker. Profile based methods are more sensitive and more selective than the other methods evaluated. However, the application of profile-based search methods to the detection of TE-derived sequences among well-curated experimentally characterized protein data sets did not turn up many more cases than had been previously detected and nowhere near as many cases as recent genome-wide searches have. We observed that the different search methods used were complementary in the sense that they yielded largely non-overlapping sets of hits and differed in their ability to recover known cases of TE-derived CDS. The probabilistic analysis of TE-derived exon sequences indicates that these sequences have low protein coding potential on average. In particular, non-autonomous TEs that do not encode protein sequences, such as Alu elements, are frequently exonized but unlikely to encode protein sequences. Conclusion The exaptation of the numerous TE sequences found in exons as bona fide protein coding sequences may prove to be far less common than has been suggested by the analysis of complete genomes. We hypothesize that many exonized TE sequences actually function as post-transcriptional regulators of gene expression, rather than coding sequences, which may act through a variety of double stranded RNA related regulatory pathways. Indeed, their relatively high copy numbers and similarity to sequences dispersed throughout the genome suggests that exonized TE sequences could serve as master regulators with a wide scope of regulatory influence. Reviewers: This article was reviewed by Itai Yanai, Kateryna D. Makova, Melissa Wilson (nominated by Kateryna D. Makova) and Cedric Feschotte (nominated by John M. Logsdon Jr.). PMID:18036258
SANSparallel: interactive homology search against Uniprot.
Somervuo, Panu; Holm, Liisa
2015-07-01
Proteins evolve by mutations and natural selection. The network of sequence similarities is a rich source for mining homologous relationships that inform on protein structure and function. There are many servers available to browse the network of homology relationships but one has to wait up to a minute for results. The SANSparallel webserver provides protein sequence database searches with immediate response and professional alignment visualization by third-party software. The output is a list, pairwise alignment or stacked alignment of sequence-similar proteins from Uniprot, UniRef90/50, Swissprot or Protein Data Bank. The stacked alignments are viewed in Jalview or as sequence logos. The database search uses the suffix array neighborhood search (SANS) method, which has been re-implemented as a client-server, improved and parallelized. The method is extremely fast and as sensitive as BLAST above 50% sequence identity. Benchmarks show that the method is highly competitive compared to previously published fast database search programs: UBLAST, DIAMOND, LAST, LAMBDA, RAPSEARCH2 and BLAT. The web server can be accessed interactively or programmatically at http://ekhidna2.biocenter.helsinki.fi/cgi-bin/sans/sans.cgi. It can be used to make protein functional annotation pipelines more efficient, and it is useful in interactive exploration of the detailed evidence supporting the annotation of particular proteins of interest. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Jagtap, Pratik; Goslinga, Jill; Kooren, Joel A; McGowan, Thomas; Wroblewski, Matthew S; Seymour, Sean L; Griffin, Timothy J
2013-04-01
Large databases (>10(6) sequences) used in metaproteomic and proteogenomic studies present challenges in matching peptide sequences to MS/MS data using database-search programs. Most notably, strict filtering to avoid false-positive matches leads to more false negatives, thus constraining the number of peptide matches. To address this challenge, we developed a two-step method wherein matches derived from a primary search against a large database were used to create a smaller subset database. The second search was performed against a target-decoy version of this subset database merged with a host database. High confidence peptide sequence matches were then used to infer protein identities. Applying our two-step method for both metaproteomic and proteogenomic analysis resulted in twice the number of high confidence peptide sequence matches in each case, as compared to the conventional one-step method. The two-step method captured almost all of the same peptides matched by the one-step method, with a majority of the additional matches being false negatives from the one-step method. Furthermore, the two-step method improved results regardless of the database search program used. Our results show that our two-step method maximizes the peptide matching sensitivity for applications requiring large databases, especially valuable for proteogenomics and metaproteomics studies. © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Real-Time Ligand Binding Pocket Database Search Using Local Surface Descriptors
Chikhi, Rayan; Sael, Lee; Kihara, Daisuke
2010-01-01
Due to the increasing number of structures of unknown function accumulated by ongoing structural genomics projects, there is an urgent need for computational methods for characterizing protein tertiary structures. As functions of many of these proteins are not easily predicted by conventional sequence database searches, a legitimate strategy is to utilize structure information in function characterization. Of a particular interest is prediction of ligand binding to a protein, as ligand molecule recognition is a major part of molecular function of proteins. Predicting whether a ligand molecule binds a protein is a complex problem due to the physical nature of protein-ligand interactions and the flexibility of both binding sites and ligand molecules. However, geometric and physicochemical complementarity is observed between the ligand and its binding site in many cases. Therefore, ligand molecules which bind to a local surface site in a protein can be predicted by finding similar local pockets of known binding ligands in the structure database. Here, we present two representations of ligand binding pockets and utilize them for ligand binding prediction by pocket shape comparison. These representations are based on mapping of surface properties of binding pockets, which are compactly described either by the two dimensional pseudo-Zernike moments or the 3D Zernike descriptors. These compact representations allow a fast real-time pocket searching against a database. Thorough benchmark study employing two different datasets show that our representations are competitive with the other existing methods. Limitations and potentials of the shape-based methods as well as possible improvements are discussed. PMID:20455259
Real-time ligand binding pocket database search using local surface descriptors.
Chikhi, Rayan; Sael, Lee; Kihara, Daisuke
2010-07-01
Because of the increasing number of structures of unknown function accumulated by ongoing structural genomics projects, there is an urgent need for computational methods for characterizing protein tertiary structures. As functions of many of these proteins are not easily predicted by conventional sequence database searches, a legitimate strategy is to utilize structure information in function characterization. Of particular interest is prediction of ligand binding to a protein, as ligand molecule recognition is a major part of molecular function of proteins. Predicting whether a ligand molecule binds a protein is a complex problem due to the physical nature of protein-ligand interactions and the flexibility of both binding sites and ligand molecules. However, geometric and physicochemical complementarity is observed between the ligand and its binding site in many cases. Therefore, ligand molecules which bind to a local surface site in a protein can be predicted by finding similar local pockets of known binding ligands in the structure database. Here, we present two representations of ligand binding pockets and utilize them for ligand binding prediction by pocket shape comparison. These representations are based on mapping of surface properties of binding pockets, which are compactly described either by the two-dimensional pseudo-Zernike moments or the three-dimensional Zernike descriptors. These compact representations allow a fast real-time pocket searching against a database. Thorough benchmark studies employing two different datasets show that our representations are competitive with the other existing methods. Limitations and potentials of the shape-based methods as well as possible improvements are discussed.
Henderson, James B.; Sellas, Anna B.; Fuchs, Jérôme; Bowie, Rauri C.K.; Dumbacher, John P.
2017-01-01
We report here the successful assembly of the complete mitochondrial genomes of the northern spotted owl (Strix occidentalis caurina) and the barred owl (S. varia). We utilized sequence data from two sequencing methodologies, Illumina paired-end sequence data with insert lengths ranging from approximately 250 nucleotides (nt) to 9,600 nt and read lengths from 100–375 nt and Sanger-derived sequences. We employed multiple assemblers and alignment methods to generate the final assemblies. The circular genomes of S. o. caurina and S. varia are comprised of 19,948 nt and 18,975 nt, respectively. Both code for two rRNAs, twenty-two tRNAs, and thirteen polypeptides. They both have duplicated control region sequences with complex repeat structures. We were not able to assemble the control regions solely using Illumina paired-end sequence data. By fully spanning the control regions, Sanger-derived sequences enabled accurate and complete assembly of these mitochondrial genomes. These are the first complete mitochondrial genome sequences of owls (Aves: Strigiformes) possessing duplicated control regions. We searched the nuclear genome of S. o. caurina for copies of mitochondrial genes and found at least nine separate stretches of nuclear copies of gene sequences originating in the mitochondrial genome (Numts). The Numts ranged from 226–19,522 nt in length and included copies of all mitochondrial genes except tRNAPro, ND6, and tRNAGlu. Strix occidentalis caurina and S. varia exhibited an average of 10.74% (8.68% uncorrected p-distance) divergence across the non-tRNA mitochondrial genes. PMID:29038757
Mahajan, Gaurang; Mande, Shekhar C
2017-04-04
A comprehensive map of the human-M. tuberculosis (MTB) protein interactome would help fill the gaps in our understanding of the disease, and computational prediction can aid and complement experimental studies towards this end. Several sequence-based in silico approaches tap the existing data on experimentally validated protein-protein interactions (PPIs); these PPIs serve as templates from which novel interactions between pathogen and host are inferred. Such comparative approaches typically make use of local sequence alignment, which, in the absence of structural details about the interfaces mediating the template interactions, could lead to incorrect inferences, particularly when multi-domain proteins are involved. We propose leveraging the domain-domain interaction (DDI) information in PDB complexes to score and prioritize candidate PPIs between host and pathogen proteomes based on targeted sequence-level comparisons. Our method picks out a small set of human-MTB protein pairs as candidates for physical interactions, and the use of functional meta-data suggests that some of them could contribute to the in vivo molecular cross-talk between pathogen and host that regulates the course of the infection. Further, we present numerical data for Pfam domain families that highlights interaction specificity on the domain level. Not every instance of a pair of domains, for which interaction evidence has been found in a few instances (i.e. structures), is likely to functionally interact. Our sorting approach scores candidates according to how "distant" they are in sequence space from known examples of DDIs (templates). Thus, it provides a natural way to deal with the heterogeneity in domain-level interactions. Our method represents a more informed application of local alignment to the sequence-based search for potential human-microbial interactions that uses available PPI data as a prior. Our approach is somewhat limited in its sensitivity by the restricted size and diversity of the template dataset, but, given the rapid accumulation of solved protein complex structures, its scope and utility are expected to keep steadily improving.
The Giardia genome project database.
McArthur, A G; Morrison, H G; Nixon, J E; Passamaneck, N Q; Kim, U; Hinkle, G; Crocker, M K; Holder, M E; Farr, R; Reich, C I; Olsen, G E; Aley, S B; Adam, R D; Gillin, F D; Sogin, M L
2000-08-15
The Giardia genome project database provides an online resource for Giardia lamblia (WB strain, clone C6) genome sequence information. The database includes edited single-pass reads, the results of BLASTX searches, and details of progress towards sequencing the entire 12 million-bp Giardia genome. Pre-sorted BLASTX results can be retrieved based on keyword searches and BLAST searches of the high throughput Giardia data can be initiated from the web site or through NCBI. Descriptions of the genomic DNA libraries, project protocols and summary statistics are also available. Although the Giardia genome project is ongoing, new sequences are made available on a bi-monthly basis to ensure that researchers have access to information that may assist them in the search for genes and their biological function. The current URL of the Giardia genome project database is www.mbl.edu/Giardia.
Yasui, Yasuo; Hirakawa, Hideki; Ueno, Mariko; Matsui, Katsuhiro; Katsube-Tanaka, Tomoyuki; Yang, Soo Jung; Aii, Jotaro; Sato, Shingo; Mori, Masashi
2016-01-01
Buckwheat (Fagopyrum esculentum Moench; 2n = 2x = 16) is a nutritionally dense annual crop widely grown in temperate zones. To accelerate molecular breeding programmes of this important crop, we generated a draft assembly of the buckwheat genome using short reads obtained by next-generation sequencing (NGS), and constructed the Buckwheat Genome DataBase. After assembling short reads, we determined 387,594 scaffolds as the draft genome sequence (FES_r1.0). The total length of FES_r1.0 was 1,177,687,305 bp, and the N50 of the scaffolds was 25,109 bp. Gene prediction analysis revealed 286,768 coding sequences (CDSs; FES_r1.0_cds) including those related to transposable elements. The total length of FES_r1.0_cds was 212,917,911 bp, and the N50 was 1,101 bp. Of these, the functions of 35,816 CDSs excluding those for transposable elements were annotated by BLAST analysis. To demonstrate the utility of the database, we conducted several test analyses using BLAST and keyword searches. Furthermore, we used the draft genome as a reference sequence for NGS-based markers, and successfully identified novel candidate genes controlling heteromorphic self-incompatibility of buckwheat. The database and draft genome sequence provide a valuable resource that can be used in efforts to develop buckwheat cultivars with superior agronomic traits. PMID:27037832
CMD: a Cotton Microsatellite Database resource for Gossypium genomics
Blenda, Anna; Scheffler, Jodi; Scheffler, Brian; Palmer, Michael; Lacape, Jean-Marc; Yu, John Z; Jesudurai, Christopher; Jung, Sook; Muthukumar, Sriram; Yellambalase, Preetham; Ficklin, Stephen; Staton, Margaret; Eshelman, Robert; Ulloa, Mauricio; Saha, Sukumar; Burr, Ben; Liu, Shaolin; Zhang, Tianzhen; Fang, Deqiu; Pepper, Alan; Kumpatla, Siva; Jacobs, John; Tomkins, Jeff; Cantrell, Roy; Main, Dorrie
2006-01-01
Background The Cotton Microsatellite Database (CMD) is a curated and integrated web-based relational database providing centralized access to publicly available cotton microsatellites, an invaluable resource for basic and applied research in cotton breeding. Description At present CMD contains publication, sequence, primer, mapping and homology data for nine major cotton microsatellite projects, collectively representing 5,484 microsatellites. In addition, CMD displays data for three of the microsatellite projects that have been screened against a panel of core germplasm. The standardized panel consists of 12 diverse genotypes including genetic standards, mapping parents, BAC donors, subgenome representatives, unique breeding lines, exotic introgression sources, and contemporary Upland cottons with significant acreage. A suite of online microsatellite data mining tools are accessible at CMD. These include an SSR server which identifies microsatellites, primers, open reading frames, and GC-content of uploaded sequences; BLAST and FASTA servers providing sequence similarity searches against the existing cotton SSR sequences and primers, a CAP3 server to assemble EST sequences into longer transcripts prior to mining for SSRs, and CMap, a viewer for comparing cotton SSR maps. Conclusion The collection of publicly available cotton SSR markers in a centralized, readily accessible and curated web-enabled database provides a more efficient utilization of microsatellite resources and will help accelerate basic and applied research in molecular breeding and genetic mapping in Gossypium spp. PMID:16737546
Fujibuchi, Wataru; Anderson, John S. J.; Landsman, David
2001-01-01
Consensus pattern and matrix-based searches designed to predict cis-acting transcriptional regulatory sequences have historically been subject to large numbers of false positives. We sought to decrease false positives by incorporating expression profile data into a consensus pattern-based search method. We have systematically analyzed the expression phenotypes of over 6000 yeast genes, across 121 expression profile experiments, and correlated them with the distribution of 14 known regulatory elements over sequences upstream of the genes. Our method is based on a metric we term probabilistic element assessment (PEA), which is a ranking of potential sites based on sequence similarity in the upstream regions of genes with similar expression phenotypes. For eight of the 14 known elements that we examined, our method had a much higher selectivity than a naïve consensus pattern search. Based on our analysis, we have developed a web-based tool called PROSPECT, which allows consensus pattern-based searching of gene clusters obtained from microarray data. PMID:11574681
Gentle Masking of Low-Complexity Sequences Improves Homology Search
Frith, Martin C.
2011-01-01
Detection of sequences that are homologous, i.e. descended from a common ancestor, is a fundamental task in computational biology. This task is confounded by low-complexity tracts (such as atatatatatat), which arise frequently and independently, causing strong similarities that are not homologies. There has been much research on identifying low-complexity tracts, but little research on how to treat them during homology search. We propose to find homologies by aligning sequences with “gentle” masking of low-complexity tracts. Gentle masking means that the match score involving a masked letter is , where is the unmasked score. Gentle masking slightly but noticeably improves the sensitivity of homology search (compared to “harsh” masking), without harming specificity. We show examples in three useful homology search problems: detection of NUMTs (nuclear copies of mitochondrial DNA), recruitment of metagenomic DNA reads to reference genomes, and pseudogene detection. Gentle masking is currently the best way to treat low-complexity tracts during homology search. PMID:22205972
Universal Keyword Classifier on Public Key Based Encrypted Multikeyword Fuzzy Search in Public Cloud
Munisamy, Shyamala Devi; Chokkalingam, Arun
2015-01-01
Cloud computing has pioneered the emerging world by manifesting itself as a service through internet and facilitates third party infrastructure and applications. While customers have no visibility on how their data is stored on service provider's premises, it offers greater benefits in lowering infrastructure costs and delivering more flexibility and simplicity in managing private data. The opportunity to use cloud services on pay-per-use basis provides comfort for private data owners in managing costs and data. With the pervasive usage of internet, the focus has now shifted towards effective data utilization on the cloud without compromising security concerns. In the pursuit of increasing data utilization on public cloud storage, the key is to make effective data access through several fuzzy searching techniques. In this paper, we have discussed the existing fuzzy searching techniques and focused on reducing the searching time on the cloud storage server for effective data utilization. Our proposed Asymmetric Classifier Multikeyword Fuzzy Search method provides classifier search server that creates universal keyword classifier for the multiple keyword request which greatly reduces the searching time by learning the search path pattern for all the keywords in the fuzzy keyword set. The objective of using BTree fuzzy searchable index is to resolve typos and representation inconsistencies and also to facilitate effective data utilization. PMID:26380364
Munisamy, Shyamala Devi; Chokkalingam, Arun
2015-01-01
Cloud computing has pioneered the emerging world by manifesting itself as a service through internet and facilitates third party infrastructure and applications. While customers have no visibility on how their data is stored on service provider's premises, it offers greater benefits in lowering infrastructure costs and delivering more flexibility and simplicity in managing private data. The opportunity to use cloud services on pay-per-use basis provides comfort for private data owners in managing costs and data. With the pervasive usage of internet, the focus has now shifted towards effective data utilization on the cloud without compromising security concerns. In the pursuit of increasing data utilization on public cloud storage, the key is to make effective data access through several fuzzy searching techniques. In this paper, we have discussed the existing fuzzy searching techniques and focused on reducing the searching time on the cloud storage server for effective data utilization. Our proposed Asymmetric Classifier Multikeyword Fuzzy Search method provides classifier search server that creates universal keyword classifier for the multiple keyword request which greatly reduces the searching time by learning the search path pattern for all the keywords in the fuzzy keyword set. The objective of using BTree fuzzy searchable index is to resolve typos and representation inconsistencies and also to facilitate effective data utilization.
Real-time Automatic Search for Multi-wavelength Counterparts of DWF Transients
NASA Astrophysics Data System (ADS)
Murphy, Christopher; Cucchiara, Antonino; Andreoni, Igor; Cooke, Jeff; Hegarty, Sarah
2018-01-01
The Deeper Wider Faster (DWF) survey aims to find and classify the fastest transients in the Universe. DWF utilizes the Dark Energy Camera (DECam), collecting a continuous sequence of 20s images over a 3 square degree field of view.Once an interesting transient is detected during DWF observations, the DWF collaboration has access to several facilities for rapid follow-up in multiple wavelengths (from gamma to radio).An online web tool has been designed to help with real-time visual classification of possible astrophysical transients in data collected by the DWF observing program. The goal of this project is to create a python-based code to improve the classification process by querying several existing archive databases. Given the DWF transient location and search radius, the developed code will extract a list of possible counterparts and all available information (e.g. magnitude, radio fluxes, distance separation).Thanks to this tool, the human classifier can make a quicker decision in order to trigger the collaboration rapid-response resources.
Liu, Yongchao; Maskell, Douglas L; Schmidt, Bertil
2009-01-01
Background The Smith-Waterman algorithm is one of the most widely used tools for searching biological sequence databases due to its high sensitivity. Unfortunately, the Smith-Waterman algorithm is computationally demanding, which is further compounded by the exponential growth of sequence databases. The recent emergence of many-core architectures, and their associated programming interfaces, provides an opportunity to accelerate sequence database searches using commonly available and inexpensive hardware. Findings Our CUDASW++ implementation (benchmarked on a single-GPU NVIDIA GeForce GTX 280 graphics card and a dual-GPU GeForce GTX 295 graphics card) provides a significant performance improvement compared to other publicly available implementations, such as SWPS3, CBESW, SW-CUDA, and NCBI-BLAST. CUDASW++ supports query sequences of length up to 59K and for query sequences ranging in length from 144 to 5,478 in Swiss-Prot release 56.6, the single-GPU version achieves an average performance of 9.509 GCUPS with a lowest performance of 9.039 GCUPS and a highest performance of 9.660 GCUPS, and the dual-GPU version achieves an average performance of 14.484 GCUPS with a lowest performance of 10.660 GCUPS and a highest performance of 16.087 GCUPS. Conclusion CUDASW++ is publicly available open-source software. It provides a significant performance improvement for Smith-Waterman-based protein sequence database searches by fully exploiting the compute capability of commonly used CUDA-enabled low-cost GPUs. PMID:19416548
The annotation-enriched non-redundant patent sequence databases.
Li, Weizhong; Kondratowicz, Bartosz; McWilliam, Hamish; Nauche, Stephane; Lopez, Rodrigo
2013-01-01
The EMBL-European Bioinformatics Institute (EMBL-EBI) offers public access to patent sequence data, providing a valuable service to the intellectual property and scientific communities. The non-redundant (NR) patent sequence databases comprise two-level nucleotide and protein sequence clusters (NRNL1, NRNL2, NRPL1 and NRPL2) based on sequence identity (level-1) and patent family (level-2). Annotation from the source entries in these databases is merged and enhanced with additional information from the patent literature and biological context. Corrections in patent publication numbers, kind-codes and patent equivalents significantly improve the data quality. Data are available through various user interfaces including web browser, downloads via FTP, SRS, Dbfetch and EBI-Search. Sequence similarity/homology searches against the databases are available using BLAST, FASTA and PSI-Search. In this article, we describe the data collection and annotation and also outline major changes and improvements introduced since 2009. Apart from data growth, these changes include additional annotation for singleton clusters, the identifier versioning for tracking entry change and the entry mappings between the two-level databases. Database URL: http://www.ebi.ac.uk/patentdata/nr/
The Annotation-enriched non-redundant patent sequence databases
Li, Weizhong; Kondratowicz, Bartosz; McWilliam, Hamish; Nauche, Stephane; Lopez, Rodrigo
2013-01-01
The EMBL-European Bioinformatics Institute (EMBL-EBI) offers public access to patent sequence data, providing a valuable service to the intellectual property and scientific communities. The non-redundant (NR) patent sequence databases comprise two-level nucleotide and protein sequence clusters (NRNL1, NRNL2, NRPL1 and NRPL2) based on sequence identity (level-1) and patent family (level-2). Annotation from the source entries in these databases is merged and enhanced with additional information from the patent literature and biological context. Corrections in patent publication numbers, kind-codes and patent equivalents significantly improve the data quality. Data are available through various user interfaces including web browser, downloads via FTP, SRS, Dbfetch and EBI-Search. Sequence similarity/homology searches against the databases are available using BLAST, FASTA and PSI-Search. In this article, we describe the data collection and annotation and also outline major changes and improvements introduced since 2009. Apart from data growth, these changes include additional annotation for singleton clusters, the identifier versioning for tracking entry change and the entry mappings between the two-level databases. Database URL: http://www.ebi.ac.uk/patentdata/nr/ PMID:23396323
Taffe, Michael A.; Taffe, William J.
2011-01-01
Several nonhuman primate species have been reported to employ a distance-minimizing, traveling salesman-like, strategy during foraging as well as in experimental spatial search tasks involving lesser amounts of locomotion. Spatial sequencing may optimize performance by reducing reference or episodic memory loads, locomotor costs, competition or other demands. A computerized self-ordered spatial search (SOSS) memory task has been adapted from a human neuropsychological testing battery (CANTAB, Cambridge Cognition, Ltd) for use in monkeys. Accurate completion of a trial requires sequential responses to colored boxes in two or more spatial locations without repetition of a previous location. Marmosets have been reported to employ a circling pattern of search, suggesting spontaneous adoption of a strategy to reduce working memory load. In this study the SOSS performance of rhesus monkeys was assessed to determine if the use of a distance-minimizing search path enhances accuracy. A novel strategy score, independent of the trial difficulty and arrangement of boxes, has been devised. Analysis of the performance of 21 monkeys trained on SOSS over two years shows that a distance-minimizing search strategy is associated with improved accuracy. This effect is observed within individuals as they improve over many cumulative sessions of training on the task and across individuals at any given level of training. Erroneous trials were associated with a failure to deploy the strategy. It is concluded that the effect of utilizing the strategy on this locomotion-free, laboratory task is to enhance accuracy by reducing demands on spatial working memory resources. PMID:21840507
Epistemic Search Sequences in Peer Interaction in a Content-Based Language Classroom
ERIC Educational Resources Information Center
Jakonen, Teppo; Morton, Tom
2015-01-01
Epistemics in interaction refers to how participants display, manage, and orient to their own and others' states of knowledge. This article applies recent conversation analytical work on epistemics to classrooms where language and content instruction are combined. It focuses on Epistemic Search Sequences (ESSs) through which students in peer…
Tachyon search speeds up retrieval of similar sequences by several orders of magnitude
Tan, Joshua; Kuchibhatla, Durga; Sirota, Fernanda L.; Sherman, Westley A.; Gattermayer, Tobias; Kwoh, Chia Yee; Eisenhaber, Frank; Schneider, Georg; Maurer-Stroh, Sebastian
2012-01-01
Summary: The usage of current sequence search tools becomes increasingly slower as databases of protein sequences continue to grow exponentially. Tachyon, a new algorithm that identifies closely related protein sequences ~200 times faster than standard BLAST, circumvents this limitation with a reduced database and oligopeptide matching heuristic. Availability and implementation: The tool is publicly accessible as a webserver at http://tachyon.bii.a-star.edu.sg and can also be accessed programmatically through SOAP. Contact: sebastianms@bii.a-star.edu.sg Supplementary information: Supplementary data are available at the Bioinformatics online. PMID:22531216
SEQATOMS: a web tool for identifying missing regions in PDB in sequence context.
Brandt, Bernd W; Heringa, Jaap; Leunissen, Jack A M
2008-07-01
With over 46 000 proteins, the Protein Data Bank (PDB) is the most important database with structural information of biological macromolecules. PDB files contain sequence and coordinate information. Residues present in the sequence can be absent from the coordinate section, which means their position in space is unknown. Similarity searches are routinely carried out against sequences taken from PDB SEQRES. However, there no distinction is made between residues that have a known or unknown position in the 3D protein structure. We present a FASTA sequence database that is produced by combining the sequence and coordinate information. All residues absent from the PDB coordinate section are masked with lower-case letters, thereby providing a view of these residues in the context of the entire protein sequence, which facilitates inspecting 'missing' regions. We also provide a masked version of the CATH domain database. A user-friendly BLAST interface is available for similarity searching. In contrast to standard (stand-alone) BLAST output, which only contains upper-case letters, our output retains the lower-case letters of the masked regions. Thus, our server can be used to perform BLAST searching case-sensitively. Here, we have applied it to the study of missing regions in their sequence context. SEQATOMS is available at http://www.bioinformatics.nl/tools/seqatoms/.
An improved stochastic fractal search algorithm for 3D protein structure prediction.
Zhou, Changjun; Sun, Chuan; Wang, Bin; Wang, Xiaojun
2018-05-03
Protein structure prediction (PSP) is a significant area for biological information research, disease treatment, and drug development and so on. In this paper, three-dimensional structures of proteins are predicted based on the known amino acid sequences, and the structure prediction problem is transformed into a typical NP problem by an AB off-lattice model. This work applies a novel improved Stochastic Fractal Search algorithm (ISFS) to solve the problem. The Stochastic Fractal Search algorithm (SFS) is an effective evolutionary algorithm that performs well in exploring the search space but falls into local minimums sometimes. In order to avoid the weakness, Lvy flight and internal feedback information are introduced in ISFS. In the experimental process, simulations are conducted by ISFS algorithm on Fibonacci sequences and real peptide sequences. Experimental results prove that the ISFS performs more efficiently and robust in terms of finding the global minimum and avoiding getting stuck in local minimums.
Bindewald, Eckart; Grunewald, Calvin; Boyle, Brett; O'Connor, Mary; Shapiro, Bruce A
2008-10-01
One approach to designing RNA nanoscale structures is to use known RNA structural motifs such as junctions, kissing loops or bulges and to construct a molecular model by connecting these building blocks with helical struts. We previously developed an algorithm for detecting internal loops, junctions and kissing loops in RNA structures. Here we present algorithms for automating or assisting many of the steps that are involved in creating RNA structures from building blocks: (1) assembling building blocks into nanostructures using either a combinatorial search or constraint satisfaction; (2) optimizing RNA 3D ring structures to improve ring closure; (3) sequence optimisation; (4) creating a unique non-degenerate RNA topology descriptor. This effectively creates a computational pipeline for generating molecular models of RNA nanostructures and more specifically RNA ring structures with optimized sequences from RNA building blocks. We show several examples of how the algorithms can be utilized to generate RNA tecto-shapes.
Bindewald, Eckart; Grunewald, Calvin; Boyle, Brett; O’Connor, Mary; Shapiro, Bruce A.
2013-01-01
One approach to designing RNA nanoscale structures is to use known RNA structural motifs such as junctions, kissing loops or bulges and to construct a molecular model by connecting these building blocks with helical struts. We previously developed an algorithm for detecting internal loops, junctions and kissing loops in RNA structures. Here we present algorithms for automating or assisting many of the steps that are involved in creating RNA structures from building blocks: (1) assembling building blocks into nanostructures using either a combinatorial search or constraint satisfaction; (2) optimizing RNA 3D ring structures to improve ring closure; (3) sequence optimisation; (4) creating a unique non-degenerate RNA topology descriptor. This effectively creates a computational pipeline for generating molecular models of RNA nanostructures and more specifically RNA ring structures with optimized sequences from RNA building blocks. We show several examples of how the algorithms can be utilized to generate RNA tecto-shapes. PMID:18838281
Guarnieri, Michael T.; Nag, Ambarish; Smolinski, Sharon L.; Darzins, Al; Seibert, Michael; Pienkos, Philip T.
2011-01-01
Biofuels derived from algal lipids represent an opportunity to dramatically impact the global energy demand for transportation fuels. Systems biology analyses of oleaginous algae could greatly accelerate the commercialization of algal-derived biofuels by elucidating the key components involved in lipid productivity and leading to the initiation of hypothesis-driven strain-improvement strategies. However, higher-level systems biology analyses, such as transcriptomics and proteomics, are highly dependent upon available genomic sequence data, and the lack of these data has hindered the pursuit of such analyses for many oleaginous microalgae. In order to examine the triacylglycerol biosynthetic pathway in the unsequenced oleaginous microalga, Chlorella vulgaris, we have established a strategy with which to bypass the necessity for genomic sequence information by using the transcriptome as a guide. Our results indicate an upregulation of both fatty acid and triacylglycerol biosynthetic machinery under oil-accumulating conditions, and demonstrate the utility of a de novo assembled transcriptome as a search model for proteomic analysis of an unsequenced microalga. PMID:22043295
Radiation resistance of biological reagents for in situ life detection.
Carr, Christopher E; Rowedder, Holli; Vafadari, Cyrus; Lui, Clarissa S; Cascio, Ethan; Zuber, Maria T; Ruvkun, Gary
2013-01-01
Life on Mars, if it exists, may share a common ancestry with life on Earth derived from meteoritic transfer of microbes between the planets. One means to test this hypothesis is to isolate, detect, and sequence nucleic acids in situ on Mars, then search for similarities to known common features of life on Earth. Such an instrument would require biological and chemical components, such as polymerase and fluorescent dye molecules. We show that reagents necessary for detection and sequencing of DNA survive several analogues of the radiation expected during a 2-year mission to Mars, including proton (H-1), heavy ion (Fe-56, O-18), and neutron bombardment. Some reagents have reduced performance or fail at higher doses. Overall, our findings suggest it is feasible to utilize space instruments with biological components, particularly for mission durations of up to several years in environments without large accumulations of charged particles, such as the surface of Mars, and have implications for the meteoritic transfer of microbes between planets.
Gacesa, Ranko; Zucko, Jurica; Petursdottir, Solveig K; Gudmundsdottir, Elisabet Eik; Fridjonsson, Olafur H; Diminic, Janko; Long, Paul F; Cullum, John; Hranueli, Daslav; Hreggvidsson, Gudmundur O; Starcevic, Antonio
2017-06-01
The MEGGASENSE platform constructs relational databases of DNA or protein sequences. The default functional analysis uses 14 106 hidden Markov model (HMM) profiles based on sequences in the KEGG database. The Solr search engine allows sophisticated queries and a BLAST search function is also incorporated. These standard capabilities were used to generate the SCATT database from the predicted proteome of Streptomyces cattleya . The implementation of a specialised metagenome database (AMYLOMICS) for bioprospecting of carbohydrate-modifying enzymes is described. In addition to standard assembly of reads, a novel 'functional' assembly was developed, in which screening of reads with the HMM profiles occurs before the assembly. The AMYLOMICS database incorporates additional HMM profiles for carbohydrate-modifying enzymes and it is illustrated how the combination of HMM and BLAST analyses helps identify interesting genes. A variety of different proteome and metagenome databases have been generated by MEGGASENSE.
Kadumuri, Rajashekar Varma; Vadrevu, Ramakrishna
2017-10-01
Due to their crucial role in function, folding, and stability, protein loops are being targeted for grafting/designing to create novel or alter existing functionality and improve stability and foldability. With a view to facilitate a thorough analysis and effectual search options for extracting and comparing loops for sequence and structural compatibility, we developed, LoopX a comprehensively compiled library of sequence and conformational features of ∼700,000 loops from protein structures. The database equipped with a graphical user interface is empowered with diverse query tools and search algorithms, with various rendering options to visualize the sequence- and structural-level information along with hydrogen bonding patterns, backbone φ, ψ dihedral angles of both the target and candidate loops. Two new features (i) conservation of the polar/nonpolar environment and (ii) conservation of sequence and conformation of specific residues within the loops have also been incorporated in the search and retrieval of compatible loops for a chosen target loop. Thus, the LoopX server not only serves as a database and visualization tool for sequence and structural analysis of protein loops but also aids in extracting and comparing candidate loops for a given target loop based on user-defined search options.
Sirius PSB: a generic system for analysis of biological sequences.
Koh, Chuan Hock; Lin, Sharene; Jedd, Gregory; Wong, Limsoon
2009-12-01
Computational tools are essential components of modern biological research. For example, BLAST searches can be used to identify related proteins based on sequence homology, or when a new genome is sequenced, prediction models can be used to annotate functional sites such as transcription start sites, translation initiation sites and polyadenylation sites and to predict protein localization. Here we present Sirius Prediction Systems Builder (PSB), a new computational tool for sequence analysis, classification and searching. Sirius PSB has four main operations: (1) Building a classifier, (2) Deploying a classifier, (3) Search for proteins similar to query proteins, (4) Preliminary and post-prediction analysis. Sirius PSB supports all these operations via a simple and interactive graphical user interface. Besides being a convenient tool, Sirius PSB has also introduced two novelties in sequence analysis. Firstly, genetic algorithm is used to identify interesting features in the feature space. Secondly, instead of the conventional method of searching for similar proteins via sequence similarity, we introduced searching via features' similarity. To demonstrate the capabilities of Sirius PSB, we have built two prediction models - one for the recognition of Arabidopsis polyadenylation sites and another for the subcellular localization of proteins. Both systems are competitive against current state-of-the-art models based on evaluation of public datasets. More notably, the time and effort required to build each model is greatly reduced with the assistance of Sirius PSB. Furthermore, we show that under certain conditions when BLAST is unable to find related proteins, Sirius PSB can identify functionally related proteins based on their biophysical similarities. Sirius PSB and its related supplements are available at: http://compbio.ddns.comp.nus.edu.sg/~sirius.
MICA: desktop software for comprehensive searching of DNA databases
Stokes, William A; Glick, Benjamin S
2006-01-01
Background Molecular biologists work with DNA databases that often include entire genomes. A common requirement is to search a DNA database to find exact matches for a nondegenerate or partially degenerate query. The software programs available for such purposes are normally designed to run on remote servers, but an appealing alternative is to work with DNA databases stored on local computers. We describe a desktop software program termed MICA (K-Mer Indexing with Compact Arrays) that allows large DNA databases to be searched efficiently using very little memory. Results MICA rapidly indexes a DNA database. On a Macintosh G5 computer, the complete human genome could be indexed in about 5 minutes. The indexing algorithm recognizes all 15 characters of the DNA alphabet and fully captures the information in any DNA sequence, yet for a typical sequence of length L, the index occupies only about 2L bytes. The index can be searched to return a complete list of exact matches for a nondegenerate or partially degenerate query of any length. A typical search of a long DNA sequence involves reading only a small fraction of the index into memory. As a result, searches are fast even when the available RAM is limited. Conclusion MICA is suitable as a search engine for desktop DNA analysis software. PMID:17018144
Smith, Hadley Stevens; Swint, J Michael; Lalani, Seema R; Yamal, Jose-Miguel; de Oliveira Otto, Marcia C; Castellanos, Stephan; Taylor, Amy; Lee, Brendan H; Russell, Heidi V
2018-05-14
Availability of clinical genomic sequencing (CGS) has generated questions about the value of genome and exome sequencing as a diagnostic tool. Analysis of reported CGS application can inform uptake and direct further research. This scoping literature review aims to synthesize evidence on the clinical and economic impact of CGS. PubMed, Embase, and Cochrane were searched for peer-reviewed articles published between 2009 and 2017 on diagnostic CGS for infant and pediatric patients. Articles were classified according to sample size and whether economic evaluation was a primary research objective. Data on patient characteristics, clinical setting, and outcomes were extracted and narratively synthesized. Of 171 included articles, 131 were case reports, 40 were aggregate analyses, and 4 had a primary economic evaluation aim. Diagnostic yield was the only consistently reported outcome. Median diagnostic yield in aggregate analyses was 33.2% but varied by broad clinical categories and test type. Reported CGS use has rapidly increased and spans diverse clinical settings and patient phenotypes. Economic evaluations support the cost-saving potential of diagnostic CGS. Multidisciplinary implementation research, including more robust outcome measurement and economic evaluation, is needed to demonstrate clinical utility and cost-effectiveness of CGS.
Protein structural similarity search by Ramachandran codes
Lo, Wei-Cheng; Huang, Po-Jung; Chang, Chih-Hung; Lyu, Ping-Chiang
2007-01-01
Background Protein structural data has increased exponentially, such that fast and accurate tools are necessary to access structure similarity search. To improve the search speed, several methods have been designed to reduce three-dimensional protein structures to one-dimensional text strings that are then analyzed by traditional sequence alignment methods; however, the accuracy is usually sacrificed and the speed is still unable to match sequence similarity search tools. Here, we aimed to improve the linear encoding methodology and develop efficient search tools that can rapidly retrieve structural homologs from large protein databases. Results We propose a new linear encoding method, SARST (Structural similarity search Aided by Ramachandran Sequential Transformation). SARST transforms protein structures into text strings through a Ramachandran map organized by nearest-neighbor clustering and uses a regenerative approach to produce substitution matrices. Then, classical sequence similarity search methods can be applied to the structural similarity search. Its accuracy is similar to Combinatorial Extension (CE) and works over 243,000 times faster, searching 34,000 proteins in 0.34 sec with a 3.2-GHz CPU. SARST provides statistically meaningful expectation values to assess the retrieved information. It has been implemented into a web service and a stand-alone Java program that is able to run on many different platforms. Conclusion As a database search method, SARST can rapidly distinguish high from low similarities and efficiently retrieve homologous structures. It demonstrates that the easily accessible linear encoding methodology has the potential to serve as a foundation for efficient protein structural similarity search tools. These search tools are supposed applicable to automated and high-throughput functional annotations or predictions for the ever increasing number of published protein structures in this post-genomic era. PMID:17716377
HOWDY: an integrated database system for human genome research
Hirakawa, Mika
2002-01-01
HOWDY is an integrated database system for accessing and analyzing human genomic information (http://www-alis.tokyo.jst.go.jp/HOWDY/). HOWDY stores information about relationships between genetic objects and the data extracted from a number of databases. HOWDY consists of an Internet accessible user interface that allows thorough searching of the human genomic databases using the gene symbols and their aliases. It also permits flexible editing of the sequence data. The database can be searched using simple words and the search can be restricted to a specific cytogenetic location. Linear maps displaying markers and genes on contig sequences are available, from which an object can be chosen. Any search starting point identifies all the information matching the query. HOWDY provides a convenient search environment of human genomic data for scientists unsure which database is most appropriate for their search. PMID:11752279
Discovery of Escherichia coli CRISPR sequences in an undergraduate laboratory.
Militello, Kevin T; Lazatin, Justine C
2017-05-01
Clustered regularly interspaced short palindromic repeats (CRISPRs) represent a novel type of adaptive immune system found in eubacteria and archaebacteria. CRISPRs have recently generated a lot of attention due to their unique ability to catalog foreign nucleic acids, their ability to destroy foreign nucleic acids in a mechanism that shares some similarity to RNA interference, and the ability to utilize reconstituted CRISPR systems for genome editing in numerous organisms. In order to introduce CRISPR biology into an undergraduate upper-level laboratory, a five-week set of exercises was designed to allow students to examine the CRISPR status of uncharacterized Escherichia coli strains and to allow the discovery of new repeats and spacers. Students started the project by isolating genomic DNA from E. coli and amplifying the iap CRISPR locus using the polymerase chain reaction (PCR). The PCR products were analyzed by Sanger DNA sequencing, and the sequences were examined for the presence of CRISPR repeat sequences. The regions between the repeats, the spacers, were extracted and analyzed with BLASTN searches. Overall, CRISPR loci were sequenced from several previously uncharacterized E. coli strains and one E. coli K-12 strain. Sanger DNA sequencing resulted in the discovery of 36 spacer sequences and their corresponding surrounding repeat sequences. Five of the spacers were homologous to foreign (non-E. coli) DNA. Assessment of the laboratory indicates that improvements were made in the ability of students to answer questions relating to the structure and function of CRISPRs. Future directions of the laboratory are presented and discussed. © 2016 by The International Union of Biochemistry and Molecular Biology, 45(3):262-269, 2017. © 2016 The International Union of Biochemistry and Molecular Biology.
Nilsson, R Henrik; Kristiansson, Erik; Ryberg, Martin; Larsson, Karl-Henrik
2005-07-18
During the last few years, DNA sequence analysis has become one of the primary means of taxonomic identification of species, particularly so for species that are minute or otherwise lack distinct, readily obtainable morphological characters. Although the number of sequences available for comparison in public databases such as GenBank increases exponentially, only a minuscule fraction of all organisms have been sequenced, leaving taxon sampling a momentous problem for sequence-based taxonomic identification. When querying GenBank with a set of unidentified sequences, a considerable proportion typically lack fully identified matches, forming an ever-mounting pile of sequences that the researcher will have to monitor manually in the hope that new, clarifying sequences have been submitted by other researchers. To alleviate these concerns, a project to automatically monitor select unidentified sequences in GenBank for taxonomic progress through repeated local BLAST searches was initiated. Mycorrhizal fungi--a field where species identification often is prohibitively complex--and the much used ITS locus were chosen as test bed. A Perl script package called emerencia is presented. On a regular basis, it downloads select sequences from GenBank, separates the identified sequences from those insufficiently identified, and performs BLAST searches between these two datasets, storing all results in an SQL database. On the accompanying web-service http://emerencia.math.chalmers.se, users can monitor the taxonomic progress of insufficiently identified sequences over time, either through active searches or by signing up for e-mail notification upon disclosure of better matches. Other search categories, such as listing all insufficiently identified sequences (and their present best fully identified matches) publication-wise, are also available. The ever-increasing use of DNA sequences for identification purposes largely falls back on the assumption that public sequence databases contain a thorough sampling of taxonomically well-annotated sequences. Taxonomy, held by some to be an old-fashioned trade, has accordingly never been more important. emerencia does not automate the taxonomic process, but it does allow researchers to focus their efforts elsewhere than countless manual BLAST runs and arduous sieving of BLAST hit lists. The emerencia system is available on an open source basis for local installation with any organism and gene group as targets.
White, Ryen W; Horvitz, Eric
2014-01-01
Objective To better understand the relationship between online health-seeking behaviors and in-world healthcare utilization (HU) by studies of online search and access activities before and after queries that pursue medical professionals and facilities. Materials and methods We analyzed data collected from logs of online searches gathered from consenting users of a browser toolbar from Microsoft (N=9740). We employed a complementary survey (N=489) to seek a deeper understanding of information-gathering, reflection, and action on the pursuit of professional healthcare. Results We provide insights about HU through the survey, breaking out its findings by different respondent marginalizations as appropriate. Observations made from search logs may be explained by trends observed in our survey responses, even though the user populations differ. Discussion The results provide insights about how users decide if and when to utilize healthcare resources, and how online health information seeking transitions to in-world HU. The findings from both the survey and the logs reveal behavioral patterns and suggest a strong relationship between search behavior and HU. Although the diversity of our survey respondents is limited and we cannot be certain that users visited medical facilities, we demonstrate that it may be possible to infer HU from long-term search behavior by the apparent influence that health concerns and professional advice have on search activity. Conclusions Our findings highlight different phases of online activities around queries pursuing professional healthcare facilities and services. We also show that it may be possible to infer HU from logs without tracking people's physical location, based on the effect of HU on pre- and post-HU search behavior. This allows search providers and others to develop more robust models of interests and preferences by modeling utilization rather than simply the intention to utilize that is expressed in search queries. PMID:23666794
USDA-ARS?s Scientific Manuscript database
A bacterial artificial chromosome (BAC) library and BAC-end sequences for Gossypium hirsutum L. have recently been developed. Here we report on genomic-based genome-wide SNP mining utilizing re-sequencing data with a BAC-end sequence reference for twelve G. hirsutum L. lines, one G. barbadense L. li...
SinEx DB: a database for single exon coding sequences in mammalian genomes.
Jorquera, Roddy; Ortiz, Rodrigo; Ossandon, F; Cárdenas, Juan Pablo; Sepúlveda, Rene; González, Carolina; Holmes, David S
2016-01-01
Eukaryotic genes are typically interrupted by intragenic, noncoding sequences termed introns. However, some genes lack introns in their coding sequence (CDS) and are generally known as 'single exon genes' (SEGs). In this work, a SEG is defined as a nuclear, protein-coding gene that lacks introns in its CDS. Whereas, many public databases of Eukaryotic multi-exon genes are available, there are only two specialized databases for SEGs. The present work addresses the need for a more extensive and diverse database by creating SinEx DB, a publicly available, searchable database of predicted SEGs from 10 completely sequenced mammalian genomes including human. SinEx DB houses the DNA and protein sequence information of these SEGs and includes their functional predictions (KOG) and the relative distribution of these functions within species. The information is stored in a relational database built with My SQL Server 5.1.33 and the complete dataset of SEG sequences and their functional predictions are available for downloading. SinEx DB can be interrogated by: (i) a browsable phylogenetic schema, (ii) carrying out BLAST searches to the in-house SinEx DB of SEGs and (iii) via an advanced search mode in which the database can be searched by key words and any combination of searches by species and predicted functions. SinEx DB provides a rich source of information for advancing our understanding of the evolution and function of SEGs.Database URL: www.sinex.cl. © The Author(s) 2016. Published by Oxford University Press.
Salehi, Mojtaba; Bahreininejad, Ardeshir
2011-08-01
Optimization of process planning is considered as the key technology for computer-aided process planning which is a rather complex and difficult procedure. A good process plan of a part is built up based on two elements: (1) the optimized sequence of the operations of the part; and (2) the optimized selection of the machine, cutting tool and Tool Access Direction (TAD) for each operation. In the present work, the process planning is divided into preliminary planning, and secondary/detailed planning. In the preliminary stage, based on the analysis of order and clustering constraints as a compulsive constraint aggregation in operation sequencing and using an intelligent searching strategy, the feasible sequences are generated. Then, in the detailed planning stage, using the genetic algorithm which prunes the initial feasible sequences, the optimized operation sequence and the optimized selection of the machine, cutting tool and TAD for each operation based on optimization constraints as an additive constraint aggregation are obtained. The main contribution of this work is the optimization of sequence of the operations of the part, and optimization of machine selection, cutting tool and TAD for each operation using the intelligent search and genetic algorithm simultaneously.
Salehi, Mojtaba
2010-01-01
Optimization of process planning is considered as the key technology for computer-aided process planning which is a rather complex and difficult procedure. A good process plan of a part is built up based on two elements: (1) the optimized sequence of the operations of the part; and (2) the optimized selection of the machine, cutting tool and Tool Access Direction (TAD) for each operation. In the present work, the process planning is divided into preliminary planning, and secondary/detailed planning. In the preliminary stage, based on the analysis of order and clustering constraints as a compulsive constraint aggregation in operation sequencing and using an intelligent searching strategy, the feasible sequences are generated. Then, in the detailed planning stage, using the genetic algorithm which prunes the initial feasible sequences, the optimized operation sequence and the optimized selection of the machine, cutting tool and TAD for each operation based on optimization constraints as an additive constraint aggregation are obtained. The main contribution of this work is the optimization of sequence of the operations of the part, and optimization of machine selection, cutting tool and TAD for each operation using the intelligent search and genetic algorithm simultaneously. PMID:21845020
Katahira, Satoshi; Muramoto, Nobuhiko; Moriya, Shigeharu; Nagura, Risa; Tada, Nobuki; Yasutani, Noriko; Ohkuma, Moriya; Onishi, Toru; Tokuhiro, Kenro
2017-01-01
The yeast Saccharomyces cerevisiae , a promising host for lignocellulosic bioethanol production, is unable to metabolize xylose. In attempts to confer xylose utilization ability in S. cerevisiae , a number of xylose isomerase (XI) genes have been expressed heterologously in this yeast. Although several of these XI encoding genes were functionally expressed in S. cerevisiae , the need still exists for a S. cerevisiae strain with improved xylose utilization ability for use in the commercial production of bioethanol. Although currently much effort has been devoted to achieve the objective, one of the solutions is to search for a new XI gene that would confer superior xylose utilization in S. cerevisiae . Here, we searched for novel XI genes from the protists residing in the hindgut of the termite Reticulitermes speratus . Eight novel XI genes were obtained from a cDNA library, prepared from the protists of the R. speratus hindgut, by PCR amplification using degenerated primers based on highly conserved regions of amino acid sequences of different XIs. Phylogenetic analysis classified these cloned XIs into two groups, one showed relatively high similarities to Bacteroidetes and the other was comparatively similar to Firmicutes . The growth rate and the xylose consumption rate of the S. cerevisiae strain expressing the novel XI, which exhibited highest XI activity among the eight XIs, were superior to those exhibited by the strain expressing the XI gene from Piromyces sp. E2. Substitution of the asparagine residue at position 337 of the novel XI with a cysteine further improved the xylose utilization ability of the yeast strain. Interestingly, introducing point mutations in the corresponding asparagine residues in XIs originated from other organisms, such as Piromyces sp. E2 or Clostridium phytofermentans , similarly improved xylose utilization in S. cerevisiae . A novel XI gene conferring superior xylose utilization in S. cerevisiae was successfully isolated from the protists in the termite hindgut. Isolation of this XI gene and identification of the point mutation described in this study might contribute to improving the productivity of industrial bioethanol.
Using search engine query data to track pharmaceutical utilization: a study of statins.
Schuster, Nathaniel M; Rogers, Mary A M; McMahon, Laurence F
2010-08-01
To examine temporal and geographic associations between Google queries for health information and healthcare utilization benchmarks. Retrospective longitudinal study. Using Google Trends and Google Insights for Search data, the search terms Lipitor (atorvastatin calcium; Pfizer, Ann Arbor, MI) and simvastatin were evaluated for change over time and for association with Lipitor revenues. The relationship between query data and community-based resource use per Medicare beneficiary was assessed for 35 US metropolitan areas. Google queries for Lipitor significantly decreased from January 2004 through June 2009 and queries for simvastatin significantly increased (P <.001 for both), particularly after Lipitor came off patent (P <.001 for change in slope). The mean number of Google queries for Lipitor correlated (r = 0.98) with the percentage change in Lipitor global revenues from 2004 to 2008 (P <.001). Query preference for Lipitor over simvastatin was positively associated (r = 0.40) with a community's use of Medicare services. For every 1% increase in utilization of Medicare services in a community, there was a 0.2-unit increase in the ratio of Lipitor queries to simvastatin queries in that community (P = .02). Specific search engine queries for medical information correlate with pharmaceutical revenue and with overall healthcare utilization in a community. This suggests that search query data can track community-wide characteristics in healthcare utilization and have the potential for informing payers and policy makers regarding trends in utilization.
compomics-utilities: an open-source Java library for computational proteomics.
Barsnes, Harald; Vaudel, Marc; Colaert, Niklaas; Helsens, Kenny; Sickmann, Albert; Berven, Frode S; Martens, Lennart
2011-03-08
The growing interest in the field of proteomics has increased the demand for software tools and applications that process and analyze the resulting data. And even though the purpose of these tools can vary significantly, they usually share a basic set of features, including the handling of protein and peptide sequences, the visualization of (and interaction with) spectra and chromatograms, and the parsing of results from various proteomics search engines. Developers typically spend considerable time and effort implementing these support structures, which detracts from working on the novel aspects of their tool. In order to simplify the development of proteomics tools, we have implemented an open-source support library for computational proteomics, called compomics-utilities. The library contains a broad set of features required for reading, parsing, and analyzing proteomics data. compomics-utilities is already used by a long list of existing software, ensuring library stability and continued support and development. As a user-friendly, well-documented and open-source library, compomics-utilities greatly simplifies the implementation of the basic features needed in most proteomics tools. Implemented in 100% Java, compomics-utilities is fully portable across platforms and architectures. Our library thus allows the developers to focus on the novel aspects of their tools, rather than on the basic functions, which can contribute substantially to faster development, and better tools for proteomics.
2011-01-01
Background Sequence homology considerations widely used to transfer functional annotation to uncharacterized protein sequences require special precautions in the case of non-globular sequence segments including membrane-spanning stretches composed of non-polar residues. Simple, quantitative criteria are desirable for identifying transmembrane helices (TMs) that must be included into or should be excluded from start sequence segments in similarity searches aimed at finding distant homologues. Results We found that there are two types of TMs in membrane-associated proteins. On the one hand, there are so-called simple TMs with elevated hydrophobicity, low sequence complexity and extraordinary enrichment in long aliphatic residues. They merely serve as membrane-anchoring device. In contrast, so-called complex TMs have lower hydrophobicity, higher sequence complexity and some functional residues. These TMs have additional roles besides membrane anchoring such as intra-membrane complex formation, ligand binding or a catalytic role. Simple and complex TMs can occur both in single- and multi-membrane-spanning proteins essentially in any type of topology. Whereas simple TMs have the potential to confuse searches for sequence homologues and to generate unrelated hits with seemingly convincing statistical significance, complex TMs contain essential evolutionary information. Conclusion For extending the homology concept onto membrane proteins, we provide a necessary quantitative criterion to distinguish simple TMs (and a sufficient criterion for complex TMs) in query sequences prior to their usage in homology searches based on assessment of hydrophobicity and sequence complexity of the TM sequence segments. Reviewers This article was reviewed by Shamil Sunyaev, L. Aravind and Arcady Mushegian. PMID:22024092
MassSieve: Panning MS/MS peptide data for proteins
Slotta, Douglas J.; McFarland, Melinda A.; Markey, Sanford P.
2010-01-01
We present MassSieve, a Java-based platform for visualization and parsimony analysis of single and comparative LC-MS/MS database search engine results. The success of mass spectrometric peptide sequence assignment algorithms has led to the need for a tool to merge and evaluate the increasing data set sizes that result from LC-MS/MS-based shotgun proteomic experiments. MassSieve supports reports from multiple search engines with differing search characteristics, which can increase peptide sequence coverage and/or identify conflicting or ambiguous spectral assignments. PMID:20564260
Simple tools for assembling and searching high-density picolitre pyrophosphate sequence data.
Parker, Nicolas J; Parker, Andrew G
2008-04-18
The advent of pyrophosphate sequencing makes large volumes of sequencing data available at a lower cost than previously possible. However, the short read lengths are difficult to assemble and the large dataset is difficult to handle. During the sequencing of a virus from the tsetse fly, Glossina pallidipes, we found the need for tools to search quickly a set of reads for near exact text matches. A set of tools is provided to search a large data set of pyrophosphate sequence reads under a "live" CD version of Linux on a standard PC that can be used by anyone without prior knowledge of Linux and without having to install a Linux setup on the computer. The tools permit short lengths of de novo assembly, checking of existing assembled sequences, selection and display of reads from the data set and gathering counts of sequences in the reads. Demonstrations are given of the use of the tools to help with checking an assembly against the fragment data set; investigating homopolymer lengths, repeat regions and polymorphisms; and resolving inserted bases caused by incomplete chain extension. The additional information contained in a pyrophosphate sequencing data set beyond a basic assembly is difficult to access due to a lack of tools. The set of simple tools presented here would allow anyone with basic computer skills and a standard PC to access this information.
Padliya, Neerav D; Garrett, Wesley M; Campbell, Kimberly B; Tabb, David L; Cooper, Bret
2007-11-01
LC-MS/MS has demonstrated potential for detecting plant pathogens. Unlike PCR or ELISA, LC-MS/MS does not require pathogen-specific reagents for the detection of pathogen-specific proteins and peptides. However, the MS/MS approach we and others have explored does require a protein sequence reference database and database-search software to interpret tandem mass spectra. To evaluate the limitations of database composition on pathogen identification, we analyzed proteins from cultured Ustilago maydis, Phytophthora sojae, Fusarium graminearum, and Rhizoctonia solani by LC-MS/MS. When the search database did not contain sequences for a target pathogen, or contained sequences to related pathogens, target pathogen spectra were reliably matched to protein sequences from nontarget organisms, giving an illusion that proteins from nontarget organisms were identified. Our analysis demonstrates that when database-search software is used as part of the identification process, a paradox exists whereby additional sequences needed to detect a wide variety of possible organisms may lead to more cross-species protein matches and misidentification of pathogens.
Efficient search, mapping, and optimization of multi-protein genetic systems in diverse bacteria
Farasat, Iman; Kushwaha, Manish; Collens, Jason; Easterbrook, Michael; Guido, Matthew; Salis, Howard M
2014-01-01
Developing predictive models of multi-protein genetic systems to understand and optimize their behavior remains a combinatorial challenge, particularly when measurement throughput is limited. We developed a computational approach to build predictive models and identify optimal sequences and expression levels, while circumventing combinatorial explosion. Maximally informative genetic system variants were first designed by the RBS Library Calculator, an algorithm to design sequences for efficiently searching a multi-protein expression space across a > 10,000-fold range with tailored search parameters and well-predicted translation rates. We validated the algorithm's predictions by characterizing 646 genetic system variants, encoded in plasmids and genomes, expressed in six gram-positive and gram-negative bacterial hosts. We then combined the search algorithm with system-level kinetic modeling, requiring the construction and characterization of 73 variants to build a sequence-expression-activity map (SEAMAP) for a biosynthesis pathway. Using model predictions, we designed and characterized 47 additional pathway variants to navigate its activity space, find optimal expression regions with desired activity response curves, and relieve rate-limiting steps in metabolism. Creating sequence-expression-activity maps accelerates the optimization of many protein systems and allows previous measurements to quantitatively inform future designs. PMID:24952589
Yasui, Yasuo; Hirakawa, Hideki; Ueno, Mariko; Matsui, Katsuhiro; Katsube-Tanaka, Tomoyuki; Yang, Soo Jung; Aii, Jotaro; Sato, Shingo; Mori, Masashi
2016-06-01
Buckwheat (Fagopyrum esculentum Moench; 2n = 2x = 16) is a nutritionally dense annual crop widely grown in temperate zones. To accelerate molecular breeding programmes of this important crop, we generated a draft assembly of the buckwheat genome using short reads obtained by next-generation sequencing (NGS), and constructed the Buckwheat Genome DataBase. After assembling short reads, we determined 387,594 scaffolds as the draft genome sequence (FES_r1.0). The total length of FES_r1.0 was 1,177,687,305 bp, and the N50 of the scaffolds was 25,109 bp. Gene prediction analysis revealed 286,768 coding sequences (CDSs; FES_r1.0_cds) including those related to transposable elements. The total length of FES_r1.0_cds was 212,917,911 bp, and the N50 was 1,101 bp. Of these, the functions of 35,816 CDSs excluding those for transposable elements were annotated by BLAST analysis. To demonstrate the utility of the database, we conducted several test analyses using BLAST and keyword searches. Furthermore, we used the draft genome as a reference sequence for NGS-based markers, and successfully identified novel candidate genes controlling heteromorphic self-incompatibility of buckwheat. The database and draft genome sequence provide a valuable resource that can be used in efforts to develop buckwheat cultivars with superior agronomic traits. © The Author 2016. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor.
Kohany, Oleksiy; Gentles, Andrew J; Hankus, Lukasz; Jurka, Jerzy
2006-10-25
Repbase is a reference database of eukaryotic repetitive DNA, which includes prototypic sequences of repeats and basic information described in annotations. Updating and maintenance of the database requires specialized tools, which we have created and made available for use with Repbase, and which may be useful as a template for other curated databases. We describe the software tools RepbaseSubmitter and Censor, which are designed to facilitate updating and screening the content of Repbase. RepbaseSubmitter is a java-based interface for formatting and annotating Repbase entries. It eliminates many common formatting errors, and automates actions such as calculation of sequence lengths and composition, thus facilitating curation of Repbase sequences. In addition, it has several features for predicting protein coding regions in sequences; searching and including Pubmed references in Repbase entries; and searching the NCBI taxonomy database for correct inclusion of species information and taxonomic position. Censor is a tool to rapidly identify repetitive elements by comparison to known repeats. It uses WU-BLAST for speed and sensitivity, and can conduct DNA-DNA, DNA-protein, or translated DNA-translated DNA searches of genomic sequence. Defragmented output includes a map of repeats present in the query sequence, with the options to report masked query sequence(s), repeat sequences found in the query, and alignments. Censor and RepbaseSubmitter are available as both web-based services and downloadable versions. They can be found at http://www.girinst.org/repbase/submission.html (RepbaseSubmitter) and http://www.girinst.org/censor/index.php (Censor).
Tandem Mass Spectrum Sequencing: An Alternative to Database Search Engines in Shotgun Proteomics.
Muth, Thilo; Rapp, Erdmann; Berven, Frode S; Barsnes, Harald; Vaudel, Marc
2016-01-01
Protein identification via database searches has become the gold standard in mass spectrometry based shotgun proteomics. However, as the quality of tandem mass spectra improves, direct mass spectrum sequencing gains interest as a database-independent alternative. In this chapter, the general principle of this so-called de novo sequencing is introduced along with pitfalls and challenges of the technique. The main tools available are presented with a focus on user friendly open source software which can be directly applied in everyday proteomic workflows.
Uhrig, R Glen; Kerk, David; Moorhead, Greg B
2013-12-01
Protein phosphorylation is a reversible regulatory process catalyzed by the opposing reactions of protein kinases and phosphatases, which are central to the proper functioning of the cell. Dysfunction of members in either the protein kinase or phosphatase family can have wide-ranging deleterious effects in both metazoans and plants alike. Previously, three bacterial-like phosphoprotein phosphatase classes were uncovered in eukaryotes and named according to the bacterial sequences with which they have the greatest similarity: Shewanella-like (SLP), Rhizobiales-like (RLPH), and ApaH-like (ALPH) phosphatases. Utilizing the wealth of data resulting from recently sequenced complete eukaryotic genomes, we conducted database searching by hidden Markov models, multiple sequence alignment, and phylogenetic tree inference with Bayesian and maximum likelihood methods to elucidate the pattern of evolution of eukaryotic bacterial-like phosphoprotein phosphatase sequences, which are predominantly distributed in photosynthetic eukaryotes. We uncovered a pattern of ancestral mitochondrial (SLP and RLPH) or archaeal (ALPH) gene entry into eukaryotes, supplemented by possible instances of lateral gene transfer between bacteria and eukaryotes. In addition to the previously known green algal and plant SLP1 and SLP2 protein forms, a more ancestral third form (SLP3) was found in green algae. Data from in silico subcellular localization predictions revealed class-specific differences in plants likely to result in distinct functions, and for SLP sequences, distinctive and possibly functionally significant differences between plants and nonphotosynthetic eukaryotes. Conserved carboxyl-terminal sequence motifs with class-specific patterns of residue substitutions, most prominent in photosynthetic organisms, raise the possibility of complex interactions with regulatory proteins.
Long frame sync words for binary PSK telemetry
NASA Technical Reports Server (NTRS)
Levitt, B. K.
1975-01-01
Correlation criteria have previously been established for identifying whether a given binary sequence would be a good frame sync word for phase-shift keyed telemetry. In the past, the search for a good K-bit sync word has involved the application of these criteria to the entire set of 2 exponent K binary K-tuples. It is shown that restricting this search to a much smaller subset consisting of K-bit prefixes of pseudonoise sequences results in sync words of comparable quality, with greatly reduced computer search times for larger values of K. As an example, this procedure is used to find good sync words of length 16-63; from a storage viewpoint, each of these sequences can be generated by a 5- or 6-bit linear feedback shift register.
Kimura, M; Tani, S; Watanabe, H; Naito, Y; Sakusabe, T; Watanabe, H; Nakaya, J; Sasaki, F; Numano, T; Furuta, T; Furuta, T
2008-01-01
This paper illustrates a high speed clinical data retrieving system, from 10 years of data of operating hospital information system for the purposes of research, evidence creation, patient safety, etc., even incorporating time sequence of causal relations. Total of 73,709,298 records of 10 years at Hamamatsu University Hospital (as of June 2008) are sent from HIS to retrieval system in HL7 v2.5 format. Hierarchical variable length database is used to install them. A search for "listing patients who were prescribed Pravastatin (Mevalotin and generic drugs, any titer)" took 1.92 seconds. "Pravastatin (any) prescribed and recorded AST >150 within two weeks" took 112.22 seconds. Searching conditions can be set to be more complex, connected by Boolean operator and/or. This system called D*D is in operation at Hamamatsu University Hospital since August 2002. It is used for 48,518 times (monthly average of 703 searches). Neither searching, nor background export of data from HIS caused delay of routine operating CPOE. Search database outside of routine operating CPOE, with daily export of order data in HL7 v2.5 format, is proved to provide excellent search environment without causing trouble. Hierarchical representation gives high-speed search response, especially with time sequence of events.
The EMBL-EBI bioinformatics web and programmatic tools framework.
Li, Weizhong; Cowley, Andrew; Uludag, Mahmut; Gur, Tamer; McWilliam, Hamish; Squizzato, Silvano; Park, Young Mi; Buso, Nicola; Lopez, Rodrigo
2015-07-01
Since 2009 the EMBL-EBI Job Dispatcher framework has provided free access to a range of mainstream sequence analysis applications. These include sequence similarity search services (https://www.ebi.ac.uk/Tools/sss/) such as BLAST, FASTA and PSI-Search, multiple sequence alignment tools (https://www.ebi.ac.uk/Tools/msa/) such as Clustal Omega, MAFFT and T-Coffee, and other sequence analysis tools (https://www.ebi.ac.uk/Tools/pfa/) such as InterProScan. Through these services users can search mainstream sequence databases such as ENA, UniProt and Ensembl Genomes, utilising a uniform web interface or systematically through Web Services interfaces (https://www.ebi.ac.uk/Tools/webservices/) using common programming languages, and obtain enriched results with novel visualisations. Integration with EBI Search (https://www.ebi.ac.uk/ebisearch/) and the dbfetch retrieval service (https://www.ebi.ac.uk/Tools/dbfetch/) further expands the usefulness of the framework. New tools and updates such as NCBI BLAST+, InterProScan 5 and PfamScan, new categories such as RNA analysis tools (https://www.ebi.ac.uk/Tools/rna/), new databases such as ENA non-coding, WormBase ParaSite, Pfam and Rfam, and new workflow methods, together with the retirement of depreciated services, ensure that the framework remains relevant to today's biological community. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Kangaroo – A pattern-matching program for biological sequences
2002-01-01
Background Biologists are often interested in performing a simple database search to identify proteins or genes that contain a well-defined sequence pattern. Many databases do not provide straightforward or readily available query tools to perform simple searches, such as identifying transcription binding sites, protein motifs, or repetitive DNA sequences. However, in many cases simple pattern-matching searches can reveal a wealth of information. We present in this paper a regular expression pattern-matching tool that was used to identify short repetitive DNA sequences in human coding regions for the purpose of identifying potential mutation sites in mismatch repair deficient cells. Results Kangaroo is a web-based regular expression pattern-matching program that can search for patterns in DNA, protein, or coding region sequences in ten different organisms. The program is implemented to facilitate a wide range of queries with no restriction on the length or complexity of the query expression. The program is accessible on the web at http://bioinfo.mshri.on.ca/kangaroo/ and the source code is freely distributed at http://sourceforge.net/projects/slritools/. Conclusion A low-level simple pattern-matching application can prove to be a useful tool in many research settings. For example, Kangaroo was used to identify potential genetic targets in a human colorectal cancer variant that is characterized by a high frequency of mutations in coding regions containing mononucleotide repeats. PMID:12150718
Fischer, Carlos N; Campos, Victor De A; Barella, Victor H
2018-05-01
Profile hidden Markov models (pHMMs) have been used to search for transposable elements (TEs) in genomes. For the learning of pHMMs aimed to search for TEs of the retrotransposon class, the conventional protocol is to use the whole internal nucleotide portions of these elements as representative sequences. To further explore the potential of pHMMs in such a search, we propose five alternative ways to obtain the sets of representative sequences of TEs other than the conventional protocol. In this study, we are interested in Bel-PAO, Copia, Gypsy, and DIRS superfamilies from the retrotransposon class. We compared the pHMMs of all six protocols. The test results show that, for each TE superfamily, the pHMMs of at least two of the proposed protocols performed better than the conventional one and that the number of correct predictions provided by the latter can be improved by considering together the results of one or more of the alternative protocols.
Querying Event Sequences by Exact Match or Similarity Search: Design and Empirical Evaluation
Wongsuphasawat, Krist; Plaisant, Catherine; Taieb-Maimon, Meirav; Shneiderman, Ben
2012-01-01
Specifying event sequence queries is challenging even for skilled computer professionals familiar with SQL. Most graphical user interfaces for database search use an exact match approach, which is often effective, but near misses may also be of interest. We describe a new similarity search interface, in which users specify a query by simply placing events on a blank timeline and retrieve a similarity-ranked list of results. Behind this user interface is a new similarity measure for event sequences which the users can customize by four decision criteria, enabling them to adjust the impact of missing, extra, or swapped events or the impact of time shifts. We describe a use case with Electronic Health Records based on our ongoing collaboration with hospital physicians. A controlled experiment with 18 participants compared exact match and similarity search interfaces. We report on the advantages and disadvantages of each interface and suggest a hybrid interface combining the best of both. PMID:22379286
Hahn, Lars; Leimeister, Chris-André; Ounit, Rachid; Lonardi, Stefano; Morgenstern, Burkhard
2016-10-01
Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don't-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de/.
Fast parallel tandem mass spectral library searching using GPU hardware acceleration.
Baumgardner, Lydia Ashleigh; Shanmugam, Avinash Kumar; Lam, Henry; Eng, Jimmy K; Martin, Daniel B
2011-06-03
Mass spectrometry-based proteomics is a maturing discipline of biologic research that is experiencing substantial growth. Instrumentation has steadily improved over time with the advent of faster and more sensitive instruments collecting ever larger data files. Consequently, the computational process of matching a peptide fragmentation pattern to its sequence, traditionally accomplished by sequence database searching and more recently also by spectral library searching, has become a bottleneck in many mass spectrometry experiments. In both of these methods, the main rate-limiting step is the comparison of an acquired spectrum with all potential matches from a spectral library or sequence database. This is a highly parallelizable process because the core computational element can be represented as a simple but arithmetically intense multiplication of two vectors. In this paper, we present a proof of concept project taking advantage of the massively parallel computing available on graphics processing units (GPUs) to distribute and accelerate the process of spectral assignment using spectral library searching. This program, which we have named FastPaSS (for Fast Parallelized Spectral Searching), is implemented in CUDA (Compute Unified Device Architecture) from NVIDIA, which allows direct access to the processors in an NVIDIA GPU. Our efforts demonstrate the feasibility of GPU computing for spectral assignment, through implementation of the validated spectral searching algorithm SpectraST in the CUDA environment.
Nishimoto, Ryu; Tani, Jun
2004-09-01
This study shows how sensory-action sequences of imitating finite state machines (FSMs) can be learned by utilizing the deterministic dynamics of recurrent neural networks (RNNs). Our experiments indicated that each possible combinatorial sequence can be recalled by specifying its respective initial state value and also that fractal structures appear in this initial state mapping after the learning converges. We also observed that the sequences of mimicking FSMs are encoded utilizing the transient regions rather than the invariant sets of the evolved dynamical systems of the RNNs.
Liu, Wanting; Xiang, Lunping; Zheng, Tingkai; Jin, Jingjie
2018-01-01
Abstract Translation is a key regulatory step, linking transcriptome and proteome. Two major methods of translatome investigations are RNC-seq (sequencing of translating mRNA) and Ribo-seq (ribosome profiling). To facilitate the investigation of translation, we built a comprehensive database TranslatomeDB (http://www.translatomedb.net/) which provides collection and integrated analysis of published and user-generated translatome sequencing data. The current version includes 2453 Ribo-seq, 10 RNC-seq and their 1394 corresponding mRNA-seq datasets in 13 species. The database emphasizes the analysis functions in addition to the dataset collections. Differential gene expression (DGE) analysis can be performed between any two datasets of same species and type, both on transcriptome and translatome levels. The translation indices translation ratios, elongation velocity index and translational efficiency can be calculated to quantitatively evaluate translational initiation efficiency and elongation velocity, respectively. All datasets were analyzed using a unified, robust, accurate and experimentally-verifiable pipeline based on the FANSe3 mapping algorithm and edgeR for DGE analyzes. TranslatomeDB also allows users to upload their own datasets and utilize the identical unified pipeline to analyze their data. We believe that our TranslatomeDB is a comprehensive platform and knowledgebase on translatome and proteome research, releasing the biologists from complex searching, analyzing and comparing huge sequencing data without needing local computational power. PMID:29106630
Hiessl, Sebastian; Schuldes, Jörg; Thürmer, Andrea; Halbsguth, Tobias; Bröker, Daniel; Angelov, Angel; Liebl, Wolfgang; Daniel, Rolf
2012-01-01
The increasing production of synthetic and natural poly(cis-1,4-isoprene) rubber leads to huge challenges in waste management. Only a few bacteria are known to degrade rubber, and little is known about the mechanism of microbial rubber degradation. The genome of Gordonia polyisoprenivorans strain VH2, which is one of the most effective rubber-degrading bacteria, was sequenced and annotated to elucidate the degradation pathway and other features of this actinomycete. The genome consists of a circular chromosome of 5,669,805 bp and a circular plasmid of 174,494 bp with average GC contents of 67.0% and 65.7%, respectively. It contains 5,110 putative protein-coding sequences, including many candidate genes responsible for rubber degradation and other biotechnically relevant pathways. Furthermore, we detected two homologues of a latex-clearing protein, which is supposed to be a key enzyme in rubber degradation. The deletion of these two genes for the first time revealed clear evidence that latex-clearing protein is essential for the microbial utilization of rubber. Based on the genome sequence, we predict a pathway for the microbial degradation of rubber which is supported by previous and current data on transposon mutagenesis, deletion mutants, applied comparative genomics, and literature search. PMID:22327575
Borman, Andrew M.; Linton, Christopher J.; Oliver, Debra; Palmer, Michael D.; Szekely, Adrien; Johnson, Elizabeth M.
2010-01-01
Rapid identification of yeast species isolates from clinical samples is particularly important given their innately variable antifungal susceptibility profiles. Here, we have evaluated the utility of pyrosequencing analysis of a portion of the internal transcribed spacer 2 region (ITS2) for identification of pathogenic yeasts. A total of 477 clinical isolates encompassing 43 different fungal species were subjected to pyrosequencing analysis in a strictly blinded study. The molecular identifications produced by pyrosequencing were compared with those obtained using conventional biochemical tests (AUXACOLOR2) and following PCR amplification and sequencing of the D1-D2 portion of the nuclear 28S large rRNA gene. More than 98% (469/477) of isolates encompassing 40 of the 43 fungal species tested were correctly identified by pyrosequencing of only 35 bp of ITS2. Moreover, BLAST searches of the public synchronized databases with the ITS2 pyrosequencing signature sequences revealed that there was only minimal sequence redundancy in the ITS2 under analysis. In all cases, the pyrosequencing signature sequences were unique to the yeast species (or species complex) under investigation. Finally, when pyrosequencing was combined with the Whatman FTA paper technology for the rapid extraction of fungal genomic DNA, molecular identification could be accomplished within 6 h from the time of starting from pure cultures. PMID:20702674
SMM-system: A mining tool to identify specific markers in Salmonella enterica.
Yu, Shuijing; Liu, Weibing; Shi, Chunlei; Wang, Dapeng; Dan, Xianlong; Li, Xiao; Shi, Xianming
2011-03-01
This report presents SMM-system, a software package that implements various personalized pre- and post-BLASTN tasks for mining specific markers of microbial pathogens. The main functionalities of SMM-system are summarized as follows: (i) converting multi-FASTA file, (ii) cutting interesting genomic sequence, (iii) automatic high-throughput BLASTN searches, and (iv) screening target sequences. The utility of SMM-system was demonstrated by using it to identify 214 Salmonella enterica-specific protein-coding sequences (CDSs). Eighteen primer pairs were designed based on eighteen S. enterica-specific CDSs, respectively. Seven of these primer pairs were validated with PCR assay, which showed 100% inclusivity for the 101 S. enterica genomes and 100% exclusivity of 30 non-S. enterica genomes. Three specific primer pairs were chosen to develop a multiplex PCR assay, which generated specific amplicons with a size of 180bp (SC1286), 238bp (SC1598) and 405bp (SC4361), respectively. This study demonstrates that SMM-system is a high-throughput specific marker generation tool that can be used to identify genus-, species-, serogroup- and even serovar-specific DNA sequences of microbial pathogens, which has a potential to be applied in food industries, diagnostics and taxonomic studies. SMM-system is freely available and can be downloaded from http://foodsafety.sjtu.edu.cn/SMM-system.html. Copyright © 2011 Elsevier B.V. All rights reserved.
Mandal, Bijoy Kumar; Kim, Tai-hoon
2013-01-01
We design an Algorithm for bioengine. As a program are enable optimal alignments searching between two sequences, the host sequence (normal plant) as well as query sequence (virus). Searching for homologues has become a routine operation of biological sequences in 4 × 4 combination with different subsequence (word size). This program takes the advantage of the high degree of homology between such sequences to construct an alignment of the matching regions. There is a main aim which is to detect the overlapping reading frames. This program also enables to find out the highly infected colones selection highest matching region with minimum gap or mismatch zones and unique virus colones matches. This is a small, portable, interactive, front-end program intended to be used to find out the regions of matching between host sequence and query subsequences. All the operations are carried out in fraction of seconds, depending on the required task and on the sequence length. PMID:24000321
Fourment, Mathieu; Gibbs, Mark J
2008-02-05
Viruses of the Bunyaviridae have segmented negative-stranded RNA genomes and several of them cause significant disease. Many partial sequences have been obtained from the segments so that GenBank searches give complex results. Sequence databases usually use HTML pages to mediate remote sorting, but this approach can be limiting and may discourage a user from exploring a database. The VirusBanker database contains Bunyaviridae sequences and alignments and is presented as two spreadsheets generated by a Java program that interacts with a MySQL database on a server. Sequences are displayed in rows and may be sorted using information that is displayed in columns and includes data relating to the segment, gene, protein, species, strain, sequence length, terminal sequence and date and country of isolation. Bunyaviridae sequences and alignments may be downloaded from the second spreadsheet with titles defined by the user from the columns, or viewed when passed directly to the sequence editor, Jalview. VirusBanker allows large datasets of aligned nucleotide and protein sequences from the Bunyaviridae to be compiled and winnowed rapidly using criteria that are formulated heuristically.
Detecting false positive sequence homology: a machine learning approach.
Fujimoto, M Stanley; Suvorov, Anton; Jensen, Nicholas O; Clement, Mark J; Bybee, Seth M
2016-02-24
Accurate detection of homologous relationships of biological sequences (DNA or amino acid) amongst organisms is an important and often difficult task that is essential to various evolutionary studies, ranging from building phylogenies to predicting functional gene annotations. There are many existing heuristic tools, most commonly based on bidirectional BLAST searches that are used to identify homologous genes and combine them into two fundamentally distinct classes: orthologs and paralogs. Due to only using heuristic filtering based on significance score cutoffs and having no cluster post-processing tools available, these methods can often produce multiple clusters constituting unrelated (non-homologous) sequences. Therefore sequencing data extracted from incomplete genome/transcriptome assemblies originated from low coverage sequencing or produced by de novo processes without a reference genome are susceptible to high false positive rates of homology detection. In this paper we develop biologically informative features that can be extracted from multiple sequence alignments of putative homologous genes (orthologs and paralogs) and further utilized in context of guided experimentation to verify false positive outcomes. We demonstrate that our machine learning method trained on both known homology clusters obtained from OrthoDB and randomly generated sequence alignments (non-homologs), successfully determines apparent false positives inferred by heuristic algorithms especially among proteomes recovered from low-coverage RNA-seq data. Almost ~42 % and ~25 % of predicted putative homologies by InParanoid and HaMStR respectively were classified as false positives on experimental data set. Our process increases the quality of output from other clustering algorithms by providing a novel post-processing method that is both fast and efficient at removing low quality clusters of putative homologous genes recovered by heuristic-based approaches.
Nascimento, Anna Christina C; Zanotta, Lanuse C; Kyaw, Cynthia M; Schwartz, Elisabeth N F; Schwartz, Carlos A; Sebben, Antonio; Sousa, Marcelo V; Fontes, Wagner; Castro, Mariana S
2004-11-01
The emergence, in recent years, of microbial resistance to commonly used antibiotics has aroused a search for new naturally occurring bactericidal and fungicidal agents that may have clinical utility. In the present study, three new antimicrobial peptides were purified from the electrical-stimulated skin secretion of the South American frog Leptodactylus ocellatus by reversed-phase chromatographic procedures. Ocellatin 1 (1GVVDILKGAGKDLLAHLVGKISEKV25-CONH2), ocellatin 2 (1GVLDIFKDAAKQILAHAAEKQI25-CONH2) and ocellatin 3 (1GVLDILKNAAKNILAHAAEQI21-CONH2) are structurally related peptides. These peptides present hemolytic activity against human erythrocytes and are also active against Escherichia coli. Ocellatins exhibit significant sequence similarity to other amphibian antimicrobial peptides, mainly to brevinin 2ED from Rana esculenta.
Improving the CellSearch® system.
Swennenhuis, J F; van Dalum, G; Zeune, L L; Terstappen, L W M M
2016-12-01
The CellSearch® CTC test enumerates tumor cells present in 7.5 ml blood of cancer patients. improvements, extensions and different utilities of the cellsearch system are discussed in this paper. Areas covered: This paper describes work performed with the CellSearch system, which go beyond the normal scope of the test. All results from searches with the search term 'CellSearch' from Web of Science and PubMed were categorized and discussed. Expert commentary: The CellSearch Circulating Tumor Cell test captures and identifies tumor cells in blood that are associated with poor clinical outcome. How to best use CTC in clinical practice is being explored in many clinical trials. The ability to extract information from the CTC to guide therapy will expand the potential clinical utility of CTC.
Geisler, Christoph
2018-02-07
Adventitious viral contamination in cell substrates used for biologicals production is a major safety concern. A powerful new approach that can be used to identify adventitious viruses is a combination of bioinformatics tools with massively parallel sequencing technology. Typically, this involves mapping or BLASTN searching individual reads against viral nucleotide databases. Although extremely sensitive for known viruses, this approach can easily miss viruses that are too dissimilar to viruses in the database. Moreover, it is computationally intensive and requires reference cell genome databases. To avoid these drawbacks, we set out to develop an alternative approach. We reasoned that searching genome and transcriptome assemblies for adventitious viral contaminants using TBLASTN with a compact viral protein database covering extant viral diversity as the query could be fast and sensitive without a requirement for high performance computing hardware. We tested our approach on Spodoptera frugiperda Sf-RVN, a recently isolated insect cell line, to determine if it was contaminated with one or more adventitious viruses. We used Illumina reads to assemble the Sf-RVN genome and transcriptome and searched them for adventitious viral contaminants using TBLASTN with our viral protein database. We found no evidence of viral contamination, which was substantiated by the fact that our searches otherwise identified diverse sequences encoding virus-like proteins. These sequences included Maverick, R1 LINE, and errantivirus transposons, all of which are common in insect genomes. We also identified previously described as well as novel endogenous viral elements similar to ORFs encoded by diverse insect viruses. Our results demonstrate TBLASTN searching massively parallel sequencing (MPS) assemblies with a compact, manually curated viral protein database is more sensitive for adventitious virus detection than BLASTN, as we identified various sequences that encoded virus-like proteins, but had no similarity to viral sequences at the nucleotide level. Moreover, searches were fast without requiring high performance computing hardware. Our study also documents the enhanced biosafety profile of Sf-RVN as compared to other Sf cell lines, and supports the notion that Sf-RVN is highly suitable for the production of safe biologicals.
Large-Scale Concatenation cDNA Sequencing
Yu, Wei; Andersson, Björn; Worley, Kim C.; Muzny, Donna M.; Ding, Yan; Liu, Wen; Ricafrente, Jennifer Y.; Wentland, Meredith A.; Lennon, Greg; Gibbs, Richard A.
1997-01-01
A total of 100 kb of DNA derived from 69 individual human brain cDNA clones of 0.7–2.0 kb were sequenced by concatenated cDNA sequencing (CCS), whereby multiple individual DNA fragments are sequenced simultaneously in a single shotgun library. The method yielded accurate sequences and a similar efficiency compared with other shotgun libraries constructed from single DNA fragments (>20 kb). Computer analyses were carried out on 65 cDNA clone sequences and their corresponding end sequences to examine both nucleic acid and amino acid sequence similarities in the databases. Thirty-seven clones revealed no DNA database matches, 12 clones generated exact matches (≥98% identity), and 16 clones generated nonexact matches (57%–97% identity) to either known human or other species genes. Of those 28 matched clones, 8 had corresponding end sequences that failed to identify similarities. In a protein similarity search, 27 clone sequences displayed significant matches, whereas only 20 of the end sequences had matches to known protein sequences. Our data indicate that full-length cDNA insert sequences provide significantly more nucleic acid and protein sequence similarity matches than expressed sequence tags (ESTs) for database searching. [All 65 cDNA clone sequences described in this paper have been submitted to the GenBank data library under accession nos. U79240–U79304.] PMID:9110174
Dong, Guangheng; Yang, Lizhu; Shen, Yue
2009-08-21
The present study investigated the course of visual searching to a target in a fixed location, using an emotional flanker task. Event-related potentials (ERPs) were recorded while participants performed the task. Emotional facial expressions were used as emotion-eliciting triggers. The course of visual searching was analyzed through the emotional effects arising from these emotion-eliciting stimuli. The flanker stimuli showed effects at about 150-250 ms following the stimulus onset, while the effect of target stimuli showed effects at about 300-400 ms. The visual search sequence in an emotional flanker task moved from a whole overview to a specific target, even if the target always appeared at a known location. The processing sequence was "parallel" in this task. The results supported the feature integration theory of visual search.
Ambroggio, Xavier I; Dommer, Jennifer; Gopalan, Vivek; Dunham, Eleca J; Taubenberger, Jeffery K; Hurt, Darrell E
2013-06-18
Influenza A viruses possess RNA genomes that mutate frequently in response to immune pressures. The mutations in the hemagglutinin genes are particularly significant, as the hemagglutinin proteins mediate attachment and fusion to host cells, thereby influencing viral pathogenicity and species specificity. Large-scale influenza A genome sequencing efforts have been ongoing to understand past epidemics and pandemics and anticipate future outbreaks. Sequencing efforts thus far have generated nearly 9,000 distinct hemagglutinin amino acid sequences. Comparative models for all publicly available influenza A hemagglutinin protein sequences (8,769 to date) were generated using the Rosetta modeling suite. The C-alpha root mean square deviations between a randomly chosen test set of models and their crystallographic templates were less than 2 Å, suggesting that the modeling protocols yielded high-quality results. The models were compiled into an online resource, the Hemagglutinin Structure Prediction (HASP) server. The HASP server was designed as a scientific tool for researchers to visualize hemagglutinin protein sequences of interest in a three-dimensional context. With a built-in molecular viewer, hemagglutinin models can be compared side-by-side and navigated by a corresponding sequence alignment. The models and alignments can be downloaded for offline use and further analysis. The modeling protocols used in the HASP server scale well for large amounts of sequences and will keep pace with expanded sequencing efforts. The conservative approach to modeling and the intuitive search and visualization interfaces allow researchers to quickly analyze hemagglutinin sequences of interest in the context of the most highly related experimental structures, and allow them to directly compare hemagglutinin sequences to each other simultaneously in their two- and three-dimensional contexts. The models and methodology have shown utility in current research efforts and the ongoing aim of the HASP server is to continue to accelerate influenza A research and have a positive impact on global public health.
A comprehensive and scalable database search system for metaproteomics.
Chatterjee, Sandip; Stupp, Gregory S; Park, Sung Kyu Robin; Ducom, Jean-Christophe; Yates, John R; Su, Andrew I; Wolan, Dennis W
2016-08-16
Mass spectrometry-based shotgun proteomics experiments rely on accurate matching of experimental spectra against a database of protein sequences. Existing computational analysis methods are limited in the size of their sequence databases, which severely restricts the proteomic sequencing depth and functional analysis of highly complex samples. The growing amount of public high-throughput sequencing data will only exacerbate this problem. We designed a broadly applicable metaproteomic analysis method (ComPIL) that addresses protein database size limitations. Our approach to overcome this significant limitation in metaproteomics was to design a scalable set of sequence databases assembled for optimal library querying speeds. ComPIL was integrated with a modified version of the search engine ProLuCID (termed "Blazmass") to permit rapid matching of experimental spectra. Proof-of-principle analysis of human HEK293 lysate with a ComPIL database derived from high-quality genomic libraries was able to detect nearly all of the same peptides as a search with a human database (~500x fewer peptides in the database), with a small reduction in sensitivity. We were also able to detect proteins from the adenovirus used to immortalize these cells. We applied our method to a set of healthy human gut microbiome proteomic samples and showed a substantial increase in the number of identified peptides and proteins compared to previous metaproteomic analyses, while retaining a high degree of protein identification accuracy and allowing for a more in-depth characterization of the functional landscape of the samples. The combination of ComPIL with Blazmass allows proteomic searches to be performed with database sizes much larger than previously possible. These large database searches can be applied to complex meta-samples with unknown composition or proteomic samples where unexpected proteins may be identified. The protein database, proteomic search engine, and the proteomic data files for the 5 microbiome samples characterized and discussed herein are open source and available for use and additional analysis.
Sequence tagging reveals unexpected modifications in toxicoproteomics
Dasari, Surendra; Chambers, Matthew C.; Codreanu, Simona G.; Liebler, Daniel C.; Collins, Ben C.; Pennington, Stephen R.; Gallagher, William M.; Tabb, David L.
2010-01-01
Toxicoproteomic samples are rich in posttranslational modifications (PTMs) of proteins. Identifying these modifications via standard database searching can incur significant performance penalties. Here we describe the latest developments in TagRecon, an algorithm that leverages inferred sequence tags to identify modified peptides in toxicoproteomic data sets. TagRecon identifies known modifications more effectively than the MyriMatch database search engine. TagRecon outperformed state of the art software in recognizing unanticipated modifications from LTQ, Orbitrap, and QTOF data sets. We developed user-friendly software for detecting persistent mass shifts from samples. We follow a three-step strategy for detecting unanticipated PTMs in samples. First, we identify the proteins present in the sample with a standard database search. Next, identified proteins are interrogated for unexpected PTMs with a sequence tag-based search. Finally, additional evidence is gathered for the detected mass shifts with a refinement search. Application of this technology on toxicoproteomic data sets revealed unintended cross-reactions between proteins and sample processing reagents. Twenty five proteins in rat liver showed signs of oxidative stress when exposed to potentially toxic drugs. These results demonstrate the value of mining toxicoproteomic data sets for modifications. PMID:21214251
Computational mining for hypothetical patterns of amino acid side chains in protein data bank (PDB)
NASA Astrophysics Data System (ADS)
Ghani, Nur Syatila Ab; Firdaus-Raih, Mohd
2018-04-01
The three-dimensional structure of a protein can provide insights regarding its function. Functional relationship between proteins can be inferred from fold and sequence similarities. In certain cases, sequence or fold comparison fails to conclude homology between proteins with similar mechanism. Since the structure is more conserved than the sequence, a constellation of functional residues can be similarly arranged among proteins of similar mechanism. Local structural similarity searches are able to detect such constellation of amino acids among distinct proteins, which can be useful to annotate proteins of unknown function. Detection of such patterns of amino acids on a large scale can increase the repertoire of important 3D motifs since available known 3D motifs currently, could not compensate the ever-increasing numbers of uncharacterized proteins to be annotated. Here, a computational platform for an automated detection of 3D motifs is described. A fuzzy-pattern searching algorithm derived from IMagine an Amino Acid 3D Arrangement search EnGINE (IMAAAGINE) was implemented to develop an automated method for searching of hypothetical patterns of amino acid side chains in Protein Data Bank (PDB), without the need for prior knowledge on related sequence or structure of pattern of interest. We present an example of the searches, which is the detection of a hypothetical pattern derived from known structural motif of C2H2 structural pattern from zinc fingers. The conservation of particular patterns of amino acid side chains in unrelated proteins is highlighted. This approach can act as a complementary method for available structure- and sequence-based platforms and may contribute in improving functional association between proteins.
Adaptive rood pattern search for fast block-matching motion estimation.
Nie, Yao; Ma, Kai-Kuang
2002-01-01
In this paper, we propose a novel and simple fast block-matching algorithm (BMA), called adaptive rood pattern search (ARPS), which consists of two sequential search stages: 1) initial search and 2) refined local search. For each macroblock (MB), the initial search is performed only once at the beginning in order to find a good starting point for the follow-up refined local search. By doing so, unnecessary intermediate search and the risk of being trapped into local minimum matching error points could be greatly reduced in long search case. For the initial search stage, an adaptive rood pattern (ARP) is proposed, and the ARP's size is dynamically determined for each MB, based on the available motion vectors (MVs) of the neighboring MBs. In the refined local search stage, a unit-size rood pattern (URP) is exploited repeatedly, and unrestrictedly, until the final MV is found. To further speed up the search, zero-motion prejudgment (ZMP) is incorporated in our method, which is particularly beneficial to those video sequences containing small motion contents. Extensive experiments conducted based on the MPEG-4 Verification Model (VM) encoding platform show that the search speed of our proposed ARPS-ZMP is about two to three times faster than that of the diamond search (DS), and our method even achieves higher peak signal-to-noise ratio (PSNR) particularly for those video sequences containing large and/or complex motion contents.
Heuristic reusable dynamic programming: efficient updates of local sequence alignment.
Hong, Changjin; Tewfik, Ahmed H
2009-01-01
Recomputation of the previously evaluated similarity results between biological sequences becomes inevitable when researchers realize errors in their sequenced data or when the researchers have to compare nearly similar sequences, e.g., in a family of proteins. We present an efficient scheme for updating local sequence alignments with an affine gap model. In principle, using the previous matching result between two amino acid sequences, we perform a forward-backward alignment to generate heuristic searching bands which are bounded by a set of suboptimal paths. Given a correctly updated sequence, we initially predict a new score of the alignment path for each contour to select the best candidates among them. Then, we run the Smith-Waterman algorithm in this confined space. Furthermore, our heuristic alignment for an updated sequence shows that it can be further accelerated by using reusable dynamic programming (rDP), our prior work. In this study, we successfully validate "relative node tolerance bound" (RNTB) in the pruned searching space. Furthermore, we improve the computational performance by quantifying the successful RNTB tolerance probability and switch to rDP on perturbation-resilient columns only. In our searching space derived by a threshold value of 90 percent of the optimal alignment score, we find that 98.3 percent of contours contain correctly updated paths. We also find that our method consumes only 25.36 percent of the runtime cost of sparse dynamic programming (sDP) method, and to only 2.55 percent of that of a normal dynamic programming with the Smith-Waterman algorithm.
SIMBAD : a sequence-independent molecular-replacement pipeline
Simpkin, Adam J.; Simkovic, Felix; Thomas, Jens M. H.; ...
2018-06-08
The conventional approach to finding structurally similar search models for use in molecular replacement (MR) is to use the sequence of the target to search against those of a set of known structures. Sequence similarity often correlates with structure similarity. Given sufficient similarity, a known structure correctly positioned in the target cell by the MR process can provide an approximation to the unknown phases of the target. An alternative approach to identifying homologous structures suitable for MR is to exploit the measured data directly, comparing the lattice parameters or the experimentally derived structure-factor amplitudes with those of known structures. Here,more » SIMBAD , a new sequence-independent MR pipeline which implements these approaches, is presented. SIMBAD can identify cases of contaminant crystallization and other mishaps such as mistaken identity (swapped crystallization trays), as well as solving unsequenced targets and providing a brute-force approach where sequence-dependent search-model identification may be nontrivial, for example because of conformational diversity among identifiable homologues. The program implements a three-step pipeline to efficiently identify a suitable search model in a database of known structures. The first step performs a lattice-parameter search against the entire Protein Data Bank (PDB), rapidly determining whether or not a homologue exists in the same crystal form. The second step is designed to screen the target data for the presence of a crystallized contaminant, a not uncommon occurrence in macromolecular crystallography. Solving structures with MR in such cases can remain problematic for many years, since the search models, which are assumed to be similar to the structure of interest, are not necessarily related to the structures that have actually crystallized. To cater for this eventuality, SIMBAD rapidly screens the data against a database of known contaminant structures. Where the first two steps fail to yield a solution, a final step in SIMBAD can be invoked to perform a brute-force search of a nonredundant PDB database provided by the MoRDa MR software. Through early-access usage of SIMBAD , this approach has solved novel cases that have otherwise proved difficult to solve.« less
SIMBAD : a sequence-independent molecular-replacement pipeline
DOE Office of Scientific and Technical Information (OSTI.GOV)
Simpkin, Adam J.; Simkovic, Felix; Thomas, Jens M. H.
The conventional approach to finding structurally similar search models for use in molecular replacement (MR) is to use the sequence of the target to search against those of a set of known structures. Sequence similarity often correlates with structure similarity. Given sufficient similarity, a known structure correctly positioned in the target cell by the MR process can provide an approximation to the unknown phases of the target. An alternative approach to identifying homologous structures suitable for MR is to exploit the measured data directly, comparing the lattice parameters or the experimentally derived structure-factor amplitudes with those of known structures. Here,more » SIMBAD , a new sequence-independent MR pipeline which implements these approaches, is presented. SIMBAD can identify cases of contaminant crystallization and other mishaps such as mistaken identity (swapped crystallization trays), as well as solving unsequenced targets and providing a brute-force approach where sequence-dependent search-model identification may be nontrivial, for example because of conformational diversity among identifiable homologues. The program implements a three-step pipeline to efficiently identify a suitable search model in a database of known structures. The first step performs a lattice-parameter search against the entire Protein Data Bank (PDB), rapidly determining whether or not a homologue exists in the same crystal form. The second step is designed to screen the target data for the presence of a crystallized contaminant, a not uncommon occurrence in macromolecular crystallography. Solving structures with MR in such cases can remain problematic for many years, since the search models, which are assumed to be similar to the structure of interest, are not necessarily related to the structures that have actually crystallized. To cater for this eventuality, SIMBAD rapidly screens the data against a database of known contaminant structures. Where the first two steps fail to yield a solution, a final step in SIMBAD can be invoked to perform a brute-force search of a nonredundant PDB database provided by the MoRDa MR software. Through early-access usage of SIMBAD , this approach has solved novel cases that have otherwise proved difficult to solve.« less
Explaining the harmonic sequence paradox.
Schmidt, Ulrich; Zimper, Alexander
2012-05-01
According to the harmonic sequence paradox, an expected utility decision maker's willingness to pay for a gamble whose expected payoffs evolve according to the harmonic series is finite if and only if his marginal utility of additional income becomes zero for rather low payoff levels. Since the assumption of zero marginal utility is implausible for finite payoff levels, expected utility theory - as well as its standard generalizations such as cumulative prospect theory - are apparently unable to explain a finite willingness to pay. This paper presents first an experimental study of the harmonic sequence paradox. Additionally, it demonstrates that the theoretical argument of the harmonic sequence paradox only applies to time-patient decision makers, whereas the paradox is easily avoided if time-impatience is introduced. ©2011 The British Psychological Society.
Vassy, Jason L; Christensen, Kurt D; Slashinski, Melody J; Lautenbach, Denise M; Raghavan, Sridharan; Robinson, Jill Oliver; Blumenthal-Barby, Jennifer; Feuerman, Lindsay Zausmer; Lehmann, Lisa Soleymani; Murray, Michael F; Green, Robert C; McGuire, Amy L
2015-01-01
Aim To describe practicing physicians’ perceived clinical utility of genome sequencing. Materials & methods We conducted a mixed-methods analysis of data from 18 primary care physicians and cardiologists in a study of the clinical integration of whole-genome sequencing. Physicians underwent brief genomics continuing medical education before completing surveys and semi-structured interviews. Results Physicians described sequencing as currently lacking clinical utility because of its uncertain interpretation and limited impact on clinical decision-making, but they expressed the idea that its clinical integration was inevitable. Potential clinical uses for sequencing included complementing other clinical information, risk stratification, motivating patient behavior change and pharmacogenetics. Conclusion Physicians given genomics continuing medical education use the language of both evidence-based and personalized medicine in describing the utility of genome-wide testing in patient care. PMID:25642274
Sequence analysis reveals genomic factors affecting EST-SSR primer performance and polymorphism
USDA-ARS?s Scientific Manuscript database
Search for simple sequence repeat (SSR) motifs and design of flanking primers in expressed sequence tag (EST) sequences can be easily done at a large scale using bioinformatics programs. However, failed amplification and/or detection, along with lack of polymorphism, is often seen among randomly sel...
Specialized microbial databases for inductive exploration of microbial genome sequences
Fang, Gang; Ho, Christine; Qiu, Yaowu; Cubas, Virginie; Yu, Zhou; Cabau, Cédric; Cheung, Frankie; Moszer, Ivan; Danchin, Antoine
2005-01-01
Background The enormous amount of genome sequence data asks for user-oriented databases to manage sequences and annotations. Queries must include search tools permitting function identification through exploration of related objects. Methods The GenoList package for collecting and mining microbial genome databases has been rewritten using MySQL as the database management system. Functions that were not available in MySQL, such as nested subquery, have been implemented. Results Inductive reasoning in the study of genomes starts from "islands of knowledge", centered around genes with some known background. With this concept of "neighborhood" in mind, a modified version of the GenoList structure has been used for organizing sequence data from prokaryotic genomes of particular interest in China. GenoChore , a set of 17 specialized end-user-oriented microbial databases (including one instance of Microsporidia, Encephalitozoon cuniculi, a member of Eukarya) has been made publicly available. These databases allow the user to browse genome sequence and annotation data using standard queries. In addition they provide a weekly update of searches against the world-wide protein sequences data libraries, allowing one to monitor annotation updates on genes of interest. Finally, they allow users to search for patterns in DNA or protein sequences, taking into account a clustering of genes into formal operons, as well as providing extra facilities to query sequences using predefined sequence patterns. Conclusion This growing set of specialized microbial databases organize data created by the first Chinese bacterial genome programs (ThermaList, Thermoanaerobacter tencongensis, LeptoList, with two different genomes of Leptospira interrogans and SepiList, Staphylococcus epidermidis) associated to related organisms for comparison. PMID:15698474
Hadad, Bat-Sheva; Kimchi, Ruth
2006-11-01
In two experiments, visual search was used to study the grouping of shape on the basis of perceptual closure among participants 5-23 years of age. We first showed that young children, like adults, demonstrate an efficient search for a concave target among convex distractors for closed connected stimuli but an inefficient search for open stimuli. Reliable developmental differences, however, were observed in search for fragmented stimuli as a function of spatial proximity and collinearity between the closure-inducing fragments. When only closure was available, search for all the age groups was equally efficient for spatially close fragments and equally inefficient for spatially distant fragments. When closure and collinearity were available, search for spatially close fragments was equally efficient for all the age groups, but search for spatially distant fragments was inefficient for younger children and improved significantly between ages 5 and 10. These findings suggest that young children can utilize closure as efficiently as can adults for the grouping of shape for closed or nearly closed stimuli. When the closure-inducing fragments are spatially distant, only older children and adults, but not 5-year-olds, can utilize collinearity to enhance closure for the perceptual grouping of shape.
Improved hybrid optimization algorithm for 3D protein structure prediction.
Zhou, Changjun; Hou, Caixia; Wei, Xiaopeng; Zhang, Qiang
2014-07-01
A new improved hybrid optimization algorithm - PGATS algorithm, which is based on toy off-lattice model, is presented for dealing with three-dimensional protein structure prediction problems. The algorithm combines the particle swarm optimization (PSO), genetic algorithm (GA), and tabu search (TS) algorithms. Otherwise, we also take some different improved strategies. The factor of stochastic disturbance is joined in the particle swarm optimization to improve the search ability; the operations of crossover and mutation that are in the genetic algorithm are changed to a kind of random liner method; at last tabu search algorithm is improved by appending a mutation operator. Through the combination of a variety of strategies and algorithms, the protein structure prediction (PSP) in a 3D off-lattice model is achieved. The PSP problem is an NP-hard problem, but the problem can be attributed to a global optimization problem of multi-extremum and multi-parameters. This is the theoretical principle of the hybrid optimization algorithm that is proposed in this paper. The algorithm combines local search and global search, which overcomes the shortcoming of a single algorithm, giving full play to the advantage of each algorithm. In the current universal standard sequences, Fibonacci sequences and real protein sequences are certified. Experiments show that the proposed new method outperforms single algorithms on the accuracy of calculating the protein sequence energy value, which is proved to be an effective way to predict the structure of proteins.
Caught in the act: the lifetime of synaptic intermediates during the search for homology on DNA
Mani, Adam; Braslavsky, Ido; Arbel-Goren, Rinat; Stavans, Joel
2010-01-01
Homologous recombination plays pivotal roles in DNA repair and in the generation of genetic diversity. To locate homologous target sequences at which strand exchange can occur within a timescale that a cell’s biology demands, a single-stranded DNA-recombinase complex must search among a large number of sequences on a genome by forming synapses with chromosomal segments of DNA. A key element in the search is the time it takes for the two sequences of DNA to be compared, i.e. the synapse lifetime. Here, we visualize for the first time fluorescently tagged individual synapses formed by RecA, a prokaryotic recombinase, and measure their lifetime as a function of synapse length and differences in sequence between the participating DNAs. Surprisingly, lifetimes can be ∼10 s long when the DNAs are fully heterologous, and much longer for partial homology, consistently with ensemble FRET measurements. Synapse lifetime increases rapidly as the length of a region of full homology at either the 3′- or 5′-ends of the invading single-stranded DNA increases above 30 bases. A few mismatches can reduce dramatically the lifetime of synapses formed with nearly homologous DNAs. These results suggest the need for facilitated homology search mechanisms to locate homology successfully within the timescales observed in vivo. PMID:20044347
Holmes, Anne; Allison, Lesley; Ward, Melissa; Dallman, Timothy J; Clark, Richard; Fawkes, Angie; Murphy, Lee; Hanson, Mary
2015-11-01
Detailed laboratory characterization of Escherichia coli O157 is essential to inform epidemiological investigations. This study assessed the utility of whole-genome sequencing (WGS) for outbreak detection and epidemiological surveillance of E. coli O157, and the data were used to identify discernible associations between genotypes and clinical outcomes. One hundred five E. coli O157 strains isolated over a 5-year period from human fecal samples in Lothian, Scotland, were sequenced with the Ion Torrent Personal Genome Machine. A total of 8,721 variable sites in the core genome were identified among the 105 isolates; 47% of the single nucleotide polymorphisms (SNPs) were attributable to six "atypical" E. coli O157 strains and included recombinant regions. Phylogenetic analyses showed that WGS correlated well with the epidemiological data. Epidemiological links existed between cases whose isolates differed by three or fewer SNPs. WGS also correlated well with multilocus variable-number tandem repeat analysis (MLVA) typing data, with only three discordant results observed, all among isolates from cases not known to be epidemiologically related. WGS produced a better-supported, higher-resolution phylogeny than MLVA, confirming that the method is more suitable for epidemiological surveillance of E. coli O157. A combination of in silico analyses (VirulenceFinder, ResFinder, and local BLAST searches) were used to determine stx subtypes, multilocus sequence types (15 loci), and the presence of virulence and acquired antimicrobial resistance genes. There was a high level of correlation between the WGS data and our routine typing methods, although some discordant results were observed, mostly related to the limitation of short sequence read assembly. The data were used to identify sublineages and clades of E. coli O157, and when they were correlated with the clinical outcome data, they showed that one clade, Ic3, was significantly associated with severe disease. Together, the results show that WGS data can provide higher resolution of the relationships between E. coli O157 isolates than that provided by MLVA. The method has the potential to streamline the laboratory workflow and provide detailed information for the clinical management of patients and public health interventions. Copyright © 2015, Holmes et al.
Allison, Lesley; Ward, Melissa; Dallman, Timothy J.; Clark, Richard; Fawkes, Angie; Murphy, Lee; Hanson, Mary
2015-01-01
Detailed laboratory characterization of Escherichia coli O157 is essential to inform epidemiological investigations. This study assessed the utility of whole-genome sequencing (WGS) for outbreak detection and epidemiological surveillance of E. coli O157, and the data were used to identify discernible associations between genotypes and clinical outcomes. One hundred five E. coli O157 strains isolated over a 5-year period from human fecal samples in Lothian, Scotland, were sequenced with the Ion Torrent Personal Genome Machine. A total of 8,721 variable sites in the core genome were identified among the 105 isolates; 47% of the single nucleotide polymorphisms (SNPs) were attributable to six “atypical” E. coli O157 strains and included recombinant regions. Phylogenetic analyses showed that WGS correlated well with the epidemiological data. Epidemiological links existed between cases whose isolates differed by three or fewer SNPs. WGS also correlated well with multilocus variable-number tandem repeat analysis (MLVA) typing data, with only three discordant results observed, all among isolates from cases not known to be epidemiologically related. WGS produced a better-supported, higher-resolution phylogeny than MLVA, confirming that the method is more suitable for epidemiological surveillance of E. coli O157. A combination of in silico analyses (VirulenceFinder, ResFinder, and local BLAST searches) were used to determine stx subtypes, multilocus sequence types (15 loci), and the presence of virulence and acquired antimicrobial resistance genes. There was a high level of correlation between the WGS data and our routine typing methods, although some discordant results were observed, mostly related to the limitation of short sequence read assembly. The data were used to identify sublineages and clades of E. coli O157, and when they were correlated with the clinical outcome data, they showed that one clade, Ic3, was significantly associated with severe disease. Together, the results show that WGS data can provide higher resolution of the relationships between E. coli O157 isolates than that provided by MLVA. The method has the potential to streamline the laboratory workflow and provide detailed information for the clinical management of patients and public health interventions. PMID:26354815
Goonesekere, Nalin Cw
2009-01-01
The large numbers of protein sequences generated by whole genome sequencing projects require rapid and accurate methods of annotation. The detection of homology through computational sequence analysis is a powerful tool in determining the complex evolutionary and functional relationships that exist between proteins. Homology search algorithms employ amino acid substitution matrices to detect similarity between proteins sequences. The substitution matrices in common use today are constructed using sequences aligned without reference to protein structure. Here we present amino acid substitution matrices constructed from the alignment of a large number of protein domain structures from the structural classification of proteins (SCOP) database. We show that when incorporated into the homology search algorithms BLAST and PSI-blast, the structure-based substitution matrices enhance the efficacy of detecting remote homologs.
NASA Technical Reports Server (NTRS)
Chang, Chi-Yung (Inventor); Fang, Wai-Chi (Inventor); Curlander, John C. (Inventor)
1995-01-01
A system for data compression utilizing systolic array architecture for Vector Quantization (VQ) is disclosed for both full-searched and tree-searched. For a tree-searched VQ, the special case of a Binary Tree-Search VQ (BTSVQ) is disclosed with identical Processing Elements (PE) in the array for both a Raw-Codebook VQ (RCVQ) and a Difference-Codebook VQ (DCVQ) algorithm. A fault tolerant system is disclosed which allows a PE that has developed a fault to be bypassed in the array and replaced by a spare at the end of the array, with codebook memory assignment shifted one PE past the faulty PE of the array.
A proteomic map of the unsequenced kala-azar vector Phlebotomus papatasi using cell line.
Pawar, Harsh; Chavan, Sandip; Mahale, Kiran; Khobragade, Sweta; Kulkarni, Aditi; Patil, Arun; Chaphekar, Deepa; Varriar, Pratyasha; Sudeep, Anakkathil; Pai, Kalpana; Prasad, T S K; Gowda, Harsha; Patole, Milind S
2015-12-01
The debilitating disease kala-azar or visceral leishmaniasis is caused by the kinetoplastid protozoan parasite Leishmania donovani. The parasite is transmitted by the hematophagous sand fly vector of the genus Phlebotomus in the old world and Lutzomyia in the new world. The predominant Phlebotomine species associated with the transmission of kala-azar are Phlebotomus papatasi and Phlebotomus argentipes. Understanding the molecular interaction of the sand fly and Leishmania, during the development of parasite within the sand fly gut is crucial to the understanding of the parasite life cycle. The complete genome sequences of sand flies (Phlebotomus and Lutzomyia) are currently not available and this hinders identification of proteins in the sand fly vector. The current study utilizes a three frame translated transcriptomic data of P. papatasi in the absence of genomic sequences to analyze the mass spectrometry data of P. papatasi cell line using a proteogenomic approach. Additionally, we have carried out the proteogenomic analysis of P. papatasi by comparative homology-based searches using related sequenced dipteran protein data. This study resulted in the identification of 1313 proteins from P. papatasi based on homology. Our study demonstrates the power of proteogenomic approaches in mapping the proteomes of unsequenced organisms. Copyright © 2015 Elsevier B.V. All rights reserved.
ESTs from Seeds to Assist the Selective Breeding of Jatropha curcas L. for Oil and Active Compounds
Gomes, Kleber A; Almeida, Tiago C; Gesteira, Abelmon S; Lôbo, Ivon P; Guimarães, Ana Carolina R; de Miranda, Antonio B; Van Sluys, Marie-Anne; da Cruz, Rosenira S; Cascardo, Júlio CM; Carels, Nicolas
2010-01-01
We report here on the characterization of a cDNA library from seeds of Jatropha curcas L. at three stages of fruit maturation before yellowing. We sequenced a total of 2200 clones and obtained a set of 931 non-redundant sequences (unigenes) after trimming and quality control, ie, 140 contigs and 791 singlets with PHRED quality ≥10. We found low levels of sequence redundancy and extensive metabolic coverage by homology comparison to GO. After comparison of 5841 non-redundant ESTs from a total of 13193 reads from GenBank with KEGG, we identified tags with nucleotide variations among J. curcas accessions for genes of fatty acid, terpene, alkaloid, quinone and hormone pathways of biosynthesis. More specifically, the expression level of four genes (palmitoyl-acyl carrier protein thioesterase, 3-ketoacyl-CoA thiolase B, lysophosphatidic acid acyltransferase and geranyl pyrophosphate synthase) measured by real-time PCR proved to be significantly different between leaves and fruits. Since the nucleotide polymorphism of these tags is associated to higher level of gene expression in fruits compared to leaves, we propose this approach to speed up the search for quantitative traits in selective breeding of J. curcas. We also discuss its potential utility for the selective breeding of economically important traits in J. curcas. PMID:26217103
Tang, Lixia; Wang, Xiong; Ru, Beibei; Sun, Hengfei; Huang, Jian; Gao, Hui
2014-06-01
Recent computational and bioinformatics advances have enabled the efficient creation of novel biocatalysts by reducing amino acid variability at hot spot regions. To further expand the utility of this strategy, we present here a tool called Multi-site Degenerate Codon Analyzer (MDC-Analyzer) for the automated design of intelligent mutagenesis libraries that can completely cover user-defined randomized sequences, especially when multiple contiguous and/or adjacent sites are targeted. By initially defining an objective function, the possible optimal degenerate PCR primer profiles could be automatically explored using the heuristic approach of Greedy Best-First-Search. Compared to the previously developed DC-Analyzer, MDC-Analyzer allows for the existence of a small amount of undesired sequences as a tradeoff between the number of degenerate primers and the encoded library size while still providing all the benefits of DC-Analyzer with the ability to randomize multiple contiguous sites. MDC-Analyzer was validated using a series of randomly generated mutation schemes and experimental case studies on the evolution of halohydrin dehalogenase, which proved that the MDC methodology is more efficient than other methods and is particularly well-suited to exploring the sequence space of proteins using data-driven protein engineering strategies.
Gruber, Andreas R
2014-07-10
RNA Polymerase III is a highly specialized enzyme complex responsible for the transcription of a very distinct set of housekeeping noncoding RNAs including tRNAs, 7SK snRNA, Y RNAs, U6 snRNA, and the RNA components of RNaseP and RNaseMRP. In this work we have utilized the conserved promoter structure of known RNA Polymerase III transcripts consisting of characteristic sequence elements termed proximal sequence elements (PSE) A and B and a TATA-box to uncover a novel RNA Polymerase III-transcribed, noncoding RNA family found to be conserved in Caenorhabditis as well as other clade V nematode species. Homology search in combination with detailed sequence and secondary structure analysis revealed that members of this novel ncRNA family evolve rapidly, and only maintain a potentially functional small stem structure that links the 5' end to the very 3' end of the transcript and a small hairpin structure at the 3' end. This is most likely required for efficient transcription termination. In addition, our study revealed evidence that canonical C/D box snoRNAs are also transcribed from a PSE A-PSE B-TATA-box promoter in Caenorhabditis elegans. Copyright © 2014 Elsevier B.V. All rights reserved.
Turning gold into ‘junk’: transposable elements utilize central proteins of cellular networks
Abrusán, György; Szilágyi, András; Zhang, Yang; Papp, Balázs
2013-01-01
The numerous discovered cases of domesticated transposable element (TE) proteins led to the recognition that TEs are a significant source of evolutionary innovation. However, much less is known about the reverse process, whether and to what degree the evolution of TEs is influenced by the genome of their hosts. We addressed this issue by searching for cases of incorporation of host genes into the sequence of TEs and examined the systems-level properties of these genes using the Saccharomyces cerevisiae and Drosophila melanogaster genomes. We identified 51 cases where the evolutionary scenario was the incorporation of a host gene fragment into a TE consensus sequence, and we show that both the yeast and fly homologues of the incorporated protein sequences have central positions in the cellular networks. An analysis of selective pressure (Ka/Ks ratio) detected significant selection in 37% of the cases. Recent research on retrovirus-host interactions shows that virus proteins preferentially target hubs of the host interaction networks enabling them to take over the host cell using only a few proteins. We propose that TEs face a similar evolutionary pressure to evolve proteins with high interacting capacities and take some of the necessary protein domains directly from their hosts. PMID:23341038
Fast parallel tandem mass spectral library searching using GPU hardware acceleration
Baumgardner, Lydia Ashleigh; Shanmugam, Avinash Kumar; Lam, Henry; Eng, Jimmy K.; Martin, Daniel B.
2011-01-01
Mass spectrometry-based proteomics is a maturing discipline of biologic research that is experiencing substantial growth. Instrumentation has steadily improved over time with the advent of faster and more sensitive instruments collecting ever larger data files. Consequently, the computational process of matching a peptide fragmentation pattern to its sequence, traditionally accomplished by sequence database searching and more recently also by spectral library searching, has become a bottleneck in many mass spectrometry experiments. In both of these methods, the main rate limiting step is the comparison of an acquired spectrum with all potential matches from a spectral library or sequence database. This is a highly parallelizable process because the core computational element can be represented as a simple but arithmetically intense multiplication of two vectors. In this paper we present a proof of concept project taking advantage of the massively parallel computing available on graphics processing units (GPUs) to distribute and accelerate the process of spectral assignment using spectral library searching. This program, which we have named FastPaSS (for Fast Parallelized Spectral Searching) is implemented in CUDA (Compute Unified Device Architecture) from NVIDIA which allows direct access to the processors in an NVIDIA GPU. Our efforts demonstrate the feasibility of GPU computing for spectral assignment, through implementation of the validated spectral searching algorithm SpectraST in the CUDA environment. PMID:21545112
Makita, Yuko; Kawashima, Mika; Lau, Nyok Sean; Othman, Ahmad Sofiman; Matsui, Minami
2018-01-19
Natural rubber is an economically important material. Currently the Pará rubber tree, Hevea brasiliensis is the main commercial source. Little is known about rubber biosynthesis at the molecular level. Next-generation sequencing (NGS) technologies brought draft genomes of three rubber cultivars and a variety of RNA sequencing (RNA-seq) data. However, no current genome or transcriptome databases (DB) are organized by gene. A gene-oriented database is a valuable support for rubber research. Based on our original draft genome sequence of H. brasiliensis RRIM600, we constructed a rubber tree genome and transcriptome DB. Our DB provides genome information including gene functional annotations and multi-transcriptome data of RNA-seq, full-length cDNAs including PacBio Isoform sequencing (Iso-Seq), ESTs and genome wide transcription start sites (TSSs) derived from CAGE technology. Using our original and publically available RNA-seq data, we calculated co-expressed genes for identifying functionally related gene sets and/or genes regulated by the same transcription factor (TF). Users can access multi-transcriptome data through both a gene-oriented web page and a genome browser. For the gene searching system, we provide keyword search, sequence homology search and gene expression search; users can also select their expression threshold easily. The rubber genome and transcriptome DB provides rubber tree genome sequence and multi-transcriptomics data. This DB is useful for comprehensive understanding of the rubber transcriptome. This will assist both industrial and academic researchers for rubber and economically important close relatives such as R. communis, M. esculenta and J. curcas. The Rubber Transcriptome DB release 2017.03 is accessible at http://matsui-lab.riken.jp/rubber/ .
Database resources of the National Center for Biotechnology Information.
Sayers, Eric W; Barrett, Tanya; Benson, Dennis A; Bolton, Evan; Bryant, Stephen H; Canese, Kathi; Chetvernin, Vyacheslav; Church, Deanna M; Dicuccio, Michael; Federhen, Scott; Feolo, Michael; Fingerman, Ian M; Geer, Lewis Y; Helmberg, Wolfgang; Kapustin, Yuri; Krasnov, Sergey; Landsman, David; Lipman, David J; Lu, Zhiyong; Madden, Thomas L; Madej, Tom; Maglott, Donna R; Marchler-Bauer, Aron; Miller, Vadim; Karsch-Mizrachi, Ilene; Ostell, James; Panchenko, Anna; Phan, Lon; Pruitt, Kim D; Schuler, Gregory D; Sequeira, Edwin; Sherry, Stephen T; Shumway, Martin; Sirotkin, Karl; Slotta, Douglas; Souvorov, Alexandre; Starchenko, Grigory; Tatusova, Tatiana A; Wagner, Lukas; Wang, Yanli; Wilbur, W John; Yaschenko, Eugene; Ye, Jian
2012-01-01
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Website. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central (PMC), Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, Genome and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, BioProject, BioSample, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Probe, Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), Biosystems, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.
Database resources of the National Center for Biotechnology Information
Acland, Abigail; Agarwala, Richa; Barrett, Tanya; Beck, Jeff; Benson, Dennis A.; Bollin, Colleen; Bolton, Evan; Bryant, Stephen H.; Canese, Kathi; Church, Deanna M.; Clark, Karen; DiCuccio, Michael; Dondoshansky, Ilya; Federhen, Scott; Feolo, Michael; Geer, Lewis Y.; Gorelenkov, Viatcheslav; Hoeppner, Marilu; Johnson, Mark; Kelly, Christopher; Khotomlianski, Viatcheslav; Kimchi, Avi; Kimelman, Michael; Kitts, Paul; Krasnov, Sergey; Kuznetsov, Anatoliy; Landsman, David; Lipman, David J.; Lu, Zhiyong; Madden, Thomas L.; Madej, Tom; Maglott, Donna R.; Marchler-Bauer, Aron; Karsch-Mizrachi, Ilene; Murphy, Terence; Ostell, James; O'Sullivan, Christopher; Panchenko, Anna; Phan, Lon; Pruitt, Don Preussm Kim D.; Rubinstein, Wendy; Sayers, Eric W.; Schneider, Valerie; Schuler, Gregory D.; Sequeira, Edwin; Sherry, Stephen T.; Shumway, Martin; Sirotkin, Karl; Siyan, Karanjit; Slotta, Douglas; Soboleva, Alexandra; Soussov, Vladimir; Starchenko, Grigory; Tatusova, Tatiana A.; Trawick, Bart W.; Vakatov, Denis; Wang, Yanli; Ward, Minghong; John Wilbur, W.; Yaschenko, Eugene; Zbicz, Kerry
2014-01-01
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central, PubReader, Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link, Primer-BLAST, COBALT, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, the Genetic Testing Registry, Genome and related tools, the Map Viewer, Trace Archive, Sequence Read Archive, BioProject, BioSample, ClinVar, MedGen, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus, Probe, Online Mendelian Inheritance in Animals, the Molecular Modeling Database, the Conserved Domain Database, the Conserved Domain Architecture Retrieval Tool, Biosystems, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All these resources can be accessed through the NCBI home page. PMID:24259429
NASA Astrophysics Data System (ADS)
Sivarami Reddy, N.; Ramamurthy, D. V., Dr.; Prahlada Rao, K., Dr.
2017-08-01
This article addresses simultaneous scheduling of machines, AGVs and tools where machines are allowed to share the tools considering transfer times of jobs and tools between machines, to generate best optimal sequences that minimize makespan in a multi-machine Flexible Manufacturing System (FMS). Performance of FMS is expected to improve by effective utilization of its resources, by proper integration and synchronization of their scheduling. Symbiotic Organisms Search (SOS) algorithm is a potent tool which is a better alternative for solving optimization problems like scheduling and proven itself. The proposed SOS algorithm is tested on 22 job sets with makespan as objective for scheduling of machines and tools where machines are allowed to share tools without considering transfer times of jobs and tools and the results are compared with the results of existing methods. The results show that the SOS has outperformed. The same SOS algorithm is used for simultaneous scheduling of machines, AGVs and tools where machines are allowed to share tools considering transfer times of jobs and tools to determine the best optimal sequences that minimize makespan.
Fourment, Mathieu; Gibbs, Mark J
2008-01-01
Background Viruses of the Bunyaviridae have segmented negative-stranded RNA genomes and several of them cause significant disease. Many partial sequences have been obtained from the segments so that GenBank searches give complex results. Sequence databases usually use HTML pages to mediate remote sorting, but this approach can be limiting and may discourage a user from exploring a database. Results The VirusBanker database contains Bunyaviridae sequences and alignments and is presented as two spreadsheets generated by a Java program that interacts with a MySQL database on a server. Sequences are displayed in rows and may be sorted using information that is displayed in columns and includes data relating to the segment, gene, protein, species, strain, sequence length, terminal sequence and date and country of isolation. Bunyaviridae sequences and alignments may be downloaded from the second spreadsheet with titles defined by the user from the columns, or viewed when passed directly to the sequence editor, Jalview. Conclusion VirusBanker allows large datasets of aligned nucleotide and protein sequences from the Bunyaviridae to be compiled and winnowed rapidly using criteria that are formulated heuristically. PMID:18251994
HMMER web server: 2018 update.
Potter, Simon C; Luciani, Aurélien; Eddy, Sean R; Park, Youngmi; Lopez, Rodrigo; Finn, Robert D
2018-06-14
The HMMER webserver [http://www.ebi.ac.uk/Tools/hmmer] is a free-to-use service which provides fast searches against widely used sequence databases and profile hidden Markov model (HMM) libraries using the HMMER software suite (http://hmmer.org). The results of a sequence search may be summarized in a number of ways, allowing users to view and filter the significant hits by domain architecture or taxonomy. For large scale usage, we provide an application programmatic interface (API) which has been expanded in scope, such that all result presentations are available via both HTML and API. Furthermore, we have refactored our JavaScript visualization library to provide standalone components for different result representations. These consume the aforementioned API and can be integrated into third-party websites. The range of databases that can be searched against has been expanded, adding four sequence datasets (12 in total) and one profile HMM library (6 in total). To help users explore the biological context of their results, and to discover new data resources, search results are now supplemented with cross references to other EMBL-EBI databases.
Vander Lugt correlation of DNA sequence data
NASA Astrophysics Data System (ADS)
Christens-Barry, William A.; Hawk, James F.; Martin, James C.
1990-12-01
DNA, the molecule containing the genetic code of an organism, is a linear chain of subunits. It is the sequence of subunits, of which there are four kinds, that constitutes the unique blueprint of an individual. This sequence is the focus of a large number of analyses performed by an army of geneticists, biologists, and computer scientists. Most of these analyses entail searches for specific subsequences within the larger set of sequence data. Thus, most analyses are essentially pattern recognition or correlation tasks. Yet, there are special features to such analysis that influence the strategy and methods of an optical pattern recognition approach. While the serial processing employed in digital electronic computers remains the main engine of sequence analyses, there is no fundamental reason that more efficient parallel methods cannot be used. We describe an approach using optical pattern recognition (OPR) techniques based on matched spatial filtering. This allows parallel comparison of large blocks of sequence data. In this study we have simulated a Vander Lugt1 architecture implementing our approach. Searches for specific target sequence strings within a block of DNA sequence from the Co/El plasmid2 are performed.
Search for Variables in the Kepler Field on DASCH Plates
NASA Astrophysics Data System (ADS)
Tang, Sumin; Grindlay, J.; Los, E.; Servillat, M.
2011-01-01
The Digital Access to a Sky Century @ Harvard (DASCH) is a project to digitize the half a million glass photographic plates over the period 1880s-1980s. This 100 year coverage is a unique resource for studying temporal variations in the universe. Here we present our variable search algorithms and variable catalog in the Kepler fields based on 3000 scanned plates. We use the KIC spectral classifications to search for long-term variability of any main sequence stars, particularly M dwarfs. We apply a variability search technique developed for DASCH and set limits on the fraction of main sequence stars, by spectral type, which show detectable (>0.2mag) variability on timescales 10-100y. Such limits are of particular interest for M dwarfs given the recent discoveries of their planet systems.
Genome Sequencing Technologies and Nursing: What Are the Roles of Nurses and Nurse Scientists?
Taylor, Jacquelyn Y; Wright, Michelle L; Hickey, Kathleen T; Housman, David E
Advances in DNA sequencing technology have resulted in an abundance of personalized data with challenging clinical utility and meaning for clinicians. This wealth of data has potential to dramatically impact the quality of healthcare. Nurses are at the focal point in educating patients regarding relevant healthcare needs; therefore, an understanding of sequencing technology and utilizing these data are critical. The objective of this study was to explicate the role of nurses and nurse scientists as integral members of healthcare teams in improving understanding of DNA sequencing data and translational genomics for patients. A history of the nurse role in newborn screening is used as an exemplar. This study serves as an exemplar on how genome sequencing has been utilized in nursing science and incorporates linkages of other omics approaches used by nurses that are included in this special issue. This special issue showcased nurse scientists conducting multi-omic research from various methods, including targeted candidate genes, pharmacogenomics, proteomics, epigenomics, and the microbiome. From this vantage point, we provide an overview of the roles of nurse scientists in genome sequencing research and provide recommendations for the best utilization of nurses and nurse scientists related to genome sequencing.
NASA Technical Reports Server (NTRS)
Caillault, Jean-Pierre; Magnani, Loris
1990-01-01
The preliminary results are reported of a survey of every EINSTEIN image which overlaps any high-latitude molecular cloud in a search for X-ray emitting pre-main sequence stars. This survey, together with complementary KPNO and IRAS data, will allow the determination of how prevalent low mass star formation is in these clouds in general and, particularly, in the translucent molecular clouds.
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering.
Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka
2016-01-01
Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads.
NASA Astrophysics Data System (ADS)
Liu, Qiong; Wang, Wen-xi; Zhu, Ke-ren; Zhang, Chao-yong; Rao, Yun-qing
2014-11-01
Mixed-model assembly line sequencing is significant in reducing the production time and overall cost of production. To improve production efficiency, a mathematical model aiming simultaneously to minimize overtime, idle time and total set-up costs is developed. To obtain high-quality and stable solutions, an advanced scatter search approach is proposed. In the proposed algorithm, a new diversification generation method based on a genetic algorithm is presented to generate a set of potentially diverse and high-quality initial solutions. Many methods, including reference set update, subset generation, solution combination and improvement methods, are designed to maintain the diversification of populations and to obtain high-quality ideal solutions. The proposed model and algorithm are applied and validated in a case company. The results indicate that the proposed advanced scatter search approach is significant for mixed-model assembly line sequencing in this company.
PhAST: pharmacophore alignment search tool.
Hähnke, Volker; Hofmann, Bettina; Grgat, Tomislav; Proschak, Ewgenij; Steinhilber, Dieter; Schneider, Gisbert
2009-04-15
We present a ligand-based virtual screening technique (PhAST) for rapid hit and lead structure searching in large compound databases. Molecules are represented as strings encoding the distribution of pharmacophoric features on the molecular graph. In contrast to other text-based methods using SMILES strings, we introduce a new form of text representation that describes the pharmacophore of molecules. This string representation opens the opportunity for revealing functional similarity between molecules by sequence alignment techniques in analogy to homology searching in protein or nucleic acid sequence databases. We favorably compared PhAST with other current ligand-based virtual screening methods in a retrospective analysis using the BEDROC metric. In a prospective application, PhAST identified two novel inhibitors of 5-lipoxygenase product formation with minimal experimental effort. This outcome demonstrates the applicability of PhAST to drug discovery projects and provides an innovative concept of sequence-based compound screening with substantial scaffold hopping potential. 2008 Wiley Periodicals, Inc.
Dhir, Somdutta; Pacurar, Mircea; Franklin, Dino; Gáspári, Zoltán; Kertész-Farkas, Attila; Kocsor, András; Eisenhaber, Frank; Pongor, Sándor
2010-11-01
SBASE is a project initiated to detect known domain types and predicting domain architectures using sequence similarity searching (Simon et al., Protein Seq Data Anal, 5: 39-42, 1992, Pongor et al, Nucl. Acids. Res. 21:3111-3115, 1992). The current approach uses a curated collection of domain sequences - the SBASE domain library - and standard similarity search algorithms, followed by postprocessing which is based on a simple statistics of the domain similarity network (http://hydra.icgeb.trieste.it/sbase/). It is especially useful in detecting rare, atypical examples of known domain types which are sometimes missed even by more sophisticated methodologies. This approach does not require multiple alignment or machine learning techniques, and can be a useful complement to other domain detection methodologies. This article gives an overview of the project history as well as of the concepts and principles developed within this the project.
Variable speed wind turbine generator with zero-sequence filter
Muljadi, Eduard
1998-01-01
A variable speed wind turbine generator system to convert mechanical power into electrical power or energy and to recover the electrical power or energy in the form of three phase alternating current and return the power or energy to a utility or other load with single phase sinusoidal waveform at sixty (60) hertz and unity power factor includes an excitation controller for generating three phase commanded current, a generator, and a zero sequence filter. Each commanded current signal includes two components: a positive sequence variable frequency current signal to provide the balanced three phase excitation currents required in the stator windings of the generator to generate the rotating magnetic field needed to recover an optimum level of real power from the generator; and a zero frequency sixty (60) hertz current signal to allow the real power generated by the generator to be supplied to the utility. The positive sequence current signals are balanced three phase signals and are prevented from entering the utility by the zero sequence filter. The zero sequence current signals have zero phase displacement from each other and are prevented from entering the generator by the star connected stator windings. The zero sequence filter allows the zero sequence current signals to pass through to deliver power to the utility.
Variable Speed Wind Turbine Generator with Zero-sequence Filter
Muljadi, Eduard
1998-08-25
A variable speed wind turbine generator system to convert mechanical power into electrical power or energy and to recover the electrical power or energy in the form of three phase alternating current and return the power or energy to a utility or other load with single phase sinusoidal waveform at sixty (60) hertz and unity power factor includes an excitation controller for generating three phase commanded current, a generator, and a zero sequence filter. Each commanded current signal includes two components: a positive sequence variable frequency current signal to provide the balanced three phase excitation currents required in the stator windings of the generator to generate the rotating magnetic field needed to recover an optimum level of real power from the generator; and a zero frequency sixty (60) hertz current signal to allow the real power generated by the generator to be supplied to the utility. The positive sequence current signals are balanced three phase signals and are prevented from entering the utility by the zero sequence filter. The zero sequence current signals have zero phase displacement from each other and are prevented from entering the generator by the star connected stator windings. The zero sequence filter allows the zero sequence current signals to pass through to deliver power to the utility.
Variable speed wind turbine generator with zero-sequence filter
Muljadi, E.
1998-08-25
A variable speed wind turbine generator system to convert mechanical power into electrical power or energy and to recover the electrical power or energy in the form of three phase alternating current and return the power or energy to a utility or other load with single phase sinusoidal waveform at sixty (60) hertz and unity power factor includes an excitation controller for generating three phase commanded current, a generator, and a zero sequence filter. Each commanded current signal includes two components: a positive sequence variable frequency current signal to provide the balanced three phase excitation currents required in the stator windings of the generator to generate the rotating magnetic field needed to recover an optimum level of real power from the generator; and a zero frequency sixty (60) hertz current signal to allow the real power generated by the generator to be supplied to the utility. The positive sequence current signals are balanced three phase signals and are prevented from entering the utility by the zero sequence filter. The zero sequence current signals have zero phase displacement from each other and are prevented from entering the generator by the star connected stator windings. The zero sequence filter allows the zero sequence current signals to pass through to deliver power to the utility. 14 figs.
ActiviTree: interactive visual exploration of sequences in event-based data using graph similarity.
Vrotsou, Katerina; Johansson, Jimmy; Cooper, Matthew
2009-01-01
The identification of significant sequences in large and complex event-based temporal data is a challenging problem with applications in many areas of today's information intensive society. Pure visual representations can be used for the analysis, but are constrained to small data sets. Algorithmic search mechanisms used for larger data sets become expensive as the data size increases and typically focus on frequency of occurrence to reduce the computational complexity, often overlooking important infrequent sequences and outliers. In this paper we introduce an interactive visual data mining approach based on an adaptation of techniques developed for web searching, combined with an intuitive visual interface, to facilitate user-centred exploration of the data and identification of sequences significant to that user. The search algorithm used in the exploration executes in negligible time, even for large data, and so no pre-processing of the selected data is required, making this a completely interactive experience for the user. Our particular application area is social science diary data but the technique is applicable across many other disciplines.
Discovering Motifs in Biological Sequences Using the Micron Automata Processor.
Roy, Indranil; Aluru, Srinivas
2016-01-01
Finding approximately conserved sequences, called motifs, across multiple DNA or protein sequences is an important problem in computational biology. In this paper, we consider the (l, d) motif search problem of identifying one or more motifs of length l present in at least q of the n given sequences, with each occurrence differing from the motif in at most d substitutions. The problem is known to be NP-complete, and the largest solved instance reported to date is (26,11). We propose a novel algorithm for the (l,d) motif search problem using streaming execution over a large set of non-deterministic finite automata (NFA). This solution is designed to take advantage of the micron automata processor, a new technology close to deployment that can simultaneously execute multiple NFA in parallel. We demonstrate the capability for solving much larger instances of the (l, d) motif search problem using the resources available within a single automata processor board, by estimating run-times for problem instances (39,18) and (40,17). The paper serves as a useful guide to solving problems using this new accelerator technology.
The Yak genome database: an integrative database for studying yak biology and high-altitude adaption
2012-01-01
Background The yak (Bos grunniens) is a long-haired bovine that lives at high altitudes and is an important source of milk, meat, fiber and fuel. The recent sequencing, assembly and annotation of its genome are expected to further our understanding of the means by which it has adapted to life at high altitudes and its ecologically important traits. Description The Yak Genome Database (YGD) is an internet-based resource that provides access to genomic sequence data and predicted functional information concerning the genes and proteins of Bos grunniens. The curated data stored in the YGD includes genome sequences, predicted genes and associated annotations, non-coding RNA sequences, transposable elements, single nucleotide variants, and three-way whole-genome alignments between human, cattle and yak. YGD offers useful searching and data mining tools, including the ability to search for genes by name or using function keywords as well as GBrowse genome browsers and/or BLAST servers, which can be used to visualize genome regions and identify similar sequences. Sequence data from the YGD can also be downloaded to perform local searches. Conclusions A new yak genome database (YGD) has been developed to facilitate studies on high-altitude adaption and bovine genomics. The database will be continuously updated to incorporate new information such as transcriptome data and population resequencing data. The YGD can be accessed at http://me.lzu.edu.cn/yak. PMID:23134687
Drory Retwitzer, Matan; Polishchuk, Maya; Churkin, Elena; Kifer, Ilona; Yakhini, Zohar; Barash, Danny
2015-01-01
Searching for RNA sequence-structure patterns is becoming an essential tool for RNA practitioners. Novel discoveries of regulatory non-coding RNAs in targeted organisms and the motivation to find them across a wide range of organisms have prompted the use of computational RNA pattern matching as an enhancement to sequence similarity. State-of-the-art programs differ by the flexibility of patterns allowed as queries and by their simplicity of use. In particular—no existing method is available as a user-friendly web server. A general program that searches for RNA sequence-structure patterns is RNA Structator. However, it is not available as a web server and does not provide the option to allow flexible gap pattern representation with an upper bound of the gap length being specified at any position in the sequence. Here, we introduce RNAPattMatch, a web-based application that is user friendly and makes sequence/structure RNA queries accessible to practitioners of various background and proficiency. It also extends RNA Structator and allows a more flexible variable gaps representation, in addition to analysis of results using energy minimization methods. RNAPattMatch service is available at http://www.cs.bgu.ac.il/rnapattmatch. A standalone version of the search tool is also available to download at the site. PMID:25940619
Database Search Engines: Paradigms, Challenges and Solutions.
Verheggen, Kenneth; Martens, Lennart; Berven, Frode S; Barsnes, Harald; Vaudel, Marc
2016-01-01
The first step in identifying proteins from mass spectrometry based shotgun proteomics data is to infer peptides from tandem mass spectra, a task generally achieved using database search engines. In this chapter, the basic principles of database search engines are introduced with a focus on open source software, and the use of database search engines is demonstrated using the freely available SearchGUI interface. This chapter also discusses how to tackle general issues related to sequence database searching and shows how to minimize their impact.
Databases and Associated Tools for Glycomics and Glycoproteomics.
Lisacek, Frederique; Mariethoz, Julien; Alocci, Davide; Rudd, Pauline M; Abrahams, Jodie L; Campbell, Matthew P; Packer, Nicolle H; Ståhle, Jonas; Widmalm, Göran; Mullen, Elaine; Adamczyk, Barbara; Rojas-Macias, Miguel A; Jin, Chunsheng; Karlsson, Niclas G
2017-01-01
The access to biodatabases for glycomics and glycoproteomics has proven to be essential for current glycobiological research. This chapter presents available databases that are devoted to different aspects of glycobioinformatics. This includes oligosaccharide sequence databases, experimental databases, 3D structure databases (of both glycans and glycorelated proteins) and association of glycans with tissue, disease, and proteins. Specific search protocols are also provided using tools associated with experimental databases for converting primary glycoanalytical data to glycan structural information. In particular, researchers using glycoanalysis methods by U/HPLC (GlycoBase), MS (GlycoWorkbench, UniCarb-DB, GlycoDigest), and NMR (CASPER) will benefit from this chapter. In addition we also include information on how to utilize glycan structural information to query databases that associate glycans with proteins (UniCarbKB) and with interactions with pathogens (SugarBind).
Structural determination of intact proteins using mass spectrometry
Kruppa, Gary [San Francisco, CA; Schoeniger, Joseph S [Oakland, CA; Young, Malin M [Livermore, CA
2008-05-06
The present invention relates to novel methods of determining the sequence and structure of proteins. Specifically, the present invention allows for the analysis of intact proteins within a mass spectrometer. Therefore, preparatory separations need not be performed prior to introducing a protein sample into the mass spectrometer. Also disclosed herein are new instrumental developments for enhancing the signal from the desired modified proteins, methods for producing controlled protein fragments in the mass spectrometer, eliminating complex microseparations, and protein preparatory chemical steps necessary for cross-linking based protein structure determination.Additionally, the preferred method of the present invention involves the determination of protein structures utilizing a top-down analysis of protein structures to search for covalent modifications. In the preferred method, intact proteins are ionized and fragmented within the mass spectrometer.
Tedersoo, Leho; Abarenkov, Kessy; Nilsson, R. Henrik; Schüssler, Arthur; Grelet, Gwen-Aëlle; Kohout, Petr; Oja, Jane; Bonito, Gregory M.; Veldre, Vilmar; Jairus, Teele; Ryberg, Martin; Larsson, Karl-Henrik; Kõljalg, Urmas
2011-01-01
Sequence analysis of the ribosomal RNA operon, particularly the internal transcribed spacer (ITS) region, provides a powerful tool for identification of mycorrhizal fungi. The sequence data deposited in the International Nucleotide Sequence Databases (INSD) are, however, unfiltered for quality and are often poorly annotated with metadata. To detect chimeric and low-quality sequences and assign the ectomycorrhizal fungi to phylogenetic lineages, fungal ITS sequences were downloaded from INSD, aligned within family-level groups, and examined through phylogenetic analyses and BLAST searches. By combining the fungal sequence database UNITE and the annotation and search tool PlutoF, we also added metadata from the literature to these accessions. Altogether 35,632 sequences belonged to mycorrhizal fungi or originated from ericoid and orchid mycorrhizal roots. Of these sequences, 677 were considered chimeric and 2,174 of low read quality. Information detailing country of collection, geographical coordinates, interacting taxon and isolation source were supplemented to cover 78.0%, 33.0%, 41.7% and 96.4% of the sequences, respectively. These annotated sequences are publicly available via UNITE (http://unite.ut.ee/) for downstream biogeographic, ecological and taxonomic analyses. In European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena/), the annotated sequences have a special link-out to UNITE. We intend to expand the data annotation to additional genes and all taxonomic groups and functional guilds of fungi. PMID:21949797
Reefgenomics.Org - a repository for marine genomics data.
Liew, Yi Jin; Aranda, Manuel; Voolstra, Christian R
2016-01-01
Over the last decade, technological advancements have substantially decreased the cost and time of obtaining large amounts of sequencing data. Paired with the exponentially increased computing power, individual labs are now able to sequence genomes or transcriptomes to investigate biological questions of interest. This has led to a significant increase in available sequence data. Although the bulk of data published in articles are stored in public sequence databases, very often, only raw sequencing data are available; miscellaneous data such as assembled transcriptomes, genome annotations etc. are not easily obtainable through the same means. Here, we introduce our website (http://reefgenomics.org) that aims to centralize genomic and transcriptomic data from marine organisms. Besides providing convenient means to download sequences, we provide (where applicable) a genome browser to explore available genomic features, and a BLAST interface to search through the hosted sequences. Through the interface, multiple datasets can be queried simultaneously, allowing for the retrieval of matching sequences from organisms of interest. The minimalistic, no-frills interface reduces visual clutter, making it convenient for end-users to search and explore processed sequence data. DATABASE URL: http://reefgenomics.org. © The Author(s) 2016. Published by Oxford University Press.
Boehm; Gibson; Lubzens
2000-01-01
This study was initiated to search for species-specific and strain-specific satellite DNA sequences for which oligonucleotide primers could be designed to differentiate between various commercially important strains of the marine monogonont rotifers Brachionus rotundiformis and Brachionus plicatilis. Two unrelated, highly reiterated satellite sequences were cloned and characterized. The eight sequenced monomers from B. rotundiformis and six from B. plicatilis had low intrarepeat variability and were similar in their overall lengths, A + T compositions, and high degrees of repeated motif substructure. However, hybridizations to 19 representative strains, sequence characterizations, and GenBank searches indicated that these two satellites are morphotype-specific and population-specific, respectively, and share little homology to each other or to other characterized sequences in the database. Primer pairs designed for the B. rotundiformis satellite confirmed hybridization specificities on polymerase chain reaction and could serve as a useful molecular diagnostic tool to identify strains belonging to the SS morphotype, which are gaining widespread usage as first feeds for marine fish in commercial production.
Evolutionary optimization of biopolymers and sequence structure maps
DOE Office of Scientific and Technical Information (OSTI.GOV)
Reidys, C.M.; Kopp, S.; Schuster, P.
1996-06-01
Searching for biopolymers having a predefined function is a core problem of biotechnology, biochemistry and pharmacy. On the level of RNA sequences and their corresponding secondary structures we show that this problem can be analyzed mathematically. The strategy will be to study the properties of the RNA sequence to secondary structure mapping that is essential for the understanding of the search process. We show that to each secondary structure s there exists a neutral network consisting of all sequences folding into s. This network can be modeled as a random graph and has the following generic properties: it is densemore » and has a giant component within the graph of compatible sequences. The neutral network percolates sequence space and any two neutral nets come close in terms of Hamming distance. We investigate the distribution of the orders of neutral nets and show that above a certain threshold the topology of neutral nets allows to find practically all frequent secondary structures.« less
A novel directional asymmetric sampling search algorithm for fast block-matching motion estimation
NASA Astrophysics Data System (ADS)
Li, Yue-e.; Wang, Qiang
2011-11-01
This paper proposes a novel directional asymmetric sampling search (DASS) algorithm for video compression. Making full use of the error information (block distortions) of the search patterns, eight different direction search patterns are designed for various situations. The strategy of local sampling search is employed for the search of big-motion vector. In order to further speed up the search, early termination strategy is adopted in procedure of DASS. Compared to conventional fast algorithms, the proposed method has the most satisfactory PSNR values for all test sequences.
Screening for Reading Problems: The Utility of SEARCH.
ERIC Educational Resources Information Center
Morrison, Delmont; And Others
1988-01-01
The accuracy of SEARCH for identifying children at risk for developing learning disabilities was evaluated with 1,107 kindergarten children. Children identified as at risk were of average intelligence. SEARCH scores were significantly correlated with sequential and simultaneous information processing skills. SEARCH predicted adequacy of…
A unified architecture for biomedical search engines based on semantic web technologies.
Jalali, Vahid; Matash Borujerdi, Mohammad Reza
2011-04-01
There is a huge growth in the volume of published biomedical research in recent years. Many medical search engines are designed and developed to address the over growing information needs of biomedical experts and curators. Significant progress has been made in utilizing the knowledge embedded in medical ontologies and controlled vocabularies to assist these engines. However, the lack of common architecture for utilized ontologies and overall retrieval process, hampers evaluating different search engines and interoperability between them under unified conditions. In this paper, a unified architecture for medical search engines is introduced. Proposed model contains standard schemas declared in semantic web languages for ontologies and documents used by search engines. Unified models for annotation and retrieval processes are other parts of introduced architecture. A sample search engine is also designed and implemented based on the proposed architecture in this paper. The search engine is evaluated using two test collections and results are reported in terms of precision vs. recall and mean average precision for different approaches used by this search engine.
Evolving discriminators for querying video sequences
NASA Astrophysics Data System (ADS)
Iyengar, Giridharan; Lippman, Andrew B.
1997-01-01
In this paper we present a framework for content based query and retrieval of information from large video databases. This framework enables content based retrieval of video sequences by characterizing the sequences using motion, texture and colorimetry cues. This characterization is biologically inspired and results in a compact parameter space where every segment of video is represented by an 8 dimensional vector. Searching and retrieval is done in real- time with accuracy in this parameter space. Using this characterization, we then evolve a set of discriminators using Genetic Programming Experiments indicate that these discriminators are capable of analyzing and characterizing video. The VideoBook is able to search and retrieve video sequences with 92% accuracy in real-time. Experiments thus demonstrate that the characterization is capable of extracting higher level structure from raw pixel values.
Partial DNA sequencing of Douglas-fir cDNAs used in RFLP mapping
K.D. Jermstad; D.L. Bassoni; C.S. Kinlaw; D.B. Neale
1998-01-01
DNA sequences from 87 Douglas-fir (Pseudotsuga menziesii [Mirb.] Franco) cDNA RFLP probes were determined. Sequences were submitted to the GenBank dbEST database and searched for similarity against nucleotide and protein databases using the BLASTn and BLASTx programs. Twenty-one sequences (24%) were assigned putative functions; 18 of which...
Liu, Wanting; Xiang, Lunping; Zheng, Tingkai; Jin, Jingjie; Zhang, Gong
2018-01-04
Translation is a key regulatory step, linking transcriptome and proteome. Two major methods of translatome investigations are RNC-seq (sequencing of translating mRNA) and Ribo-seq (ribosome profiling). To facilitate the investigation of translation, we built a comprehensive database TranslatomeDB (http://www.translatomedb.net/) which provides collection and integrated analysis of published and user-generated translatome sequencing data. The current version includes 2453 Ribo-seq, 10 RNC-seq and their 1394 corresponding mRNA-seq datasets in 13 species. The database emphasizes the analysis functions in addition to the dataset collections. Differential gene expression (DGE) analysis can be performed between any two datasets of same species and type, both on transcriptome and translatome levels. The translation indices translation ratios, elongation velocity index and translational efficiency can be calculated to quantitatively evaluate translational initiation efficiency and elongation velocity, respectively. All datasets were analyzed using a unified, robust, accurate and experimentally-verifiable pipeline based on the FANSe3 mapping algorithm and edgeR for DGE analyzes. TranslatomeDB also allows users to upload their own datasets and utilize the identical unified pipeline to analyze their data. We believe that our TranslatomeDB is a comprehensive platform and knowledgebase on translatome and proteome research, releasing the biologists from complex searching, analyzing and comparing huge sequencing data without needing local computational power. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Ong, Wen Dee; Voo, Lok-Yung Christopher; Kumar, Vijay Subbiah
2012-01-01
Pineapple (Ananas comosus var. comosus), is an important tropical non-climacteric fruit with high commercial potential. Understanding the mechanism and processes underlying fruit ripening would enable scientists to enhance the improvement of quality traits such as, flavor, texture, appearance and fruit sweetness. Although, the pineapple is an important fruit, there is insufficient transcriptomic or genomic information that is available in public databases. Application of high throughput transcriptome sequencing to profile the pineapple fruit transcripts is therefore needed. To facilitate this, we have performed transcriptome sequencing of ripe yellow pineapple fruit flesh using Illumina technology. About 4.7 millions Illumina paired-end reads were generated and assembled using the Velvet de novo assembler. The assembly produced 28,728 unique transcripts with a mean length of approximately 200 bp. Sequence similarity search against non-redundant NCBI database identified a total of 16,932 unique transcripts (58.93%) with significant hits. Out of these, 15,507 unique transcripts were assigned to gene ontology terms. Functional annotation against Kyoto Encyclopedia of Genes and Genomes pathway database identified 13,598 unique transcripts (47.33%) which were mapped to 126 pathways. The assembly revealed many transcripts that were previously unknown. The unique transcripts derived from this work have rapidly increased of the number of the pineapple fruit mRNA transcripts as it is now available in public databases. This information can be further utilized in gene expression, genomics and other functional genomics studies in pineapple.
Ong, Wen Dee; Voo, Lok-Yung Christopher; Kumar, Vijay Subbiah
2012-01-01
Background Pineapple (Ananas comosus var. comosus), is an important tropical non-climacteric fruit with high commercial potential. Understanding the mechanism and processes underlying fruit ripening would enable scientists to enhance the improvement of quality traits such as, flavor, texture, appearance and fruit sweetness. Although, the pineapple is an important fruit, there is insufficient transcriptomic or genomic information that is available in public databases. Application of high throughput transcriptome sequencing to profile the pineapple fruit transcripts is therefore needed. Methodology/Principal Findings To facilitate this, we have performed transcriptome sequencing of ripe yellow pineapple fruit flesh using Illumina technology. About 4.7 millions Illumina paired-end reads were generated and assembled using the Velvet de novo assembler. The assembly produced 28,728 unique transcripts with a mean length of approximately 200 bp. Sequence similarity search against non-redundant NCBI database identified a total of 16,932 unique transcripts (58.93%) with significant hits. Out of these, 15,507 unique transcripts were assigned to gene ontology terms. Functional annotation against Kyoto Encyclopedia of Genes and Genomes pathway database identified 13,598 unique transcripts (47.33%) which were mapped to 126 pathways. The assembly revealed many transcripts that were previously unknown. Conclusions The unique transcripts derived from this work have rapidly increased of the number of the pineapple fruit mRNA transcripts as it is now available in public databases. This information can be further utilized in gene expression, genomics and other functional genomics studies in pineapple. PMID:23091603
Cloning and characterization of an autonomous replication sequence from Coxiella burnetii.
Suhan, M; Chen, S Y; Thompson, H A; Hoover, T A; Hill, A; Williams, J C
1994-01-01
A Coxiella burnetii chromosomal fragment capable of functioning as an origin for the replication of a kanamycin resistance (Kanr) plasmid was isolated by use of origin search methods utilizing an Escherichia coli host. The 5.8-kb fragment was subcloned into phagemid vectors and was deleted progressively by an exonuclease III-S1 technique. Plasmids containing progressively shorter DNA fragments were then tested for their capability to support replication by transformation of an E. coli polA strain. A minimal autonomous replication sequence (ARS) was delimited to 403 bp. Sequencing of the entire 5.8-kb region revealed that the minimal ARS contained two consensus DnaA boxes, three A + T-rich 21-mers, a transcriptional promoter leading rightwards, and potential integration host factor and factor of inversion stimulation binding sites. Database comparisons of deduced amino acid sequences revealed that open reading frames located around the ARS were homologous to genes often, but not always, found near bacterial chromosomal origins; these included identities with rpmH and rnpA in E. coli and identities with the 9K protein and 60K membrane protein in E. coli and Pseudomonas species. These and direct hybridization data suggested that the ARS was chromosomal and not associated with the resident plasmid QpH1. Two-dimensional agarose gel electrophoresis did not reveal the presence of initiating intermediates, indicating that the ARS did not initiate chromosome replication during laboratory growth of C. burnetii. Images PMID:8071197
Workflow and web application for annotating NCBI BioProject transcriptome data
Vera Alvarez, Roberto; Medeiros Vidal, Newton; Garzón-Martínez, Gina A.; Barrero, Luz S.; Landsman, David
2017-01-01
Abstract The volume of transcriptome data is growing exponentially due to rapid improvement of experimental technologies. In response, large central resources such as those of the National Center for Biotechnology Information (NCBI) are continually adapting their computational infrastructure to accommodate this large influx of data. New and specialized databases, such as Transcriptome Shotgun Assembly Sequence Database (TSA) and Sequence Read Archive (SRA), have been created to aid the development and expansion of centralized repositories. Although the central resource databases are under continual development, they do not include automatic pipelines to increase annotation of newly deposited data. Therefore, third-party applications are required to achieve that aim. Here, we present an automatic workflow and web application for the annotation of transcriptome data. The workflow creates secondary data such as sequencing reads and BLAST alignments, which are available through the web application. They are based on freely available bioinformatics tools and scripts developed in-house. The interactive web application provides a search engine and several browser utilities. Graphical views of transcript alignments are available through SeqViewer, an embedded tool developed by NCBI for viewing biological sequence data. The web application is tightly integrated with other NCBI web applications and tools to extend the functionality of data processing and interconnectivity. We present a case study for the species Physalis peruviana with data generated from BioProject ID 67621. Database URL: http://www.ncbi.nlm.nih.gov/projects/physalis/ PMID:28605765
Registration of Panoramic/Fish-Eye Image Sequence and LiDAR Points Using Skyline Features
Zhu, Ningning; Jia, Yonghong; Ji, Shunping
2018-01-01
We propose utilizing a rigorous registration model and a skyline-based method for automatic registration of LiDAR points and a sequence of panoramic/fish-eye images in a mobile mapping system (MMS). This method can automatically optimize original registration parameters and avoid the use of manual interventions in control point-based registration methods. First, the rigorous registration model between the LiDAR points and the panoramic/fish-eye image was built. Second, skyline pixels from panoramic/fish-eye images and skyline points from the MMS’s LiDAR points were extracted, relying on the difference in the pixel values and the registration model, respectively. Third, a brute force optimization method was used to search for optimal matching parameters between skyline pixels and skyline points. In the experiments, the original registration method and the control point registration method were used to compare the accuracy of our method with a sequence of panoramic/fish-eye images. The result showed: (1) the panoramic/fish-eye image registration model is effective and can achieve high-precision registration of the image and the MMS’s LiDAR points; (2) the skyline-based registration method can automatically optimize the initial attitude parameters, realizing a high-precision registration of a panoramic/fish-eye image and the MMS’s LiDAR points; and (3) the attitude correction values of the sequences of panoramic/fish-eye images are different, and the values must be solved one by one. PMID:29883431
Li, Xiaobai; Jin, Feng; Jin, Liang; Jackson, Aaron; Huang, Cheng; Li, Kehu; Shu, Xiaoli
2014-12-05
Cymbidium is a genus of 68 species in the orchid family, with extremely high ornamental value. Marker-assisted selection has proven to be an effective strategy in accelerating plant breeding for many plant species. Analysis of cymbidiums genetic background by molecular markers can be of great value in assisting parental selection and breeding strategy design, however, in plants such as cymbidiums limited genomic resources exist. In order to obtain efficient markers, we deep sequenced the C. ensifolium transcriptome to identify simple sequence repeats derived from gene regions (genic-SSR). The 7,936 genic-SSR markers were identified. A total of 80 genic-SSRs were selected, and primers were designed according to their flanking sequences. Of the 80 genic-SSR primer sets, 62 were amplified in C. ensifolium successfully, and 55 showed polymorphism when cross-tested among 9 Cymbidium species comprising 59 accessions. Unigenes containing the 62 genic-SSRs were searched against Non-redundant (Nr), Gene Ontology database (GO), eukaryotic orthologous groups (KOGs) and Kyoto Encyclopedia of Genes and Genomes (KEGG) database. The search resulted in 53 matching Nr sequences, of which 39 had GO terms, 18 were assigned to KOGs, and 15 were annotated with KEGG. Genetic diversity and population structure were analyzed based on 55 polymorphic genic-SSR data among 59 accessions. The genetic distance averaged 0.3911, ranging from 0.016 to 0.618. The polymorphic index content (PIC) of 55 polymorphic markers averaged 0.407, ranging from 0.033 to 0.863. A model-based clustering analysis revealed that five genetic groups existed in the collection. Accessions from the same species were typically grouped together; however, C. goeringii accessions did not always form a separate cluster, suggesting that C. goeringii accessions were polyphyletic. The genic-SSR identified in this study constitute a set of markers that can be applied across multiple Cymbidium species and used for the evaluation of genetic relationships as well as qualitative and quantitative trait mapping studies. Genic-SSR's coupled with the functional annotations provided by the unigenes will aid in mapping candidate genes of specific function.
Kemme, Catherine A.; Marquez, Rolando; Luu, Ross H.
2017-01-01
Abstract Eukaryotic genomes contain numerous non-functional high-affinity sequences for transcription factors. These sequences potentially serve as natural decoys that sequester transcription factors. We have previously shown that the presence of sequences similar to the target sequence could substantially impede association of the transcription factor Egr-1 with its targets. In this study, using a stopped-flow fluorescence method, we examined the kinetic impact of DNA methylation of decoys on the search process of the Egr-1 zinc-finger protein. We analyzed its association with an unmethylated target site on fluorescence-labeled DNA in the presence of competitor DNA duplexes, including Egr-1 decoys. DNA methylation of decoys alone did not affect target search kinetics. In the presence of the MeCP2 methyl-CpG-binding domain (MBD), however, DNA methylation of decoys substantially (∼10-30-fold) accelerated the target search process of the Egr-1 zinc-finger protein. This acceleration did not occur when the target was also methylated. These results suggest that when decoys are methylated, MBD proteins can block them and thereby allow Egr-1 to avoid sequestration in non-functional locations. This effect may occur in vivo for DNA methylation outside CpG islands (CGIs) and could facilitate localization of some transcription factors within regulatory CGIs, where DNA methylation is rare. PMID:28486614
USDA-ARS?s Scientific Manuscript database
The concept of utilizing putative and unique gene sequences for the design of species specific probes was tested. The abundance profile of assigned functions within the Lactobacillus plantarum genome was used for the identification of the putative and unique gene sequence, csh. The targeted gene (cs...
The PARIGA server for real time filtering and analysis of reciprocal BLAST results.
Orsini, Massimiliano; Carcangiu, Simone; Cuccuru, Gianmauro; Uva, Paolo; Tramontano, Anna
2013-01-01
BLAST-based similarity searches are commonly used in several applications involving both nucleotide and protein sequences. These applications span from simple tasks such as mapping sequences over a database to more complex procedures as clustering or annotation processes. When the amount of analysed data increases, manual inspection of BLAST results become a tedious procedure. Tools for parsing or filtering BLAST results for different purposes are then required. We describe here PARIGA (http://resources.bioinformatica.crs4.it/pariga/), a server that enables users to perform all-against-all BLAST searches on two sets of sequences selected by the user. Moreover, since it stores the two BLAST output in a python-serialized-objects database, results can be filtered according to several parameters in real-time fashion, without re-running the process and avoiding additional programming efforts. Results can be interrogated by the user using logical operations, for example to retrieve cases where two queries match same targets, or when sequences from the two datasets are reciprocal best hits, or when a query matches a target in multiple regions. The Pariga web server is designed to be a helpful tool for managing the results of sequence similarity searches. The design and implementation of the server renders all operations very fast and easy to use.
A search for structurally similar cellular internal ribosome entry sites
Baird, Stephen D.; Lewis, Stephen M.; Turcotte, Marcel; Holcik, Martin
2007-01-01
Internal ribosome entry sites (IRES) allow ribosomes to be recruited to mRNA in a cap-independent manner. Some viruses that impair cap-dependent translation initiation utilize IRES to ensure that the viral RNA will efficiently compete for the translation machinery. IRES are also employed for the translation of a subset of cellular messages during conditions that inhibit cap-dependent translation initiation. IRES from viruses like Hepatitis C and Classical Swine Fever virus share a similar structure/function without sharing primary sequence similarity. Of the cellular IRES structures derived so far, none were shown to share an overall structural similarity. Therefore, we undertook a genome-wide search of human 5′UTRs (untranslated regions) with an empirically derived structure of the IRES from the key inhibitor of apoptosis, X-linked inhibitor of apoptosis protein (XIAP), to identify novel IRES that share structure/function similarity. Three of the top matches identified by this search that exhibit IRES activity are the 5′UTRs of Aquaporin 4, ELG1 and NF-kappaB repressing factor (NRF). The structures of AQP4 and ELG1 IRES have limited similarity to the XIAP IRES; however, they share trans-acting factors that bind the XIAP IRES. We therefore propose that cellular IRES are not defined by overall structure, as viral IRES, but are instead dependent upon short motifs and trans-acting factors for their function. PMID:17591613
Goloboff, Pablo A
2014-10-01
Three different types of data sets, for which the uniquely most parsimonious tree can be known exactly but is hard to find with heuristic tree search methods, are studied. Tree searches are complicated more by the shape of the tree landscape (i.e. the distribution of homoplasy on different trees) than by the sheer abundance of homoplasy or character conflict. Data sets of Type 1 are those constructed by Radel et al. (2013). Data sets of Type 2 present a very rugged landscape, with narrow peaks and valleys, but relatively low amounts of homoplasy. For such a tree landscape, subjecting the trees to TBR and saving suboptimal trees produces much better results when the sequence of clipping for the tree branches is randomized instead of fixed. An unexpected finding for data sets of Types 1 and 2 is that starting a search from a random tree instead of a random addition sequence Wagner tree may increase the probability that the search finds the most parsimonious tree; a small artificial example where these probabilities can be calculated exactly is presented. Data sets of Type 3, the most difficult data sets studied here, comprise only congruent characters, and a single island with only one most parsimonious tree. Even if there is a single island, missing entries create a very flat landscape which is difficult to traverse with tree search algorithms because the number of equally parsimonious trees that need to be saved and swapped to effectively move around the plateaus is too large. Minor modifications of the parameters of tree drifting, ratchet, and sectorial searches allow travelling around these plateaus much more efficiently than saving and swapping large numbers of equally parsimonious trees with TBR. For these data sets, two new related criteria for selecting taxon addition sequences in Wagner trees (the "selected" and "informative" addition sequences) produce much better results than the standard random or closest addition sequences. These new methods for Wagner trees and for moving around plateaus can be useful when analyzing phylogenomic data sets formed by concatenation of genes with uneven taxon representation ("sparse" supermatrices), which are likely to present a tree landscape with extensive plateaus. Copyright © 2014 Elsevier Inc. All rights reserved.
Decision theory for computing variable and value ordering decisions for scheduling problems
NASA Technical Reports Server (NTRS)
Linden, Theodore A.
1993-01-01
Heuristics that guide search are critical when solving large planning and scheduling problems, but most variable and value ordering heuristics are sensitive to only one feature of the search state. One wants to combine evidence from all features of the search state into a subjective probability that a value choice is best, but there has been no solid semantics for merging evidence when it is conceived in these terms. Instead, variable and value ordering decisions should be viewed as problems in decision theory. This led to two key insights: (1) The fundamental concept that allows heuristic evidence to be merged is the net incremental utility that will be achieved by assigning a value to a variable. Probability distributions about net incremental utility can merge evidence from the utility function, binary constraints, resource constraints, and other problem features. The subjective probability that a value is the best choice is then derived from probability distributions about net incremental utility. (2) The methods used for rumor control in Bayesian Networks are the primary way to prevent cycling in the computation of probable net incremental utility. These insights lead to semantically justifiable ways to compute heuristic variable and value ordering decisions that merge evidence from all available features of the search state.
GeneBee-net: Internet-based server for analyzing biopolymers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brodsky, L.I.; Ivanov, V.V.; Nikolaev, V.K.
This work describes a network server for searching databanks of biopolymer structures and performing other biocomputing procedures; it is available via direct Internet connection. Basic server procedures are dedicated to homology (similarity) search of sequence and 3D structure of proteins. The homologies found could be used to build multiple alignments, predict protein and RNA secondary structure, and construct phylogenetic trees. In addition to traditional methods of sequence similarity search, the authors propose {open_quotes}non-matrix{close_quotes} (correlational) search. An analogous approach is used to identify regions of similar tertiary structure of proteins. Algorithm concepts and usage examples are presented for new methods. Servicemore » logic is based upon interaction of a client program and server procedures. The client program allows the compilation of queries and the processing of results of an analysis.« less
NASA Technical Reports Server (NTRS)
Sridhar, K. R.; Finn, J. E.
2000-01-01
The primary objectives of the Mars exploration program are to collect data for planetary science in a quest to answer questions related to Origins, to search for evidence of extinct and extant life, and to expand the human presence in the solar system. The public and political engagement that is critical for support of a Mars exploration program is based on all of these objectives. In order to retain and to build public and political support, it is important for NASA to have an integrated Mars exploration plan, not separate robotic and human plans that exist in parallel or in sequence. The resolutions stemming from the current architectural review and prioritization of payloads may be pivotal in determining whether NASA will have such a unified plan and retain public support. There are several potential scientific and technological links between the robotic-only missions that have been flown and planned to date, and the combined robotic and human missions that will come in the future. Taking advantage of and leveraging those links are central to the idea of a unified Mars exploration plan. One such link is in situ resource utilization (ISRU) as an enabling technology to provide consumables such as fuels, oxygen, sweep and utility gases from the Mars atmosphere.
ERIC Educational Resources Information Center
Saracevic, Tefko
2000-01-01
Summarizes a presentation that discussed findings and implications of research projects using an Internet search service and Internet-accessible vendor databases, representing the two sides of public database searching: query formulation and resource utilization. Presenters included: Tefko Saracevic, Amanda Spink, Dietmar Wolfram and Hong Xie.…
Parson, W; Gusmão, L; Hares, D R; Irwin, J A; Mayr, W R; Morling, N; Pokorak, E; Prinz, M; Salas, A; Schneider, P M; Parsons, T J
2014-11-01
The DNA Commission of the International Society of Forensic Genetics (ISFG) regularly publishes guidelines and recommendations concerning the application of DNA polymorphisms to the question of human identification. Previous recommendations published in 2000 addressed the analysis and interpretation of mitochondrial DNA (mtDNA) in forensic casework. While the foundations set forth in the earlier recommendations still apply, new approaches to the quality control, alignment and nomenclature of mitochondrial sequences, as well as the establishment of mtDNA reference population databases, have been developed. Here, we describe these developments and discuss their application to both mtDNA casework and mtDNA reference population databasing applications. While the generation of mtDNA for forensic casework has always been guided by specific standards, it is now well-established that data of the same quality are required for the mtDNA reference population data used to assess the statistical weight of the evidence. As a result, we introduce guidelines regarding sequence generation, as well as quality control measures based on the known worldwide mtDNA phylogeny, that can be applied to ensure the highest quality population data possible. For both casework and reference population databasing applications, the alignment and nomenclature of haplotypes is revised here and the phylogenetic alignment proffered as acceptable standard. In addition, the interpretation of heteroplasmy in the forensic context is updated, and the utility of alignment-free database searches for unbiased probability estimates is highlighted. Finally, we discuss statistical issues and define minimal standards for mtDNA database searches. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.
Genome Sequencing Technologies and Nursing: What Are the Roles of Nurses and Nurse Scientists?
Taylor, Jacquelyn Y.; Wright, Michelle L.; Hickey, Kathleen T.; Housman, David
2016-01-01
Background Advances in DNA sequencing technology have resulted in an abundance of personalized data with challenging clinical utility and meaning for clinicians. This wealth of data has potential to dramatically impact the quality of healthcare. Nurses are at the focal point in educating patients regarding relevant healthcare needs; therefore, an understanding of sequencing technology and utilizing these data are critical. Aim The objective of this paper is to explicate the role of nurses and nurse scientists as integral members of healthcare teams in improving understanding of DNA sequencing data and translational genomics for patients. Approach A history of the nurse role in newborn screening is used as an exemplar. Discussion This paper serves as an exemplar on how genome sequencing has been utilized in nursing science and incorporates linkages of other omics approaches used by nurses that are included in this special issue. This special issue showcased nurse scientists conducting multi-omic research from various methods, including targeted candidate genes, pharmacogenomics, proteomics, epigenomics and the microbiome. From this vantage point, we provide an overview of the roles of nurse scientists in genome sequencing research and provide recommendations for the best utilization of nurses and nurse scientists related to genome sequencing. PMID:28252579
Harper, Angela F; Leuthaeuser, Janelle B; Babbitt, Patricia C; Morris, John H; Ferrin, Thomas E; Poole, Leslie B; Fetrow, Jacquelyn S
2017-02-01
Peroxiredoxins (Prxs or Prdxs) are a large protein superfamily of antioxidant enzymes that rapidly detoxify damaging peroxides and/or affect signal transduction and, thus, have roles in proliferation, differentiation, and apoptosis. Prx superfamily members are widespread across phylogeny and multiple methods have been developed to classify them. Here we present an updated atlas of the Prx superfamily identified using a novel method called MISST (Multi-level Iterative Sequence Searching Technique). MISST is an iterative search process developed to be both agglomerative, to add sequences containing similar functional site features, and divisive, to split groups when functional site features suggest distinct functionally-relevant clusters. Superfamily members need not be identified initially-MISST begins with a minimal representative set of known structures and searches GenBank iteratively. Further, the method's novelty lies in the manner in which isofunctional groups are selected; rather than use a single or shifting threshold to identify clusters, the groups are deemed isofunctional when they pass a self-identification criterion, such that the group identifies itself and nothing else in a search of GenBank. The method was preliminarily validated on the Prxs, as the Prxs presented challenges of both agglomeration and division. For example, previous sequence analysis clustered the Prx functional families Prx1 and Prx6 into one group. Subsequent expert analysis clearly identified Prx6 as a distinct functionally relevant group. The MISST process distinguishes these two closely related, though functionally distinct, families. Through MISST search iterations, over 38,000 Prx sequences were identified, which the method divided into six isofunctional clusters, consistent with previous expert analysis. The results represent the most complete computational functional analysis of proteins comprising the Prx superfamily. The feasibility of this novel method is demonstrated by the Prx superfamily results, laying the foundation for potential functionally relevant clustering of the universe of protein sequences.
Babbitt, Patricia C.; Ferrin, Thomas E.
2017-01-01
Peroxiredoxins (Prxs or Prdxs) are a large protein superfamily of antioxidant enzymes that rapidly detoxify damaging peroxides and/or affect signal transduction and, thus, have roles in proliferation, differentiation, and apoptosis. Prx superfamily members are widespread across phylogeny and multiple methods have been developed to classify them. Here we present an updated atlas of the Prx superfamily identified using a novel method called MISST (Multi-level Iterative Sequence Searching Technique). MISST is an iterative search process developed to be both agglomerative, to add sequences containing similar functional site features, and divisive, to split groups when functional site features suggest distinct functionally-relevant clusters. Superfamily members need not be identified initially—MISST begins with a minimal representative set of known structures and searches GenBank iteratively. Further, the method’s novelty lies in the manner in which isofunctional groups are selected; rather than use a single or shifting threshold to identify clusters, the groups are deemed isofunctional when they pass a self-identification criterion, such that the group identifies itself and nothing else in a search of GenBank. The method was preliminarily validated on the Prxs, as the Prxs presented challenges of both agglomeration and division. For example, previous sequence analysis clustered the Prx functional families Prx1 and Prx6 into one group. Subsequent expert analysis clearly identified Prx6 as a distinct functionally relevant group. The MISST process distinguishes these two closely related, though functionally distinct, families. Through MISST search iterations, over 38,000 Prx sequences were identified, which the method divided into six isofunctional clusters, consistent with previous expert analysis. The results represent the most complete computational functional analysis of proteins comprising the Prx superfamily. The feasibility of this novel method is demonstrated by the Prx superfamily results, laying the foundation for potential functionally relevant clustering of the universe of protein sequences. PMID:28187133
Improving Grasp Skills Using Schema Structured Learning
NASA Technical Reports Server (NTRS)
Platt, Robert; Grupen, ROderic A.; Fagg, Andrew H.
2006-01-01
Abstract In the control-based approach to robotics, complex behavior is created by sequencing and combining control primitives. While it is desirable for the robot to autonomously learn the correct control sequence, searching through the large number of potential solutions can be time consuming. This paper constrains this search to variations of a generalized solution encoded in a framework known as an action schema. A new algorithm, SCHEMA STRUCTURED LEARNING, is proposed that repeatedly executes variations of the generalized solution in search of instantiations that satisfy action schema objectives. This approach is tested in a grasping task where Dexter, the UMass humanoid robot, learns which reaching and grasping controllers maximize the probability of grasp success.
Welker, F
2018-02-20
The study of ancient protein sequences is increasingly focused on the analysis of older samples, including those of ancient hominins. The analysis of such ancient proteomes thereby potentially suffers from "cross-species proteomic effects": the loss of peptide and protein identifications at increased evolutionary distances due to a larger number of protein sequence differences between the database sequence and the analyzed organism. Error-tolerant proteomic search algorithms should theoretically overcome this problem at both the peptide and protein level; however, this has not been demonstrated. If error-tolerant searches do not overcome the cross-species proteomic issue then there might be inherent biases in the identified proteomes. Here, a bioinformatics experiment is performed to test this using a set of modern human bone proteomes and three independent searches against sequence databases at increasing evolutionary distances: the human (0 Ma), chimpanzee (6-8 Ma) and orangutan (16-17 Ma) reference proteomes, respectively. Incorrectly suggested amino acid substitutions are absent when employing adequate filtering criteria for mutable Peptide Spectrum Matches (PSMs), but roughly half of the mutable PSMs were not recovered. As a result, peptide and protein identification rates are higher in error-tolerant mode compared to non-error-tolerant searches but did not recover protein identifications completely. Data indicates that peptide length and the number of mutations between the target and database sequences are the main factors influencing mutable PSM identification. The error-tolerant results suggest that the cross-species proteomics problem is not overcome at increasing evolutionary distances, even at the protein level. Peptide and protein loss has the potential to significantly impact divergence dating and proteome comparisons when using ancient samples as there is a bias towards the identification of conserved sequences and proteins. Effects are minimized between moderately divergent proteomes, as indicated by almost complete recovery of informative positions in the search against the chimpanzee proteome (≈90%, 6-8 Ma). This provides a bioinformatic background to future phylogenetic and proteomic analysis of ancient hominin proteomes, including the future description of novel hominin amino acid sequences, but also has negative implications for the study of fast-evolving proteins in hominins, non-hominin animals, and ancient bacterial proteins in evolutionary contexts.
Dasari, Surendra; Chambers, Matthew C.; Martinez, Misti A.; Carpenter, Kristin L.; Ham, Amy-Joan L.; Vega-Montoto, Lorenzo J.; Tabb, David L.
2012-01-01
Spectral libraries have emerged as a viable alternative to protein sequence databases for peptide identification. These libraries contain previously detected peptide sequences and their corresponding tandem mass spectra (MS/MS). Search engines can then identify peptides by comparing experimental MS/MS scans to those in the library. Many of these algorithms employ the dot product score for measuring the quality of a spectrum-spectrum match (SSM). This scoring system does not offer a clear statistical interpretation and ignores fragment ion m/z discrepancies in the scoring. We developed a new spectral library search engine, Pepitome, which employs statistical systems for scoring SSMs. Pepitome outperformed the leading library search tool, SpectraST, when analyzing data sets acquired on three different mass spectrometry platforms. We characterized the reliability of spectral library searches by confirming shotgun proteomics identifications through RNA-Seq data. Applying spectral library and database searches on the same sample revealed their complementary nature. Pepitome identifications enabled the automation of quality analysis and quality control (QA/QC) for shotgun proteomics data acquisition pipelines. PMID:22217208
The LANL hemorrhagic fever virus database, a new platform for analyzing biothreat viruses.
Kuiken, Carla; Thurmond, Jim; Dimitrijevic, Mira; Yoon, Hyejin
2012-01-01
Hemorrhagic fever viruses (HFVs) are a diverse set of over 80 viral species, found in 10 different genera comprising five different families: arena-, bunya-, flavi-, filo- and togaviridae. All these viruses are highly variable and evolve rapidly, making them elusive targets for the immune system and for vaccine and drug design. About 55,000 HFV sequences exist in the public domain today. A central website that provides annotated sequences and analysis tools will be helpful to HFV researchers worldwide. The HFV sequence database collects and stores sequence data and provides a user-friendly search interface and a large number of sequence analysis tools, following the model of the highly regarded and widely used Los Alamos HIV database [Kuiken, C., B. Korber, and R.W. Shafer, HIV sequence databases. AIDS Rev, 2003. 5: p. 52-61]. The database uses an algorithm that aligns each sequence to a species-wide reference sequence. The NCBI RefSeq database [Sayers et al. (2011) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 39, D38-D51.] is used for this; if a reference sequence is not available, a Blast search finds the best candidate. Using this method, sequences in each genus can be retrieved pre-aligned. The HFV website can be accessed via http://hfv.lanl.gov.
Just, Rebecca S; Irwin, Jodi A
2018-05-01
Some of the expected advantages of next generation sequencing (NGS) for short tandem repeat (STR) typing include enhanced mixture detection and genotype resolution via sequence variation among non-homologous alleles of the same length. However, at the same time that NGS methods for forensic DNA typing have advanced in recent years, many caseworking laboratories have implemented or are transitioning to probabilistic genotyping to assist the interpretation of complex autosomal STR typing results. Current probabilistic software programs are designed for length-based data, and were not intended to accommodate sequence strings as the product input. Yet to leverage the benefits of NGS for enhanced genotyping and mixture deconvolution, the sequence variation among same-length products must be utilized in some form. Here, we propose use of the longest uninterrupted stretch (LUS) in allele designations as a simple method to represent sequence variation within the STR repeat regions and facilitate - in the nearterm - probabilistic interpretation of NGS-based typing results. An examination of published population data indicated that a reference LUS region is straightforward to define for most autosomal STR loci, and that using repeat unit plus LUS length as the allele designator can represent greater than 80% of the alleles detected by sequencing. A proof of concept study performed using a freely available probabilistic software demonstrated that the LUS length can be used in allele designations when a program does not require alleles to be integers, and that utilizing sequence information improves interpretation of both single-source and mixed contributor STR typing results as compared to using repeat unit information alone. The LUS concept for allele designation maintains the repeat-based allele nomenclature that will permit backward compatibility to extant STR databases, and the LUS lengths themselves will be concordant regardless of the NGS assay or analysis tools employed. Further, these biologically based, easy-to-derive designations uphold clear relationships between parent alleles and their stutter products, enabling analysis in fully continuous probabilistic programs that model stutter while avoiding the algorithmic complexities that come with string based searches. Though using repeat unit plus LUS length as the allele designator does not capture variation that occurs outside of the core repeat regions, this straightforward approach would permit the large majority of known STR sequence variation to be used for mixture deconvolution and, in turn, result in more informative mixture statistics in the near term. Ultimately, the method could bridge the gap from current length-based probabilistic systems to facilitate broader adoption of NGS by forensic DNA testing laboratories. Copyright © 2018 The Authors. Published by Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Bulan, Orhan; Bernal, Edgar A.; Loce, Robert P.; Wu, Wencheng
2013-03-01
Video cameras are widely deployed along city streets, interstate highways, traffic lights, stop signs and toll booths by entities that perform traffic monitoring and law enforcement. The videos captured by these cameras are typically compressed and stored in large databases. Performing a rapid search for a specific vehicle within a large database of compressed videos is often required and can be a time-critical life or death situation. In this paper, we propose video compression and decompression algorithms that enable fast and efficient vehicle or, more generally, event searches in large video databases. The proposed algorithm selects reference frames (i.e., I-frames) based on a vehicle having been detected at a specified position within the scene being monitored while compressing a video sequence. A search for a specific vehicle in the compressed video stream is performed across the reference frames only, which does not require decompression of the full video sequence as in traditional search algorithms. Our experimental results on videos captured in a local road show that the proposed algorithm significantly reduces the search space (thus reducing time and computational resources) in vehicle search tasks within compressed video streams, particularly those captured in light traffic volume conditions.
Cantalupo, Paul G.; Katz, Joshua P.
2015-01-01
ABSTRACT We searched The Cancer Genome Atlas (TCGA) database for viruses by comparing non-human reads present in transcriptome sequencing (RNA-Seq) and whole-exome sequencing (WXS) data to viral sequence databases. Human papillomavirus 18 (HPV18) is an etiologic agent of cervical cancer, and as expected, we found robust expression of HPV18 genes in cervical cancer samples. In agreement with previous studies, we also found HPV18 transcripts in non-cervical cancer samples, including those from the colon, rectum, and normal kidney. However, in each of these cases, HPV18 gene expression was low, and single-nucleotide variants and positions of genomic alignments matched the integrated portion of HPV18 present in HeLa cells. Chimeric reads that match a known virus-cell junction of HPV18 integrated in HeLa cells were also present in some samples. We hypothesize that HPV18 sequences in these non-cervical samples are due to nucleic acid contamination from HeLa cells. This finding highlights the problems that contamination presents in computational virus detection pipelines. IMPORTANCE Viruses associated with cancer can be detected by searching tumor sequence databases. Several studies involving searches of the TCGA database have reported the presence of HPV18, a known cause of cervical cancer, in a small number of additional cancers, including those of the rectum, kidney, and colon. We have determined that the sequences related to HPV18 in non-cervical samples are due to nucleic acid contamination from HeLa cells. To our knowledge, this is the first report of the misidentification of viruses in next-generation sequencing data of tumors due to contamination with a cancer cell line. These results raise awareness of the difficulty of accurately identifying viruses in human sequence databases. PMID:25631090
ERIC Educational Resources Information Center
Wernersson, Rasmus
2007-01-01
An important part of teaching students how to use the BLAST tool for searching large sequence databases, is to train the students to think critically about the quality of the sequence hits found--both in terms of the statistical significance and how informative the individual hits are. This paper describes how generating truly random sequences by…
Dai, Qi; Yang, Yanchun; Wang, Tianming
2008-10-15
Many proposed statistical measures can efficiently compare biological sequences to further infer their structures, functions and evolutionary information. They are related in spirit because all the ideas for sequence comparison try to use the information on the k-word distributions, Markov model or both. Motivated by adding k-word distributions to Markov model directly, we investigated two novel statistical measures for sequence comparison, called wre.k.r and S2.k.r. The proposed measures were tested by similarity search, evaluation on functionally related regulatory sequences and phylogenetic analysis. This offers the systematic and quantitative experimental assessment of our measures. Moreover, we compared our achievements with these based on alignment or alignment-free. We grouped our experiments into two sets. The first one, performed via ROC (receiver operating curve) analysis, aims at assessing the intrinsic ability of our statistical measures to search for similar sequences from a database and discriminate functionally related regulatory sequences from unrelated sequences. The second one aims at assessing how well our statistical measure is used for phylogenetic analysis. The experimental assessment demonstrates that our similarity measures intending to incorporate k-word distributions into Markov model are more efficient.
Huang, Yin-Fu; Wang, Chia-Ming; Liou, Sing-Wu
2013-01-01
A hybrid self-adaptive harmony search and back-propagation mining system was proposed to discover weighted patterns in human intron sequences. By testing the weights under a lazy nearest neighbor classifier, the numerical results revealed the significance of these weighted patterns. Comparing these weighted patterns with the popular intron consensus model, it is clear that the discovered weighted patterns make originally the ambiguous 5SS and 3SS header patterns more specific and concrete.
Wang, Chia-Ming; Liou, Sing-Wu
2013-01-01
A hybrid self-adaptive harmony search and back-propagation mining system was proposed to discover weighted patterns in human intron sequences. By testing the weights under a lazy nearest neighbor classifier, the numerical results revealed the significance of these weighted patterns. Comparing these weighted patterns with the popular intron consensus model, it is clear that the discovered weighted patterns make originally the ambiguous 5SS and 3SS header patterns more specific and concrete. PMID:23737711
Chamrad, Daniel C; Körting, Gerhard; Schäfer, Heike; Stephan, Christian; Thiele, Herbert; Apweiler, Rolf; Meyer, Helmut E; Marcus, Katrin; Blüggel, Martin
2006-09-01
A novel software tool named PTM-Explorer has been applied to LC-MS/MS datasets acquired within the Human Proteome Organisation (HUPO) Brain Proteome Project (BPP). PTM-Explorer enables automatic identification of peptide MS/MS spectra that were not explained in typical sequence database searches. The main focus was detection of PTMs, but PTM-Explorer detects also unspecific peptide cleavage, mass measurement errors, experimental modifications, amino acid substitutions, transpeptidation products and unknown mass shifts. To avoid a combinatorial problem the search is restricted to a set of selected protein sequences, which stem from previous protein identifications using a common sequence database search. Prior to application to the HUPO BPP data, PTM-Explorer was evaluated on excellently manually characterized and evaluated LC-MS/MS data sets from Alpha-A-Crystallin gel spots obtained from mouse eye lens. Besides various PTMs including phosphorylation, a wealth of experimental modifications and unspecific cleavage products were successfully detected, completing the primary structure information of the measured proteins. Our results indicate that a large amount of MS/MS spectra that currently remain unidentified in standard database searches contain valuable information that can only be elucidated using suitable software tools.
Gapped Spectral Dictionaries and Their Applications for Database Searches of Tandem Mass Spectra*
Jeong, Kyowon; Kim, Sangtae; Bandeira, Nuno; Pevzner, Pavel A.
2011-01-01
Generating all plausible de novo interpretations of a peptide tandem mass (MS/MS) spectrum (Spectral Dictionary) and quickly matching them against the database represent a recently emerged alternative approach to peptide identification. However, the sizes of the Spectral Dictionaries quickly grow with the peptide length making their generation impractical for long peptides. We introduce Gapped Spectral Dictionaries (all plausible de novo interpretations with gaps) that can be easily generated for any peptide length thus addressing the limitation of the Spectral Dictionary approach. We show that Gapped Spectral Dictionaries are small thus opening a possibility of using them to speed-up MS/MS searches. Our MS-GappedDictionary algorithm (based on Gapped Spectral Dictionaries) enables proteogenomics applications (such as searches in the six-frame translation of the human genome) that are prohibitively time consuming with existing approaches. MS-GappedDictionary generates gapped peptides that occupy a niche between accurate but short peptide sequence tags and long but inaccurate full length peptide reconstructions. We show that, contrary to conventional wisdom, some high-quality spectra do not have good peptide sequence tags and introduce gapped tags that have advantages over the conventional peptide sequence tags in MS/MS database searches. PMID:21444829
Yu, Qiang; Wei, Dingbang; Huo, Hongwei
2018-06-18
Given a set of t n-length DNA sequences, q satisfying 0 < q ≤ 1, and l and d satisfying 0 ≤ d < l < n, the quorum planted motif search (qPMS) finds l-length strings that occur in at least qt input sequences with up to d mismatches and is mainly used to locate transcription factor binding sites in DNA sequences. Existing qPMS algorithms have been able to efficiently process small standard datasets (e.g., t = 20 and n = 600), but they are too time consuming to process large DNA datasets, such as ChIP-seq datasets that contain thousands of sequences or more. We analyze the effects of t and q on the time performance of qPMS algorithms and find that a large t or a small q causes a longer computation time. Based on this information, we improve the time performance of existing qPMS algorithms by selecting a sample sequence set D' with a small t and a large q from the large input dataset D and then executing qPMS algorithms on D'. A sample sequence selection algorithm named SamSelect is proposed. The experimental results on both simulated and real data show (1) that SamSelect can select D' efficiently and (2) that the qPMS algorithms executed on D' can find implanted or real motifs in a significantly shorter time than when executed on D. We improve the ability of existing qPMS algorithms to process large DNA datasets from the perspective of selecting high-quality sample sequence sets so that the qPMS algorithms can find motifs in a short time in the selected sample sequence set D', rather than take an unfeasibly long time to search the original sequence set D. Our motif discovery method is an approximate algorithm.
NASA Technical Reports Server (NTRS)
Wheeler, Ward C.
2003-01-01
The problem of determining the minimum cost hypothetical ancestral sequences for a given cladogram is known to be NP-complete (Wang and Jiang, 1994). Traditionally, point estimations of hypothetical ancestral sequences have been used to gain heuristic, upper bounds on cladogram cost. These include procedures with such diverse approaches as non-additive optimization of multiple sequence alignment, direct optimization (Wheeler, 1996), and fixed-state character optimization (Wheeler, 1999). A method is proposed here which, by extending fixed-state character optimization, replaces the estimation process with a search. This form of optimization examines a diversity of potential state solutions for cost-efficient hypothetical ancestral sequences and can result in greatly more parsimonious cladograms. Additionally, such an approach can be applied to other NP-complete phylogenetic optimization problems such as genomic break-point analysis. c2003 The Willi Hennig Society. Published by Elsevier Science (USA). All rights reserved.
NABIC: A New Access Portal to Search, Visualize, and Share Agricultural Genomics Data.
Seol, Young-Joo; Lee, Tae-Ho; Park, Dong-Suk; Kim, Chang-Kug
2016-01-01
The National Agricultural Biotechnology Information Center developed an access portal to search, visualize, and share agricultural genomics data with a focus on South Korean information and resources. The portal features an agricultural biotechnology database containing a wide range of omics data from public and proprietary sources. We collected 28.4 TB of data from 162 agricultural organisms, with 10 types of omics data comprising next-generation sequencing sequence read archive, genome, gene, nucleotide, DNA chip, expressed sequence tag, interactome, protein structure, molecular marker, and single-nucleotide polymorphism datasets. Our genomic resources contain information on five animals, seven plants, and one fungus, which is accessed through a genome browser. We also developed a data submission and analysis system as a web service, with easy-to-use functions and cutting-edge algorithms, including those for handling next-generation sequencing data.
Raising the IQ in full-text searching via intelligent querying
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kero, R.; Russell, L.; Swietlik, C.
1994-11-01
Current Information Retrieval (IR) technologies allow for efficient access to relevant information, provided that user selected query terms coincide with the specific linguistical choices made by the authors whose works constitute the text-base. Therefore, the challenge is to enhance the limited searching capability of state-of-the-practice IR. This can be done either with augmented clients that overcome current server searching deficiencies, or with added capabilities that can augment searching algorithms on the servers. The technology being investigated is that of deductive databases, with a set of new techniques called cooperative answering. This technology utilizes semantic networks to allow for navigation betweenmore » possible query search term alternatives. The augmented search terms are passed to an IR engine and the results can be compared. The project utilizes the OSTI Environment, Safety and Health Thesaurus to populate the domain specific semantic network and the text base of ES&H related documents from the Facility Profile Information Management System as the domain specific search space.« less
Falkner, Jayson; Andrews, Philip
2005-05-15
Comparing tandem mass spectra (MSMS) against a known dataset of protein sequences is a common method for identifying unknown proteins; however, the processing of MSMS by current software often limits certain applications, including comprehensive coverage of post-translational modifications, non-specific searches and real-time searches to allow result-dependent instrument control. This problem deserves attention as new mass spectrometers provide the ability for higher throughput and as known protein datasets rapidly grow in size. New software algorithms need to be devised in order to address the performance issues of conventional MSMS protein dataset-based protein identification. This paper describes a novel algorithm based on converting a collection of monoisotopic, centroided spectra to a new data structure, named 'peptide finite state machine' (PFSM), which may be used to rapidly search a known dataset of protein sequences, regardless of the number of spectra searched or the number of potential modifications examined. The algorithm is verified using a set of commercially available tryptic digest protein standards analyzed using an ABI 4700 MALDI TOFTOF mass spectrometer, and a free, open source PFSM implementation. It is illustrated that a PFSM can accurately search large collections of spectra against large datasets of protein sequences (e.g. NCBI nr) using a regular desktop PC; however, this paper only details the method for identifying peptide and subsequently protein candidates from a dataset of known protein sequences. The concept of using a PFSM as a peptide pre-screening technique for MSMS-based search engines is validated by using PFSM with Mascot and XTandem. Complete source code, documentation and examples for the reference PFSM implementation are freely available at the Proteome Commons, http://www.proteomecommons.org and source code may be used both commercially and non-commercially as long as the original authors are credited for their work.
Stansfield, Claire; O'Mara-Eves, Alison; Thomas, James
2017-09-01
Using text mining to aid the development of database search strings for topics described by diverse terminology has potential benefits for systematic reviews; however, methods and tools for accomplishing this are poorly covered in the research methods literature. We briefly review the literature on applications of text mining for search term development for systematic reviewing. We found that the tools can be used in 5 overarching ways: improving the precision of searches; identifying search terms to improve search sensitivity; aiding the translation of search strategies across databases; searching and screening within an integrated system; and developing objectively derived search strategies. Using a case study and selected examples, we then reflect on the utility of certain technologies (term frequency-inverse document frequency and Termine, term frequency, and clustering) in improving the precision and sensitivity of searches. Challenges in using these tools are discussed. The utility of these tools is influenced by the different capabilities of the tools, the way the tools are used, and the text that is analysed. Increased awareness of how the tools perform facilitates the further development of methods for their use in systematic reviews. Copyright © 2017 John Wiley & Sons, Ltd.
Hernández, Yözen; Bernstein, Rocky; Pagan, Pedro; Vargas, Levy; McCaig, William; Ramrattan, Girish; Akther, Saymon; Larracuente, Amanda; Di, Lia; Vieira, Filipe G; Qiu, Wei-Gang
2018-03-02
Automated bioinformatics workflows are more robust, easier to maintain, and results more reproducible when built with command-line utilities than with custom-coded scripts. Command-line utilities further benefit by relieving bioinformatics developers to learn the use of, or to interact directly with, biological software libraries. There is however a lack of command-line utilities that leverage popular Open Source biological software toolkits such as BioPerl ( http://bioperl.org ) to make many of the well-designed, robust, and routinely used biological classes available for a wider base of end users. Designed as standard utilities for UNIX-family operating systems, BpWrapper makes functionality of some of the most popular BioPerl modules readily accessible on the command line to novice as well as to experienced bioinformatics practitioners. The initial release of BpWrapper includes four utilities with concise command-line user interfaces, bioseq, bioaln, biotree, and biopop, specialized for manipulation of molecular sequences, sequence alignments, phylogenetic trees, and DNA polymorphisms, respectively. Over a hundred methods are currently available as command-line options and new methods are easily incorporated. Performance of BpWrapper utilities lags that of precompiled utilities while equivalent to that of other utilities based on BioPerl. BpWrapper has been tested on BioPerl Release 1.6, Perl versions 5.10.1 to 5.25.10, and operating systems including Apple macOS, Microsoft Windows, and GNU/Linux. Release code is available from the Comprehensive Perl Archive Network (CPAN) at https://metacpan.org/pod/Bio::BPWrapper . Source code is available on GitHub at https://github.com/bioperl/p5-bpwrapper . BpWrapper improves on existing sequence utilities by following the design principles of Unix text utilities such including a concise user interface, extensive command-line options, and standard input/output for serialized operations. Further, dozens of novel methods for manipulation of sequences, alignments, and phylogenetic trees, unavailable in existing utilities (e.g., EMBOSS, Newick Utilities, and FAST), are provided. Bioinformaticians should find BpWrapper useful for rapid prototyping of workflows on the command-line without creating custom scripts for comparative genomics and other bioinformatics applications.
de la Torre-Díez, Isabel; López-Coronado, Miguel; Vaca, Cesar; Aguado, Jesús Saez; de Castro, Carlos
2015-02-01
A systematic review of cost-utility and cost-effectiveness research works of telemedicine, electronic health (e-health), and mobile health (m-health) systems in the literature is presented. Academic databases and systems such as PubMed, Scopus, ISI Web of Science, and IEEE Xplore were searched, using different combinations of terms such as "cost-utility" OR "cost utility" AND "telemedicine," "cost-effectiveness" OR "cost effectiveness" AND "mobile health," etc. In the articles searched, there were no limitations in the publication date. The search identified 35 relevant works. Many of the articles were reviews of different studies. Seventy-nine percent concerned the cost-effectiveness of telemedicine systems in different specialties such as teleophthalmology, telecardiology, teledermatology, etc. More articles were found between 2000 and 2013. Cost-utility studies were done only for telemedicine systems. There are few cost-utility and cost-effectiveness studies for e-health and m-health systems in the literature. Some cost-effectiveness studies demonstrate that telemedicine can reduce the costs, but not all. Among the main limitations of the economic evaluations of telemedicine systems are the lack of randomized control trials, small sample sizes, and the absence of quality data and appropriate measures.
Sastre, Elizabeth Ann; Denny, Joshua C; McCoy, Jacob A; McCoy, Allison B; Spickard, Anderson
2011-01-01
Effective teaching of evidence-based medicine (EBM) to medical students is important for lifelong self-directed learning. We implemented a brief workshop designed to teach literature searching skills to third-year medical students. We assessed its impact on students' utilization of EBM resources during their clinical rotation and the quality of EBM integration in inpatient notes. We developed a physician-led, hands-on workshop to introduce EBM resources to all internal medicine clerks. Pre- and post-workshop measures included student's attitudes to EBM, citations of EBM resources in their clinical notes, and quality of the EBM component of the discussion in the note. Computer log analysis recorded students' online search attempts. After the workshop, students reported improved comfort using EBM and increased utilization of EBM resources. EBM integration into the discussion component of the notes also showed significant improvement. Computer log analysis of students' searches demonstrated increased utilization of EBM resources following the workshop. We describe the successful implementation of a workshop designed to teach third-year medical students how to perform an efficient EBM literature search. We demonstrated improvements in students' confidence regarding EBM, increased utilization of EBM resources, and improved integration of EBM into inpatient notes.
GPU-Acceleration of Sequence Homology Searches with Database Subsequence Clustering
Suzuki, Shuji; Kakuta, Masanori; Ishida, Takashi; Akiyama, Yutaka
2016-01-01
Sequence homology searches are used in various fields and require large amounts of computation time, especially for metagenomic analysis, owing to the large number of queries and the database size. To accelerate computing analyses, graphics processing units (GPUs) are widely used as a low-cost, high-performance computing platform. Therefore, we mapped the time-consuming steps involved in GHOSTZ, which is a state-of-the-art homology search algorithm for protein sequences, onto a GPU and implemented it as GHOSTZ-GPU. In addition, we optimized memory access for GPU calculations and for communication between the CPU and GPU. As per results of the evaluation test involving metagenomic data, GHOSTZ-GPU with 12 CPU threads and 1 GPU was approximately 3.0- to 4.1-fold faster than GHOSTZ with 12 CPU threads. Moreover, GHOSTZ-GPU with 12 CPU threads and 3 GPUs was approximately 5.8- to 7.7-fold faster than GHOSTZ with 12 CPU threads. PMID:27482905
de Andrade, Roberto R S; Vaslin, Maite F S
2014-03-07
Next-generation parallel sequencing (NGS) allows the identification of viral pathogens by sequencing the small RNAs of infected hosts. Thus, viral genomes may be assembled from host immune response products without prior virus enrichment, amplification or purification. However, mapping of the vast information obtained presents a bioinformatics challenge. In order to by pass the need of line command and basic bioinformatics knowledge, we develop a mapping software with a graphical interface to the assemblage of viral genomes from small RNA dataset obtained by NGS. SearchSmallRNA was developed in JAVA language version 7 using NetBeans IDE 7.1 software. The program also allows the analysis of the viral small interfering RNAs (vsRNAs) profile; providing an overview of the size distribution and other features of the vsRNAs produced in infected cells. The program performs comparisons between each read sequenced present in a library and a chosen reference genome. Reads showing Hamming distances smaller or equal to an allowed mismatched will be selected as positives and used to the assemblage of a long nucleotide genome sequence. In order to validate the software, distinct analysis using NGS dataset obtained from HIV and two plant viruses were used to reconstruct viral whole genomes. SearchSmallRNA program was able to reconstructed viral genomes using NGS of small RNA dataset with high degree of reliability so it will be a valuable tool for viruses sequencing and discovery. It is accessible and free to all research communities and has the advantage to have an easy-to-use graphical interface. SearchSmallRNA was written in Java and is freely available at http://www.microbiologia.ufrj.br/ssrna/.
2014-01-01
Background Next-generation parallel sequencing (NGS) allows the identification of viral pathogens by sequencing the small RNAs of infected hosts. Thus, viral genomes may be assembled from host immune response products without prior virus enrichment, amplification or purification. However, mapping of the vast information obtained presents a bioinformatics challenge. Methods In order to by pass the need of line command and basic bioinformatics knowledge, we develop a mapping software with a graphical interface to the assemblage of viral genomes from small RNA dataset obtained by NGS. SearchSmallRNA was developed in JAVA language version 7 using NetBeans IDE 7.1 software. The program also allows the analysis of the viral small interfering RNAs (vsRNAs) profile; providing an overview of the size distribution and other features of the vsRNAs produced in infected cells. Results The program performs comparisons between each read sequenced present in a library and a chosen reference genome. Reads showing Hamming distances smaller or equal to an allowed mismatched will be selected as positives and used to the assemblage of a long nucleotide genome sequence. In order to validate the software, distinct analysis using NGS dataset obtained from HIV and two plant viruses were used to reconstruct viral whole genomes. Conclusions SearchSmallRNA program was able to reconstructed viral genomes using NGS of small RNA dataset with high degree of reliability so it will be a valuable tool for viruses sequencing and discovery. It is accessible and free to all research communities and has the advantage to have an easy-to-use graphical interface. Availability and implementation SearchSmallRNA was written in Java and is freely available at http://www.microbiologia.ufrj.br/ssrna/. PMID:24607237
A VLSI chip set for real time vector quantization of image sequences
NASA Technical Reports Server (NTRS)
Baker, Richard L.
1989-01-01
The architecture and implementation of a VLSI chip set that vector quantizes (VQ) image sequences in real time is described. The chip set forms a programmable Single-Instruction, Multiple-Data (SIMD) machine which can implement various vector quantization encoding structures. Its VQ codebook may contain unlimited number of codevectors, N, having dimension up to K = 64. Under a weighted least squared error criterion, the engine locates at video rates the best code vector in full-searched or large tree searched VQ codebooks. The ability to manipulate tree structured codebooks, coupled with parallelism and pipelining, permits searches in as short as O (log N) cycles. A full codebook search results in O(N) performance, compared to O(KN) for a Single-Instruction, Single-Data (SISD) machine. With this VLSI chip set, an entire video code can be built on a single board that permits realtime experimentation with very large codebooks.
Sevy, Alexander M.; Jacobs, Tim M.; Crowe, James E.; Meiler, Jens
2015-01-01
Computational protein design has found great success in engineering proteins for thermodynamic stability, binding specificity, or enzymatic activity in a ‘single state’ design (SSD) paradigm. Multi-specificity design (MSD), on the other hand, involves considering the stability of multiple protein states simultaneously. We have developed a novel MSD algorithm, which we refer to as REstrained CONvergence in multi-specificity design (RECON). The algorithm allows each state to adopt its own sequence throughout the design process rather than enforcing a single sequence on all states. Convergence to a single sequence is encouraged through an incrementally increasing convergence restraint for corresponding positions. Compared to MSD algorithms that enforce (constrain) an identical sequence on all states the energy landscape is simplified, which accelerates the search drastically. As a result, RECON can readily be used in simulations with a flexible protein backbone. We have benchmarked RECON on two design tasks. First, we designed antibodies derived from a common germline gene against their diverse targets to assess recovery of the germline, polyspecific sequence. Second, we design “promiscuous”, polyspecific proteins against all binding partners and measure recovery of the native sequence. We show that RECON is able to efficiently recover native-like, biologically relevant sequences in this diverse set of protein complexes. PMID:26147100
Genome-Wide SNP Genotyping to Infer the Effects on Gene Functions in Tomato
Hirakawa, Hideki; Shirasawa, Kenta; Ohyama, Akio; Fukuoka, Hiroyuki; Aoki, Koh; Rothan, Christophe; Sato, Shusei; Isobe, Sachiko; Tabata, Satoshi
2013-01-01
The genotype data of 7054 single nucleotide polymorphism (SNP) loci in 40 tomato lines, including inbred lines, F1 hybrids, and wild relatives, were collected using Illumina's Infinium and GoldenGate assay platforms, the latter of which was utilized in our previous study. The dendrogram based on the genotype data corresponded well to the breeding types of tomato and wild relatives. The SNPs were classified into six categories according to their positions in the genes predicted on the tomato genome sequence. The genes with SNPs were annotated by homology searches against the nucleotide and protein databases, as well as by domain searches, and they were classified into the functional categories defined by the NCBI's eukaryotic orthologous groups (KOG). To infer the SNPs' effects on the gene functions, the three-dimensional structures of the 843 proteins that were encoded by the genes with SNPs causing missense mutations were constructed by homology modelling, and 200 of these proteins were considered to carry non-synonymous amino acid substitutions in the predicted functional sites. The SNP information obtained in this study is available at the Kazusa Tomato Genomics Database (http://plant1.kazusa.or.jp/tomato/). PMID:23482505
Brezina, Paul R; Anchan, Raymond; Kearns, William G
2016-07-01
The purpose of the review was to define the various diagnostic platforms currently available to perform preimplantation genetic testing for aneuploidy and describe in a clear and balanced manner the various strengths and weaknesses of these technologies. A systematic literature review was conducted. We used the terms "preimplantation genetic testing," "preimplantation genetic diagnosis," "preimplantation genetic screening," "preimplantation genetic diagnosis for aneuploidy," "PGD," "PGS," and "PGD-A" to search through PubMed, ScienceDirect, and Google Scholar from the year 2000 to April 2016. Bibliographies of articles were also searched for relevant studies. When possible, larger randomized controlled trials were used. However, for some emerging data, only data from meeting abstracts were available. PGS is emerging as one of the most valuable tools to enhance pregnancy success with assisted reproductive technologies. While all of the current diagnostic platforms currently available have various advantages and disadvantages, some platforms, such as next-generation sequencing (NGS), are capable of evaluating far more data points than has been previously possible. The emerging complexity of different technologies, especially with the utilization of more sophisticated tools such as NGS, requires an understanding by clinicians in order to request the best test for their patients.. Ultimately, the choice of which diagnostic platform is utilized should be individualized to the needs of both the clinic and the patient. Such a decision must incorporate the risk tolerance of both the patient and provider, fiscal considerations, and other factors such as the ability to counsel patients on their testing results and how these may or may not impact clinical outcomes.
Kemme, Catherine A; Marquez, Rolando; Luu, Ross H; Iwahara, Junji
2017-07-27
Eukaryotic genomes contain numerous non-functional high-affinity sequences for transcription factors. These sequences potentially serve as natural decoys that sequester transcription factors. We have previously shown that the presence of sequences similar to the target sequence could substantially impede association of the transcription factor Egr-1 with its targets. In this study, using a stopped-flow fluorescence method, we examined the kinetic impact of DNA methylation of decoys on the search process of the Egr-1 zinc-finger protein. We analyzed its association with an unmethylated target site on fluorescence-labeled DNA in the presence of competitor DNA duplexes, including Egr-1 decoys. DNA methylation of decoys alone did not affect target search kinetics. In the presence of the MeCP2 methyl-CpG-binding domain (MBD), however, DNA methylation of decoys substantially (∼10-30-fold) accelerated the target search process of the Egr-1 zinc-finger protein. This acceleration did not occur when the target was also methylated. These results suggest that when decoys are methylated, MBD proteins can block them and thereby allow Egr-1 to avoid sequestration in non-functional locations. This effect may occur in vivo for DNA methylation outside CpG islands (CGIs) and could facilitate localization of some transcription factors within regulatory CGIs, where DNA methylation is rare. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Li, Xiaobin; Xie, Yingzhou; Liu, Meng; Tai, Cui; Sun, Jingyong; Deng, Zixin; Ou, Hong-Yu
2018-05-04
oriTfinder is a web server that facilitates the rapid identification of the origin of transfer site (oriT) of a conjugative plasmid or chromosome-borne integrative and conjugative element. The utilized back-end database oriTDB was built upon more than one thousand known oriT regions of bacterial mobile genetic elements (MGEs) as well as the known MGE-encoding relaxases and type IV coupling proteins (T4CP). With a combination of similarity searches for the oriTDB-archived oriT nucleotide sequences and the co-localization of the flanking relaxase homologous genes, the oriTfinder can predict the oriT region with high accuracy in the DNA sequence of a bacterial plasmid or chromosome in minutes. The server also detects the other transfer-related modules, including the potential relaxase gene, T4CP gene and the type IV secretion system gene cluster, and the putative genes coding for virulence factors and acquired antibiotic resistance determinants. oriTfinder may contribute to meeting the increasing demands of re-annotations for bacterial conjugative, mobilizable or non-transferable elements and aid in the rapid risk accession of disease-relevant trait dissemination in pathogenic bacteria of interest. oriTfinder is freely available to all users without any login requirement at http://bioinfo-mml.sjtu.edu.cn/oriTfinder.
Li, Honglan; Joh, Yoon Sung; Kim, Hyunwoo; Paek, Eunok; Lee, Sang-Won; Hwang, Kyu-Baek
2016-12-22
Proteogenomics is a promising approach for various tasks ranging from gene annotation to cancer research. Databases for proteogenomic searches are often constructed by adding peptide sequences inferred from genomic or transcriptomic evidence to reference protein sequences. Such inflation of databases has potential of identifying novel peptides. However, it also raises concerns on sensitive and reliable peptide identification. Spurious peptides included in target databases may result in underestimated false discovery rate (FDR). On the other hand, inflation of decoy databases could decrease the sensitivity of peptide identification due to the increased number of high-scoring random hits. Although several studies have addressed these issues, widely applicable guidelines for sensitive and reliable proteogenomic search have hardly been available. To systematically evaluate the effect of database inflation in proteogenomic searches, we constructed a variety of real and simulated proteogenomic databases for yeast and human tandem mass spectrometry (MS/MS) data, respectively. Against these databases, we tested two popular database search tools with various approaches to search result validation: the target-decoy search strategy (with and without a refined scoring-metric) and a mixture model-based method. The effect of separate filtering of known and novel peptides was also examined. The results from real and simulated proteogenomic searches confirmed that separate filtering increases the sensitivity and reliability in proteogenomic search. However, no one method consistently identified the largest (or the smallest) number of novel peptides from real proteogenomic searches. We propose to use a set of search result validation methods with separate filtering, for sensitive and reliable identification of peptides in proteogenomic search.
Geisler, Christoph; Jarvis, Donald L
2016-07-01
Spodoptera frugiperda (Sf) cell lines are used to produce several biologicals for human and veterinary use. Recently, it was discovered that all tested Sf cell lines are persistently infected with Sf-rhabdovirus, a novel rhabdovirus. As part of an effort to search for other adventitious viruses, we searched the Sf cell genome and transcriptome for sequences related to Sf-rhabdovirus. To our surprise, we found intact Sf-rhabdovirus N- and P-like ORFs, and partial Sf-rhabdovirus G- and L-like ORFs. The transcribed and genomic sequences matched, indicating the transcripts were derived from the genomic sequences. These appear to be endogenous viral elements (EVEs), which result from the integration of partial viral genetic material into the host cell genome. It is theoretically impossible for the Sf-rhabdovirus-like EVEs to produce infectious virus particles as 1) they are disseminated across 4 genomic loci, 2) the G and L ORFs are incomplete, and 3) the M ORF is missing. Our finding of transcribed virus-like sequences in Sf cells underscores that MPS-based searches for adventitious viruses in cell substrates used to manufacture biologics should take into account both genomic and transcribed sequences to facilitate the identification of transcribed EVE's, and to avoid false positive detection of replication-competent adventitious viruses. Copyright © 2016 International Alliance for Biological Standardization. Published by Elsevier Ltd. All rights reserved.
Geisler, Christoph; Jarvis, Donald L.
2016-01-01
Spodoptera frugiperda (Sf) cell lines are used to produce several biologicals for human and veterinary use. Recently, it was discovered that all tested Sf cell lines are persistently infected with Sf-rhabdovirus, a novel rhabdovirus. As part of an effort to search for other adventitious viruses, we searched the Sf cell genome and transcriptome for sequences related to Sf-rhabdovirus. To our surprise, we found intact Sf-rhabdovirus N- and P-like ORFs, and partial Sf-rhabdovirus G- and L-like ORFs. The transcribed and genomic sequences matched, indicating the transcripts were derived from the genomic sequences. These appear to be endogenous viral elements (EVEs), which result from the integration of partial viral genetic material into the host cell genome. It is theoretically impossible for the Sf-rhabdovirus-like EVEs to produce infectious virus particles as 1) they are disseminated across 4 genomic loci, 2) the G and L ORFs are incomplete, and 3) the M ORF is missing. Our finding of transcribed virus-like sequences in Sf cells underscores that MPS-based searches for adventitious viruses in cell substrates used to manufacture biologics should take into account both genomic and transcribed sequences to facilitate the identification of transcribed EVE's, and to avoid false positive detection of replication-competent adventitious viruses. PMID:27236849
Pleurochrysome: A Web Database of Pleurochrysis Transcripts and Orthologs Among Heterogeneous Algae
Fujiwara, Shoko; Takatsuka, Yukiko; Hirokawa, Yasutaka; Tsuzuki, Mikio; Takano, Tomoyuki; Kobayashi, Masaaki; Suda, Kunihiro; Asamizu, Erika; Yokoyama, Koji; Shibata, Daisuke; Tabata, Satoshi; Yano, Kentaro
2016-01-01
Pleurochrysis is a coccolithophorid genus, which belongs to the Coccolithales in the Haptophyta. The genus has been used extensively for biological research, together with Emiliania in the Isochrysidales, to understand distinctive features between the two coccolithophorid-including orders. However, molecular biological research on Pleurochrysis such as elucidation of the molecular mechanism behind coccolith formation has not made great progress at least in part because of lack of comprehensive gene information. To provide such information to the research community, we built an open web database, the Pleurochrysome (http://bioinf.mind.meiji.ac.jp/phapt/), which currently stores 9,023 unique gene sequences (designated as UNIGENEs) assembled from expressed sequence tag sequences of P. haptonemofera as core information. The UNIGENEs were annotated with gene sequences sharing significant homology, conserved domains, Gene Ontology, KEGG Orthology, predicted subcellular localization, open reading frames and orthologous relationship with genes of 10 other algal species, a cyanobacterium and the yeast Saccharomyces cerevisiae. This sequence and annotation information can be easily accessed via several search functions. Besides fundamental functions such as BLAST and keyword searches, this database also offers search functions to explore orthologous genes in the 12 organisms and to seek novel genes. The Pleurochrysome will promote molecular biological and phylogenetic research on coccolithophorids and other haptophytes by helping scientists mine data from the primary transcriptome of P. haptonemofera. PMID:26746174
USDA-ARS?s Scientific Manuscript database
Modern day genomics holds the promise of solving the complexities of basic plant sciences, and of catalyzing practical advances in plant breeding. While contiguous, "base perfect" deep sequencing is a key module of any genome project, recent advances in parallel next generation sequencing technologi...
Blank-Landeshammer, Bernhard; Kollipara, Laxmikanth; Biß, Karsten; Pfenninger, Markus; Malchow, Sebastian; Shuvaev, Konstantin; Zahedi, René P; Sickmann, Albert
2017-09-01
Complex mass spectrometry based proteomics data sets are mostly analyzed by protein database searches. While this approach performs considerably well for sequenced organisms, direct inference of peptide sequences from tandem mass spectra, i.e., de novo peptide sequencing, oftentimes is the only way to obtain information when protein databases are absent. However, available algorithms suffer from drawbacks such as lack of validation and often high rates of false positive hits (FP). Here we present a simple method of combining results from commonly available de novo peptide sequencing algorithms, which in conjunction with minor tweaks in data acquisition ensues lower empirical FDR compared to the analysis using single algorithms. Results were validated using state-of-the art database search algorithms as well specifically synthesized reference peptides. Thus, we could increase the number of PSMs meeting a stringent FDR of 5% more than 3-fold compared to the single best de novo sequencing algorithm alone, accounting for an average of 11 120 PSMs (combined) instead of 3476 PSMs (alone) in triplicate 2 h LC-MS runs of tryptic HeLa digestion.
Accelerating Information Retrieval from Profile Hidden Markov Model Databases.
Tamimi, Ahmad; Ashhab, Yaqoub; Tamimi, Hashem
2016-01-01
Profile Hidden Markov Model (Profile-HMM) is an efficient statistical approach to represent protein families. Currently, several databases maintain valuable protein sequence information as profile-HMMs. There is an increasing interest to improve the efficiency of searching Profile-HMM databases to detect sequence-profile or profile-profile homology. However, most efforts to enhance searching efficiency have been focusing on improving the alignment algorithms. Although the performance of these algorithms is fairly acceptable, the growing size of these databases, as well as the increasing demand for using batch query searching approach, are strong motivations that call for further enhancement of information retrieval from profile-HMM databases. This work presents a heuristic method to accelerate the current profile-HMM homology searching approaches. The method works by cluster-based remodeling of the database to reduce the search space, rather than focusing on the alignment algorithms. Using different clustering techniques, 4284 TIGRFAMs profiles were clustered based on their similarities. A representative for each cluster was assigned. To enhance sensitivity, we proposed an extended step that allows overlapping among clusters. A validation benchmark of 6000 randomly selected protein sequences was used to query the clustered profiles. To evaluate the efficiency of our approach, speed and recall values were measured and compared with the sequential search approach. Using hierarchical, k-means, and connected component clustering techniques followed by the extended overlapping step, we obtained an average reduction in time of 41%, and an average recall of 96%. Our results demonstrate that representation of profile-HMMs using a clustering-based approach can significantly accelerate data retrieval from profile-HMM databases.
Investigation of negative-parity states in Dy 156 : Search for evidence of tetrahedral symmetry
Hartley, D. J.; Riedinger, L. L.; Janssens, R. V. F.; ...
2017-01-01
An experiment populating low/medium-spin states in 156Dy was performed to investigate the possibility of tetrahedral symmetry in this nucleus. In particular, focus was placed on the low-spin, negative-parity states since recent theoretical studies suggest that these may be good candidates for this high-rank symmetry. The states were produced in the 148Nd( 12C,4 n) reaction and the Gammasphere array was utilized to detect the emitted rays. B(E 2) /B(E1) ratios of transition probabilities from the low-spin, negative-parity bands were determined and used to interpret whether these structures are best associated with tetrahedral symmetry or, as previously assigned, to octupole vibrations. Additionally,more » several other negative-parity structures were observed to higher spin and two new sequences were established« less
Analysis of the archaeal sub-seafloor community at Suiyo Seamount on the Izu-Bonin Arc.
Hara, Kurt; Kakegawa, Takeshi; Yamashiro, Kan; Maruyama, Akihiko; Ishibashi, Jun-Ichiro; Marumo, Katsumi; Urabe, Tetsuro; Yamagishi, Akihiko
2005-01-01
A sub-surface archaeal community at the Suiyo Seamount in the Western Pacific Ocean was investigated by 16S rRNA gene sequence and whole-cell in situ hybridization analyses. In this study, we drilled and cased holes at the hydrothermal area of the seamount to minimize contamination of the hydrothermal fluid in the sub-seafloor by penetrating seawater. PCR clone analysis of the hydrothermal fluid samples collected from a cased hole indicated the presence of chemolithoautotrophic primary biomass producers of Archaeoglobales and the Methanococcales-related archaeal HTE1 group, both of which can utilize hydrogen as an electron donor. We discuss the implication of the microbial community on the early history of life and on the search for extraterrestrial life. c2005 COSPAR. Published by Elsevier Ltd. All rights reserved.
Investigation of negative-parity states in Dy 156 : Search for evidence of tetrahedral symmetry
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hartley, D. J.; Riedinger, L. L.; Janssens, R. V. F.
2017-01-01
An experiment populating low/medium-spin states in 156 Dy was performed to investigate the possibility of tetrahedral symmetry in this nucleus. In particular, focus was placed on the low-spin, negative-parity states since recent theoretical studies suggest that these may be good candidates for this high-rank symmetry. The states were produced in the 148 Nd ( 12 C , 4 n ) reaction and the Gammasphere array was utilized to detect the emitted γ rays. B ( E 2 ) / B ( E 1 ) ratios of transition probabilities from the low-spin, negative-parity bands were determined and used to interpret whethermore » these structures are best associated with tetrahedral symmetry or, as previously assigned, to octupole vibrations. In addition, several other negative-parity structures were observed to higher spin and two new sequences were established.« less
Moriyama, Koichi; Fukui, Ken–ichi; Numao, Masayuki
2015-01-01
Background Non-medical professionals (consumers) are increasingly using the Internet to support their health information needs. However, the cognitive effort required to perform health information searches is affected by the consumer’s familiarity with health topics. Consumers may have different levels of familiarity with individual health topics. This variation in familiarity may cause misunderstandings because the information presented by search engines may not be understood correctly by the consumers. Objective As a first step toward the improvement of the health information search process, we aimed to examine the effects of health topic familiarity on health information search behaviors by identifying the common search activity patterns exhibited by groups of consumers with different levels of familiarity. Methods Each participant completed a health terminology familiarity questionnaire and health information search tasks. The responses to the familiarity questionnaire were used to grade the familiarity of participants with predefined health topics. The search task data were transcribed into a sequence of search activities using a coding scheme. A computational model was constructed from the sequence data using a Markov chain model to identify the common search patterns in each familiarity group. Results Forty participants were classified into L1 (not familiar), L2 (somewhat familiar), and L3 (familiar) groups based on their questionnaire responses. They had different levels of familiarity with four health topics. The video data obtained from all of the participants were transcribed into 4595 search activities (mean 28.7, SD 23.27 per session). The most frequent search activities and transitions in all the familiarity groups were related to evaluations of the relevancy of selected web pages in the retrieval results. However, the next most frequent transitions differed in each group and a chi-squared test confirmed this finding (P<.001). Next, according to the results of a perplexity evaluation, the health information search patterns were best represented as a 5-gram sequence pattern. The most common patterns in group L1 were frequent query modifications, with relatively low search efficiency, and accessing and evaluating selected results from a health website. Group L2 performed frequent query modifications, but with better search efficiency, and accessed and evaluated selected results from a health website. Finally, the members of group L3 successfully discovered relevant results from the first query submission, performed verification by accessing several health websites after they discovered relevant results, and directly accessed consumer health information websites. Conclusions Familiarity with health topics affects health information search behaviors. Our analysis of state transitions in search activities detected unique behaviors and common search activity patterns in each familiarity group during health information searches. PMID:25783222
Literature-based condition-specific miRNA-mRNA target prediction.
Oh, Minsik; Rhee, Sungmin; Moon, Ji Hwan; Chae, Heejoon; Lee, Sunwon; Kang, Jaewoo; Kim, Sun
2017-01-01
miRNAs are small non-coding RNAs that regulate gene expression by binding to the 3'-UTR of genes. Many recent studies have reported that miRNAs play important biological roles by regulating specific mRNAs or genes. Many sequence-based target prediction algorithms have been developed to predict miRNA targets. However, these methods are not designed for condition-specific target predictions and produce many false positives; thus, expression-based target prediction algorithms have been developed for condition-specific target predictions. A typical strategy to utilize expression data is to leverage the negative control roles of miRNAs on genes. To control false positives, a stringent cutoff value is typically set, but in this case, these methods tend to reject many true target relationships, i.e., false negatives. To overcome these limitations, additional information should be utilized. The literature is probably the best resource that we can utilize. Recent literature mining systems compile millions of articles with experiments designed for specific biological questions, and the systems provide a function to search for specific information. To utilize the literature information, we used a literature mining system, BEST, that automatically extracts information from the literature in PubMed and that allows the user to perform searches of the literature with any English words. By integrating omics data analysis methods and BEST, we developed Context-MMIA, a miRNA-mRNA target prediction method that combines expression data analysis results and the literature information extracted based on the user-specified context. In the pathway enrichment analysis using genes included in the top 200 miRNA-targets, Context-MMIA outperformed the four existing target prediction methods that we tested. In another test on whether prediction methods can re-produce experimentally validated target relationships, Context-MMIA outperformed the four existing target prediction methods. In summary, Context-MMIA allows the user to specify a context of the experimental data to predict miRNA targets, and we believe that Context-MMIA is very useful for predicting condition-specific miRNA targets.
Protein Information Resource: a community resource for expert annotation of protein data
Barker, Winona C.; Garavelli, John S.; Hou, Zhenglin; Huang, Hongzhan; Ledley, Robert S.; McGarvey, Peter B.; Mewes, Hans-Werner; Orcutt, Bruce C.; Pfeiffer, Friedhelm; Tsugita, Akira; Vinayaka, C. R.; Xiao, Chunlin; Yeh, Lai-Su L.; Wu, Cathy
2001-01-01
The Protein Information Resource, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the most comprehensive and expertly annotated protein sequence database in the public domain, the PIR-International Protein Sequence Database. To provide timely and high quality annotation and promote database interoperability, the PIR-International employs rule-based and classification-driven procedures based on controlled vocabulary and standard nomenclature and includes status tags to distinguish experimentally determined from predicted protein features. The database contains about 200 000 non-redundant protein sequences, which are classified into families and superfamilies and their domains and motifs identified. Entries are extensively cross-referenced to other sequence, classification, genome, structure and activity databases. The PIR web site features search engines that use sequence similarity and database annotation to facilitate the analysis and functional identification of proteins. The PIR-International databases and search tools are accessible on the PIR web site at http://pir.georgetown.edu/ and at the MIPS web site at http://www.mips.biochem.mpg.de. The PIR-International Protein Sequence Database and other files are also available by FTP. PMID:11125041
Liang, Chanjuan; van Dijk, Jeroen P; Scholtens, Ingrid M J; Staats, Martijn; Prins, Theo W; Voorhuijzen, Marleen M; da Silva, Andrea M; Arisi, Ana Carolina Maisonnave; den Dunnen, Johan T; Kok, Esther J
2014-04-01
The growing number of biotech crops with novel genetic elements increasingly complicates the detection of genetically modified organisms (GMOs) in food and feed samples using conventional screening methods. Unauthorized GMOs (UGMOs) in food and feed are currently identified through combining GMO element screening with sequencing the DNA flanking these elements. In this study, a specific and sensitive qPCR assay was developed for vip3A element detection based on the vip3Aa20 coding sequences of the recently marketed MIR162 maize and COT102 cotton. Furthermore, SiteFinding-PCR in combination with Sanger, Illumina or Pacific BioSciences (PacBio) sequencing was performed targeting the flanking DNA of the vip3Aa20 element in MIR162. De novo assembly and Basic Local Alignment Search Tool searches were used to mimic UGMO identification. PacBio data resulted in relatively long contigs in the upstream (1,326 nucleotides (nt); 95 % identity) and downstream (1,135 nt; 92 % identity) regions, whereas Illumina data resulted in two smaller contigs of 858 and 1,038 nt with higher sequence identity (>99 % identity). Both approaches outperformed Sanger sequencing, underlining the potential for next-generation sequencing in UGMO identification.
Xia, Jia; Yang, Lili; Chen, Jialin; Wu, Yuping; Yi, Meisheng
2013-01-01
Background The Indo-Pacific humpback dolphin (Sousa chinensis), a marine mammal species inhabited in the waters of Southeast Asia, South Africa and Australia, has attracted much attention because of the dramatic decline in population size in the past decades, which raises the concern of extinction. So far, this species is poorly characterized at molecular level due to little sequence information available in public databases. Recent advances in large-scale RNA sequencing provide an efficient approach to generate abundant sequences for functional genomic analyses in the species with un-sequenced genomes. Principal Findings We performed a de novo assembly of the Indo-Pacific humpback dolphin leucocyte transcriptome by Illumina sequencing. 108,751 high quality sequences from 47,840,388 paired-end reads were generated, and 48,868 and 46,587 unigenes were functionally annotated by BLAST search against the NCBI non-redundant and Swiss-Prot protein databases (E-value<10−5), respectively. In total, 16,467 unigenes were clustered into 25 functional categories by searching against the COG database, and BLAST2GO search assigned 37,976 unigenes to 61 GO terms. In addition, 36,345 unigenes were grouped into 258 KEGG pathways. We also identified 9,906 simple sequence repeats and 3,681 putative single nucleotide polymorphisms as potential molecular markers in our assembled sequences. A large number of unigenes were predicted to be involved in immune response, and many genes were predicted to be relevant to adaptive evolution and cetacean-specific traits. Conclusion This study represented the first transcriptome analysis of the Indo-Pacific humpback dolphin, an endangered species. The de novo transcriptome analysis of the unique transcripts will provide valuable sequence information for discovery of new genes, characterization of gene expression, investigation of various pathways and adaptive evolution, as well as identification of genetic markers. PMID:24015242
Gui, Duan; Jia, Kuntong; Xia, Jia; Yang, Lili; Chen, Jialin; Wu, Yuping; Yi, Meisheng
2013-01-01
The Indo-Pacific humpback dolphin (Sousa chinensis), a marine mammal species inhabited in the waters of Southeast Asia, South Africa and Australia, has attracted much attention because of the dramatic decline in population size in the past decades, which raises the concern of extinction. So far, this species is poorly characterized at molecular level due to little sequence information available in public databases. Recent advances in large-scale RNA sequencing provide an efficient approach to generate abundant sequences for functional genomic analyses in the species with un-sequenced genomes. We performed a de novo assembly of the Indo-Pacific humpback dolphin leucocyte transcriptome by Illumina sequencing. 108,751 high quality sequences from 47,840,388 paired-end reads were generated, and 48,868 and 46,587 unigenes were functionally annotated by BLAST search against the NCBI non-redundant and Swiss-Prot protein databases (E-value<10(-5)), respectively. In total, 16,467 unigenes were clustered into 25 functional categories by searching against the COG database, and BLAST2GO search assigned 37,976 unigenes to 61 GO terms. In addition, 36,345 unigenes were grouped into 258 KEGG pathways. We also identified 9,906 simple sequence repeats and 3,681 putative single nucleotide polymorphisms as potential molecular markers in our assembled sequences. A large number of unigenes were predicted to be involved in immune response, and many genes were predicted to be relevant to adaptive evolution and cetacean-specific traits. This study represented the first transcriptome analysis of the Indo-Pacific humpback dolphin, an endangered species. The de novo transcriptome analysis of the unique transcripts will provide valuable sequence information for discovery of new genes, characterization of gene expression, investigation of various pathways and adaptive evolution, as well as identification of genetic markers.
Finding functional features in Saccharomyces genomes by phylogenetic footprinting.
Cliften, Paul; Sudarsanam, Priya; Desikan, Ashwin; Fulton, Lucinda; Fulton, Bob; Majors, John; Waterston, Robert; Cohen, Barak A; Johnston, Mark
2003-07-04
The sifting and winnowing of DNA sequence that occur during evolution cause nonfunctional sequences to diverge, leaving phylogenetic footprints of functional sequence elements in comparisons of genome sequences. We searched for such footprints among the genome sequences of six Saccharomyces species and identified potentially functional sequences. Comparison of these sequences allowed us to revise the catalog of yeast genes and identify sequence motifs that may be targets of transcriptional regulatory proteins. Some of these conserved sequence motifs reside upstream of genes with similar functional annotations or similar expression patterns or those bound by the same transcription factor and are thus good candidates for functional regulatory sequences.
The LANL hemorrhagic fever virus database, a new platform for analyzing biothreat viruses
Kuiken, Carla; Thurmond, Jim; Dimitrijevic, Mira; Yoon, Hyejin
2012-01-01
Hemorrhagic fever viruses (HFVs) are a diverse set of over 80 viral species, found in 10 different genera comprising five different families: arena-, bunya-, flavi-, filo- and togaviridae. All these viruses are highly variable and evolve rapidly, making them elusive targets for the immune system and for vaccine and drug design. About 55 000 HFV sequences exist in the public domain today. A central website that provides annotated sequences and analysis tools will be helpful to HFV researchers worldwide. The HFV sequence database collects and stores sequence data and provides a user-friendly search interface and a large number of sequence analysis tools, following the model of the highly regarded and widely used Los Alamos HIV database [Kuiken, C., B. Korber, and R.W. Shafer, HIV sequence databases. AIDS Rev, 2003. 5: p. 52–61]. The database uses an algorithm that aligns each sequence to a species-wide reference sequence. The NCBI RefSeq database [Sayers et al. (2011) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 39, D38–D51.] is used for this; if a reference sequence is not available, a Blast search finds the best candidate. Using this method, sequences in each genus can be retrieved pre-aligned. The HFV website can be accessed via http://hfv.lanl.gov. PMID:22064861
Comparison of the Heme Iron Utilization Systems of Pathogenic Vibrios
O’Malley, S. M.; Mouton, S. L.; Occhino, D. A.; Deanda, M. T.; Rashidi, J. R.; Fuson, K. L.; Rashidi, C. E.; Mora, M. Y.; Payne, S. M.; Henderson, D. P.
1999-01-01
Vibrio alginolyticus, Vibrio fluvialis, and Vibrio parahaemolyticus utilized heme and hemoglobin as iron sources and contained chromosomal DNA similar to several Vibrio cholerae heme iron utilization genes. A V. parahaemolyticus gene that performed the function of V. cholerae hutA was isolated. A portion of the tonB1 locus of V. parahaemolyticus was sequenced and found to encode proteins similar in amino acid sequence to V. cholerae HutW, TonB1, and ExbB1. A recombinant plasmid containing the V. cholerae tonB1 and exbB1D1 genes complemented a V. alginolyticus heme utilization mutant. These data suggest that the heme iron utilization systems of the pathogenic vibrios tested, particularly V. parahaemolyticus and V. alginolyticus, are similar at the DNA level, the functional level, and, in the case of V. parahaemolyticus, the amino acid sequence or protein level to that of V. cholerae. PMID:10348876
Hinton, Elizabeth G; Oelschlegel, Sandra; Vaughn, Cynthia J; Lindsay, J Michael; Hurst, Sachiko M; Earl, Martha
2013-01-01
This study utilizes an informatics tool to analyze a robust literature search service in an academic medical center library. Structured interviews with librarians were conducted focusing on the benefits of such a tool, expectations for performance, and visual layout preferences. The resulting application utilizes Microsoft SQL Server and .Net Framework 3.5 technologies, allowing for the use of a web interface. Customer tables and MeSH terms are included. The National Library of Medicine MeSH database and entry terms for each heading are incorporated, resulting in functionality similar to searching the MeSH database through PubMed. Data reports will facilitate analysis of the search service.
Pattern similarity study of functional sites in protein sequences: lysozymes and cystatins
Nakai, Shuryo; Li-Chan, Eunice CY; Dou, Jinglie
2005-01-01
Background Although it is generally agreed that topography is more conserved than sequences, proteins sharing the same fold can have different functions, while there are protein families with low sequence similarity. An alternative method for profile analysis of characteristic conserved positions of the motifs within the 3D structures may be needed for functional annotation of protein sequences. Using the approach of quantitative structure-activity relationships (QSAR), we have proposed a new algorithm for postulating functional mechanisms on the basis of pattern similarity and average of property values of side-chains in segments within sequences. This approach was used to search for functional sites of proteins belonging to the lysozyme and cystatin families. Results Hydrophobicity and β-turn propensity of reference segments with 3–7 residues were used for the homology similarity search (HSS) for active sites. Hydrogen bonding was used as the side-chain property for searching the binding sites of lysozymes. The profiles of similarity constants and average values of these parameters as functions of their positions in the sequences could identify both active and substrate binding sites of the lysozyme of Streptomyces coelicolor, which has been reported as a new fold enzyme (Cellosyl). The same approach was successfully applied to cystatins, especially for postulating the mechanisms of amyloidosis of human cystatin C as well as human lysozyme. Conclusion Pattern similarity and average index values of structure-related properties of side chains in short segments of three residues or longer were, for the first time, successfully applied for predicting functional sites in sequences. This new approach may be applicable to studying functional sites in un-annotated proteins, for which complete 3D structures are not yet available. PMID:15904486
User Guidelines for the Brassica Database: BRAD.
Wang, Xiaobo; Cheng, Feng; Wang, Xiaowu
2016-01-01
The genome sequence of Brassica rapa was first released in 2011. Since then, further Brassica genomes have been sequenced or are undergoing sequencing. It is therefore necessary to develop tools that help users to mine information from genomic data efficiently. This will greatly aid scientific exploration and breeding application, especially for those with low levels of bioinformatic training. Therefore, the Brassica database (BRAD) was built to collect, integrate, illustrate, and visualize Brassica genomic datasets. BRAD provides useful searching and data mining tools, and facilitates the search of gene annotation datasets, syntenic or non-syntenic orthologs, and flanking regions of functional genomic elements. It also includes genome-analysis tools such as BLAST and GBrowse. One of the important aims of BRAD is to build a bridge between Brassica crop genomes with the genome of the model species Arabidopsis thaliana, thus transferring the bulk of A. thaliana gene study information for use with newly sequenced Brassica crops.
Update on Genomic Databases and Resources at the National Center for Biotechnology Information.
Tatusova, Tatiana
2016-01-01
The National Center for Biotechnology Information (NCBI), as a primary public repository of genomic sequence data, collects and maintains enormous amounts of heterogeneous data. Data for genomes, genes, gene expressions, gene variation, gene families, proteins, and protein domains are integrated with the analytical, search, and retrieval resources through the NCBI website, text-based search and retrieval system, provides a fast and easy way to navigate across diverse biological databases.Comparative genome analysis tools lead to further understanding of evolution processes quickening the pace of discovery. Recent technological innovations have ignited an explosion in genome sequencing that has fundamentally changed our understanding of the biology of living organisms. This huge increase in DNA sequence data presents new challenges for the information management system and the visualization tools. New strategies have been designed to bring an order to this genome sequence shockwave and improve the usability of associated data.
Motion Estimation Using the Firefly Algorithm in Ultrasonic Image Sequence of Soft Tissue
Chao, Chih-Feng; Horng, Ming-Huwi; Chen, Yu-Chan
2015-01-01
Ultrasonic image sequence of the soft tissue is widely used in disease diagnosis; however, the speckle noises usually influenced the image quality. These images usually have a low signal-to-noise ratio presentation. The phenomenon gives rise to traditional motion estimation algorithms that are not suitable to measure the motion vectors. In this paper, a new motion estimation algorithm is developed for assessing the velocity field of soft tissue in a sequence of ultrasonic B-mode images. The proposed iterative firefly algorithm (IFA) searches for few candidate points to obtain the optimal motion vector, and then compares it to the traditional iterative full search algorithm (IFSA) via a series of experiments of in vivo ultrasonic image sequences. The experimental results show that the IFA can assess the vector with better efficiency and almost equal estimation quality compared to the traditional IFSA method. PMID:25873987
Motion estimation using the firefly algorithm in ultrasonic image sequence of soft tissue.
Chao, Chih-Feng; Horng, Ming-Huwi; Chen, Yu-Chan
2015-01-01
Ultrasonic image sequence of the soft tissue is widely used in disease diagnosis; however, the speckle noises usually influenced the image quality. These images usually have a low signal-to-noise ratio presentation. The phenomenon gives rise to traditional motion estimation algorithms that are not suitable to measure the motion vectors. In this paper, a new motion estimation algorithm is developed for assessing the velocity field of soft tissue in a sequence of ultrasonic B-mode images. The proposed iterative firefly algorithm (IFA) searches for few candidate points to obtain the optimal motion vector, and then compares it to the traditional iterative full search algorithm (IFSA) via a series of experiments of in vivo ultrasonic image sequences. The experimental results show that the IFA can assess the vector with better efficiency and almost equal estimation quality compared to the traditional IFSA method.
NABIC: A New Access Portal to Search, Visualize, and Share Agricultural Genomics Data
Seol, Young-Joo; Lee, Tae-Ho; Park, Dong-Suk; Kim, Chang-Kug
2016-01-01
The National Agricultural Biotechnology Information Center developed an access portal to search, visualize, and share agricultural genomics data with a focus on South Korean information and resources. The portal features an agricultural biotechnology database containing a wide range of omics data from public and proprietary sources. We collected 28.4 TB of data from 162 agricultural organisms, with 10 types of omics data comprising next-generation sequencing sequence read archive, genome, gene, nucleotide, DNA chip, expressed sequence tag, interactome, protein structure, molecular marker, and single-nucleotide polymorphism datasets. Our genomic resources contain information on five animals, seven plants, and one fungus, which is accessed through a genome browser. We also developed a data submission and analysis system as a web service, with easy-to-use functions and cutting-edge algorithms, including those for handling next-generation sequencing data. PMID:26848255
SalmonDB: a bioinformatics resource for Salmo salar and Oncorhynchus mykiss
Di Génova, Alex; Aravena, Andrés; Zapata, Luis; González, Mauricio; Maass, Alejandro; Iturra, Patricia
2011-01-01
SalmonDB is a new multiorganism database containing EST sequences from Salmo salar, Oncorhynchus mykiss and the whole genome sequence of Danio rerio, Gasterosteus aculeatus, Tetraodon nigroviridis, Oryzias latipes and Takifugu rubripes, built with core components from GMOD project, GOPArc system and the BioMart project. The information provided by this resource includes Gene Ontology terms, metabolic pathways, SNP prediction, CDS prediction, orthologs prediction, several precalculated BLAST searches and domains. It also provides a BLAST server for matching user-provided sequences to any of the databases and an advanced query tool (BioMart) that allows easy browsing of EST databases with user-defined criteria. These tools make SalmonDB database a valuable resource for researchers searching for transcripts and genomic information regarding S. salar and other salmonid species. The database is expected to grow in the near feature, particularly with the S. salar genome sequencing project. Database URL: http://genomicasalmones.dim.uchile.cl/ PMID:22120661
SalmonDB: a bioinformatics resource for Salmo salar and Oncorhynchus mykiss.
Di Génova, Alex; Aravena, Andrés; Zapata, Luis; González, Mauricio; Maass, Alejandro; Iturra, Patricia
2011-01-01
SalmonDB is a new multiorganism database containing EST sequences from Salmo salar, Oncorhynchus mykiss and the whole genome sequence of Danio rerio, Gasterosteus aculeatus, Tetraodon nigroviridis, Oryzias latipes and Takifugu rubripes, built with core components from GMOD project, GOPArc system and the BioMart project. The information provided by this resource includes Gene Ontology terms, metabolic pathways, SNP prediction, CDS prediction, orthologs prediction, several precalculated BLAST searches and domains. It also provides a BLAST server for matching user-provided sequences to any of the databases and an advanced query tool (BioMart) that allows easy browsing of EST databases with user-defined criteria. These tools make SalmonDB database a valuable resource for researchers searching for transcripts and genomic information regarding S. salar and other salmonid species. The database is expected to grow in the near feature, particularly with the S. salar genome sequencing project. Database URL: http://genomicasalmones.dim.uchile.cl/
Breast cancer patients' search for meaning.
Pintado, Sheila
2018-04-25
The aim of this study was to analyze the search for meaning in women with breast cancer and its relationship with the emotional well-being. One hundred thirty-one breast cancer survivors were assessed using a mixed-method. Results showed that the meaning of suffer cancer can be explained by nine categories; and the utility of the suffering experienced was divided in seven categories. Moreover, the results showed a significant correlation between the meaning and the utility of suffering cancer, and the emotional well-being. The search for meaning in breast cancer women affects the emotional well-being, so it is necessary to attend it.
Searching for resistance genes to Bursaphelenchus xylophilus using high throughput screening.
Santos, Carla S; Pinheiro, Miguel; Silva, Ana I; Egas, Conceição; Vasconcelos, Marta W
2012-11-07
Pine wilt disease (PWD), caused by the pinewood nematode (PWN; Bursaphelenchus xylophilus), damages and kills pine trees and is causing serious economic damage worldwide. Although the ecological mechanism of infestation is well described, the plant's molecular response to the pathogen is not well known. This is due mainly to the lack of genomic information and the complexity of the disease. High throughput sequencing is now an efficient approach for detecting the expression of genes in non-model organisms, thus providing valuable information in spite of the lack of the genome sequence. In an attempt to unravel genes potentially involved in the pine defense against the pathogen, we hereby report the high throughput comparative sequence analysis of infested and non-infested stems of Pinus pinaster (very susceptible to PWN) and Pinus pinea (less susceptible to PWN). Four cDNA libraries from infested and non-infested stems of P. pinaster and P. pinea were sequenced in a full 454 GS FLX run, producing a total of 2,083,698 reads. The putative amino acid sequences encoded by the assembled transcripts were annotated according to Gene Ontology, to assign Pinus contigs into Biological Processes, Cellular Components and Molecular Functions categories. Most of the annotated transcripts corresponded to Picea genes-25.4-39.7%, whereas a smaller percentage, matched Pinus genes, 1.8-12.8%, probably a consequence of more public genomic information available for Picea than for Pinus. The comparative transcriptome analysis showed that when P. pinaster was infested with PWN, the genes malate dehydrogenase, ABA, water deficit stress related genes and PAR1 were highly expressed, while in PWN-infested P. pinea, the highly expressed genes were ricin B-related lectin, and genes belonging to the SNARE and high mobility group families. Quantitative PCR experiments confirmed the differential gene expression between the two pine species. Defense-related genes triggered by nematode infestation were detected in both P. pinaster and P. pinea transcriptomes utilizing 454 pyrosequencing technology. P. pinaster showed higher abundance of genes related to transcriptional regulation, terpenoid secondary metabolism (including some with nematicidal activity) and pathogen attack. P. pinea showed higher abundance of genes related to oxidative stress and higher levels of expression in general of stress responsive genes. This study provides essential information about the molecular defense mechanisms utilized by P. pinaster and P. pinea against PWN infestation and contributes to a better understanding of PWD.
PatGen--a consolidated resource for searching genetic patent sequences.
Rouse, Richard J D; Castagnetto, Jesus; Niedner, Roland H
2005-04-15
Compared to the wealth of online resources covering genomic, proteomic and derived data the Bioinformatics community is rather underserved when it comes to patent information related to biological sequences. The current online resources are either incomplete or rather expensive. This paper describes, PatGen, an integrated database containing data from bioinformatic and patent resources. This effort addresses the inconsistency of publicly available genetic patent data coverage by providing access to a consolidated dataset. PatGen can be searched at http://www.patgendb.com rjdrouse@patentinformatics.com.
Automated search method for AFM and profilers
NASA Astrophysics Data System (ADS)
Ray, Michael; Martin, Yves C.
2001-08-01
A new automation software creates a search model as an initial setup and searches for a user-defined target in atomic force microscopes or stylus profilometers used in semiconductor manufacturing. The need for such automation has become critical in manufacturing lines. The new method starts with a survey map of a small area of a chip obtained from a chip-design database or an image of the area. The user interface requires a user to point to and define a precise location to be measured, and to select a macro function for an application such as line width or contact hole. The search algorithm automatically constructs a range of possible scan sequences within the survey, and provides increased speed and functionality compared to the methods used in instruments to date. Each sequence consists in a starting point relative to the target, a scan direction, and a scan length. The search algorithm stops when the location of a target is found and criteria for certainty in positioning is met. With today's capability in high speed processing and signal control, the tool can simultaneously scan and search for a target in a robotic and continuous manner. Examples are given that illustrate the key concepts.
Database resources of the National Center for Biotechnology Information
Sayers, Eric W.; Barrett, Tanya; Benson, Dennis A.; Bolton, Evan; Bryant, Stephen H.; Canese, Kathi; Chetvernin, Vyacheslav; Church, Deanna M.; DiCuccio, Michael; Federhen, Scott; Feolo, Michael; Fingerman, Ian M.; Geer, Lewis Y.; Helmberg, Wolfgang; Kapustin, Yuri; Krasnov, Sergey; Landsman, David; Lipman, David J.; Lu, Zhiyong; Madden, Thomas L.; Madej, Tom; Maglott, Donna R.; Marchler-Bauer, Aron; Miller, Vadim; Karsch-Mizrachi, Ilene; Ostell, James; Panchenko, Anna; Phan, Lon; Pruitt, Kim D.; Schuler, Gregory D.; Sequeira, Edwin; Sherry, Stephen T.; Shumway, Martin; Sirotkin, Karl; Slotta, Douglas; Souvorov, Alexandre; Starchenko, Grigory; Tatusova, Tatiana A.; Wagner, Lukas; Wang, Yanli; Wilbur, W. John; Yaschenko, Eugene; Ye, Jian
2012-01-01
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Website. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central (PMC), Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, Genome and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, BioProject, BioSample, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Probe, Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), Biosystems, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov. PMID:22140104
Database resources of the National Center for Biotechnology Information
2013-01-01
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central, Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, the Genetic Testing Registry, Genome and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, BioProject, BioSample, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus, Probe, Online Mendelian Inheritance in Animals, the Molecular Modeling Database, the Conserved Domain Database, the Conserved Domain Architecture Retrieval Tool, Biosystems, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page. PMID:23193264
GMDD: a database of GMO detection methods.
Dong, Wei; Yang, Litao; Shen, Kailin; Kim, Banghyun; Kleter, Gijs A; Marvin, Hans J P; Guo, Rong; Liang, Wanqi; Zhang, Dabing
2008-06-04
Since more than one hundred events of genetically modified organisms (GMOs) have been developed and approved for commercialization in global area, the GMO analysis methods are essential for the enforcement of GMO labelling regulations. Protein and nucleic acid-based detection techniques have been developed and utilized for GMOs identification and quantification. However, the information for harmonization and standardization of GMO analysis methods at global level is needed. GMO Detection method Database (GMDD) has collected almost all the previous developed and reported GMOs detection methods, which have been grouped by different strategies (screen-, gene-, construct-, and event-specific), and also provide a user-friendly search service of the detection methods by GMO event name, exogenous gene, or protein information, etc. In this database, users can obtain the sequences of exogenous integration, which will facilitate PCR primers and probes design. Also the information on endogenous genes, certified reference materials, reference molecules, and the validation status of developed methods is included in this database. Furthermore, registered users can also submit new detection methods and sequences to this database, and the newly submitted information will be released soon after being checked. GMDD contains comprehensive information of GMO detection methods. The database will make the GMOs analysis much easier.
Barta, Endre; Sebestyén, Endre; Pálfy, Tamás B.; Tóth, Gábor; Ortutay, Csaba P.; Patthy, László
2005-01-01
DoOP (http://doop.abc.hu/) is a database of eukaryotic promoter sequences (upstream regions) aiming to facilitate the recognition of regulatory sites conserved between species. The annotated first exons of human and Arabidopsis thaliana genes were used as queries in BLAST searches to collect the most closely related orthologous first exon sequences from Chordata and Viridiplantae species. Up to 3000 bp DNA segments upstream from these first exons constitute the clusters in the chordate and plant sections of the Database of Orthologous Promoters. Release 1.0 of DoOP contains 21 061 chordate clusters from 284 different species and 7548 plant clusters from 269 different species. The database can be used to find and retrieve promoter sequences of a given gene from various species and it is also suitable to see the most trivial conserved sequence blocks in the orthologous upstream regions. Users can search DoOP with either sequence or text (annotation) to find promoter clusters of various genes. In addition to the sequence data, the positions of the conserved sequence blocks derived from multiple alignments, the positions of repetitive elements and the positions of transcription start sites known from the Eukaryotic Promoter Database (EPD) can be viewed graphically. PMID:15608291
Barta, Endre; Sebestyén, Endre; Pálfy, Tamás B; Tóth, Gábor; Ortutay, Csaba P; Patthy, László
2005-01-01
DoOP (http://doop.abc.hu/) is a database of eukaryotic promoter sequences (upstream regions) aiming to facilitate the recognition of regulatory sites conserved between species. The annotated first exons of human and Arabidopsis thaliana genes were used as queries in BLAST searches to collect the most closely related orthologous first exon sequences from Chordata and Viridiplantae species. Up to 3000 bp DNA segments upstream from these first exons constitute the clusters in the chordate and plant sections of the Database of Orthologous Promoters. Release 1.0 of DoOP contains 21,061 chordate clusters from 284 different species and 7548 plant clusters from 269 different species. The database can be used to find and retrieve promoter sequences of a given gene from various species and it is also suitable to see the most trivial conserved sequence blocks in the orthologous upstream regions. Users can search DoOP with either sequence or text (annotation) to find promoter clusters of various genes. In addition to the sequence data, the positions of the conserved sequence blocks derived from multiple alignments, the positions of repetitive elements and the positions of transcription start sites known from the Eukaryotic Promoter Database (EPD) can be viewed graphically.
From biomedicine to natural history research: EST resources for ambystomatid salamanders
Putta, Srikrishna; Smith, Jeramiah J; Walker, John A; Rondet, Mathieu; Weisrock, David W; Monaghan, James; Samuels, Amy K; Kump, Kevin; King, David C; Maness, Nicholas J; Habermann, Bianca; Tanaka, Elly; Bryant, Susan V; Gardiner, David M; Parichy, David M; Voss, S Randal
2004-01-01
Background Establishing genomic resources for closely related species will provide comparative insights that are crucial for understanding diversity and variability at multiple levels of biological organization. We developed ESTs for Mexican axolotl (Ambystoma mexicanum) and Eastern tiger salamander (A. tigrinum tigrinum), species with deep and diverse research histories. Results Approximately 40,000 quality cDNA sequences were isolated for these species from various tissues, including regenerating limb and tail. These sequences and an existing set of 16,030 cDNA sequences for A. mexicanum were processed to yield 35,413 and 20,599 high quality ESTs for A. mexicanum and A. t. tigrinum, respectively. Because the A. t. tigrinum ESTs were obtained primarily from a normalized library, an approximately equal number of contigs were obtained for each species, with 21,091 unique contigs identified overall. The 10,592 contigs that showed significant similarity to sequences from the human RefSeq database reflected a diverse array of molecular functions and biological processes, with many corresponding to genes expressed during spinal cord injury in rat and fin regeneration in zebrafish. To demonstrate the utility of these EST resources, we searched databases to identify probes for regeneration research, characterized intra- and interspecific nucleotide polymorphism, saturated a human – Ambystoma synteny group with marker loci, and extended PCR primer sets designed for A. mexicanum / A. t. tigrinum orthologues to a related tiger salamander species. Conclusions Our study highlights the value of developing resources in traditional model systems where the likelihood of information transfer to multiple, closely related taxa is high, thus simultaneously enabling both laboratory and natural history research. PMID:15310388
Stavropoulos, Dimitri J; Merico, Daniele; Jobling, Rebekah; Bowdin, Sarah; Monfared, Nasim; Thiruvahindrapuram, Bhooma; Nalpathamkalam, Thomas; Pellecchia, Giovanna; Yuen, Ryan K C; Szego, Michael J; Hayeems, Robin Z; Shaul, Randi Zlotnik; Brudno, Michael; Girdea, Marta; Frey, Brendan; Alipanahi, Babak; Ahmed, Sohnee; Babul-Hirji, Riyana; Porras, Ramses Badilla; Carter, Melissa T; Chad, Lauren; Chaudhry, Ayeshah; Chitayat, David; Doust, Soghra Jougheh; Cytrynbaum, Cheryl; Dupuis, Lucie; Ejaz, Resham; Fishman, Leona; Guerin, Andrea; Hashemi, Bita; Helal, Mayada; Hewson, Stacy; Inbar-Feigenberg, Michal; Kannu, Peter; Karp, Natalya; Kim, Raymond H; Kronick, Jonathan; Liston, Eriskay; MacDonald, Heather; Mercimek-Mahmutoglu, Saadet; Mendoza-Londono, Roberto; Nasr, Enas; Nimmo, Graeme; Parkinson, Nicole; Quercia, Nada; Raiman, Julian; Roifman, Maian; Schulze, Andreas; Shugar, Andrea; Shuman, Cheryl; Sinajon, Pierre; Siriwardena, Komudi; Weksberg, Rosanna; Yoon, Grace; Carew, Chris; Erickson, Raith; Leach, Richard A; Klein, Robert; Ray, Peter N; Meyn, M Stephen; Scherer, Stephen W; Cohn, Ronald D; Marshall, Christian R
2016-01-01
The standard of care for first-tier clinical investigation of the aetiology of congenital malformations and neurodevelopmental disorders is chromosome microarray analysis (CMA) for copy-number variations (CNVs), often followed by gene(s)-specific sequencing searching for smaller insertion–deletions (indels) and single-nucleotide variant (SNV) mutations. Whole-genome sequencing (WGS) has the potential to capture all classes of genetic variation in one experiment; however, the diagnostic yield for mutation detection of WGS compared to CMA, and other tests, needs to be established. In a prospective study we utilised WGS and comprehensive medical annotation to assess 100 patients referred to a paediatric genetics service and compared the diagnostic yield versus standard genetic testing. WGS identified genetic variants meeting clinical diagnostic criteria in 34% of cases, representing a fourfold increase in diagnostic rate over CMA (8%; P value=1.42E−05) alone and more than twofold increase in CMA plus targeted gene sequencing (13%; P value=0.0009). WGS identified all rare clinically significant CNVs that were detected by CMA. In 26 patients, WGS revealed indel and missense mutations presenting in a dominant (63%) or a recessive (37%) manner. We found four subjects with mutations in at least two genes associated with distinct genetic disorders, including two cases harbouring a pathogenic CNV and SNV. When considering medically actionable secondary findings in addition to primary WGS findings, 38% of patients would benefit from genetic counselling. Clinical implementation of WGS as a primary test will provide a higher diagnostic yield than conventional genetic testing and potentially reduce the time required to reach a genetic diagnosis. PMID:28567303
Rath, Matthias; Jenssen, Sönke E; Schwefel, Konrad; Spiegler, Stefanie; Kleimeier, Dana; Sperling, Christian; Kaderali, Lars; Felbor, Ute
2017-09-01
Cerebral cavernous malformations (CCM) are vascular lesions of the central nervous system that can cause headaches, seizures and hemorrhagic stroke. Disease-associated mutations have been identified in three genes: CCM1/KRIT1, CCM2 and CCM3/PDCD10. The precise proportion of deep-intronic variants in these genes and their clinical relevance is yet unknown. Here, a long-range PCR (LR-PCR) approach for target enrichment of the entire genomic regions of the three genes was combined with next generation sequencing (NGS) to screen for coding and non-coding variants. NGS detected all six CCM1/KRIT1, two CCM2 and four CCM3/PDCD10 mutations that had previously been identified by Sanger sequencing. Two of the pathogenic variants presented here are novel. Additionally, 20 stringently selected CCM index cases that had remained mutation-negative after conventional sequencing and exclusion of copy number variations were screened for deep-intronic mutations. The combination of bioinformatics filtering and transcript analyses did not reveal any deep-intronic splice mutations in these cases. Our results demonstrate that target enrichment by LR-PCR combined with NGS can be used for a comprehensive analysis of the entire genomic regions of the CCM genes in a research context. However, its clinical utility is limited as deep-intronic splice mutations in CCM1/KRIT1, CCM2 and CCM3/PDCD10 seem to be rather rare. Copyright © 2017 Elsevier Masson SAS. All rights reserved.
Madsen, James A.; Ko, Byoung Joon; Xu, Hua; Iwashkiw, Jeremy A.; Robotham, Scott A.; Shaw, Jared B.; Feldman, Mario F.; Brodbelt, Jennifer S.
2013-01-01
O -glycopeptides are often acidic owing to the frequent occurrence of acidic saccharides in the glycan, rendering traditional proteomic workflows that rely on positive mode tandem mass spectrometry (MS/MS) less effective. In this report, we demonstrate the utility of negative mode ultraviolet photodissociation (UVPD) MS for the characterization of acidic O-linked glycopeptide anions. This method was evaluated for a series of singly- and multiply-deprotonated glycopeptides from the model glycoprotein kappa casein, resulting in production of both peptide and glycan product ions that afforded 100% sequence coverage of the peptide and glycan moieties from a single MS/MS event. The most abundant and frequent peptide sequence ions were a/x-type products, which, importantly, were found to retain the labile glycan modifications. The glycan-specific ions mainly arose from glycosidic bond cleavages (B, Y, C, and Z ions) in addition to some less common cross-ring cleavages. Based on the UVPD fragmentation patterns, an automated database searching strategy (based on the MassMatrix algorithm) was designed that is specific for the analysis of glycopeptide anions by UVPD. This algorithm was used to identify glycopeptides from mixtures of glycosylated and non-glycosylated peptides, sequence both glycan and peptide moieties simultaneously, and pinpoint the correct site(s) of glycosylation. This methodology was applied to uncover novel site-specificity of the O-linked glycosylated OmpA/MotB from the “superbug” A. baumannii to help aid in the elucidation of the functional role that protein glycosylation plays in pathogenesis. PMID:24006841
Workflow and web application for annotating NCBI BioProject transcriptome data.
Vera Alvarez, Roberto; Medeiros Vidal, Newton; Garzón-Martínez, Gina A; Barrero, Luz S; Landsman, David; Mariño-Ramírez, Leonardo
2017-01-01
The volume of transcriptome data is growing exponentially due to rapid improvement of experimental technologies. In response, large central resources such as those of the National Center for Biotechnology Information (NCBI) are continually adapting their computational infrastructure to accommodate this large influx of data. New and specialized databases, such as Transcriptome Shotgun Assembly Sequence Database (TSA) and Sequence Read Archive (SRA), have been created to aid the development and expansion of centralized repositories. Although the central resource databases are under continual development, they do not include automatic pipelines to increase annotation of newly deposited data. Therefore, third-party applications are required to achieve that aim. Here, we present an automatic workflow and web application for the annotation of transcriptome data. The workflow creates secondary data such as sequencing reads and BLAST alignments, which are available through the web application. They are based on freely available bioinformatics tools and scripts developed in-house. The interactive web application provides a search engine and several browser utilities. Graphical views of transcript alignments are available through SeqViewer, an embedded tool developed by NCBI for viewing biological sequence data. The web application is tightly integrated with other NCBI web applications and tools to extend the functionality of data processing and interconnectivity. We present a case study for the species Physalis peruviana with data generated from BioProject ID 67621. URL: http://www.ncbi.nlm.nih.gov/projects/physalis/. Published by Oxford University Press 2017. This work is written by US Government employees and is in the public domain in the US.
From web search to healthcare utilization: privacy-sensitive studies from mobile data.
White, Ryen; Horvitz, Eric
2013-01-01
We explore relationships between health information seeking activities and engagement with healthcare professionals via a privacy-sensitive analysis of geo-tagged data from mobile devices. We analyze logs of mobile interaction data stripped of individually identifiable information and location data. The data analyzed consist of time-stamped search queries and distances to medical care centers. We examine search activity that precedes the observation of salient evidence of healthcare utilization (EHU) (ie, data suggesting that the searcher is using healthcare resources), in our case taken as queries occurring at or near medical facilities. We show that the time between symptom searches and observation of salient evidence of seeking healthcare utilization depends on the acuity of symptoms. We construct statistical models that make predictions of forthcoming EHU based on observations about the current search session, prior medical search activities, and prior EHU. The predictive accuracy of the models varies (65%-90%) depending on the features used and the timeframe of the analysis, which we explore via a sensitivity analysis. We provide a privacy-sensitive analysis that can be used to generate insights about the pursuit of health information and healthcare. The findings demonstrate how large-scale studies of mobile devices can provide insights on how concerns about symptomatology lead to the pursuit of professional care. We present new methods for the analysis of mobile logs and describe a study that provides evidence about how people transition from mobile searches on symptoms and diseases to the pursuit of healthcare in the world.
SAM: String-based sequence search algorithm for mitochondrial DNA database queries
Röck, Alexander; Irwin, Jodi; Dür, Arne; Parsons, Thomas; Parson, Walther
2011-01-01
The analysis of the haploid mitochondrial (mt) genome has numerous applications in forensic and population genetics, as well as in disease studies. Although mtDNA haplotypes are usually determined by sequencing, they are rarely reported as a nucleotide string. Traditionally they are presented in a difference-coded position-based format relative to the corrected version of the first sequenced mtDNA. This convention requires recommendations for standardized sequence alignment that is known to vary between scientific disciplines, even between laboratories. As a consequence, database searches that are vital for the interpretation of mtDNA data can suffer from biased results when query and database haplotypes are annotated differently. In the forensic context that would usually lead to underestimation of the absolute and relative frequencies. To address this issue we introduce SAM, a string-based search algorithm that converts query and database sequences to position-free nucleotide strings and thus eliminates the possibility that identical sequences will be missed in a database query. The mere application of a BLAST algorithm would not be a sufficient remedy as it uses a heuristic approach and does not address properties specific to mtDNA, such as phylogenetically stable but also rapidly evolving insertion and deletion events. The software presented here provides additional flexibility to incorporate phylogenetic data, site-specific mutation rates, and other biologically relevant information that would refine the interpretation of mitochondrial DNA data. The manuscript is accompanied by freeware and example data sets that can be used to evaluate the new software (http://stringvalidation.org). PMID:21056022
ERIC Educational Resources Information Center
Gabay, Yafit; Schiff, Rachel; Vakil, Eli
2012-01-01
Motor sequence learning has been studied extensively in Developmental dyslexia (DD). The purpose of the present research was to examine procedural learning of letter names and motor sequences in individuals with DD and control groups. Both groups completed the Serial Search Task which enabled the assessment of learning of letter names and motor…
USDA-ARS?s Scientific Manuscript database
We have previously published extensive genomic surveys [1-3], reporting NAT-homologous sequences in hundreds of sequenced bacterial, fungal and vertebrate genomes. We present here the results of our latest search of 2445 genomes, representing 1532 (70 archaeal, 1210 bacterial, 43 protist, 97 fungal,...
Predicting hospital visits from geo-tagged Internet search logs.
Agarwal, Vibhu; Han, Lichy; Madan, Isaac; Saluja, Shaurya; Shidham, Aaditya; Shah, Nigam H
2016-01-01
The steady rise in healthcare costs has deprived over 45 million Americans of healthcare services (1, 2) and has encouraged healthcare providers to look for opportunities to improve their operational efficiency. Prior studies have shown that evidence of healthcare seeking intent in Internet searches correlates well with healthcare resource utilization. Given the ubiquitous nature of mobile Internet search, we hypothesized that analyzing geo-tagged mobile search logs could enable us to machine-learn predictors of future patient visits. Using a de-identified dataset of geo-tagged mobile Internet search logs, we mined text and location patterns that are predictors of healthcare resource utilization and built statistical models that predict the probability of a user's future visit to a medical facility. Our efforts will enable the development of innovative methods for modeling and optimizing the use of healthcare resources-a crucial prerequisite for securing healthcare access for everyone in the days to come.
A parallelized binary search tree
USDA-ARS?s Scientific Manuscript database
PTTRNFNDR is an unsupervised statistical learning algorithm that detects patterns in DNA sequences, protein sequences, or any natural language texts that can be decomposed into letters of a finite alphabet. PTTRNFNDR performs complex mathematical computations and its processing time increases when i...
Speiser, Daniel I; Pankey, M Sabrina; Zaharoff, Alexander K; Battelle, Barbara A; Bracken-Grissom, Heather D; Breinholt, Jesse W; Bybee, Seth M; Cronin, Thomas W; Garm, Anders; Lindgren, Annie R; Patel, Nipam H; Porter, Megan L; Protas, Meredith E; Rivera, Ajna S; Serb, Jeanne M; Zigler, Kirk S; Crandall, Keith A; Oakley, Todd H
2014-11-19
Tools for high throughput sequencing and de novo assembly make the analysis of transcriptomes (i.e. the suite of genes expressed in a tissue) feasible for almost any organism. Yet a challenge for biologists is that it can be difficult to assign identities to gene sequences, especially from non-model organisms. Phylogenetic analyses are one useful method for assigning identities to these sequences, but such methods tend to be time-consuming because of the need to re-calculate trees for every gene of interest and each time a new data set is analyzed. In response, we employed existing tools for phylogenetic analysis to produce a computationally efficient, tree-based approach for annotating transcriptomes or new genomes that we term Phylogenetically-Informed Annotation (PIA), which places uncharacterized genes into pre-calculated phylogenies of gene families. We generated maximum likelihood trees for 109 genes from a Light Interaction Toolkit (LIT), a collection of genes that underlie the function or development of light-interacting structures in metazoans. To do so, we searched protein sequences predicted from 29 fully-sequenced genomes and built trees using tools for phylogenetic analysis in the Osiris package of Galaxy (an open-source workflow management system). Next, to rapidly annotate transcriptomes from organisms that lack sequenced genomes, we repurposed a maximum likelihood-based Evolutionary Placement Algorithm (implemented in RAxML) to place sequences of potential LIT genes on to our pre-calculated gene trees. Finally, we implemented PIA in Galaxy and used it to search for LIT genes in 28 newly-sequenced transcriptomes from the light-interacting tissues of a range of cephalopod mollusks, arthropods, and cubozoan cnidarians. Our new trees for LIT genes are available on the Bitbucket public repository ( http://bitbucket.org/osiris_phylogenetics/pia/ ) and we demonstrate PIA on a publicly-accessible web server ( http://galaxy-dev.cnsi.ucsb.edu/pia/ ). Our new trees for LIT genes will be a valuable resource for researchers studying the evolution of eyes or other light-interacting structures. We also introduce PIA, a high throughput method for using phylogenetic relationships to identify LIT genes in transcriptomes from non-model organisms. With simple modifications, our methods may be used to search for different sets of genes or to annotate data sets from taxa outside of Metazoa.
Scheduling with genetic algorithms
NASA Technical Reports Server (NTRS)
Fennel, Theron R.; Underbrink, A. J., Jr.; Williams, George P. W., Jr.
1994-01-01
In many domains, scheduling a sequence of jobs is an important function contributing to the overall efficiency of the operation. At Boeing, we develop schedules for many different domains, including assembly of military and commercial aircraft, weapons systems, and space vehicles. Boeing is under contract to develop scheduling systems for the Space Station Payload Planning System (PPS) and Payload Operations and Integration Center (POIC). These applications require that we respect certain sequencing restrictions among the jobs to be scheduled while at the same time assigning resources to the jobs. We call this general problem scheduling and resource allocation. Genetic algorithms (GA's) offer a search method that uses a population of solutions and benefits from intrinsic parallelism to search the problem space rapidly, producing near-optimal solutions. Good intermediate solutions are probabalistically recombined to produce better offspring (based upon some application specific measure of solution fitness, e.g., minimum flowtime, or schedule completeness). Also, at any point in the search, any intermediate solution can be accepted as a final solution; allowing the search to proceed longer usually produces a better solution while terminating the search at virtually any time may yield an acceptable solution. Many processes are constrained by restrictions of sequence among the individual jobs. For a specific job, other jobs must be completed beforehand. While there are obviously many other constraints on processes, it is these on which we focussed for this research: how to allocate crews to jobs while satisfying job precedence requirements and personnel, and tooling and fixture (or, more generally, resource) requirements.
An integrated SNP mining and utilization (ISMU) pipeline for next generation sequencing data.
Azam, Sarwar; Rathore, Abhishek; Shah, Trushar M; Telluri, Mohan; Amindala, BhanuPrakash; Ruperao, Pradeep; Katta, Mohan A V S K; Varshney, Rajeev K
2014-01-01
Open source single nucleotide polymorphism (SNP) discovery pipelines for next generation sequencing data commonly requires working knowledge of command line interface, massive computational resources and expertise which is a daunting task for biologists. Further, the SNP information generated may not be readily used for downstream processes such as genotyping. Hence, a comprehensive pipeline has been developed by integrating several open source next generation sequencing (NGS) tools along with a graphical user interface called Integrated SNP Mining and Utilization (ISMU) for SNP discovery and their utilization by developing genotyping assays. The pipeline features functionalities such as pre-processing of raw data, integration of open source alignment tools (Bowtie2, BWA, Maq, NovoAlign and SOAP2), SNP prediction (SAMtools/SOAPsnp/CNS2snp and CbCC) methods and interfaces for developing genotyping assays. The pipeline outputs a list of high quality SNPs between all pairwise combinations of genotypes analyzed, in addition to the reference genome/sequence. Visualization tools (Tablet and Flapjack) integrated into the pipeline enable inspection of the alignment and errors, if any. The pipeline also provides a confidence score or polymorphism information content value with flanking sequences for identified SNPs in standard format required for developing marker genotyping (KASP and Golden Gate) assays. The pipeline enables users to process a range of NGS datasets such as whole genome re-sequencing, restriction site associated DNA sequencing and transcriptome sequencing data at a fast speed. The pipeline is very useful for plant genetics and breeding community with no computational expertise in order to discover SNPs and utilize in genomics, genetics and breeding studies. The pipeline has been parallelized to process huge datasets of next generation sequencing. It has been developed in Java language and is available at http://hpc.icrisat.cgiar.org/ISMU as a standalone free software.
A structured review of health utility measures and elicitation in advanced/metastatic breast cancer.
Hao, Yanni; Wolfram, Verena; Cook, Jennifer
2016-01-01
Health utilities are increasingly incorporated in health economic evaluations. Different elicitation methods, direct and indirect, have been established in the past. This study examined the evidence on health utility elicitation previously reported in advanced/metastatic breast cancer and aimed to link these results to requirements of reimbursement bodies. Searches were conducted using a detailed search strategy across several electronic databases (MEDLINE, EMBASE, Cochrane Library, and EconLit databases), online sources (Cost-effectiveness Analysis Registry and the Health Economics Research Center), and web sites of health technology assessment (HTA) bodies. Publications were selected based on the search strategy and the overall study objectives. A total of 768 publications were identified in the searches, and 26 publications, comprising 18 journal articles and eight submissions to HTA bodies, were included in the evidence review. Most journal articles derived utilities from the European Quality of Life Five-Dimensions questionnaire (EQ-5D). Other utility measures, such as the direct methods standard gamble (SG), time trade-off (TTO), and visual analog scale (VAS), were less frequently used. Several studies described mapping algorithms to generate utilities from disease-specific health-related quality of life (HRQOL) instruments such as European Organization for Research and Treatment of Cancer Quality of Life Questionnaire - Core 30 (EORTC QLQ-C30), European Organization for Research and Treatment of Cancer Quality of Life Questionnaire - Breast Cancer 23 (EORTC QLQ-BR23), Functional Assessment of Cancer Therapy - General questionnaire (FACT-G), and Utility-Based Questionnaire-Cancer (UBQ-C); most used EQ-5D as the reference. Sociodemographic factors that affect health utilities, such as age, sex, income, and education, as well as disease progression, choice of utility elicitation method, and country settings, were identified within the journal articles. Most submissions to HTA bodies obtained utility values from the literature rather than exploring the HRQOL data obtained during clinical development. This was critiqued by the National Institute for Health and Clinical Excellence (NICE). Furthermore, the impact of age on utilities was highlighted by NICE and it was suggested that an age match of the study population should be attempted. Health utilities are recorded across the globe to varying extents and using differing elicitation methods. Manufacturers seeking reimbursement need to be aware of the country-specific requirements for elicitation of health utilities.
A structured review of health utility measures and elicitation in advanced/metastatic breast cancer
Hao, Yanni; Wolfram, Verena; Cook, Jennifer
2016-01-01
Background Health utilities are increasingly incorporated in health economic evaluations. Different elicitation methods, direct and indirect, have been established in the past. This study examined the evidence on health utility elicitation previously reported in advanced/metastatic breast cancer and aimed to link these results to requirements of reimbursement bodies. Methods Searches were conducted using a detailed search strategy across several electronic databases (MEDLINE, EMBASE, Cochrane Library, and EconLit databases), online sources (Cost-effectiveness Analysis Registry and the Health Economics Research Center), and web sites of health technology assessment (HTA) bodies. Publications were selected based on the search strategy and the overall study objectives. Results A total of 768 publications were identified in the searches, and 26 publications, comprising 18 journal articles and eight submissions to HTA bodies, were included in the evidence review. Most journal articles derived utilities from the European Quality of Life Five-Dimensions questionnaire (EQ-5D). Other utility measures, such as the direct methods standard gamble (SG), time trade-off (TTO), and visual analog scale (VAS), were less frequently used. Several studies described mapping algorithms to generate utilities from disease-specific health-related quality of life (HRQOL) instruments such as European Organization for Research and Treatment of Cancer Quality of Life Questionnaire – Core 30 (EORTC QLQ-C30), European Organization for Research and Treatment of Cancer Quality of Life Questionnaire – Breast Cancer 23 (EORTC QLQ-BR23), Functional Assessment of Cancer Therapy – General questionnaire (FACT-G), and Utility-Based Questionnaire-Cancer (UBQ-C); most used EQ-5D as the reference. Sociodemographic factors that affect health utilities, such as age, sex, income, and education, as well as disease progression, choice of utility elicitation method, and country settings, were identified within the journal articles. Most submissions to HTA bodies obtained utility values from the literature rather than exploring the HRQOL data obtained during clinical development. This was critiqued by the National Institute for Health and Clinical Excellence (NICE). Furthermore, the impact of age on utilities was highlighted by NICE and it was suggested that an age match of the study population should be attempted. Conclusion Health utilities are recorded across the globe to varying extents and using differing elicitation methods. Manufacturers seeking reimbursement need to be aware of the country-specific requirements for elicitation of health utilities. PMID:27382319
Hawley, Robert G; Chen, Yuzhong; Riz, Irene; Zeng, Chen
2012-05-04
In this study, we utilized an integrated bioinformatics and computational biology approach in search of new BH3-only proteins belonging to the BCL2 family of apoptotic regulators. The BH3 (BCL2 homology 3) domain mediates specific binding interactions among various BCL2 family members. It is composed of an amphipathic α-helical region of approximately 13 residues that has only a few amino acids that are highly conserved across all members. Using a generalized motif, we performed a genome-wide search for novel BH3-containing proteins in the NCBI Consensus Coding Sequence (CCDS) database. In addition to known pro-apoptotic BH3-only proteins, 197 proteins were recovered that satisfied the search criteria. These were categorized according to α-helical content and predictive binding to BCL-xL (encoded by BCL2L1) and MCL-1, two representative anti-apoptotic BCL2 family members, using position-specific scoring matrix models. Notably, the list is enriched for proteins associated with autophagy as well as a broad spectrum of cellular stress responses such as endoplasmic reticulum stress, oxidative stress, antiviral defense, and the DNA damage response. Several potential novel BH3-containing proteins are highlighted. In particular, the analysis strongly suggests that the apoptosis inhibitor and DNA damage response regulator, AVEN, which was originally isolated as a BCL-xL-interacting protein, is a functional BH3-only protein representing a distinct subclass of BCL2 family members.
Algorithms for database-dependent search of MS/MS data.
Matthiesen, Rune
2013-01-01
The frequent used bottom-up strategy for identification of proteins and their associated modifications generate nowadays typically thousands of MS/MS spectra that normally are matched automatically against a protein sequence database. Search engines that take as input MS/MS spectra and a protein sequence database are referred as database-dependent search engines. Many programs both commercial and freely available exist for database-dependent search of MS/MS spectra and most of the programs have excellent user documentation. The aim here is therefore to outline the algorithm strategy behind different search engines rather than providing software user manuals. The process of database-dependent search can be divided into search strategy, peptide scoring, protein scoring, and finally protein inference. Most efforts in the literature have been put in to comparing results from different software rather than discussing the underlining algorithms. Such practical comparisons can be cluttered by suboptimal implementation and the observed differences are frequently caused by software parameters settings which have not been set proper to allow even comparison. In other words an algorithmic idea can still be worth considering even if the software implementation has been demonstrated to be suboptimal. The aim in this chapter is therefore to split the algorithms for database-dependent searching of MS/MS data into the above steps so that the different algorithmic ideas become more transparent and comparable. Most search engines provide good implementations of the first three data analysis steps mentioned above, whereas the final step of protein inference are much less developed for most search engines and is in many cases performed by an external software. The final part of this chapter illustrates how protein inference is built into the VEMS search engine and discusses a stand-alone program SIR for protein inference that can import a Mascot search result.
A motif detection and classification method for peptide sequences using genetic programming.
Tomita, Yasuyuki; Kato, Ryuji; Okochi, Mina; Honda, Hiroyuki
2008-08-01
An exploration of common rules (property motifs) in amino acid sequences has been required for the design of novel sequences and elucidation of the interactions between molecules controlled by the structural or physical environment. In the present study, we developed a new method to search property motifs that are common in peptide sequence data. Our method comprises the following two characteristics: (i) the automatic determination of the position and length of common property motifs by calculating the physicochemical similarity of amino acids, and (ii) the quick and effective exploration of motif candidates that discriminates the positives and negatives by the introduction of genetic programming (GP). Our method was evaluated by two types of model data sets. First, the intentionally buried property motifs were searched in the artificially derived peptide data containing intentionally buried property motifs. As a result, the expected property motifs were correctly extracted by our algorithm. Second, the peptide data that interact with MHC class II molecules were analyzed as one of the models of biologically active peptides with buried motifs in various lengths. Twofold MHC class II binding peptides were identified with the rule using our method, compared to the existing scoring matrix method. In conclusion, our GP based motif searching approach enabled to obtain knowledge of functional aspects of the peptides without any prior knowledge.
qPMS9: An Efficient Algorithm for Quorum Planted Motif Search
NASA Astrophysics Data System (ADS)
Nicolae, Marius; Rajasekaran, Sanguthevar
2015-01-01
Discovering patterns in biological sequences is a crucial problem. For example, the identification of patterns in DNA sequences has resulted in the determination of open reading frames, identification of gene promoter elements, intron/exon splicing sites, and SH RNAs, location of RNA degradation signals, identification of alternative splicing sites, etc. In protein sequences, patterns have led to domain identification, location of protease cleavage sites, identification of signal peptides, protein interactions, determination of protein degradation elements, identification of protein trafficking elements, discovery of short functional motifs, etc. In this paper we focus on the identification of an important class of patterns, namely, motifs. We study the (l, d) motif search problem or Planted Motif Search (PMS). PMS receives as input n strings and two integers l and d. It returns all sequences M of length l that occur in each input string, where each occurrence differs from M in at most d positions. Another formulation is quorum PMS (qPMS), where the motif appears in at least q% of the strings. We introduce qPMS9, a parallel exact qPMS algorithm that offers significant runtime improvements on DNA and protein datasets. qPMS9 solves the challenging DNA (l, d)-instances (28, 12) and (30, 13). The source code is available at https://code.google.com/p/qpms9/.
GABI-Kat SimpleSearch: new features of the Arabidopsis thaliana T-DNA mutant database.
Kleinboelting, Nils; Huep, Gunnar; Kloetgen, Andreas; Viehoever, Prisca; Weisshaar, Bernd
2012-01-01
T-DNA insertion mutants are very valuable for reverse genetics in Arabidopsis thaliana. Several projects have generated large sequence-indexed collections of T-DNA insertion lines, of which GABI-Kat is the second largest resource worldwide. User access to the collection and its Flanking Sequence Tags (FSTs) is provided by the front end SimpleSearch (http://www.GABI-Kat.de). Several significant improvements have been implemented recently. The database now relies on the TAIRv10 genome sequence and annotation dataset. All FSTs have been newly mapped using an optimized procedure that leads to improved accuracy of insertion site predictions. A fraction of the collection with weak FST yield was re-analysed by generating new FSTs. Along with newly found predictions for older sequences about 20,000 new FSTs were included in the database. Information about groups of FSTs pointing to the same insertion site that is found in several lines but is real only in a single line are included, and many problematic FST-to-line links have been corrected using new wet-lab data. SimpleSearch currently contains data from ~71,000 lines with predicted insertions covering 62.5% of the 27,206 nuclear protein coding genes, and offers insertion allele-specific data from 9545 confirmed lines that are available from the Nottingham Arabidopsis Stock Centre.
miRNEST database: an integrative approach in microRNA search and annotation
Szcześniak, Michał Wojciech; Deorowicz, Sebastian; Gapski, Jakub; Kaczyński, Łukasz; Makałowska, Izabela
2012-01-01
Despite accumulating data on animal and plant microRNAs and their functions, existing public miRNA resources usually collect miRNAs from a very limited number of species. A lot of microRNAs, including those from model organisms, remain undiscovered. As a result there is a continuous need to search for new microRNAs. We present miRNEST (http://mirnest.amu.edu.pl), a comprehensive database of animal, plant and virus microRNAs. The core part of the database is built from our miRNA predictions conducted on Expressed Sequence Tags of 225 animal and 202 plant species. The miRNA search was performed based on sequence similarity and as many as 10 004 miRNA candidates in 221 animal and 199 plant species were discovered. Out of them only 299 have already been deposited in miRBase. Additionally, miRNEST has been integrated with external miRNA data from literature and 13 databases, which includes miRNA sequences, small RNA sequencing data, expression, polymorphisms and targets data as well as links to external miRNA resources, whenever applicable. All this makes miRNEST a considerable miRNA resource in a sense of number of species (544) that integrates a scattered miRNA data into a uniform format with a user-friendly web interface. PMID:22135287
MMDB: Entrez’s 3D-structure database
Wang, Yanli; Anderson, John B.; Chen, Jie; Geer, Lewis Y.; He, Siqian; Hurwitz, David I.; Liebert, Cynthia A.; Madej, Thomas; Marchler, Gabriele H.; Marchler-Bauer, Aron; Panchenko, Anna R.; Shoemaker, Benjamin A.; Song, James S.; Thiessen, Paul A.; Yamashita, Roxanne A.; Bryant, Stephen H.
2002-01-01
Three-dimensional structures are now known within many protein families and it is quite likely, in searching a sequence database, that one will encounter a homolog with known structure. The goal of Entrez’s 3D-structure database is to make this information, and the functional annotation it can provide, easily accessible to molecular biologists. To this end Entrez’s search engine provides three powerful features. (i) Sequence and structure neighbors; one may select all sequences similar to one of interest, for example, and link to any known 3D structures. (ii) Links between databases; one may search by term matching in MEDLINE, for example, and link to 3D structures reported in these articles. (iii) Sequence and structure visualization; identifying a homolog with known structure, one may view molecular-graphic and alignment displays, to infer approximate 3D structure. In this article we focus on two features of Entrez’s Molecular Modeling Database (MMDB) not described previously: links from individual biopolymer chains within 3D structures to a systematic taxonomy of organisms represented in molecular databases, and links from individual chains (and compact 3D domains within them) to structure neighbors, other chains (and 3D domains) with similar 3D structure. MMDB may be accessed at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure. PMID:11752307
Differentially Private Frequent Sequence Mining via Sampling-based Candidate Pruning
Xu, Shengzhi; Cheng, Xiang; Li, Zhengyi; Xiong, Li
2016-01-01
In this paper, we study the problem of mining frequent sequences under the rigorous differential privacy model. We explore the possibility of designing a differentially private frequent sequence mining (FSM) algorithm which can achieve both high data utility and a high degree of privacy. We found, in differentially private FSM, the amount of required noise is proportionate to the number of candidate sequences. If we could effectively reduce the number of unpromising candidate sequences, the utility and privacy tradeoff can be significantly improved. To this end, by leveraging a sampling-based candidate pruning technique, we propose a novel differentially private FSM algorithm, which is referred to as PFS2. The core of our algorithm is to utilize sample databases to further prune the candidate sequences generated based on the downward closure property. In particular, we use the noisy local support of candidate sequences in the sample databases to estimate which sequences are potentially frequent. To improve the accuracy of such private estimations, a sequence shrinking method is proposed to enforce the length constraint on the sample databases. Moreover, to decrease the probability of misestimating frequent sequences as infrequent, a threshold relaxation method is proposed to relax the user-specified threshold for the sample databases. Through formal privacy analysis, we show that our PFS2 algorithm is ε-differentially private. Extensive experiments on real datasets illustrate that our PFS2 algorithm can privately find frequent sequences with high accuracy. PMID:26973430
Tempest: Accelerated MS/MS database search software for heterogeneous computing platforms
Adamo, Mark E.; Gerber, Scott A.
2017-01-01
MS/MS database search algorithms derive a set of candidate peptide sequences from in-silico digest of a protein sequence database, and compute theoretical fragmentation patterns to match these candidates against observed MS/MS spectra. The original Tempest publication described these operations mapped to a CPU-GPU model, in which the CPU generates peptide candidates that are asynchronously sent to a discrete GPU to be scored against experimental spectra in parallel (Milloy et al., 2012). The current version of Tempest expands this model, incorporating OpenCL to offer seamless parallelization across multicore CPUs, GPUs, integrated graphics chips, and general-purpose coprocessors. Three protocols describe how to configure and run a Tempest search, including discussion of how to leverage Tempest's unique feature set to produce optimal results. PMID:27603022
The Google Online Marketing Challenge: Real Clients, Real Money, Real Ads and Authentic Learning
ERIC Educational Resources Information Center
Miko, John S.
2014-01-01
Search marketing is the process of utilizing search engines to drive traffic to a Web site through both paid and unpaid efforts. One potential paid component of a search marketing strategy is the use of a pay-per-click (PPC) advertising campaign in which advertisers pay search engine hosts only when their advertisement is clicked. This paper…
Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation
2011-01-01
Background The Smith-Waterman algorithm for local sequence alignment is more sensitive than heuristic methods for database searching, but also more time-consuming. The fastest approach to parallelisation with SIMD technology has previously been described by Farrar in 2007. The aim of this study was to explore whether further speed could be gained by other approaches to parallelisation. Results A faster approach and implementation is described and benchmarked. In the new tool SWIPE, residues from sixteen different database sequences are compared in parallel to one query residue. Using a 375 residue query sequence a speed of 106 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon X5650 six-core processor system, which is over six times more rapid than software based on Farrar's 'striped' approach. SWIPE was about 2.5 times faster when the programs used only a single thread. For shorter queries, the increase in speed was larger. SWIPE was about twice as fast as BLAST when using the BLOSUM50 score matrix, while BLAST was about twice as fast as SWIPE for the BLOSUM62 matrix. The software is designed for 64 bit Linux on processors with SSSE3. Source code is available from http://dna.uio.no/swipe/ under the GNU Affero General Public License. Conclusions Efficient parallelisation using SIMD on standard hardware makes it possible to run Smith-Waterman database searches more than six times faster than before. The approach described here could significantly widen the potential application of Smith-Waterman searches. Other applications that require optimal local alignment scores could also benefit from improved performance. PMID:21631914
Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation.
Rognes, Torbjørn
2011-06-01
The Smith-Waterman algorithm for local sequence alignment is more sensitive than heuristic methods for database searching, but also more time-consuming. The fastest approach to parallelisation with SIMD technology has previously been described by Farrar in 2007. The aim of this study was to explore whether further speed could be gained by other approaches to parallelisation. A faster approach and implementation is described and benchmarked. In the new tool SWIPE, residues from sixteen different database sequences are compared in parallel to one query residue. Using a 375 residue query sequence a speed of 106 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon X5650 six-core processor system, which is over six times more rapid than software based on Farrar's 'striped' approach. SWIPE was about 2.5 times faster when the programs used only a single thread. For shorter queries, the increase in speed was larger. SWIPE was about twice as fast as BLAST when using the BLOSUM50 score matrix, while BLAST was about twice as fast as SWIPE for the BLOSUM62 matrix. The software is designed for 64 bit Linux on processors with SSSE3. Source code is available from http://dna.uio.no/swipe/ under the GNU Affero General Public License. Efficient parallelisation using SIMD on standard hardware makes it possible to run Smith-Waterman database searches more than six times faster than before. The approach described here could significantly widen the potential application of Smith-Waterman searches. Other applications that require optimal local alignment scores could also benefit from improved performance.
SearchGUI: A Highly Adaptable Common Interface for Proteomics Search and de Novo Engines.
Barsnes, Harald; Vaudel, Marc
2018-05-25
Mass-spectrometry-based proteomics has become the standard approach for identifying and quantifying proteins. A vital step consists of analyzing experimentally generated mass spectra to identify the underlying peptide sequences for later mapping to the originating proteins. We here present the latest developments in SearchGUI, a common open-source interface for the most frequently used freely available proteomics search and de novo engines that has evolved into a central component in numerous bioinformatics workflows.
Sharma, Parichit; Mantri, Shrikant S
2014-01-01
The function of a newly sequenced gene can be discovered by determining its sequence homology with known proteins. BLAST is the most extensively used sequence analysis program for sequence similarity search in large databases of sequences. With the advent of next generation sequencing technologies it has now become possible to study genes and their expression at a genome-wide scale through RNA-seq and metagenome sequencing experiments. Functional annotation of all the genes is done by sequence similarity search against multiple protein databases. This annotation task is computationally very intensive and can take days to obtain complete results. The program mpiBLAST, an open-source parallelization of BLAST that achieves superlinear speedup, can be used to accelerate large-scale annotation by using supercomputers and high performance computing (HPC) clusters. Although many parallel bioinformatics applications using the Message Passing Interface (MPI) are available in the public domain, researchers are reluctant to use them due to lack of expertise in the Linux command line and relevant programming experience. With these limitations, it becomes difficult for biologists to use mpiBLAST for accelerating annotation. No web interface is available in the open-source domain for mpiBLAST. We have developed WImpiBLAST, a user-friendly open-source web interface for parallel BLAST searches. It is implemented in Struts 1.3 using a Java backbone and runs atop the open-source Apache Tomcat Server. WImpiBLAST supports script creation and job submission features and also provides a robust job management interface for system administrators. It combines script creation and modification features with job monitoring and management through the Torque resource manager on a Linux-based HPC cluster. Use case information highlights the acceleration of annotation analysis achieved by using WImpiBLAST. Here, we describe the WImpiBLAST web interface features and architecture, explain design decisions, describe workflows and provide a detailed analysis.
Sharma, Parichit; Mantri, Shrikant S.
2014-01-01
The function of a newly sequenced gene can be discovered by determining its sequence homology with known proteins. BLAST is the most extensively used sequence analysis program for sequence similarity search in large databases of sequences. With the advent of next generation sequencing technologies it has now become possible to study genes and their expression at a genome-wide scale through RNA-seq and metagenome sequencing experiments. Functional annotation of all the genes is done by sequence similarity search against multiple protein databases. This annotation task is computationally very intensive and can take days to obtain complete results. The program mpiBLAST, an open-source parallelization of BLAST that achieves superlinear speedup, can be used to accelerate large-scale annotation by using supercomputers and high performance computing (HPC) clusters. Although many parallel bioinformatics applications using the Message Passing Interface (MPI) are available in the public domain, researchers are reluctant to use them due to lack of expertise in the Linux command line and relevant programming experience. With these limitations, it becomes difficult for biologists to use mpiBLAST for accelerating annotation. No web interface is available in the open-source domain for mpiBLAST. We have developed WImpiBLAST, a user-friendly open-source web interface for parallel BLAST searches. It is implemented in Struts 1.3 using a Java backbone and runs atop the open-source Apache Tomcat Server. WImpiBLAST supports script creation and job submission features and also provides a robust job management interface for system administrators. It combines script creation and modification features with job monitoring and management through the Torque resource manager on a Linux-based HPC cluster. Use case information highlights the acceleration of annotation analysis achieved by using WImpiBLAST. Here, we describe the WImpiBLAST web interface features and architecture, explain design decisions, describe workflows and provide a detailed analysis. PMID:24979410
Asgari, Samira; McLaren, Paul J; Peake, Jane; Wong, Melanie; Wong, Richard; Bartha, Istvan; Francis, Joshua R; Abarca, Katia; Gelderman, Kyra A; Agyeman, Philipp; Aebi, Christoph; Berger, Christoph; Fellay, Jacques; Schlapbach, Luregn J
2016-01-01
One out of three pediatric sepsis deaths in high income countries occur in previously healthy children. Primary immunodeficiencies (PIDs) have been postulated to underlie fulminant sepsis, but this concept remains to be confirmed in clinical practice. Pseudomonas aeruginosa ( P. aeruginosa ) is a common bacterium mostly associated with health care-related infections in immunocompromised individuals. However, in rare cases, it can cause sepsis in previously healthy children. We used exome sequencing and bioinformatic analysis to systematically search for genetic factors underpinning severe P. aeruginosa infection in the pediatric population. We collected blood samples from 11 previously healthy children, with no family history of immunodeficiency, who presented with severe sepsis due to community-acquired P. aeruginosa bacteremia. Genomic DNA was extracted from blood or tissue samples obtained intravitam or postmortem. We obtained high-coverage exome sequencing data and searched for rare loss-of-function variants. After rigorous filtrations, 12 potentially causal variants were identified. Two out of eight (25%) fatal cases were found to carry novel pathogenic variants in PID genes, including BTK and DNMT3B . This study demonstrates that exome sequencing allows to identify rare, deleterious human genetic variants responsible for fulminant sepsis in apparently healthy children. Diagnosing PIDs in such patients is of high relevance to survivors and affected families. We propose that unusually severe and fatal sepsis cases in previously healthy children should be considered for exome/genome sequencing to search for underlying PIDs.
Asgari, Samira; McLaren, Paul J.; Peake, Jane; Wong, Melanie; Wong, Richard; Bartha, Istvan; Francis, Joshua R.; Abarca, Katia; Gelderman, Kyra A.; Agyeman, Philipp; Aebi, Christoph; Berger, Christoph; Fellay, Jacques; Schlapbach, Luregn J.; Posfay-Barbe, Klara
2016-01-01
One out of three pediatric sepsis deaths in high income countries occur in previously healthy children. Primary immunodeficiencies (PIDs) have been postulated to underlie fulminant sepsis, but this concept remains to be confirmed in clinical practice. Pseudomonas aeruginosa (P. aeruginosa) is a common bacterium mostly associated with health care-related infections in immunocompromised individuals. However, in rare cases, it can cause sepsis in previously healthy children. We used exome sequencing and bioinformatic analysis to systematically search for genetic factors underpinning severe P. aeruginosa infection in the pediatric population. We collected blood samples from 11 previously healthy children, with no family history of immunodeficiency, who presented with severe sepsis due to community-acquired P. aeruginosa bacteremia. Genomic DNA was extracted from blood or tissue samples obtained intravitam or postmortem. We obtained high-coverage exome sequencing data and searched for rare loss-of-function variants. After rigorous filtrations, 12 potentially causal variants were identified. Two out of eight (25%) fatal cases were found to carry novel pathogenic variants in PID genes, including BTK and DNMT3B. This study demonstrates that exome sequencing allows to identify rare, deleterious human genetic variants responsible for fulminant sepsis in apparently healthy children. Diagnosing PIDs in such patients is of high relevance to survivors and affected families. We propose that unusually severe and fatal sepsis cases in previously healthy children should be considered for exome/genome sequencing to search for underlying PIDs. PMID:27703454
Fault trees and sequence dependencies
NASA Technical Reports Server (NTRS)
Dugan, Joanne Bechta; Boyd, Mark A.; Bavuso, Salvatore J.
1990-01-01
One of the frequently cited shortcomings of fault-tree models, their inability to model so-called sequence dependencies, is discussed. Several sources of such sequence dependencies are discussed, and new fault-tree gates to capture this behavior are defined. These complex behaviors can be included in present fault-tree models because they utilize a Markov solution. The utility of the new gates is demonstrated by presenting several models of the fault-tolerant parallel processor, which include both hot and cold spares.
Ghouila, Amel; Florent, Isabelle; Guerfali, Fatma Zahra; Terrapon, Nicolas; Laouini, Dhafer; Yahia, Sadok Ben; Gascuel, Olivier; Bréhélin, Laurent
2014-01-01
Identification of protein domains is a key step for understanding protein function. Hidden Markov Models (HMMs) have proved to be a powerful tool for this task. The Pfam database notably provides a large collection of HMMs which are widely used for the annotation of proteins in sequenced organisms. This is done via sequence/HMM comparisons. However, this approach may lack sensitivity when searching for domains in divergent species. Recently, methods for HMM/HMM comparisons have been proposed and proved to be more sensitive than sequence/HMM approaches in certain cases. However, these approaches are usually not used for protein domain discovery at a genome scale, and the benefit that could be expected from their utilization for this problem has not been investigated. Using proteins of P. falciparum and L. major as examples, we investigate the extent to which HMM/HMM comparisons can identify new domain occurrences not already identified by sequence/HMM approaches. We show that although HMM/HMM comparisons are much more sensitive than sequence/HMM comparisons, they are not sufficiently accurate to be used as a standalone complement of sequence/HMM approaches at the genome scale. Hence, we propose to use domain co-occurrence--the general domain tendency to preferentially appear along with some favorite domains in the proteins--to improve the accuracy of the approach. We show that the combination of HMM/HMM comparisons and co-occurrence domain detection boosts protein annotations. At an estimated False Discovery Rate of 5%, it revealed 901 and 1098 new domains in Plasmodium and Leishmania proteins, respectively. Manual inspection of part of these predictions shows that it contains several domain families that were missing in the two organisms. All new domain occurrences have been integrated in the EuPathDomains database, along with the GO annotations that can be deduced.
Ghouila, Amel; Florent, Isabelle; Guerfali, Fatma Zahra; Terrapon, Nicolas; Laouini, Dhafer; Yahia, Sadok Ben; Gascuel, Olivier; Bréhélin, Laurent
2014-01-01
Identification of protein domains is a key step for understanding protein function. Hidden Markov Models (HMMs) have proved to be a powerful tool for this task. The Pfam database notably provides a large collection of HMMs which are widely used for the annotation of proteins in sequenced organisms. This is done via sequence/HMM comparisons. However, this approach may lack sensitivity when searching for domains in divergent species. Recently, methods for HMM/HMM comparisons have been proposed and proved to be more sensitive than sequence/HMM approaches in certain cases. However, these approaches are usually not used for protein domain discovery at a genome scale, and the benefit that could be expected from their utilization for this problem has not been investigated. Using proteins of P. falciparum and L. major as examples, we investigate the extent to which HMM/HMM comparisons can identify new domain occurrences not already identified by sequence/HMM approaches. We show that although HMM/HMM comparisons are much more sensitive than sequence/HMM comparisons, they are not sufficiently accurate to be used as a standalone complement of sequence/HMM approaches at the genome scale. Hence, we propose to use domain co-occurrence — the general domain tendency to preferentially appear along with some favorite domains in the proteins — to improve the accuracy of the approach. We show that the combination of HMM/HMM comparisons and co-occurrence domain detection boosts protein annotations. At an estimated False Discovery Rate of 5%, it revealed 901 and 1098 new domains in Plasmodium and Leishmania proteins, respectively. Manual inspection of part of these predictions shows that it contains several domain families that were missing in the two organisms. All new domain occurrences have been integrated in the EuPathDomains database, along with the GO annotations that can be deduced. PMID:24901648
Genomics approach to the environmental community of microorganisms
NASA Astrophysics Data System (ADS)
Kawarabayasi, Y.; Maruyama, A.
2004-12-01
It was indicated by microscopic observation or comparison of 16S rDNA sequence that many extremophiles were surviving in many hydrothermal environments. But it is generally said that over 99% of total microbes are now uncultivable. Thus, we planned to identify uncultivable microbes through direct sequencing of environmental DNA. At first, shotgun plasmid libraries were directly constructed with the DNA molecules prepared from mixed microbes collected from low-temperature hydrothermal water at RM24 in the Southern East Pacific Rise (S-EPR). It was shown that the sequences of some number of clones indicated the similar feature to the intron in eukaryote or tandem repetitive sequence identified in some human familiar diseases. The results indicated that many microorganisms with eukaryotic feature were dominant in low temperature water of S-EPR. Secondly, shotgun plasmid libraries were constructed from the environmental DNA prepared from Beppu hot springs. The ORFs were easily identified all clones determined entire sequence. Thus it can be said that hot springs is good resources for searching novel genes. At last, the mixed microbes isolated from Suiyo seamount were used for construction of shotgun library. The clones in this library contained the ORFs. From some clones in hot spring and Suiyo sample, aminoacyl-tRNA synthatase, which is generally present in all organisms, was isolated by similarity. The phylogenetic analysis of aminoacyl-tRNA synthetase identified indicated that novel and unidentified microorganisms should be present in hot spring or Suiyo seamount. The novel genes identified from Suiyo seamount were also utilized for expression in E. coli. Some gene products were successfully obtained from the E. coli cells as soluble proteins. Some protein indicated the thermostability up to 70_E#8249;C, meaning that the original host cell of this gene should be stable up to the same temperature. Our work indicates that environmental genomics, including the direct cloning, sequencing of environmental DNA and expression of gene identified, is powerful approach to collect novel uncultivable microbes or novel active genes.
Rhythm sensitivity in macaque monkeys
Selezneva, Elena; Deike, Susann; Knyazeva, Stanislava; Scheich, Henning; Brechmann, André; Brosch, Michael
2013-01-01
This study provides evidence that monkeys are rhythm sensitive. We composed isochronous tone sequences consisting of repeating triplets of two short tones and one long tone which humans perceive as repeating triplets of two weak and one strong beat. This regular sequence was compared to an irregular sequence with the same number of randomly arranged short and long tones with no such beat structure. To search for indication of rhythm sensitivity we employed an oddball paradigm in which occasional duration deviants were introduced in the sequences. In a pilot study on humans we showed that subjects more easily detected these deviants when they occurred in a regular sequence. In the monkeys we searched for spontaneous behaviors the animals executed concomitant with the deviants. We found that monkeys more frequently exhibited changes of gaze and facial expressions to the deviants when they occurred in the regular sequence compared to the irregular sequence. In addition we recorded neuronal firing and local field potentials from 175 sites of the primary auditory cortex during sequence presentation. We found that both types of neuronal signals differentiated regular from irregular sequences. Both signals were stronger in regular sequences and occurred after the onset of the long tones, i.e., at the position of the strong beat. Local field potential responses were also significantly larger for the durational deviants in regular sequences, yet in a later time window. We speculate that these temporal pattern-selective mechanisms with a focus on strong beats and their deviants underlie the perception of rhythm in the chosen sequences. PMID:24046732
STEM Education Related Dissertation Abstracts: A Bounded Qualitative Meta-Study
ERIC Educational Resources Information Center
Banning, James; Folkestad, James E.
2012-01-01
This article utilizes a bounded qualitative meta-study framework to examine the 101 dissertation abstracts found by searching the ProQuest Dissertation and Theses[TM] digital database for dissertations abstracts from 1990 through 2010 using the search terms education, science, technology, engineer, and STEM/SMET. Professional search librarians…
Complexity: an internet resource for analysis of DNA sequence complexity
Orlov, Y. L.; Potapov, V. N.
2004-01-01
The search for DNA regions with low complexity is one of the pivotal tasks of modern structural analysis of complete genomes. The low complexity may be preconditioned by strong inequality in nucleotide content (biased composition), by tandem or dispersed repeats or by palindrome-hairpin structures, as well as by a combination of all these factors. Several numerical measures of textual complexity, including combinatorial and linguistic ones, together with complexity estimation using a modified Lempel–Ziv algorithm, have been implemented in a software tool called ‘Complexity’ (http://wwwmgs.bionet.nsc.ru/mgs/programs/low_complexity/). The software enables a user to search for low-complexity regions in long sequences, e.g. complete bacterial genomes or eukaryotic chromosomes. In addition, it estimates the complexity of groups of aligned sequences. PMID:15215465
Group I introns are widespread in archaea.
Nawrocki, Eric P; Jones, Thomas A; Eddy, Sean R
2018-05-18
Group I catalytic introns have been found in bacterial, viral, organellar, and some eukaryotic genomes, but not in archaea. All known archaeal introns are bulge-helix-bulge (BHB) introns, with the exception of a few group II introns. It has been proposed that BHB introns arose from extinct group I intron ancestors, much like eukaryotic spliceosomal introns are thought to have descended from group II introns. However, group I introns have little sequence conservation, making them difficult to detect with standard sequence similarity searches. Taking advantage of recent improvements in a computational homology search method that accounts for both conserved sequence and RNA secondary structure, we have identified 39 group I introns in a wide range of archaeal phyla, including examples of group I introns and BHB introns in the same host gene.
LigSearch: a knowledge-based web server to identify likely ligands for a protein target
DOE Office of Scientific and Technical Information (OSTI.GOV)
Beer, Tjaart A. P. de; Laskowski, Roman A.; Duban, Mark-Eugene
LigSearch is a web server for identifying ligands likely to bind to a given protein. Identifying which ligands might bind to a protein before crystallization trials could provide a significant saving in time and resources. LigSearch, a web server aimed at predicting ligands that might bind to and stabilize a given protein, has been developed. Using a protein sequence and/or structure, the system searches against a variety of databases, combining available knowledge, and provides a clustered and ranked output of possible ligands. LigSearch can be accessed at http://www.ebi.ac.uk/thornton-srv/databases/LigSearch.
Winnowing sequences from a database search.
Berman, P; Zhang, Z; Wolf, Y I; Koonin, E V; Miller, W
2000-01-01
In database searches for sequence similarity, matches to a distinct sequence region (e.g., protein domain) are frequently obscured by numerous matches to another region of the same sequence. In order to cope with this problem, algorithms are developed to discard redundant matches. One model for this problem begins with a list of intervals, each with an associated score; each interval gives the range of positions in the query sequence that align to a database sequence, and the score is that of the alignment. If interval I is contained in interval J, and I's score is less than J's, then I is said to be dominated by J. The problem is then to identify each interval that is dominated by at least K other intervals, where K is a given level of "tolerable redundancy." An algorithm is developed to solve the problem in O(N log N) time and O(N*) space, where N is the number of intervals and N* is a precisely defined value that never exceeds N and is frequently much smaller. This criterion for discarding database hits has been implemented in the Blast program, as illustrated herein with examples. Several variations and extensions of this approach are also described.
Characterizing the D2 statistic: word matches in biological sequences.
Forêt, Sylvain; Wilson, Susan R; Burden, Conrad J
2009-01-01
Word matches are often used in sequence comparison methods, either as a measure of sequence similarity or in the first search steps of algorithms such as BLAST or BLAT. The D2 statistic is the number of matches of words of k letters between two sequences. Recent advances have been made in the characterization of this statistic and in the approximation of its distribution. Here, these results are extended to the case of approximate word matches. We compute the exact value of the variance of the D2 statistic for the case of a uniform letter distribution, and introduce a method to provide accurate approximations of the variance in the remaining cases. This enables the distribution of D2 to be approximated for typical situations arising in biological research. We apply these results to the identification of cis-regulatory modules, and show that this method detects such sequences with a high accuracy. The ability to approximate the distribution of D2 for both exact and approximate word matches will enable the use of this statistic in a more precise manner for sequence comparison, database searches, and identification of transcription factor binding sites.
NASA Astrophysics Data System (ADS)
Ivanov, Mark V.; Lobas, Anna A.; Levitsky, Lev I.; Moshkovskii, Sergei A.; Gorshkov, Mikhail V.
2018-02-01
In a proteogenomic approach based on tandem mass spectrometry analysis of proteolytic peptide mixtures, customized exome or RNA-seq databases are employed for identifying protein sequence variants. However, the problem of variant peptide identification without personalized genomic data is important for a variety of applications. Following the recent proposal by Chick et al. (Nat. Biotechnol. 33, 743-749, 2015) on the feasibility of such variant peptide search, we evaluated two available approaches based on the previously suggested "open" search and the "brute-force" strategy. To improve the efficiency of these approaches, we propose an algorithm for exclusion of false variant identifications from the search results involving analysis of modifications mimicking single amino acid substitutions. Also, we propose a de novo based scoring scheme for assessment of identified point mutations. In the scheme, the search engine analyzes y-type fragment ions in MS/MS spectra to confirm the location of the mutation in the variant peptide sequence.
Temporal and peripheral extraction of contextual cues from scenes during visual search.
Koehler, Kathryn; Eckstein, Miguel P
2017-02-01
Scene context is known to facilitate object recognition and guide visual search, but little work has focused on isolating image-based cues and evaluating their contributions to eye movement guidance and search performance. Here, we explore three types of contextual cues (a co-occurring object, the configuration of other objects, and the superordinate category of background elements) and assess their joint contributions to search performance in the framework of cue-combination and the temporal unfolding of their extraction. We also assess whether observers' ability to extract each contextual cue in the visual periphery is a bottleneck that determines the utilization and contribution of each cue to search guidance and decision accuracy. We find that during the first four fixations of a visual search task observers first utilize the configuration of objects for coarse eye movement guidance and later use co-occurring object information for finer guidance. In the absence of contextual cues, observers were suboptimally biased to report the target object as being absent. The presence of the co-occurring object was the only contextual cue that had a significant effect in reducing decision bias. The early influence of object-based cues on eye movements is corroborated by a clear demonstration of observers' ability to extract object cues up to 16° into the visual periphery. The joint contributions of the cues to decision search accuracy approximates that expected from the combination of statistically independent cues and optimal cue combination. Finally, the lack of utilization and contribution of the background-based contextual cue to search guidance cannot be explained by the availability of the contextual cue in the visual periphery; instead it is related to background cues providing the least inherent information about the precise location of the target in the scene.
A robust and cost-effective approach to sequence and analyze complete genomes of small RNA viruses
USDA-ARS?s Scientific Manuscript database
Background: Next-generation sequencing (NGS) allows ultra-deep sequencing of nucleic acids. The use of sequence-independent amplification of viral nucleic acids without utilization of target-specific primers provides advantages over traditional sequencing methods and allows detection of unsuspected ...
De Novo Peptide Sequencing: Deep Mining of High-Resolution Mass Spectrometry Data.
Islam, Mohammad Tawhidul; Mohamedali, Abidali; Fernandes, Criselda Santan; Baker, Mark S; Ranganathan, Shoba
2017-01-01
High resolution mass spectrometry has revolutionized proteomics over the past decade, resulting in tremendous amounts of data in the form of mass spectra, being generated in a relatively short span of time. The mining of this spectral data for analysis and interpretation though has lagged behind such that potentially valuable data is being overlooked because it does not fit into the mold of traditional database searching methodologies. Although the analysis of spectra by de novo sequences removes such biases and has been available for a long period of time, its uptake has been slow or almost nonexistent within the scientific community. In this chapter, we propose a methodology to integrate de novo peptide sequencing using three commonly available software solutions in tandem, complemented by homology searching, and manual validation of spectra. This simplified method would allow greater use of de novo sequencing approaches and potentially greatly increase proteome coverage leading to the unearthing of valuable insights into protein biology, especially of organisms whose genomes have been recently sequenced or are poorly annotated.
The SUPERFAMILY database in 2004: additions and improvements.
Madera, Martin; Vogel, Christine; Kummerfeld, Sarah K; Chothia, Cyrus; Gough, Julian
2004-01-01
The SUPERFAMILY database provides structural assignments to protein sequences and a framework for analysis of the results. At the core of the database is a library of profile Hidden Markov Models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent an entire superfamily. We have applied the library to predicted proteins from all completely sequenced genomes (currently 154), the Swiss-Prot and TrEMBL databases and other sequence collections. Close to 60% of all proteins have at least one match, and one half of all residues are covered by assignments. All models and full results are available for download and online browsing at http://supfam.org. Users can study the distribution of their superfamily of interest across all completely sequenced genomes, investigate with which other superfamilies it combines and retrieve proteins in which it occurs. Alternatively, concentrating on a particular genome as a whole, it is possible first, to find out its superfamily composition, and secondly, to compare it with that of other genomes to detect superfamilies that are over- or under-represented. In addition, the webserver provides the following standard services: sequence search; keyword search for genomes, superfamilies and sequence identifiers; and multiple alignment of genomic, PDB and custom sequences.
From web search to healthcare utilization: privacy-sensitive studies from mobile data
Horvitz, Eric
2013-01-01
Objective We explore relationships between health information seeking activities and engagement with healthcare professionals via a privacy-sensitive analysis of geo-tagged data from mobile devices. Materials and methods We analyze logs of mobile interaction data stripped of individually identifiable information and location data. The data analyzed consist of time-stamped search queries and distances to medical care centers. We examine search activity that precedes the observation of salient evidence of healthcare utilization (EHU) (ie, data suggesting that the searcher is using healthcare resources), in our case taken as queries occurring at or near medical facilities. Results We show that the time between symptom searches and observation of salient evidence of seeking healthcare utilization depends on the acuity of symptoms. We construct statistical models that make predictions of forthcoming EHU based on observations about the current search session, prior medical search activities, and prior EHU. The predictive accuracy of the models varies (65%–90%) depending on the features used and the timeframe of the analysis, which we explore via a sensitivity analysis. Discussion We provide a privacy-sensitive analysis that can be used to generate insights about the pursuit of health information and healthcare. The findings demonstrate how large-scale studies of mobile devices can provide insights on how concerns about symptomatology lead to the pursuit of professional care. Conclusion We present new methods for the analysis of mobile logs and describe a study that provides evidence about how people transition from mobile searches on symptoms and diseases to the pursuit of healthcare in the world. PMID:22661560
DNA Repair: The Search for Homology.
Haber, James E
2018-05-01
The repair of chromosomal double-strand breaks (DSBs) by homologous recombination is essential to maintain genome integrity. The key step in DSB repair is the RecA/Rad51-mediated process to match sequences at the broken end to homologous donor sequences that can be used as a template to repair the lesion. Here, in reviewing research about DSB repair, I consider the many factors that appear to play important roles in the successful search for homology by several homologous recombination mechanisms. See also the video abstract here: https://youtu.be/vm7-X5uIzS8. © 2018 WILEY Periodicals, Inc.
RAG-3D: A search tool for RNA 3D substructures
Zahran, Mai; Sevim Bayrak, Cigdem; Elmetwaly, Shereef; ...
2015-08-24
In this study, to address many challenges in RNA structure/function prediction, the characterization of RNA's modular architectural units is required. Using the RNA-As-Graphs (RAG) database, we have previously explored the existence of secondary structure (2D) submotifs within larger RNA structures. Here we present RAG-3D—a dataset of RNA tertiary (3D) structures and substructures plus a web-based search tool—designed to exploit graph representations of RNAs for the goal of searching for similar 3D structural fragments. The objects in RAG-3D consist of 3D structures translated into 3D graphs, cataloged based on the connectivity between their secondary structure elements. Each graph is additionally describedmore » in terms of its subgraph building blocks. The RAG-3D search tool then compares a query RNA 3D structure to those in the database to obtain structurally similar structures and substructures. This comparison reveals conserved 3D RNA features and thus may suggest functional connections. Though RNA search programs based on similarity in sequence, 2D, and/or 3D structural elements are available, our graph-based search tool may be advantageous for illuminating similarities that are not obvious; using motifs rather than sequence space also reduces search times considerably. Ultimately, such substructuring could be useful for RNA 3D structure prediction, structure/function inference and inverse folding.« less
RAG-3D: a search tool for RNA 3D substructures
Zahran, Mai; Sevim Bayrak, Cigdem; Elmetwaly, Shereef; Schlick, Tamar
2015-01-01
To address many challenges in RNA structure/function prediction, the characterization of RNA's modular architectural units is required. Using the RNA-As-Graphs (RAG) database, we have previously explored the existence of secondary structure (2D) submotifs within larger RNA structures. Here we present RAG-3D—a dataset of RNA tertiary (3D) structures and substructures plus a web-based search tool—designed to exploit graph representations of RNAs for the goal of searching for similar 3D structural fragments. The objects in RAG-3D consist of 3D structures translated into 3D graphs, cataloged based on the connectivity between their secondary structure elements. Each graph is additionally described in terms of its subgraph building blocks. The RAG-3D search tool then compares a query RNA 3D structure to those in the database to obtain structurally similar structures and substructures. This comparison reveals conserved 3D RNA features and thus may suggest functional connections. Though RNA search programs based on similarity in sequence, 2D, and/or 3D structural elements are available, our graph-based search tool may be advantageous for illuminating similarities that are not obvious; using motifs rather than sequence space also reduces search times considerably. Ultimately, such substructuring could be useful for RNA 3D structure prediction, structure/function inference and inverse folding. PMID:26304547
RAG-3D: A search tool for RNA 3D substructures
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zahran, Mai; Sevim Bayrak, Cigdem; Elmetwaly, Shereef
In this study, to address many challenges in RNA structure/function prediction, the characterization of RNA's modular architectural units is required. Using the RNA-As-Graphs (RAG) database, we have previously explored the existence of secondary structure (2D) submotifs within larger RNA structures. Here we present RAG-3D—a dataset of RNA tertiary (3D) structures and substructures plus a web-based search tool—designed to exploit graph representations of RNAs for the goal of searching for similar 3D structural fragments. The objects in RAG-3D consist of 3D structures translated into 3D graphs, cataloged based on the connectivity between their secondary structure elements. Each graph is additionally describedmore » in terms of its subgraph building blocks. The RAG-3D search tool then compares a query RNA 3D structure to those in the database to obtain structurally similar structures and substructures. This comparison reveals conserved 3D RNA features and thus may suggest functional connections. Though RNA search programs based on similarity in sequence, 2D, and/or 3D structural elements are available, our graph-based search tool may be advantageous for illuminating similarities that are not obvious; using motifs rather than sequence space also reduces search times considerably. Ultimately, such substructuring could be useful for RNA 3D structure prediction, structure/function inference and inverse folding.« less
Iterative Repair Planning for Spacecraft Operations Using the Aspen System
NASA Technical Reports Server (NTRS)
Rabideau, G.; Knight, R.; Chien, S.; Fukunaga, A.; Govindjee, A.
2000-01-01
This paper describes the Automated Scheduling and Planning Environment (ASPEN). ASPEN encodes complex spacecraft knowledge of operability constraints, flight rules, spacecraft hardware, science experiments and operations procedures to allow for automated generation of low level spacecraft sequences. Using a technique called iterative repair, ASPEN classifies constraint violations (i.e., conflicts) and attempts to repair each by performing a planning or scheduling operation. It must reason about which conflict to resolve first and what repair method to try for the given conflict. ASPEN is currently being utilized in the development of automated planner/scheduler systems for several spacecraft, including the UFO-1 naval communications satellite and the Citizen Explorer (CX1) satellite, as well as for planetary rover operations and antenna ground systems automation. This paper focuses on the algorithm and search strategies employed by ASPEN to resolve spacecraft operations constraints, as well as the data structures for representing these constraints.
NASA Astrophysics Data System (ADS)
Rosenberg, Jake; Parker, W. Ryan; Cammarata, Michael B.; Brodbelt, Jennifer S.
2018-04-01
UV-POSIT (Ultraviolet Photodissociation Online Structure Interrogation Tools) is a suite of web-based tools designed to facilitate the rapid interpretation of data from native mass spectrometry experiments making use of 193 nm ultraviolet photodissociation (UVPD). The suite includes four separate utilities which assist in the calculation of fragment ion abundances as a function of backbone cleavage sites and sequence position; the localization of charge sites in intact proteins; the calculation of hydrogen elimination propensity for a-type fragment ions; and mass-offset searching of UVPD spectra to identify unknown modifications and assess false positive fragment identifications. UV-POSIT is implemented as a Python/Flask web application hosted at http://uv-posit.cm.utexas.edu. UV-POSIT is available under the MIT license, and the source code is available at https://github.com/jarosenb/UV_POSIT. [Figure not available: see fulltext.
Rosenberg, Jake; Parker, W Ryan; Cammarata, Michael B; Brodbelt, Jennifer S
2018-06-01
UV-POSIT (Ultraviolet Photodissociation Online Structure Interrogation Tools) is a suite of web-based tools designed to facilitate the rapid interpretation of data from native mass spectrometry experiments making use of 193 nm ultraviolet photodissociation (UVPD). The suite includes four separate utilities which assist in the calculation of fragment ion abundances as a function of backbone cleavage sites and sequence position; the localization of charge sites in intact proteins; the calculation of hydrogen elimination propensity for a-type fragment ions; and mass-offset searching of UVPD spectra to identify unknown modifications and assess false positive fragment identifications. UV-POSIT is implemented as a Python/Flask web application hosted at http://uv-posit.cm.utexas.edu . UV-POSIT is available under the MIT license, and the source code is available at https://github.com/jarosenb/UV_POSIT . Graphical Abstract.
Multiple objects tracking with HOGs matching in circular windows
NASA Astrophysics Data System (ADS)
Miramontes-Jaramillo, Daniel; Kober, Vitaly; Díaz-Ramírez, Víctor H.
2014-09-01
In recent years tracking applications with development of new technologies like smart TVs, Kinect, Google Glass and Oculus Rift become very important. When tracking uses a matching algorithm, a good prediction algorithm is required to reduce the search area for each object to be tracked as well as processing time. In this work, we analyze the performance of different tracking algorithms based on prediction and matching for a real-time tracking multiple objects. The used matching algorithm utilizes histograms of oriented gradients. It carries out matching in circular windows, and possesses rotation invariance and tolerance to viewpoint and scale changes. The proposed algorithm is implemented in a personal computer with GPU, and its performance is analyzed in terms of processing time in real scenarios. Such implementation takes advantage of current technologies and helps to process video sequences in real-time for tracking several objects at the same time.
Developments in Ocular Genetics: 2013 Annual Review
Aboobakar, Inas F.; Allingham, R. Rand
2014-01-01
Purpose To highlight major advancements in ocular genetics from the year 2013. Design Literature review. Methods A literature search was conducted on PubMed to identify articles pertaining to genetic influences on human eye diseases. This review focuses on manuscripts published in print or online in the English language between January 1, 2013 and December 31, 2013. A total of 120 papers from 2013 were included in this review. Results Significant progress has been made in our understanding of the genetic basis of a broad group of ocular disorders, including glaucoma, age-related macular degeneration, cataract, diabetic retinopathy, keratoconus, Fuchs’ endothelial dystrophy, and refractive error. Conclusions The latest next-generation sequencing technologies have become extremely effective tools for identifying gene mutations associated with ocular disease. These technological advancements have also paved the way for utilization of genetic information in clinical practice, including disease diagnosis, prediction of treatment response and molecular interventions guided by gene-based knowledge. PMID:25097799
Towards computational improvement of DNA database indexing and short DNA query searching.
Stojanov, Done; Koceski, Sašo; Mileva, Aleksandra; Koceska, Nataša; Bande, Cveta Martinovska
2014-09-03
In order to facilitate and speed up the search of massive DNA databases, the database is indexed at the beginning, employing a mapping function. By searching through the indexed data structure, exact query hits can be identified. If the database is searched against an annotated DNA query, such as a known promoter consensus sequence, then the starting locations and the number of potential genes can be determined. This is particularly relevant if unannotated DNA sequences have to be functionally annotated. However, indexing a massive DNA database and searching an indexed data structure with millions of entries is a time-demanding process. In this paper, we propose a fast DNA database indexing and searching approach, identifying all query hits in the database, without having to examine all entries in the indexed data structure, limiting the maximum length of a query that can be searched against the database. By applying the proposed indexing equation, the whole human genome could be indexed in 10 hours on a personal computer, under the assumption that there is enough RAM to store the indexed data structure. Analysing the methodology proposed by Reneker, we observed that hits at starting positions [Formula: see text] are not reported, if the database is searched against a query shorter than [Formula: see text] nucleotides, such that [Formula: see text] is the length of the DNA database words being mapped and [Formula: see text] is the length of the query. A solution of this drawback is also presented.
Budiman, Muhammad A.; Mao, Long; Wood, Todd C.; Wing, Rod A.
2000-01-01
Recently a new strategy using BAC end sequences as sequence-tagged connectors (STCs) was proposed for whole-genome sequencing projects. In this study, we present the construction and detailed characterization of a 15.0 haploid genome equivalent BAC library for the cultivated tomato, Lycopersicon esculentum cv. Heinz 1706. The library contains 129,024 clones with an average insert size of 117.5 kb and a chloroplast content of 1.11%. BAC end sequences from 1490 ends were generated and analyzed as a preliminary evaluation for using this library to develop an STC framework to sequence the tomato genome. A total of 1205 BAC end sequences (80.9%) were obtained, with an average length of 360 high-quality bases, and were searched against the GenBank database. Using a cutoff expectation value of <10−6, and combining the results from BLASTN, BLASTX, and TBLASTX searches, 24.3% of the BAC end sequences were similar to known sequences, of which almost half (48.7%) share sequence similarities to retrotransposons and 7% to known genes. Some of the transposable element sequences were the first reported in tomato, such as sequences similar to maize transposon Activator (Ac) ORF and tobacco pararetrovirus-like sequences. Interestingly, there were no BAC end sequences similar to the highly repeated TGRI and TGRII elements. However, the majority (70.3%) of STCs did not share significant sequence similarities to any sequences in GenBank at either the DNA or predicted protein levels, indicating that a large portion of the tomato genome is still unknown. Our data demonstrate that this BAC library is suitable for developing an STC database to sequence the tomato genome. The advantages of developing an STC framework for whole-genome sequencing of tomato are discussed. [The BAC end sequences described in this paper have been deposited in the GenBank data library under accession nos. AQ367111–AQ368361.] PMID:10645957
NASA Astrophysics Data System (ADS)
dias, S. B.; Yang, C.; Li, Z.; XIA, J.; Liu, K.; Gui, Z.; Li, W.
2013-12-01
Global climate change has become one of the biggest concerns for human kind in the 21st century due to its broad impacts on society and ecosystems across the world. Arctic has been observed as one of the most vulnerable regions to the climate change. In order to understand the impacts of climate change on the natural environment, ecosystems, biodiversity and others in the Arctic region, and thus to better support the planning and decision making process, cross-disciplinary researches are required to monitor and analyze changes of Arctic regions such as water, sea level, biodiversity and so on. Conducting such research demands the efficient utilization of various geospatially referenced data, web services and information related to Arctic region. In this paper, we propose a cloud-enabled and service-oriented Spatial Web Portal (SWP) to support the discovery, integration and utilization of Arctic related geospatial resources, serving as a building block of polar CI. This SWP leverages the following techniques: 1) a hybrid searching mechanism combining centralized local search, distributed catalogue search and specialized Internet search for effectively discovering Arctic data and web services from multiple sources; 2) a service-oriented quality-enabled framework for seamless integration and utilization of various geospatial resources; and 3) a cloud-enabled parallel spatial index building approach to facilitate near-real time resource indexing and searching. A proof-of-concept prototype is developed to demonstrate the feasibility of the proposed SWP, using an example of analyzing the Arctic snow cover change over the past 50 years.
A Next-Generation Sequencing Primer—How Does It Work and What Can It Do?
Alekseyev, Yuriy O.; Fazeli, Roghayeh; Yang, Shi; Basran, Raveen; Miller, Nancy S.
2018-01-01
Next-generation sequencing refers to a high-throughput technology that determines the nucleic acid sequences and identifies variants in a sample. The technology has been introduced into clinical laboratory testing and produces test results for precision medicine. Since next-generation sequencing is relatively new, graduate students, medical students, pathology residents, and other physicians may benefit from a primer to provide a foundation about basic next-generation sequencing methods and applications, as well as specific examples where it has had diagnostic and prognostic utility. Next-generation sequencing technology grew out of advances in multiple fields to produce a sophisticated laboratory test with tremendous potential. Next-generation sequencing may be used in the clinical setting to look for specific genetic alterations in patients with cancer, diagnose inherited conditions such as cystic fibrosis, and detect and profile microbial organisms. This primer will review DNA sequencing technology, the commercialization of next-generation sequencing, and clinical uses of next-generation sequencing. Specific applications where next-generation sequencing has demonstrated utility in oncology are provided. PMID:29761157
Bradshaw, Charles Richard; Surendranath, Vineeth; Henschel, Robert; Mueller, Matthias Stefan; Habermann, Bianca Hermine
2011-03-10
Conserved domains in proteins are one of the major sources of functional information for experimental design and genome-level annotation. Though search tools for conserved domain databases such as Hidden Markov Models (HMMs) are sensitive in detecting conserved domains in proteins when they share sufficient sequence similarity, they tend to miss more divergent family members, as they lack a reliable statistical framework for the detection of low sequence similarity. We have developed a greatly improved HMMerThread algorithm that can detect remotely conserved domains in highly divergent sequences. HMMerThread combines relaxed conserved domain searches with fold recognition to eliminate false positive, sequence-based identifications. With an accuracy of 90%, our software is able to automatically predict highly divergent members of conserved domain families with an associated 3-dimensional structure. We give additional confidence to our predictions by validation across species. We have run HMMerThread searches on eight proteomes including human and present a rich resource of remotely conserved domains, which adds significantly to the functional annotation of entire proteomes. We find ∼4500 cross-species validated, remotely conserved domain predictions in the human proteome alone. As an example, we find a DNA-binding domain in the C-terminal part of the A-kinase anchor protein 10 (AKAP10), a PKA adaptor that has been implicated in cardiac arrhythmias and premature cardiac death, which upon stress likely translocates from mitochondria to the nucleus/nucleolus. Based on our prediction, we propose that with this HLH-domain, AKAP10 is involved in the transcriptional control of stress response. Further remotely conserved domains we discuss are examples from areas such as sporulation, chromosome segregation and signalling during immune response. The HMMerThread algorithm is able to automatically detect the presence of remotely conserved domains in proteins based on weak sequence similarity. Our predictions open up new avenues for biological and medical studies. Genome-wide HMMerThread domains are available at http://vm1-hmmerthread.age.mpg.de.
Bradshaw, Charles Richard; Surendranath, Vineeth; Henschel, Robert; Mueller, Matthias Stefan; Habermann, Bianca Hermine
2011-01-01
Conserved domains in proteins are one of the major sources of functional information for experimental design and genome-level annotation. Though search tools for conserved domain databases such as Hidden Markov Models (HMMs) are sensitive in detecting conserved domains in proteins when they share sufficient sequence similarity, they tend to miss more divergent family members, as they lack a reliable statistical framework for the detection of low sequence similarity. We have developed a greatly improved HMMerThread algorithm that can detect remotely conserved domains in highly divergent sequences. HMMerThread combines relaxed conserved domain searches with fold recognition to eliminate false positive, sequence-based identifications. With an accuracy of 90%, our software is able to automatically predict highly divergent members of conserved domain families with an associated 3-dimensional structure. We give additional confidence to our predictions by validation across species. We have run HMMerThread searches on eight proteomes including human and present a rich resource of remotely conserved domains, which adds significantly to the functional annotation of entire proteomes. We find ∼4500 cross-species validated, remotely conserved domain predictions in the human proteome alone. As an example, we find a DNA-binding domain in the C-terminal part of the A-kinase anchor protein 10 (AKAP10), a PKA adaptor that has been implicated in cardiac arrhythmias and premature cardiac death, which upon stress likely translocates from mitochondria to the nucleus/nucleolus. Based on our prediction, we propose that with this HLH-domain, AKAP10 is involved in the transcriptional control of stress response. Further remotely conserved domains we discuss are examples from areas such as sporulation, chromosome segregation and signalling during immune response. The HMMerThread algorithm is able to automatically detect the presence of remotely conserved domains in proteins based on weak sequence similarity. Our predictions open up new avenues for biological and medical studies. Genome-wide HMMerThread domains are available at http://vm1-hmmerthread.age.mpg.de. PMID:21423752
Characterization of tannase protein sequences of bacteria and fungi: an in silico study.
Banerjee, Amrita; Jana, Arijit; Pati, Bikash R; Mondal, Keshab C; Das Mohapatra, Pradeep K
2012-04-01
The tannase protein sequences of 149 bacteria and 36 fungi were retrieved from NCBI database. Among them only 77 bacterial and 31 fungal tannase sequences were taken which have different amino acid compositions. These sequences were analysed for different physical and chemical properties, superfamily search, multiple sequence alignment, phylogenetic tree construction and motif finding to find out the functional motif and the evolutionary relationship among them. The superfamily search for these tannase exposed the occurrence of proline iminopeptidase-like, biotin biosynthesis protein BioH, O-acetyltransferase, carboxylesterase/thioesterase 1, carbon-carbon bond hydrolase, haloperoxidase, prolyl oligopeptidase, C-terminal domain and mycobacterial antigens families and alpha/beta hydrolase superfamily. Some bacterial and fungal sequence showed similarity with different families individually. The multiple sequence alignment of these tannase protein sequences showed conserved regions at different stretches with maximum homology from amino acid residues 389-469 and 482-523 which could be used for designing degenerate primers or probes specific for tannase producing bacterial and fungal species. Phylogenetic tree showed two different clusters; one has only bacteria and another have both fungi and bacteria showing some relationship between these different genera. Although in second cluster near about all fungal species were found together in a corner which indicates the sequence level similarity among fungal genera. The distributions of fourteen motifs analysis revealed Motif 1 with a signature amino acid sequence of 29 amino acids, i.e. GCSTGGREALKQAQRWPHDYDGIIANNPA, was uniformly observed in 83.3 % of studied tannase sequences representing its participation with the structure and enzymatic function.
DOE Office of Scientific and Technical Information (OSTI.GOV)
White, Richard A.; Panyala, Ajay R.; Glass, Kevin A.
MerCat is a parallel, highly scalable and modular property software package for robust analysis of features in next-generation sequencing data. MerCat inputs include assembled contigs and raw sequence reads from any platform resulting in feature abundance counts tables. MerCat allows for direct analysis of data properties without reference sequence database dependency commonly used by search tools such as BLAST and/or DIAMOND for compositional analysis of whole community shotgun sequencing (e.g. metagenomes and metatranscriptomes).
NASA Astrophysics Data System (ADS)
Jitsuhiro, Takatoshi; Toriyama, Tomoji; Kogure, Kiyoshi
We propose a noise suppression method based on multi-model compositions and multi-pass search. In real environments, input speech for speech recognition includes many kinds of noise signals. To obtain good recognized candidates, suppressing many kinds of noise signals at once and finding target speech is important. Before noise suppression, to find speech and noise label sequences, we introduce multi-pass search with acoustic models including many kinds of noise models and their compositions, their n-gram models, and their lexicon. Noise suppression is frame-synchronously performed using the multiple models selected by recognized label sequences with time alignments. We evaluated this method using the E-Nightingale task, which contains voice memoranda spoken by nurses during actual work at hospitals. The proposed method obtained higher performance than the conventional method.
orthoFind Facilitates the Discovery of Homologous and Orthologous Proteins.
Mier, Pablo; Andrade-Navarro, Miguel A; Pérez-Pulido, Antonio J
2015-01-01
Finding homologous and orthologous protein sequences is often the first step in evolutionary studies, annotation projects, and experiments of functional complementation. Despite all currently available computational tools, there is a requirement for easy-to-use tools that provide functional information. Here, a new web application called orthoFind is presented, which allows a quick search for homologous and orthologous proteins given one or more query sequences, allowing a recurrent and exhaustive search against reference proteomes, and being able to include user databases. It addresses the protein multidomain problem, searching for homologs with the same domain architecture, and gives a simple functional analysis of the results to help in the annotation process. orthoFind is easy to use and has been proven to provide accurate results with different datasets. Availability: http://www.bioinfocabd.upo.es/orthofind/.
Reactive Searching and Infotaxis in Odor Source Localization
Voges, Nicole; Chaffiol, Antoine; Lucas, Philippe; Martinez, Dominique
2014-01-01
Male moths aiming to locate pheromone-releasing females rely on stimulus-adapted search maneuvers complicated by a discontinuous distribution of pheromone patches. They alternate sequences of upwind surge when perceiving the pheromone and cross- or downwind casting when the odor is lost. We compare four search strategies: three reactive versus one cognitive. The former consist of pre-programmed movement sequences triggered by pheromone detections while the latter uses Bayesian inference to build spatial probability maps. Based on the analysis of triphasic responses of antennal lobe neurons (On, inhibition, Off), we propose three reactive strategies. One combines upwind surge (representing the On response to a pheromone detection) and spiral casting, only. The other two additionally include crosswind (zigzag) casting representing the Off phase. As cognitive strategy we use the infotaxis algorithm which was developed for searching in a turbulent medium. Detection events in the electroantennogram of a moth attached to a robot indirectly control this cyborg, depending on the strategy in use. The recorded trajectories are analyzed with regard to success rates, efficiency, and other features. In addition, we qualitatively compare our robotic trajectories to behavioral search paths. Reactive searching is more efficient (yielding shorter trajectories) for higher pheromone doses whereas cognitive searching works better for lower doses. With respect to our experimental conditions (2 m from starting position to pheromone source), reactive searching with crosswind zigzag yields the shortest trajectories (for comparable success rates). Assuming that the neuronal Off response represents a short-term memory, zigzagging is an efficient movement to relocate a recently lost pheromone plume. Accordingly, such reactive strategies offer an interesting alternative to complex cognitive searching. PMID:25330317
Reactive searching and infotaxis in odor source localization.
Voges, Nicole; Chaffiol, Antoine; Lucas, Philippe; Martinez, Dominique
2014-10-01
Male moths aiming to locate pheromone-releasing females rely on stimulus-adapted search maneuvers complicated by a discontinuous distribution of pheromone patches. They alternate sequences of upwind surge when perceiving the pheromone and cross- or downwind casting when the odor is lost. We compare four search strategies: three reactive versus one cognitive. The former consist of pre-programmed movement sequences triggered by pheromone detections while the latter uses Bayesian inference to build spatial probability maps. Based on the analysis of triphasic responses of antennal lobe neurons (On, inhibition, Off), we propose three reactive strategies. One combines upwind surge (representing the On response to a pheromone detection) and spiral casting, only. The other two additionally include crosswind (zigzag) casting representing the Off phase. As cognitive strategy we use the infotaxis algorithm which was developed for searching in a turbulent medium. Detection events in the electroantennogram of a moth attached to a robot indirectly control this cyborg, depending on the strategy in use. The recorded trajectories are analyzed with regard to success rates, efficiency, and other features. In addition, we qualitatively compare our robotic trajectories to behavioral search paths. Reactive searching is more efficient (yielding shorter trajectories) for higher pheromone doses whereas cognitive searching works better for lower doses. With respect to our experimental conditions (2 m from starting position to pheromone source), reactive searching with crosswind zigzag yields the shortest trajectories (for comparable success rates). Assuming that the neuronal Off response represents a short-term memory, zigzagging is an efficient movement to relocate a recently lost pheromone plume. Accordingly, such reactive strategies offer an interesting alternative to complex cognitive searching.
USDA-ARS?s Scientific Manuscript database
DNA barcoding revealed the presence of the polyphagous leafminer pest Liriomyza sativae Blanchard in Bangladesh. DNA barcode sequences for mitochondrial COI were generated for Agromyzidae larvae, pupae and adults collected from field populations across Bangladesh. BLAST sequence similarity searches ...
2014-01-01
Background A rare neuro-ichthyotic disorder characterized by ichthyosis, spastic quadriplegia and intellectual disability and caused by recessive mutations in ELOVL4, encoding elongase-4 protein has recently been described. The objective of the study was to search for sequence variants in the gene ELOVL4 in three affected individuals of a consanguineous Pakistani family exhibiting features of neuro-ichthyotic disorder. Methods Linkage in the family was searched by genotyping microsatellite markers linked to the gene ELOVL4, mapped at chromosome 6p14.1. Exons and splice junction sites of the gene ELOVL4 were polymerase chain reaction amplified and sequenced in an automated DNA sequencer. Results DNA sequence analysis revealed a novel homozygous nonsense mutation (c.78C > G; p.Tyr26*). Conclusions Our report further confirms the recently described ELOVL4-related neuro-ichthyosis and shows that the neurological phenotype can be absent in some individuals. PMID:24571530
The BaMM web server for de-novo motif discovery and regulatory sequence analysis.
Kiesel, Anja; Roth, Christian; Ge, Wanwan; Wess, Maximilian; Meier, Markus; Söding, Johannes
2018-05-28
The BaMM web server offers four tools: (i) de-novo discovery of enriched motifs in a set of nucleotide sequences, (ii) scanning a set of nucleotide sequences with motifs to find motif occurrences, (iii) searching with an input motif for similar motifs in our BaMM database with motifs for >1000 transcription factors, trained from the GTRD ChIP-seq database and (iv) browsing and keyword searching the motif database. In contrast to most other servers, we represent sequence motifs not by position weight matrices (PWMs) but by Bayesian Markov Models (BaMMs) of order 4, which we showed previously to perform substantially better in ROC analyses than PWMs or first order models. To address the inadequacy of P- and E-values as measures of motif quality, we introduce the AvRec score, the average recall over the TP-to-FP ratio between 1 and 100. The BaMM server is freely accessible without registration at https://bammmotif.mpibpc.mpg.de.
Communication Webagogy 2.0: More Click, Less Drag.
ERIC Educational Resources Information Center
Radford, Marie L.; Wagner, Kurt W.
2000-01-01
Argues that, because of the chaotic nature of the Web and the competing searching software, no single search tool will suffice. Lists and discusses meta search engines; communication meta-sites and subject directories, all indexed by humans; teaching resources for communication courses that utilize the unique features of the Web; and web sites…
Predicting the performance of fingerprint similarity searching.
Vogt, Martin; Bajorath, Jürgen
2011-01-01
Fingerprints are bit string representations of molecular structure that typically encode structural fragments, topological features, or pharmacophore patterns. Various fingerprint designs are utilized in virtual screening and their search performance essentially depends on three parameters: the nature of the fingerprint, the active compounds serving as reference molecules, and the composition of the screening database. It is of considerable interest and practical relevance to predict the performance of fingerprint similarity searching. A quantitative assessment of the potential that a fingerprint search might successfully retrieve active compounds, if available in the screening database, would substantially help to select the type of fingerprint most suitable for a given search problem. The method presented herein utilizes concepts from information theory to relate the fingerprint feature distributions of reference compounds to screening libraries. If these feature distributions do not sufficiently differ, active database compounds that are similar to reference molecules cannot be retrieved because they disappear in the "background." By quantifying the difference in feature distribution using the Kullback-Leibler divergence and relating the divergence to compound recovery rates obtained for different benchmark classes, fingerprint search performance can be quantitatively predicted.
A global sampling approach to designing and reengineering RNA secondary structures.
Levin, Alex; Lis, Mieszko; Ponty, Yann; O'Donnell, Charles W; Devadas, Srinivas; Berger, Bonnie; Waldispühl, Jérôme
2012-11-01
The development of algorithms for designing artificial RNA sequences that fold into specific secondary structures has many potential biomedical and synthetic biology applications. To date, this problem remains computationally difficult, and current strategies to address it resort to heuristics and stochastic search techniques. The most popular methods consist of two steps: First a random seed sequence is generated; next, this seed is progressively modified (i.e. mutated) to adopt the desired folding properties. Although computationally inexpensive, this approach raises several questions such as (i) the influence of the seed; and (ii) the efficiency of single-path directed searches that may be affected by energy barriers in the mutational landscape. In this article, we present RNA-ensign, a novel paradigm for RNA design. Instead of taking a progressive adaptive walk driven by local search criteria, we use an efficient global sampling algorithm to examine large regions of the mutational landscape under structural and thermodynamical constraints until a solution is found. When considering the influence of the seeds and the target secondary structures, our results show that, compared to single-path directed searches, our approach is more robust, succeeds more often and generates more thermodynamically stable sequences. An ensemble approach to RNA design is thus well worth pursuing as a complement to existing approaches. RNA-ensign is available at http://csb.cs.mcgill.ca/RNAensign.
A global sampling approach to designing and reengineering RNA secondary structures
Levin, Alex; Lis, Mieszko; Ponty, Yann; O’Donnell, Charles W.; Devadas, Srinivas; Berger, Bonnie; Waldispühl, Jérôme
2012-01-01
The development of algorithms for designing artificial RNA sequences that fold into specific secondary structures has many potential biomedical and synthetic biology applications. To date, this problem remains computationally difficult, and current strategies to address it resort to heuristics and stochastic search techniques. The most popular methods consist of two steps: First a random seed sequence is generated; next, this seed is progressively modified (i.e. mutated) to adopt the desired folding properties. Although computationally inexpensive, this approach raises several questions such as (i) the influence of the seed; and (ii) the efficiency of single-path directed searches that may be affected by energy barriers in the mutational landscape. In this article, we present RNA-ensign, a novel paradigm for RNA design. Instead of taking a progressive adaptive walk driven by local search criteria, we use an efficient global sampling algorithm to examine large regions of the mutational landscape under structural and thermodynamical constraints until a solution is found. When considering the influence of the seeds and the target secondary structures, our results show that, compared to single-path directed searches, our approach is more robust, succeeds more often and generates more thermodynamically stable sequences. An ensemble approach to RNA design is thus well worth pursuing as a complement to existing approaches. RNA-ensign is available at http://csb.cs.mcgill.ca/RNAensign. PMID:22941632
GENOME-ENABLED DISCOVERY OF CARBON SEQUESTRATION GENES IN POPLAR
DOE Office of Scientific and Technical Information (OSTI.GOV)
DAVIS J M
2007-10-11
Plants utilize carbon by partitioning the reduced carbon obtained through photosynthesis into different compartments and into different chemistries within a cell and subsequently allocating such carbon to sink tissues throughout the plant. Since the phytohormones auxin and cytokinin are known to influence sink strength in tissues such as roots (Skoog & Miller 1957, Nordstrom et al. 2004), we hypothesized that altering the expression of genes that regulate auxin-mediated (e.g., AUX/IAA or ARF transcription factors) or cytokinin-mediated (e.g., RR transcription factors) control of root growth and development would impact carbon allocation and partitioning belowground (Fig. 1 - Renewal Proposal). Specifically, themore » ARF, AUX/IAA and RR transcription factor gene families mediate the effects of the growth regulators auxin and cytokinin on cell expansion, cell division and differentiation into root primordia. Invertases (IVR), whose transcript abundance is enhanced by both auxin and cytokinin, are critical components of carbon movement and therefore of carbon allocation. Thus, we initiated comparative genomic studies to identify the AUX/IAA, ARF, RR and IVR gene families in the Populus genome that could impact carbon allocation and partitioning. Bioinformatics searches using Arabidopsis gene sequences as queries identified regions with high degrees of sequence similarities in the Populus genome. These Populus sequences formed the basis of our transgenic experiments. Transgenic modification of gene expression involving members of these gene families was hypothesized to have profound effects on carbon allocation and partitioning.« less
Dickinson, Peter; Xiong, Anqi; York, Daniel; Jayashankar, Kartika; Pielberg, Gerli; Koltookian, Michele; Murén, Eva; Fuxelius, Hans-Henrik; Weishaupt, Holger; Andersson, Göran; Hedhammar, Åke; Bongcam-Rudloff, Erik; Forsberg-Nilsson, Karin
2016-01-01
Gliomas are the most common form of malignant primary brain tumors in humans and second most common in dogs, occurring with similar frequencies in both species. Dogs are valuable spontaneous models of human complex diseases including cancers and may provide insight into disease susceptibility and oncogenesis. Several brachycephalic breeds such as Boxer, Bulldog and Boston Terrier have an elevated risk of developing glioma, but others, including Pug and Pekingese, are not at higher risk. To identify glioma-associated genetic susceptibility factors, an across-breed genome-wide association study (GWAS) was performed on 39 dog glioma cases and 141 controls from 25 dog breeds, identifying a genome-wide significant locus on canine chromosome (CFA) 26 (p = 2.8 x 10−8). Targeted re-sequencing of the 3.4 Mb candidate region was performed, followed by genotyping of the 56 SNVs that best fit the association pattern between the re-sequenced cases and controls. We identified three candidate genes that were highly associated with glioma susceptibility: CAMKK2, P2RX7 and DENR. CAMKK2 showed reduced expression in both canine and human brain tumors, and a non-synonymous variant in P2RX7, previously demonstrated to have a 50% decrease in receptor function, was also associated with disease. Thus, one or more of these genes appear to affect glioma susceptibility. PMID:27171399
Knowlton, Michelle N; Li, Tongbin; Ren, Yongliang; Bill, Brent R; Ellis, Lynda Bm; Ekker, Stephen C
2008-01-07
The zebrafish is a powerful model vertebrate amenable to high throughput in vivo genetic analyses. Examples include reverse genetic screens using morpholino knockdown, expression-based screening using enhancer trapping and forward genetic screening using transposon insertional mutagenesis. We have created a database to facilitate web-based distribution of data from such genetic studies. The MOrpholino DataBase is a MySQL relational database with an online, PHP interface. Multiple quality control levels allow differential access to data in raw and finished formats. MODBv1 includes sequence information relating to almost 800 morpholinos and their targets and phenotypic data regarding the dose effect of each morpholino (mortality, toxicity and defects). To improve the searchability of this database, we have incorporated a fixed-vocabulary defect ontology that allows for the organization of morpholino affects based on anatomical structure affected and defect produced. This also allows comparison between species utilizing Phenotypic Attribute Trait Ontology (PATO) designated terminology. MODB is also cross-linked with ZFIN, allowing full searches between the two databases. MODB offers users the ability to retrieve morpholino data by sequence of morpholino or target, name of target, anatomical structure affected and defect produced. MODB data can be used for functional genomic analysis of morpholino design to maximize efficacy and minimize toxicity. MODB also serves as a template for future sequence-based functional genetic screen databases, and it is currently being used as a model for the creation of a mutagenic insertional transposon database.
Global analysis of gene expression profiles in developing physic nut (Jatropha curcas L.) seeds.
Jiang, Huawu; Wu, Pingzhi; Zhang, Sheng; Song, Chi; Chen, Yaping; Li, Meiru; Jia, Yongxia; Fang, Xiaohua; Chen, Fan; Wu, Guojiang
2012-01-01
Physic nut (Jatropha curcas L.) is an oilseed plant species with high potential utility as a biofuel. Furthermore, following recent sequencing of its genome and the availability of expressed sequence tag (EST) libraries, it is a valuable model plant for studying carbon assimilation in endosperms of oilseed plants. There have been several transcriptomic analyses of developing physic nut seeds using ESTs, but they have provided limited information on the accumulation of stored resources in the seeds. We applied next-generation Illumina sequencing technology to analyze global gene expression profiles of developing physic nut seeds 14, 19, 25, 29, 35, 41, and 45 days after pollination (DAP). The acquired profiles reveal the key genes, and their expression timeframes, involved in major metabolic processes including: carbon flow, starch metabolism, and synthesis of storage lipids and proteins in the developing seeds. The main period of storage reserves synthesis in the seeds appears to be 29-41 DAP, and the fatty acid composition of the developing seeds is consistent with relative expression levels of different isoforms of acyl-ACP thioesterase and fatty acid desaturase genes. Several transcription factor genes whose expression coincides with storage reserve deposition correspond to those known to regulate the process in Arabidopsis. The results will facilitate searches for genes that influence de novo lipid synthesis, accumulation and their regulatory networks in developing physic nut seeds, and other oil seeds. Thus, they will be helpful in attempts to modify these plants for efficient biofuel production.
Angiuoli, Samuel V; Matalka, Malcolm; Gussman, Aaron; Galens, Kevin; Vangala, Mahesh; Riley, David R; Arze, Cesar; White, James R; White, Owen; Fricke, W Florian
2011-08-30
Next-generation sequencing technologies have decentralized sequence acquisition, increasing the demand for new bioinformatics tools that are easy to use, portable across multiple platforms, and scalable for high-throughput applications. Cloud computing platforms provide on-demand access to computing infrastructure over the Internet and can be used in combination with custom built virtual machines to distribute pre-packaged with pre-configured software. We describe the Cloud Virtual Resource, CloVR, a new desktop application for push-button automated sequence analysis that can utilize cloud computing resources. CloVR is implemented as a single portable virtual machine (VM) that provides several automated analysis pipelines for microbial genomics, including 16S, whole genome and metagenome sequence analysis. The CloVR VM runs on a personal computer, utilizes local computer resources and requires minimal installation, addressing key challenges in deploying bioinformatics workflows. In addition CloVR supports use of remote cloud computing resources to improve performance for large-scale sequence processing. In a case study, we demonstrate the use of CloVR to automatically process next-generation sequencing data on multiple cloud computing platforms. The CloVR VM and associated architecture lowers the barrier of entry for utilizing complex analysis protocols on both local single- and multi-core computers and cloud systems for high throughput data processing.
Low validity of Google Trends for behavioral forecasting of national suicide rates.
Tran, Ulrich S; Andel, Rita; Niederkrotenthaler, Thomas; Till, Benedikt; Ajdacic-Gross, Vladeta; Voracek, Martin
2017-01-01
Recent research suggests that search volumes of the most popular search engine worldwide, Google, provided via Google Trends, could be associated with national suicide rates in the USA, UK, and some Asian countries. However, search volumes have mostly been studied in an ad hoc fashion, without controls for spurious associations. This study evaluated the validity and utility of Google Trends search volumes for behavioral forecasting of suicide rates in the USA, Germany, Austria, and Switzerland. Suicide-related search terms were systematically collected and respective Google Trends search volumes evaluated for availability. Time spans covered 2004 to 2010 (USA, Switzerland) and 2004 to 2012 (Germany, Austria). Temporal associations of search volumes and suicide rates were investigated with time-series analyses that rigorously controlled for spurious associations. The number and reliability of analyzable search volume data increased with country size. Search volumes showed various temporal associations with suicide rates. However, associations differed both across and within countries and mostly followed no discernable patterns. The total number of significant associations roughly matched the number of expected Type I errors. These results suggest that the validity of Google Trends search volumes for behavioral forecasting of national suicide rates is low. The utility and validity of search volumes for the forecasting of suicide rates depend on two key assumptions ("the population that conducts searches consists mostly of individuals with suicidal ideation", "suicide-related search behavior is strongly linked with suicidal behavior"). We discuss strands of evidence that these two assumptions are likely not met. Implications for future research with Google Trends in the context of suicide research are also discussed.
Memetic algorithms for de novo motif-finding in biomedical sequences.
Bi, Chengpeng
2012-09-01
The objectives of this study are to design and implement a new memetic algorithm for de novo motif discovery, which is then applied to detect important signals hidden in various biomedical molecular sequences. In this paper, memetic algorithms are developed and tested in de novo motif-finding problems. Several strategies in the algorithm design are employed that are to not only efficiently explore the multiple sequence local alignment space, but also effectively uncover the molecular signals. As a result, there are a number of key features in the implementation of the memetic motif-finding algorithm (MaMotif), including a chromosome replacement operator, a chromosome alteration-aware local search operator, a truncated local search strategy, and a stochastic operation of local search imposed on individual learning. To test the new algorithm, we compare MaMotif with a few of other similar algorithms using simulated and experimental data including genomic DNA, primary microRNA sequences (let-7 family), and transmembrane protein sequences. The new memetic motif-finding algorithm is successfully implemented in C++, and exhaustively tested with various simulated and real biological sequences. In the simulation, it shows that MaMotif is the most time-efficient algorithm compared with others, that is, it runs 2 times faster than the expectation maximization (EM) method and 16 times faster than the genetic algorithm-based EM hybrid. In both simulated and experimental testing, results show that the new algorithm is compared favorably or superior to other algorithms. Notably, MaMotif is able to successfully discover the transcription factors' binding sites in the chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-Seq) data, correctly uncover the RNA splicing signals in gene expression, and precisely find the highly conserved helix motif in the transmembrane protein sequences, as well as rightly detect the palindromic segments in the primary microRNA sequences. The memetic motif-finding algorithm is effectively designed and implemented, and its applications demonstrate it is not only time-efficient, but also exhibits excellent performance while compared with other popular algorithms. Copyright © 2012 Elsevier B.V. All rights reserved.
Suckau, Detlev; Resemann, Anja
2009-12-01
The ability to match Top-Down protein sequencing (TDS) results by MALDI-TOF to protein sequences by classical protein database searching was evaluated in this work. Resulting from these analyses were the protein identity, the simultaneous assignment of the N- and C-termini and protein sequences of up to 70 residues from either terminus. In combination with de novo sequencing using the MALDI-TDS data, even fusion proteins were assigned and the detailed sequence around the fusion site was elucidated. MALDI-TDS allowed to efficiently match protein sequences quickly and to validate recombinant protein structures-in particular, protein termini-on the level of undigested proteins.
Finding Protein and Nucleotide Similarities with FASTA
Pearson, William R.
2016-01-01
The FASTA programs provide a comprehensive set of rapid similarity searching tools ( fasta36, fastx36, tfastx36, fasty36, tfasty36), similar to those provided by the BLAST package, as well as programs for slower, optimal, local and global similarity searches ( ssearch36, ggsearch36) and for searching with short peptides and oligonucleotides ( fasts36, fastm36). The FASTA programs use an empirical strategy for estimating statistical significance that accommodates a range of similarity scoring matrices and gap penalties, improving alignment boundary accuracy and search sensitivity (Unit 3.5). The FASTA programs can produce “BLAST-like” alignment and tabular output, for ease of integration into existing analysis pipelines, and can search small, representative databases, and then report results for a larger set of sequences, using links from the smaller dataset. The FASTA programs work with a wide variety of database formats, including mySQL and postgreSQL databases (Unit 9.4). The programs also provide a strategy for integrating domain and active site annotations into alignments and highlighting the mutational state of functionally critical residues. These protocols describe how to use the FASTA programs to characterize protein and DNA sequences, using protein:protein, protein:DNA, and DNA:DNA comparisons. PMID:27010337
Finding Protein and Nucleotide Similarities with FASTA.
Pearson, William R
2016-03-24
The FASTA programs provide a comprehensive set of rapid similarity searching tools (fasta36, fastx36, tfastx36, fasty36, tfasty36), similar to those provided by the BLAST package, as well as programs for slower, optimal, local, and global similarity searches (ssearch36, ggsearch36), and for searching with short peptides and oligonucleotides (fasts36, fastm36). The FASTA programs use an empirical strategy for estimating statistical significance that accommodates a range of similarity scoring matrices and gap penalties, improving alignment boundary accuracy and search sensitivity. The FASTA programs can produce "BLAST-like" alignment and tabular output, for ease of integration into existing analysis pipelines, and can search small, representative databases, and then report results for a larger set of sequences, using links from the smaller dataset. The FASTA programs work with a wide variety of database formats, including mySQL and postgreSQL databases. The programs also provide a strategy for integrating domain and active site annotations into alignments and highlighting the mutational state of functionally critical residues. These protocols describe how to use the FASTA programs to characterize protein and DNA sequences, using protein:protein, protein:DNA, and DNA:DNA comparisons. Copyright © 2016 John Wiley & Sons, Inc.
Auditory Attentional Capture: Effects of Singleton Distractor Sounds
ERIC Educational Resources Information Center
Dalton, Polly; Lavie, Nilli
2004-01-01
The phenomenon of attentional capture by a unique yet irrelevant singleton distractor has typically been studied in visual search. In this article, the authors examine whether a similar phenomenon occurs in the auditory domain. Participants searched sequences of sounds for targets defined by frequency, intensity, or duration. The presence of a…
Latorre, Mariano; Silva, Herman; Saba, Juan; Guziolowski, Carito; Vizoso, Paula; Martinez, Veronica; Maldonado, Jonathan; Morales, Andrea; Caroca, Rodrigo; Cambiazo, Veronica; Campos-Vargas, Reinaldo; Gonzalez, Mauricio; Orellana, Ariel; Retamales, Julio; Meisel, Lee A
2006-11-23
Expressed sequence tag (EST) analyses provide a rapid and economical means to identify candidate genes that may be involved in a particular biological process. These ESTs are useful in many Functional Genomics studies. However, the large quantity and complexity of the data generated during an EST sequencing project can make the analysis of this information a daunting task. In an attempt to make this task friendlier, we have developed JUICE, an open source data management system (Apache + PHP + MySQL on Linux), which enables the user to easily upload, organize, visualize and search the different types of data generated in an EST project pipeline. In contrast to other systems, the JUICE data management system allows a branched pipeline to be established, modified and expanded, during the course of an EST project. The web interfaces and tools in JUICE enable the users to visualize the information in a graphical, user-friendly manner. The user may browse or search for sequences and/or sequence information within all the branches of the pipeline. The user can search using terms associated with the sequence name, annotation or other characteristics stored in JUICE and associated with sequences or sequence groups. Groups of sequences can be created by the user, stored in a clipboard and/or downloaded for further analyses. Different user profiles restrict the access of each user depending upon their role in the project. The user may have access exclusively to visualize sequence information, access to annotate sequences and sequence information, or administrative access. JUICE is an open source data management system that has been developed to aid users in organizing and analyzing the large amount of data generated in an EST Project workflow. JUICE has been used in one of the first functional genomics projects in Chile, entitled "Functional Genomics in nectarines: Platform to potentiate the competitiveness of Chile in fruit exportation". However, due to its ability to organize and visualize data from external pipelines, JUICE is a flexible data management system that should be useful for other EST/Genome projects. The JUICE data management system is released under the Open Source GNU Lesser General Public License (LGPL). JUICE may be downloaded from http://genoma.unab.cl/juice_system/ or http://www.genomavegetal.cl/juice_system/.
Latorre, Mariano; Silva, Herman; Saba, Juan; Guziolowski, Carito; Vizoso, Paula; Martinez, Veronica; Maldonado, Jonathan; Morales, Andrea; Caroca, Rodrigo; Cambiazo, Veronica; Campos-Vargas, Reinaldo; Gonzalez, Mauricio; Orellana, Ariel; Retamales, Julio; Meisel, Lee A
2006-01-01
Background Expressed sequence tag (EST) analyses provide a rapid and economical means to identify candidate genes that may be involved in a particular biological process. These ESTs are useful in many Functional Genomics studies. However, the large quantity and complexity of the data generated during an EST sequencing project can make the analysis of this information a daunting task. Results In an attempt to make this task friendlier, we have developed JUICE, an open source data management system (Apache + PHP + MySQL on Linux), which enables the user to easily upload, organize, visualize and search the different types of data generated in an EST project pipeline. In contrast to other systems, the JUICE data management system allows a branched pipeline to be established, modified and expanded, during the course of an EST project. The web interfaces and tools in JUICE enable the users to visualize the information in a graphical, user-friendly manner. The user may browse or search for sequences and/or sequence information within all the branches of the pipeline. The user can search using terms associated with the sequence name, annotation or other characteristics stored in JUICE and associated with sequences or sequence groups. Groups of sequences can be created by the user, stored in a clipboard and/or downloaded for further analyses. Different user profiles restrict the access of each user depending upon their role in the project. The user may have access exclusively to visualize sequence information, access to annotate sequences and sequence information, or administrative access. Conclusion JUICE is an open source data management system that has been developed to aid users in organizing and analyzing the large amount of data generated in an EST Project workflow. JUICE has been used in one of the first functional genomics projects in Chile, entitled "Functional Genomics in nectarines: Platform to potentiate the competitiveness of Chile in fruit exportation". However, due to its ability to organize and visualize data from external pipelines, JUICE is a flexible data management system that should be useful for other EST/Genome projects. The JUICE data management system is released under the Open Source GNU Lesser General Public License (LGPL). JUICE may be downloaded from or . PMID:17123449
Efficient privacy-preserving string search and an application in genomics.
Shimizu, Kana; Nuida, Koji; Rätsch, Gunnar
2016-06-01
Personal genomes carry inherent privacy risks and protecting privacy poses major social and technological challenges. We consider the case where a user searches for genetic information (e.g. an allele) on a server that stores a large genomic database and aims to receive allele-associated information. The user would like to keep the query and result private and the server the database. We propose a novel approach that combines efficient string data structures such as the Burrows-Wheeler transform with cryptographic techniques based on additive homomorphic encryption. We assume that the sequence data is searchable in efficient iterative query operations over a large indexed dictionary, for instance, from large genome collections and employing the (positional) Burrows-Wheeler transform. We use a technique called oblivious transfer that is based on additive homomorphic encryption to conceal the sequence query and the genomic region of interest in positional queries. We designed and implemented an efficient algorithm for searching sequences of SNPs in large genome databases. During search, the user can only identify the longest match while the server does not learn which sequence of SNPs the user queried. In an experiment based on 2184 aligned haploid genomes from the 1000 Genomes Project, our algorithm was able to perform typical queries within [Formula: see text] 4.6 s and [Formula: see text] 10.8 s for client and server side, respectively, on laptop computers. The presented algorithm is at least one order of magnitude faster than an exhaustive baseline algorithm. https://github.com/iskana/PBWT-sec and https://github.com/ratschlab/PBWT-sec shimizu-kana@aist.go.jp or Gunnar.Ratsch@ratschlab.org Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Efficient privacy-preserving string search and an application in genomics
Shimizu, Kana; Nuida, Koji; Rätsch, Gunnar
2016-01-01
Motivation: Personal genomes carry inherent privacy risks and protecting privacy poses major social and technological challenges. We consider the case where a user searches for genetic information (e.g. an allele) on a server that stores a large genomic database and aims to receive allele-associated information. The user would like to keep the query and result private and the server the database. Approach: We propose a novel approach that combines efficient string data structures such as the Burrows–Wheeler transform with cryptographic techniques based on additive homomorphic encryption. We assume that the sequence data is searchable in efficient iterative query operations over a large indexed dictionary, for instance, from large genome collections and employing the (positional) Burrows–Wheeler transform. We use a technique called oblivious transfer that is based on additive homomorphic encryption to conceal the sequence query and the genomic region of interest in positional queries. Results: We designed and implemented an efficient algorithm for searching sequences of SNPs in large genome databases. During search, the user can only identify the longest match while the server does not learn which sequence of SNPs the user queried. In an experiment based on 2184 aligned haploid genomes from the 1000 Genomes Project, our algorithm was able to perform typical queries within ≈ 4.6 s and ≈ 10.8 s for client and server side, respectively, on laptop computers. The presented algorithm is at least one order of magnitude faster than an exhaustive baseline algorithm. Availability and implementation: https://github.com/iskana/PBWT-sec and https://github.com/ratschlab/PBWT-sec. Contacts: shimizu-kana@aist.go.jp or Gunnar.Ratsch@ratschlab.org Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27153731
Protein structure prediction with local adjust tabu search algorithm
2014-01-01
Background Protein folding structure prediction is one of the most challenging problems in the bioinformatics domain. Because of the complexity of the realistic protein structure, the simplified structure model and the computational method should be adopted in the research. The AB off-lattice model is one of the simplification models, which only considers two classes of amino acids, hydrophobic (A) residues and hydrophilic (B) residues. Results The main work of this paper is to discuss how to optimize the lowest energy configurations in 2D off-lattice model and 3D off-lattice model by using Fibonacci sequences and real protein sequences. In order to avoid falling into local minimum and faster convergence to the global minimum, we introduce a novel method (SATS) to the protein structure problem, which combines simulated annealing algorithm and tabu search algorithm. Various strategies, such as the new encoding strategy, the adaptive neighborhood generation strategy and the local adjustment strategy, are adopted successfully for high-speed searching the optimal conformation corresponds to the lowest energy of the protein sequences. Experimental results show that some of the results obtained by the improved SATS are better than those reported in previous literatures, and we can sure that the lowest energy folding state for short Fibonacci sequences have been found. Conclusions Although the off-lattice models is not very realistic, they can reflect some important characteristics of the realistic protein. It can be found that 3D off-lattice model is more like native folding structure of the realistic protein than 2D off-lattice model. In addition, compared with some previous researches, the proposed hybrid algorithm can more effectively and more quickly search the spatial folding structure of a protein chain. PMID:25474708