A Gene-Phenotype Relationship Extraction Pipeline from the Biomedical Literature Using a Representation Learning Approach
Motivation: The fundamental challenge of modern genetic analysis is to establish gene-phenotype correlations that are often found in the large-scale publications. Because lexical features of gene are relatively regular in text, the main challenge of these relation extraction is phenotype recognition. Due to phenotypic descriptions are often study- or author-specific, few lexicon can be used to effectively identify the entire phenotypic expressions in text, especially for plants.
Methods: We propose a pipeline for extracting phenotype, gene and their relations from biomedical literature. Combined with abbreviation revision and sentence template extraction, we improve the unsupervised word-embedding-to-sentence-embedding cascaded approach as representation learning to recognize the various broad phenotypic information in literature. In addition, the dictionary- and rulebased method is applied for gene recognition. Finally, we integrate one of famous information extraction system OLLIE to identify gene-phenotype relations.
Results: To demonstrate the applicability of the pipeline, we established two types of comparison experiment using model organism Arabidopsis thaliana. In the comparison of state-of-the-art baselines, our approach obtained the best performance (F1-Measure of 66.83%). We also applied the pipeline to 481 full-articles from TAIR gene-phenotype manual relationship dataset to prove the validity. The results showed that our proposed pipeline can cover 70.94% of the original dataset and add 373 new relations to expand it.
Phenotype Extraction Based on Word Embedding to Sentence Embedding Cascaded Approach
Identifying Overlapping Protein Complexes in Yeast Protein Interaction Network via Fuzzy Clustering
Bar charts detection and analysis in biomedical publication
Bar charts offer a concise way of summarizing and communicating multi-faceted data sets in biomedical publications.
Complex relationships including quantitative information shown by bar charts is of great interest to scientists and
practitioners, which make it valuable to parse bar charts. This fact together with the abundance of bar chart images
and their shared common patterns gives us a good candidates for automated image mining and parsing. We
demonstrate a workflow to analyze bar charts and give a few feasible solutions to apply it. We are able to detect bar
segments and panels at high accuracy and recall, and present preliminary results for the identification of entities in
these images. While we cannot provide a complete instance of the application using our method, we present evidence
that this kind of image mining is feasible
Development and validation of InDel markers for identification of QTL underlying flowering time in soybean
Quantitative Trait Locus Mapping of Soybean Maturity Gene E6
Natural variation at the soybean J locus improves adaptation to the tropics and enhances yield
GmILPA1, Encoding an APC8-like Protein, Controls Leaf Petiole Angle in Soybean
Efficiently predicting large-scale protein-protein interactions using MapReduce
Multiscale Crossing Representation Using Combined Feature of Contour and Venation for Leaf Image Identification
InDel marker detection by integration of multiple softwares using machine learning techniques
Background: In the biological experiments of soybean species, molecular markers are widely used to verify the soybean genome or construct its genetic map. Among a variety of molecular markers, insertions and deletions (InDels) are preferred with the advantages of wide distribution and high density at the whole-genome level. Hence, the problem of detecting InDels based on next-generation sequencing data is of great importance for the design of InDel markers. To tackle it, this paper integrated machine learning techniques with existing software and developed two algorithms for InDel detection, one is the best F-score method (BF-M) and the other is the Support Vector Machine (SVM) method (SVM-M), which is based on the classical SVM model.
Results: The experimental results show that the performance of BF-M was promising as indicated by the high precision and recall scores, whereas SVM-M yielded the best performance in terms of recall and F-score. Moreover, based on the InDel markers detected by SVM-M from soybeans that were collected from 56 different regions, highly polymorphic loci were selected to construct an InDel marker database for soybean.
Conclusions: Compared to existing software tools, the two algorithms proposed in this work produced substantially higher precision and recall scores, and remained stable in various types of genomic regions. Moreover, based on SVM-M, we have constructed a database for soybean InDel markers and published it for academic research.
Transcriptome Sequencing Identified Genes and Gene Ontologies Associated with Early Freezing Tolerance in Maize
A Global Analysis of the Polygalacturonase Gene Family in Soybean (Glycine max)
Identification of additional QTLs for flowering time by removing the effect of the maturity gene E1 in soybean
Sequence composition of BAC clones and SSR markers mapped to Upland cotton chromosomes 11 and 21 targeting resistance to soil-borne pathogens-bornepathogens
GmCOL1a and GmCOL1b function as flowering repressor in soybean under long-day conditions
the functions of legume CO genes in controlling flowering remain unknown. Here, we analyze the
expression patterns of E1, E2 and GmCOL1a/1b using near-isogenic lines (NILs), and we further
analyze flowering-related genes in gmcol1b mutants and GmCOL1a-overexpressing plants. Our data
showed that both E3 and E4 up-regulate E1 expression, with the effect of E3 on E1 being greater than
the effect of E4 on E1. E2 was up-regulated by E3 and E4 but down-regulated by E1. GmCOL1a/1b
were up-regulated by E1, E2, E3 and E4. Although the spatial and temporal patterns of GmCOL1a/1b
expression were more similar to those of AtCOL2 than to those of AtCO, gmcol1b mutants flowered
earlier than wild-type plants under long-day (LD) conditions, and the overexpression of GmCOL1a
caused late flowering under LD or natural conditions. In addition, GmFT2a/5a, E1 and E2 were
down-regulated in GmCOL1a-overexpressing plants under LD conditions. Because E1/2 influences the
expression of GmCOL1a, and vice versa, we conclude that these genes may function as part of a
negative feedback loop, and GmCOL1a/b genes may serve as suppressors in photoperiodic flowering
in soybean under LD conditions.
GmmiR156b overexpression delays flowering time in soybean
QTL mapping for flowering time in different latitude in soybean
PopGeV: A Web-based Large-scale Population Genome Browser
Dual functions of GmTOE4a in the regulation of photoperiod‑mediated flowering and plant morphology in soybean
QTLMiner：QTL database curation by mining tables in literature
Availability: QTLMiner is available at www.soyomics.com/qtlminer/.
A New Dominant Gene E9 Conditions Early Flowering and Maturity in Soybean
Genetic Variation in Soybean at the Maturity Locus E4 Is Involved in Adaptation to Long Days at High Latitudes
Molecular identification of genes controlling flowering time, maturity, and photoperiod response in soybean
HETEROGENEITY-INDUCED SPOT DYNAMICS FOR A THREE-COMPONENT REACTION-DIFFUSION SYSTEM
Cloning and Expression Analysis of GmMYB Genes Induced by Abiotic Stresses
Overview of Mollisols in the world: Distribution, land use and management
Onset of unidirectional pulse propagation in an excitable medium with asymmetric heterogeneity
Dynamics of traveling pulses in heterogeneous media
Heterogeneity-induced defect bifurcation and pulse dynamics for a three-component reaction-diffusion system
Particle swarm optimization with multiscale searching method