Based on Scientific Data Engineering Pipelines, Language Modeling, Correlation Matrix Datasets & Visualization.
In collaboration with

Relationships can be validated based on historical time-series analysis or used to predict a context-dependent connection between proteins or drug compounds.
Every 24 hours there are about 2000 new peer reviewed published papers from sources such as the National Library of Medicine’s PubMed. Real-time updating enables a living visualization where updating correlations modify the strength represented by distance between protein nodes.
Visualize changes between proteins based on real-time published literature and context-dependent relationships.
A ‘sunburst’ visualization around each protein node enables interpretation of known and hidden relationships to references, proteins, pathways, diseases and drug compounds.
Real-time correlation matrix datasets are available via standard REST APIs which can be used to generate new kinds of clusters, graph networks and relationship networks based on language modeling.
Protein A | Protein B | Drug Compound | |
---|---|---|---|
NME4 | 0.27 | 0.107 | 0.839 |
OPA1 | 0.109 | 0.612 | 0.967 |
Cardiolipin | 0.848 | 0.487 | 0.957 |
... | ... | ... | ... |
A chain of evidence can be used to validate and explain how two protein nodes are related, strength of relationship and in what context the relationship exists. Data provenance, governance, lineage and security is handled on-chain based on a utility token wallet-enabled API.
Contact us to learn more
Today, language models developed by groups including DeepMind's AlphaFold2 and OpenAI's GPT-3 efforts are at the core of recent breakthroughs and discoveries in computational biology including predicting the mathematics of knots to predicting how a protein folds based on a sequence of amino acids (Jumper, J., Evans, R., Pritzel, A. et al. Nature 596, 583 589 2021 K. L. Saar, 2021). Language models such as geneBERT are now being modified to enable new insights and interpretations in multi-omics data. Language models are currently responsible for truly accelerating the process of generating new hypotheses and novel discoveries in biosciences including drug discovery (arXiv:2201.09647) and drug repurposing/repositioning along with predicting new chemical compounds.
Language modeling can enable the detection of hidden relationships between human proteins associated to stressors such as blood anemia and mitochondria stress (Trudel, G., Shahin, N., Ramsay, T. et al. Nature Medicine 2022), muscle atrophy and ocular stress (Blakely, E 2015) resulting from microgravity. For example, although two proteins may not be mentioned together in the same research paper, a well-defined implicit relationship can exist between these two proteins based on language modeling. This relationship can remain hidden in the literature and data sources until a researcher makes the connection. Hidden relationships can also be detected between drug compounds, drug cocktails and micronutrients associated with the development of countermeasures for stressors during spaceflight.
Our network of data engineering pipelines uses an ensemble of language models based on NLU (Natural Language Understanding) to generate feature vectors with key/value pairs representing scored attributes. Vectors are transformed into correlation matrix datasets which are used to generate clusters, graph networks and relationship networks. Hidden relationships between protein nodes are represented in a visualization where a hidden relationship is defined by two protein nodes with intersecting features in a vector space without both protein nodes existing together in the same fixed-window or paper during the language modeling process. Datasets are hashed on the blockchain enabling data provenance and security.