As Data Flows Surge, Data Infrastructure Groans

2022-08-20 18:40:21 By : Ms. Sara Zhao

To prevent data gridlock, data management platforms aren’t just expanding, they’re also instituting new rules of the road

Infrastructure tends to be taken for granted until it shows signs of strain. These signs include bumper to bumper traffic on our roadways, canceled flights at our airports, and—perhaps most vexingly—unchanging status indicators on our computer monitors. When infrastructure performance (or rather the lack of it) finally comes to our attention, we may hope to rid ourselves of our old infrastructure and start afresh. Yet we often find it necessary to keep old infrastructure elements and build new infrastructure elements around it. We add lanes, expand terminals, and integrate new hardware and software systems.

Even thoroughly renovated and seemingly new data infrastructure is likely be a mix of old and new, complicating data management. This issue was recognized by the scientists who formulated the FAIR Guiding Principles. The FAIR principles—which encompass Findability, Accessibility, Interoperability, and Reusability for scientific data management and stewardship—were formally introduced in an article that appeared in Scientific Data in 2016.

“Good data management is not a goal in itself,” the article stated. “[It] is the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the data publication process.”

Although the FAIR principles originally emphasized implementations by open-source repositories or community-level initiatives, they were also seen as being applicable everywhere big data reigns, supporting “integrative research, reproducibility, and reuse in general.”

Today, big data is bigger than ever. And as the FAIR team anticipated, it has also become “more diverse and less integrated … exacerbating the discovery and reusability problem for both human and computational stakeholders.” How might these stakeholders cope with their big data challenges? For answers to that question, GEN spoke with several of the luminaries who planned to deliver talks at the 20th Anniversary Bio-IT Conference and Expo, which was held May 3–5.

In the biopharmaceutical industry, practically everyone is aware that data typically resides in “silos,” where its value may be limited simply because it is difficult to access. Even if data is generated in giga-, tera-, and petabyte amounts, it may be accessible only to those stakeholders directly involved in its creation. Even favored stakeholders may reach for data only to find that it has disappeared.

To overcome the data silo problem, it is necessary to “bring data together” says Daniel Herzig-Sommer, PhD, the COO of Metaphacts, a software company based in Germany that develops technology for knowledge graphs, which are also known as semantic networks. Knowledge graphs, the company explains, are becoming “both the repository for organization-wide master data (ontological schema and static reference knowledge) as well as the integration hub for various legacy data sources (for example, relational databases or data streams).”

Knowledge graph technology can “drive knowledge democratization and decision intelligence” in the pharmaceutical and life sciences industries, Herzig-Sommer maintains. “One of the strong points of the knowledge graph approach is that you can link data and make it ‘FAIR.’”

At Bio-IT World, Herzig-Sommer described how knowledge graph–driven FAIR data platforms can empower end users and machines to access and consume knowledge “intuitively and in context.” He also discussed best practices for building a semantic layer atop a data mesh to enable domain experts.

“[Adhering to the FAIR principles] allows you to bring data together and interlink it,” he asserts. “One of the benefits is the breaking up of data silos.” Another benefit is the conservation of data. “Data is usually created and prepared for a specific use case,” he explains. “If the use case ceases to exist, the data is discarded or not reused anymore.

“Our approach allows you to reuse the data, build on top of it, and bring another use case, enriching the data and making the knowledge graph grow use case by use case. And you will still have the interconnected, integrated data available.”

Although data silos are frowned upon, they will continue to exist. Indeed, there are, Herzig-Sommer acknowledges, “natural” data silos. They may occur wherever divisions are found—across hallways, across universities, across companies, across borders, and especially across clinical trials. Consequently, it will always be necessary to reach across these divisions, especially if the biopharmaceutical industry is to become more collaborative.

“You have to be very clever about which tests, which assays, you actually use in the laboratory,” says Christof Gänzler, PhD, biology domain product marketing manager, PerkinElmer Informatics. “If your assays are to be part of a big data scenario, they will need to be automated. They will also need to be associated with each other.”

In other words, assay automation isn’t just for the assay itself, but also for the data transformation and analysis workflow. Unfortunately, this broader conception of automation is hard to sustain if data generated by laboratory instruments is directed to generic software tools like Excel and GraphPad Prism.

“Every time data comes in from an instrument, it comes in an Excel format, text files, or some other format which scientists have to parse into a spreadsheet and add equations and metadata to start analyzing their experiment results,” Gänzler stated in a PerkinElmer white paper. “They repeat this process every time they come up with new data. There is often no automated process. The data is not searchable. You can’t use it to run further analyses, and you can’t share how you arrived at your conclusions so others can repeat your data analysis.”

Gänzler proposes that data interpretation should be standardized across all platforms, assays, and users. This task can be facilitated by the centralization of data. Indeed, this approach can make high-level programming almost obsolete. For example, scientists interested in running a data analysis wouldn’t have to rely on complex, custom-designed code. Instead, they could simply query a database and receive a clean, understandable return.

According to Gänzler, the benefits of centralized data management may become evident even before a query is processed. For example, a scientist-user could see the number of databases that will respond to the query, as well as an estimate of the quality of the response the query will generate. The scientist-user could even adjust their query before formally submitting it.

Another approach to data management is followed by Genestack, a U.K.-based life sciences research and development informatics firm. Genestack’s flagship product, Omics Data Manager (ODM), is designed to help organizations create a FAIR catalogue of multiomics investigation elements (studies, samples, and data) with tools for curating rich and standardized metadata in bulk, as well as optimized REpresentational State Transfer–compliant (RESTful) APIs for scalable and integrative cross-study, cross-omics search functionality.

Recently, Genestack personnel (Kevin Dialdestoro, head of data science consulting, and Kelsey Luu, an artificial intelligence/machine learning engineer intern) posted an article on the company’s website describing how artificial intelligence may be used to derive pathway insights from multiomics datasets.

“The model was applied on curated/re-processed public datasets spanning multiple tissue types and autoimmune diseases, revealing relevant and important biological pathways,” they wrote. “Benchmarking simulations demonstrated that our approach is more robust than standard methods.

“Currently, the standard approach for identifying perturbed pathways from gene expression data is a disjointed analysis of differential expression followed by pathway enrichment. However, recent publications suggest that more sophisticated artificial intelligence approaches demonstrate promise as a means for modeling high-complexity systems like biological networks.”

It is a good time to explore how artificial intelligence models may derive more robust and novel pathways from multiomics datasets. According to Misha Kapushesky, Genestack’s founder and CEO, the availability of standardized, integrated, and well-annotated omics data is growing. She adds, “We’ve already reached a point where artificial intelligence is delivering interesting molecules.” Finally, she suggests that modern data production has gotten easier and more user friendly over the last two decades. Instead of having to write custom data analytics from scratch, it is possible, she says, to “write three lines of code and shove it into an algorithm” like TensorFlow.

In their article, Dialdestoro and Luu identified a number of challenges: harmonizing metadata and expression data across your private/public experiments; tracking diverse subject-sample-data relationships; and integrating and indexing molecular and phenotypic data in a scalable manner.

“For a long time, genomics has [driven the development and adoption of] computational technology in the life sciences,” says Ari E. Berman, PhD, the CEO at BioTeam, a life sciences consulting firm. “‘High-performance everything’ started to become required just because of the amount of data that was being generated. And it became clear, after not too long of a time, that doing genomics on your laptop just wasn’t feasible anymore. Two full human genomes would fill up your laptop.

“There are newer technologies like sheet microscopes and lattice light sheet microscopes that capture an entire organism at once, in multiple wavelengths, and in shockingly high detail. Those devices can generate 25 terabytes of data.”

He adds that the management of vast amounts of data requires more than just sheer computational firepower: “It’s more complex than [just buying more computers and more storage capabilities]. The problem isn’t just having equipment, the problem is how you handle the data.”

Data management still needs to respect the old adage, “Garbage in, garbage out.” It applies whether sequencing data or multi-terabyte cryo-electron microscopy data is being collected. Data management also has to ensure that elements such as inputs, interpretation processes, and even file extensions comply with widely accepted standards. (Notice that the qualification “widely accepted” precedes the word “standards.”)

“The problem is how the data is treated,” Berman explains. “Right now, it’s the Wild West.” In other words, multiple standards are vying for acceptance. He indicates that standardization issues arise in every aspect of the biopharma world. These aspects include patient stratification criteria, electronic health record formats, and workflows for the analysis of sequencing information.

Log in to leave a comment