Data preparation, cataloging, and feature management for a massive genomic dataset containing sensitive information
At the Karolinska Institute’s center for cervical cancer prevention, sequencing machines have generated 800+ TBs of next-generation sequencing data, requiring both low-cost storage and secure large-scale processing by researchers.
They use large amounts of data from omics analyses to gain new insights into the biology of viruses. For an individual research group or even a university alone, these large volumes would be very difficult or even impossible to process. Providing standardized solutions at this scale requires international collaboration.
The Aim:
The organisation utilizes large-scale processing on Apache Spark and deep learning on TensorFlow to analyze these scale sensitive datasets to identify novel viruses, perform large cohort studies, and identify genetic mutations causing diseases. However:
- Neither Kubernetes nor Hadoop based platforms support storing and processing sensitive data on a shared cluster required by research studies to avoid cross-link with data outside the study or copy data in/out from a study. One cluster per research study introduces excessive cost and administration overhead.
- Infrastructure was too complex and expensive to administer - without a dedicated IT operations team.
- Researchers required a data science platform that provided them with the ability to do everything from small scale analyses in Python on notebooks, to large-scale processing with Spark/PySpark, to deep learning with GPUs.
Why Hopsworks?
Karolinska Institute deployed Hopsworks to manage genomic data and conduct secure research studies. Hopsworks was built around projects, providing a GDPR-compliant environment that enables secure collaboration between researchers on medical studies within a shared cluster.
Hopsworks is optimized for commodity hardware and runs on any data center. Clusters can be easily expanded by adding capacity, when needed enabling a low cost solution for up to PBs of data. Similarly, Hopsworks supports commodity or enterprise GPUs that can be used for deep learning.
Hopsworks’ user-friendly web interface enables researchers to run, manage and access data and programs without software administration knowledge and skills.
Results
Hopsworks Multi-tenant Security Model helped Karolinska Institute to provide collaboration between researchers to manage, share and use genomic data without compromising data security and GDPR.
90% Cost Reduction
Costs savings associated with storing large volumes of data, as well as compute resources (CPU) and Graphical Processing Units (GPUs) to process this data.
Integrated Data Science Platform
Easy collaboration between researchers when managing, sharing, and processing genomic data.
Faster Data Processing
Massively parallel data processing pipeline for massive genomic datasets.