Architecture beyond buildings

Hi there! Could you please tell us a little bit about
yourself and what you do at C3G?

My name is Ksenia Zaytseva and I am a Data Architect within the Data Team at C3G. I have a Master’s degree in Information Science. Before C3G I worked in a variety of academic research disciplines building data and metadata management systems mostly for their Open Data initiatives.

In the Data Team, we work on the development of various tools and services for data infrastructures. We organize experimental metadata or patients’ clinical and phenotypic data following international data standards devised by the biomedical and genomics health domains. We also build systems and user interfaces for exploring the data and making large-scale genomics data analysis possible.

What kind of projects are you involved in?

One of the main projects I am part of is the Bento portal. The Bento portal is a platform for sharing and exploring –omics data. My role on this project is to develop clinical and phenotypic data management services based on the existing standards in genomics, healthcare and information science. Bento is a suite of microservices where each microservice addresses a specific problem. The advantage of this approach is that depending on a project’s specific requirements, each microservice can be used separately and plugged into other software architectures. I have been working on the Katsu service – it’s an API service with a database backend used to store phenotypic metadata about patients and/or biosamples and their related genomic and disease information. The service is partly based on the Phenopackets GA4GH standard. It also stores experiment metadata, administrative metadata about the dataset itself (e.g. provenance, access rights) and reference resources (e.g. what ontologies and controlled vocabularies are used to annotate the data).

We aim to implement Bento as a generic platform for various projects in genomics. This approach, and our data model’s adoption of standards, enables us to set up a project portal and to transform and import the project-specific data relatively quickly. It also provides the possibility for integrative and federated data analysis in the future. Currently, Bento is deployed in several projects, among them iCHANGE, Signature and BQC19.

Another project I am involved in is the Canadian Distributed Infrastructure for Genomics (CanDIG). Similarly to Bento, I am working on a clinical metadata service using the OMOP data model. Besides genomics projects, I am also a part of the Canadian Open Neuroscience Portal (CONP) project. The CONP is an open data portal for datasets and pipelines in neuroscience. I developed and maintain the metadata validation tool. When each dataset is submitted to the portal, the tool checks if it contains all the required data descriptions, for example, information about its creators or the license under which the dataset has been made available. I have also worked on implementing semantic web technologies within CONP, such as making its metadata available in Google dataset searches and providing SPARQL endpoint access. I am currently working on integrating CONP terms into the Neuroscience Information Data Model.

What do you enjoy most about your work?

Besides developing as a technology professional, my favorite part of my work at C3G is that I get to learn a lot about human genomics, different sequencing technologies and methods, and how other things work in healthcare and biomedical research. I find it personally very interesting as a general context for my work.

Ksenia tells us about data and metadata standards in the health domain and some of the challenges in that field.

All standards or data models can be divided into two groups: first, those that apply to the data itself and, second, those that apply to the metadata (the data about the data). The first group includes definitions and relationships among biomedical and health concepts – for example, the definitions of Individual, Biosample/Specimen, Condition, how those elements are related to each other and what properties they have. The standards I am working with are GA4GH Phenopackets for phenotypic data in genomics, OMOP common data model for observational medical data, HL7 FHIR – healthcare records exchange standard and mCODE data elements for oncology-related data.

The second group includes metadata standards and models for describing the meta information about the data. For example, its provenance describes how and when the data was originated/collected/produced (e.g. by an agent or through a machine or software). Also described are the creators or authors of the data, applicable access rights, where the data is stored (e.g. repository or archive) and what the data is about. I work specifically with the DATS model for dataset descriptions as well as schema.org and W3C standards (e.g. PROV-O).
Besides data standards there are many reference resources (ontologies and controlled vocabularies) and databases in the biomedical field that we use to reconcile our data descriptors. Some of these ontologies are SNOMED-CT, Human Phenotype Ontology (HP), National Cancer Institute Thesaurus (NCIT) and Uber Anatomy Ontology (UBERON), among others.

What are the challenges of working with health science data?

The main challenge is that there is no one-size-fits-all data model that satisfies all different use cases. Most projects have their own systems and data requirements related to their research questions and goals. Use of the data/metadata standards and interoperability guidelines allow us to bring data together via identifying common data elements. It facilitates large-scale data aggregation, analysis and new findings. That’s why it’s important to build a community of researchers, clinicians and data experts to gather the knowledge and expertise together in order to develop common data solutions that can cover various use cases and better prepare us for unforeseen challenges, such as the current pandemic.