At Hopsworks, the FAIR Guiding Principles for scientific data management and stewardship have been a cornerstone of our approach to build a better machine learning platform. F.A.I.R. principles initially became prevalent in academia and diverse fields of research in an effort to make sure that the ever growing amount of data could still be usable and beneficial for the society, and it has since been widely adopted. However, few people mention them in the context of machine learning systems and data management. Yet those principles are even more relevant today in the fast moving AI and LLMs landscape, where new legislation is changing the rules of the game.
AI professionals should consider how questions of ethics, data management, and open frameworks may influence their choice of tools and machine learning platforms when implementing modern ML systems. In Hopsworks, we follow the F.A.I.R. principles in the design of a platform for managing machine learning data and infrastructure.
Findable; referring to mechanics to make the data easily searchable and findable. Infrastructure, stakeholders, and projects need easy-to-use functionality for data discovery.
Accessible; allow access not only to the data but the provenance of the data and metadata for the data.
Interoperable; data should be easily shared between different computer systems. This is achieved by implementing open standards and formats for data
Reusable; data produced by one system should be easy to reuse in downstream systems, without copying the data. In order to reuse data, it’s important to include metadata related to the data licenses,, provenance, community standards, and custom metadata that will allow other institutions, teams or groups to be able to reuse the data.
Some of the FAIR principles are directly applicable in the context of machine learning systems: there are lots of open source frameworks, file systems, and programming languages that are used for the operation of AI products and services. Still, some very serious challenges do emerge that are specifically due to the way any ML System needs to operate.
Findable; while strategies that apply metadata and clear nomenclature can be applied in the context of operational machine learning systems, practitioners will find it challenging to create a clear centralized logic between the different data sources and databases needed to operate such services; a modern ML system might need to be connected to multiple sources, some of which may be real-time, or vector databases for large languages. Making a clear structure for the assets and the metadata becomes a complex endeavor without a centralized solution capable of catering to the different scenarios.
Interoperable & Accessible; When open frameworks and open file formats are used; core challenges in regards to accessibility and interoperability should be easier to resolve; in which case it becomes important to consider open standards, compute engines and avoid DSLs. One additional challenge that can span from the very nature of the underlying data is to make it accessible for auditing (for example; what was the data that the model in production last year trained on?), review and debugging whilst the systems continuously updates and appends data.
Reusability; Finally a fundamental characteristic of machine learning models is that some of them require the data processing to be directly tied to the model that will be trained; we call these model-dependent transformations. This process essentially compromises the integrity of the data and the underlying datasets can’t be re-used in a different scenario. And not only does it prevent the reuse of the data itself, it is also harder to understand for a human. This leads to significant holds on the ability of any organization to reuse their data in different models, leading to deduplication and the creation of monolithic pipelines that are notoriously harder to scale from.
At Hopsworks, we have a strong heritage working with academia and research, participating in projects such as HEAP (Human Exposome Assessment Project) that manages personal data from numerous medical institutes across the world. We have always been mindful of the evident privacy and security concerns and needs of efficiency in managing data following FAIR principles; when approaching such project; we consider those principles as a blueprint on how to refine our own software;
Additionally, striving to build strong abstractions and APIs that enable users and organizations to have a better understanding of the models they are building and more flexibility in reusing their data pipelines. Those are core aspects of the Hopsworks platform, which we believe all state-of-the-art ML platforms should follow to be within the FAIR framework.