A Talk by Paco Nathan
Founder / Author,
Derwen.ai / O'Reilly
About this talk
Graph applications were once considered “exotic” and expensive. Until recently, few software engineers had much experience putting graphs to work. However, the use cases are now becoming more commonplace.
This talk explores a practical use case, one which addresses key issues of data governance and reproducible research, and depends on sophisticated use of graph technology.
Consider: some academic disciplines such as astronomy enjoy a wealth of data — mostly open data. Popular machine learning algorithms, open source Python libraries, and distributed systems all owe much to those disciplines and their history of big data.
Other disciplines require strong guarantees for privacy and security. Datasets used in social science research involve confidential details about human subjects: medical histories, wages, home addresses for family members, police records, etc.
Those cannot be shared openly, which impedes researchers from learning about related work by others. Reproducibility of research and the pace of science in general are limited. Nonetheless, social science research is vital for civil governance, especially for evidence-based policymaking (US federal law since 2018).
Even when data may be too sensitive to share openly, often the metadata can be shared. Constructing knowledge graphs of metadata about datasets — along with metadata about authors, their published research, methods used, data providers, data stewards, and so on — that provides effective means to tackle hard problems in data governance.
Knowledge graph work supports use cases such as entity linking, discovery and recommendations, axioms to infer about compliance, etc. This talk reviews the Rich Context AI competition and the related ADRF framework used now by more than 15 federal agencies in the US.
We’ll explore knowledge graph use cases, use of open standards and open source, and how this enhances reproducible research. Social science research for the public sector has much in common with data use in industry.
Issues of privacy, security, and compliance overlap, pointing toward what will be required of banks, media channels, etc., and what technologies apply. We’ll look at comparable work emerging in other parts of industry: open source projects, open standards emerging, and in particular a new set of features in Project Jupyter that support knowledge graphs about data governance.