Schema
Last updated
Last updated
A schema is a "blueprint" of what data looks like. More formally, it's an expression of descriptive and structural metadata with defined semantics. A schema is a powerful communication tool, as it provides a clear and well-encapsulated expression of what data you have (or need).
By "defined semantics" we mean that it is expressed in a particular schema language, the choice of which can be highly nuanced depending on your application.
Purpose
Schema allow you to:
de-risk projects involving data,
build robust data transformations and pipelines, and
communicate with other stakeholders about the data you require (or provide).
Tip
The pure act of writing down what's in the data is far more important than the selection of schema language.
Schemas can (in general) be translated between languages relatively easily; the real value lies in getting the data structure and description written down and communicated in the first place.
Disambiguation
In philosophy, a schema is a representation of a plan or theory in the form of an outline or model: "a schema of scientific reasoning". This is a much broader definition than we use for our purposes here.
Yes, you read this tab label correctly. Schema are an incredibly powerful de-risking tool for project management!
In any data-driven project with several stakeholders, many months can be spent in communication on how teams are going to work together. Example chunks of data are sent back and forth in CSV or Excel files, and it's frequently unclear what expectations are and where boundaries lie between teams. The meanings of particular columns are queried by email, just when people go out on holiday. Things need fixing when it turns out the real data is a bit different. And so on, and so on, as the critical path grows ever longer...
At the beginning of such projects, defining a set of schema at the boundaries where data is exchanged between teams:
Clarifies initial expectations
Encourages disciplined, effective communication between the teams as the project and its data evolve
This works well even if you don't yet have a clear understanding of your data (when schema are little more than a wild-guess!), because it introduces a framework for communication at the start.
Industrial standards frequently emerge specifying data contents, often for stable industrial systems whose parameters and data outputs are well known.