Schema
Last updated
Last updated
A schema is a "blueprint" of what data looks like. More formally, it's an expression of descriptive and structural metadata with defined semantics. A schema is a powerful communication tool, as it provides a clear and well-encapsulated expression of what data you have (or need).
By "defined semantics" we mean that it is expressed in a particular , the choice of which can be highly nuanced depending on your application.
Purpose
Schema allow you to:
de-risk projects involving data,
build robust data and , and
communicate with other stakeholders about the data you require (or provide).
Disambiguation
In philosophy, a schema is a representation of a plan or theory in the form of an outline or model: "a schema of scientific reasoning". This is a much broader definition than we use for our purposes here.
Yes, you read this tab label correctly. Schema are an incredibly powerful de-risking tool for project management!
In any data-driven project with several stakeholders, many months can be spent in communication on how teams are going to work together. Example chunks of data are sent back and forth in CSV or Excel files, and it's frequently unclear what expectations are and where boundaries lie between teams. The meanings of particular columns are queried by email, just when people go out on holiday. Things need fixing when it turns out the real data is a bit different. And so on, and so on, as the critical path grows ever longer...
At the beginning of such projects, defining a set of schema at the boundaries where data is exchanged between teams:
Clarifies initial expectations
Encourages disciplined, effective communication between the teams as the project and its data evolve
This works well even if you don't yet have a clear understanding of your data (when schema are little more than a wild-guess!), because it introduces a framework for communication at the start.
Industrial standards frequently emerge specifying data contents, often for stable industrial systems whose parameters and data outputs are well known.
For many kinds of , it's imperative that the input data has some kind of characteristic. For example, if you run a wind resource analysis, your input data must contain at least some information about the wind at a site!
A schema for a relational database (such as ) describes how data is stored, by specifying tables, columns, column types and relations.