Blog Post

Show me the data (dictionary)!

We're working to make field definitions easier to discover and manage

What does that field mean? That’s an oft-asked question that can normally be answered by a little documentation. For San Francisco’s open data, finding that documentation can be, ahem, inconsistent, and it’s high time we address that for everyone’s sanity! We are beginning a project to make the collection, maintenance and dissemination of data documentation consistent for data publishers, DataSF staff and data users.

Where we are today: a confusing array of options

Across the open data portal, there are many ways to access and find documentation about a dataset. These include:

In the description of the dataset. This is rare, but sometimes a dataset will list field definitions in the description or link to a place that contains definitions.
As descriptions within the open data platform. You can access these descriptions by mousing over the (i)nfo icon on the column name.
As an attachment. This will be the case for almost all datasets published in the past year or so. As part of the updated DataSF publishing process, we generate a template and attach to each dataset.
None at all. For some datasets posted in the early days of the portal, there may be no documentation. We’re fixing that through this process.

The problem is that users don’t know which one to expect for a given dataset. That’s a pain. We get it.

Where we’re headed: one stop documentation

When we started, data dictionary templates were the best solution. And they’ve served us well at the start, but they hardly scale.

Imagine getting a question about a dataset, answering the question, updating an attachment and re-uploading it. Now imagine that across 400+ datasets, 52 data coordinators and thousands of users.

So we’re centralizing and systematizing to support continuous improvement of documentation. We’ll do this work in phases, but when complete, we will have:

A single point of access for all field definitions and eventually profiled information about those fields.
Consistent, printer-friendly documentation to accompany datasets.
Global field definitions written once and propagated across datasets. (There’s no reason we should write the definition of Supervisor District more than one time).
A consistent way to administer changes and updates to the master field definitions.

Write once, read anywhere

One of our primary goals is to reduce the overhead of maintaining and accessing field documentation. We know that data stewards spend a good bit of time explaining data to users. Maybe something gets written down, but many times, the documentation hasn’t been systematic. Or, if it has, it’s not easily accessible.

As of today, we’re releasing a working dataset of all fields stored in the open data portal. We include field definitions where available. We also provide a link to data dictionary attachments where they exist. We have a bit of work to do to document all of the fields, but you’ll be able to track our progress (see below).

The finished dataset will ultimately power a more user-friendly documentation interface. Don’t worry, we won’t expect everyone to go to the dataset to look up field definitions forever. It will also enable meta-analyses of fields. For example, we anticipate it informing our efforts around consistent data publishing practices.

Follow along with us

We have over 7000 fields. About one third of them are documented either within the open data platform or through the template attachments. That leaves us with nearly 5000 fields with no documentation. That may sound like a lot, but we’re breaking it down into smaller pieces:

Define unique, undocumented fields. We’ll rely on the data stewards to submit definitions for the currently undocumented fields that aren’t global. This averages to about 25 fields per steward.
Define global fields. There are many fields that show up in multiple datasets. We can define these once and propagate them.
Migrate documentation from templates. A subset of the 32% documented fields are in other documents. We’ll script what we can and systematically enter the rest.
Deal with the rest as needed. Even if we don’t get to full coverage using the above 3 tactics, we’re okay with rolling on the rest as needed. Priorities can be determined based on where we’re seeing confusion through our support portal.

To track our progress, we’ve created a simple dashboard. It’s linked to the field definition dataset and will update automatically. Follow along with us toward a brighter documentation future!