Blog Post

Part 3: How to tame open data with standards

What's a centerline node network (CNN) identifier? The new data standards reference from DataSF sets out to clarify common definitions and references as well as set publishing standards.

In celebration of Open Data Day on March 3, 2018, we are releasing a 4 part series on how to manage an open data program.

Part 1: DataSF’s operating manual for open data
Part 2: How to monitor your open data portal
Part 3: How to tame open data with standards
Part 4: Why you need to profile your open data

Over the past 4 years, we’ve added well over 200 new datasets to the portal, many of them automated and considered highly valuable, such as building permits or medical and fire emergency calls.

But across these many datasets, inconsistencies have invariably emerged in how the data is published. For example, neighborhood columns that don’t clearly identify their associated boundary or different ways of encoding a parcel number.

The missing manual: Taming our datasets with standards

To begin taming these issues, we’ve launched a Data Standards Reference Handbook. This is our “missing manual” that focuses on data publishing decisions regarding:

Formats and data structure. How the structure of columns and rows should be handled across any dataset (e.g. when providing a date/time field or a coordinate).
Standard references. Which datasets represent standard references within the city and how those should be used in any referring datasets (e.g. parcels or department codes).

As a practical matter it becomes a guide for DataSF staff and department publishers as they’re publishing new data. It also provides a baseline reference for some common questions we get about data across datasets, like:

What’s a CNN? (hint: it’s not the Cable News Network)
What are the fields block, lot and apn? Are they related?
What’s a mapblocklot?
Is there an official reference of department codes somewhere?

A little note on implementation

Any of our enterprising data users will probably be thinking: well this is great, but the existing datasets haven’t changed, so what? This handbook is just a starting point. Going forward, new data will be published per the handbook. For existing data, we’ll use our comprehensive dataset field profiles, web analytics and other signals to prioritize resets over time.

We’re excited as this marks another milestone for managing data as an important City asset. We hope over time these efforts at standardization will bring down barriers to use and promote novel analysis across related datasets.

Special thanks

There are many reviewers who have helped flesh out this document, but we want to give a special shout out to our friends working on open data in Singapore. They had a handy data quality guide that we lifted heavily for our section on data structure and formats.