www.ddmcd.com

View Original

Compendium: My Guest Posts for the BaleFire Global Open Data Blog

The Three Phases of Open Data Quality Control

By Dennis D. McDonald

Introduction

In my previous post about open data quality the suggested solutions relate not just to adhering to standards but also to making sure that the processes by which open data are published and maintained are efficiently and effectively managed. In this post I drill down a bit more on that point about the management processes.

Three Phases

When discussing open data it helps to look at open data projects with tasks divided into at least three related phases:

  1. Assessment and planning
  2. Data preparation and publishing
  3. Ongoing maintenance and support

Different tools and processes are relevant to each phase. Each can have an impact on the quality of the data as well as its perceived quality.

Phase 1. Assessment and planning

Critical to data quality at this first phase of an open data project is an understanding of the “who, where, how, how much, and why” of the data. If the goals of the project include making data from multiple systems and departments accessible and reusable, there’s no substitute for having a good understanding of what the source data actually look like early in the project. Developing an understanding of the level of effort involved in preparing the data for public access is critical. Understanding who will be responsible for making changes and corrections on an ongoing basis will also be important.

Data issues (e.g., missing data, lack of standard identifiers, transposed fields, even outright errors) that may have limited impact on traditional internal users may loom large when the data are made public. Data inconsistencies that matter little internally, even if they are not outright errors, may cause embarrassment and can be labeled as “errors” by those whose understanding of data and data management is meager.

This is not to say that outright errors aren’t important; of course they are. But nuances such as distinctions between “outliers” or inconsistently tagged or labeled fields may be lost on some members of the public or press. Variations in data management literacy should be expected and planned for.

Given the effort required by data preparation work (see Phase 2) there’s no substitute for taking the time during Phase 1 to perform an objective sampling of the source data including, where possible, test runs to see how the tools to be used in managing and accessing the data will behave when faced with live data. Validation tools that check for data formatting and standards compliance at this stage will be very useful. If data are “clean” and error-free, data prep and Phase 2 will run smoothly. If there are issues and they are significant with the data, the earlier you know about it the better.

Phase 2. Data preparation and publishing

This is the “production” phase of the project where plans are put in motion and initial releases of the data are prepared along with the web-based tools that link users with the data. For large volumes of data it’s not unusual at this stage for contractors to be involved with initial extract, transform, and load activities as well as programming and API development tasks. Appropriate testing tools and techniques can answer questions such as these:

  1. Were the number of records extracted from the source system the same as the number of records loaded into the open data portal?
  2. Are predefined filters or data visualization features behaving correctly with varying types and volumes of data?
  3. Are data anonymization strategies impacting the types of analyses that can be conducted with the data?
  4. Are basic statistics being calculated correctly and are missing data or are incorrectly coded data being tagged for special processing?

Making extensive amounts of data available for public scrutiny may mean that some data context will be missing. Because of this some users may lack an understanding of how to interpret the data and may not understand what’s significant and what isn’t. Something that looks like an anomaly or error might actually be correct.

Supplying such context has less to do with quality control than with how well equipped the user is to make sense of the data.  If two different departments use two different address formats or two different expenditure categories for check writing, data files combining these two sources without some indication of such contextual information may lead to a perception of error even though the source data are technically correct. 

Detecting the possibility of such inconsistencies is a Phase 1 task.  Resolving such inconsistencies on a production or volume bases will be Phase 2 task and may involve manual and automated processes as well as the development of ancillary services such as help files or even online support resources. 

Phase 3. Ongoing maintenance and support

Once the open data service goes “live” there need to be ongoing quality management processes that monitor and report to management on the condition of the data.  Error detection and error correction systems and processes need to be in place, including a channel for users to provide feedback and corrections.  This feedback mechanism is important given that one of the guiding principles of the open data movement is that users are free to use data as they please.  Some of these uses may never have been anticipated or tested for and may reveal data issues that need to be addressed. 

Finally, ongoing monitoring of source data is needed to remain aware of possible changes of source data that might have impact later on.  Some upgrades to source data systems, even when basic formats are controlled by well accepted data standards, might introduce format or encoding changes that have downstream impacts.

Summary and conclusions

Data quality management in the context of open data programs should not be considered as something “extra” but as part of the ongoing program management process.  Outright data errors must be stamped out as early as possible before they have a chance to proliferate. 

Much of the data provided in open data programs are the byproduct of human activities that have a natural tendency to change over time. This raises the possibility that errors and inconsistencies will arise in even well managed data programs.  The solution: pay attention to quality management details at all stages of the process so that good data are provided and costs associated with error correction are minimized.

Related reading:

How Important Is Open Data Quality?

By Dennis D. McDonald

At risk?

Martin Doyle’s Is Open Data at Risk from Poor Data Quality is a thoughtful piece but doesn’t address this question: 

Should data quality standards observed in open data programs be any different from the data quality standards observed in any other programs that produce or consume data?

My first response is to answer with a definite “No!” but I think the question is worth discussing. Data quality is a complex issue that people have been wrestling with for a long time. I remember way back in graduate school doing a class project on measuring “error rates” in how metadata were assigned to technical documentation that originated from multiple sources.  

Just defining what we meant by “error” was an intellectually challenging exercise that introduced me to the complexities of defining quality as well as the impacts quality variations can have on information system cost and performance.

Reading Doyle’s article reminded me of that early experience and how complex quality measurement can be in programs that are designed to make data accessible and re-usable.

Trade-offs

One way to look at such questions is in terms of trade-offs, i.e., would we gain more benefit by exposing potentially faulty data files now to public scrutiny and re-use than we gain by delaying and spending more time and resources to “clean up” the data before it’s opened up for public access?

Setting aside for the moment how we define “quality” — and not all the points made by Doyle are directly related to quality — database managers have always had to concern themselves with standards, data cleaning, corrections, and manual versus automated approaches to data quality control. In fact, I’m not convinced that “adding” quality control measures throughout the process will significantly add to costs; in fact, adding quality control measures may actually reduce costs, especially when data errors are caught early on.

Ripple effects

One thing to consider when deciding whether or not to release data that may not be 100% “clean” — whatever that is defined to mean — is that it is a basic principle of open data that others will be free to use and re-use the data. If data are distributed with flaws, will re-use of that data compound those flaws by impacting other systems and processes that may be beyond the control of the issuer? Plus, if downstream problems do occur, who gets to pay for the cleanup?

Quality expectations

Data quality concerns in our own lifetimes have become central to commerce and communication as transactions of all types have moved online or onto the web. Buying, selling, and managing personal shopping, health, and financial affairs reliably place a high emphasis on data accuracy, completeness, and reliability. We’re used to that. Our high expectations of data quality are at least partly due to how purpose-built systems with specific functions are expected to operate by delivering support for specified transactions. Data quality issues can directly impact system performance (and sometimes profitability) in immediately measurable ways. The same can also be true of data provided by open data programs.

Solutions

As Doyle reports, the move to more open data by governments (he focuses specifically on the U.K. but I think his observations are widely relevant) exposes issues with how some open data programs are managed and governed. Such issues can include a variation in data standards, inconsistent or incompatible business processes, variations in software, and occasionally, outright sloppiness.

Even when you do everything “right,” though, you may still run into problems. I remember once I was managing a large software-and-database consolidation project where, even after many hours of analysis and data transformation programming and testing, there still remained a group of financial transactions that were unable to make it from System A to System B without manual processing. It turned out that there were basic incompatibilities in the two systems’ underlying data models due to their having been built on different accounting assumptions. What one system considered to be an outright “error,” the other system considered to be 100% correct.

What’s the remedy for situations where there are potentially so many areas where errors and data quality variations can creep into the system? Doyle’s solution list is logical and straightforward:

  • Investing in standards that make data consistent
  • Ensuring encoding methods are used and checked
  • Ensuring duplicate data is always removed during frequent data quality checks
  • Removing dependency on software that produces inconsistent or proprietary results
  • Ensuring governance that avoids confusion

It’s not all about standards

Note that the above solutions are not all about formally-developed and de facto “standards” — though obviously standards are important. What is also needed is a recognition that publishing data is part of an ongoing process. Success depends not only on the adoption of standards but also on the ability to manage or at least coordinate the people, processes, and technologies that need to mesh together to make open data programs effective and sustainable. Quality variations occurring at one location in the change may have no impact on that point and may not show up till later.

Who’s in charge?

Blaming one company or product line for open data program failures, as seems to be the case in ComputerWeekly.com’s Microsoft gets flak over “rubbish” UK data, is simplistic. A lot of gears have to work together in an open data program that depends on technology, software, and organizations working together. One crucial question concerns the last thing mentioned in Doyle’s list: governance. Who’s in charge? Who has authority, responsibility, accountability? Ultimately, where’s the money to come from? And, who is responsible for managing expectations about how the system will perform?

Open data programs wherever they occur can involve many different players who have to work together even though their loyalties lie with different organizations. If participants don’t share a common purpose and strong central or top down leadership is lacking or unavailable, does that mean that open data programs and their emphasis on standards are doomed?

Of course not. Support for open data at all levels of government is still strong. Planning exercises such as the World Bank’s Open Data Readiness Assessment Tool explicitly recognize policy and governance issues as being important to the success of open data efforts.

Still, there certainly are real issues that need to be managed. These include not only the types of data problems mentioned in the ComputerWeekly.com article but also process changes associated with data standardization that I have written about before.

Sharing as platform

Open data system development involving multiple systems and organizations is manageable when people are willing to work together in a collaborative fashion to pursue realistic and sustainable goals and benefits. Participants also need to share information about what they are doing, including the provision of “data about the data” as provided in the Open Data Institute’s Data Set Certificates.

My own reading of the situation is that such sharing is occurring, partly as an outgrowth of the “open data” movement itself, and partly as an outgrowth of the increasingly social nature of work as more people become accustomed to information sharing via modern tools, relationships, and networks. 

While social networking and sharing are no substitute for leadership, they do provide a platform for collaboration in all the business and technical areas relevant to open data.

Regarding the question, “How Important Is Open Data Quality?” My answer is “Very Important.” One of our challenges, then, is to make sure that everyone involved in the process sees — and understands — how what they do along the way does have an impact on open data quality.

Related reading:

How Cost Impacts Open Data Program Planning – And Vice Versa

By Dennis D. McDonald

Introduction

How important are costs when you are planning an open data program? Are they, as suggested by Rebecca Merrett in Addressing cost and privacy issues with open data in government, the “… elephant in the room,” especially when data anonymization costs are being considered? Or are such costs just a normal consideration when planning any project where quantities of different types of data have to be manipulated and delivered?

It’s difficult to make a generalization about this. Open data program costs can fall along at least three general dimensions:

  1. Controlled versus uncontrolled
  2. Known versus unknown
  3. Startup versus ongoing

1. Controlled versus uncontrolled

Why worry about what you can’t control? The answer is because they can impact your program whether you control them or not. Examples of uncontrolled costs might be:

  • Taxes, licensing, insurance, and registration fees.
  • Staff salaries that can’t be reassigned to other programs or cost centers.
  • Maintenance and support costs for systems that will be incurred regardless of how data are managed or used.
  • Costs related to unanticipated changes to source data formats.

Examples of controlled costs might be:

  • Contractors that can be terminated or reassigned to another cost center at the end of the project.
  • Staffers whose chargeback amounts vary by the number of hours they record in the organization’s time tracking system.
  • Other costs (for example, postage, communication, printing) that driven by how incoming requests or inquiries are processed and handled.

2. Known versus unknown

As Thufir Hawat told Paul Atreides in David Lynch’s Dune, “Remember … the first step in avoiding a trap is knowing of its existence.” So it is with the data related costs associated with open data programs.

It can be troublesome (or at least costlier and more time-consuming) to erroneously assume that data associated with a source program are sufficiently “clean” for publication, even if the data are coming from a system or database that has operated successfully for many years. Older systems designed for batch processing might rarely if ever touch on records that contain errors or out of range values that might “choke” the intake process of the open data platform. Newer or online systems might automatically exclude such values from processing but they might still get passed across and displayed openly in a system designed for public scrutiny, possibly causing misunderstanding or embarrassment.

How to avoid such “unknowns” that might lead to unknown or unexpected costs? The answer: sample, test, and start small. Be aware of data cleanup and standardization cost before committing to a budget and schedule. Use this information to prioritize processing and release of files. Then continually feed back into the schedule the results from actual experience.

Be aware of the options available for anonymizing and how they will impact data visualization. For some crime statistics, for example, it may be undesirable (or even illegal) to pinpoint actual incident data on a neighborhood “heat map” by blurring the exact location (say, to the block as opposed to residents level). Such a strategy which itself might lead to misunderstanding or interpretation errors. Knowing about such issues in advance will help you avoid the “trap” of unanticipated costs and schedule delays.

3. Startup versus ongoing costs

Understanding the costs associated with starting a program up (developing a strategy, building a governance group, prototyping, contracting, internal selling, website modification, etc.) and maintaining the program (keeping data updated, adding new files and services, responding to public comments and criticisms, etc. ) will influence sustainability of the open data program.

Knowing about the one time start up and the recurring and ongoing costs will be important. Managing these costs and the labor and non-labor resources associated with them over time will require strong and consistent leadership and governance to address important questions such as:

  1. Can improvements in operational efficiencies and standardization counteract cost increases related to additional data files being added to the program?
  2. Can maintenance and support costs be reduced through outsourcing?
  3. Can business processes associated with ongoing data extraction and transformation be centralized or standardized?
  4. Does it make sense to associate ongoing open data program costs with costs incurred by other operating apartments?

An example of the last item is the possible trade-off between shifting data from delivery in response to manually processed Freedom of Information requests to more open access through the open data program. This was discussed in Does Replacing Freedom of Information Request Handling with Open Data Based Self Service Reduce Operating Costs? Whether such a shift might result in real offsets is a question answered by analyzing real cost data. (An assumption is that mechanisms actually exist for tracking such costs on a consistent basis and that cost, in fact, is an important variable in planning and program management.)

Conclusion

Regardless of how program costs are treated from an accounting perspective, the need for resource tracking will be strong given the “matrix organization” approach that some open data programs employ. Staff who support multiple programs including the open data program will need regular communication with program management in order to maintain the program going forward.

Maintaining the efficiency of such a distributed operation poses special challenges, not the least of which is cost control. A related challenge will be maintaining proficiency and efficiency of how open data program related tasks are performed when work is distributed across individuals who participate only infrequently.

Related reading:

Three Things about Open Data Programs That Make Them Special

By Dennis D. McDonald

During the brainstorming session at the inaugural meeting of the Open Data Enthusiasts meet up last week in Washington DC, attendee David Luria commented that we need to do a better job of understanding, defining, and communicating the objectives of open data programs if we want them to be successful.

I couldn’t agree more. Program objectives need to be clearly defined and shared with stakeholders and program participants so that everyone is marching in the same direction. If we don’t understand and agree on our objectives how can we establish requirements and metrics to measure what we’re trying to accomplish?

Admittedly the above principle is straight out of Project Management 101 and describes the initial steps you need to take in planning and documenting any project, not just those involving open data. Still, what I have noticed after involvement with many data related projects is that there are some special challenges associated with open data programs that we do need to pay special attention to early in program planning.

One challenge is the distributed and collaborative nature of many open data programs and how they are organized. For example, while top management in a government agency or department might initiate and support development of an open data program, it will be up to middle-management – potentially representing many different organizations and interests — to actually implement the plan. They need clarity about the program’s objectives and how it relates to their own programs and responsibilities, especially if they’re expected to supply or process data differently from how they traditionally operate.

Another difference between open data and more “traditional” data management projects concerns how the data will be used. When developing a data supported software system in a traditional fashion the processes and decisions to be supported by data can be defined in detail; data will be shaped and delivered specifically to support selected processes or decisions. With some open data programs, a stated objective will be to make data available so that innovative or yet to be determined applications can be created by others. Organizations accustomed to focusing their attention on supporting specific programs or projects may find it challenging to change their processes — or incur extra costs — to support what they might consider to be less clearly defined objectives.

A third challenge that can differentiate open data programs from more traditional data management projects is that they may need to provide access to data to users with a wide range of data handling skills. While some users will be comfortable with using supplied API’s to download selected data for sophisticated analysis or modeling, for others basic spreadsheet operations may be a challenge. Simply providing well cataloged data files in downloadable form will not be enough. Some users will need help with analysis, visualization, and storytelling.

Which brings us back to defining, clarifying, and communicating the open data program’s objectives. Whether program planning and goal clarification is led by someone with policy, IT, customer service, or business expertise is not the issue. What is needed from management is the ability to define program objectives – including how different types of users and uses will be supported — in a way that (a) takes into account all of the stakeholders’ and participants’ interests and (b) engages with them to move the program forward. This takes planning, communication, collaboration, and leadership. Again, these requirements are not unique to open data programs and projects.

Perhaps the single most important feature of open data programs is just that — they focus on making data “open” and accessible to many different users who may previously not have had much experience in accessing or using such data.

Perhaps one of the most important suggestions for making data open and accessible is that the process by which open data projects and programs are planned and managed should also be open and accessible. Care should be taken right from the start to engage with and listen to all potential stakeholders and participants, especially in cases where the organization’s underlying culture has not traditionally valued openness or transparency. For some organizations this approach to project management might require significant changes if information sharing and collaboration are not already practiced. 

Related reading:

Open Data Program Managers Need Both Analytical and Structural Data Skills

By Dennis D. McDonald

Introduction

In Management Needs Data Literacy to Run Open Data Programs I addressed the question of how much “data literacy” open data program managers need. I outlined a series of topics corresponding to different parts of the data management lifecycle the program managers need to be familiar with. While certainly I don’t believe it is necessary for all program managers to be “data scientists” to manage open data programs effectively, I do think that there are certain data related skills that managers do need. One of the most important is the ability to think about data both from analytical as well as structural perspectives.

The analytical perspective

Analytically, managers need to understand that useful data are not just random collections of numbers but represent patterns and trends that can be used to tell stories about the objects or events with which the numbers are associated. The range of tools we have available now for analyzing and visualizing data are truly impressive including systems that are capable of processing and recognizing patterns and trends in huge volumes of data. Sometimes this is referred to as “big data” analytics, especially when we’re discussing the volumes of data that can be produced by organizations such as government agencies and public utilities.

What is also impressive to me, though, is the other end of the spectrum. As a BaleFire consultant involved with the implementation of open data portals using tools such as those provided by companies such as Socrata I am truly impressed with the visualization and analytical power available to those interested in discovering and exploring trends and patterns in everyday data such as crime statistics, municipal operational expenditures, and restaurant inspections.

Despite the simplicity and ease of use of such systems, though, open data managers need to be sensitive to the opportunities such tools provide and should be able to perform basic analyses on their own. One of the most important skills the open data manager can possess will be the ability to think in terms of the stories the data can “tell” and plan accordingly.

The structural perspective

From a structural perspective managers need to understand that, despite the availability of easy-to-use file management, navigational, and visualization tools, data need to be viewed as building blocks that require cleaning, quality control, and standards. Sophisticated processing tools can be relied on to do some cleaning and standardization at the time of data intake for making a file available on a web portal. Sustaining consistent data quality over time requires constant monitoring and some changes to current processes if the data are to be updated in a timely fashion. This “extra layer” of management requires resources, someone to oversee it, and — perhaps most importantly — someone to defend it over time.

Summary

Effectively performing both these roles requires both an understanding of analytical potential and an appreciation of what it takes to keep the data flowing!

Related reading

What does the term “program alignment” mean when applied to open data programs?

By Dennis D. McDonald

“Program alignment” has long been a meat-and-potatoes term for consultants involved with strategic planning. The basic idea is that the initiatives you plan and carry out to support your organization’s goals and objective need to be “aligned” (i.e., supportive of or in line with) your organization’s goals and objectives.

What does “alignment” mean in practice? How do you measure whether or not your activities are aligned with your goals and objectives? And what does this mean when the concept is applied to open data programs?

Here are some of the things to look for including a caveat about applying this term too loosely to open data programs.

The basic conditions for determining whether any initiative (for example, a program, a project, a purchase, a reorganization, a new product, a new hiring initiative, etc.) is aligned with your organization’s goals and objectives are the following:

  1. You know what your organization’s goals and objectives are.
  2. You know how to measure whether these goals and objectives are being accomplished.
  3. You and your management agree that your open data program supports one or more of your organization’s goals and objectives.
  4. You know how to measure the relationship between your open data program and accomplishments of your organization’s goals and objectives.

Items one and two don’t have anything to do directly with open data programs. They are the basis for ensuring that any program or initiative is aligned with what the parent organization is attempting to accomplish.

If the organization is able to articulate its goals and objectives but can’t measure their accomplishment you have your work cut out for you if you decide to take on both implementing an open data program as well as measuring its performance against the organization’s strategic objectives. It’s not impossible to do both but at minimum more executive buy-in and involvement – and resources — may be required.

There are many ways to measure accomplishments. Program impact range from purely qualitative to rigorously quantitative. Don’t think that benefits always need to be measured in dollars and cents for management to be convinced that a program delivers useful results. At minimum you will need convincing anecdotal evidence of positive impacts that management will understand and accept.

I mentioned a caveat concerning the ability to demonstrate alignment between an open data program and a sponsoring organization’s goals and objectives.

Don’t assume that you can predict, much less measure, all the uses made of the data your open data program provides. It’s in the nature of open data programs to make data available for planned as well as new, unanticipated, and innovative uses. This should not deter you from proposing an open data program but instead should cause you and your organization to rethink how your organization can support accomplishments of your goals and objectives, even among partners and users with whom you may not have worked in the past.

Related reading:

Needed: an understanding of the data environment in which your open data program’s users operate

By Dennis D. McDonald

It makes sense that, if you devote time and energy to designing, ramping up, and managing an open data program, you’re doing so for a reason. In What does the term “program alignment” mean when applied to open data programs? I made the assumption that you will want to align your open data program with the sponsoring organization’s goals and objectives and then measure the open data program’s performance by whether or not these goals and objectives are supported.

I did mention a caveat: you can’t always predict how the data provided through your open data program are used, what all the uses of your data and up being, who the users are, and what the benefits of these uses might be, given your lack of control over how your data might be re-used and re-shared.

If this is so, how concerned should you be about the secondary and tertiary uses made of your program’s open data?

Perhaps the data you’re distributing describe the participants and transactions associated with programs you are responsible for managing and you make the data available on a regular basis via your own web portal as well as third-party services. They in turn support access to and reuse of your program’s data via search, file downloads, and API’s that can link your data with data from other sources.

I maintain that, if you are running a government program that is serving a particular need of a particular constituency, it’s your business to know not only how well your program is serving that constituency directly but also how the services of others are impacting them as well. That means you may need to track not only the direct impact of your services (for example, income supplements, health services, training, public safety, environmental cleanup, sanitation, housing, education, etc.) but also the impacts of usage of related data and services obtained from sources external to your own.

This is not as far-fetched as it may sound at first. If your program is providing services to a particular constituency you need to understand the context in which those services are delivered. This includes knowing how other service operate and how your own users compare your services with services provided by others.

In the commercial world this means understanding your “competition.” There are some similarities with how government programs operate since people can take advantage of a variety of related or overlapping services from different sources. The potential for data re-sharing and re-use by others adds an additional complication since establishing a reasonable chain of events linking data access to positive (or negative) outcomes becomes even more challenging.

An important first step is to gain an understanding of the “data environment” in which your program’s target users already operate. Who else is providing similar or related services and data? What is the relative popularity (or unpopularity) of these different data sources?

If, for example, you are a municipal government and want to begin publishing geographically tagged and visualized crime data on your agency’s own data portal, what other data sources on crime are available to your target constituents? Is there overlap? How will you be adding value to the mix?

How will you be measuring the impact of your publishing such crime statistics, and who benefits? Will such data be viewed as a help or hindrance by your own law-enforcement professionals? Will certain neighborhoods be flag is more or less “desirable” in terms of crime, and what would the impact of such characterizations be? If you do begin providing crime data by neighborhood and a third party combines these data with other real estate data via a system that allows for custom neighborhood “redlining,” will you be held responsible?

In summary, you need to think about going beyond basic usage and consumption data to understand the potential impacts of your open data program. You also need an understanding of the communities and the environments in which your users are operating. Then you can begin to really understand the good your open data program is capable of.

Related reading:

 ————————————————————————————————

Copyright (c) 2014 by Dennis D. McDonald