Getting beyond the term "unstructured data"
Back when the earth was still cooling I worked for a company that used its proprietary software to build electronic encyclopedias, product catalogs, scientific journals, and code repositories. A really big deal was our ability to do word and phrase searches of very large collection of “unstructured text.”
Even then the term “unstructured” was somewhat of a misnomer. What we were really saying was that “unstructured” meant the data weren’t arrayed in the nicely structured rows, columns, and fields of a traditional database. Instead, data were organized along the lines of how people write, speak, or read.
Fast forward to today and our current software tools, AI and otherwise, that can search for and manipulate “structure.” If structure isn’t present or aligned with a pre-existing schema, we can create or adapt a pre-existing metadata schema that enables analysis and reporting that makes sense to human or automated users. This new “structure” may have little to do with how the mechanisms that created the data operated. Or, it can attempt to recreate the original structure or language used to create the data.
Either way, the processes may introduce deviations (some might say “errors”) that don’t align perfectly with how the data originated. This is typically where we emphasize the importance of human review of text generated by LLMs in order to avoid obvious errors or “hallucinations.”
Perhaps we should re-think what we mean by “unstructured” data. Even the stream-of-consciousness approach used by James Joyce in ULYSSES contains vast and complex structures associated with the English language. Just because the text can’t be easily mapped to a traditionally organized object-oriented database schema doesn’t make it “unstructured.”
What term should we use? Does the question even matter? Perhaps how we approach the question will have something to do with what we are trying to communicate—the data’s meaning—via our data transformation and manipulation processes.
Copyright (c) 2023 by Dennis D. McDonald. The above image was created by Bing’s “Copilot” in response to the prompt “generate a line drawing that expresses the complexity of analyzing and reporting on large unstructured data sets.”