Understanding Data Pedigree: Key Insights

August 16, 2023By Mitch Carmen

Anyone who has studied statistics can relate to this: you can’t understand the data unless you first understand the process that produced it. One of my favorite instructors during graduate school at The University of Chicago would begin every class asking us “what is the data generating process?” At each class he did this because it is such an important and fundamental piece of information to the downstream process. What he meant by the data generating process was understanding how, when, why, where etc. was the data actually created and collected. If we’re going to analyze the data of the system, we must understand how that data was generated in the first place.

Investments in conservation projects, calculating the value of a natural asset, or evaluating carbon credits rely on data. But what if the data is suspect or non-existent? In this article, we will look at why data pedigrees matter and examples that demonstrate why a robust data pedigree is standard practice in many industries.

What exactly does a data pedigree mean?

The word pedigree, for me, brings to mind thoroughbred horses or the Westminster Kennel Club Dog Show. These horse and dog breeds are highly sought after, and their lineage has to be well documented. The pedigree encompasses information of their family before them as well as information about that particular animal at this moment in time.

In the same way, the data related to natural assets should have a robust pedigree. When thinking about data pedigrees, there are several aspects to consider:

  • Data source: What is the source of the data? Was it collected internally or obtained from external sources?
  • Credibility & reliability: What is the reputation of the data source provider and their methodology?
  • Data collection process: What are the tools and techniques used? Was it human or sensor collected? What units of measurement are used? (NASA learned this the hard way in the example below).
  • Data quality: How complete, accurate, consistent, and valid is the data? Are there inconsistencies or missing values that could impact the data’s integrity? Were any data cleansing or preprocessing steps performed to improve data quality?
  • Data transformation: Was the data aggregated, filtered, normalized, or transformed in any way, and if so, what is the impact on interpreting the data (an interesting example on why this matters is presented below)
  • Data governance: Are there established policies, procedures, and controls in place to ensure data integrity, security, and privacy? Who is responsible for this?
  • Data documentation: Does the data include data definitions and metadata to provide a clear understanding of the data’s structure, variables, and any assumptions/limitations associated with it?
  • Updates and versioning: Is the data regularly updated if different versions exist? How often is it refreshed?

Data Pedigree in Environmental Intelligence

As you can see, there is a lot to consider when deciding if the dataset you have is of the best pedigree. In Environmental Intelligence, data pedigree can be the difference in deciding which conservation/restoration projects to invest in or the pricing tier of carbon credits, to name a few. For example, a carbon credit that has underlying data with a poor data pedigree will undeniably be priced significantly lower than the same carbon credit with a high data pedigree.

At Laconic, the SADAR™ platform has been designed specifically to meet the most demanding aspects of the data pedigree as described above for the purpose of creating trust in the underlying data.

Designed for Data Pedigree in Mind – Laconic SADARTM

Within the SADARTM platform, all data collected has a clear and understandable lineage. You’ll know whether the data was sourced via drone, human surveys, sensors, or aquatic/sub-aquatic sensors, for example. Any data added to the platform is thoroughly vetted and noted as coming from an external source. When and where the data was collected is documented and Laconic outlines the units of measure with a well-structured referenceable taxonomy, we call Laconic Universal Environmental Identifier (LUEI).

Within the SADARTM platform, we adhere to these 10 data pedigree standards for all raw data that enters the platform:

Completeness: Completeness ensures that all relevant aspects or dimensions of the phenomenon being studied are represented in the data. Laconic uses 6 intelligence collection platforms (satellites, drones, ground-based sensors, aquatic surface, sub-surface sensors and ground survey teams). Additionally, Laconic draws on more than 24 survey types, 54 environmental data variables, 10 data pedigree elements, and 6 Ecosystem thematic accounts.

Accuracy: Is vital in generating reliable insights. It measures the degree to which the data accurately reflects the reality it is intended to represent. The SADARTM multi-modal collection platforms ensure that data is accurate.

Timeliness: Is critical as it focuses on the time at which the data is collected and ensures that data is relevant and applicable to the current context. Each collection is documented as to the time and version of the data.

Collection processes: Includes the instruments, tools, and techniques used.

Source: Provides an understanding of the sources of data whether from satellite images, IoT devices, or human surveys.

Consistency: Aligns data with other sources of information or user expectations to ensure that the insights generated are reliable.

Integrity: The degree to which data has been preserved and protected from changes or modifications that could compromise its quality or reliability.

Confidentiality: Focuses on the degree to which data is protected from unauthorized access or disclosure, safeguarding sensitive information and ensuring data privacy.

Security: The measures that Laconic has taken to protect the data from threats or vulnerabilities such as cyber-attacks, data breaches or unauthorized access.

Traceability: Focuses on the ability to trace the data back to its source so there is an understanding of its history and lineage which speaks to Laconic’s transparency and accountability.

The result is a refined dataset that has adhered to our data pedigree process and is ready to inform our Environmental Decision Support practice.

Data Pedigree Blunders

While the data pedigree in Environmental Intelligence is a critical component in investment decisions, in some industries it can be the difference between life and death. Here are a few examples of how data pedigree policies led to big blunders, sometimes leading to a tragic end.

Challenger Space Shuttle Disaster, 1986

Abnormally low temperatures at Cape Canaveral led to a conference call to discuss whether it was safe to launch the shuttle. Scientists turned to the data to inform their decision. They looked at the available data regarding the relationship between temperature and O-ring failures since O-rings were a cause of concern for years before launch. (Someone had deleted observations that resulted in no failures). In a later review of the full dataset, they determined that any reasonable analysis would have determined that launching at that temperature would be extremely dangerous. Sadly, the entire crew of seven died.

Court-rendered Judgements

We’ve all seen TV shows that are based on the legal court system or contain trials where evidence is produced for or against the defendant. The chain of custody for any evidence (data) collected has to be documented and handled correctly. In the murder trial of former football star, O.J. Simpson, multiple issues involving the chain of custody for the forensics collected surfaced immediately. The defense was able to cast enough doubt on the evidence produced to return a not-guilty verdict. The LA Times outlined all the evidence-collecting blunders by one of the investigators.

Pharmaceuticals

The FDA is charged with approving pharmaceuticals as safe for the public in the US. They in turn require the pharmaceutical manufacturers to provide metadata that provides the context and integrity for interpreting the data presented. The FDA also requires an electronic audit trail on when data records were created, modified, or deleted – even when attempts to access the system that contains the data, and any attempts to rename or delete files. In a case brought by the US Department of Justice against GlaxoSmithKline (GSK) about the antidepressant drug Paxil, allegations were made of data mishandling and fraudulent practices (including failure to disclose unfavorable safety data, and manipulated trial results). GSK settled the case and paid substantial fines ($US 3 billion).

Laconic takes data pedigree very seriously. From the collection of raw data to decision-support services, you can be assured that the standards we adhere to provide you with data you can rely on when making environmental conservation decisions, valuing natural assets or evaluating carbon credits.

Related Posts