Structured What? Leveraging structured data to build a better story

Tuesday, October 13, 2015 - 10:09

Let’s talk about data – structured data. Any IT analyst will tell you that all data is structured. He/she, for example, will note your Microsoft Word file is very highly structured and adheres to rigorous format rules and this is what defines and differentiates MS Word files, Excel Spreadsheets, PDFs, CAD Drawings, etc. from one another. The analyst will tell you that without this structure, the data would be meaningless and almost impossible to interpret. While true, that isn’t the type of structured data that we are here to discuss, and the fact that IT and legal use different terminology – so different that we can’t even agree on what “structured” means – highlights the issues inherent in this topic. Given that, perhaps a good place to start would be to define what we mean by “structured data.”

What is Structured Data?

In the electronic discovery world, we classify data into two broad categories: structured and unstructured. The unstructured category encompasses user-created and -controlled data such as Word, PDF, Excel and PowerPoint files. Structured data refers to large corporate data stores such as HR systems, GPS tracking systems, payroll and timecard systems, finance and point of sale (POS) systems, and other large databases used to operate a business. These databases, also known as relational databases, consist of many columns of interrelated data. For example, a GPS system might store information such as latitude, longitude, date, time and user/vehicle identification. There may be millions of data points in a single database spanning multiple years and hundreds or thousands of users. A GPS system can easily store the location of every vehicle in a company fleet for a period of months, or even years, in five-minute increments. The amount of raw data can be staggering, yet it contains knowledge that would be well worth the effort required to obtain it.

In a litigation context, structured data systems represent a risk, as well as a potentially significant advantage. The risk, as with most electronically stored information (ESI), is the potential for spoliation. Many structured data platforms are dynamic, and the data within can be updated, edited or deleted. Others are static and retain all data indefinitely, including change logs and detailed transaction history; however, these are generally the exception rather than the rule. Often, databases have predefined retention periods; they delete historical data regularly and are open to ad hoc modification by administrators or users, with little or no restrictions or logging. 

As an attorney, to mitigate the risk of spoliation, you should be aware of the structured data in the custody and control of your clients. You should know the purpose of such data and its relevance to litigation. Just as you would with traditional ESI, you should issue hold notices for retention of the data. Preservation can be as simple as setting aside a single tape or creating a one-time backup of the database, or as complex as sifting through years of tapes and migration history. When preserving the data, first consider the future use and utility of the data as preserved. Preserving data in a manner that fully retains everything, but is complex and difficult to work with in the future, can be a costly and unnecessary mistake. Likewise, preserving data in a manner that is ad hoc and undocumented can introduce litigation issues that are difficult to address.

Knowing how different applications interrelate and how upstream and downstream systems interact with identified data sources can be beneficial. In the unfortunate situation of a missed preservation opportunity, upstream or downstream data can be of use in countering spoliation charges. These applications may contain similar data for use in responding to discovery requests. For example, a ticket-tracking system may only retain data for a short term; if relevant data is not preserved, the loss of data could become the basis for a spoliation argument. However, if the system generates a series of emails for each ticket, the downstream email platform might be an alternate source of the information required to respond to the discovery request, providing a remedy to spoliation.

The potential advantage presented by structured data is its ability to shed light on the merits of a case. As an example, in a recent product liability litigation, the plaintiffs were unable to produce receipts. However, they produced multiple eyewitness declarations stating that the product was purchased at a certain time at a particular store operated by the defendants. Before reaching merit arguments, iDS was able to work with the defendants to leverage their structured data systems, including inventory, POS and historical transactions, to prove the claim to be patently false. The data clearly showed that the defendants had not sold that product model from that specific location during the time period represented by the plaintiffs. In fact, the data showed the decline of sales of the product over several years preceding the alleged purchase, and that no matching products had been sold within a year at the identified location or at other stores within the same state or nationally. Structured data told a clear, consistent and compelling story of lack of product sales – at the store level, state level and national level. 

What does this tell us about structured data?

It Isn’t the Data You’re Looking For

Structured data is just that: data. Data by itself is fairly worthless, usually quite voluminous – and inherently difficult for humans to read and understand, much less explain to others. With that, the goal should not be to have “The Data,” but rather glean knowledge from that data relevant to the merits of your case – and to be able to rely on that knowledge, and convince others to rely on it, to support your arguments. 

Structured data systems store data in a predefined and consistent – or “structured” – manner. That makes it ideal for computers to read and process, but not for people to comprehend. Data from structured systems must be converted from raw data into knowledge so you can understand and work with it. 

If knowledge differs from data, how do we distill large volumes of data into knowledge? Let’s use an example. Take the number 1656. This is raw data and, by itself, it could be practically anything: a year, a bank account balance, a distance, a weight, etc. Adding information – meaning – to the number can tell us what it represents. In this instance, it represents a time: 16:56 hours or 4:56 p.m. Now it has meaning, but there are still many open questions. If we add more information to the equation, we may learn that it’s a timestamp in UTC[1] format. It’s now 4:56 p.m. UTC, or for those of us on the East Coast, 11:56 a.m. EST. This is still somewhat ambiguous, especially if the time of day is important to the case. If it’s during the summer months, 11:56 a.m. EST is a misleading interpretation. The correct time on the East Coast during summer is 12:56 p.m. EDT, as Daylight Savings Time would be in effect. 

Our raw data, 1656, has now been converted to information, 12:56 p.m. EDT, but we still don’t know what it means. Without proper context, this time could represent anything from the sent date on an email, to a creation date for a document, to a system login time or the time from a swipe card accessing a building. We need to add context to transform the information into knowledge. The context is that it’s the time GPS records indicate the plaintiff’s cell phone left a McDonald’s restaurant. We can use this knowledge to assess the merits of our case. The plaintiff was at McDonald’s at 12:56 p.m. EDT on the day in question – not in the office as claimed.

This example represents a single instance for a single plaintiff on a single day. We could pull millions of data points together to paint a tapestry of events – a compelling story of how people behaved in general that we could compare to how they behaved in specific instances. Google, Facebook and legions of Internet advertisers use this type of data to predict what we will read, what we will buy and where we will go. However, our goal wouldn’t be to use the data predictively, but to report historically: this is what you read, this is what you bought (or didn’t buy, as illustrated by our product liability example) and this is where you were (McDonald’s, not the office). 

To Boldly Go . . .

What does the future hold? Exponential increases in storage and retention periods, more aggregation, additional analytics and the high probability that strictly structured data will gradually become less structured or even unstructured. iDiscovery Solutions is already seeing an increase in use of NoSQL (non-relational structured data) databases (e.g., MongoDB and Apache Cassandra), indicating that while structured data use is on the rise, it’s becoming more flexible to meet changing business needs. As an attorney, you’ll need to be more agile and rigorous in understanding the data and the story it tells because, in the end, all data tells a story and that story can often make a huge difference in terms of a positive outcome for your client.

 

[1] Coordinated Universal Time (UTC). “Prior to 1972, this time was called Greenwich Mean Time (GMT) but is now referred to as Coordinated Universal Time or Universal Time Coordinated (UTC). It is a coordinated time scale, maintained by the Bureau International des Poids et Mesures (BIPM). It is also known as ‘Z time’ or ‘Zulu Time': http://www.nhc.noaa.gov/aboututc.shtml

Charles Platt, a senior managing consultant at iDiscovery Solutions (iDS) in Washington D.C., has over 25 years’ experience consulting with corporations and clients on information systems development, infrastructure and analysis, digital forensics, cybersecurity and incident response, database administration, e-Discovery cases, software analysis and development, and project management. He has consulted on projects ranging from large-scale forensics investigations to highly complex intellectual property and systems analysis cases in the legal and e-Discovery industries. He can be reached at cplatt@idiscoverysolutions.com

Brian Y. Kim is also a senior managing consultant with iDiscovery Solutions in Washington, D.C. He brings over 12 years of experience in project and engagement management and business development experience from discovery services providers and law firms. Kim’s expertise spans all phases of the Electronic Discovery Reference Model (EDRM) life cycle, with a broad range of skills in client infrastructure and e-Discovery practices. His portfolio includes advice to clients on best practices in accordance with the Federal Rules of Civil Procedure including legal hold, data collection, data preservation, review and production. He can be reached at bkim@idiscoverysolutions.com.