Defensible Clean-Up; Addressing The Dark Side Of Big Data

Wednesday, December 19, 2012 - 10:46

The Editor interview Mary Mack, Enterprise Technology Counsel, ZyLAB. Our readers may wish to note that ZyLAB and the Association of Certified E-Discovery Specialists (ACEDS) are organizing an hour-long webinar to be presented on January 9 at 12:00 p.m. EST. The webinar – titled “The Dark Side of Big Data” – will offer best practices and clear guidelines for the defensible disposition of data in a manner that protects the company, mitigates risk and offers business benefits.

Click Here To Learn More!


Editor: ZyLAB has been warning companies about the “dark side of Big Data.” Please start our discussion by telling us what big data is.

Mack: Big data can be described as having volume, velocity and variety. The estimates of how much data is created on a daily basis are gargantuan, and it’s even more unbelievable when you look at yearly volumes. IBM is now exploring moving data with light, so today’s Internet speeds will soon seem like a snail’s pace. The “Internet of Things,” where sensors collect and move data without human intervention, spawns new and different data stores that we don’t even see.

Big data is in constant motion between devices, and whole systems and companies are moving to the cloud. We hear about the jurisdictional implications of data held in the cloud – around issues like privacy, intellectual property, taxation and nexus. Adding to the complexity, as data continues to migrate to the cloud, virtual systems will follow. Virtual systems aren’t necessarily tied to a physical device or operating system, so several operating systems may exist on one physical device. Further, when a file is moved, all of its inherent and user-created metadata must travel with it.

Editor: What exactly do you mean by the “dark side” of big data?

Mack: The business value of big data – largely tied to marketing, customer or operational data – is accompanied by e-discovery and regulatory responsibilities. These include the need to identify, collect, reduce, review and produce the big data, and they involve huge costs and risks. We call it the “dark side” because these requirements are not always visible in a meaningful way to the corporate officers who need to make decisions in a legal, regulatory or compliance context.

Editor: How should companies handle this issue?

Mack: The first step is to assign someone to be accountable for the data. Right now, information governance responsibilities are shared between the chief information officer (CIO) and the chief legal officer (CLO). In most organizations, this responsibility is not formally recognized, and key performance indicators haven’t been developed for this role. Gartner sees the appointment of a “chief data officer”[1] with similar responsibilities around information governance described for technology counsel.[2] Our clients are now asking for defensible legacy data clean-up projects, which implies that someone within the organization has the skills or foresight to see the need for this action. Such projects can spark the kind of role modification I am speaking of.

Editor: Please elaborate on how technology can be used to manage the data.

Mack: E-discovery systems are already designed to collect a variety of unstructured and structured data. For example, unstructured data might be word-processing files, and structured data could be database files, and these systems gather data and normalize it for tagging and analysis. Advance search methods like conceptual search, entity extraction, rules-based coding and predictive coding can enrich the data and provide visualizations. The robust reporting that is required for e-discovery and the sampling technology that’s emerging both allow for defensible decision making. Finally, the actions and results in e-discovery systems are logged; therefore, if you’re using an e-discovery system for a clean-up project, data will also be logged for future defensibility.

Editor: In your recent article with us, you mentioned that referenceable case law should be involved in the process of establishing defensible data policies.

Mack: Yes, we talked about the Rambus[3] case in that article. Some of this is common sense. It’s not OK, for example, to destroy data with a guilty mind; to do it strategically in advance of anticipated litigation; or to do it in an uneven fashion. So it’s important that the process involve a solid protocol – one that both harmonizes with the organization’s data-retention policy and ties practices to that policy. Many organizations that are under constant legal hold don’t enforce their own retention policy. As a result, if they subsequently rely on that policy to destroy data, they run the risk that doing so is not a routine operation, which is one criterion that FRCP 37(e) requires to escape sanctions. So the idea of legacy data clean-up is to put together a process that will be used consistently over time.

Editor: Beyond ensuring the integrity of data for litigation purposes, please talk about data that simply may be useful in the context of business intelligence.

Mack: It’s interesting you mention business intelligence because there is an emerging consensus that information has value. In fact, Debra Logan, a research vice president from Gartner, is anticipating that information assets – meaning data that’s managed well – will actually have value on a company’s balance sheet. So yes, there certainly are useful aspects of this data for sales and marketing. Those benefits may extend into the realm of intellectual property data, either patentable or trade secret material that is involved with “know how,” which may be buried in someone’s shared drive somewhere. Unless you can find it, it’s basically lost to the company.

Editor: So there may be additional business functions, other than litigation, that this process can engage with.

Mack: Yes; however, for most of our clients, the triggers continue to be litigation and regulatory matters, which involve a pre-defined need and a legal spend that can be measured and reduced. Further, corporate sales and marketing areas have been investing in business intelligence tools, and it is anticipated that technology spend will increasingly appear on the chief marketing officer’s (CMO’s) budget, not on the CIO’s.

It’s a recent development that CLOs and CIOs have developed working relationships around information governance and e-discovery. Now, a new role is introduced into the equation. Naturally, the CMO has a job to do and may not be focused on the legal implications of using business intelligence tools, which aren’t necessarily designed to preserve data or to produce it in a timely manner for legal or regulatory matters. So this development creates a whole new set of challenges for the CLO in protecting the company’s legal interests.

Editor: ZyLAB has consistently advocated taking a proactive approach to e-discovery. What developments do you see coming from the evolution toward true information governance?

Mack: The financial crisis led to the elimination of budgets for proactive litigation management, though that trend is starting to reverse. We expect to see greater use of early case assessment (ECA), which assesses the electronic evidence and the value of the case, pulling in factors like big data, jurisdiction, and the behavior of opponents or a specific bench. The goal is to formulate a response that is proportional to the litigation. The results will be reduced costs and a trend toward making litigation response an ordinary business function that does not disrupt operations or budgets.

Editor: How does predictive coding fit in with this development?

Mack: Predictive coding systems should comprehend the “variety” aspect of data. This includes the ability to recognize and derive text from pictures, such as company assets that are contained in technical drawings or contracts that may not have machine-readable text. Prior to the application of a predictive coding algorithm, the data must be normalized. Different languages need to be automatically translated, and multinational companies must have Unicode-compliant systems.

Predictive coding technology can consistently code documents, and it can derive relationships and surface documents that weren’t under consideration. In fact, there is increasing and often enthusiastic judicial acceptance of predictive coding, which is significant considering that it took several years to gain widespread acceptance of search terms. Such acceptance of predictive coding may reflect an acknowledgement from judges that e-discovery has become onerously expensive and that dockets need to be moved along more quickly.

Editor: Please talk about ZyLAB’s use of an XML archive for document storage. What other features ensure that you stay on the cutting edge for your clients?

Mack: Some services providers use proprietary formats for data storage. ZyLAB made a different decision by opting to use XML, which is an open standard and, therefore, uses tools that can evolve without the need for massive upgrades to the company’s data. In other words, our unified system eliminates the need to move information from database to database, which may be required in systems that involve disparate components. Keeping data in one place makes it easier to normalize it and facilitates more accurate and productive inquiries.

Other features include the fact that we were the first to support Microsoft Office 365 with Exchange Email and SharePoint in the cloud. We also were the first to support Gmail and Google Docs, and we’re now supporting the newly announced Google Drive.

ZyLAB clients are taking full advantage of our database, content-management and social-media collectors, which are specific components that can be plugged in as needed and customized to a client’s file format. We also offer a nice phonetic-based search and classification method for voice files, eliminating the need to create transcripts before searching. The point is that new varieties of big data are being created every day, so we focus on using our flexibility to be able to intake data.

Finally, ZyLAB is committed to participating at the forefront in the dialogue surrounding the legal implications of big data. Certainly, we encourage your readers to sign up for the webcast about the Dark Side of Big Data. I also would like to make them aware of a whitepaper by our chairman and chief strategy officer, Johannes Scholtes, on handling this Dark Side of Big Data.[4] Another interesting paper is a report written by electronic discovery and automated litigation support expert George Socha.[5] In this paper Socha discusses how to bring e-discovery in-house and presents a new model that expands the electronic discovery reference model (EDRM) to also include the information governance reference model.

Editor: Do you have any final thoughts?

Mack: I would like to acknowledge the efforts of organizations like Lawyers for Civil Justice and the Defense Research Institute (DRI), which seek meaningful changes in the Federal Rules of Civil Procedure, including clarification of sanctions for inadvertent destruction. Additionally, New York and Illinois have pilot programs to reduce e-discovery costs, and courts are moving toward a bad intent and proof of damage requirement for sanctions around preservation. These are positive developments, and ZyLAB’s part is to enable companies to help themselves by cleaning up their data, particularly before they make the move to the cloud.

[1] CEO Advisory: Chief Data Officers Are Foresight, Not Fad Published: 11 December 2012 Gartner Analyst(s): Mark Raskino, Debra Logan.

[2] Carole Basri, Mary Mack and Ronald J. Hedges, Chapter 18, Creating an Effective Electronic Discovery Compliance Program Including the Office of Technology Counsel, eDiscovery for Corporate Counsel, Thomson Reuters West, 2012 Ed.

[4] Whitepaper available here:

[5] Report available here:


Please email the interviewee at with questions
about this interview. Also please visit ZyLAB at Legal Tech New York on
January 29–31,
booth #325.