Predictive Coding Gaining Acceptance As A Defensible eDiscovery Tool

Tuesday, January 24, 2012 - 08:51

The Editor interviews Warwick Sharp, Vice President, Marketing and Business Development, Equivio

Editor: Please provide our readers with some background on yourself and Equivio.

Sharp: I learned the software business at Amdocs, the global telecom billing giant, where I served as vice president of product marketing. By the time I left Amdocs to found Equivio, the company had grown from 400 to 10,000 employees. This was a very intensive learning experience.

I have had the good fortune to be a co-founder of Equivio – a unique opportunity to build a company from scratch, with close friends, and work with a fantastic team of talented and creative software technologists. We established Equivio in 2004, focusing on analytical software for eDiscovery. Our first product was near-duplicate detection, and this was followed by email threading and predictive coding.

We recently launched Zoom, an eDiscovery platform for predictive coding and analytics. With Zoom, we have taken our proven analytical applications, loaded them onto an integrated web platform, and then added a whole set of new functionality, such as data import, text extract, metadata analysis, early case assessment and language detection.

Editor: Do you find more resistance to technology in the legal market than in other markets in which you have been active?

Sharp: That was probably the case until about 2008, when everything changed. Until then, the litigation industry was suffering from acute “billable hour myopia,” but this pathology has cleared up over the past few years. The litigation industry is focusing much more on providing value for money and ensuring customer loyalty.

We’re certainly seeing this change in the uptake of predictive coding technology. Law firms are playing a very proactive role in adopting predictive coding with the aim of enhancing productivity, reducing costs and providing more value for their clients. Resistance to cost-reducing technology has been evaporating rapidly. This change seems to be motivated by competition among law firms that want to provide the best value and outcomes for their clients. The obvious way to accomplish that goal is through the adoption of analytical technology.

Editor: In eDiscovery circles, there is currently a lot of talk about predictive coding. Is this yet another eDiscovery fad that will evaporate as fast as it appeared?

Sharp: Certainly, fads have come and gone – who can forget the hype a couple of years ago around early case assessment? But predictive coding appears to be different. As a process that is still very labor-intensive, the pain and cost of litigation review has reached intolerable levels. This has created a rare coalition of stakeholders. Everyone is involved, from corporations and law firms to the courts and litigation service providers. Predictive coding is emerging as the accepted technology of choice to re-engineer this business process and make litigation affordable again.

In any technology adoption cycle, it’s always a question of catching the right wave. So first you need a wave, not a hyped ripple that is no more than a figment of someone’s imagination. The wave of predictive coding is gathering force steadily, morphing the human task of review into a merged human and machine task.

Editor: There seems to be confusion in the market. Can you provide some guidance around all the terminology: predictive coding, predictive tagging, technology-assisted review, computer-assisted review, etc.

Sharp: They are all one and the same. Everybody seems to have adopted different terminologies, but they all essentially refer to a software technology that can be trained to identify the relevance of documents. Essentially what you are trying to do is to encode an experienced, intelligent litigator’s understanding of the case into the software. It typically works like this: you provide a sample of documents to an expert – an attorney who is familiar with the case – who then tags and submits them to the software. The software uses these exemplars to pattern the attributes of relevant documents. When sufficient document samples have been tagged by the expert, the system is able to accurately assess document relevance from the broader document population.

Obviously, while the same fundamental principles underlie all predictive coding technologies, there are significant differences between offerings in the market.  These differences are manifest in the validity of the training process, the quality and defensibility of the results, the ability of the tools to quantify outcomes, statistical veracity, and the capacity of the tools to verify output and perform quality assurance.

Editor: Can the technology adapt in situations where the discovery process leads to new information – possibly a new legal theory – about the case? What if, suddenly, you’re looking for something altogether different?

Sharp: If the issue shifts, then the software must be retrained. For example, let’s assume you trained the system to find documents about Michael Jordan. If the topic subsequently changes to Kobe Bryant, you will obviously need to retrain the system. 

This can also happen with rolling loads. The learning from an initial load of documents may or may not be applicable to subsequent loads. Let’s take an extreme example: you train the system to find documents about King James on an initial load of documents written in English. Then you receive a second load of documents in Navajo. Clearly, the system’s “knowledge” acquired from the first load will be of no use. You, or a linguistically talented colleague, will need to retrain the system to find documents about King James in the Navajo collection.

In real life, it’s obviously much more nuanced, and subsequent loads often comprise similar, yet distinct, content. Your predictive coding system needs to be able to analyze the incremental material to statistically verify that the system’s knowledge to date is a good match for the new load. If not, retraining will be required. The ability to manage incremental loads in this way is obviously an important defensibility requirement.

Editor: Does the court take a favorable view of predictive coding?

Sharp: This is another key development. There has been a lot of discussion about judicial views on both predictive coding and on the legacy approach based on keyword searching. In October, Hon. Andrew J. Peck, United States magistrate judge for the Southern District of New York, published an article on predictive coding, noting: “Until there is a judicial opinion approving (or even critiquing) the use of predictive coding, counsel will just have to rely on this article as a sign of judicial approval. In my opinion, computer-assisted coding should be used in those cases where it will help ‘secure the just, speedy, and inexpensive’ determination of cases in our e-discovery world.” (Legal Technology News, October 2011). In my view, this statement is nothing short of historic.

Editor: What is the impact of predictive coding on improving business processes with respect to eDiscovery?

Sharp: Predictive coding is spawning a whole series of new models in eDiscovery. A good example is early case assessment, or ECA. Predictive coding puts the “assessment” back into ECA by enabling users to zoom in on the most relevant documents and make informed assessments of the winnability and potential cost of the case. This eliminates a lot of the risk in the fight-or-flee decision.

Culling has also been re-engineered based on predictive coding. All the studies, including the TREC project over the past few years, show that keyword search yields about 20 to 30 percent of the relevant documents in a collection. Obviously, this ought to be a concern to litigators charged with ensuring the defensibility of the eDiscovery process. Predictive coding, by contrast, is able to retrieve between 70 to 90 percent of the relevant documents, while reducing the volume of documents that need to be reviewed.

Also notable is how this technology has influenced the review process. For example, law firms are accelerating case development by prioritizing document review – starting with the documents with the highest relevancy scores, and then progressively working back. Some firms have adopted a stratified review strategy. For example, high-scoring documents might be assigned for review by senior reviewers, while low-scoring, low-potential documents will be reviewed by lower-cost contract reviewers. In so doing, the firm can balance risk and cost. It’s an exciting time to be in the industry. The whole notion of stratified review didn’t even exist a year ago.

Predictive coding also allows users to take a more systematic approach to Quality Assurance (“QA”) in litigation review. Rather than doing a simple random test, the technology allows the firm to cross-match the software’s relevance designations against those of the human review. QA then focuses on the “discrepancy” documents, where the software and humans did not agree. This allows the firm to systemize the whole quality process.

Given the two-year timeline in which all this has been happening, the change inspired by predictive coding technology is simply phenomenal. We’re seeing this technology spawn a whole series of new and creative business models that are reinventing the business process of eDiscovery.

Editor: Where are we on the adoption cycle? Are we close to the tipping point?

Sharp: We’re certainly past the early adoption stage and are now seeing the first signs that predictive coding is gaining mainstream acceptance. The Equivio solution alone has been used in over a thousand cases. Some of the largest law firms have adopted the software and we’ve seen it used in Second Requests and Department of Justice engagements. Tipping points are hard to predict, but it now seems clear that we have reached the point of no return. Predictive coding has shown the way forward, and there’s no way back to the so-called good ol’ days of “eyes on every document.” Given a legitimate alternative, it’s no longer possible to justify the outrageous costs of traditional linear review.

Editor: What obstacles do you encounter in the market’s take-up of the technology?

Sharp: This is a new technology, so education is challenge number one. The Sedona Group and the Georgetown conference are great examples of how the community is working together to facilitate this process. Beyond education, there are also implementation and usability challenges. These were key factors for Equivio in the design of our new Zoom platform. The whole concept of the Zoom platform is to insulate end users from the underlying complexity so they can use the technology in a manner that is simple, natural and intuitive.

Editor: What are the key lessons learned from your experience to date in developing and implementing predictive coding systems?

Sharp: Defensibility is key. The training flow needs to be sensible and reasonable. For example, the seeding approach – where you feed in 10, 20 or 200 documents that you know to be relevant and ask the system to “find similar” – is problematic. This approach skews the results based on what you know. But what you don’t know may be even more important. So a valid sampling approach is very important. Statistics are also critical. The tool needs to provide a statistically valid approach for monitoring the training process, quantifying results, and, as mentioned earlier, for managing incremental loads.  The tool also needs to provide tools for self-testing, measurement and verification. This is important so that you can be sure that the results are not skewed and to verify that the system has done what it says it has done.

Usability considerations will also come into play. Here you are looking at things like language support, multi-issue cases, and the efficiency of the training process, that is, how many documents are needed to train the system. Some systems require 10,000 or more exemplars, and that’s just for one issue. Others require just 1,000-2,000. Predictive coding applications also need to provide decision-support capabilities for constructing your review set. For example, in order to decide which documents will be culled, the system needs to be able to tell you, with statistical veracity, that, say, the top scoring 20 percent of the collection will yield, say, 84 percent of the responsive documents. 

Editor: Where do you think we will be in five years with predictive coding?

Sharp: Predictive coding is generating huge change in the litigation industry. Perhaps the most telling indication of the scope of change is the fact that some firms are already using predictive coding as a replacement for first-pass review. This can work, but the industry will need time to crystallize and assimilate a set of best practices that can facilitate the universalization of this trend. It will be interesting to see how far we are down this path in five years’ time.

Please email the interviewee at with questions about this interview.