Better Analytics Drives Excellent E-Discovery Solutions

Friday, May 18, 2012 - 16:14

The Editor interviews Kurt Michel, President and CEO, Content Analyst Company, LLC.

Editor: Please tell us about your background and about Content Analyst.

Michel: I am the president and CEO of Content Analyst, which is a developer of advanced analytics solutions and enabling technology for use by our partners. Content Analyst is not an e-discovery company and we do not prescribe a particular e-discovery process. Our technology has more global applications. Our partners are the e-discovery solution providers, and they use our enabling technology to develop innovative services that deliver value to their end-user services.

I saw the value of early-stage optical and electronic imaging solutions during my college years and decided to focus my professional career on changing the world of information management. Working for numerous Enterprise Content Management companies, including Oracle and Stellent for almost twenty years, I learned how to intelligently apply technology in a well-designed workflow to improve the quality and efficiency of existing business processes. When I was introduced to Content Analyst’s technology I saw the potential it had to improve a wide variety of business processes, including e-discovery, and decided it was the best place to focus my professional efforts.

Editor: Please describe the Content Analyst Analytical Technology (CAAT) enabling technology.

Michel: CAAT enables companies to adopt multiple analytic capabilities for managing information governance and e-discovery, including organizing large document collections and identifying information of relevance to the specific task at hand.

CAAT offers basic organization capabilities, such as email threading and near-duplicate detection, and the dynamic clustering function allows users to organize documents based on conceptual meaning. Our powerful conceptual search technology overcomes many limitations of traditional Boolean search and allows users to query for an idea and find related documents, regardless of the actual words in those documents. The technology leverages critical human expertise to train the engine on issues of interest and then finds responsive documents at a level of precision and speed that humans can’t match.

Editor:  How can companies access your platform of capabilities?

Michel: Companies gain access through our partners. We license our technology to partners who access CAAT’s functionality through application programming interfaces (APIs) and then integrate it into their own solutions. We provide an extensive library of APIs, which are designed to be very flexible and used in an iterative fashion; thus, in creating these solutions, our partners are limited only by their own imagination and vision. For instance, a partner might want to offer a workflow that includes conceptual search, followed by document clustering, then categorization and then conceptual search again. Partners can call those specific functions at any time during their product’s designed workflow and really differentiate their solutions based on the capabilities they select. 

CAAT is a popular choice because the technology is effective and scalable. We place great emphasis on providing well-formed, intuitive APIs that allow our partners to integrate CAAT smoothly and with minimum labor, and our ContentCare program helps partners understand the technology, while leveraging our extensive knowledge of best practices to design a solution. We take the partnering concept a step further by doing joint marketing and sales training, and that’s why we’ve been successful.

Editor: Please describe CAAT’s Latent Semantic Indexing (LSI) technology.

Michel: As a mathematical approach for organizing and deriving insight from large document collections, LSI offers a major benefit over linguistic approaches because it always produces a definitive answer. As a result, answers are consistent, repeatable and defensible.

Further, LSI can uncover latent relationships up to the fifth order in unstructured text, making correlations among words and terms that are used in a non-standard way. CAAT’s technology was first used in the U.S. intelligence community specifically because of its ability to understand and infer the hidden meaning and relationships of terms in large document collections.

One partner used our engine to analyze open-source newspaper content and was able to uncover terrorist communications. Using the CAAT engine, they identified the use of the term "wedding invitations" by the terrorist to communicate planned activity. They ultimately uncovered planned attacks on individuals identified in the invitations. Because our mathematical approach is language-agnostic, CAAT can decode messages in any language supported in Unicode (the computing industry standard for representing languages).

The practical applications of this technology are impressive and can be used proactively in many contexts. For example, litigators trying to develop discovery strategies in advance of the meet and confer can use CAAT to expand the universe of required terms in a keyword agreement.

While LSI technology is well recognized by academics for being precise, there have been two persistent concerns: scalability and updatability. Along with our partners, Content Analyst has indexed collections of hundreds of millions of documents, making us the only LSI-based solution to resolve the scale issue to such a great extent. Updatability pertains to e-discovery processes that involve rolling collections, i.e., adding new document collections dynamically, and CAAT offers the capability to update an existing index without having to start over each time.

Editor: What metrics or visuals does CAAT offer?

Michel: As with all of CAAT’s functionality, the metrics and visuals available to end-users are selected by our partners. CAAT tracks a great deal of information about an index, including the number of unique documents/terms and configuration data, such that we can rebuild a given index and prove that our results are consistent, repeatable and defensible.

CAAT also returns relevance rankings so users can assess both which documents are most closely correlated to a query and documents that are related but yielded low scores. While the former delivers obvious benefits, the latter can be very useful in assessing whether any important query terms were omitted.

Another important visual is clustering related documents, the unattended process of grouping documents by their conceptual meaning, regardless of a specific matter or the intent of a given user. The powerful thing about clustering documents and batching them for review is that it allows reviewers to compare apples to apples and focus on one topic at a time, which in turn increases speed and accuracy.

Behind the scenes, the system is tracking information so the indexing process can be repeated: end-users and the judiciary both demand this capability. A key factor here is that Content Analyst’s strong development team – comprising engineers and mathematicians who live and breathe the LSI technology – is well equipped to train our partners on how the technology works and  foster confidence in choosing CAAT as the advanced analytics technology solution.

Editor: Why is it important for companies to be able to leverage multiple capabilities for data management and analysis?

Michel: In broad terms, thought leaders like Gartner are adding clarity to terms that cover issues that are of great importance to global businesses like "Big Data." They have  identified “Information Governance” as an overarching issue that includes e-discovery as a subset (compliance and records management are examples of other subsets). As such, it is important to have organizational capabilities that allow businesses to manage issues at a high level.

Fundamental organizational capabilities, such as email threading and textual near-duplication detection, make particular sense during e-discovery review processes because they foster accurate results and create efficiencies. Email threading recreates the consolidated string of messages that allow the reviewer to assess the entire “conversation” and make holistic coding decisions. Textual near-duplication organizes and finds all versions of a document so users can focus on differences that are highlighted by the system. Clustering and batching documents, as discussed above, also fits into this category.

Importantly, through all of these processes, no documents have been eliminated from the system. These capabilities are aimed at organizing documents intelligently and allowing litigation professionals to make better, more efficient decisions. Attorneys, for example, often get involved with the conceptual search process, which can help early on with developing litigation strategies.

These flexible capabilities address the specific workflow and business process needs, and the benefits work in both directions to determine high or low relevance. Given today’s blurred line between work and personal email, assigning lower priority to personal junk mail can be done in an efficient, automated process before expert resources are tapped during actual litigation.

Editor: Does your system allow dynamic adoption of technologies as issues arise or develop?

Michel: Yes, absolutely. CAAT has a wide set of capabilities that can be integrated into the partner’s solution and then offered in real time. End-users can access a dynamic workflow that will organize documents and then intelligently batch them for efficient review. CAAT can also leverage documents tagged as exemplars to categorize documents with a high level of precision in predictive coding applications. 

Privilege is an important concern, as evidenced by production issues in the Oracle v. Google case. CAAT can search a production set and identify documents that are highly correlated to those already tagged as privileged, at which point the litigation professional can make an informed decision on the document. Dynamic adoption of analytics technology can address time-sensitive developments, such as an opponent’s discovery production, facilitating a speedy response. The point here is that the technology is available at any given stage to make the process more efficient and reliable. 

Editor: Can CAAT manage the evolving nature of language – or even multiple languages?

Michel: Yes. This capability is at the heart of CAAT’s value proposition and mathematical approach. For example the engine would interpret the word “spam” as a canned meat product in documents that were indexed 20 years ago, but it would comprehend a newer meaning, i.e., “junk email,” in a more contemporary index. The engine can infer which meaning applies depending on the context, and it can follow the evolution of new words like “google” (used as a verb) or the concept of “friending” someone on Facebook.

As noted above, CAAT is language-agnostic, which enables corporations to manage global communications. It can batch or index documents by language and search for concepts in one language to find relevant documents in another. While the patented ability to perform cross–lingual searching does require behind-the-scenes work by the Content Analyst partner, those efforts are invisible to the end-user. The results are impressive. CAAT’s engine is a self-learning system that automatically derives comprehension from the document collections in any language – no translation required.

Editor: What cost benefits does CAAT offer?

Michel: The cost benefit of using advanced analytics in e-discovery and other governance applications involves their ability to automate the process and to improve the quality of the results; thus, all the functions we have been discussing can produce efficiencies and cost savings. The technology offers capabilities that (1) facilitate document management from broad data organization to e-discovery document review; (2) assist with developing proactive litigation and governance strategies; and (3) perform what used to be time-consuming conceptual searches with a simple cut and paste.

In essence, CAAT adds a “super reviewer” to litigation and document management teams, one that works 24 hours per day and can read, cull and dynamically index millions of documents before humans get involved. Finally, the technology allows partners to develop solutions that allow organizations to leverage the knowledge gained during e-discovery back into the corporate repository so that all tagging is preserved and the work product can be reused for future litigation.

Editor: What questions should companies ask prospective data management service providers?

Michel: In screening potential solution providers, look for firms that use a consultative approach to understanding the client’s specific needs and then assess the full spectrum of their offerings. Providers offering predictive coding should have a deep understanding of statistical sampling methodologies and experience in the actual use of advanced analytics technology, not simply offer it as a menu item. It is important that they use proven technology that is up to the task. This is where CAAT receives a “Good Housekeeping” seal of approval.

Please email the interviewee at with questions about this interview.