Successful Predictive Coding In An Unfamiliar Linguistic Landscape

Multinational companies engage in cross-border litigation, which means that electronically stored information (ESI) collected for those matters involves multiple languages.

To effectively navigate e-discovery in unfamiliar languages, you must call upon an experienced vendor that can comprehend the diverse meanings and cultural connotations found in other languages. Without this specialization, your costs will increase and your evidence may be incomplete.

Predictive Coding Saves Time And Money

The legal industry has now begun to realize the inefficiency and cost bloat that comes with linear document review approaches. Today’s litigation matters involve such an unwieldy amount of ESI that no human could hope to gaze upon every file or document. That’s where technology-assisted review (TAR) can help through such approaches as predictive coding.

The 2012 RAND study entitled “Where the Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery” found that 73 percent of every dollar spent on e-discovery goes to document review costs[1]. When taking advantage of TAR tools, document reviews are more accurate, faster and less expensive. This helps to “secure the just, speedy, and inexpensive” determination of cases as stated in Rule 1 of the Federal Rules of Civil Procedure[2].

Large-scale, human-based, linear document review would work wonderfully if you didn’t have to account for human opinion. For a group of ten human document reviewers, there will be a small set of documents on which all ten human reviewers will agree are relevant. But what about all of the other documents? If eight or nine human reviewers agree that a document is responsive, there’s probably a good chance that it is actually responsive. But if three or four of the ten human reviewers declare a document responsive, is it actually responsive? Would a more experienced lawyer have to intervene at that point to make the final call?

Predictive coding tools work in a similar fashion, but they take out the elements of human error, human opinion and human distraction. An experienced lawyer identifies a “seed set” of highly responsive documents, and then a computer utilizes an algorithm to compare those documents to the greater body of collected ESI. The tool assigns a score to each document based on its similarity to the seed set. A high number means there’s a prominent chance the document is responsive. A low number means the document can be set aside since there’s a low probability that the document is responsive.

The more experienced lawyers in the matter can then start their investigation by looking at the documents the predictive coding tool deemed to be more responsive. Then, if necessary (and only if necessary), the lawyers can proceed to examine the documents deemed to be less or non-responsive.

The Requirements For A Predictive Coding Tool

When considering a predictive coding tool, you should inquire if the tool can support both “supervised” and “active” learning processes. A reasonable predictive coding plan will require several iterative learning cycles as the technology identifies responsive files and then allows a lawyer to rank how well the tool performed. Based on the additional feedback, the technology improves the accuracy of the responsive documents.

It is also imperative that the predictive coding tool be able to produce detailed reports and metrics at every stage of the project. Not only will these reports assist in defending the overall approach, it will also help to validate the results throughout the project’s lifecycle.

Lastly, you should inquire as to all forms of technology-assisted review that a vendor offers since it may not be enough to simply ask for “predictive coding.” UBIC’s Lit i View™, for example, offers a clustering scheme based on predictive coding that helps to prioritize documents for human review. This unique technology combination could provide a distinct advantage in certain matters.

Predictive Coding With English ESI

If all of your collected ESI is in the English language, your TAR options are wide open. The vast majority of vendors that offer TAR options are based in the United States and can easily process English-based ESI. Additionally, English-speaking lawyers can certainly read and comprehend documents that are composed in the English language.

But litigation matters are busting borders more and more these days, which means that you will inevitably encounter ESI composed in languages other than English. How do English-speaking practitioners read through non-English ESI? Merely translating the ESI into English is a woefully inadequate approach because important cultural expressions will be completely lost in translation.

E-discovery is experiencing an increase in ESI composed of Chinese, Japanese and Korean languages, otherwise known as “CJK.” Whether it’s an increase in cross-border litigation or foreign corporations utilizing the U.S. judicial system, the influx of CJK languages in e-discovery reveals distinct challenges for both vendors and parties.

The Cultural Challenges Of Collecting ESI In CJK Languages

A few of the cultural complexities associated with e-discovery in CJK languages are multiple word meanings and societal peculiarities evident only to native language speakers. Misunderstanding a cultural idiom could be disastrous when trying to understand a state of mind or the intent of someone authoring an e-mail or document. 

Successfully overcoming these challenges requires a vendor that is completely fluent in CJK languages and cultures. Anything less, and the comprehension of your collected ESI will be severely lacking.

It’s not just the language barriers that pose a hurdle. Vendors such as UBIC understand the cross-border privacy issues and how to diplomatically interact with IT administrators in other countries where security and encryption are handled differently than in the United States. Having a local support team stationed in other countries is essential to successful and defensible data collection.

It is dangerous to dismiss the importance of comprehending the cultural dynamics involved with ESI composed in CJK languages. If you refuse to engage the services of a native speaker, your e-discovery project will be fraught with misinformation and erroneous assumptions.

The Technical Trials Of Processing ESI In CJK Languages

On the technical side, CJK languages rely on different character encoding schemes than what most vendors typically process. An encoding scheme allows a computer to utilize a code to represent the textual characters we read in e-mail messages and documents (similar to how Morse Code uses dots & dashes to represent letters). Though Unicode has become the industry standard, CJK characters utilize other encoding schemes such as Shift-JIS and EUC-KR. Vendors unfamiliar with the specifics of CJK encoding schemes will produce incorrect and/or insufficient data.

Word spacing is also unique. The English language uses spaces to separate words. Chinese and Japanese sentences do not require breaks between words, which makes it extremely confusing for standard indexing technology. The Korean language does use a unique-spacing system, but it differs from the common break structures in Western languages.

It’s not just unfamiliar character encoding schemes and language idiosyncrasies that create challenges with CJK. E-mail collected in the U.S. is almost universally from either a Microsoft Exchange (.PST) or Lotus Notes (.NSF) environment. But e-mail collected in CJK countries might come from numerous other e-mail systems that demand a familiarity rarely found in vendors outside of the Asian countries. Each e-mail platform has its own unique architecture, source code and storage protocols, and anyone unfamiliar with those systems will fail to adequately collect and process the ESI retrieved from those systems. 

There are even differences in how the Windows operating system and Office-suite software is set up to handle CJK input that would be completely mysterious to inexperienced vendors. If one does not understand the complexities involved with collecting CJK ESI from these systems, it will result in incomplete collection, falling below the level of due diligence in your collection and preservation duties. All of these considerations have major consequences if they are not factored into an overall plan for collecting, processing and reviewing CJK language-based ESI.

It is essential that native language speakers design, implement and troubleshoot the technology necessary to properly review ESI in its native form. The critical element is engaging a vendor who can effectively guide you through an e-discovery project involving CJK. The vendor must understand both the cultural and technical challenges involved with multi-language e-discovery, or else you will end up with a garbled mess of unrecognized files and inaccurate searches.

Utilizing TAR With CJK Languages

UBIC is a pioneer in CJK TAR™ and uniquely positioned to provide both solutions (end-to-end EDRM support) and services for CJK-based e-discovery.

UBIC employs native-speaking, CJK language specialists who can successfully work with IT professionals in Asian countries to collect and preserve relevant ESI. UBIC can process the collected ESI using its proprietary, integrated technology platform, Lit i View, so that the CJK characters are accurately extracted and displayed for review.

Lastly, UBIC’s experienced professionals know how to set up a TAR workflow that takes into consideration the challenges for ESI in CJK languages. UBIC’s Lit i View platform is the only TAR tool that was designed specifically to be used with CJK languages. More importantly, UBIC can assist early in an investigation or litigation matter to establish defensible early case assessment exercises through document review and on through to production.

Modern TAR tools “learn” from sample training data provided by experienced lawyers. The tool then uses morphological analysis and statistical algorithms to find similar documents in the remaining ESI. In matters involving CJK languages, it is critical for both the software and the legal team to recognize the linguistic anomalies that are encountered.

There are a variety of TAR and predictive coding tools on the market but many of them are ineffective when they encounter the linguistic challenges posed by CJK languages. In these scenarios, it is important to utilize a hybrid approach from a vendor such as UBIC who can provide a sophisticated, multi-language tool such as Lit i View combined with the cultural familiarization found only in individuals who are native CJK language speakers. 

[1] Nicholas M. Pace & Laura Zakaras, Where the Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery, RAND Institute for Civil Justice, 2012 (

[2] See Peck, Andrew, United States Magistrate Judge for the Southern District of New York, Search, Forward,, October 1, 2011 ( “In my opinion, computer assisted coding should be used in those cases where it will help ‘secure the just, speedy, and inexpensive’ (Fed. R. Civ. P. 1) determination of cases in our e-discovery world.”


