Editor: Please describe your role at UBIC.
Starrett: As its counsel and chief global risk officer, I pretty much limit my legal role to risk management. I administer that effort as well as maintain contacts with international local counsel to whom we reach out on a periodic basis.
Editor: Tell us about UBIC and the unique services that it provides.
Starrett: UBIC is a full-service company. We offer services across the electronic discovery reference model (EDRM), which is essentially the five-step lifecycle of electronic evidence in the discovery process. We cover every aspect, from collection of data, processing of that data for its review, and production of the data to opposing counsel.
We use our predictive coding platform and can apply our data analytics tools to everything from mergers and acquisitions to investigations as well as do sentiment analysis of data to determine whether people are angry, disgruntled, fearful, or feel guilty. We can also use our technology for business intelligence, and our Xaminer tool uses predictive analytics for forensics investigations.
Editor: I gather that predictive coding is still somewhat controversial.
Starrett: The controversy is due in large part to limited knowledge of its technology. If you and I are involved in a lawsuit, I am required to give you all of my relevant data and vice-versa. If you are using predictive coding, and I, as your opponent, want everything that is relevant in your possession, I might feel that predictive coding is too black-box and that I might be missing data that would otherwise be due to me. Nevertheless, it is faster, cheaper, and more accurate in finding relevant data than human review. I am also chair of the new Big Data Committee of the American Bar Association, where we plan to address this issue for predictive coding as well as establish best practices for all issues related to the legal profession that concern data science.
Editor: What is Big Data?
Starrett: This unfortunately is a fuzzy topic. It is where the data is of sufficient size or complexity that it exceeds the capabilities of conventional methodologies or technologies. It is very contextual. In some applications, the data is moving at a very high speed, such as when you are trying to find patterns of credit card fraud on sites where there are hundreds of thousands of transactions being reviewed in a very short time. The size becomes big because of the large number of computations that have to happen in a very short period of time. Hundreds of gigabytes, depending on the context, could be considered Big Data.
Editor: What are the sources of Big Data?
Starrett: I think where you find it most is on the web and social media because of the enormity of what is out there. Also, the databases that enterprises have behind their firewalls are large sources of data that have to be looked at. Emails constitute about 90 percent of all evidence used in lawsuits, so they would certainly be a candidate.
Editor: How do you process Big Data?
Starrett: There are mainly two categories of analytics. There is descriptive analytics and there is inferential analytics.
Descriptive analytics enables you to find patterns in data that you might otherwise never know about. For example, let’s assume that you have 100,000 emails. What descriptive analytics will do is to look for documents that are similar to others and then group them into similar clusters or concepts. Thus, it might find an email that is, for example, for a company picnic. It then finds all the company picnic emails and puts them into a cluster of documents. It might find a research and development project that you did not know existed. It will find those and group them together creating a map of all the data, which enables you to find things that you had no idea were there because the software found patterns that were intrinsic to the data. You can then make correlations and further assess what is present in the data.
Inferential analytics is basically predictive coding. It takes a small sample of a large amount of data and draws conclusions from that. An example is a poll that says a particular percent of voters are going to vote for a particular politician. It is based on samples that very carefully replicate the population as a whole. And if you can do that, then you can make a very close prediction on how the larger population will behave. The same approach is used to draw conclusions from data. If you take small samples of data that are representative of a much larger data set, you can then draw conclusions about the Big Data.
Editor: Does Big Data sometimes include Chinese, Japanese and Korean (CJK) characters?
Starrett: Yes, definitely. As we become more international, the borders start to blur, and there is a lot of activity with Asian countries – contracts, mergers and acquisitions, lawsuits, government investigations – all within the borders of China, Japan, Korea, Taiwan and other countries. Because of that, the evidence that you are looking for and the data that is relevant are largely in these other languages. Sometimes it is Big Data in CJK characters. Some emails may be multilingual with three or four different languages in an email or a document.
In order to properly identify those CJK characters, you have to understand the encoding because if you do not map the ones and zeros in the computer’s memory to the proper character type, it will throw everything off. Properly and accurately identifying the character encoding is key. The particular language used is listed in the document itself. So when our software finds an email, in the vast majority of cases it knows the language being used. Where it cannot do that, we have people who are multilingual who can say, for example, this is Japanese and not Korean.
The technical challenges I just mentioned are those UBIC is uniquely qualified to handle because we started in Japan and the Asian continent 10 years ago rather than starting here in the U.S. and going there.
Editor: In legislative and regulatory proceedings, it is frequently helpful for lawyers to locate Big Data in support of their positions. Can UBIC help in finding such data where CJK characters are involved?
Starrett: What we would do depends on where the data is located. Different types of data are handled differently when it comes to its collection and review by attorneys and then for its production for a particular purpose. There are different system nuances depending on the purpose. These are all issues that we can handle, and they come up often.
Editor: In many cases, the correlations resulting from processing Big Data do not reveal the personal identification of those who provide such data or of those to whom such data pertains. Is there a danger that the language of U.S. and foreign privacy and data protection laws may prevent the data from being collected and processed even though it is not associated with any kind of personal information?
Starrett: Privacy is one of the biggest concerns that we face with cross-border efforts to collect, investigate or process data, particularly in European Union countries. There are restrictions on transporting the data across borders. What you usually have to do is get in touch with the authorities who administer the privacy regulations and work with them. It is frequently possible to use data analytics to identify what is private and what is not. If it is not private or is anonymous (meaning there is no personal information involved), we are free to use it. We are careful to comply with all applicable privacy laws.
Editor: Without identifying any specific clients, what do you find are the most frequent uses of Big Data?
Starrett: Most Big Data is unstructured data found, for example, on social media sites. Most activity in Big Data now is in the unstructured context – 70 percent is a number that I have heard. I think you will find a lot of this is happening in compliance and regulatory environments where they need to be able to find private data on an ongoing basis or detect terrorist threats. With all this stuff that comes across the web, emails and blogs and information on social media sites, they need to look at all that data in a very short period of time. And there are data analytics that can crunch those large amounts of data in short periods of time looking for threats so that they can defuse something before it happens.
Editors: So if a lobbyist came to you in order to make an argument to a regulatory agency or to a legislature, you could also use the kind of analytics you have been talking about to assist them in making their case.
Starrett: It very much depends on the specifics as far as the time and the budget of the professionals that would be needed at the outset and subsequently throughout the process. It is a matter of understanding the problem and then using the analytics tools I discussed to identify the data that would be most helpful to them in making their case. So yes, it would be possible to use our technology and consulting services to collect the information needed.
Editor: Do your firm’s services for lawyers go beyond e-discovery?
Starrett: Absolutely, and we have an effort underway now, a fairly significant one, to do just that. We are initially looking at areas that are more focused on the legal profession – mergers and acquisitions – and sentiment analysis to look for threats. But this technology, as long as you understand the type of data you need to review and apply the proper workflow, can be applied to just about anything. We do not see any necessary boundary to where we might take the technology and do not plan on holding ourselves back.