Deconstructing Big Data Through Data Science

Monday, December 22, 2014 - 08:51

”Big Data” is a key buzz term circulating among legal professionals. But as such, do we have a proper grasp of what big data really means? Do we have a solid awareness of the 30,000-foot view, ensuring that we know how to best tackle a big data problem? Rather than take that risk, let’s consider three important foundational topics:

  • What is “big data”?
  • Understanding “unconventional” methods – data science
  • Remember… it’s all forensic

Defining Big Data

While the definition of big data varies greatly, it certainly covers more than just “big.” A definition that covers adequate ground is, “Data that is too large, complex or fast moving for conventional methods to handle.”

Data that is “too large” is rather easy to grasp, though there is no particular data volume that demarcates “large.” It is contextual. Complex data need not be “big.” A spreadsheet with several hundred-thousand rows might only be a few megabytes. However, it may contain highly varied and complex data that would exceed conventional approaches. Data moving across a network may not be large or complex but may move at such high speeds that it is difficult to track. Data in an actively used database may change drastically in a small time period. Thus, “big” means many things in the data world.

Data Science – an “Unconventional” Solution

What is meant by “conventional” methods? Again, this is contextual, but an example may prove helpful. As we will see later, the discipline of data science comes to the rescue in the form of a series of “unconventional” solutions.

Legal professionals are well aware of e-discovery. Generally, when a plaintiff files a lawsuit, discovery rules require each party provide their opponent with all relevant evidence they possess. Electronically stored information (ESI) nowadays can prove to be the lion’s share of relevant data. Indeed, one of the most expensive aspects of this process is when scores of attorneys must plod through voluminous data to identify what is relevant for “production” to their opponent. This is arguably a conventional method.

Many are familiar with predictive analytics (predictive coding). Here, relatively small “samples” of the total to-be-reviewed data population are selected. Then, these same attorneys review only the sample data for relevant items. They will then leverage machine-learning techniques to identify similar relevant documents within the total data population (rather than manually reviewing the entire data population). While oversimplified, the point is that machine learning is much faster, less costly and generally a more accurate data review method.

To complete this example, suppose it costs $500,000 and one month for twenty attorneys to review 200,000 documents – we could call this an adequate data review using “conventional” methods. However, if we cut the budget or the completion time by half, we could not complete the same review with this “conventional” method. Here, predictive coding is the data science approach that solved our big data problem. The data became “big” because of budgetary and scheduling restraints. This is a prime example of where predictive analytic tools can change the equation.

Next, we need to have some grasp of data science – our solution toolkit for big data. The definition of data science, has many variants and is not easy to encapsulate. However, we can capture the essence of data science through this description:

Data science is techniques or approaches found in either descriptive analytics or predictive analytics (inferential analytics). With descriptive analytics, previously unknown information about a large data population is uncovered. A typical approach is “clustering.” This method looks for previously unknown patterns within the data. These patterns are then clustered into groups. It’s possible to specify to a clustering program how many clusters to identify, and the program would attempt to break down the data accordingly. The more clusters requested, the more granular the clustering. After clustering is complete, unwanted data can be removed and more important data can be reviewed first (i.e., k-means and hierarchical clustering). In this way, clustering, an “unconventional” approach, brings new meaning to a large volume of data.

With predictive analytics, the task is to make predictions about larger data sets from smaller ones (or yet-to-be-encountered data). We saw this with the e-discovery review above. Properly obtained samples are used to make predictions about what is in larger data sets. This makes certain research tasks much more efficient – the analytics does the heavy lifting where conventional methods fail.

Other areas of study that are either inclusive of descriptive and predictive analytics or that are used in conjunction include the following: math, statistics, database engineering, text and data mining, natural language processing and computer programming, among others.

Data Science Is Forensic

Data science is, frankly, no different than the application of other sciences in a legal context – it is “forensic.” The following is a list of forensic issues that stand out as among the more important:

  • Definitely use qualified data scientists. Many of the methods and approaches can steer one far off course if applied without the requisite skill.
  • Understand where data has been. Has it been filtered using search terms? Has extraneous data been removed? Has data been aggregated or transformed? These and other actions can throw off data science applications significantly.
  • Do not assume that any legal matter will involve only one data science area. It may take one into multiple data types, and data science approaches will change accordingly.
  • Know when conventional methods are useful and when they need to be supplemented or replaced by a data science approach. Cost, timelines and available skill levels often guide decisions.

Paul Starrett, Esq., is the General Counsel of UBIC North America, Inc. 

Please contact us by emailing or visit our website at