Foreign Language Documents? Limited Data Set? Today’s E-discovery is Unfazed

Though still trying to adapt to a world of social media and emojis, modern e-discovery technology has progressed by leaps and bounds. Document review, for instance, has reached a level of automation that was once unthinkable. By using technology assisted review (TAR) and new machine learning techniques, even the review of small foreign language document sets, with sparse relevant documents, can be automated.

To be sure, foreign language documents do not present much of a challenge to e-discovery technology. Sure, to review different languages, e-discovery platforms need to first be able to render their unique syntaxes, which may include a lack of spacing between characters. But most of the time, this can be solved by adding certain Unicode scripts and implementing a tokenization system to delineate spaces between characters.

Still, rendering foreign languages on screen is far different than reviewing them, a task that can be made harder by the fact that certain languages include many homophones—words that are spelled the same but have different meanings—and a variety of writing systems. But like any machine learning tool, TAR can be taught to “read” most any foreign language.

“The same techniques that work in English also work in Japanese,” said Jeremy Pickens, chief scientist at Catalyst Repository Systems. He explained that the way TAR predicts relevant documents “is all statistical. It is all just relative counts and algorithms, and the machines no more knows it’s Japanese than it does if its English.”

This is because all TAR tools are trained through supervised machine learning. A user, usually a subject matter expert or senior attorney who is able to read the documents’ language, will manually review and rank a core set of documents for relevancy. This core set will then be used as the learning model for TAR, in essence helping it understand what to flag as relevant when reviewing the broader data set.

“Because of TAR 1.0′s reliance on a core learning model, TAR 1.0 often requires one first have a high volume of documents, and within that, a fair amount relevant material. That is clearly not the case with the continuous active learning protocols of TAR 2.0.” But what happens if the data set is small with sparse relevant material?

It was a situation Pickens and his colleague, Thomas Gricks, managing director of professional services at Catalyst, recently came across when a client approached Catalyst’s Tokyo offices to test the efficiency of using TAR to review a small set of Japanese language documents.

Pickens and Gricks were able to run a simulation that showed that the data set could be reviewed with a high level of accuracy. They achieved this accuracy by using continuous active learning (CAL), an advanced machine learning technique that is catching on among e-discovery providers, some of whom refer to it as TAR 2.0.

Pickens explained that CAL is different than “TAR 1.0 because instead of stopping after 2,000 or so documents, [the human reviewer] continues to feed back the documents as they are being reviewed. So there is no training phase followed by a review phase; training is review and review is training.”

Gricks added that with CAL, “every single coding decision you make on the document is used to consistently train the machine.”

Such a technique makes it possible to automate review in smaller data sets. But while many have developed CAL tools, including e-discovery providers and university researchers, not all such tools are made the same. Pickens, for example, noted that there is no one silver bullet to making CAL more efficient, and in his experience, it took years of research and fine turning to get his tool to a particular level of accuracy.

Some of this fine tuning over the years has brought up some interesting finds. Pickens noted, for instance, that “counterintuitively, everyone wants to have email threading in their [e-discovery] systems… but email threading with CAL makes it works less effectively.”

Overall, what his research has shown is that “knowing to turn off certain selected pieces of technology” is as important with CAL as enabling them.

And though there are many factors to this fine tuning, for Gricks, there two overarching things that will almost always improve CAL’s accuracy.

The first is reinforcement learning, or letting the platform know when it accurately flags a relevant document throughout the continuous review process. The second is what he calls “contextual diversity,” which means having the CAL tool “review all types of documents in a collection, and not just the stuff you are looking for which would be a complete relevancy feedback approach.” Training the tool on a larger data set, especially one that has stark examples of what is not relevant, he explained, gives it a deeper understanding about what is relevant.

%d bloggers like this: