Bridging The Gap in eDiscovery - The Emergence of Conceptual Semantic Search
by Jeffrey Parkhurst
Much has been written lately about the volume of data that must be sifted through in today’s litigation. For decades the “gold standard” for sifting through document populations has been key word searching. Of late there has been a movement to understand the limitations of keyword searching and to replace it with a more advanced and cost effective methods of data gathering and parsing.
Recently focus has been on the positive impact that Technology Assisted Review and specifically, Predictive Coding (as a more refined type of TAR), can have on our industry. These technologies are not without controversy. The theory is easy: Predictive Coding involves the use of software to help identify potentially relevant documents thus reducing the volume of documents that need to be examined.
The Courts are finally beginning to weigh in on technologies that augment other human document review and keyword searching to find responsive information. Corporate data volumes are estimated to be growing at 40% a year. As expected, ESI volumes are also growing exponentially, yet lawyers must remain in compliance with the Federal Rules of Civil Procedure which govern process to “.... secure the just, speedy, and inexpensive determination of every action and proceeding”.
The only way to achieve this is to look outside the box and identify technologies that can bring greater proportionality and cost savings to the discovery process. While Predictive Coding is the most often discussed alternative today, but it may also be somewhat outdated in its current iterations. Given that there are virtually limitless variations of training algorithms used in predictive coding and vendors frequently tout their approach as being superior to other variations, arriving at a “standard” model of predictive coding that produces consistent results has been difficult.
This article is about data analytics. Data analytics have been around for decades. Virtually all corporations and consulting organizations have analytics technology. Certainly it is being used to sample and review the efficacy of functions within organizations.
However, most of the data being analyzed is not currently text-based. Text or semantic data analytics have been around in one form or another for well over a decade. It has been largely gone unimplemented because the need was not there.
In a world that creates 1.3 zetabytes of new text annually, we now face a crisis of "not knowing what we know." With the rise of keyword indexing decades ago in corporate America, our law firms and our EDiscovery populations have deployed keyword searching to solve our knowledge problems.
Therefore, when I read Judge Shira Scheindlin’s recent opinion that are described below, I saw the opportunity open up the discussion to consider a better technology. I have chosen to focus on conceptual semantic search technology as a more comprehensive search technology that extends beyond keyword searching and predictive coding tools.
When conceptual semantic search is used in conjunction with predictive coding methodology and document review software systems, it forms the foundation for a very strong platform in EDiscovery that should improve everyone’s understanding of the search results and help identify the truly important documents. In effect, what is the best way to enhance existing systems and help alleviate some of the concerns that surround current keyword search term limitations and predictive coding complexities?
Keyword Searching and the Courts
U.S. Magistrate Judge Andrew J. Peck’s opinions on keyword searching continue to be at the forefront of EDiscovery. Discussions go beyond his landmark Da Silva Moore v. Publicis Groupe opinion approving of the use of predictive tagging technology. Judge Peck recently further refined his opinion when he said:
In too many cases, however, the way lawyers choose keywords is the equivalent of the child's game of 'Go Fish' ... keyword searches usually are not very effective.'"
In July, U.S. District Court Judge Shira Scheindlin of the Southern District of New York quoted Judge Peck in her opinion regarding National Day Laborer Organizing Network v. United States Immigration and Customs Enforcement Agency, No. 10 Civ. 3488 (SAS), 2012 U.S. Dist. LEXIS 97863 (S.D.N.Y. July 13, 2012):
"Simple keyword searching is often not enough: 'Even in the simplest case requiring a search of on-line e-mail, there is no guarantee that using keywords will always prove sufficient.' There is increasingly strong evidence that '[k]eyword search[ing] is not nearly as effective at identifying relevant information as many lawyers would like to believe.'
The opinion offered further advice on "emerging best practices," encouraging collaboration between parties and technology-assisted review. Judge Scheindlin continued, citing the "shortcomings" of keyword searching:
"There is a "need for careful thought, quality control, testing, and cooperation with opposing counsel in designing search terms or keywords to be used to produce emails or other electronically stored information." And beyond the use of keyword search, parties can (and frequently should) rely on latent semantic indexing, statistical probability models, and machine learning tools to find responsive documents.
The Current State of Predictive Coding
Predictive coding sounds promising on its own merits; but still, the judiciary, lawyers and technologists have been unable to come together to completely accept this newer technology. Why?? I believe that there are as number of reasons:
1. Having been involved in the legal industry for 30 years, I have concluded that the legal system as a whole is not a leader or a fast adopter of new technology. "Precedent" is strongly woven into more than the just the judicial interpretation of a set of facts, it is part of the fabric of every stage of the practice of law. Change is difficult to implement.
2. Predictive coding is highly scientific and difficult for many lawyers to understand. It is based on math, with complicated algorithms and advanced statistics. As many of my lawyer friends are fond of saying, "I became a lawyer because I wasn’t good at math".
3. It is proprietary in nature. Companies have spent millions creating software that they feel represents the best approach to the problem based on their experience and the interpretation of that experience by highly intelligent program designers. So no one wants to open up the "black box" and reveal exactly how their system operates, requiring us to take it on faith that the results are valid and accurate.
4. Each company uses algorithms in a slightly different manner to produce a set of documents that they are sure are better than the competition. This means that no two products will produce the same set of search result documents, which is troubling to attorneys and judges when discussing the completeness of document productions.
5. Predictive coding involves implementing a series of complex steps leading to the production of a final set of responsive documents. Some of the steps include identifying a small set of training documents (also known as seed documents), reviewing that subset using humans to train the machine, and then evaluating the system results, including complicated sampling and confidence ranking… and then repeating the process with further refinement. It can be time consuming and require attorneys to learn how to effectively use the software to obtain good results.
How Can We Improve on the Existing Model?
I believe that while legal professionals are focused on EDiscovery and how to improve document retrieval, we need to recognize that what we are quickly encountering is a problem with ‘Big Data’. The number of documents which must be quickly analyzed in the discovery process has grown dramatically over the last few years. It goes far beyond the documents created or received by a few key custodians.
In addition to the documents that may be within the direct control of a corporation and its employees; it has grown to include vast quantities of email, social media, document repositories and other information that are exchanged on the Internet and often controlled by third parties.
It has truly become a world of Big Data that keyword search engines just can’t handle effectively or accurately. As the data volume expands, it becomes harder for attorneys to know what documents to look for in the first place, in order to find seed documents. We don’t know what we need to know in order to make the process work effectively.
Conceptual semantic searching offers some of the best approaches to early data analytics that can provide users with an exceptionally unique visual overview and approach to analyze unstructured data. It’s very visual, fast and provides an abundance of information about your documents in the early phases of EDiscovery.
The goal of analyzing data is to simply identify information that is responsive to your request, quickly, effectively and efficiently. In fact, semantic searching should help educate you about what might be important about a topic, rather than the other way around.
What is Conceptual Semantic Search?
Semantic search goes far beyond keyword search. It seeks to improve search accuracy by understanding the intent of the searcher and the contextual meaning of terms as they appear in a searchable text. Semantic search processes consider multiple points of information simultaneously; including context of search, location, intent, word variation, synonyms, generalized and specialized queries, concept matching and translation of natural language queries to provide a result set of documents. In summary, while full-text searching lets you query exact words in a document; semantic search lets you query the underlying meaning of the document.
What does this really mean? Using semantic search, there is no particular document that the user knows about that they are trying to uncover. Rather, the user is trying to locate documents that when examined, will give them the concepts that they are trying to find. The goal of semantic search is to deliver targeted information queried by concept rather than have a user sort through a list of documents loosely related by only the presence of a keyword. This is what has been missing from EDiscovery tools: the ability to quickly and effectively perform complex conceptual analytics on document populations and to help a company and its legal counsel determine what information they have. This conceptual analysis helps litigants determine what data they have, what it means and whether you should litigate or settle the case.
The advantage of semantic search originates in the engine’s ability to match on the meaning of words regardless of what words are used in a user’s query or in a source document. In short, semantic search engines are able to go beyond keyword matching and match on concepts.
Most of the semantic search engines can produce relevant search results that do not contain any of the original query words. In addition, many of the linguistic challenges that typically wreak havoc on keyword engines like polysemy (words with multiple meanings) and synonymy (multiple words with the same meaning) are handled intelligently and naturally by semantic engines.
While it is much more complicated than I have just written here (doctoral dissertations and lengthy white papers have been written on semantic search technology), the essence is that semantic search differs greatly from keyword search. It is a new way of identifying useful and relevant information.
How is Semantic Search Different?
Keyword searching involves linking keyword queries to taxonomies, lexicons and thesauri as a way to provide a form of structured query. This technique for recognizing relationships in data has been around for years with every new list creator claiming a new breakthrough in learning.
The reality is that these structured systems utilize manually created (but machine generated) lexicons in an attempt to understand meaning and improve search results. In fact, they are not constantly ‘learning systems’ but rather static, non-dynamic systems that require constant updating (re-indexing) in order to produce any type of accurate results. In today’s world of Big Data, it is not possible to constantly update large federated index lists at the same rate that information changes and new documents are added to a dataset for analysis. So the resulting searches become less effective over time.
In summary, semantic search generally does not suffer from the same limitations of keyword indexing technologies. Perhaps most importantly, semantic search is dynamic, with the ability to continually update concepts as new information is introduced to the system.
Semantic search improves your search results by adding concepts that relate to your original search query. It is not limited to keyword lists. You can then instantly review the concepts the system suggests, assign weighting measures marking the relative importance of the documents, eliminate concepts you don’t want to examine at that time and add more concepts of your own for another round of analytics by pressing a button.
Shortly after loading text searchable documents into the system, you can begin to examine them to help you determine what you need to know about your case. Semantic searches actually produce the seed concepts and documents for you by grouping document concept clusters. You can then determine which documents are of most interest and prioritize a review of that material first, immediately examining cluster contents to determine their utility for your case. Some Current Limitations of Semantic Search Products.
Semantic technologies have been around for years. Most technologies deployed today have some limitations depending on the underlying algorithms. Semantic search engines offer a variety of potential benefits to search technology in applications; however there have been a number of early products that contain limitations on their utility. These limitations include things such as:
•A lack of transparency: Most eDiscovery "Black Boxes", with the internal search generation approach hidden from the end users. Search results remain somewhat elusive to understand, describe and defend.
•Defensibility: It is very difficult to provide a defense for a set of search results when it is not clear exactly what that result set is based on. If the contents of the Black Box are not opened for examination, it is difficult to convince opposing counsel and the judge that you have provided all the information.
•Lack of Control: In addition to being black boxes, many semantic search engines do not provide the searcher with the easy ability to immediately enhance or interact with the search result or the search query.
•Index Scaling: Some semantic search engines can only search the documents they index and further do not allow other engines to search their indexes. As the volume of content continues to grow exponentially, it is difficult to re-index data sets.
•Intelligence Scaling: Semantic indexes are generally larger than keyword indexes, typically reside in RAM and cannot scale to learn from tens of millions of documents, making it difficult to handle large data sets.
These limitations and an understanding of how litigation proceeds has created the opportunity for a semantic search solution that can overcome one of the most difficult problems facing litigators: the ability to provide an easy to use system that augments the human ability to make legal determinations.
What does "BrainSpace" by Pure Discovery Bring to the Table?
I have been attending Legal Tech New York for well over 18 years. I scour the floor each year looking for something "new", which is not just an updated version of an existing product or a slick repackaging of an old one. About every 4-5 years (My own Moore’s Law for EDiscovery), there seems to be a quantum leap by a forward thinking company that creates what I would consider to be a watershed moment in the legal market. In 2012, that discovery came while I was attending a session on semantic review by a company called Pure Discovery and test drove a product called LegalSuite deploying their technology called BrainSpace.
Whereas predictive coding technologies rely on linking together pieces of data within specific indexes using industry-specific ontologies, BrainSpace doesn’t try to index data at all. It certainly recognizes when documents contain similar words or concepts, but it also has a wider capability. BrainSpace actually learns about documents the same way that we as humans learn, by processing relationships. This learning includes its ability to determine concepts, interests, people and perspectives.
BrainSpace doesn’t try to index data at all. It certainly recognizes when documents contain similar words or concepts, but it also has a wider capability. BrainSpace actually learns about documents the same way that we as humans learn, by processing relationships. This learning includes its ability to determine concepts, interests, people and perspectives.
Since full-text searching engines entered the legal industry in the 1980’s, our analysis has focused on "what" was in a document, partly because that was a limitation of the technology. However, today we live in a new "social" world when networks are employed at virtually every client.
We are all wired to our co-workers in multiple ways. We interact with email, documents, data and social media and material is no longer attributed directly to us, rather many people in an organization touch it along the way. With data existing in a social world, so does evidence.
The answer is no longer simply what, but perhaps even more importantly, who. BrainSpace provides machine learning of your data, providing information on both what and who.
BrainSpace, through the Pure Discovery Legal Suite (PDLS) acts as a device that exists between a user’s cognitive thoughts and the data sources where the data resides. It is "post-search" technology that presents you conceptual data in a way that is tunable, fast and easy-to-understand. BrainSpace reads your query, analyzes it against what it has BrainSpace also allow you to tune or refine your conceptual search results and outputs a standard Boolean statement as to how the weighted search results were arrived at.
This overcomes the mystery blackbox results set that will plague many semantic and predictive coding technologies. The system presents the user with a weighted result set which then allows the user to adjust the weight of the various lesser included concepts to get the most complete results possible. This is done through the use of slider bars in the results box.
What are the Benefits of BrainSpace Semantic Search?
• Saves time : You enter what you do know about the case or topic you’re researching in your search query. The semantic technology uncovers related concepts and terms, educating you about the documents in the database.
• Immediate control over the process
Some BrainSpace Considerations
1. The current version of BrainSpace requires that the documents be in a text readable format, it does not read or process native files or non-text documents.
2. BrainSpace requires you to point it at a data set, but it doesn’t care how large or what type of data set it examines. Semantic queries can be formulated to run across virtually any document populations including intranets, extranets, enterprise content management systems, portals, and email archiving.
3. BrainSpace does not create an index. Rather, it creates containers of document "intelligence" with similar concepts which allow you to quickly examine them for relevance.
4. Within approximately an hour of "brain building" (loading) a dataset that includes 1 million documents, you could begin to examine the contents by clustered document container.
By quickly returning easy to understand visual maps of the contents of your data, you can quickly review, edit, tag and parse information as your knowledge of the contents increases
• Improved document recall
Focused Semantic search quickly returns more results that are related to your search query, whether or not you used the specific terms in your initial search.
• Greater precision
By combining Boolean and focused semantic search technology, your results will be highly relevant to the topics that are important to you.
• Increased transparency
The terms and concepts suggested by the semantic technology are returned to you for your review as the actual search is run. You know exactly what the system is searching on.
5. The BrainSpace query takes a plain language question that you are asking and converts to a conceptual Boolean search statement and then searches the data set. Results are displayed in concept clusters which allow you to easily identify documents that are responsive to questions you didn’t even know to ask.
6. BrainSpace maintains the Boolean statements it created so that results can be repeated and described to the opposition or the court, revealing how the results were obtained in a defensible manner. It makes transparent to the user what search statement is being sent to the data set.
7. Functionality built into the product:
a. The Semantic Near Dupe Identification Engine detects and groups near duplicate documents, identifying redundant documents with only slight variances which reduce review time.
b. It also contains text-based deduping, which goes beyond hash value in identifying duplicate documents by comparing the text of documents, exclusive of metadata differences.
c. Concept clustering is displayed as a "Focus" wheel of relevant containers that can be continually parsed into concept subsets, all the time displaying the concept that brings the documents together. Users can quickly generate visual maps of responsive documents, identify and tag key areas of hot documents and store and/or export these subsets for immediate review by a team.
d. BrainSpace assigns a PDID Document Tagger number that goes beyond Bates numbering. It is a semantic Bates number because it assigns documents with contextual similarity a similar PDID number. This allows the user to sort documents from related containers by using the PDID number.
e. Users are given the ability to add, delete, increase or decrease the importance of all query words in a unique visual query interface as they are examining results.
8. Document Containers of concept related documents that are quickly reviewed and determined to be relevant and likely require further analysis can be tagged and then exported into any document review platform or predictive coding system for a secondary analysis.
Why is this type of analysis different?
One of the most important features of the software is not really a feature, but rather forms the basis for a new approach to using semantic search. BrainSpace has been designed to be an interactive experience or conversation between the user of the system and the machine learning concept search.
The goal of the system is to encourage continued interaction with the data so that the search results continue to educate you about the information in the documents, quickly increasing your level of knowledge. It is less transactional and more interactive and hands on.
Creating a tool that is easy to use and understand means that attorneys can easily spend time with the data, quickly educating them as to concepts and documents that may be relevant to their ongoing discovery. The interaction creates a place to work and learn, rather than a long result set of documents that are similar to a keyword that was preselected.
This level of understanding is made possible by transforming your queries into a QueryCloud, which is a visual portrayal of the newly generated semantic query. It effectively places the user in the center of the transaction, encouraging interaction between the query and the data. Each user query is transformed into a list the shows the most relevant extracted and inferred words and phrases.
The goal of litigators in handling the large volume of data in today’s discovery is to provide the most cost-effective and comprehensive solution to analyzing the data that is potentially involved in EDiscovery and discover what is relevant and why. And the earlier this is done in the process, the better! When used in conjunction with other data management and review tools, semantic search can improve the state of EDiscovery. I have listed 4 key factors that indicate why and how semantic search can be used to improve your handling of Big Data in EDiscovery. It is time to take a long look at how this can impact EDiscovery:
1. Know your data – you have to be aware of what data you have, what it means and how it might impact your case as early in the process as possible. Your knowledge may result in your pursuit of a settlement rather than proceeding to trial based on what you learn. Including semantic searching in your plan dramatically reduces your learning curve by pointing you towards information that is likely relevant; more quickly and easily than other methods.
2. Semantic search improves your results – Semantic search queries take plain language questions that you are asking and convert them to a conceptual Boolean search statement which then examines the data set.
3. Explain your approach – you need to provide an explanation to the opposition and the judge about how you have achieved your search results and why the document population you are turning over is in fact responsive and relevant to the discovery request. This level of search transparency is at the heart of the semantic search product Pure Discovery which turns all the plain English search requests into a conceptual Boolean statement which can be clearly understood and replicated when necessary.
4. Be transparent and cooperative – Judges require parties to come to the meet and confer with definitive plans that have been worked out between the parties. They are looking for reasonable and well thought out approaches to discovery that are based on some degree of proportionality.
Results are displayed in concept clusters which allow you to easily identify documents that are responsive to questions you didn’t even know to ask. Semantic searches are dynamic, with the ability to continually update results as new information is introduced to the system. The better knowledge you have, the better you are able to negotiate during the meet and confer to limit document production, understand your case and determine litigation strategies.
Using semantic search as part of your overall preliminary document strategy will help improve your knowledge about the document population and allow you to improve everyone’s understanding of what and how documents have been selected. You will not be taken by surprise at the meet and confer since you will be in control of the information on behalf of your client.
1) The Basics of Predictive Coding
Without going into a complete discussion about predictive coding, the essential element that is relevant to understand is that predictive coding is based on some type of document seeding in order for the machine to "learn" what kinds of things you are interested in finding. The legal team puts together several representative populations of documents dealing with key areas of interest and the machine begins to locate documents of a similar nature. Predictive Coding requires:
• Input from case experts: both substantive legal issue and software consultants
• Keyword analytics to first locate important documents and create seed sets for the machine to use as their matching sets.
• A defined workflow that includes strong statistical sampling analysis to help insure accurate results
• Iterative rounds of machine "learning" (augmented by software and case experts) to find other documents that are "like this" based on keywords and some concepts.
Predictive coding is not designed to replace human review of documents, it is meant to optimize the review and help reduce the volume of documents that must be examined during discovery. The output from predictive coding during discovery is to take all the documents the computers identify as "related" to an issue identified by the case experts and then rank them and tag them so that they can be reviewed by humans for relevance and responsiveness.
One of the advantages of this technology is that you are using human decisions to "teach" the computer to locate documents, increasing the accuracy and relevancy of search sets over time. Whether you call it predictive coding, computer-aided review, or technology-assisted review, it employs a combination of human beings and computer algorithms that are used to determine relevant documents by creating "seed sets" -- and then using the seeds (controlled by algorithms) to have computers produce subsets of responsive documents.
About the Author:
Jeffrey Parkhurst, Consultant, Studeo Legal
Jeff provides consulting and business development leadership to legal service providers regarding new business opportunities and increased service offerings. He delivers consulting to clients on EDiscovery procedures, processes and software alternatives to process discovery data. He writes a weekly blog, Support for Litigation on EDiscovery and litigation support issues highlighting discovery trends, consulting services and the impact of recent court rulings on the practice of law.