Sunday, March 13, 2011

Text Mining, Legal Research and Google Books

What does text have to do with quantitative methods?

Text mining is the use of computers and algorithms to classify, associate, or search large amounts of unstructured text data. This is a quantitative method because it involves pattern recognition… a traditionally quantitative subject. Further, the website of one text mining company, Autonomy, states that their software was “built on the seminal works of Thomas Bayes and Claude Shannon…” the fathers of Bayesian statistics and information theory, respectively.

TF-IDF and PageRank

To further demonstrate the quantitative nature of text mining, I present the ‘Transaction Frequency – Inverse Document Frequency’ (TF-IDF) metric. Documents can be grouped according to the similarity of their contents using this single metric… based strictly on word counts and the count of document appearances for a word. If a word appears in every document (such as ‘and’ ‘or’ ‘the’ etc) this metric ignores that word. If a word appears in only two documents and that word appears often (such as ‘mince’ in cookbooks, ‘precedent’ in legal texts, or ‘love’ in romance novels), then this metric says those two documents are very similar.
If you still don’t believe that text mining is quantitative, then check out the Wikipedia entry for PageRank (the Google Search algorithm). Google prioritizes websites based on the words you pick, the words on each website, and where the sites are linked, but it is using a great deal of math to do so.

Existing Applications

The most useful specializations of text mining concern information extraction (How can I determine the type of crime from a police report) and information retrieval (Which documents in the library are relevant?) These technologies are revolutionizing knowledge work, as Google Books makes it possible to search millions of documents for specific phrases and word combinations, Gmail uses email text to target advertising, and the FBI’s Carnivore program scans every email in the United States for indications of criminal or terrorist activity. “We’re at the beginning of a 10-year period where we’re going to transition from computers that can’t understand language to a point where computers can understand quite a bit about language.”

A Few Other Existing Applications

ClearForest – Software that combs through financial newsfeeds to detect news that will affect company prices. Wall Street firms can then write software that automatically sells insurance firm stocks if an earthquake occurs in their coverage zone for example.
Teneros Social Sentry – Large companies purchase this service which scans Twitter, Facebook, LinkedIn, mySpace, Orkut, and blog postings of all employees for criticism of the employer, its customers, inappropriate behavior, prospecting, and data loss prevention.
Email Filtering – Your inbox would be full of spam if companies couldn’t filter it out based on textual analysis.

Benefits and Value Quantification
Scalable – Information overload is no longer a limitation.
Breadth – You can search every document every written.
Uniformity of Approach – Whereas two legal researchers’ criteria may differ, an algorithm is applied uniformly for all instances.
Accuracy – Neither fatigue, distraction, nor habituation will impair an algorithm’s accuracy.
Unbiased – Subjectivity isn’t a problem.
SpeedClearwell’s software was used… to search through a half-million documents… [it] analyzed and sorted 570,000 documents… in two days. [It] used just one more day to identify 3,070 documents that were relevant to the court-ordered discovery motion.”
Cost “In 1978, six television networks paid $2.2 million dollars ($7,752,096 in 2011 dollars) in legal fees to review 6 million documents." This now costs less than $400,000… a 95% price reduction through text mining algorithms and software.
• Focus on analysis, not data collection – The founder of Autonomy “estimated that the shift from manual document discovery to e-discovery" would allow one lawyer to do the work previously done by 500, and that the next generation of software will again double productivity.


If useful information is an iron needle and information is the haystack, then Google, Autonomy, Lexis Nexis and their competitors are very powerful electro-magnets. What opportunities are created by the ability to find the needle of information in the haystack? Will library sciences exist a decade from now?

If knowledge is power, how valuable is technology that fights information overload?

What happens in a world where knowledge is instantly accessible from any source? Speeches, webpages, every book every written… and it is already prioritized based on your search? With information extraction essentially free, the future will go to those who are the most creative at integrating that information (more to follow on this topic in my blog entry).

“Armies of Expensive Lawyers, Replaced by Cheaper Software.” By John Markoff. New York Times. March 5, 2011

“Falling Demand for Brains?” By Paul Krugman. New York Times. March 5, 2011.

Text Mining. Predictive Method for Analyzing Unstructured Information. By Sholom Weiss, Nitin Indurkhya, Tong Zhang, and Fred Damerau. 2005 Springer Science+Business Media Inc. Pg 85, 178-182