Kathan Roberts
An emerging area of legal analysis, corpus linguistics, is introducing new methods of quantitative legal reasoning. Corpus linguistics is the use of computers to analyze a large set of texts in order to understand how a certain word or phrase is used within them. A classic example is to survey 18th century texts to determine the original meaning of “keep and bear arms” at the time the Second Amendment was adopted.[1] Much of the existing corpus linguistics work has been in the areas of statutory and constitutional interpretation, but recent work in trademark law has shown that corpus linguistics can be useful anywhere the law asks questions about language.
For a tidy example of what rudimentary corpus linguistics analysis looks like, consider United States v. Costello.[2] The court had to determine whether simply providing shelter to an illegal immigrant constituted “harboring.”[3] Dissatisfied with the lack of context in standard dictionaries, Judge Posner conducted several Google searches for phrases containing the word “harboring.”[4] He found that it appeared much more frequently in the context of hiding someone from the authorities, rather than simply sheltering them. Thus, he decided that the statute at issue did not apply to sheltering.[5] Corpus linguistics enabled the judge to survey a wider variety and a larger number of texts than he otherwise could have, and to come to a highly nuanced understanding of the statutory language.[6]
The premise of corpus linguistics is that it can provide hard empirical evidence for the otherwise subjective task of linguistic interpretation. It also has the advantage of scale. A computer can easily analyze more texts than a human reader can, making the computer’s insights that much more authoritative.
Earlier this year, the Journal of Law & the Arts published an article that argues that corpus linguistics can assess whether a trademark is famous enough to receive anti-dilution protection.[7] This represents a promising frontier in corpus linguistics that goes beyond simply interpreting the law itself. The fields of copyright and trademark law are particularly amenable to corpus linguistics analysis because it frequently raises questions of language use and interpretation.
The results given by corpus linguistics techniques should be seen not as definitive answers but simply as supporting evidence. In the case of Judge Posner’s Google searches, the computer did not tell him the meaning of “harboring;” it gave him statistics about the ways in which “harboring” was commonly used. The judge decided that these statistics were sufficiently clear and persuasive to support a particular ruling, and a reviewing court could easily examine his reasoning. Lawyers advancing corpus linguistics-based arguments will need to convince the court that their metrics are the right ones to consider, and that the set of texts they are analyzing is comprehensive and unbiased.
If corpus linguistics proves its worth in legal analysis, the legal profession will need some new skills. Recently, the Harvard Journal of Law & Public Policy published an article about corpus linguistics, with the following editor’s note appearing at the beginning: “JLPP’s editorial staff has not independently reviewed the corpus linguistics analysis presented herein.”[8] In other words, you will just have to trust the authors.
As corpus linguistics gains prominence, lawyers and judges will naturally become more familiar with its techniques. Law schools should give students at least a basic understanding of common techniques and their limitations. It promises to be a useful tool in a lawyer’s kit. Students should be prepared to work with it.
[1] John S. Ehrett, Against Corpus Linguistics, 108 Geo. L.J. Online 50, 52-53 (2019).
[2] United States v. Costello, 666 F.3d 1040 (7th Cir. 2012).
[3] Id. at 1043.
[4] Id. at 1044.
[5] Id.
[6] See Ehrett, supra note 1 (detailing Judge Posner’s corpus linguistics analysis in this case).
[7] Jake Linford & Kyra Nelson, Trademark Fame and Corpus Linguistics, 45 Colum. J.L. & Arts 171 (2022).
[8] John K. Bush & A.J. Jeffries, The Horseless Carriage of Constitutional Interpretation: Corpus Linguistics and the Meaning of “Direct Taxes” in Hylton v. United States, 45 Harv. J.L. & Pub. Pol’y 523 (2022).