An algorithm, designed to probe a database containing all personal data available to the government, sees that you have recently bought some fertilizer and a large truck, and that you have emailed someone with a .lb (Lebanon) email address. Seeing this red flag pop up on his computer, a government agent pulls your bank records, your Amazon and iTunes purchases, the domain names that you’ve recently visited, and a list of everyone you have recently phoned or emailed. Inspecting all of these records, the agent determines that you were merely asking your Lebanese uncle for advice on expanding your farm and makes a notation in his database. (He is also able to determine your religious affiliation, that you have an affinity for Steven Seagal movies, and that you have been having an affair, but is less interested in all of that.)
This example of “data mining” is the future, if not the present, of law enforcement.1 Data mining both offers enormous possibilities for law enforcement and national security — in the scenario above, the fertilizer and truck could have been intended for a far less innocuous use — and radically undermines the notion that one’s interests, affiliations, and legal activities can be kept private from the government. Such concerns have led to significant public debate over the proper scope of surveillance, prompted in particular by Edward Snowden’s recent disclosures.
Yet despite the obvious privacy implications of data mining, traditional Fourth Amendment doctrine offers relatively little to help constrain such activity. The Supreme Court has held that one cannot have a reasonable expectation of privacy in information that is given to third parties2 or made accessible to the public.3 In the modern era, this doctrine covers an enormous amount of activity: commercial interactions are known to credit card companies;4 financial records are in the hands of banks;5 phone calls and emails entail offering telecommunications companies the numbers and addresses necessary to route the information properly;6 and even a cell phone’s location may be known to the phone company at all times.7 Such information, under extant Supreme Court doctrine, arguably falls outside the scope of the Fourth Amendment’s protections. Accordingly, the government can compile and analyze it largely free from constitutional scrutiny.8
Commentators have responded to this apparent deficiency by suggesting that individual sources of information might be protected, Facebook and other social networks being one widely debated example.9 But the reality of modern data mining is that removing isolated sources from the flood of “public” information will do little to stop the government from divining a detailed portrait from the information that remains available. Whatever restraints are imposed on the collection of individual sources, it continues to be true that “most government data mining today occurs in a legal vacuum outside the scope of the Fourth Amendment and without a statutory or regulatory framework.”10
Taking another tack, five Justices of the Supreme Court have signaled a willingness to move away from the piece-by-piece analysis toward a “mosaic theory” of the Fourth Amendment.11 In United States v. Jones,12 the majority decided that long-term surveillance via a GPS beacon attached to a car bumper constituted a search due to the physical trespass upon the bumper.13 Yet Justice Sotomayor concurring and Justice Alito — joined by Justices Ginsburg, Breyer, and Kagan — concurring in the judgment suggested that the collection of sufficiently large amounts of information might amount to a search (thus implicating the Fourth Amendment) regardless of physical trespass.14 However, the question as framed in Jones — to what degree a given type of information can be analyzed — is likely to offer little guidance to courts struggling to determine the validity of investigations using multiple types of rapidly evolving information-gathering techniques. More promisingly, scholars have noted the possibility of building legal protections into algorithms and databases in order to protect privacy while still enhancing law enforcement capabilities.
This Note argues that a properly designed algorithmic search, with features corresponding to the Fourth Amendment’s dog-sniffing doctrine, can offer a potential constitutional solution to the privacy pitfalls of data mining. In a series of cases the Court has stated that the use of drug-sniffing dogs does not constitute a search under certain conditions. From these cases the following requirements can be identified: that the dogs access the scents without intruding into a constitutionally protected area, that they recognize only illegal activity, that humans do not see any private information until probable cause has been established by the dog’s bark, and that the dogs have a low false-positive rate. These features map roughly onto the characteristics of a well-designed algorithmic search.
This Note begins by discussing data mining: its definition, its utility, and the threat it presents to traditional notions of privacy. Part II analyzes the extent to which data mining can be regulated under established Fourth Amendment doctrine, agreeing with the scholarly consensus that it largely falls outside the traditional scope of a search. Part III explores some of the alternatives that have been put forward, finding some promise in regulating access to certain types of information and less in the mosaic theory hinted at in recent cases. Part IV presents this Note’s alternative, arguing that it flows logically from dog-sniff doctrine and answers the most serious objections to data mining. Part V concludes.
I. Data Mining’s Promises and Pitfalls
The quantity of information collected about U.S. citizens, both privately and publicly, is expanding at a prodigious rate.15 The government has direct access to an enormous amount of information collected by various agencies: payroll records, political contribution disclosure regimes, birth certificates, marriage licenses, and more.16 The federal government maintained more than 2,000 databases over a decade ago,17 a number that surely understates today’s figures.18 In addition to the numerous public-sector sources of information, the private sector has amassed considerable information about consumers.19 Part of this trend can be traced to the increasing number of interactions and transactions that occur online and electronically, as email replaces mail, Amazon.com replaces storefronts, credit cards replace cash, and Facebook replaces conversation. This proliferation of available data — combined with the demand for such data from both public and private sources20 — has led to “the creation of a new industry: the database industry.”21 This industry “provides data to companies for marketing, to the government for law enforcement purposes, to private investigators for investigating individuals, to creditors for credit checks, and to employers for background checks.”22
And what is being done with these data? They are examined, either by people or by algorithms, for patterns of useful information in a process termed “data mining.” Data mining, for the purposes of constitutional analysis of government surveillance, can be defined as “searches of one or more electronic databases of information concerning U.S. persons, by or on behalf of an agency or employee of the government.”23 Data mining, of course, can also be carried out by private parties — one famous example involved Target’s analyzing the shopping habits of its customers to identify those who had recently become pregnant, and preemptively targeting them with baby-product advertisements.24 More sophisticated uses involve massive databases compiled by both governments and private companies from a wide variety of sources that can be used to target advertising or law enforcement resources.25
More specifically, data mining can be grouped into two broad categories: “subject-based,” which involves pulling together and analyzing information about a previously identified individual, and “pattern-based,” which involves analyzing information on nonsuspect individuals to identify patterns of transactions or behaviors that correlate with suspect activity.26 While subject-based data mining may raise constitutional concerns of its own, this Note focuses primarily on pattern-based data mining. Such data mining in the absence of individualized suspicion differs in kind, not merely degree, from traditional government investigatory techniques.
Data mining holds undeniable promise for law enforcement: it can “turn low-level data, usually too voluminous to understand, into higher forms (information or knowledge) that might be more compact (for example, a summary), more abstract (for example, a descriptive model), or more useful (for example, a predictive model).”27 Just as Target was able to predict pregnancy, a government could in theory identify transactions indicative of tax fraud or drug dealing, or of terrorist attacks in the making.28 Data mining’s greatest advantage over traditional forms of surveillance is that it does not require ex ante individualized suspicion: law enforcement could identify a past (or even future) wrongdoer whom the government would otherwise never have suspected. In theory, law enforcement could also become more efficient, in terms of both cost and burden on citizens. If police could identify criminals through data mining and disrupt embryonic terror attacks, one could envision a future where passengers can wear shoes through airport security. While there are debates about exactly how effective data mining can be for law enforcement and national security purposes,29 law enforcement and national security agencies are rapidly expanding their efforts and capabilities to gather information and analyze it on a mass scale.30
Yet data mining’s promise for law enforcement comes paired with significant privacy concerns. The privacy concerns attendant to data mining (as opposed to information-gathering more generally) can be grouped into “those that arise from the aggregation (or integration) of data and those that arise from the automated analysis of data that may not be based on any individualized suspicion.”31 The first concern is that discussed by Justice Sotomayor in Jones — “GPS monitoring generates a precise, comprehensive record of a person’s public movements that reflects a wealth of detail about her familial, political, professional, religious, and sexual associations.”32 This concern becomes even more severe when GPS data is combined with credit card transactions, online activities, and other forms of data.
The second concern, meanwhile, is rooted in the unnerving fact that this intimate, invasive surveillance is targeted at everyone. More traditional government surveillance may not require a warrant until reaching the level of a search; yet presumably law enforcement is not investing the resources necessary to surveil round the clock and track down every piece of information without at least some whiff of wrongdoing. The average citizen can take comfort in the assumption that she will not incur such close scrutiny, a comfort that is not afforded by pattern-based data mining.
These concerns have most recently come to the fore due to the revelations of former National Security Agency (NSA) contractor Edward Snowden. Snowden’s revelations about the scope of NSA surveillance have prompted a wave of privacy concerns and a renewed debate around the tradeoffs between privacy and security attendant in data mining.33 It is unclear as of yet what, if any, legislative action will result from these revelations; in the meantime, courts have struggled with the constitutional implications of such programs.
II. Data Mining Under the Fourth Amendment
Fourth Amendment doctrine rests upon two assumptions that data mining exposes as particularly ill-suited to the modern age: that physical intrusions will correspond to the most serious invasions of privacy, and that the inability of government to invade privacy on a mass scale
will offer practical obscurity. The first assumption takes root in the fact that opening one’s mail, entering one’s property, or rooting through one’s belongings all involve clear lines of physical intrusion that courts can easily police. Yet as technology advances, the policing of physical intrusion starts to look very much like the Maginot line: impregnable against frontal assault while far more serious invasions of privacy flow around it unimpeded. The Court is thus adamant that to set foot on private property is to trigger a search,34 yet relatively unconcerned about helicopters (and soon drones, no doubt) hovering close above one’s property with high-resolution cameras.35 Such vigilance offers little comfort in a world where one’s intimate transactions occur in a space where no physical intrusion is required to access them, and the Court has begun to react with trepidation to the conflict between these traditional assumptions and the modern world.
Two major cases in the Fourth Amendment canon have left a vast amount of data constitutionally unprotected. First, the Supreme Court declared in California v. Greenwood36 that one does not have a privacy interest in garbage placed out on the street for collection,37 and more generally that the Fourth Amendment does not protect that which “could have been observed by any member of the public.”38 Thus one’s public movements and actions, prior to Jones, were thought not to receive Fourth Amendment protection.39
Second, and more problematic to scholars,40 the Court stated in Smith v. Maryland41 that an individual has no “legitimate expectation of privacy in information he voluntarily turns over to third parties.”42 The paradigmatic examples of this principle are bank records43 and telephone numbers dialed.44 Today, this third-party doctrine appears to extend as far as recording one’s “IP address, to/from address for e-mails, and volume sent from the account.”45 While the contents of emails might receive protection,46 the lines of Fourth Amendment searches as set by the Court’s application of analog doctrines bear virtually no resemblance to society’s current expectations of privacy.47
The cumulative effect of the public exposure and third-party doctrines renders data mining largely “outside the scope of the Fourth Amendment.”48 While there are statutory restrictions on certain types of surveillance, most notably the Stored Communications Act,49 the Fourth Amendment leaves unprotected any information that has fallen or could legally fall into the hands of a private third party. Accordingly, a staggering amount of information generally considered quite personal can be collected with limited constitutional restriction.
These two doctrines interact problematically with another core assumption of the Fourth Amendment: that law enforcement has limited resources and cannot be in all places at all times. This assumption has meant that the Court has yet to recognize that both the extent to which data are analyzed and the scope of their collection have constitutional implications. Courts have traditionally assumed a degree of practical obscurity: even if one cannot guarantee the privacy of one’s transactions against the watchful eye of the state, one can reasonably expect that government agents will not follow one’s public movements, collect receipts at every vendor one visits, and check the address on every letter one sends or receives.50
Courts have most fully articulated this principle in the context of public movement. Discussing information-gathering police stops, the Court has relied upon “limited police resources” along with other practical constraints to inhibit “an unreasonable proliferation of police checkpoints.”51 Judge Posner, meanwhile, has distinguished between the police’s ability to follow a single driver through public streets and the possibility of mass observation through technology (and even analysis of movement patterns via algorithm).52
While the Court has recognized that the elimination of such practical barriers to information gathering can independently raise Fourth Amendment concerns,53 it continues to rely upon the default assumption of practical obscurity. In fact, such an assumption underlies the fundamental mode of Fourth Amendment analysis: each step in a search is to be analyzed independently for any constitutional violation, regardless of the number of steps or searches put together.54 While individualized analysis might make sense when each element of a search requires an investment of significant resources, it seems hopelessly outdated when thousands of micro-searches can be effortlessly amalgamated.55
The Court is hardly unaware of the challenges that technological development has posed to its traditional Fourth Amendment assumptions. Though the Court has yet to encounter data mining directly,56 in a series of recent cases it has expressed trepidation about uninhibited adoption of technologically dated Fourth Amendment precedents.
First, the Court has hesitated to allow search of email stored on a third party’s servers. In City of Ontario v. Quon,57 the Court was faced with the question of whether an employee could have a reasonable expectation of privacy in text messages stored on a government employer’s servers.58 Yet rather than address the question head on, the Court ruled that the search was reasonable regardless of the employee’s privacy interest.59 In explaining the Court’s reticence, Justice Kennedy explained that “[t]he Court must proceed with care . . . . The judiciary risks error by elaborating too fully on the Fourth Amendment implications of emerging technology before its role in society has become clear.”60 While this case could be read as a simple application of the canon of constitutional avoidance, the Court has often cast such modesty aside in the field of criminal procedure.61 Rather, City of Ontario may indicate that the Court is reluctant to follow Smith all the way down the rabbit hole when it comes to electronic communications.
Second, in United States v. Jones, the Court confronted the use of a GPS tracking device to surveil a suspect for four weeks. Decades earlier, in United States v. Knotts,62 the Court had held that the use of a locating “beeper” was constitutionally permissible because “[a] person traveling in an automobile on public thoroughfares has no reasonable expectation of privacy in his movements from one place to another.”63 The Jones majority distinguished this seemingly controlling precedent by finding that while in Knotts the suspect had voluntarily placed a bugged package in his car, in Jones the government trespassed upon the suspect’s rear bumper in placing the device.64 Yet for the five Justices concurring, it was not the origin of the device but the extent of its information gathering that was most troubling. Justice Alito dismissed the attachment of the device as “trivial,” and argued that the length of the surveillance passed an as-yet unidentified threshold marking the bounds between a search and a non-search.65 Justice Sotomayor went even further, specifically calling into question the viability of the third-party doctrine in
the digital age, in which people reveal a great deal of information about themselves to third parties in the course of carrying out mundane tasks. People disclose the phone numbers that they dial or text to their cellular providers; the URLs that they visit and the e-mail addresses with which they correspond to their Internet service providers; and the books, groceries, and medications they purchase to online retailers.66
Last Term, the Court struck an even more direct blow at technological neutrality — the notion that the Fourth Amendment should translate seamlessly from the analog to the digital.67 In Riley v. California,68 the Court unanimously refused to extend the traditional search-incident-to-arrest exception — by which arresting officers could rifle through the effects of an arrestee without Fourth Amendment scrutiny69 — to the search of an arrestee’s cell phone. Chief Justice Roberts explained that to compare the search of a cell phone to that of a wallet or a purse “is like saying a ride on horseback is materially indistinguishable from a flight to the moon. . . . Modern cell phones, as a category, implicate privacy concerns far beyond those implicated by the search of a cigarette pack, a wallet, or a purse. . . . [A]ny extension of that reasoning to digital data has to rest on its own bottom.”70
These cases suggest that the Court is aware that modern surveillance technologies represent a problem for traditional Fourth Amendment doctrine, but is still casting about for a solution that might prove workable in the context of data mining. In the next Part, this Note examines the alternatives that have been put forward.
III. Fourth Amendment Alternatives
Several proposals have been floated to address the mounting unease with the mass collection and analysis of data that, while (arguably) innocuous in pieces, in combination can reveal a discomfiting amount about a person’s life. Broadly speaking, these proposals can be grouped into three categories: those that restrict what types of information can be gathered, those that restrict how much of it can be put together, and those that restrict how it can be analyzed. Though the first and second categories are important, this Note focuses on the third category as offering the most potential for systematic judicial regulation of data mining.
Much of the scholarly attention has focused on restricting the types of data that can be collected. Some critics have attacked the third-party doctrine directly, arguing either that the entire edifice is built upon a mistake,71 or that it should distinguish information that is exposed to a third party only by passing through an automated conduit to another private party (so that, for example, emails that pass through third-party servers would not lose their protected character).72 Without wading too deep into the continuing vitality of the third-party doctrine, it is worth noting that such proposals run squarely into Smith, which solidified the third-party doctrine as applicable to information collected in the course of automated communications.73 Other scholars have focused on those types of information that implicate other constitutional interests, such as associational or interpersonal privacy. The question of how much privacy one is entitled to in the information one posts on Facebook has generated its own small field of constitutional scholarship.74 Additional scholarship has focused on location tracking, arguing that the pervasive surveillance of one’s public movements could offend the Constitution.75
Regulation of what can legitimately be collected is undoubtedly important. Even if legal restrictions are placed on the scope of data analysis, it would offend the Constitution if the inputs into a data-mining program included intimate conversations within the marital bedroom. Yet analyzing each source of information smacks of attempting to hold back the flood by plugging each leak in the dam as it appears. Justice Sotomayor noted in Jones the enormous amount of personal information that could be garnered from GPS tracking alone.76 Yet at the same time, excluding location data would hardly prevent the government from generating much the same record by looking solely at one’s email exchanges, browser history, or credit card transactions: “[I]t will often be unnecessary for the government to track us, because for most of us much of our lives are already described in transactional databases.”77 Unless every meaningful source of information is to be regulated, a more systematic approach is needed.
An alternative (or additional) approach to the regulation of data mining is to look not merely at sources, but at the amount of information that is accumulated. Professor Orin Kerr describes this as the “mosaic theory,” and notes its endorsement by the D.C. Circuit in United States v. Maynard78 and by the concurrences in Jones. The mosaic theory “considers whether a set of nonsearches aggregated together amount to a search because their collection and subsequent analysis creates a revealing mosaic.”79 Such an approach accords with our intuitions and expectations about privacy: the government may be entitled to examine a particular commercial transaction, or to find out where a suspect is at a given moment, but should not be able to piece together her entire life without first seeking a warrant.
Yet Kerr is right to note the significant difficulties involved in setting forth a predictable standard for the mosaic theory.80 First, the three relevant opinions — Judge Ginsburg’s in Maynard, Justice Sotomayor’s Jones concurrence, and Justice Alito’s Jones concurrence — put forward three divergent variations of the mosaic theory test, each different in important respects from the others.81 More troublingly, it is difficult to see how any standard could reliably apply either within types of surveillance (three days of GPS tracking is acceptable, but is four days too many?) or across types (bank records are okay, as are email addresses, but do the two combined create a search?). Unless one imagines each type of nonsearch being assigned a point value that can accumulate to a search, the mosaic theory is not likely to lend itself to stable solutions, but rather to frustrate equally both government investigators and privacy advocates.
If regulation of individual sources of information at the point of collection is insufficient (though indispensable), and regulation of the gross extent of analysis is likely to result in endless confusion, perhaps one should examine the method of analysis. Some scholars who focus on the method of analysis have looked to the distinction identified above between subject-based data mining (the examination of accumulated information on a pre-identified individual) and pattern-based data mining (the suspicionless examination of large numbers of individuals for indicative patterns of behavior). While subject-based data mining may be a logical extension of ordinary investigative techniques, pattern-based data mining has drawn particular ire: such analysis, divorced from particularized suspicion, is viewed as hostile to both “the constitutional presumption of innocence and the Fourth Amendment principle that the government must have individualized suspicion before it can conduct a search.”82 Yet it is precisely the ability to investigate in the absence of preexisting suspicion that offers data mining’s greatest promise: the possibility of putting together disconnected facts to point the finger at a suspect whom the government would not otherwise have suspected.
Rather than throw the baby out with the bathwater, a more promising avenue is to regulate the analysis of the data in a manner that comports with constitutional principles. K.A. Taipale has discussed this possibility at length, arguing that “security with privacy can be achieved by employing value-sensitive technology development strategies that take privacy concerns into account during development, in particular, by building in rule-based processing, selective revelation, and strong credential and audit features.”83 Taipale analyzes each of these features in depth: rule-based processing allows data to be labeled and categorized in order to ensure that it will not be accessed improperly;84 selective revelation “uses an iterative, layered structure that reveals personal data partially and incrementally in order to maintain subject anonymity”;85 and strong credentialing and audit features avoid insider abuse of information by restricting and monitoring access.86 Taipale demonstrates the viability of these features as a technological matter and argues convincingly that they can allow data mining to accord with privacy intuitions.
Whereas Taipale focuses his attention on the feasibility and social desirability of certain features of data mining, however, this Note is more concerned with “how these particular technologies fit within the current legal structure.”87 As the Riley Court made clear, the adoption of privacy-enhancing protocols is to be encouraged, but constitutional scrutiny remains indispensable: “[T]he Government proposes that law enforcement agencies ‘develop protocols to address’ concerns raised by cloud computing. Probably a good idea, but the Founders did not fight a revolution to gain the right to government agency protocols.”88 Indeed they did not. But, as argued in Parts II and III, neither the status quo nor the solutions so far offered are likely to provide a coherent and satisfactory accommodation between competing constitutional concerns. In the next Part, this Note thus seeks to fill a gap by identifying one model of constitutional oversight of data mining and demonstrating its congruence with existing Fourth Amendment doctrine.
IV. The Crime-Sniffing Algorithm
This Part examines the Court’s treatment of the use of drug- and explosive-sniffing dogs under the Fourth Amendment. While such cases have generally been relegated to a niche, the elements of the doctrine map surprisingly well onto the constitutional issues posed by data mining. Analogizing from the cases determining whether dog-sniffing creates Fourth Amendment concerns, this Note lays out the elements that a data-mining algorithm would have to satisfy: the initial search must be performed by a computer upon a database of traditionally unprotected information; the algorithm must not identify protected (noncriminal) activity; human interaction with the data must occur only after the algorithm has demonstrated probable cause; and the algorithm must have a sufficiently low false-positive rate.
In four cases spread decades apart, the Supreme Court confirmed that the use of a drug-sniffing dog, in a manner that did not involve additional intrusion beyond that already constitutionally permissible, did not constitute a search under the Fourth Amendment and that the dog’s reaction could provide probable cause for a search. In United States v. Place,89 the Court established this rule in upholding a sniff test of luggage pursuant to a valid Terry stop,90 and in Illinois v. Caballes91 the Court confirmed it with regard to a sniff test of a vehicle’s exterior, again pursuant to a valid Terry stop.92 Most recently, in Florida v. Jardines,93 the Court held that stepping onto the curtilage of a home with a drug-sniffing dog constituted a Fourth Amendment violation.94 The same Term, in Florida v. Harris,95 the Court confirmed that “[a] sniff is up to snuff” in establishing probable cause.96 From these cases four important features of the doctrine can be drawn: the sniff must only analyze information that is legally obtained; the sniff must only detect illegal activity; humans must not participate in any search until probable cause has been established by the sniff; and the sniff must have a low false-positive rate.
The first crucial feature is that the dog, because it does not physically intrude into the bag or car, conducts its detection from a nontrespassory vantage point. In Place, the Court identified this feature by pointing out that a “‘canine sniff’ by a well-trained narcotics detection dog . . . does not require opening the luggage.”97 The Caballes Court agreed, applying the logic to a sniff of a car’s exterior.98 In Jardines, the fact that a government agent had stepped onto the property with the drug-sniffing dog provided the critical distinction from Place and Caballes, as the Court deemed the activity a search on trespass grounds.99 Put another way, dog sniffs are permissible so long as they gather data where it has emerged from a constitutionally protected space into a constitutionally unprotected space. That the information is no longer technically within the home is not itself sufficient — in Kyllo v. United States100 the Court found a search where police used a thermal imaging device from across the street that detected heat radiating from the home101 — but it is necessary that obtaining the information to be analyzed not involve an independent constitutional violation.
The second important feature of dog sniffing identified by the Court is that dogs are trained only to react to illegal activity. As the Court stated in Caballes, “governmental conduct that only reveals the possession of contraband ‘compromises no legitimate privacy interest.’”102 The Court relied on a previous case holding that chemical analysis of white powder for the presence of cocaine did not constitute a search.103 This principle seems simple enough, but there is a critical distinction to be drawn out between what the dog detects and what the dog reacts to. The dog (and even more so the chemical field test) only reacts in a binary manner: drugs or no drugs (or perhaps drugs and/or explosives, or neither). However, the dog and the test necessarily encounter scents and substances that are not only innocent, but potentially highly personal: a dog trained to do so could surely identify the scent of one’s soiled undergarments or a mistress’s perfume, while a field test could as easily identify one’s medication. Yet because the dog and field test are only trained and designed to respond in distinctive ways to specific objects,104 any private information they come across is meaningless to them. It is thus crucial that, while the dog may encounter private activity, it only recognizes and reports illegal activity.
The third important feature of the dog sniff is its place in the overall search process: that is, the dog must establish probable cause before a human can encounter any private information. This point was critical in Place, where Justice O’Connor noted that a dog sniff “does not expose noncontraband items that otherwise would remain hidden from public view, as does, for example, an officer’s rummaging through the contents of the luggage. Thus, the manner in which information is obtained through this investigative technique is much less intrusive than a typical search.”105 The Court recently reaffirmed the importance of the dog’s function in establishing probable cause in Harris, holding that “a probable-cause hearing focusing on a dog’s alert should proceed much like any other.”106
These last two features — that the dog reacts only to the presence of contraband and that a human does not become involved until probable cause is established — depend on the fourth feature of drug-sniffing dogs: a low false-positive rate. Justice Stevens was careful to note this feature of the drug-sniffing dog in Caballes.107 Justice Souter vehemently contested whether the dogs actually performed as advertised, finding a range of reported false-positive rates between seven and sixty percent.108 While one might argue that “[t]he infallible dog . . . is a creature of legal fiction,”109 the point remains that a dog with a high false-positive rate is legally distinct from one with a low false-positive rate.110 Thus, in Harris, the Court acknowledged that a dog’s record of accuracy and reliability are critical to its utilization in establishing probable cause.111
Whatever the empirical truth of the propositions, the dog-sniffing cases suggest that a sniff properly should analyze only legally obtained information and detect only illicit activities; that officers should not act before probable cause has been provided by a dog’s alert; and that the sniff should seldom produce false positives.
In its dog-sniffing cases, the Court described the drug-sniffing dog as “sui generis,”112 but as Justice Kagan noted, the highly trained dogs at issue are no different than any other “specialized device for discovering objects not in plain view.”113 The underlying logic of the dog-sniff cases fits neatly with the issues posed by pattern-based data mining. Given courts’ fondness for reasoning by analogy in Fourth Amendment cases involving technological developments,114 it should be possible to design an automated search that replicates the core features identified in the dog-sniff cases: analysis only of legally obtained information; exclusive focus on detecting illegal activity; no human observation without prior probable cause; and low error rates.
The first feature — that the database contain only legally obtained information — pertains to the database rather than the algorithm, and it is important to note that this feature falls into the category of “what can be collected” discussed above.115 As acknowledged, the inquiry into what individual data points are or are not private is critical. One hopes that the government is already in compliance with this feature as reflected in current Fourth Amendment doctrine. A promising indicator is that, based on public accounts, the NSA program drawing the most scrutiny analyzes telephonic metadata (numbers dialed, length of call, etc.) rather than the contents of calls themselves.116 While the dataset available to NSA algorithms vastly outstrips that available to drug-sniffing dogs, the two are comparable in legal terms. In the dog-sniff context, the information is legally obtained so long as the police and dog do not trespass while obtaining the scents. Similarly, a properly designed algorithm would analyze information that has been turned over or exposed to third parties, rather than intrude into personal computers or the content of email. Accordingly, so long as the database subject to pattern-based data mining only includes information that is gathered in accordance with current constitutional doctrine, the searching algorithm will have access to a dataset roughly analogous to that of the drug-sniffing dog.
The second feature pertains to the algorithm: it would have to be programmed to recognize only patterns indicative of illegal activity. This feature sounds deceptively simple. Ordinarily a regression will take certain inputs (locations, purchases, patterns of communication) and turn them into a probabilistic output. To simplify to near the point of absurdity, the algorithm might render a result of “p(terrorist) = 0.9.” However, given the enormous complexity of a database that would compile all data available to the government, a sophisticated algorithm would have to create new, intermediate inputs after analyzing the initial variables.117 Imagine, for example, that there is a pattern of activity that adherents of a certain right-wing group tend to follow prior to committing acts of violence, but such a pattern can also be displayed innocuously by nonmembers; a sophisticated algorithm might be designed to identify both the pattern of activity and the group membership, and flag only those individuals who fit both criteria. Or perhaps sexual orientation could help distinguish between patrons of a given brothel and shoppers at the grocery store on the ground floor. Such intermediate inputs raise questions that the drug-sniffing dog does not pose (imagine, by way of comparison, the drug-sniffing dog thinking to itself, “well that kind of smells like cocaine, and he’s Colombian, so . . .”). Yet concerns over impermissibly biased analyses can be partially answered by the third and fourth features: that is, so long as the intermediate inputs are not revealed to a human until there has been a highly reliable indication of probable cause for illegal activity, the evaluative process of the algorithm is less significant.
The third feature is that human interaction occurs only at the point of probable cause. Kerr has argued that a computer’s analysis of private information is irrelevant to the Fourth Amendment; a Fourth Amendment search should be found to occur only at the moment that a human interacts with private information.118 Courts could apply the dog-sniffing doctrine to demand a roughly similar process, so that first the algorithm “barks” to indicate probable cause, and then the human search of the information that aroused suspicion follows. For the purposes of this Note, it is assumed that a magistrate would be interposed between the algorithm and the human search119: if an individual met the requisite threshold for probable cause, the program could give an anonymized summary of the incriminating information to an agent, who could then seek a warrant for more thorough exploration of the collected data (and potentially other sources of data not accessible to the algorithm, such as email content).120
The final critical feature would be the false-positive rate. The appropriate false-positive rate is a minefield well outside the scope of this Note, and might raise questions about whether probable cause is variable across classes of crimes (one imagines that courts would be more accepting of a high-percent false-positive rate for terrorism than for illegal downloading of copyrighted films). It is worth noting that the consequences of a false positive might be regarded as significantly more detrimental to one’s privacy in the data-mining context than in the dog-sniffing context. Yet, though a search of one’s luggage or car does not yield the same depth of information as a comprehensive examination of one’s metadata, the Court has recognized the privacy interests connected to the former as worthy of protection.121 It is thus not too great a stretch to require the algorithm to provide evidence of reliability sufficient to parallel the faith the Court has placed in the drug-sniffing dog.
The algorithm described here may seem something of a deus ex machina solution to the tradeoffs between law enforcement and privacy that data mining poses. Yet such a program is technologically feasible. A drug-sniffing dog need not be perfect; neither data demonstrating significant error rates122 nor a concrete instance of a dog’s misidentification123 have prevented the use of drug-sniffing dogs in providing probable cause for a more intrusive search. Rather, the Court has made clear that a dog’s alert must merely lead a reasonable person to conclude that a search would reveal evidence of wrongdoing.124
An automated algorithm can meet this threshold. As Taipale has noted, “[t]he use of probabilistic models developed through data mining can substantially improve human decision-making in some contexts.”125 A crude data-mining program could simply query a database for individuals who match a set of criteria that have been deemed sufficiently indicative of wrongful activity. A sophisticated database could return a list of persons who had purchased certain chemical compounds, recently participated in certain kinds of international financial transactions, and visited identified extremist websites. Comparable mechanical filters are already in place in government communications infrastructure,126 and there is little reason to think that the NSA is less capable of designing criteria supplying probable cause than Target is of designing criteria suggesting pregnancy.
That a cleverly designed data-mining program might be able to pass constitutional muster is not necessarily a ringing endorsement. The interposition of the algorithm between the dataset and human eyes does not fundamentally alter either of the two core privacy concerns raised by data mining: aggregation of data and investigation without particularized suspicion.127 The availability of massive amounts of data to the government, subject only to the promise that nobody will look at one’s intimate secrets until the algorithm lets them, will inevitably set off alarm bells for those inclined to be suspicious of government. While such concerns are not unreasonable, the alternatives — either not collecting the data or compartmentalizing them in order to inhibit the construction of detailed, intimate portraits — involve largely forfeiting the potential law enforcement gains from data mining. Abandoning pattern-based data mining as a law enforcement tool is appropriately within the scope of public debate, but this Note argues that it is hardly a constitutional necessity.
As for investigation without particularized suspicion, to some extent the algorithm does not change the fact that everyone is subjected to a form of surveillance. Yet the existence of drug-sniffing dogs in Penn Station does not seem to alarm civil libertarians. The critical features of such dogs, as identified by the Court, are that they can process only information about whether one is engaging in an illegal activity, and that they do so without intruding into what one has not already exposed. If an algorithm, with sufficient oversight, could become as innocuous to the public as the dogs — recognizing that HAL and Lassie are not exactly neck and neck in vying for the public’s affection — there is no reason why we could not come to regard such light surveillance of our public activities as ordinary.
“In Algorithm We Trust” is not a slogan likely to catch fire among civil libertarians any time soon. Yet courts must navigate the tradeoff between data mining’s boost to law enforcement efficiency and efficacy on the one hand, and its threat to privacy on the other hand. Interposing a carefully designed and regulated machine between the data and the human observer offers the potential to capture significant benefits while satisfying a number of both constitutional and privacy concerns. Close review of the Court’s dog-sniffing cases provides four principles by which pattern-based data mining should be regulated: analysis only of legally obtained information; exclusive focus on detecting illegal activity; no human observation without prior probable cause; and low error rates. Even if courts do not apply these features precisely — there are some relevant differences between canines and computers — this Note at a minimum demonstrates that dog-sniffing cases provide a template for judicial oversight of pattern-based data mining that involves neither complete dereliction of constitutional oversight nor free-form weighing of costs and benefits, as in the context of the Fourth Amendment’s “special needs” doctrine.128 Rather, courts should carefully scrutinize the dataset, the method and sequence of analysis, and the reliability of the results.