By now many lawyers and business managers have heard of the term “Big Data,” but many may not understand exactly what it refers to, and still more likely do not know how it will impact their clients and business (or perhaps it already is).  Big Data is everywhere (quite literally).  We see it drive the creative processes used by entertainment companies to construct the perfect television series based on their customer’s specific preferences.  We see Big Data in action when data brokers collect detailed employment information concerning 190 million persons (including salary information) and sell it to debt collectors, financial institutions and other entities.    Big Data is in play when retailers can determine when its customers are pregnant without being told, and send them marketing materials early on in order to win business.  Big Data may also eventually help find the cure to cancer and other diseases

The potential uses and benefits of Big Data are endless.  Unfortunately, Big Data also poses some risk to both the companies seeking to unlock its potential, and the individuals whose information is now continuously being collected, combined, mined, analyzed, disclosed and acted upon.  This post explores the concept of Big Data and some of the privacy-related legal issues and risks associated with it.

1.0   What is “Big Data”?

To understand the legal issues associated with Big Data it is important to understand the meaning of the term.  Wikipedia (part of the Big Data phenomenon itself) defines Big Data as follows:

Big Data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, analysis, and visualization.

While the Wikipedia definition highlights the challenges associated with large data sets and understanding the data contained in those sets, a definition by the TechAmerican Foundation also captures the opportunities associated with Big Data:

Big Data is a term that describes large volumes of high velocity, complex and variable data that require advanced techniques and technologies to enable the capture, storage, distribution, management, and analysis of the information.

The Foundation stresses Big Data solutions as part of its attempt to define the term:

Big Data Solutions: Advanced techniques and technologies to enable the capture, storage, distribution, management and analysis of information.

According to the TechAmerican Foundation, Big Data is characterized by three factors: volume, velocity, and variety:

Characteristic Description
Volume The sheer amount of data generated or data intensity that must be ingested, analyzed, and managed to make decisions based on complete data analysis
Velocity How fast data is being produced and changed and the speed with which data must be received, understood, and processed
Variety The rise of information coming from new sources both inside and outside the walls of the enterprise or organization creates integration, management, governance, and architectural pressures on IT


While these definitions and attributes of Big Data may be helpful, they are still rather abstract.  Perhaps the better question to ask is “what does Big Data mean to companies or other organizations?”  Using this filter, Big Data and its use can be viewed as business process or a supplement to existing business processes.  Big Data in the business context means or encompasses the following:

  • The ability of the organization to access unimaginable amounts of structured and unstructured data (much more of it likely in the unstructured category) both internally and through external resources (e.g. data brokers, affiliates or partners).
  • A realization (or hope) that by capturing, structuring and analyzing these huge volumes of data, and understanding the relationships within and between data, the company may gain valuable insights (often precise and non-obvious) that may significantly improve how the company does business.
  • The need to leverage specialized tools and specialized employees (e.g. data scientists) to enable the capture, curation, storage, search, sharing and analysis of the data in a way that is valuable to the organization.
  • Analyzing and addressing the potential limitations and legal risks and issues associated the collection, analysis and use of Big Data (and the insights derived from it).

While the specific applications of Big Data analysis will vary depending on the industry, the availability of data and the goals of a particular organization (and some of those practical applications are summarized above), many organizations will use Big Data to better understand and market to their customers (both individuals and corporate).

2.0 Big Data and Privacy

When it comes to consumer marketing the potential for Big Data is enormous (and some would argue that the confluence of online marketing and Big Data represent the “Holy Grail” of marketing).  Big Data can allow marketers to target customers precisely and efficiently by providing advertising, and product and services offers that are specifically tailored to a particular individual based on his or her attributes.   Big data combined with the use of mobile devices can result in offers to individuals that are highly relevant, delivered at the right time and (with mobile and geo-location tracking) at the right place. However, one of the most significant legal challenges associated with Big Data, especially on the consumer marketing side, is privacy.

2.1 Big Data and Notice/Consent

In the United States, pursuant to the Fair Information Practice Principles, the foundation of privacy protection includes the concepts of notice/awareness and choice/consent.    To satisfy the principle of notice and awareness, the data subject from whom data will be collected must be made aware of the uses to which his or her personal information will be put, and to whom such personal information will be disclosed.  The notice is intended to allow the data subject to make an informed choice as to the collection and use of the subject’s personal information, and to consent (or not) to that collection and use.

In a Big Data world, some contend that the goals of notice/consent may be circumvented due to the complexity of the Big Data ecosystem and practical limitations related to the use of written privacy policies.  For example, privacy advocates believe that in some cases, a person that reads a privacy policy and agrees that his or her personal information can be collected, used and disclosed for “marketing purposes” may not understand that such personal information may end up residing in the database of a data broker, and combined and disclosed in ways not apparent in or contemplated by the privacy policy.  For example, if an ecommerce vendor disclosed to a marketer that an individual customer purchased a deep fryer, such information could be combined into a profile about the individual in a database owned by a data broker.  If the data broker later sells access to the database to a health insurance company, whose algorithms put people who purchase deep fryers into a high risk category, in the world of Big Data, the initial, relatively innocuous data disclosure (that was consented to), could suddenly serve as the basis to deny a person health care (or result in higher health care rates).

The problem here is twofold.  First, the consumer may not understand where his or her personal information may end up, and that it could be combined with other existing profile data in a manner that reveals more about the person than contemplated at the time of disclosure.  Further onward transfer and combining with yet more databases could reveal even more.  Second, the data subject lacks an understanding of the interpretations, inferences, and/or deductions that may be draw from his combined data using Big Data mining techniques and analytics.  As such, in a Big Data world, some would argue that data subjects have even less awareness and ability to provide meaningful consent.

2.2    Big Data and Access/Participation

Another area of privacy concern related to Big Data deals with the principle of “access/participation”.  This principle deals with a data subject’s ability to access his or her personal data in order to ascertain whether it is accurate and complete.  This principle is necessary to allow individuals to correct inaccurate information about them.

This principle has been incorporated into the Fair Credit Reporting Act (FCRA), which requires credit reporting agencies to provide consumers with access to their credit reports so they can have inaccuracies corrected.  In the Big Data context satisfying the access/participation principle poses significant challenges.  Except for the established and highly visible players, the general public does not know what entities may be collecting information about them and creating profiles.  While data subjects may be able to identify companies to whom they have provided personal information, and may have a direct relationship with such companies, the same is not true in the case of data brokers.  In most cases data subjects do not have a direct relationship with them and these brokers typically do not receive information directly from the data subjects.  Even if a consumer can identify a data broker that holds his or her profile, without a contract the consumer may have no legal recourse that would require the broker to provide access to his or her personal information.  While some data brokers may be acting as “credit reporting agencies” and therefore subject to the FCRA, many take steps to avoid that status.

Based on concerns over access and transparency, the Federal Trade Commission has indicated a desire to consider additional regulatory scrutiny over data brokers:

To address the invisibility of, and consumers’ lack of control over, data brokers’ collection and use of consumer information, the Commission supports targeted legislation – similar to that contained in several of the data security bills introduced in the 112th Congress – that would provide consumers with access to information about them held by a data broker. To further increase transparency, the Commission calls on data brokers that compile data for marketing purposes to explore creating a centralized website where data brokers could (1) identify themselves to consumers and describe how they collect and use consumer data and (2) detail the access rights and other choices they provide with respect to the consumer data they maintain.

More recently, in December 2012, the FTC launched an investigation to study the data broker industry’s collection and use of consumer information.  Moreover, much of the privacy-related legislation proposed in Congress has included provisions related to the regulation and oversight of data brokers (although none has passed to date).  Overall, this is an area that is ripe for an increased regulatory response and potentially Federal and/or State legislation.

2.3 Big Data and Do Not Target / Do Not Collect

Another privacy-related area impacted by Big Data is the “do not track” debate.  For many in the advertising industry “do not track” refers to the use of consumer data for purposes of targeted advertising.  In contrast, the FTC and privacy advocates believe the concept of DNT encompasses not only targeting of individuals, but also collection of personal information from individuals (“Do Not Collect”).  Recent regulatory emphasis on Do Not Collect stems in part from concerns surrounding Big Data.   With the pervasive and constant collection of information about individuals from multiple sources, many data brokers are able to pinpoint a user’s identity and specific preferences without having any information traditionally considered personally identifiable information.   As discussed further below, common methods for de-identifying personal information may not be effective, if  the unique identifier of the computer or mobile device used to access a website, when combined with specific behavioral and other data, can supply enough information to identify a person individually.  This may lead to heightened regulatory scrutiny of Big Data practices, specifically where the collection and aggregation of seemingly harmless data about a person can be used to reveal sensitive information (e.g. health status, sexual orientation and financial status).

2.4 Anonymization and Big Data

One technique for mitigating privacy-related risks associated with Big Data is de-identification or anonymization.  Data sets that are de-identified have had key information stripped away, in order to   prevent others from individually identifying the persons to whom the data set relates.  This technique allows organizations to work with Big Data sets while mitigating privacy concerns, and has been used in many realms, including healthcare, banking and finance and online advertising.

In fact, many regulatory regimes recognize the concept of de-identified personal information.  Under regulations promulgated pursuant to Gramm-Leach-Bliley (regulating the privacy and security of financial data) “personally identifiable financial information” does not include information that does not identify a consumer, “such as aggregate information or blind data that does not contain personal identifiers such as account numbers, names, or addresses.” The Office for Civil Rights of the Department of Health and Human Services has issued extensive guidance concerning de-identification of health data, and sets forth two methods to achieve de-identification under HIPAA:  expert determination and “safe harbor” de-identification (which involves removing eighteen types of identifiers from health data).  Under European data protection laws, to achieve legally permissible de-identification, “anonymization of data should exclude any possibility of individuals to be identified, even by combining anonymized information.”

However, organizations relying on de-identification to circumvent privacy issues (and liability) must proceed carefully.  If de-identification is not performed properly, it may be possible to re-identify individuals in an anonymized data set.  There have been several real-life instances where re-identification has occurred, and researchers have also been able to demonstrate methods for identifying individuals from data that appeared anonymous on its face.

In one infamous example, as part of a contest to create a better movie recommendation engine, Netflix released an anonymized data set containing the movie rental histories of approximately 480,000 of its customers.   Researchers established that they could re-identify some of the Netflix customers at issue by accessing and analyzing publicly available information concerning movie ratings performed by such customers.  The Netflix contest eventually led to a lawsuit against the company and regulatory scrutiny from the Federal Trade Commission.  In another example, a researcher showed how she could re-identify persons with data in an anonymous healthcare data base, by using publicly available voter records (in this case she was able to re-identify the information of the governor of Massachusetts).

The risk of re-identification of Big Data sets using contextual “micro data” is a significant concern for organizations working with de-identified data sets.  If the de-identification is not done properly, third parties with access to de-identified data sets may be able to re-identify individuals, and that re-identification could expose the individuals at issue or constitute a data breach under existing data breach notification laws, and could lead to litigation or regulatory scrutiny.  Organizations desiring to de-identify and anonymize their data sets should consider several questions to help understand and mitigate potential privacy and organizational risks, including:

  • What are the purposes, risks and benefits of de-identifying and using or disclosing the data, and do the benefits outweigh the risks?
  • Will the third parties and/or service providers at issue use any data (aggregate, de-identified, etc.) for own purposes?  Do they have any contractual rights to use the data or engage in their own aggregation or anonymization of data?
  • Is the data truly anonymized? How can the company be sure? What information will be exposed if the data is re-identified?  Is it worth investing effort to verify anonymization?
  • What is the risk to the business if the data is re-identified?   Data breach notification?  Lawsuits?  Regulatory investigations or actions?

Engaging in the analysis above can be very helpful in mitigating risks.  However, companies need to be aware that the very nature of Big Data makes true anonymization more difficult.  With reams of detailed data now available and accessible, and sophisticated algorithms that allow data mining, it is arguably easier to re-identify individuals.  The analysis and combination of anonymized data sets with data sets containing identified individuals is largely unpredictable, and yet can potentially result in an organization getting into legal trouble.

3.0  Conclusion

The Big Data era is upon us, and it will become increasingly common for companies to collect, data mine and analyze large data sets in order to further their business interests.  Big Data analytics is already the norm for many organizations, and this trend will only continue over time as more and more data is collected, and stronger and more predictive tools and processes are developed to understand that data.  As companies rush headlong into the Big Data space, they would be wise to step back and contemplate the potential privacy implications of their activities, and consider steps to address privacy concerns.  Proactively dealing with the privacy issues discussed in this article can help organizations safely leverage Big Data while still retaining customers, and avoiding reputational harm, litigation and regulatory scrutiny.