Unique in the shopping mall: On the reidentifiability of credit card metadata

See allHide authors and affiliations

Science  30 Jan 2015:
Vol. 347, Issue 6221, pp. 536-539
DOI: 10.1126/science.1256297
  • Fig. 1 Financial traces in a simply anonymized data set such as the one we use for this work.

    Arrows represent the temporal sequence of transactions for user 7abc1a23 and the prices are grouped in bins of increasing size (29).

  • Fig. 2 The unicity ε of the credit card data set given p points.

    The green bars represent unicity when spatiotemporal tuples are known. This shows that four spatiotemporal points taken at random (p = 4) are enough to uniquely characterize 90% of individuals. The blue bars represent unicity when using spatial-temporal-price triples (a = 0.50) and show that adding the approximate price of a transaction significantly increases the likelihood of reidentification. Error bars denote the 95% confidence interval on the mean.

  • Fig. 3 Unicity (ε4) when we lower the resolution of the data set on any or all of the three dimensions; with four spatiotemporal tuples [(A), no price] and with four spatiotemporal-price triples [(B), a = 0.75; (C), a = 0.50].

    Although unicity decreases with the resolution of the data, the decrease is easily overcome by collecting a few more points. Even at very low resolution (h = 15 days, v = 350 shops, price a = 0.50), we have more than an 80% chance of reidentifying an individual with 10 points (ε10 > 0.8) (table S1).

  • Fig. 4 Unicity for different categories of users (v = 1, h = 1).

    (A) It is significantly easier to reidentify women (ε4 = 0.93) than men (ε4 = 0.89). (B) The higher a person’s income is, the easier he or she is to reidentify. High-income people (ε4 = 0.93) are significantly easier to reidentify than medium-income people (ε4 = 0.91), and medium-income people are themselves significantly easier to reidentify than low-income people (ε4 = 0.88). Significance levels were tested with a one-tailed t test (P < 0.05). Error bars denote the 95% confidence interval on the mean.

  • Fig. 5 Distributions of the financial records.

    (A) Probability density function of the price of a transaction in dollars equivalent. (B) Probability density function of spatial distance between two consecutive transactions of the same user. The best fit of a power law (dotted line) and an exponential distribution (dot-dashed line) are given as a reference. The dashed lines are the diameter of the first and second largest cities in the country. Thirty percent of the successive transactions of a user are less than 1 km apart (the shaded area), followed by, an order of magnitude lower, a plateau between 2 and 20 km, roughly the radius of the two largest cities in the country. This shows that financial metadata are different from mobility data: The likelihood of short travel distance is very high and then plateaus, and the overall distribution does not follow a power-law or exponential distribution.

Supplementary Materials

  • Unique in the shopping mall: On the reidentifiability of credit card metadata

    Yves-Alexandre de Montjoye, Laura Radaelli, Vivek Kumar Singh, Alex “Sandy” Pentland

    Materials/Methods, Supplementary Text, Tables, Figures, and/or References

    Download Supplement
    • Materials and Methods
    • Figs. S1 to S5
    • Tables S1 and S2
    • Algorithms S1 and S2
    • Full Reference List
    Subsampled Data

Navigate This Article