On Monday, University of Aarhus researcher Emil O. W. Kirkegaard and University of Aalborg researcher Julius D. Bjerrekær published a very large dataset of OKCupid profile metadata, which was scraped from the service. OKCupid, a well-known dating website, allows users to create profiles that contain information about the user’s age, their location, their demographics, and the ages of the partners they are seeking.
Looking the age ranges of preferred dating partners relative to the user’s own age could be interesting. At the least, analyzing the data will give me an idea if the dataset is sufficiently robust for further blog posts.
I’ve extracted and cleaned up the profile metadata to make it easier to process and have valid profiles for 66,365 OKCupid users, and you can view this data in this Google Sheet .
Kirkegaard and Bjerrekær noted in the paper included in the dataset release that the distribution of ages of OKCupid users is skewed toward younger users. They provided this density plot:
The dataset contains three relevant columns:
d_age , the user’s provided age,
lf_max_age , the maximum age the user is seeking for a partner, and
lf_min_age , the minimum age. Let’s sanity check the data by plotting user age against the maximum and minimum ages of the preferred age range, with the latter two variables colored accordingly.
The sanity check was a good idea because there’s a lot going on here.
- Both maximum age range and minimum age range are positively correlated with the user’s current age, as you would expect.
- As the age distribution chart showed, there is very little data available for users older than 60, and we may have to remove that in later analysis.
- Users tend to set their desired age ranges to clean multiples of 5 (the solid horizontal lines), which is an interesting psychological bias . Many users also set their minimum age preference to the minimum possible value of 18 years-old…
- …conversely, many users set their maximum age range to 99/100 years-old, essentially saying that anything-goes as far as age. Unfortunately, these are invalid values for this analysis, and users who have declared this preference must be removed.
- I have absolutely no idea why there is a solid perfectly linear diagonal line of maximum age values evident in the data. Perhaps that offset is the default maximum age range and users have not changed it?
The chart is a bit messy for further analysis. Let’s try another approach by normalizing the max/min age ranges relative to the user’s age, and plotting these deltas instead. For example, if a 23 year-old OKCupid user is looking for someone with age 23-27, their
lf_max_delta = 27 - 23 = 4 , while
lf_min_delta = 23 - 23 = 0 .
And just for fun, let’s compare the age ranges with the infamous half-your-age-plus-seven rule, which is shorthand for a formula that results in minimum socially-acceptable dating partner age for a given age (for example, if a user is 30, the minimum dating age is
30 / 2 + 7 = 22 , and the corresponding delta is
22 - 30 = -8 ). On the charts, this boundary will be represented with a red line .
Putting it all together:
As a person gets older, the gap between maximum age of a partner they are willing to date relative to their own decreases, but the gap between the minimum age increases . At age 40, users begin to violate the half-age-plus-seven rule.
There are definitely a few latent factors in play affecting the results, and let’s see if we can identify them.
As noted in the paper, the gender breakdown of OKCupid users in the dataset is 60.6% Male, 39.1% Female, which is important as preferences of males and females in dating are very different. A 2011 blog post by Dr. Benjamin Le for Science Of Relationships , using 1992 data from Kenrick and Keefe, asserts that there is a difference due to a change in reproductive strategies, and notes that there is indeed some truth to the half-age-plus-seven rule. The visualizations he provides follow his hypothesis.
Let’s try to reproduce those visualizations, although due to the dating cultural changes in the last two decades, we should not expect an exact match. As with the Science Of Relationships image, we can bucket the user ages into 20s, 30s, etc. to help consolidate the data, as reading a continuous line of highly-variable age ranges is harder to parse than just reading a line chart with a few points. For each age bucket, we can calculate the medians of the deltas for both the maximum age and the minimum age, and draw a line between the bucket points for each (we can also represent the upper bounds with upward arrows, and the lower bounds with downward arrows too).
How do things look when the age ranges are separated by gender and separated into age buckets?
There are a few similarities to the Science Of Relationships chart: the lower bound of male relationships are always below the half-age-plus-seven line, and the lower bound of female relationships are initially above the line. For males, the upper bound decreases over time, while for female, the bound stays constant over time.
Since we have many other user attributes available with the OKCupid profiles, let’s check socioeconomic and cultural indicators in conjunction with faceting by gender.
The rules of conduct for dating vary all over the world. The places with the most number of OKCupid profiles in this particular dataset are the United Kingdom, California, New York, and Texas. Let’s see if the shape of the age ranges vary by location.
Not much difference: however, for both California and Texas, women in their 50s want to date others 30 years older! (I double-checked the data ; the number of women in their 50s at each location is 15 in each case, which is why age bucketing and using the medians is important)
Ethnicity can be a factor in relationships. Let’s check against the four most prevalent ethnicities in the dataset: White, Black, Hispanic / Latin, and Asian:
Zerodiscernible difference in dating age preferences between the four groups.
The “evolutionary perspective” argument presented in the Science of Relationships article assumes that the relationships are heterosexual, which is not always true. We should be checking every permutation of relationships: not just Male/Female and Female/Male, but Male users on OKCupid who are looking for other Male users, Female/Female, Male/Everyone, and Female/Everyone.
Again, not much difference, which throws some skepticism on the reproduction argument, although it’s worth noting that the lower bound of Female/Female and Female/Everyone relationships are slightly lower than the Female/Male bound, and overlaps the half-age-plus-seven line.
There are other interesting attributes in OKCupid profiles, but they are self-reported (e.g. survey responses) and require additional care in order to be used. While I wasn’t able to perform statistical testing to confirm my observations ( Pearson’s chi-square test does not work easily in this case since it can only be used on frequencies), I am interested in revisiting dating age range data in the future.