Ancestry.com has more than 12 billion records that are part of a 10-petabyte or 10-million gigabyte data store.
Ancestry.com, the world’s largest online family history resource, uses machine learning and several other statistical techniques to provide services such as ancestry information and DNA sequencing to its users.
According to the Chief Technology Officer, Scott Sorensen, Ancestry.com has more than 12 billion records that are part of a 10-petabyte (or 10-million gigabyte) data store. If you’re searching for “John Smith,” he explained, it will likely yield results for about 80 million “Smith” results and about 4 million results for “John Smith,” but you’re only interested in the handful that are relevant to your John Smith. For Ancestry.com their data is highly strategic. As Sorensen explains, there are 5 fundamental ways they make use of data to enhance the customer experience. These include:
- With more than 30,000 record collections in their data store including birth, death, census, military and immigration records, they mine this data using patterns in search behavior to speak to their more than 2 million subscribers or tens of millions of registered users in a more relevant way. For instance, only a selection of their users will be interested in newly released Mexican census data.
- They mine their data to provide product development direction to the product team. Analyzing search behavior can show where a subscriber might be stuck or where they leave the service and therefore where new content could be created.
- They rely on big data stores to develop new statistical approaches to algorithmic development, such as record linking and search relevance algorithms. Today, the vast amount of user discoveries are determined by Ancestry.com hints derived from strategically linked records and past search behavior (e.g., Charles ‘Westman’ is the same person as Charles ‘Westmont’). Two years ago, the majority of discoveries were based on user-initiated search.
- Advanced data forensics is used to mine data for security purposes to ensure appropriate use of their information.
- DNA genotyping to provide information about genetic genealogy is a new area of focus. Customers spit in a tube, send the package to Ancestry.com, and then molecular tests and computational analyses are performed to predict a person’s ethnicity and identify relatives in the database. For every AncestryDNA customer, 700,000 SNPs (distinct variable regions in your DNA) are measured and analyzed, resulting in 10 million cousin predictions for users to-date.
A portion of Ancestry.com’s data is processed on three clusters using MapR as the Hadoop distribution. One cluster is for DNA matching; another is for machine learning and the third, which is just being built-up, is for data mining. Massive distributed parallel processing is required to mine through 10 petabytes of data and the large quantities of DNA data. Ancestry.com runs batch jobs and wants to run the DNA pipeline constantly with no interruptions, so high availability is very important. MapR’s high availability JobTracker enabled the company to run different tasks on the same cluster. They have also been pleased with MapR’s service and support, and the ability to quickly get everything up and running with the graphical user interface and client configuration.