How to quickly remove duplicates from MongoDB

The fastest way to remove duplicates from Mongo is to fix whatever is adding them to mongo, but if you’re reading this you probably want to remove the duplicates that are already there.

I had this problem as well and google wasn’t much help so here’s my solution.

From here on I’m assuming you’re using MongoDB and NodeJS but if not you can apply this concept to any language;

  1. Hash your documents So as a human to find duplicates I would find documents that have exactly the same content, the easiest way for a computer to do this is to hash the document. A hash is a numerical ‘digest’ of the document in which some input will have some output. If two inputs are the same they’ll have the same hash, this is known as a hash collision. Using this knowledge we can find documents that have the same hash and group them together.

    A quick google suggested that murmurhash 3 was a good trade off between speed and collision rate (which in this case is when two different inputs have the same output). I didn’t analyze the runtime characteristics but it worked for this. You’re free to choose whatever algorithm you’d like however.

    Here’s the source for the version I used

function murmurhash3_32_gc(key, seed) {
	var remainder, bytes, h1, h1b, c1, c1b, c2, c2b, k1, i;
	
	remainder = key.length & 3; // key.length % 4
	bytes = key.length - remainder;
	h1 = seed;
	c1 = 0xcc9e2d51;
	c2 = 0x1b873593;
	i = 0;
	
	while (i < bytes) {
	  	k1 = 
	  	  ((key.charCodeAt(i) & 0xff)) |
	  	  ((key.charCodeAt(++i) & 0xff) << 8) |
	  	  ((key.charCodeAt(++i) & 0xff) << 16) |
	  	  ((key.charCodeAt(++i) & 0xff) << 24);
		++i;
		
		k1 = ((((k1 & 0xffff) * c1) + ((((k1 >>> 16) * c1) & 0xffff) << 16))) & 0xffffffff;
		k1 = (k1 << 15) | (k1 >>> 17);
		k1 = ((((k1 & 0xffff) * c2) + ((((k1 >>> 16) * c2) & 0xffff) << 16))) & 0xffffffff;

		h1 ^= k1;
        h1 = (h1 << 13) | (h1 >>> 19);
		h1b = ((((h1 & 0xffff) * 5) + ((((h1 >>> 16) * 5) & 0xffff) << 16))) & 0xffffffff;
		h1 = (((h1b & 0xffff) + 0x6b64) + ((((h1b >>> 16) + 0xe654) & 0xffff) << 16));
	}
	
	k1 = 0;
	
	switch (remainder) {
		case 3: k1 ^= (key.charCodeAt(i + 2) & 0xff) << 16;
		case 2: k1 ^= (key.charCodeAt(i + 1) & 0xff) << 8;
		case 1: k1 ^= (key.charCodeAt(i) & 0xff);
		
		k1 = (((k1 & 0xffff) * c1) + ((((k1 >>> 16) * c1) & 0xffff) << 16)) & 0xffffffff;
		k1 = (k1 << 15) | (k1 >>> 17);
		k1 = (((k1 & 0xffff) * c2) + ((((k1 >>> 16) * c2) & 0xffff) << 16)) & 0xffffffff;
		h1 ^= k1;
	}
	
	h1 ^= key.length;

	h1 ^= h1 >>> 16;
	h1 = (((h1 & 0xffff) * 0x85ebca6b) + ((((h1 >>> 16) * 0x85ebca6b) & 0xffff) << 16)) & 0xffffffff;
	h1 ^= h1 >>> 13;
	h1 = ((((h1 & 0xffff) * 0xc2b2ae35) + ((((h1 >>> 16) * 0xc2b2ae35) & 0xffff) << 16))) & 0xffffffff;
	h1 ^= h1 >>> 16;

	return h1 >>> 0;
}
  1. Loop through and hash the documents Now if you’re using NodeJS and Mongoose there is a nice fast way to get documents into your script which is to create a readable stream, which looks like this
var mongoose = require('mongoose'),
	Documents = mongoose.model('Documents', require('./documents').schema);
    
var documentStream = Documents.find().stream();

documentStream.on('data', function(doc) {
	//Process your document here
})

documentStream.on('close', function() {
	//All the documents have been iterated over
});

This is much faster than querying the database for all the records and waiting for them to be returned, as this way, your streams data handler will be called every time the database finds a document

In our case we want to hash the document, then add it to an array of hashes, which looks like this

var group = {},
	processed = 0;

documentStream.on('data', function(doc){
	//Make a copy of the doc
	var copy = _.clone(doc);

	//Delete the ID (since it's unique)
	delete copy._id;

	//Hash the object
	var key = murmurhash3_32_gc(JSON.stringify(copy), 0xad4f3b4f);

	//Ensure the key's value is an array so we can push 
	//objects into it
	if (typeof group[key] == 'undefined'){
		group[key] = [];
	}

	//Group the Ids by their hash
	group[key].push(doc);

	//Show how many documents have been processed so far
	process.stdout.write(util.format(" %d processed so far", ++processed) + "\r");
})

One caveat to the process is that we have to make a copy of the document and delete it’s _id. Even with duplicates you’ll find that the _id will be unique. Therefore, we hash the content of the document without the _id to see if the content of the document is unique.

As you can see we create an object where the key is the hash of the object and the value is an array of products with that hash. This allows us to group the products together for processing later.

Once this process is done and the database signals that all the documents have been iterated over, the close handler is called then we can do something with out group of data. In my case it looked like this

//Once all the object's have been hashed, delete the duplicates
documentStream.on('close', function(){
	var toRemove = [];

	//For each hash in the group of hashes
	for(key in group){
		//If this array has more than one document
		if (group[key].length > 1){
			//There is a purposeful "off by one" error below since I want to leave the first value in the array untouched
			//Loop through all but the first value in the array 
			for (var i = group[key].length - 1; i > 0; i--) {
				//Add the _id to the list of IDs to remove
				toRemove.push(group[key][i]._id);
			};
		}
	}

	//Once we've built up the list, lets remove them woo!
	Documents.remove({id: {$in: toRemove}}, function(err){
		if (!err){
			console.log("duplicates removed successfully");
			process.exit(0);
		}
	});
})

We loop over all the groups of keys, and if there are more than one product in the array we add the _ids of the extras to the toRemove array. Once those are all collected, we tell mongoose to find all the documents with these _ids and remove them, easy.

And that’s it. Using this technique I removed a few thousand duplicates from the database in just a few seconds.

If you’re having problems keep duplicates out of your database you can use this technique combined with a “hash” field and an unique index on said field, a techique I’ll demonstrate in an upcoming blog post.

If you have any comments or questions let me know.