Posts Tagged ‘flickr’

InnoDB Primary Key Clustering In Action @ Flickr

Sunday, May 8th, 2011

InnoDB clustered indexes is one of the most powerful, yet poorly understood optimizations.

The idea is that InnoDB will try to place items in primary key sort order together on disk. Thus, if your primary key is a numeric column called id, rows with id 1, 2, and 3 will be adjacent on disk, as will rows with id 21, 22, 23 etc. If you’re fetching rows that are adjacent when sorted on primary key, they will be adjacent on disk, resulting in sequential disk access. This sort of thinking is important when you make the decision to put on your big-boy pants and not just put your entire data set in memory.

The part that people don’t often think about is that composite primary keys follow the same rule. Tim Denike, Flickr’s DBA, demonstrates just how powerful the effect of clustered indexes can be:

The trick, as explained in the comments on the photo, is that most queries on Flickr end up being queries for multiple photos by the same owner (I suspect narcissism has a lot to do with that..) Thus if you have an index on owner_id, photo_id, the fact that the owner_id is the prefix causes photos from the same owner to be placed together on disk.

Obviously, since not all the photos are written to the database at the same time, some fragmentation is inevitable in the real world; nonetheless, it’s clearly a huge performance win. I believe some routine maintenance commands (OPTIMIZE? I can never remember all the different commands, someone will have to fact-check this..) will effectively “defrag” the table, moving all the rows into sequential primary-key order.

InnoDB clustering is covered in great detail in Ch. 3 of “High Performance MySQL

“Fighting Spam at Flickr” at Web2.0Expo

Saturday, May 15th, 2010

I recently had the giddy honor of speaking at the 2010 Web2.0Expo in San Francisco. The topic was simple – spam. I shared some insights (or I hope they were insights, anyway) about combating the spam problem on a social website – something I had been doing quite a lot of since joining Flickr. The slides are now on Slideshare and embedded below.

Thanks to Brady and the rest of the w2e team for putting together a great conference. I didn’t get to go to as many sessions as I would have liked due to having to spend most of my time in the speakers lounge preparing, but the ones I did go to were excellent.

Things I forgot to say in the talk/slides that are important:

  • Keep track of recent rates for ALL activity that your users do. This gets a bit expensive in terms of storage, but if you prune the data furiously, it can be made sustainable. Having that information is key – it can be used at pretty much every step of spam mitigation. Also, be smart about this – if messages can be deleted from a table, don’t use that table to do the counting. Nobody I know has EVER done that……

  • Rate limit everything. There’s usually a sweetspot right between what 99% of real users will actually ever do and spam-land.

Anyway, here are the slides. Enjoy!