The other day I was tasked with adding tagging functionality to one of the projects I'm working on. While this is quite basic functionality it comes with a few challenges - from properly configuring relations in the DB and the ORM, through properly handling operations on tags to properly handling connections between tags and tagged entities.
How tagging (usually) works?
If I were writing a blogging platform I'd do the following to implement tagging posts:
- Create posts table,
- Create tags table to store all the tags,
- Create posts_to_tags table to store connections between posts and tags,
- Configure the ORM to see many-to-many relationship between posts and tags,
- Write application code to handle tags and post-to-tags connections.
To do this right, I'd have to put a lot of consideration and care into managing tags (e.g. handling duplicate tags) and post-to-tags connections (e.g. removing the existing connections when updating a post).
When displaying posts I'd have to issue a separate query to load tags for each post I'd like to display. That'd require more resources for my app and put the DB server(s) under heavier load.
When searching for posts by tag I'd have to issue quite an expensive query with two joins in it.
Caching posts with their tags and search by tag results would require writing additional glue code in the application.
There must be an easier way to achieve this.
How tagging could work while being simple?
How about saving tags with every post? Here's where PostgreSQL's ARRAY data type comes into play. To add tags to posts I'd just add tags column to posts table and set it to store an array of strings. Then, when saving a post I'd simply take whatever tags the user entered and save them in the array.
When displaying posts I'd simply display the array of tags stored in the column.
When searching for posts by tag I'd simply add a WHERE condition to the query.
Caching posts and search by tag results would be no different from caching any other query results.
Sounds nice and easy but would it work? How would it affect the performance?
Turns out array based tags performed better in my totally unreliable test. To be honest, I was a bit suprised when I saw how good the array filtering performance was. I expected the result to be quite the opposite.
ARRAY data type in PostgreSQL is a nice thing to have and to use. If I were to implement the tagging functionality I mentioned in the beggining of this post again I'd use an array columns. If I were to write, say, an Instagram clone I'd go with different tagging implementaion. The thing is that array column would work until there's a small number of entries. In the tests I performed while playing around I found that 20+ items in the tags array caused a major performance hit.
Discuss this post on Reddit.