When I visited Japan, I would frequently stumble across reviews like this on Tabelog:
This place was life changing! I mean literally, they saved my mom’s life, the employee performed CPR. They also gave me a free meal for my troubles. 4 stars.
Meanwhile in the U.S. on Google Maps, I see reviews like this:
This place sucks! I mean literally, the employee killed my mom with some peanuts. They didn’t even comp the meal. 4 stars.
Okay, I might be exaggerating a tad. But rating norms are currently all over the place.
This doesn’t help us communicate about important issues like which movies, books, and restaurants we should waste our time and money on.
So let’s design a better system. We’ll stick to the five-star scale, half stars included, since that’s what most platforms use.
Why and how to rate
Rating is fundamentally a compression problem. We’re taking the full richness of our experience and cramming it into a handful of discrete buckets. The question is where to draw the bucket boundaries – and like any compression problem, the answer depends on what information we care about preserving.
This matters because ratings serve two purposes: personal tracking and communicating with others. The better our compression scheme captures our actual experience, the more useful our ratings become for both.
Let’s formalize this.
Think of rating as projecting our personal percentile rankings onto a 10-point scale. I’ll use movies as the running example, but the same logic applies to restaurants, books, or anything else you might rate. We’ll condition on selection bias – a “50th percentile movie” means median among films we’ve watched, not all films ever made.
To maximize the entropy that our ratings convey, we can simply use a uniform distribution:
While this is optimal from an information theoretic perspective, it is not optimal for personal tracking purposes. There is no need to distinguish precisely the difference between a 40th percentile movie and a 50th percentile movie. Nor does the distinction between the two assist us in recommending and discussing movies with others. Who cares whether Transformers was a 4 or a 5 out of 10?
At the top though, we care a lot about those extra 10 points.
An 89th percentile film is just great, whereas a 99th percentile film will be one of the top few films you’ll ever see. This difference is profound and worth tracking.
So we only care about maximizing information where it matters, and this points towards compressing away the middle part of our distribution in order to gain granularity at the top end.
Redistribution
If we aggressively compress the middle and shift that granularity upward, we get something like this:
Our top end looks much better. Any movie that is 70th percentile or above will be classified pretty accurately!
But, this still isn’t great. We want to be able to distinguish between a 40th and a 69th percentile movie. Let’s smooth out the distribution a little, and also give ourselves a little bottom end granularity so that we can distinguish the truly horrid movies:
This looks pretty reasonable. Of course, individuals would be free to deviate on personal preference as well as on the thing they are rating. For example:
- Someone who enjoys hate-watching movies might desire more granularity at the bottom end of the scale
- A restaurateur might prefer less aggressive compression to capture higher uncertainty, since a single visit only samples a slice of what a restaurant offers
Even without our precious half stars, we can still come up with a similarly shaped curve:
Implementation
“But wait,” you say, “this is all well and good in theory, but when I walk out of a restaurant, I don’t have a percentile ranking computed. I just have… vibes.”
This is valid, and it’s a real problem with the current system too – we just don’t think about it. When you rate a restaurant, you are implicitly claiming that it is better than some percentage of restaurants that you have been to. You’re just doing the percentile calculation intuitively.
Here is how to actually implement this:
-
Calibrate. For your first ~50 ratings in a domain, don’t stress about getting each one right. Just rate things, then go back and shuffle stuff around until your distribution matches your target curve.
-
Extract reference points. Once you have a calibrated set of ratings, identify “anchor” items that you know really well and can use as boundaries. For me, Halal Guys is my 2nd-percentile restaurant – the boundary between 0.5 and 1 star. The local Lebanese place I’ve been to dozens of times is my 70th-percentile restaurant – the boundary between 2.5 and 3 stars. And so on.
-
Rate by comparison. When you encounter a new restaurant, don’t try to compute a percentile. Just make a few comparisons to your reference points and you are done.
-
Recalibrate periodically. Your reference points will drift. And if you get a new job and start eating at fancier places, your old anchors will become outdated. Just notice when things feel off and recalibrate.
What about everyone else?
You might worry about coordination – if everyone else rates “good” as 4 stars and you rate it 2.5, you look like a pretentious snob. But this matters less than it seems. Friends will learn your scale. Platforms like Letterboxd publish your distribution. And on platforms that don’t show distributions, your individual ratings are just noise in the aggregate anyway.
◆ ◆ ◆
Thanks to Michael for helping me iterate on the implementation details.