Phish.Net Show Ratings, Part 3: Variance and Bias in Ratings: What's the Problem?

Saturday 08/24/2024 by phishnet

PHISH.NET SHOW RATINGS, PART 3: VARIANCE AND BIAS IN RATINGS: WHAT'S THE PROBLEM?

[We would like to thank Paul Jakus (@paulj) of the Dept. of Applied Economics at Utah State University for this summary of research presented at the 2024 Phish Studies Conference. -Ed.]

The first two blogposts in this series can be found here and here. This post will address the statistical biases believed to be present in the data, and how anomalous raters may contribute to bias.

Statistically, a show rating represents our best point estimate of an unobservable theoretical construct: the “true” show rating. To the degree that an estimated show rating deviates from its true value, the error is composed of sampling variance and bias. In the figure below, think of the bullseye as the true show rating, and the red dots as our estimates (best guesses) of the true value.

Variance acknowledges an unavoidable aspect of any statistic based on sampling: we’ll probably get it wrong because we can’t ask everybody about their rating of every show. That said, if we have a large, unbiased sample, we’ll come really close to hitting the bullseye (the lower left target). A smaller but still unbiased sample will miss by a larger amount but, on average, the misses will be randomly spread around the true value (lower right target.)

Bias measures the degree to which we systematically miss the true value. That is, there’s something about the people who rate shows, or how they rate shows, that means we will always miss the bullseye regardless of the amount of variance (the targets in the top row). We have two major forms of bias:

Sampling Biases

We do not have a truly random sample because people who rate shows—and the shows they choose to rate—are self-selected.

1. Potential raters are not drawn with equal probability from a sample of Phish fans. Instead, all raters must be registered users of Phish.Net, who are likely to be more enthusiastic than the typical Phish fan. Even then, only a fraction of all registered users have ever rated a show. The ratings database is unlikely to be representative of the general population of fans.

2. Most Phish.net raters do not rate all the shows they’ve heard on tape, seen on video, or attended in person. If people selectively rate only those shows they consider the best, or those shows they have heard are the best, or only those shows they have attended, then the shows raters choose to rate are no longer random, and we are likely to systematically miss the true show rating.

Response Biases

1. Attendance bias. Ratings for shows that one has attended are alleged to be biased upwards (fluffed) because attending a show is the best way to enjoy Phish.

2. Recency bias. Rating a show too soon after its conclusion is alleged to bias show ratings upward (fluffed) because the “warm glow” of a solid performance has not yet faded.

3. Herding effect. Phish.Net shows any rater the average show rating of all previous raters. For example, you think a show earned a ‘3’, but then you see that on .Net that 165 people rate that show as ‘4.3’. The product marketing literature has found that people adjust from their initial rating so as to follow the herd (i.e., maybe you submit a ‘4’ instead of a ‘3’.)

4. Deliberate distortion. Here we’re talking about people who rate almost every show a ‘1’ (bombers) or a ‘5’ (fluffers) regardless of show quality. Are these people simply telling their own truth, or are their ratings intended to bias the rating estimate away from its true value?

While we can do little to control for sampling biases, we may be able to use the average deviation and entropy metrics to mitigate the various response biases—if these biases are associated with anomalous rating behavior. That will be the topic of the fourth, and final, blogpost.

If you liked this blog post, one way you could "like" it is to make a donation to The Mockingbird Foundation, the sponsor of Phish.net. Support music education for children, and you just might change the world.

5 comments - Link: http://phi.sh/b/66c60e4c

Comments

2024-08-24 1:27 pm, comment by Multibeast_Rider

I'm curious if there are any shows that are particularly anomalous. I always wondered how 7-14-19's distribution looked because I know at one point it had by far the most ratings of any show ever.

Score: 2

2024-08-24 6:00 pm, comment by paulj

That's a good question. In the Discussion Thread that accompanies these posts, a couple of other netters have asked about the distribution (variance) of ratings for any one show. There's actually a (small) literature about what happens when people see the distribution of of ratings for a given product. One study of online book sales found that for any two books with the same average rating, the book with the greater distribution of ratings had higher sales. In the context of Phish shows, I think this might mean that, of two shows with the same rating, more people would be driven to listen to the show that had a wider variety of ratings.

I've not done anything with "within show" variance of ratings, but that could be really interesting.

Score: 2

2024-08-24 6:15 pm, comment by mschoobs

I am sure I will get blasted for this but it has been bothering ever since I saw these articles. I love this site so please take this with positive intent.

I think all of these articles are interesting but ultimately irrelevant. I respect what you are trying to do as I am a data person myself.

Better definition is required to get what you are after. What is the reason we have ratings? What makes a great vs good show? Who should get to decide that?

There is no such thing as a true show rating. Without objective guidelines for scoring you cannot have a true show rating. What we have is a popular opinion score based on subjective feelings. There is no right or wrong, there is only opinion. Certainly people will disagree on what is "true" or "right" but those are just opinions. My wife and I have completely different opinions on show ratings but it doensnt make her wrong. Deciding who's vote should count and those that should not is control.

Trying to control opinions is making yourself the arbiter of truth. That is ok I am fine with it. It is your site and you entitled to do what you want with it. Jam charts work this way and I accept it. I dont agree with Jam charts all the time and that is ok.

These articles are a justification for control. I really dont care if you control the scores. You dont need to justify your reasons either.

If we are after a democratic scoring system(I dont know what you are after with the "true value") then everyone should have their vote count as 1 regardless of their stance. Each person should get one vote, which should be achievable.

End of rant. Thank you for all of the information but I think you are missing the forest from the trees.

Score: 7

2024-08-25 1:46 pm, comment by paulj

@mschoobs said:

I am sure I will get blasted for this

I sincerely hope that no one is blasting anybody. Everyone should rate shows any way they want to.

Better definition is required to get what you are after. What is the reason we have ratings? What makes a great vs good show? Who should get to decide that?

There is no such thing as a true show rating.

The things that make a good show versus a great show is entirely up to rater, and raters may differ on what those things are. Like you, I am not aware of any formal definition what the show rating is supposed to measure, but I believe there is an informal consensus. The average show rating represents what we, as a community composed of vastly different people, thinks of a given Phish performance.

Is there a “true show rating”? Yes, in the theoretical sense that (1) everyone who hears a show has an opinion about it, and (2) we can measure a numeric rating that represents a mix of all of our opinions. My work examines how we use those opinions (numeric ratings) to construct a summary measure that best captures what the community thinks. This, of course, moves us out of the theoretical realm and into the empirical.

Without objective guidelines for scoring you cannot have a true show rating. What we have is a popular opinion score based on subjective feelings.

Yes, the average show rating is a popular opinion score based on subjective feelings. But the key element of this research is that we can link show rating to the things we, as fans, profess to love about Phish. Do we like jamming? Do we like silky segues between songs? Do we like to hear rarely performed songs?

I believe the answer to all of these questions (and more) is, “yes.” If that’s the case, then we can connect the subjective show rating to objective (quantifiable) measures of show quality. If show ratings are correlated with quantitative measures drawn from the setlist, then we have established the empirical validity of show ratings.

My research question is, “Given all of the information we have on raters and ratings, is the simple mean rating of a show the best way to measure fans’ collective opinion?”

Trying to control opinions is making yourself the arbiter of truth. That is ok I am fine with it.

I absolutely NOT okay with controlling opinions and, in my numerous discussions with fans regarding ratings, I have yet to encounter anyone who is. Please, rate any show any way you want, according to any criteria you want. But I firmly believe that any one person's show ratings, and average show ratings, should be correlated with what happens onstage at a Phish show.

Score: 2

2024-08-25 3:18 pm, comment by mschoobs

You are after a noble goal. IMO, the work you are doing is great but your intended use is not correct. I think using it to educate people so that we get better reviews more consistently is the answer. Designing a system that decides how ratings count is control.

My research question is, “Given all of the information we have on raters and ratings, is the simple mean rating of a show the best way to measure fans’ collective opinion?”

I think that is great question. I also believe what you learn could be put to wonderful use. Educating the community may encourage more and better ratings.

But I firmly believe that any one person's show ratings, and average show ratings, should be correlated with what happens onstage at a Phish show.

Implementing a system that does this is control.

I know you do not intend it. That is a buy product of the quest for perfection. Variables need to be controlled to get a better signal in the data.

If you want to change the ratings we should completely change the ratings and leave the existing rating alone. I would introduce a new rating that is well defined and understood in terms of its calculation.

Thank you for the discourse. Please keep up the good work, I do really appreciate the thought that has gone into this.

Score: 1