If you are comparing two cards who's spread in 3-0% is less than the spread between the elves, it's probably not that useful.
There’s a more principled way of doing this in statistics. You define what’s called a “null distribution” and ask if differences between cards (or absolute %’s) could be explained by this null distribution. This is usually translated into something called a p-value, which is indicative of how significant any given result is. This is something I plan to develop over the course of this project, but you’re right - differences between functionally identical cards serve as a good benchmark.
The card that provides amazing value in a losing effort gets ignored. The card that provides poor (or no) value in a winning deck wielded by a skilled player gets credit for winning a draft. Without feedback from players regarding card performance, the data in and of itself doesn't carry much weight at all.
The nice thing about random variation like this is that it theoretically goes away with sample size. The card that provides amazing value in one losing draft should eventually (absent bias) increase in frequency in winning decklists because it is a good card - the reverse is true for bad cards that sometimes go in good decks. This is true regardless of player skill. If our dataset had 1 million decklists in it, we could easily ask card vs. card comparisons (as long as the cards occupied similar niches in the cube).
Now, will our data set ever approach the requisite sample size to really answer specific card comparison questions? Probably not, but it doesn't mean it theoretically can't. In any case, what I’m more excited to investigate are the deck qualities. How many reanimation spells does the average winning reanimator deck have? How many tinker targets/fodder does the average tinker deck have? The nice thing about these questions is that they dodge some of the build-around-bias that LucidVision mentioned (Tinker and NO having lower win rates because they’re build around, for example), and they’re less susceptible to noise.
If your best player like Orzhov,and your worst one prefers playing other colours, then your data will always be skewed towards white and black.
This is 100% true. Steveman is the one who contributes winning decklists the most (partly because he’s a good player and partly because he hosts the most drafts), and he has a tendency to draft 3-4 color midrange piles. I, on the other hand, draft aggressive decks almost entirely.
This is a source of systematic bias that will not go away as we gather more decklists. One way I’m hoping to fix this (and get more data) is to have decklist submissions from other users in the community. It’s also fairly easy to keep these decklists separate during analysis, so I can return stats specific to anyone’s cube (as well as merge their decklists into the general pool). These lists don’t even have to be just 3-0 lists.
If anyone is interested in contributing decklists to the project, let me know!
My Cubes - The Busted Cube. A fully functional, almost 100% custom cube. The project started out by asking "What if other colors got cards on the power level of Mana Drain,Ancestral Recall, and Time Walk?" Draft and enjoy!
I’m the person that’s currently running the analysis on these decklists. If you’re interested in contributing (from a coding/statistics perspective or in contributing your own decklists), let me know! I'm a fiend for data, so if you like keeping track of how decks in your cube do, I'd love to talk. We’re hoping to turn this into a larger project (and even grow our dataset by using submissions from verified cubers) and do some cool analyses, like using the data to write a decent drafting AI.
LucidVisions makes some good points. There are important questions of sample sizes, biases, and statistical strength in any data collection effort. I want to address them now as we move forward with this project. At some point, I’ll do a write up of the setup of the project (including discussions of sample size and bias), how we’re analyzing the data, and what we hope to do moving forward.
While there are correlations to a cards true power level and how many 3-0 decks it shows up in, the correlation is not 100%
While it’s true that this correlation is not 100%, I’d argue it’s much higher than you think. The only reason to draft cards and put them in your decks is because they win games of Magic. Cards do this in different ways (Karn Liberated outright wins you the game, while Volcanic Island improves deck consistency), but the output is the same: a higher winrate. In fact, I would argue that the ONLY reliable metric for a cards power level is how it contributes to win rates. While assaying this contribution by looking at 3 - 0 decklists isn’t perfect, good cards lead to higher chances of 3-0 decklists. It would be nice to have all 8 decklists and their records from every draft, but this isn’t currently feasible.
The sample required to get a very accurate picture is obscenely high.
Undoubtedly true. If you gave this dataset to any person that does data science or machine learning, you’d be laughed out of the room - the dimensionality of the data relative to our sample size is laughable.
But we’re not even looking for a very accurate picture - we’re looking for a general one. As of right now, there aren’t many conclusions that you can draw from this data that experienced cubers don’t already know (Fractured Identity is good, lands are important, blue decks are good), but having a falsifiable basis to make these claims is something the cube community has lacked for some time.
As someone who’s worked on low level data projects before, there’s a surprising amount of conclusions you can get from this dataset. I personally believe that Land Tax isn’t very good, but seeing it so high on the White list makes me question my beliefs. As long as the questions you ask aren’t very complex, this dataset might do better than you think.
You can see that by comparing Strip mine and wasteland, one card is vastly strictly superior to the other, and it has 15% less top 3-0 decks. City of brass and mana confluence are functionally identical, yet are almost as far apart as Elspeth and Gideon.
This gets at an important point, and it’s one that I’m sure will come up often as we work on this project. One point is intrinsic biases in drafting (and therefore, in the dataset). One explanation for why Wasteland does better than Strip Mine is that drafters underrate Wasteland and overrate Strip Mine. Experienced drafters (those more likely to 3-0) end up with Wasteland a higher proportion of the time because of this.
One could also validly argue, as you claim, that it’s simply due to noise. This is especially true for your comparison of Mana Confluence and City of Brass. These high variance points are undoubtedly present in the data, and this will be true of almost any dataset. This leads nicely into the next two points.
I’m just worried that people will misinterpret these results
We’re just presenting the data and noting things that are potentially interesting - how you interpret them is up to you. But I would argue that having some basis to argue for a cards inclusion is better than no basis at all. All too often the cube community relies on the testing results of high profile members to see if a card is good or not, but this doesn’t always work and isn’t verifiable. And many times, people don't cube enough to test themselves. Some experienced cubers swear by Reveillark, other equally experienced members hate it. This project hopes to begin to provide some verifiable, statistical basis for examining a cube card (and the characteristics of winning decks in general).
I suggest being tepid in drawing any conclusions that Card A is better than Card B
Couldn’t agree more. I’m hoping to be able to construct a statistical framework for this (if you’re familiar with statistics - defining null distributions, p-values, etc). But without even doing this I can tell you that the ability to directly compare the performance of two cards will almost always be well outside the reach of this dataset, and that’s important to acknowledge. I would caution anyone that is seeking to compare two cards (that includes Steveman and his claims above)
In summary/ TLDR
This dataset, as it currently stands, is question-generating, not as question-answering. When the data tells us something that doesn’t agree with our experience, it might be time to start questioning the reliability of those experiences.
How you choose to interpret it is up to you, but I see this as a developing tool for a community that historically has based card choices entirely on personal experience.
My Cubes - The Busted Cube. A fully functional, almost 100% custom cube. The project started out by asking "What if other colors got cards on the power level of Mana Drain,Ancestral Recall, and Time Walk?" Draft and enjoy!
There’s a more principled way of doing this in statistics. You define what’s called a “null distribution” and ask if differences between cards (or absolute %’s) could be explained by this null distribution. This is usually translated into something called a p-value, which is indicative of how significant any given result is. This is something I plan to develop over the course of this project, but you’re right - differences between functionally identical cards serve as a good benchmark.
The nice thing about random variation like this is that it theoretically goes away with sample size. The card that provides amazing value in one losing draft should eventually (absent bias) increase in frequency in winning decklists because it is a good card - the reverse is true for bad cards that sometimes go in good decks. This is true regardless of player skill. If our dataset had 1 million decklists in it, we could easily ask card vs. card comparisons (as long as the cards occupied similar niches in the cube).
Now, will our data set ever approach the requisite sample size to really answer specific card comparison questions? Probably not, but it doesn't mean it theoretically can't. In any case, what I’m more excited to investigate are the deck qualities. How many reanimation spells does the average winning reanimator deck have? How many tinker targets/fodder does the average tinker deck have? The nice thing about these questions is that they dodge some of the build-around-bias that LucidVision mentioned (Tinker and NO having lower win rates because they’re build around, for example), and they’re less susceptible to noise.
This is 100% true. Steveman is the one who contributes winning decklists the most (partly because he’s a good player and partly because he hosts the most drafts), and he has a tendency to draft 3-4 color midrange piles. I, on the other hand, draft aggressive decks almost entirely.
This is a source of systematic bias that will not go away as we gather more decklists. One way I’m hoping to fix this (and get more data) is to have decklist submissions from other users in the community. It’s also fairly easy to keep these decklists separate during analysis, so I can return stats specific to anyone’s cube (as well as merge their decklists into the general pool). These lists don’t even have to be just 3-0 lists.
If anyone is interested in contributing decklists to the project, let me know!
Regular 450 unpowered cube (with some custom cards) - 450 Unpowered
I’m the person that’s currently running the analysis on these decklists. If you’re interested in contributing (from a coding/statistics perspective or in contributing your own decklists), let me know! I'm a fiend for data, so if you like keeping track of how decks in your cube do, I'd love to talk. We’re hoping to turn this into a larger project (and even grow our dataset by using submissions from verified cubers) and do some cool analyses, like using the data to write a decent drafting AI.
LucidVisions makes some good points. There are important questions of sample sizes, biases, and statistical strength in any data collection effort. I want to address them now as we move forward with this project. At some point, I’ll do a write up of the setup of the project (including discussions of sample size and bias), how we’re analyzing the data, and what we hope to do moving forward.
While it’s true that this correlation is not 100%, I’d argue it’s much higher than you think. The only reason to draft cards and put them in your decks is because they win games of Magic. Cards do this in different ways (Karn Liberated outright wins you the game, while Volcanic Island improves deck consistency), but the output is the same: a higher winrate. In fact, I would argue that the ONLY reliable metric for a cards power level is how it contributes to win rates. While assaying this contribution by looking at 3 - 0 decklists isn’t perfect, good cards lead to higher chances of 3-0 decklists. It would be nice to have all 8 decklists and their records from every draft, but this isn’t currently feasible.
Undoubtedly true. If you gave this dataset to any person that does data science or machine learning, you’d be laughed out of the room - the dimensionality of the data relative to our sample size is laughable.
But we’re not even looking for a very accurate picture - we’re looking for a general one. As of right now, there aren’t many conclusions that you can draw from this data that experienced cubers don’t already know (Fractured Identity is good, lands are important, blue decks are good), but having a falsifiable basis to make these claims is something the cube community has lacked for some time.
As someone who’s worked on low level data projects before, there’s a surprising amount of conclusions you can get from this dataset. I personally believe that Land Tax isn’t very good, but seeing it so high on the White list makes me question my beliefs. As long as the questions you ask aren’t very complex, this dataset might do better than you think.
This gets at an important point, and it’s one that I’m sure will come up often as we work on this project. One point is intrinsic biases in drafting (and therefore, in the dataset). One explanation for why Wasteland does better than Strip Mine is that drafters underrate Wasteland and overrate Strip Mine. Experienced drafters (those more likely to 3-0) end up with Wasteland a higher proportion of the time because of this.
One could also validly argue, as you claim, that it’s simply due to noise. This is especially true for your comparison of Mana Confluence and City of Brass. These high variance points are undoubtedly present in the data, and this will be true of almost any dataset. This leads nicely into the next two points.
We’re just presenting the data and noting things that are potentially interesting - how you interpret them is up to you. But I would argue that having some basis to argue for a cards inclusion is better than no basis at all. All too often the cube community relies on the testing results of high profile members to see if a card is good or not, but this doesn’t always work and isn’t verifiable. And many times, people don't cube enough to test themselves. Some experienced cubers swear by Reveillark, other equally experienced members hate it. This project hopes to begin to provide some verifiable, statistical basis for examining a cube card (and the characteristics of winning decks in general).
Couldn’t agree more. I’m hoping to be able to construct a statistical framework for this (if you’re familiar with statistics - defining null distributions, p-values, etc). But without even doing this I can tell you that the ability to directly compare the performance of two cards will almost always be well outside the reach of this dataset, and that’s important to acknowledge. I would caution anyone that is seeking to compare two cards (that includes Steveman and his claims above)
In summary/ TLDR
This dataset, as it currently stands, is question-generating, not as question-answering. When the data tells us something that doesn’t agree with our experience, it might be time to start questioning the reliability of those experiences.
How you choose to interpret it is up to you, but I see this as a developing tool for a community that historically has based card choices entirely on personal experience.
Regular 450 unpowered cube (with some custom cards) - 450 Unpowered