Friday, December 17, 2010

How many instances are needed when considering study results?

This post is the 2nd part of a series I started a few weeks ago that will discuss using quantifiable edges to your advantage.  Today I'll discuss a common question I get about the studies.  How many instances are needed for valid and usable results?  It will lead into "What makes a study compelling?" in the next post.

Many of the posts I put on the blog are what I refer to as studies. In this previous post I showed the layout of the studies. A study is simply test results of an idea. Most of the time the idea is based in technical analysis. It looks to answer the question, “How has the market performed in the past after…”

Some studies are fairly general. For instance, I might look at how the market performs after it has traded down 3 days in a row. Others are more specific with added filters. Perhaps I notice that not only is the SPX down 3 days in a row, but it also is trading at a 10-day low, and is above the 200ma and volume has increased each of the last 3 days.

Both studies could tell me something about the market in relation to its current condition (assuming I’m describing current conditions, which is typically my approach). If I am able to describe conditions that more closely match the current market then I have a better shot at seeing behavior over the next several days match up with the study results. Of course there is a trade-off between general and specific, and that is the number of instances.

A general test may have hundreds or thousands of instances which it can refer to in order to generate expectations. A very specific test may have an extremely low number of instances. If the number of instances is too low then the results may have little or no meaning. For instance if my parameters are run and I find that the market had only set up in a similar manner 1 other time over my test period, is it reasonable to assume that the market will act the same way this time? Most people would correctly assume “no”. What if there were 2 instances and they both had similar reactions in the past. Could I assume this suggests a directional edge? 3 instances? 4? 10? 30? 50? More? How many instances is “enough” to have some level of confidence that your results are actually suggesting an edge and they are not the result of luck?

Before answering let me address 1 common misconception people have about statistical testing. That misconception is that you need 30 instances in order to demonstrate statistical significance. This idea originates in the fact that a sample size of 30 is needed in order to calculate a Z-score or run a chi-square test. The reason that 30 instances are necessary is that Z-scores assume a normal probability distribution. Without 30 instances it is not possible to resolve the shape of the normal probability distribution clearly enough to make certain statistical measures valid. One thing traders should be aware of is that the stock market does not have a normal distribution anyway. It has “fat tails”. In other words, there are more outliers present in stock market movements than one would expect under a normally distributed curve. So relying on standard statistical measures and assuming a normal distribution could expose a trader to more risk than his results would imply.

Still, these tests are helpful in determining whether your results were likely due to a real edge or whether there is a high risk that luck played a big part. But what if you don’t have 30 instances? In that case you could use a t-table statistic.

To better understand statistical significance and see how to run some of these tests I’ll refer you to the below post from a couple of years back:

Note that this post also contains a t-table. One interesting thing we can see when looking at a t-table is the minimum number of instances you would need to have different confidence levels that your edge is actually an edge and not due to luck. For instance, if all instances were followed by a market rise, you would want at least 6 instances in order to be 95% confident that there was an actual edge. A 99.9% confidence would be reached if you had 11 instances that all resulted in a rise over the next X days.

So if you look back at the study I showed Wednesday, SPY only set up in that pattern 12 times in the past, but every time it was trading higher 5 days later. This means statistically there is about a 99.9% chance that the positive results were due to more than luck. That there has in fact been a real edge in that pattern in the past. Does this mean there is a 100% chance it will be higher 5 days after the setup? No! Not even close. A high degree of confidence means there is likely some kind of an edge. It doesn’t mean the past winning % or net expectations are likely to persist indefinitely.

So how many instances do I require before I’m willing to accept a study as part of my analysis and place it on my active list? It varies depending on things like the strength of previous reactions and other stats I’ll get into in my next post, but I’ll generally use a t-table to help me decide. Will I incorporate a study with only 10 or 11 instances? Yes, but it will have to have strong win/loss stats and a high win %. Personally, I tend to favor studies that have somewhere between 20-70 instances. Too low and they are less reliable. Too high and the setup is often too broad to have much meaning.

I’ve spent far more space discussing this than I wanted, but it is an issue that has come up time and again with readers, so I wanted to be somewhat thorough.

In fact, of the list of things I look at in a study to help me decide whether it is compelling or not, the number of instances (assuming it isn’t minuscule) is near the bottom .

I intend to accelerate this series of posts over the next couple of weeks and I’m sorry it’s taken so long to get rolling. In the next post I will discuss a list of other things I examine when determining whether I find a study compelling.

1 comment:

leo00o83 said...

Hi Rob,

I found one thing confusing about your post when you said: "Too high and the setup is often too broad to have much meaning.", yet in your post from '08 "Significance" you mention that

"To be helpful, statisticians set up “confidence levels.” If the result could have occurred by chance once in twenty repetitions of the record, you can have 95% confidence that the result isn’t just luck. This level has been called “probably significant.”

If the result could be expected by chance once in a hundred repetitions, you can have 99% confidence; this level has been called “significant.”

If the expectation is once in a thousand repetitions, you can have 99.9% confidence that the result wasn’t a lucky record. This level has been called “highly significant.”"

So what range of numbers of repetitions is over the top?