Thanks. I decided to do a little objective testing. I can't think of an easy way to test for the correct volume, but there are some easy tests to see how accurate data is. First, I looked for how many times the data provider reported obviously incorrect prices, ie. the Low is less than the High, the Open is greater than the High, the Close is greater than the High, the Open is less than the Low or the Close is less than the Low. Second, I checked to see how many times the data provider reported bars on dates that the markets were closed. Finally, I checked to see how many times the data provider was missing bars on dates that the markets were open.
I moved my existing data and downloaded new daily data this morning for the stocks in the S&P 500 index as of January 1, 2015. (Some symbols are no longer trading, so data is not available for them from some providers.) The providers were Fidelity, Google, NASDAQ, QuoteMedia and Yahoo. (I couldn't get MSN to update the data.) Finally, I ran the code below for the 10 year period 1/1/2005 to 12/31/2014, which is designed to report the 3 things described in the previous paragraph. I ran it against the S&P 500 index on the assumption that these are the most heavily traded stocks and, therefore, would have the best data.
The results are attached. The most surprising thing to me was the number of errors in the NASDAQ data. I assumed it would have been perfect, but it had 3 obviously incorrect prices and 4 missing bars. Google and QuoteMedia were by far the least accurate data providers, in terms of the number of bad symbols and the number of incorrect prices. (They were slightly better than Fidelity in terms of the number of missing bars.) The only thing QuoteMedia has going for it is the ability to download symbols that no longer trade. Yahoo had no incorrect prices or missing bars, but was the only data provider to have bars on days the markets were closed. Fidelity and NASDAQ were very similar. NASDAQ had one more incorrect price than Fidelity (3 instead of 2), but many many missing bars.
To put these results in perspective, I ran the strategy on approximately 2,500 bars of data for each symbol (250 bars per year x 10 years) and approximately 1,250,000 bars of data in total for each data provider (250 bars per year x 10 years x 500 symbols). Therefore, the number of incorrect prices for Fidelity (2), NASDAQ (3) and Yahoo (0) are not really significant.
I then repeated this exercise for the Russell 1000 as of June 1, 2015. There were several surprises. First, the number of Fidelity missing bars jumped from 135 to 2,614! Second, the number of errors in the Google and QuoteMedia data didn't change! Finally, the number Yahoo bars on days the markets were closed dropped from 60 to 1! The reason is that for all data providers, the errors come from a finite number of symbols, rather than being evenly distributed among all the symbols.
Obviously, you get what you pay for, and none of the data is perfect without manually correcting it. I'm sticking with Yahoo, since it is easy to eliminate the bars on the dates the markets were closed. With that fix, on these three benchmarks it is fine.
Needless to say, I encourage everyone to run this code on ALL DATA for ALL symbols, and either manually correct the data or understand where there may be problems when you back test. If you have Yahoo data going back a long way, you will have to replace the markets.xml file in order to get very old market holidays.
CODE:
Please log in to see this code.