Project Home

Tracker

Documents

Tasks

Source Code

Discussions

Reports

File Releases

Wiki

Project Admin
Search Wiki Pages Project: libaffy     Wiki > MAS5Discrepancy > View Wiki Page
wiki1025: MAS5Discrepancy

MAS5.0 Discrepancies

It has been noted that the Bioconductor (and other) implementations of the MAS5.0 algorithm differ somewhat from the Affymetrix implementation. A numerical description of Bioconductor discrepancies with MAS5.0 exist at http://bmbolstad.com/misc/MAS5diff/Mas5difference.html. This comes from the lack of open source code from this algorithm, many small implementation decisions, and a sharp discontinuity in one of their defining functions. Here we address some of the key areas in which the descrepancies arise, although complete fixes to this problem are too time-consuming to be undertaken.

Affymetrix provides specific details of their MAS5.0 algorithm within a whitepaper: Statistical Algorithms Description Document.

Implementation Details

While a description of the algorithm exists within the Affymetrix whitepaper, there are many minor considerations that can signficantly impact results. We provide one example that was uncovered due to the availability of the bioconductor implementation source code for comparison.

In the case of libaffy, we found that our results differed from that of bioconductor by 20 or more intensity units simply because of background correction based on 0 indexing (e.g. C indexes start at 0) vs. 1-based indexing. This discrepancy was minor overall and in fact the Affymetrix CEL locations are based on 0 indexing, so it is natural to assume that index strategy. The bioconductor implementation uses a 0-based index scheme to determine the location of centroids for individual zones (refer to the SADD for definition of these terms). Using floating point variables, the affy implementors added 0.5 to the center coordinates to round up. However, we investigated the possibility of simply calculating center coordinates in 1-based indexing. In other words, the location (0,0) would be (1,1) in the 1-based indexing, or more appropriately (50,50) would be come (51,51).

As the libaffy correctness results point out, this simple change to the code provides a small overall effect towards closer agreement with the original Affymetrix code.

Algorithm Discussion

The key shortcoming to the Affymetrix MAS5.0 algorithm is worth pointing out explicitly. After a great deal of work comparing results across systems, it is clear that without the source code for the original implementation the results can never be completely replicated. This outcome is acceptable, but for one key limitation: different numerical methods are used based on hard thresholds. It seems that most of the error in implementations occurs due to the definition of an "Ideal Mismatch" or IM.

When the MM of a probeset is greater than the PM, an estimate based on a proportion of the PM is taken as the IM. This in contrast to the MM value itself. Consider the following scenario: a probeset consists of probe pairs with very large differences between PM and IM (Tukey's biweight Average FC of ~4-fold).

Example: (40,10),(400,100),(400,399.998)

One probe pair, however, does not reliably detect the gene in question (this happens reasonably often). Depending on the implementation details (as described above) either PM-IM is very small (0.002, since PM-MM is small but positive), or PM-IM is very large (~300, since IM is then calculated as PM/2^SB, or PM/4 roughly in this case). Having a function that discontinuous considering a 0.002 change in precision, particularly in the context of numerical computation, is problematic.

Detailed Implementation Decisions

SB: Although not explicitly described, it appears that the computation of the Specific Background (SB) occurs from all probes in a probeset, whether or not they have been masked in a particular CEL file.

Probes for Background Correction: Only valid probe locations on the chip are used for background correction. Other probe locations have intensity values, but are excluded from the calculation of the lower 2% of intensities. This is a bit unusual if the point is determining background based on the entire region. However, since we do not know what is at the location in question, it is possible that the areas tend to be higher or lower than random.

Calculating Zones: As described earlier, centroids appear to use a 1-based index.

Conclusion

MAS5.0 is a technical artifact. Certainly more sensitive expression computations have been developed, including Affymetrix. Therefore, the minor differences that currently exist between algorithms are probably not important, particularly when it comes to identifying patterns within gene expressions.

However, the mantra of science is reproducibility. The R project (and the Bioconductor project) have been excellent examples of this philosophy in providing open software implementations that can be examined by all. We feel that the best algorithmic description of numerical methods is done via code, a technique that is common within statistics and computer science.