User Tools

Site Tools


doublecheck

Preliminary double-check done already

Computing resources available in clouds are diverse in their performance, price and lifetime. To save cost, I had to continuously seek inexpensive resources and create instances in daily basis. The total number of instances ever created exceeded five hundreds. I conducted the following check as a minimal verification of the resouces:

  • Every time a new instance created a benchmark program which counts a small subset was executed and the time and the result was recorded. If the answer was wrong, that resource was never used. I actually saw a few such resources and it is noteworthy that some versions of docker environment with multiple RTX-4090s produced wrong answers due to a failure of inter-GPU atomic transactions. I avoided multiple 4090s working together and used them separately instead.
  • As a postmortem verification, independent re-counting has been done for every instance at a sampling rate of once per day. If any wrong results were found, all results that the resource produced were considered unreliable.

The thorough double-check in progress

If every subtotal is calculated twice and the two results match, the counts should be considered correct (provided the code is correct). The re-counting is in progress and is 70% completed as of 2024/05/12.

Errors Found

updated on 2024.02.17

Another erroneous instance was found. It ran with an RTX-4090 for about one month and produced about 19,000 sub-subtotals. Out of those sub-subtotals 6 were incorrect. All of the incorrect results were produced in the last one hour of the lifetime of the instance. After the erroneous behavior, the GPU of the instance became unusable with an error message of “invalid memory access”.

As the result of the correction, the number increased by 7 959 250 368 (331 635 432 x24).

updated on 2023.09.07

During the thorough double-check, it was discovered that a portion of the results generated by an instance was incorrect. The instance ran with two RTX-4090s for 60 hours and generated 3,771 sub-subtotals. Out of those sub-subtotals only 12 were incorrect and all incorrect results were generated by only one of the two RTX-4090s. It is unlikely that these errors are due to logical flaws or coding mistakes. Hardware defects or instability are the most probable causes.

As the result of the correction, the number increased by 960(40×24).

While these errors have not damaged my confidence in the logic and the code used in the calculation, it is possible that errors of similar nature may still be contained in the result. Therefore, the results should be considered unconfirmed until the thorough double-check is completed.

doublecheck.txt · Last modified: 2024/05/13 18:10 by mino

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki