Hardlimit test bank

whoololon

Ah, so.

Thanks for the clarification.

For the next version, in the final message when collecting data, fix "Retriving data".

cobito

I have a question about some results. Let's see if anyone knows why this could happen.

The other day, @krampak uploaded some results of the Ryzen 7 2700x. Specifically, I was intrigued by this one executed with 32 threads. This model has 8 cores with SMT. Looking at the results of a validation at 16 threads uploaded by the same @krampak on the same day, presumably from the same machine, the results look comparatively curious.

Leaving aside the memory tests, in the single-thread tests, the validation at 32 threads gets 1.2% more than the one at 16 threads. With that, we know that the conditions have been the same (same background loads) and we also know that both tests were run at stock frequency.

In contrast, in the multi-thread tests, the validation at 32 threads gets 5.3% more performance than the one at 16 threads. The way the benchmark works in multi-thread is very simple: it launches a process per thread, does its operations, calculates its scores and finally adds up the scores of all the processes. That is, if it got more points, it's because it was able to do more operations in the same time.

I can only think of two possibilities:

That when executing 32 threads, the benchmark is monopolizing a greater percentage of the CPU, taking processing time away from background programs.
That it is making better use of the processor's segmentation.

If I remember correctly, this behavior was also noticed by @kynes a while ago. A third possibility is a bug in the program, but after reviewing the code, I can't figure out how it could be happening because, in addition, the result of the integer test is practically the same in both cases; with a bug in the program, the behavior would have had to be reproduced in all the tests.

Anyway, I leave it there in case you feel like pondering for a while.

whoololon

What I'm saying is that, with all the fuss about UserBenchmark, you need a serious site to compare processors... that's it.

cobito

As I mentioned in the other thread, the program is signed with a CA approved by Microsoft. So from now on, this becomes something serious.

The Windows Smartscreen still appears. The theory says that the program (but especially the certificate), have to gain reputation. There are three ways to gain reputation:
· Leave the executable in a public place. Over time, it will gain points. This is done.
· Download it from different sites. The more it is downloaded and executed, the more points it will gain. I read that if it is done from Internet Explorer or Edge it will be better. But in reality I think it makes little difference. The point is simply to download and run it (no need to pass it or validate). This is where you can lend a hand.
· Leaving the executable in any folder (like the downloads folder) so that the Windows Telemetry program can see it.

In this way, in a matter of days, the warning message will disappear.

@whoololon said in Hardlimit test bench:

What I'm saying is that, with the fuss that's been made about the UserBenchmark, a serious site is needed to compare processors... that's all I'm saying.

Well yes, it may be a good time to move the matter. I have some ideas...
By the way, the typo in the text that you commented on has already been corrected.

Regarding the central, there have been a couple of minor updates. The most important ones (the details in the first post of the thread):
· Now shows OC ranking in each fiche.
· Now shows the version in Spanish if a browser in Catalan, Galician, Basque, Asturian or Occitan is detected.

whoololon

I've been taking a look, and I'm not clear if the results shown, both in the micro description and in the ranking table, are the best scores (micro with OC up to the hilt in a specific configuration for tests like Xevipiu), or the average of all validated results for the micro, or only those that go with the serial frequency...

Thanks in advance.

cobito

@whoololon Both in the micro description (cpu.php) and in the different processor and architecture rankings, only results without OC (stock frequency) are taken into account to make the average. The overclocked results appear in a separate table within the tab of each model (if there are overclocked results).

In the results of a validation (result.php), the data corresponds to the validation in question, without taking into account other validations. The user ranking table that appears both on the home page and in the validation result takes into account individual validations, including overclocked processors and without making averages.

In summary: the tabs and rankings calculate an average of the validations at stock frequency. If there are overclocked results for a model, they are shown in the tab separately. The user rankings show individual results (without average) including overclocked validations.

I don't know if that was what you were asking.

whoololon

Yes, that was it; thanks for the clarification.

kynes

@cobito There is something that has me puzzled. The microphone of my laptop has 4 cores and 4 threads, but it gains about 25,000 points in multithreading if I set 8 threads instead of 4. I understand that it must be that it thus monopolizes more processor time, but then the results would not be totally consistent if the maximum number of threads possible are not used. Would there be any way to test more than 8 threads, to see what result it gives? If it is marginally superior or similar, it is a matter of micro usage. If it is very superior, it must be a bug in the benchmark.

cobito

@kynes First of all, there is a discrepancy between the program results and the central that is not corrected yet because I am thinking about how to give the most reliable result possible: the program uses the old method which consists of counting the maximum result of each test. In the central, an average of the 10 samples per test is made. In this way, from the program, greater variations between executions are appreciated while in the central, those differences (that can be caused by background processes) are filtered and are less appreciated.

Having said that, of the 4 results you have sent (2 with 4 threads and 2 with 8 threads), I have chosen the extremes to have the worst possible case: the one that gave the lowest score in 4 threads and the one that gave the highest score in 8 threads. The difference in the total multithread score is 4.7%. For reference, the differences between the two results at 4 threads and at 8 threads are 1.3% and 1.2% respectively.

To me, personally, that there is a 4.7% difference between the extremes vs that there is a 1.2% difference in validations with the same number of threads, it seems normal seeing how Windows 10 has a hundred things in the background.

But something that could be a failure of the program (and that would also be quite difficult to diagnose as to correct if it really were a failure), is the fact that in the tests with 8 threads, there is a peak of scores in the first sample. Surely because of that you have measured such a large difference in the program results where those peaks were taken at the same time as in the results of the central, that difference is much smaller, because the average was calculated.

I will see if I have a moment and prepare the version without thread limit that you mention so that you can test that.

cobito

@kynes Here is a modified version without a thread limit. In general, it seems that the multi-threaded result is proportional to the number of cores regardless of the excess threads, although there is a slight improvement when the number of threads is higher than the processor. But there is a machine where the thread synchronization has failed and has not been detected, generating a meaningless result. The PC has 4 cores with HT and from 32 threads it seems to fail.

This version only works in FPU and AVX 2 mode and the results are not valid.

kynes

I understand that if you took the average of the threads, the result would be coherent, but there is one that is going off the rails. I'm going to try with 128 to see what happens.

kynes

With 128 threads I think I hold the world record in multithreading:

Well, and if I don't have it, I'll try it with 256 threads to see what happens

cobito

@kynes There is a clear difference here. I am also seeing it in my case. I suppose that in the end I will have to apply a kind of truncated mean: something like eliminating the two highest values, the two lowest and making an average of the remaining 6. Because it is clear that the outliers at the beginning distort the measure.

I am also going to review the synchronization mechanism, to see if the fault was there.

By the way, be careful with the 256 threads, because if the system crashes and the processes lose communication with each other, they can remain permanently waiting consuming 100% of all the cores and you will need to either close each process manually or restart the PC.

cobito

@kynes said in Hardlimit Test Bench:

With 128 threads I think I have the world record in multithreading:

Well, and if I don't have it, I'll test it on 256 threads to see what happens

There the synchronization has failed. Basically you are passing a handful of threads at different times and the scores are being added as if they had all been passed at once.

cobito

Some conclusions I draw from this:

When the test bench runs with an amount 4 times higher than the number of processor threads, the program fails and gives absurd results. This does not worry me because it will be limited to double the threads.
When an amount higher than the number of processor threads is run (but below an absurd amount), false positives usually occur in the first sample. Here are results with double the threads of three different models:

Core i7-6820HQ

Pentium N3540

Core i5-7300HQ

That this happens in three different models makes it clear that it is a generalized behavior. Curiously, these peaks are seen in tests #1, 2, and 4, but not in test 3. That this behavior is not reproduced in test 3 makes me think it's complicated to think of a program failure, but it cannot be ruled out yet.

Below are the same models with a number of threads equal to the CPU:

Core i7-6820HQ
Pentium N3540
Core i5-7300HQ

If there are peaks, they are practically imperceptible.

In the i5 and i7, a sustained improvement is seen throughout the test. Here there are only two options: that a larger amount of CPU is being monopolized or that segmentation is being better utilized. In the case of the Core i7-6820HQ, it is a PC with many programs running in the background along with an antivirus. The Pentium N3540 runs Windows 10 without anything else, with nothing running in the background and no antivirus. Perhaps for this reason, doubling the number of threads does not improve performance in a sustained way.

In general

What distorts the result shown in the program with double the threads is the initial peak. The problem is that I don't know why it occurs. If it were the program, I would expect a peak in all tests or a peak at the beginning and another at the end, but it doesn't happen that way. Could it be a trick of the cache? Honestly, I have no idea. But it is strange and the program will need to be reviewed.

krampak

And why don't we remove the option to specify the number of threads and force it to always run with the number of cores available on the CPU? I say this to avoid unnecessary differences between users.

cobito

@krampak Initially, the reason for having the freedom to specify the number of threads was in case HT/SMT detection failed. So far, that hasn't happened once. So there's no reason to keep it.

From the point of view of measuring the performance of a processor, choosing the number of threads is useful for seeing how a micro performs at half load, which would be quite interesting for evaluating processors at half load, which is actually how they are used most of the time. But if it's already difficult to receive validations with the default configuration, it's much more so with slightly exotic configurations.

Here a solution, as you say, is either to remove the possibility of choosing the number of threads or perhaps to limit the maximum to the number of threads of the processor to leave that possibility open.

Another possibility (which is complementary) is to do a truncated mean, something that will eventually be applied because this would correct all the results that have been sent so far.

And another possibility is to directly ignore the first sample; a simple but not elegant solution.

The method on how to calculate the final score is something that I have been thinking about for a long time (hence the program and the central use different criteria). The truncated mean is the one that is winning because it avoids this type of problems of unknown origin and because it filters the result in PCs with moderate background load.

The test bench has several mechanisms to prevent tampering with results and at the beginning of the development, it was the part to which more hours were dedicated. Fortunately, this failure has a retroactive solution. One thing is clear is that one of the objectives is to measure the real performance of the machine, without a specific configuration of the program offering a performance superiority that does not exist.

The latter I say to make it clear that this is not a trivial failure and that it will be solved. Until then, I remain open to suggestions.

whoololon

Let me see if I understand it correctly: the program executed by default shows reliable results, but when adjusting the parameter of how many threads we want it to use, it is prone to showing "unusual" results...

If that is the case, personally I would only let the results obtained by default be validated, at least until the incident is resolved... we just needed an army of pollagorders cheating at solitaire and, by the way, falsifying the ranking (this is the most serious thing for me, after all, I think what is intended is for the table to be reliable).

The option to choose the number of threads can be maintained, warning that it may give erroneous results and that they are not "official", for those who like tinkering.

cobito

Surely, by tomorrow or the day after (from there, depending on how long it takes Microsoft to certify it), version 1.4 will be ready, which will come with the issue of falsified scores fixed, among other changes.

If any of you have a current and powerful processor, I could use a screenshot of the CPU tab to put on the Store page, since I only have a few older PCs around here.

whoololon

This is the most powerful thing I have in my house, I don't know if it's what you're looking for.
alt text

Hardlimit test bank

In general

Foreros conectados [Conectados hoy]

Estadísticas de Hardlimit