Your code is faulty: in the DO loop, when i = 1 the statement in the loop references qout(0), which is a bounds violation. You can fix this bug easily.

Please define what you mean by “best way” and what you mean by “mimic Matlab”.

Your code is faulty: in the DO loop, when i = 1 the statement in the loop references qout(0), which is a bounds violation. You can fix this bug easily.

Please define what you mean by “best way” and what you mean by “mimic Matlab”.

1 Like

Thank you. I updated my description.

Thank you for pointing out the potential issue, my code is more for illustration, I am sorry it is a little sloppy. I would require A and B in the input start from index 1.

I would actually wish A can be any 1D array starting from any index, does not have to start from 1. In the i loop, i should start from the lbound(A), end at lbound(A) + size(A) -1. But that is not the most relevant for now.

I wish to find the ‘fastest’ way to do cumsum.

```
cumsum = [(sum(a(1:i)), i = 1, size(a))]
```

The compiler should be allowed to vectorize, parallelize, do whatever optimizations it likes with that.

1 Like

Correction inside loop:

```
do i = 2, size(A)
B(i) = A(i-1) + A(i)
enddo
```

Also, the most elegant way is how @everythingfunctional suggested, but with `cumsum`

corrected as an array not scalar.

```
pure function cumsum(a) result (r)
real(dp), intent(in) :: a(:)
real(dp) :: r(size(a))
integer :: i
r(:) = [(sum(a(1:i)),i=1,size(a))]
end function cumsum
```

then you simply call it as you initially intended: `b=cumsum(a)`

here is a working example link

1 Like

Thank you so much @stavros Yes, this is truly an elegant one line solution.

Indeed, let me say that I do not mean to offend anyone, nothing personal.

Just from business point of view, this elegant code may not be efficient enough. The cost is O( (size(a))^2 ).

The code I show is O( size(a) ) which is a factor of size(a) faster (however I wish to further improve it and make it faster.)

```
subroutine cumsum(a,b)
real(kind=r8) :: a(:),b(size(a))
integer(kind=i8) :: i
b(1)=a(1)
do i = 2, size(a)
b(i) = b(i-1) + a(i)
enddo
return
end subroutine cumsum
```

You could check the speed difference with a array whose size(a) = 1000000.

Note that I did not set the size = 1000000 just to show that the elegant code is slow. I set it as 1000000 is because I need this number of particles in my particle filter (sequential Monte Carlo).

I can use

a + random array

to propose the move of all the particles randomly and fully using vectorization. This may be faster than do loop which loop over 1000000 particles.

Slightly more detailed, in the below code, np = 1000000,

```
noise = matmul(reshape(gaussian(gbl%nx*np),[np,gbl%nx]),sq)
x0tmp = spread(gbl%x0,1,np)
call mf(x0tmp,gbl%kel)
x = x0tmp + noise ! this is the a + random array
subroutine mf(x,kel)
real(kind=r8) :: x(:,:),kel
x(:,1) = exp(-0.2_r8*kel)*x(:,1)
return
end subroutine mf
subroutine sysresample(q,i)
use random
integer(kind=i8) :: m,j,k,l
real(kind=r8) :: q(:),qc(size(q)),rn(1),u(size(q))
integer(kind=i8) :: i(size(q))
call cumsum(q,qc)
!qc=cumsum(q)
m = size(q)
rn=randn(1)
u = ( [(l, l = 0, m-1)] + rn(1))/m
i=0; k=1
do j=1,m
do while (qc(k)<u(j))
k = k + 1
enddo
i(j) = k
enddo
return
end subroutine sysresample
subroutine cumsum(a,b)
real(kind=r8) :: a(:),b(size(a))
integer(kind=i8) :: i
b(1)=a(1)
do i = 2, size(a)
b(i) = b(i-1) + a(i)
enddo
return
end subroutine cumsum
```

Note that subroutine sysresample may not be the most efficient. It was a translation from Schon’s matlab code,

http://user.it.uu.se/~thosc112/research/rao-blackwellized-particle.html

For those who are interested, can check a paper

See these parts in the paper

Oh, no offense at all!!! I am actually agreeing with you. My comment was only about *elegant* In my actual code, which by the way is in a similar field to yours (Monte Carlo and particles) I use nothing functional within the main loops, perhaps only during the preprocessor part where speed doesn’t really matter, and even then not for large arrays.

Of course, there is nothing inherently wrong with the functional style, the problem is on the compiler’s side. Such expressions should be automatically vectorized/unrolled. What really happens is that every time such expression is called a temporary array is created. The difference in speed is in orders of magnitudes.

By the way, to further optimize use only subroutines, for *some reason code generated from subroutines is better optimized resulting in minor speed improvement, it is noticeable in large sizes.

*Extra code is generated to reallocate the allocatable array.

2 Likes

3 Likes

CRquantum, I am afraid that this thread illustrates why we should heed Knuth’s warning, “Premature optimization is the root of all evil”.

You focused on a small part (i.e., the formation of **cumsum**) of the overall algorithm, and until you posted #7, we did not know what you planned to do with the computed cumsum array. But now, it turns out that even if you could compute that array in *zero time,* your overall calculation is going to suffer from the slowness of **linear search**. The cumsum array is already in increasing order (strictly so, if none of the elements of q are zero), so you could do a **binary search** for u(j) in the array qc. Further study of the values that you may expect for the arrays u and qc could enable the binary search to be conducted with a segment of the array qc rather than with the entire array, …

4 Likes

@mecej4 Thanks. I got you. The piece I show is a small piece in my code.

True, I totally agree, one should not spend 90% of the time optimizing a part of the code which at most give like 1% speedup.

Uhm, but if there is a chance to speedup each piece and learn something from you guys, why not?

The thing I mainly focused, actually is something else. But that may be a little off topic now (I may put these issues into smaller specific question and ask you guys one by one, as @VladimirF always reminds me).

If I may,

The true issue, is that perhaps you may ask, why do you need 10^6 particles to represent a distribution?

In the code, bringing down the number of particle from 10^6 to 10^3 or so, is where true speedup occur. Because in my code that reduces the number of stochastic differential equations (SDE) to be solved after all.

The subroutine sysresample is subroutine for resampling.

```
sysresample(q,i)
```

q is an array size 10^6, each of its value means the normalized weight of a particle. I need to resample the new 10^6 particles according to weights. The i is the index to help me choose particles to form the new resampled array x from the old x. like below,

```
q = q/sum(q)
call sysresample(q,idx)
x = x(idx,:) ! the x on the LHS is the resampled x, using the index from the x on the RHS.
```

I believe the sysresample subroutine I took from Thomas Schon, Software and data sets | Thomas Schön

is at most N*log(N) algorithm (which is not too bad) where N mean the number of particles which is the size of the array.

The Carpenter etal paper I showed in thread 7 is a perhaps O(N) algorithm.

Resampling, or sometimes people call bootstrap sampling or importance-sampling-resampling (SIR). Eg, see a paper,

The binary search you mentioned is like perhaps bisection method or something like that, is to find some particular values in an array. I could be wrong, but perhaps it is not exactly what I needed here now for resampling. But I do thank you very much indeed for your thoughts which is indeed very helpful, not only for me, but also for everyone here.

I think you missed the point here. The semantics of the one-line expression using *implied do* and the *sum* intrinsic contain sufficient information, such that an optimizing compiler could recognize this expression as the *cumulative sum* operation, and choose the fastest set of machine instructions for the target architecture.

At least for the hardware (x86-64) and compilers (`gfortran`

and `ifort`

) available to me, this does not appear to be the case today. A blog post from Daniel Lemire on “Is C++ worth it” provides an anecdote about cumulative sum performance in C++ and Java with a somewhat-related takeaway message.

Now to receive the answer you are looking for, you probably need to ask the more accurate question

How to obtain good performance for the cumulative sum for double precision reals on hardware A with compiler B?

@ivanpribec One line method is elegant that is obvious. I learned from this line a lot.

However, please try the one line method, and the regular do loop cumsum, for size=1000000 array.

On the one hand, on my laptop one-line method for size=1000000 array, after I hit enter it does not seem to give me answer.

Modern laptop’s CPU is Ghz level, so roughly 10^9 operations per second.

For the one line method, 10^6*10^6=10^12 operations will likely take 10^3 seconds. Considering some vectorization, it may took 10 to 10^2 seconds or so. That is my rough estimation.

I am afraid the Intel OneAPI compiler I use may not be that smart to magically reduce a N^2 method to N.

On the other hand, the do loop one stupid version gives me answer in less than 0.1 second.

I think my question and description is not too unclear for most people, and I am afraid I am not very capable of understanding your accurate question.

Perhaps ‘King’ at S.O gives me answer already,

Again, I can be absolutely wrong and I may indeed miss the point, but if you find the one-line method is indeed faster than the stupid do loop subroutine, please let me know with your concrete timing. I feel that may help more.

I accept all of your criticism, and I admit most people here have much higher ability in Fortran than me. That is why I come here and learn. But I also think that, if we can just focus on this small topic, how to write cumsum as fast as possible from technical point of view, it may be more positive and useful. But anyway, since there may not be too much room to improve for this cumsum, we may just move on to new topics in other threads.

I think @ivanpribec said that the one-liner is indeed slower (note especially the last sentence below) when he wrote

The semantics of the one-line expression using

implied doand thesumintrinsic contain sufficient information, such that an optimizing compiler could recognize this expression as thecumulative sumoperation, and choose the fastest set of machine instructions for the target architecture.At least for the hardware (x86-64) and compilers (

`gfortran`

and`ifort`

) available to me, this does not appear to be the case today

1 Like

That statement opens up a teachable moment. The answer is, “Because Worse is Better”.

In other words, if I have to spend considerable amounts of time describing and learning good techniques for each sub-task, by the time I learn all the techniques and put together a program to solve the complete problem, I may be too old to care or someone else may have solved it much sooner.

2 Likes

Thank you @Beliavsky .

If @ivanpribec could simply add However, and delete At least, in the beginning of the 2nd paragraph, such that

"

…

However, for the hardware (x86-64), …

…

"

Then the logic seems more clear and I can understand where my point is missing more easily.

So he is saying that he believe the one-liner gives the compiler enough information such that the compiler ‘intelligently’ know this is ‘cumulative sum’. However, in reality compilers are not that smart.

Actually I am not missing the point entirely. But feel free to say anything @ivanpribec , I am not really bothered by any criticism, as long as the logic is perhaps a little bit more clear. (9th floor the comment by @mecej4 is more clear to me.)

I know one-liner concise code is elegant and usually gives better performance, particularly when deal with array operations which can be vectorized.

See here,

This part by veryreverie

```
do i=1,4
ts(i) = t + sum(as(:i-1,i)) * h
xs(i) = x + dot_product(as(:i-1,i), ks(:i-1))
ks(i) = h * (fi(xs(i)) + gi(xs(i))*warray(i))
enddo
```

I understand dot_product(as(:i-1,i), ks(:i-1)) is efficient. In that code, nothing is repeatedly calcuated more than twice.

But in the cumsum case, because you know there is some dependence there, like in order to go to step k+1, need to calculate step k first. Therefore although I wish, I do not know how the compiler is intelligent enough to realize it is a cumsum.

Unless the engineer perhaps particularly hard-code something in the compiler.

But that can be another topic for compiler engineer how to more intelligently optimized user’s code, particularly realize the logic behind the code. However, just for cumsum problem, even if the compiler realize it is cumsum, it will still calculate the cumsum use the stupid loop method perhaps. So it may come back the question again, how to write the fastest cumsum.

I have spent two evenings with the cumulative sum in hopes of providing some closure to this thread. To this effect I’ve ignored the warning from Knuth and the many earlier threads on Discourse which have discussed how micro-benchmarking rarely provides good evidence.

The way the question from the OP is framed,

is there faster way to do cumsum what perhaps fully utilize vectorization or something?

tells us that he’s interested in architecture-specific solutions. The answer will obviously depend upon the hardware, compiler, compiler settings, array size, memory layout, and other factors making it difficult to give any kind of general answer. Best I can do is try with experimentation.

In post #10, @mecej4 has brought up Knuth’s warning, which also appears as the epigraph at the beginning of Chapter 1 in the book “Scientific Software Design” by Rouson, Xia & Xu. Reading the chapter further, Rouson & coauthors introduce the central thesis of the book:

Your Time is Worth More Than Your Computer’s TimeTotal solution time and cost can be reduced in greater proportion by reducing > development time and costs than by reducing computing time and costs.

If you are developing an application where performance is critical, I can only suggest you first profile your code using a tool like gprof or the Intel® VTune™ Profiler. This will help you to identify the sections of your code which occupy most of the runtime, and where you should focus your efforts to gain speed-up.

Before you reject the canonical implemention with the do loop, it’s fair to do a comparison with other programming languages. I’ve chosen the following three:

- C++:
`std::partial_sum`

- Python:
`numpy.cumsum`

- MATLAB:
`cumsum`

I’ve placed all the benchmark codes I used here:

GitHub - ivan-pi/cumsum_benchmark: Some benchmarks for the cumulative sum of an array

Results I got for C++, Python, and MATLAB are the following:

As you can see the C++ `std::partial_sum`

implementation leads with just over 1000 million doubles per second. MATLAB is not trailing very far behind.

For Fortran `cumsum`

, I’ve prepared some variations following the C++ codes of Prof. Daniel Lemire. Hopefully, there are no mistakes left.

With gfortran I get the following results (columns correspond to array sizes N = 1000, 10000, 100000, 1000000, rows are distinct trials):

```
# Compiler version: GCC version 10.3.0
# Compiler options: -mtune=generic -march=x86-64 -O3 -Wall
# do_loop
391.389 828.912 1024.947 749.708
994.036 1066.553 1028.987 748.018
979.432 1006.239 967.090 753.746
1037.344 1070.320 1039.696 757.796
1018.330 1073.768 1041.558 753.090
1057.082 1074.691 1025.178 759.203
1009.082 1071.582 1033.506 747.828
# step 2
1592.357 1639.882 1334.971 746.350
1515.152 1682.369 1351.461 748.210
1605.136 1554.243 1352.174 665.873
1420.455 1618.123 1245.051 750.325
1517.451 1632.120 1347.073 744.820
1305.483 1500.600 1364.647 751.274
1515.152 1670.844 1322.576 735.441
# step 2 unrolled
1027.749 1040.908 1029.569 639.030
1001.001 1024.695 1014.641 701.242
1025.641 1073.537 1016.085 627.270
1018.330 996.711 933.158 623.454
1003.009 1032.418 864.655 686.328
1061.571 1066.780 1027.359 749.817
1060.445 1041.775 1036.506 753.150
```

before experiencing a crash with the 3-step method. As you can see the simple do loop provides equal performance as the C++ standard library. The `step2`

version is able to achieve a 50 % performance increase for small array sizes, but with large array sizes, the advantage seems to vanish.

The second two step version `step2_unrolled`

provides no benefit compared to the canonical do loop. Moral of the story is - don’t try to be smarter than the compiler, unless you are a low-level performance tuning engineer with intimate knowledge of the cache organization and CPU instructions sets.

I also made a run with the Intel Fortran compiler:

```
# Compiler version: Intel(R) Fortran Intel(R) 64 Compiler Classic for applicati
ons running on Intel(R) 64, Version 2021.3.0 Build 20210609_000000
# Compiler options: -warn all -O2 -o f_cumsum
# do_loop
280.269 443.027 524.172 458.147
532.481 532.992 514.099 490.898
513.611 517.813 510.485 496.566
534.474 522.166 524.929 475.125
518.941 517.518 506.827 497.843
533.333 519.292 516.054 484.909
533.333 522.384 516.316 482.826
# step 2
1216.545 1280.082 1071.432 719.369
1485.884 1548.947 1321.528 707.293
1479.290 1604.879 1374.627 652.008
1519.757 1534.684 1351.936 683.041
1519.757 1546.551 1353.858 742.416
1524.390 1540.357 1306.626 708.752
1483.680 1586.798 1366.139 651.351
# step 2 unrolled
326.584 333.400 344.710 334.904
350.877 336.655 345.354 334.240
340.252 351.136 352.399 338.091
356.379 344.187 345.642 331.063
344.353 344.982 344.578 328.011
339.789 335.435 328.517 325.161
344.234 333.901 332.197 323.531
# step 3
993.049 987.069 900.163 623.206
1199.041 1139.601 935.121 632.438
1447.178 1176.471 972.328 697.442
1552.795 1265.342 1035.540 691.362
1468.429 1104.850 1039.685 653.648
1108.647 1159.420 940.230 632.437
1335.113 1144.427 1054.863 708.983
# step four
585.823 659.979 614.881 286.593
683.527 708.617 635.300 294.837
796.813 726.111 624.836 288.141
772.798 738.007 595.745 275.712
816.327 745.268 649.199 255.521
653.595 667.824 539.110 264.546
788.644 745.935 642.422 282.847
```

Here the `step2_unrolled`

version gives performance roughly equal to numpy. The basic `do_loop`

version is 50 % slower than gfortran. The `step2`

and `step3`

versions appear to perform reasonably well.

The `step_four`

version is taken from the Stack Exchange thread linked in the original post, but it also doesn’t provide much benefit against the simple do loop.

In conclusion based upon my limited and imperfect evidence, I agree with @mecej4’s wisdom, “Because Worse is Better”. Using the simple `do`

loop is a good return on the invested coding effort. Given the 6 hours I just invested in preparing the code for this post, I don’t think the 50 % increase in two-step unrolled version is really worth it. I would probably prefer to wait a few years, and buy a new CPU.

@CRquantum, please feel free to expand the code I put on GitHub with your own `cumsum`

variations or borrow it for other performance-related questions.

Concerning the implied do loop suggested by Brad (@everythingfunctional), the performance is in fact terribly slow, that I could not afford to wait for the results of the N > 10^4 cases. As indicated by @CRquantum, the work appears to scale as O(N^2) introducing a severe performance penalty. For N = 1000 the implied do version is roughly 500 times slower for the smallest array size, and drops another order of magnitude for N = 10000:

```
# implied_do
2.250 0.211
2.245 0.208
2.166 0.206
2.148 0.208
2.086 0.210
2.204 0.212
2.238 0.210
```

Edit: the results were measured on a ThinkPad T530 laptop with an Intel(R) Core™ i5-3320M (first released 2012). It would be interesting to learn what speed new processors achieve, and whether the gains already outperform the tweaks like loop unrolling.

6 Likes

Thank you @ivanpribec that is truly remarkable work!

My motivation is just that I saw matlab have cumsum, but cumsum in matlab is just like an API, I do not know its source code. So I wonder in Fortran can we have the fastest cunsum algorithm. I just feel that a fastest algorithm overall, should not depend on hardware too much.

I have downloaded your code, and I will read your code carefully.

Right now, you may use

```
procedure() ::
```

so that there is no need to write the interface blocks.

Eg, in your fortran file line 198,

```
subroutine run_test(ntrials,nreps,n,cumsum)
integer, intent(in) :: ntrials, nreps
integer, intent(in) :: n(:)
interface
subroutine cumsum(a,b)
import dp
real(dp), intent(in) :: a(:)
real(dp), intent(out) :: b(:)
end subroutine
end interface
```

I believe you may do

```
subroutine run_test(ntrials,nreps,n,cumsum)
use cumsum_mod
integer, intent(in) :: ntrials, nreps
integer, intent(in) :: n(:)
procedure(cumsum_do_loop) :: cumsum
```

The one line

```
procedure(cumsum_do_loop) :: cumsum
```

will automatically take care of the interface for the cumsum.

Actually I think you may pick any name of your cumsum subroutines inside the () in procedure(), because they have the same interface.

I just ran your code, here is the results. I am using Thinkpad P72 with Xeon-2186M CPU, perhaps it may of some help for your to see how the performance of CPU increased over the years (I do not really think single core performance increased much since Intel’s

Haswell architecture)

Your subroutine cumsum_step2(a, b) looks like the best cumsum. The block trick is new to me, and it looks very interesting. Perhaps it fully takes advantage of AVX2, perhaps for CPU with AVX512, step2 may be even better.

```
subroutine cumsum_step2(a, b)
real(dp), intent(in) :: a(:)
real(dp), intent(out) :: b(:)
real(dp) :: s0, s1
integer :: i, n
s0 = 0.0_dp
s1 = 0.0_dp
n = size(a)
do i = 1, n, 2
block
real(dp) :: x0, x1
x0 = a(i)
x1 = a(i+1)
s0 = s1 + x0
s1 = s1 + (x1 + x0)
b(i) = s0
b(i+1) = s1
end block
end do
if (mod(n,2) /= 0) then
b(n) = b(n-1) + a(n)
end if
end subroutine
```

Here is a modified version of the main loop in CRQuantum’s sysresample() subroutine in his post #7,

```
k=1
qcum = q(k)
do j=1,m
do while (qcum < u(j))
k = k + 1
qcum = qcum+q(k)
enddo
i(j) = k
enddo
```

*I have not tested the code, so it may need correction*.

Note that it makes the **qc** array and, therefore, the function (or subroutine) **cumsum** superfluous. If this version works well, we have in this thread discussed some great answers to the wrong question as far as the task of doing CRQuantum’s Monte Carlo simulation is concerned.

On the other hand, if there is sufficient demand for **cumsum**, it could be added to the list of Fortran intrinsic functions.

2 Likes

@mecej4 Thank you! Your code is correct.

Yes we indeed prevented the use of cumsum function/subroutine explicitly. Uhm, however the calculation of cumsum is still there in the loop, and the time cost is the same if not more. But nonetheless your code/suggestion is insightful!

The reason I am curious about cumsum is from Thomas Schon’s Matlab code,

http://user.it.uu.se/~thosc112/research/rao-blackwellized-particle.html

In the pf.m, he uses his sysresample function which uses cumsum, Matlab code is below,

```
function i=sysresample(q)
qc=cumsum(q); M=length(q);
u=([0:M-1]+rand(1))/M;
i=zeros(1,M); k=1;
for j=1:M
while (qc(k)<u(j))
k=k+1;
end
i(j)=k;
end
end
```

I support putting cumsum into Fortran’s intrinsic function like the Matlab one, Cumulative sum - MATLAB cumsum

I found a link, leonfoks seems already did something about this,

https://leonfoks.github.io/coretran/interface/cumsum.html

PS.

Just to be clear, I do not have any intention to waste your guy’s time to solve my problem for me.

In this post I was simply just asking about cumsum function because it is simple enough.

I know cumsum not something really slow down my code because it is O(N) which is not too bad after all.

Just in case someone want to see a minimal working example, I put my not even baked MWE in the link, Files · master · Chen Rong / SDE-PF · GitLab

If you have visual studio + intel OneAPI installed, just click the .sln file and it will open the project, just build and run, it should take 0.15 sec or less.

main.f90 is main program. The cumsum is in the file pf.f90, that is the module pf (particle filter), in the sysresample subroutine. around line 153. Above it, which is commented now is my translated version with cumsum, below line 153 is the one @mecej4 suggested without explicit cumsum.

The very naive SDE solver (which is not used in this example in this thread) I slightly wrote is based on Kasdin’s paper and John Burkardt’s code

Note that John’s stochastic_RK is a scalar version which only solve 1 equation. While Kasdin’s paper provides a general solution which can solve a system of SDEs, so it is actually a vector version.

I plan to learn more from FLINT ODE code by @prince_mahajan

and make my SDE solver much better than it is now.

About SDE and Kasdin’s method, a relevant S.O post is,

The comments there by Lutz Lehmann about SDE is very useful.

I also asked Kasdin’s method at

Chris Rack told me that Kasdin’s Runge-Kutta method is like order 0.5 than order 4. But nonetheless here I will begin with Kasdin’s method.

Thing is, in Fortran, we does not seem to have some good SDE solver now it seems.

I don’t see why it would cost more.

By using a scalar variable `qcum`

to store the partial sum, the code from mecej4 got rid of a reference to the `qc`

array. Instead of needing to fetch both `qc`

and `u`

values from the cache, only `u`

will be fetched and `qcum`

will likely remain in one the registers. I would expect this to provide a small but perhaps significant speed-up to the `sysresample`

routine. I am happy to be proven wrong.

Of course we can only be certain after performing an experiment and/or analyzing the output assembly code. This reminds me of some projects I came across recently:

2 Likes

For the record, the block construct doesn’t seem to play any role. Also the `step2`

version **does not** appear to utilize any advanced vector extensions (AVX) due to data dependencies. This can be inferred from the Intel Optimization report:

Source code:

```
!> Canonical cumsum
subroutine cumsum_do_loop(a,b)
real(dp), intent(in) :: a(:)
real(dp), intent(out) :: b(:)
integer :: i
real(dp) :: sm
b(1) = a(1)
do i = 2, size(a)
b(i) = b(i-1) + a(i)
end do
end subroutine
```

Compile instructions:

```
ifort -warn all -O3 -xHost -qopt-report-phase=loop,vec -qopt-report=2 cumsum_benchmark.f90 -o f_cumsum
```

Optimization report:

```
Begin optimization report for: CUMSUM_DO_LOOP
Report from: Loop nest & Vector optimizations [loop, vec]
LOOP BEGIN at cumsum_benchmark.f90(19,5)
<Multiversioned v1>
remark #25233: Loop multiversioned for stride tests on Assumed shape arrays
remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
remark #15346: vector dependence: assumed FLOW dependence between B(I) (20:7) and B(I-1) (20:7)
remark #25439: unrolled with remainder by 2
LOOP END
LOOP BEGIN at cumsum_benchmark.f90(19,5)
<Remainder, Multiversioned v1>
LOOP END
LOOP BEGIN at cumsum_benchmark.f90(19,5)
<Multiversioned v2>
remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning
remark #25439: unrolled with remainder by 2
LOOP END
LOOP BEGIN at cumsum_benchmark.f90(19,5)
<Remainder, Multiversioned v2>
LOOP END
```

Source code:

```
!> Two-step cumsum
!> Source: https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2012/07/23
subroutine cumsum_step2(a, b)
real(dp), intent(in) :: a(:)
real(dp), intent(out) :: b(:)
real(dp) :: s0, s1
integer :: i, n
s0 = 0.0_dp
s1 = 0.0_dp
n = size(a)
do i = 1, n, 2
block
real(dp) :: x0, x1
x0 = a(i)
x1 = a(i+1)
s0 = s1 + x0
s1 = s1 + (x1 + x0)
b(i) = s0
b(i+1) = s1
end block
end do
if (mod(n,2) /= 0) then
b(n) = b(n-1) + a(n)
end if
end subroutine
```

Compile instructions:

```
ifort -warn all -O3 -xHost -qopt-report-phase=loop,vec -qopt-report=2 cumsum_benchmark.f90 -o f_cumsum
```

Optimization report:

```
Begin optimization report for: CUMSUM_STEP2
Report from: Loop nest & Vector optimizations [loop, vec]
LOOP BEGIN at cumsum_benchmark.f90(76,5)
remark #25084: Preprocess Loopnests: Moving Out Store [ cumsum_benchmark.f90(81,9) ]
remark #25084: Preprocess Loopnests: Moving Out Store [ cumsum_benchmark.f90(82,9) ]
remark #25084: Preprocess Loopnests: Moving Out Store [ cumsum_benchmark.f90(86,5) ]
remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
remark #15346: vector dependence: assumed ANTI dependence between cumsum_step2 (81:9) and cumsum_step2 (82:9)
remark #25439: unrolled with remainder by 2
LOOP END
LOOP BEGIN at cumsum_benchmark.f90(76,5)
<Remainder>
LOOP END
```

I’ve tried to take a look at the differences with Godbolt (Compiler Explorer), but the reasons why the second executes faster for small workloads is way beyond my knowledge.

Here are the sections inside the loop body (for single precision):

```
..B2.6: # Preds ..B2.4 ..B2.6
movss xmm1, DWORD PTR [r8+rdx*8] #19.14
movss xmm0, DWORD PTR [8+rax+rdx*8] #19.23
addss xmm1, DWORD PTR [4+rax+rdx*8] #19.7
movss DWORD PTR [4+r8+rdx*8], xmm1 #19.7
addss xmm1, xmm0 #19.7
movss DWORD PTR [8+r8+rdx*8], xmm1 #19.7
inc rdx #18.5
cmp rdx, rcx #18.5
jb ..B2.6 # Prob 63% #18.5
lea ebx, DWORD PTR [1+rdx+rdx]
```

As you can see the loop has been unrolled, and two additions are performed before jumping back to the start.

```
..B3.4: # Preds ..B3.4 ..B3.3
movss xmm2, DWORD PTR [r12+r8*4] #39.9
movaps xmm3, xmm0 #41.9
movss xmm1, DWORD PTR [r14+r8*4] #40.9
inc r9 #36.5
movss xmm5, DWORD PTR [rdx+r8*4] #39.9
addss xmm3, xmm2 #41.9
addss xmm2, xmm1 #42.23
movss xmm4, DWORD PTR [rsi+r8*4] #40.9
movaps xmm6, xmm5 #41.9
movss DWORD PTR [r10+rbx*4], xmm3 #43.9
add r8, r11 #36.5
addss xmm0, xmm2 #42.9
addss xmm5, xmm4 #42.23
addss xmm6, xmm0 #41.9
movss DWORD PTR [r13+rbx*4], xmm0 #44.9
addss xmm0, xmm5 #42.9
movss DWORD PTR [rax+rbx*4], xmm6 #43.9
movss DWORD PTR [rcx+rbx*4], xmm0 #44.9
add rbx, rbp #36.5
cmp r9, r15 #36.5
jb ..B3.4 # Prob 63% #36.5
mov r15d, DWORD PTR [-48+rsp] #[spill]
```