r - Speeding up an extremely slow for-loop

Question

Welcome To Ask or Share your Answers For Others

r - Speeding up an extremely slow for-loop

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - Speeding up an extremely slow for-loop

This is my first question on stackoverflow, so feel free to criticize the question.

For every row in a data set, I would like to sum the rows that:

have identical 'team', 'season' and 'simulation_ID'.
have 'match_ID' smaller than (and not equal to) the current 'match_ID'.

such that I find the accumulated number of points up to that match, for that team, season and simulation_ID, i.e. cumsum(simulation$team_points).

I have issues to implement the second condition without using an extremely slow for-loop.

The data looks like this:

match_ID	season	simulation_ID	home_team	team	match_result	team_points
2084	2020-2021	1	TRUE	Liverpool	Away win	0
2084	2020-2021	2	TRUE	Liverpool	Draw	1
2084	2020-2021	3	TRUE	Liverpool	Away win	0
2084	2020-2021	4	TRUE	Liverpool	Away win	0
2084	2020-2021	5	TRUE	Liverpool	Home win	3
2084	2020-2021	1	FALSE	Burnley	Home win	0
2084	2020-2021	2	FALSE	Burnley	Draw	1

question from:https://stackoverflow.com/questions/65873007/speeding-up-an-extremely-slow-for-loop

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T19:23:21+0000

For loops are always slow in scripting languages like R and should best be avoided. This can be done using "vectorized operations", that apply a function to a vector rather than each element separately. Native functions in R or popular packages often rely on optimized C++ code and linear algebra libraries under the hood to do this, such that operations become much faster than a loop in R. For example, your CPU is usually able to process dozens of vector elements at the same time rather than going 1-by-1 as in a for loop. You can find more information about vectorization in this question.

In your specific example, you could for example use dplyr to transform your data:

library(dplyr)

df %>%
  # you want to perform the same operation for each of the groups
  group_by(team, season, simulationID) %>%
  # within each group, order the data by match_ID (ascending)
  arrange(match_ID) %>%
  # take the vector team_points in each group then calculate its cumsum
  # write that cumsum into a new column named "points"
  mutate(points = cumsum(team_points))

The code above essentially decomposes the match_points column into one vector for each group that you care about, then applies a single, highly optimized operation to each of them.

Categories

r - Speeding up an extremely slow for-loop

r - Speeding up an extremely slow for-loop

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags