Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
139 views
in Technique[技术] by (71.8m points)

r - Speeding up an extremely slow for-loop

This is my first question on stackoverflow, so feel free to criticize the question.

For every row in a data set, I would like to sum the rows that:

  • have identical 'team', 'season' and 'simulation_ID'.
  • have 'match_ID' smaller than (and not equal to) the current 'match_ID'.

such that I find the accumulated number of points up to that match, for that team, season and simulation_ID, i.e. cumsum(simulation$team_points).

I have issues to implement the second condition without using an extremely slow for-loop.

The data looks like this:

match_ID season simulation_ID home_team team match_result team_points
2084 2020-2021 1 TRUE Liverpool Away win 0
2084 2020-2021 2 TRUE Liverpool Draw 1
2084 2020-2021 3 TRUE Liverpool Away win 0
2084 2020-2021 4 TRUE Liverpool Away win 0
2084 2020-2021 5 TRUE Liverpool Home win 3
2084 2020-2021 1 FALSE Burnley Home win 0
2084 2020-2021 2 FALSE Burnley Draw 1
question from:https://stackoverflow.com/questions/65873007/speeding-up-an-extremely-slow-for-loop

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

For loops are always slow in scripting languages like R and should best be avoided. This can be done using "vectorized operations", that apply a function to a vector rather than each element separately. Native functions in R or popular packages often rely on optimized C++ code and linear algebra libraries under the hood to do this, such that operations become much faster than a loop in R. For example, your CPU is usually able to process dozens of vector elements at the same time rather than going 1-by-1 as in a for loop. You can find more information about vectorization in this question.

In your specific example, you could for example use dplyr to transform your data:

library(dplyr)

df %>%
  # you want to perform the same operation for each of the groups
  group_by(team, season, simulationID) %>%
  # within each group, order the data by match_ID (ascending)
  arrange(match_ID) %>%
  # take the vector team_points in each group then calculate its cumsum
  # write that cumsum into a new column named "points"
  mutate(points = cumsum(team_points))

The code above essentially decomposes the match_points column into one vector for each group that you care about, then applies a single, highly optimized operation to each of them.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...