optimization - Why is ''.join() faster than += in Python?

Question

Welcome To Ask or Share your Answers For Others

optimization - Why is ''.join() faster than += in Python?

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

optimization - Why is ''.join() faster than += in Python?

I'm able to find a bevy of information online (on Stack Overflow and otherwise) about how it's a very inefficient and bad practice to use + or += for concatenation in Python.

I can't seem to find WHY += is so inefficient. Outside of a mention here that "it's been optimized for 20% improvement in certain cases" (still not clear what those cases are), I can't find any additional information.

What is happening on a more technical level that makes ''.join() superior to other Python concatenation methods?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T01:04:00+0000

Let's say you have this code to build up a string from three strings:

x = 'foo'
x += 'bar'  # 'foobar'
x += 'baz'  # 'foobarbaz'

In this case, Python first needs to allocate and create 'foobar' before it can allocate and create 'foobarbaz'.

So for each += that gets called, the entire contents of the string and whatever is getting added to it need to be copied into an entirely new memory buffer. In other words, if you have N strings to be joined, you need to allocate approximately N temporary strings and the first substring gets copied ~N times. The last substring only gets copied once, but on average, each substring gets copied ~N/2 times.

With .join, Python can play a number of tricks since the intermediate strings do not need to be created. CPython figures out how much memory it needs up front and then allocates a correctly-sized buffer. Finally, it then copies each piece into the new buffer which means that each piece is only copied once.

There are other viable approaches which could lead to better performance for += in some cases. E.g. if the internal string representation is actually a rope or if the runtime is actually smart enough to somehow figure out that the temporary strings are of no use to the program and optimize them away.

However, CPython certainly does not do these optimizations reliably (though it may for a few corner cases) and since it is the most common implementation in use, many best-practices are based on what works well for CPython. Having a standardized set of norms also makes it easier for other implementations to focus their optimization efforts as well.

Categories

optimization - Why is ''.join() faster than += in Python?

optimization - Why is ''.join() faster than += in Python?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags