Notice
See also this answer: https://stackoverflow.com/a/21708215 which was the base for the EDIT 2 at the bottom here.
I've augmented the loop to 1000000 to get a better timing measure.
This is my Python timing:
real 0m2.038s
user 0m2.009s
sys 0m0.024s
Here's an equivalent of your code, just a bit prettier:
#include <regex>
#include <vector>
#include <string>
std::vector<std::string> split(const std::string &s, const std::regex &r)
{
return {
std::sregex_token_iterator(s.begin(), s.end(), r, -1),
std::sregex_token_iterator()
};
}
int main()
{
const std::regex r(" +");
for(auto i=0; i < 1000000; ++i)
split("a b c", r);
return 0;
}
Timing:
real 0m5.786s
user 0m5.779s
sys 0m0.005s
This is an optimization to avoid construction/allocation of vector and string objects:
#include <regex>
#include <vector>
#include <string>
void split(const std::string &s, const std::regex &r, std::vector<std::string> &v)
{
auto rit = std::sregex_token_iterator(s.begin(), s.end(), r, -1);
auto rend = std::sregex_token_iterator();
v.clear();
while(rit != rend)
{
v.push_back(*rit);
++rit;
}
}
int main()
{
const std::regex r(" +");
std::vector<std::string> v;
for(auto i=0; i < 1000000; ++i)
split("a b c", r, v);
return 0;
}
Timing:
real 0m3.034s
user 0m3.029s
sys 0m0.004s
This is near a 100% performance improvement.
The vector is created before the loop, and can grow its memory in the first iteration. Afterwards there's no memory deallocation by clear()
, the vector maintains the memory and construct strings in-place.
Another performance increase would be to avoid construction/destruction std::string
completely, and hence, allocation/deallocation of its objects.
This is a tentative in this direction:
#include <regex>
#include <vector>
#include <string>
void split(const char *s, const std::regex &r, std::vector<std::string> &v)
{
auto rit = std::cregex_token_iterator(s, s + std::strlen(s), r, -1);
auto rend = std::cregex_token_iterator();
v.clear();
while(rit != rend)
{
v.push_back(*rit);
++rit;
}
}
Timing:
real 0m2.509s
user 0m2.503s
sys 0m0.004s
An ultimate improvement would be to have a std::vector
of const char *
as return, where each char pointer would point to a substring inside the original s
c string itself. The problem is that, you can't do that because each of them would not be null terminated (for this, see usage of C++1y string_ref
in a later sample).
This last improvement could also be achieved with this:
#include <regex>
#include <vector>
#include <string>
void split(const std::string &s, const std::regex &r, std::vector<std::string> &v)
{
auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
auto rend = std::cregex_token_iterator();
v.clear();
while(rit != rend)
{
v.push_back(*rit);
++rit;
}
}
int main()
{
const std::regex r(" +");
std::vector<std::string> v;
for(auto i=0; i < 1000000; ++i)
split("a b c", r, v); // the constant string("a b c") should be optimized
// by the compiler. I got the same performance as
// if it was an object outside the loop
return 0;
}
I've built the samples with clang 3.3 (from trunk) with -O3. Maybe other regex libraries are able to perform better, but in any case, allocations/deallocations are frequently a performance hit.
Boost.Regex
This is the boost::regex
timing for the c string arguments sample:
real 0m1.284s
user 0m1.278s
sys 0m0.005s
Same code, boost::regex
and std::regex
interface in this sample are identical, just needed to change the namespace and include.
Best wishes for it to get better over time, C++ stdlib regex implementations are in their infancy.
EDIT
For sake of completion, I've tried this (the above mentioned "ultimate improvement" suggestion) and it didn't improved performance of the equivalent std::vector<std::string> &v
version in anything:
#include <regex>
#include <vector>
#include <string>
template<typename Iterator> class intrusive_substring
{
private:
Iterator begin_, end_;
public:
intrusive_substring(Iterator begin, Iterator end) : begin_(begin), end_(end) {}
Iterator begin() {return begin_;}
Iterator end() {return end_;}
};
using intrusive_char_substring = intrusive_substring<const char *>;
void split(const std::string &s, const std::regex &r, std::vector<intrusive_char_substring> &v)
{
auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
auto rend = std::cregex_token_iterator();
v.clear(); // This can potentially be optimized away by the compiler because
// the intrusive_char_substring destructor does nothing, so
// resetting the internal size is the only thing to be done.
// Formerly allocated memory is maintained.
while(rit != rend)
{
v.emplace_back(rit->first, rit->second);
++rit;
}
}
int main()
{
const std::regex r(" +");
std::vector<intrusive_char_substring> v;
for(auto i=0; i < 1000000; ++i)
split("a b c", r, v);
return 0;
}
This has to do with the array_ref and string_ref proposal. Here's a sample code using it:
#include <regex>
#include <vector>
#include <string>
#include <string_ref>
void split(const std::string &s, const std::regex &r, std::vector<std::string_ref> &v)
{
auto rit = std::cregex_token_iterator(s.data(), s.data() + s.length(), r, -1);
auto rend = std::cregex_token_iterator();
v.clear();
while(rit != rend)
{
v.emplace_back(rit->first, rit->length());
++rit;
}
}
int main()
{
const std::regex r(" +");
std::vector<std::string_ref> v;
for(auto i=0; i < 1000000; ++i)
split("a b c", r, v);
return 0;
}
It will also be cheaper to return a vector of string_ref
rather than string
copies for the case of split
with vector return.
EDIT 2
This new solution is able to get output by return. I have used Marshall Clow's string_view
(string_ref
got renamed) libc++ implementation found at https://github.com/mclow/string_view.
#include <string>
#include <string_view>
#include <boost/regex.hpp>
#include <boost/range/iterator_range.hpp>
#include <boost/iterator/transform_iterator.hpp>
using namespace std;
using namespace std::experimental;
using namespace boost;
string_view stringfier(const cregex_token_iterator::value_type &match) {
return {match.first, static_cast<size_t>(match.length())};
}
using string_view_iterator =
transform_iterator<decltype(&stringfier), cregex_token_iterator>;
iterator_range<string_view_iterator> split(string_view s, const regex &r) {
return {
string_view_iterator(
cregex_token_iterator(s.begin(), s.end(), r, -1),
stringfier
),
string_view_iterator()
};
}
int main() {
const regex r(" +");
for (size_t i = 0; i < 1000000; ++i) {
split("a b c", r);
}
}
Timing:
real 0m0.385s
user 0m0.385s
sys 0m0.000s
Note how faster this is compared to previous results. Of course, it's not filling a vector
inside the loop (nor matching anything in advance probably too), but you get a range anyway, which you can range over with range-based for
, or even use it to fill a vector
.
As ranging over the iterator_range
creates string_view
s over an original string
(or a null terminated string), this gets very lightweight, never generating unnecessary string allocations.
Just to compare using this split
implementation but actually filling a vector
we could do this:
int main() {
const regex r(" +");
vector<string_view> v;
v.reserve(10);
for (size_t i = 0; i < 1000000; ++i) {
copy(split("a b c", r), back_inserter(v));
v.clear();
}
}
This uses boost range copy algorithm to fill the vector in each iteration, the timing is:
real 0m1.002s
user 0m0.997s
sys 0m0.004s
As can be seen, no much difference in comparison with the optimized string_view
output param version.
Note also there's a proposal for a std::split
that would work like this.