Although StringBuilder.replace()
is a huge improvement compared to String.replace()
, it is still very far from being optimal.
The problem with StringBuilder.replace()
is that if the replacement has different length than the replaceable part (applies to our case), a bigger internal char
array might have to be allocated, and the content has to be copied, and then the replace will occur (which also involves copying).
Imagine this: You have a text with 10.000 characters. If you want to replace the "XY"
substring found at position 1
(2nd character) to "ABC"
, the implementation has to reallocate a char
buffer which is at least larger by 1, has to copy the old content to the new array, and it has to copy 9.997 characters (starting at position 3
) to the right by 1 to fit "ABC"
into the place of "XY"
, and finally characters of "ABC"
are copied to the starter position 1
. This has to be done for every replace! This is slow.
Faster Solution: Building Output On-The-Fly
We can build the output on-the-fly: parts that don't contain replaceable texts can simply be appended to the output, and if we find a replaceable fragment, we append the replacement instead of it. Theoretically it's enough to loop over the input only once to generate the output. Sounds simple, and it's not that hard to implement it.
Implementation:
We will use a Map
preloaded with mappings of the replaceable-replacement strings:
Map<String, String> map = new HashMap<>();
map.put("<h1>", "<big><big><big><b>");
map.put("</h1>", "</b></big></big></big>");
map.put("<h2>", "<big><big>");
map.put("</h2>", "</big></big>");
map.put("<h3>", "<big>");
map.put("</h3>", "</big>");
map.put("<h4>", "<b>");
map.put("</h4>", "</b>");
map.put("<h5>", "<small><b>");
map.put("</h5>", "</b></small>");
map.put("<h6>", "<small>");
map.put("</h6>", "</small>");
And using this, here is the replacer code: (more explanation after the code)
public static String replaceTags(String src, Map<String, String> map) {
StringBuilder sb = new StringBuilder(src.length() + src.length() / 2);
for (int pos = 0;;) {
int ltIdx = src.indexOf('<', pos);
if (ltIdx < 0) {
// No more '<', we're done:
sb.append(src, pos, src.length());
return sb.toString();
}
sb.append(src, pos, ltIdx); // Copy chars before '<'
// Check if our hit is replaceable:
boolean mismatch = true;
for (Entry<String, String> e : map.entrySet()) {
String key = e.getKey();
if (src.regionMatches(ltIdx, key, 0, key.length())) {
// Match, append the replacement:
sb.append(e.getValue());
pos = ltIdx + key.length();
mismatch = false;
break;
}
}
if (mismatch) {
sb.append('<');
pos = ltIdx + 1;
}
}
}
Testing it:
String in = "Yo<h1>TITLE</h1><h3>Hi!</h3>Nice day.<h6>Hi back!</h6>End";
System.out.println(in);
System.out.println(replaceTags(in, map));
Output: (wrapped to avoid scroll bar)
Yo<h1>TITLE</h1><h3>Hi!</h3>Nice day.<h6>Hi back!</h6>End
Yo<big><big><big><b>TITLE</b></big></big></big><big>Hi!</big>Nice day.
<small>Hi back!</small>End
This solution is faster than using regular expressions as that involves much overhead, like compiling a Pattern
, creating a Matcher
etc. and regexp is also much more general. It also creates many temporary objects under the hood which are thrown away after the replace. Here I only use a StringBuilder
(plus char
array under its hood) and the code iterates over the input String
only once. Also this solution is much faster that using StringBuilder.replace()
as detailed at the top of this answer.
Notes and Explanation
I initialized the StringBuilder
in the replaceTags()
method like this:
StringBuilder sb = new StringBuilder(src.length() + src.length() / 2);
So basically I created it with an initial capacity of 150% of the length of the original String
. This is because our replacements are longer than the replaceable texts, so if replacing occurs, the output will obviously be longer than the input. Giving a larger initial capacity to StringBuilder
will result in no internal char[]
reallocation at all (of course the required initial capacity depends on the replaceable-replacement pairs and their frequency/occurrence in the input, but this +50% is a good upper estimation).
I also utilized the fact that all replaceable strings start with a '<'
character, so finding the next potential replaceable position becomes blazing-fast:
int ltIdx = src.indexOf('<', pos);
It's just a simple loop and char
comparisons inside String
, and since it always starts searching from pos
(and not from the start of the input), overall the code iterates over the input String
only once.
And finally to tell if a replaceable String
does occur at the potential position, we use the String.regionMatches()
method to check the replaceable stings which is also blazing-fast as all it does is just compares char
values in a loop and returns at the very first mismatching character.
And a PLUS:
The question doesn't mention it, but our input is an HTML document. HTML tags are case-insensitive which means the input might contain <H1>
instead of <h1>
.
To this algorithm this is not a problem. The regionMatches()
in the String
class has an overload which supports case-insensitive comparison:
boolean regionMatches(boolean ignoreCase, int toffset, String other,
int ooffset, int len);
So if we want to modify our algorithm to also find and replace input tags which are the same but are written using different letter case, all we have to modify is this one line:
if (src.regionMatches(true, ltIdx, key, 0, key.length())) {
Using this modified code, replaceable tags become case-insensitive:
Yo<H1>TITLE</H1><h3>Hi!</h3>Nice day.<H6>Hi back!</H6>End
Yo<big><big><big><b>TITLE</b></big></big></big><big>Hi!</big>Nice day.
<small>Hi back!</small>End