This is hard to do in one step. Writing a single regex that does that is virtually impossible.
Try a two-step approach.
- Put a link around every "Paris" there is, regardless if there already is another link present.
- Find all incorrectly nested links (
<a href="..."><a href="...">Paris</a></a>
), and eliminate the inner link.
Regex for step one is dead-simple:
Paris
Regex for step two is slightly more complex:
(<a[^>]+>.*?(?!:</a>))<a[^>]+>(Paris)</a>
Use that one on the whole string and replace it with the content of match groups 1 and 2, effectively removing the surplus inner link.
Explanation of regex #2 in plain words:
- Find every link (
<a[^>]+>
), optionally followed by anything that is not itself followed by a closing link (.*?(?!:</a>)
). Save it into match group 1.
- Now look for the next link (
<a[^>]+>
). Make sure it is there, but do not save it.
- Now look for the word Paris. Save it into match group 2.
- Look for a closing link (
</a>
). Make sure it is there, but don't save it.
- Replace everything with the content of groups 1 and 2, thereby losing everything you did not save.
The approach assumes these side conditions:
- Your input HTML is not horribly broken.
- Your regex flavor supports non-greedy quantifiers (.*?) and zero-width negative look-ahead assertions (
(?!:...)
).
- You wrap the word "Paris" only in a link in step 1, no additional characters. Every "
Paris
" becomes "<a href"...">Paris</a>
", or step two will fail (until you change the second regex).
BTW: regex #2 explicitly allows for constructs like this:
<a href="">in the <b>capital of France</b>, <a href="">Paris</a></a>
The surplus link comes from step one, replacement result of step 2 will be:
<a href="">in the <b>capital of France</b>, Paris</a>
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…