regex - Regular expression replace a word by a link

Question

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

I want to write a regular expression that will replace the word Paris by a link, for only the word is not ready a part of a link.

Example:

    i'm living <a href="Paris" atl="Paris link">in Paris</a>, near Paris <a href="gare">Gare du Nord</a>,  i love Paris.

would become

    i'm living.........near <a href="">Paris</a>..........i love <a href="">Paris</a>.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:56:05+0000

This is hard to do in one step. Writing a single regex that does that is virtually impossible.

Try a two-step approach.

Put a link around every "Paris" there is, regardless if there already is another link present.
Find all incorrectly nested links (<a href="..."><a href="...">Paris</a></a>), and eliminate the inner link.

Regex for step one is dead-simple:

Paris

Regex for step two is slightly more complex:

(<a[^>]+>.*?(?!:</a>))<a[^>]+>(Paris)</a>

Use that one on the whole string and replace it with the content of match groups 1 and 2, effectively removing the surplus inner link.

Explanation of regex #2 in plain words:

Find every link (<a[^>]+>), optionally followed by anything that is not itself followed by a closing link (.*?(?!:</a>)). Save it into match group 1.
Now look for the next link (<a[^>]+>). Make sure it is there, but do not save it.
Now look for the word Paris. Save it into match group 2.
Look for a closing link (</a>). Make sure it is there, but don't save it.
Replace everything with the content of groups 1 and 2, thereby losing everything you did not save.

The approach assumes these side conditions:

Your input HTML is not horribly broken.
Your regex flavor supports non-greedy quantifiers (.*?) and zero-width negative look-ahead assertions ((?!:...)).
You wrap the word "Paris" only in a link in step 1, no additional characters. Every "Paris" becomes "<a href"...">Paris</a>", or step two will fail (until you change the second regex).
BTW: regex #2 explicitly allows for constructs like this:

<a href="">in the <b>capital of France</b>, <a href="">Paris</a></a>

The surplus link comes from step one, replacement result of step 2 will be:

<a href="">in the <b>capital of France</b>, Paris</a>