regex - Java code/library for generating slugs (for use in pretty URLs)

Question

Welcome To Ask or Share your Answers For Others

regex - Java code/library for generating slugs (for use in pretty URLs)

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

regex - Java code/library for generating slugs (for use in pretty URLs)

Web frameworks such as Rails and Django has built-in support for "slugs" which are used to generate readable and SEO-friendly URLs:

A slug string typically contains only of the characters a-z, 0-9 and - and can hence be written without URL-escaping (think "foo%20bar").

I'm looking for a Java slug function that given any valid Unicode string will return a slug representation (a-z, 0-9 and -).

A trivial slug function would be something along the lines of:

return input.toLowerCase().replaceAll("[^a-z0-9-]", "");

However, this implementation would not handle internationalization and accents (? > e). One way around this would be to enumerate all special cases, but that would not be very elegant. I'm looking for something more well thought out and general.

My question:

What is the most general/practical way to generate Django/Rails type slugs in Java?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T17:44:45+0000

Normalize your string using canonical decomposition:

  private static final Pattern NONLATIN = Pattern.compile("[^\w-]");
  private static final Pattern WHITESPACE = Pattern.compile("[\s]");

  public static String toSlug(String input) {
    String nowhitespace = WHITESPACE.matcher(input).replaceAll("-");
    String normalized = Normalizer.normalize(nowhitespace, Form.NFD);
    String slug = NONLATIN.matcher(normalized).replaceAll("");
    return slug.toLowerCase(Locale.ENGLISH);
  }

This is still a fairly naive process, though. It isn't going to do anything for s-sharp (ß - used in German), or any non-Latin-based alphabet (Greek, Cyrillic, CJK, etc).

Be careful when changing the case of a string. Upper and lower case forms are dependent on alphabets. In Turkish, the capitalization of U+0069 (i) is U+0130 (İ), not U+0049 (I) so you risk introducing a non-latin1 character back into your string if you use String.toLowerCase() under a Turkish locale.

Categories

regex - Java code/library for generating slugs (for use in pretty URLs)

regex - Java code/library for generating slugs (for use in pretty URLs)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags