Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
490 views
in Technique[技术] by (71.8m points)

language agnostic - Can you programmatically detect pluralizations of English words, and derive the singular form?

Given some (English) word that we shall assume is a plural, is it possible to derive the singular form? I'd like to avoid lookup/dictionary tables if possible.

Some examples:

Examples  -> Example    a simple 's' suffix
Glitch    -> Glitches   'es' suffix, as opposed to above
Countries -> Country    'ies' suffix.
Sheep     -> Sheep      no change: possible fallback for indeterminate values

Or, this seems to be a fairly exhaustive list.

Suggestions of libraries in language x are fine, as long as they are open-source (ie, so that someone can examine them to determine how to do it in language y)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

It really depends on what you mean by 'programmatically'. Part of English works on easy to understand rules, and part doesn't. It has to do mainly with frequency. For a brief overview, you can read Pinker's "Words and Rules", but do yourself a favor and don't take the whole generative theory of linguistics entirely to heart. There's a lot more empiricism there than that school of thought really lends to the pursuit.

A lot of English can be statistically lemmatized. By the way, stemming or lemmatization is the term you're looking for. One of the most effective lemmatizers which work off of statistical rules bootstrapped with frequency-based exceptions is the Morpha Lemmatizer. You can give this a shot if you have a project that requires this type of simplification of strings which represent specific terms in English.

There are even more naive approaches that accomplish much with respect to normalizing related terms. Take a look at the Porter Stemmer, which is effective enough to cluster together most terms in English.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...