java - Finding all the common substrings of given two strings

Question

Welcome To Ask or Share your Answers For Others

java - Finding all the common substrings of given two strings

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

java - Finding all the common substrings of given two strings

I have come across a problem statement to find the all the common sub-strings between the given two sub-strings such a way that in every case you have to print the longest sub-string. The problem statement is as follows:

Write a program to find the common substrings between the two given strings. However, do not include substrings that are contained within longer common substrings.

For example, given the input strings eatsleepnightxyz and eatsleepabcxyz, the results should be:

eatsleep (due to eatsleepnightxyz eatsleepabcxyz)

xyz (due to eatsleepnightxyz eatsleepabcxyz)

a (due to eatsleepnightxyz eatsleepabcxyz)

t (due to eatsleepnightxyz eatsleepabcxyz)

However, the result set should not include e from eatsleepnightxyz eatsleepabcxyz, because both es are already contained in the eatsleep mentioned above. Nor should you include ea, eat, ats, etc., as those are also all covered by eatsleep.

In this, you don't have to make use of String utility methods like: contains, indexOf, StringTokenizer, split and replace.

My Algorithm is as follows: I am starting with brute force and will switch to more optimized solution when I improve my basic understanding.

 For String S1:
     Find all the substrings of S1 of all the lengths
     While doing so: Check if it is also a substring of 
     S2.

Attempt to figure out the time complexity of my approach.

Let the two given strings be n1-String and n2-String

The number of substrings of S1 is clearly n1(n1+1)/2.
But we have got to find the average length a substring of S1.
Let’s say it is m. We’ll find m separately.
Time Complexity to check whether an m-String is a substring of an n-String is O(n*m).
Now, we are checking for each m-String is a substring of S2, which is an n2-String.
This, as we have seen above, is an O(n² m) algorithm.
The time required by the overall algorithm then is
Tn=(Number of substrings in S1) * (average substring lengthtime for character comparison procedure)
By performing certain calculations, I came to conclusion that the time complexity is O(n³ m²)
Now, our job is to find m in terms of n1.

Attempt to find m in terms of n1.

T_n = (n)(1) + (n-1)(2) + (n-2)(3) + ..... + (2)(n-1) + (1)(n)
where T_n is the sum of lengths of all the substrings.

Average will be the division of this sum by the total number of Substrings produced.

This, simply is a summation and division problem whose solution is as follows O(n)

Therefore...

Running time of my algorithm is O(n^5).

With this in mind I wrote the following code:

 package pack.common.substrings;

 import java.util.ArrayList;
 import java.util.LinkedHashSet;
 import java.util.List;
 import java.util.Set;

 public class FindCommon2 {
    public static final Set<String> commonSubstrings = new      LinkedHashSet<String>();

 public static void main(String[] args) {
    printCommonSubstrings("neerajisgreat", "neerajisnotgreat");
    System.out.println(commonSubstrings);
}

 public static void printCommonSubstrings(String s1, String s2) {
    for (int i = 0; i < s1.length();) {
        List<String> list = new ArrayList<String>();
        for (int j = i; j < s1.length(); j++) {
            String subStr = s1.substring(i, j + 1);
            if (isSubstring(subStr, s2)) {
                list.add(subStr);
            }
        }
        if (!list.isEmpty()) {
            String s = list.get(list.size() - 1);
            commonSubstrings.add(s);
            i += s.length();
        }
    }
 }

 public static boolean isSubstring(String s1, String s2) {
    boolean isSubstring = true;
    int strLen = s2.length();
    int strToCheckLen = s1.length();
    if (strToCheckLen > strLen) {
        isSubstring = false;
    } else {
        for (int i = 0; i <= (strLen - strToCheckLen); i++) {
            int index = i;
            int startingIndex = i;
            for (int j = 0; j < strToCheckLen; j++) {
                if (!(s1.charAt(j) == s2.charAt(index))) {
                    break;
                } else {
                    index++;
                }
            }
            if ((index - startingIndex) < strToCheckLen) {
                isSubstring = false;
            } else {
                isSubstring = true;
                break;
            }
        }
    }
    return isSubstring;
 }
}

Explanation for my code:

 printCommonSubstrings: Finds all the substrings of S1 and 
                        checks if it is also a substring of 
                        S2.
 isSubstring : As the name suggests, it checks if the given string 
               is a substring of the other string.

Issue: Given the inputs

  S1 = “neerajisgreat”;
  S2 = “neerajisnotgreat”
  S3 = “rajeatneerajisnotgreat”

In case of S1 and S2, the output should be: neerajis and great but in case of S1 and S3, the output should have been: neerajis, raj, great, eat but still I am getting neerajis and great as output. I need to figure this out.

How should I design my code?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T18:28:23+0000

You would be better off with a proper algorithm for the task rather than a brute-force approach. Wikipedia describes two common solutions to the longest common substring problem: suffix-tree and dynamic-programming.

The dynamic programming solution takes O(n m) time and O(n m) space. This is pretty much a straightforward Java translation of the Wikipedia pseudocode for the longest common substring:

public static Set<String> longestCommonSubstrings(String s, String t) {
    int[][] table = new int[s.length()][t.length()];
    int longest = 0;
    Set<String> result = new HashSet<>();

    for (int i = 0; i < s.length(); i++) {
        for (int j = 0; j < t.length(); j++) {
            if (s.charAt(i) != t.charAt(j)) {
                continue;
            }

            table[i][j] = (i == 0 || j == 0) ? 1
                                             : 1 + table[i - 1][j - 1];
            if (table[i][j] > longest) {
                longest = table[i][j];
                result.clear();
            }
            if (table[i][j] == longest) {
                result.add(s.substring(i - longest + 1, i + 1));
            }
        }
    }
    return result;
}

Now, you want all of the common substrings, not just the longest. You can enhance this algorithm to include shorter results. Let's examine the table for the example inputs eatsleepnightxyz and eatsleepabcxyz:

  e a t s l e e p a b c x y z
e 1 0 0 0 0 1 1 0 0 0 0 0 0 0
a 0 2 0 0 0 0 0 0 1 0 0 0 0 0
t 0 0 3 0 0 0 0 0 0 0 0 0 0 0
s 0 0 0 4 0 0 0 0 0 0 0 0 0 0
l 0 0 0 0 5 0 0 0 0 0 0 0 0 0
e 1 0 0 0 0 6 1 0 0 0 0 0 0 0
e 1 0 0 0 0 1 7 0 0 0 0 0 0 0
p 0 0 0 0 0 0 0 8 0 0 0 0 0 0
n 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i 0 0 0 0 0 0 0 0 0 0 0 0 0 0
g 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t 0 0 1 0 0 0 0 0 0 0 0 0 0 0
x 0 0 0 0 0 0 0 0 0 0 0 1 0 0
y 0 0 0 0 0 0 0 0 0 0 0 0 2 0
z 0 0 0 0 0 0 0 0 0 0 0 0 0 3

The eatsleep result is obvious: that's the 12345678 diagonal streak at the top-left.
The xyz result is the 123 diagonal at the bottom-right.
The a result is indicated by the 1 near the top (second row, ninth column).
The t result is indicated by the 1 near the bottom left.

What about the other 1s at the left, the top, and next to the 6 and 7? Those don't count because they appear within the rectangle formed by the 12345678 diagonal — in other words, they are already covered by eatsleep.

I recommend doing one pass doing nothing but building the table. Then, make a second pass, iterating backwards from the bottom-right, to gather the result set.

Categories

java - Finding all the common substrings of given two strings

java - Finding all the common substrings of given two strings

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags