This is for fun right?
SQL is all about processing sets of rows, so if we can convert a 'word' into a set of characters as rows then we can use the 'group' functions to do useful stuff.
Using a 'relational database engine' to do simple character manipulation feels wrong. Still, is it possible to answer your question with just SQL? Yes it is...
Now, i always have a table that has one integer column that has about 500 rows in it that has the ascending sequence 1 .. 500. It is called 'integerseries'. It is a really small table that used a lot so it gets cached in memory. It is designed to replace the from 'select 1 ... union ...
text in queries.
It is useful for generating sequential rows (a table) of anything that you can calculate that is based on a integer by using it in a cross join
(also any inner join
). I use it for generating days for a year, parsing comma delimited strings etc.
Now, the sql mid
function can be used to return the character at a given position. By using the 'integerseries' table i can 'easily' convert a 'word' into a characters table with one row per character. Then use the 'group' functions...
SET @word='Hello World';
SELECT charAtIdx, COUNT(charAtIdx)
FROM (SELECT charIdx.id,
MID(@word, charIdx.id, 1) AS charAtIdx
FROM integerseries AS charIdx
WHERE charIdx.id <= LENGTH(@word)
ORDER BY charIdx.id ASC
) wordLetters
GROUP BY
wordLetters.charAtIdx
ORDER BY charAtIdx ASC
Output:
charAtIdx count(charAtIdx)
--------- ------------------
1
d 1
e 1
H 1
l 3
o 2
r 1
W 1
Note: The number of rows in the output is the number of different characters in the string. So, if the number of output rows is counted then the number of 'different letters' will be known.
This observation is used in the final query.
The final query:
The interesting point here is to move the 'integerseries' 'cross join' restrictions (1 .. length(word)) into the actual 'join' rather than do it in the where
clause. This provides the optimizer with clues as to how to restrict the data produced when doing the join
.
SELECT
wordLetterCounts.wordId,
wordLetterCounts.word,
COUNT(wordLetterCounts.wordId) AS letterCount
FROM
(SELECT words.id AS wordId,
words.word AS word,
iseq.id AS charPos,
MID(words.word, iseq.id, 1) AS charAtPos,
COUNT(MID(words.word, iseq.id, 1)) AS charAtPosCount
FROM
words
JOIN integerseries AS iseq
ON iseq.id BETWEEN 1 AND words.wordlen
GROUP BY
words.id,
MID(words.word, iseq.id, 1)
) AS wordLetterCounts
GROUP BY
wordLetterCounts.wordId
Output:
wordId word letterCount
------ -------------------- -------------
1 3333333333 1
2 1113333333 2
3 1112222444 3
4 Hello World 8
5 funny - not so much? 13
Word Table and Data:
CREATE TABLE `words` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`word` varchar(128) COLLATE utf8mb4_unicode_ci NOT NULL,
`wordlen` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
/*Data for the table `words` */
insert into `words`(`id`,`word`,`wordlen`) values (1,'3333333333',10);
insert into `words`(`id`,`word`,`wordlen`) values (2,'1113333333',10);
insert into `words`(`id`,`word`,`wordlen`) values (3,'1112222444',10);
insert into `words`(`id`,`word`,`wordlen`) values (4,'Hello World',11);
insert into `words`(`id`,`word`,`wordlen`) values (5,'funny - not so much?',20);
Integerseries table: range 1 .. 30 for this example.
CREATE TABLE `integerseries` (
`id` int(11) unsigned NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=500 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci