Its nothing to do with the ruby internals. You are comparing apples with oranges.
in your first example, you are grouping 700 char string 100000 times and finding the count. So its a problem in your logic. not in counting.In the second approach you are just counting,
And in both the approaches you are just using count only
just change the first example like this
def count_frequency_using_chars(sequence)
grouped = sequence.chars.group_by{|x| x}
100000.times do
grouped.map{|letter, array| [letter, array.count]}
end
end
And its as fast as your second
Edit
This approach is 3x faster than the count_frequency_using_count
, check the benchmarks
def count_frequency_using_chars_with_single_group(sequence)
grouped = sequence.chars.group_by{|x| x}
100000.times do
grouped.map{|letter, array| [letter, array.count]}
end
end
def count_frequency_using_count(sequence)
100000.times do
["a", "c", "g", "t"].map{|letter| sequence.count(letter)}
end
end
Benchmark.bm do |benchmark|
benchmark.report do
pp count_frequency_using_chars_with_single_group(sequence)
end
benchmark.report do
pp count_frequency_using_count(sequence)
end
end
user system total real
0.410000 0.000000 0.410000 ( 0.419100)
1.330000 0.000000 1.330000 ( 1.324431)
Andrew to your comments,
measuring the character composition of 100000 sequences once each, not the character composition of one sequence 100000 times
, still your count approach is too slower than the group_by approach. I just benchmarked the large strings as you said
seq = "gattaca" * 10000
#seq length is 70000
arr_seq = (1..10).map {|x| seq}
#10 seq items
and changed the methods to handle the multiple sequences
def count_frequency_using_chars_with_single_group(sequences)
sequences.each do |sequence|
grouped = sequence.chars.group_by{|x| x}
100000.times do
grouped.map{|letter, array| [letter, array.count]}
end
end
end
def count_frequency_using_count(sequence)
sequences.each do |sequence|
100000.times do
["a", "c", "g", "t"].map{|letter| sequence.count(letter)}
end
end
end
Benchmark.bm do |benchmark|
benchmark.report do
pp count_frequency_using_chars_with_single_group(arr_seq)
end
benchmark.report do
pp count_frequency_using_count(arr_seq)
end
end
For processing 100000 times, 10 sequences each with 70000 length
user system total real
3.710000 0.040000 3.750000 ( 23.452889) #modified group_by approach
1039.180000 6.920000 1046.100000 (1071.374730) #simple char count approach
Your simple char count approach is 47% slower than the modified group_by approach for the high volume strings. I ran the above benchmark for just 10 sequences each with 70000 length. Assume this for 100 or 1000 sequences, simple count would never an option. right?