Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
297 views
in Technique[技术] by (71.8m points)

aggregation framework - MongoDB Document Re-shaping

This question comes out of (as mine usually do) perusing the questions asked on SO and as such, raising another question for myself. So apart from the learning exercise in working towards a solution for a problem, I find that another question pops up, such as this.

The original question as yet remains unaccepted by the OP, and indeed has not been clarified as to what "they" wanted to achieve. But I did give my interpretation, in both the simple and long forms of arriving at a solution.

The process, in the end, has left me wondering that considering the long form of the solution, would there be some new feature to be introduced in the next (Currently expecting 2.6) MongoDB release, using the additional aggregation operators that have been introduced.

So the case is as follows:

Sample Documents

{
    "tracked_item_type" : "Software",
    "tracked_item_name" : "Word",
    "duration" : 9540
}
{
    "tracked_item_type" : "Software",
    "tracked_item_name" : "Excel",
    "duration" : 4000
}
{
    "tracked_item_type" : "Software",
    "tracked_item_name" : "Notepad",
    "duration" : 4000
}
{
    "tracked_item_type" : "Site",
    "tracked_item_name" : "Facebook",
    "duration" : 7920
}
{
    "tracked_item_type" : "Site",
    "tracked_item_name" : "Twitter",
    "duration" : 5555
}
{
    "tracked_item_type" : "Site",
    "tracked_item_name" : "Digital Blasphemy",
    "duration" : 8000
}

Desired Result

The top two results by each type, ordered by the total duration. Even though this is a small sample, duration is considered to be a $sum of many items.

{ 
    "tracked_item_type": "Site",
    "tracked_item_name": "Digital Blasphemy",
    "duration" : 8000
}
{ 
    "tracked_item_type": "Site",
    "tracked_item_name": "Facebook",
    "duration" : 7920
}
{ 
    "tracked_item_type": "Software",
    "tracked_item_name": "Word",
    "duration" : 9540
}
{ 
    "tracked_item_type": "Software",
    "tracked_item_name": "Notepad",
    "duration" : 4000
}

Aggregate Solution

This was my lengthy way to solve the problem

db.collection.aggregate([

    // Group on the types and "sum" of duration
    {"$group": {
        "_id": {
            "tracked_item_type": "$tracked_item_type",
            "tracked_item_name": "$tracked_item_name"
         },
        "duration": {"$sum": "$duration"}
    }},

    // Sort by type and duration descending
    {"$sort": { "_id.tracked_item_type": 1, "duration": -1 }},

    /* The fun part */

    // Re-shape results to "sites" and "software" arrays 
    {"$group": { 
        "_id": null,
        "sites": {"$push":
            {"$cond": [
                {"$eq": ["$_id.tracked_item_type", "Site" ]},
                { "_id": "$_id", "duration": "$duration" },
                null
            ]}
        },
        "software": {"$push":
            {"$cond": [
                {"$eq": ["$_id.tracked_item_type", "Software" ]},
                { "_id": "$_id", "duration": "$duration" },
                null
            ]}
        }
    }},


    // Remove the null values for "software"
    {"$unwind": "$software"},
    {"$match": { "software": {"$ne": null} }},
    {"$group": { 
        "_id": "$_id",
        "software": {"$push": "$software"}, 
        "sites": {"$first": "$sites"} 
    }},

    // Remove the null values for "sites"
    {"$unwind": "$sites"},
    {"$match": { "sites": {"$ne": null} }},
    {"$group": { 
        "_id": "$_id",
        "software": {"$first": "$software"},
        "sites": {"$push": "$sites"} 
    }},


    // Project out software and limit to the *top* 2 results
    {"$unwind": "$software"},
    {"$project": { 
        "_id": 0,
        "_id": { "_id": "$software._id", "duration": "$software.duration" },
        "sites": "$sites"
    }},
    {"$limit" : 2},


    // Project sites, grouping multiple software per key, requires a sort
    // then limit the *top* 2 results
    {"$unwind": "$sites"},
    {"$group": {
        "_id": { "_id": "$sites._id", "duration": "$sites.duration" },
        "software": {"$push": "$_id" }
    }},
    {"$sort": { "_id.duration": -1 }},
    {"$limit": 2}

])

The "Not quite there yet" Output

And the point where aggregation falls short of getting to the final result. At least to my current understanding.

{
    "result" : [
        {
            "_id" : {
                "_id" : {
                    "tracked_item_type" : "Site",
                    "tracked_item_name" : "Digital Blasphemy"
                 },
                 "duration" : 8000
           },
            "software" : [
                {
                    "_id" : {
                        "tracked_item_type" : "Software",
                        "tracked_item_name" : "Word"
                    },
                    "duration" : 9540
                },

                {
                    "_id" : {
                        "tracked_item_type" : "Software",
                        "tracked_item_name" : "Notepad"
                    },
                    "duration" : 4000
                }
            ]
        },
        {
            "_id" : {
                "_id" : {
                    "tracked_item_type" : "Site",
                    "tracked_item_name" : "Facebook"
                },
                "duration" : 7920
            },
            "software" : [
                {
                    "_id" : {
                        "tracked_item_type" : "Software",
                        "tracked_item_name" : "Word"
                    },
                    "duration" : 9540
                },
                {
                    "_id" : {
                        "tracked_item_type" : "Software",
                        "tracked_item_name" : "Notepad"
                    },
                    "duration" : 4000
                }
            ]
        }
    ],
    "ok" : 1
}

This all seemed very reasonable (to Me anyway) that the result, while not complete could be post-processed in code in order to massage it into the desired form.

But indeed, it seems an exercise, and a point of intrigue as to whether this could be achieved with the use of any upcoming features for aggregation (or possibly another technique that has eluded me) to get to the desired result form.

So feel free to answer with any, suggestions / pointers as to how this could be achieved.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Here is an aggregation that finds the top two by duration in each category (it does break "ties" arbitrarily, which seems to be in line with your sample output):

var pregroup = { "$group" : {
        "_id" : {
            "type" : "$tracked_item_type",
            "name" : "$tracked_item_name"
        },
        "duration" : {
            "$sum" : "$duration"
        }
    }
};
var sort = { "$sort" : { "_id.type" : 1, "duration" : -1 } };
var group1 = { "$group" : {
        "_id" : "$_id.type",
        "num1" : {
            "$first" : {
                "name" : "$_id.name",
                "dur" : "$duration"
            }
        },
        "other" : {
            "$push" : {
                "name" : "$_id.name",
                "dur" : "$duration"
            }
        },
    "all" : {
        "$push" : {
            "name" : "$_id.name",
            "dur" : "$duration"
        }
    }
    }
};
var unwind = { "$unwind" : "$other" };
project = {
    "$project" : {
        "keep" : {
            "$ne" : [
                "$num1.name",
                "$other.name"
            ]
        },
        "num1" : 1,
        "all" : 1,
        "other" : 1
    }
};
var match = { "$match" : { "keep" : true } };
var sort2 = { "$sort" : { "_id" : 1, "other.dur" : -1 } };
var group2 = { "$group" : {
        "_id" : "$_id",
        "numberOne" : {
            "$first" : "$num1"
        },
        "numberTwo" : {
            "$first" : "$other"
        },
    "all" : {
        "$first" : "$all"
    }
    }
};
unwind2 = { "$unwind" : "$all" };
project2 = { "$project" : {
    "_id" : 0,
    "tracked_item_type" : "$_id",
    "tracked_item_name" : {
        "$cond" : [
            {
                "$or" : [
                    {
                        "$eq" : [
                            "$all.name",
                            "$numberOne.name"
                        ]
                    },
                    {
                        "$eq" : [
                            "$all.name",
                            "$numberTwo.name"
                        ]
                    }
                ]
            },
            "$all.name",
            null
        ]
    },
    "duration" : {
        "$cond" : [
            {
                "$or" : [
                    {
                        "$eq" : [
                            "$all.name",
                            "$numberOne.name"
                        ]
                    },
                    {
                        "$eq" : [
                            "$all.name",
                            "$numberTwo.name"
                        ]
                    }
                ]
            },
            "$all.dur",
            null
        ]
    }
}
}
match2 = { "$match" : { "tracked_item_name" : { "$ne" : null } } };

Running this with your sample data:

db.top2.aggregate(pregroup, sort, group1, unwind, project, match, sort2, group2, unwind2, project2, match2).toArray()
[
    {
        "tracked_item_type" : "Software",
        "tracked_item_name" : "Word",
        "duration" : 9540
    },
    {
        "tracked_item_type" : "Software",
        "tracked_item_name" : "Notepad",
        "duration" : 4000
    },
    {
        "tracked_item_type" : "Site",
        "tracked_item_name" : "Digital Blasphemy",
        "duration" : 8000
    },
    {
        "tracked_item_type" : "Site",
        "tracked_item_name" : "Facebook",
        "duration" : 7920
    }
]

This will work with arbitrary number of domains (different tracked item type values) and you don't need to know all their names in advance. However, to generalize it to top three, top four, top five, etc. will add four more stages for each additional top "N" value - not very practical or pretty.

Please vote up this jira ticket to get a more native implementation of "top N" functionality in the aggregation framework.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

56.9k users

...