I need to cluster some data, but i have some constraints for doing it:
- All clusters must have exactly two members
- The clustering must occur between type "1" and type "2"
According to the documentation, ML.NET lib uses K-means algorithm to cluster data; in my case I always know the number of clusters.
My dataset looks something like this:
description,quantity,value,type
270128-MARLBORO SOF MI (0.20) 10X20,1,77.99,1
270129-SHELTON SLIMS BOX MI (0.40) 10X20,2,137.62,1
270174-MARLBORO GOL BOX MI (0.20) 10X20,1,84.87,1
270175-MARLBORO KS BOX MI (0.60) 10X20,3,254.61,1
270176-SHELTON ORIGINAL BOX MI (0.20) 10X20,1,68.81,1
270177-L&M BLUE BOX MI (0.80) 10X20,4,293.6,1
270180-MARLBORO BLUE ICE BOX MI (0.40) 10X20,2,169.74,1
270190-MARLBORO FILTER PLUS MI (0.40) 10X20,2,169.74,1
270193-CHESTERFIELD REMIX BOX MI (0.40) 10X20,2,119.28,1
270196-A SAMPOERNA MENTHOL BOX MI (3.20) 10X20,16,1504.64,1
270197-CHESTERFIELD ORIGINAL BOX MI (0.20) 10X20,1,52.76,1
270200-MARLBORO DOUBLE FUSION KS BOX MI (0.20) 10X20,1,89.46,1
270202-CHESTERFIELD SILVER BOX MI (0.20) 10X20,1,52.75,1
270203-CHESTERFIELD BLUE BOX MI (1.60) 10X20,8,422.08,1
CIG SAMPOERNA MENTHOL,16,1504.64,2
CIG CHESTERFIELD REMIX,2,119.28,2
CIG MARLBORO FILTER PLUS BOX,2,169.74,2
CIG MARLBORO BLUE ICE BOX,2,169.74,2
CIG L & M AZUL BOX,4,293.6,2
CIG SHELTON CURTO BOX,1,68.81,2
CIG MARLBORO VERMELHO BOX,3,254.61,2
CIG MARLBORO GOLD BOX,1,84.87,2
CIG SHELTON LONGO BOX,2,137.62,2
CIG MARLBORO VERMELHO MACO,1,77.99,2
CIG CHESTERFIELD AZUL MACO,8,422.08,2
CIG CHESTERFIELD PRATA,1,52.75,2
CIG MARLBORO DOUBLE MIX BOX,1,89.46,2
CIG CHESTERFIELD AZUL MACO,1,52.76,2
As I said earlier I need to cluster in pairs, so the K value in this case is 14.
Here's my test code:
class Program2
{
private const string _path = "datasetPath";
static void Main(string[] args)
{
var mlContext = new MLContext();
var data = mlContext.Data.LoadFromTextFile<Input>(_path, ',', true);
var featuresColumnName = "Features";
var pipeLine = mlContext.Transforms.Categorical.OneHotEncoding("DescriptionOHE", "Description")
.Append(mlContext.Transforms.Concatenate(featuresColumnName, "DescriptionOHE", "Quantity", "Value"))
.Append(mlContext.Clustering.Trainers.KMeans(featuresColumnName, numberOfClusters: 14));
var test = pipeLine.Preview(data);
// Grouping by assigned cluster number
var results = test.RowView.GroupBy(x => x.Values[6].Value).ToList();
}
}
public class Input
{
[LoadColumn(0)]
public string Description { get; set; }
[LoadColumn(1)]
public float Quantity { get; set; }
[LoadColumn(2)]
public float Value { get; set; }
}
public class Prediction
{
[ColumnName("PredictedLabel")]
public uint Cluster { get; set; }
[ColumnName("Score")]
public float[] Distance { get; set; }
}
For now I'm not even considering the "type" column because when i did it earlier it messed up all the clusters for obvious reasons.
If you guys execute the code above, you'll see that some clusters have 3 or even 4 members.
Is there a solution for those two constraints i mentioned earlier using ML.NET lib?
If there's not, what resources are worth of trying to achieve that?
question from:
https://stackoverflow.com/questions/65852527/is-it-possible-to-set-constraints-in-ml-net-k-means-clustering