I don't know where your SMOTE()
and Fisher()
functions are coming from, but the answer is yes you can definitely do this. In order to do so you will need to write a wrapper class around those functions though. The easiest way to this is inherit sklearn's BaseEstimator
and TransformerMixin
classes, see this for an example: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html
If this isn't making sense to you, post the details of at least one of your functions (the library it comes from or your code if you wrote it yourself) and we can go from there.
EDIT:
I apologize, I didn't look at your functions closely enough to realize that they transform your target in addition to your training data (i.e. both X and y). Pipeline does not support transformations to your target so you will have do them prior as you originally were. For your reference, here is what it would look like to write your custom class for your Fisher process which would work if the function itself did not need to affect your target variable.
>>> from sklearn.base import BaseEstimator, TransformerMixin
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.svm import SVC
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.grid_search import GridSearchCV
>>> from sklearn.datasets import load_iris
>>>
>>> class Fisher(BaseEstimator, TransformerMixin):
... def __init__(self,percentile=0.95):
... self.percentile = percentile
... def fit(self, X, y):
... from numpy import shape, argsort, ceil
... X_pos, X_neg = X[y==1], X[y==0]
... X_mean = X.mean(axis=0)
... X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
... deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) + (1.0/(shape(X_neg)[0]-1))*X_neg.var(axis=0)
... num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
... F = num/deno
... sort_F = argsort(F)[::-1]
... n_feature = (float(self.percentile)/100)*shape(X)[1]
... self.ind_feature = sort_F[:ceil(n_feature)]
... return self
... def transform(self, x):
... return x[self.ind_feature,:]
...
>>>
>>> data = load_iris()
>>>
>>> pipeline = Pipeline([
... ('fisher', Fisher()),
... ('normal',StandardScaler()),
... ('svm',SVC(class_weight='auto'))
... ])
>>>
>>> grid = {
... 'fisher__percentile':[0.75,0.50],
... 'svm__C':[1,2]
... }
>>>
>>> model = GridSearchCV(estimator = pipeline, param_grid=grid, cv=2)
>>> model.fit(data.data,data.target)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 596, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 378, in _fit
for parameters in parameter_iterable
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 653, in __call__
self.dispatch(function, args, kwargs)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 400, in dispatch
job = ImmediateApply(func, args, kwargs)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 138, in __init__
self.results = func(*args, **kwargs)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1239, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
self.steps[-1][-1].fit(Xt, y, **fit_params)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/svm/base.py", line 149, in fit
(X.shape[0], y.shape[0]))
ValueError: X and y have incompatible shapes.
X has 1 samples, but y has 75.