This is a continuously updated repository that documents personal journey on learning data science, machine learning related topics.
Goal: Introduce machine learning contents in Jupyter Notebook format. The content aims to strike a good balance between mathematical notations, educational implementation from scratch using Python's scientific stack including numpy, numba, scipy, pandas, matplotlib, pyspark etc. and open-source library usage such as scikit-learn, fasttext, huggingface, onnx, xgboost, lightgbm, pytorch, keras, tensorflow, gensim, h2o, ortools, ray tune etc.
Documentation Listings
model deployment
FastAPI & Azure Kubernetes Cluster. End to end example of training a model and hosting it as a service. [folder]
Quick Intro to Gradient Boosted Tree Inferencing. [nbviewer][html]
Finetuning Pre-trained BERT Model on Text Classification Task And Inferencing with ONNX Runtime. [nbviewer][html]
operation_research
Operation Research Quick Intro Via Ortools. [nbviewer][html]
reinforcement learning
Introduction to Multi-armed Bandits. [nbviewer][html]
ad
Notes related to advertising domain.
Quick introduction to generalized second price auction. [nbviewer][html]
search
Information Retrieval, some examples are demonstrated using ElasticSearch.
Introduction to BM25 (Best Match). [nbviewer][html]
time series
Forecasting methods for timeseries-based data.
Getting started with time series analysis with Exponential Smoothing (Holt-Winters). [nbviewer][html]
Framing time series problem as supervised-learning. [nbviewer][html]
First Foray Into Discrete/Fast Fourier Transformation. [nbviewer][html]
projects
End to end project including data preprocessing, model building.
A/B testing, a.k.a experimental design. Includes: Quick review of necessary statistic concepts. Methods and workflow/thought-process for conducting the test and caveats to look out for.
Frequentist A/B testing (includes a quick review of concepts such as p-value, confidence interval). [nbviewer][html]
Quantile Regression and its application in A/B testing.
Quick Introduction to Quantile Regression. [nbviewer][html]
Quantile Regression's application in A/B testing. [nbviewer][html]
Quick introduction to difference in difference. [nbviewer][html]
model selection
Methods for selecting, improving, evaluating models/algorithms.
K-fold cross validation, grid/random search from scratch. [nbviewer][html]
AUC (Area under the ROC curve and precision/recall curve) from scratch (includes the process of building a custom scikit-learn transformer). [nbviewer][html]
Evaluation metrics for imbalanced dataset. [nbviewer][html]
Detecting collinearity amongst features (Variance Inflation Factor for numeric features and Cramer's V statistics for categorical features), also introduces Linear Regression from a Maximum Likelihood perspective and the R-squared evaluation metric. [nbviewer][html]
Curated tips and tricks for technical and soft skills. [nbviewer][html]
Principal Component Analysis (PCA) from scratch. [nbviewer][html]
Introduction to Singular Value Decomposition (SVD), also known as Latent Semantic Analysis/Indexing (LSA/LSI). [nbviewer][html]
recsys
Recommendation system with a focus on matrix factorization methods. Starters into the field should go through the first notebook to understand the basics of matrix factorization methods.
Alternating Least Squares with Weighted Regularization (ALS-WR) from scratch. [nbviewer][html]
ALS-WR for implicit feedback data from scratch & Mean Average Precision at k (mapk) and Normalized Cumulative Discounted Gain (ndcg) evaluation. [nbviewer][html]
Bayesian Personalized Ranking (BPR) from scratch & AUC evaluation. [nbviewer][html]
WARP (Weighted Approximate-Rank Pairwise) Loss using lightfm. [nbviewer][html]
Factorization Machine from scratch. [nbviewer][html]
Content-Based Recommenders:
(Text) Content-Based Recommenders. Introducing Approximate Nearest Neighborhood (ANN) - Locality Sensitive Hashing (LSH) for cosine distance from scratch. [nbviewer][html]
Approximate Nearest Neighborhood (ANN):
Benchmarking ANN implementations (nmslib). [nbviewer][html]
Calibrated Recommendation for reducing bias/increasing diversity in recommendation. [nbviewer][html]
Maximum Inner Product for Speeding Up Generating Recommendations. [nbviewer][html]
trees
Tree-based models for both regression and classification tasks.
Influence Maximization from scratch. Includes discussion on Independent Cascade (IC), Submodular Optimization algorithms including Greedy and Lazy Greedy, a.k.a Cost Efficient Lazy Forward (CELF) [nbviewer][html]
Choosing the optimal cutoff value for logistic regression using cost-sensitive mistakes (meaning when the cost of misclassification might differ between the two classes) when your dataset consists of unbalanced binary classes. e.g. Majority of the data points in the dataset have a positive outcome, while few have negative, or vice versa. The notion can be extended to any other classification algorithm that can predict class’s probability, this documentation just uses logistic regression for illustration purpose.
Visualize two by two standard confusion matrix and ROC curve with costs using ggplot2.
A collection of scattered old clustering documents in R.
Toy sample code of the LDA algorithm (gibbs sampling) and the topicmodels library. [Rmarkdown]
k-shingle, Minhash and Locality Sensitive Hashing for solving the problem of finding textually similar documents. [Rmarkdown]
Introducing tf-idf (term frequency-inverse document frequency), a text mining technique. Also uses it to perform text clustering via hierarchical clustering. [Rmarkdown]
Some useful evaluations when working with hierarchical clustering and K-means clustering (K-means++ is used here). Including Calinski-Harabasz index for determine the right K (cluster number) for clustering and boostrap evaluation of the clustering result’s stability. [Rmarkdown]
linear regression
Training Linear Regression with gradient descent in R, briefly covers the interpretation and visualization of linear regression's summary output. [Rmarkdown]
请发表评论