Forge ML/DL API Walk Through Tutorial
With the eg. of titanic dataset (because it’s absurdly tiny)
Forge box the the code api for the Forge WebUI, it offers a convenient API when you’re coding marchine learning with python.
For now it works seemless with pytorch, I hope api with other frameworks will emerge, like tensorflow/api
If this is the first time you run anything with forge, You’ll have to run the command forge
in your terminal first to initialize database.
Without extra configuration, the database will be $HOME/data/forge.db
. Or you can change it to an assortment of other SQL data base of your choice, like MySQL or PostgreSQL in config.py
import pandas as pd
from forgebox.ftorch import FG
loading configs from /etc/forgebox.cfg
[Connecting to] sqlite:////Users/salvor/data/forge.db
Set a name for your training task, without space or funny charactors.
A task name will only be marked once, you can run your jupyter notebook or py script repeatatively, it will be considered only one task
fg = FG("titanic2")
p = fg.p
==========hyper params==========
{'bench_mark': 0.6, 'dim': 30, 'hidden': 512}
Set the hyper params
BENCH_MARK = p("bench_mark",.6) # bench mark for a prediction to be treated as a positive
DIM = p("dim",30) # latent space for the embedding
HIDDEN = p("hidden",512) # hidden size for neural network
p("hidden")
will return 512, this is how we can retrieve our hyper-param from the config later
p("hidden",1024)
to set hidden size to 1024, meanwhile, the expression will return 1024.
Now we fire up those csv files
train_df = pd.read_csv("/data/titanic/Data/train.csv")
test_df = pd.read_csv("/data/titanic/Data/test.csv")
train_df = train_df.fillna(0)
test_df = test_df.fillna(0)
train_df["AgeZero"] = train_df["Age"].apply(lambda x:(int(x==0))*1)
test_df["AgeZero"] = test_df["Age"].apply(lambda x:(int(x==0))*1)
train_df.sample(10)
Preprocessing tabulated data with forgebox
We have several modules in the forgebox preprocessing package that will speed up the preprocessing on structured data
from forgebox.ftorch.prepro import categorical,minmax,tabulate,categorical_idx
categorical_idx
can transfer the categorical fields to indices, like 0 for male, 1 for female, 2 for unkown.
The order of indices will be according to the frequency counts, from the most happened situations to the least happened.
The build
function will calculate and record the meta info of this preprocess, based on a pandas data series.
You can specify max_
arguement in the build
function, it will have a max category number, the least happened cases will be summarized as other
, the default max_
number is 20
Pclass = categorical_idx("Pclass",)
Pclass.build(train_df.Pclass)
Sex = categorical_idx("Sex")
Sex.build(train_df.Sex)
Embarked = categorical_idx("Embarked")
Embarked.build(train_df.Embarked)
AgeZero = categorical_idx("AgeZero")
AgeZero.build(train_df.AgeZero)
Pclass
3 491
1 216
2 184
Sex
male 577
female 314
Embarked
S 644
C 168
Q 77
0 2
AgeZero
0 714
1 177
MinMax is to change the range of the data series, the any data in the column will be capped to max or min, and scale down/up to a machine learning model friendly level.
Age = minmax("Age")
Age.build(train_df.Age)
SibSp = minmax("SibSp")
SibSp.build(train_df.SibSp)
Parch = minmax("Parch")
Parch.build(train_df.Parch)
Fare = minmax("Fare")
Fare.build(train_df.Fare)
min_:0.000 max_:80.000 range:80.000
min_:0.000 max_:8.000 range:8.000
min_:0.000 max_:6.000 range:6.000
min_:0.000 max_:512.329 range:512.329
Tabulate
is how we combine several column preprocess channels into one.
Notice! the you can use another Tabulate
as a column, and combine it further with other columns
x_pre = tabulate("x_pre")
x_pre.build(Pclass,Sex,Embarked,AgeZero,Age,SibSp,Parch,Fare)
Here we preprocess our “Y” data
Survived = minmax("Survived")
Survived.build(train_df["Survived"])
min_:0.000 max_:1.000 range:1.000
y_pre = tabulate("y_pre")
y_pre.build(Survived)
Now we check our preprocess channel with dataframe.
x_pre.prepro(train_df), y_pre.prepro(train_df)
((891, 8), (891, 1))
Titanic is a tiny dataset by size, if you are using forgebox for real project, dataframe will be much bigger, you can use x_pre.prepro(train_df.head(20))
instead
Build up a pytorch model
import torch
from torch import nn
class ti_mlp(nn.Module):
def __init__(self):
super().__init__()
self.emb_p = nn.Embedding(len(Pclass.cate_list),DIM)
self.emb_s = nn.Embedding(len(Sex.cate_list),DIM)
self.emb_e = nn.Embedding(len(Embarked.cate_list),DIM)
self.emb_a = nn.Embedding(2,DIM)
self.mlp = nn.Sequential(*[
nn.Linear(DIM*4+4,HIDDEN,bias=False),
nn.BatchNorm1d(HIDDEN),
nn.ReLU(),
nn.Linear(HIDDEN,HIDDEN,bias=False),
nn.BatchNorm1d(HIDDEN),
nn.ReLU(),
nn.Linear(HIDDEN,1,bias=False),
nn.Sigmoid(),
])
def forward(self,x):
p,s,e,a,conti = x[:,:1].long(),x[:,1:2].long(),x[:,2:3].long(),x[:,3:4].long(),x[:,4:].float()
x_ = torch.cat([self.emb_p(p).squeeze(1),
self.emb_s(s).squeeze(1),
self.emb_e(e).squeeze(1),
self.emb_a(a).squeeze(1),
conti],dim=1)
return self.mlp(x_)
from forgebox.ftorch.prepro import DF_Dataset,fuse
from forgebox.ftorch.train import Trainer
from forgebox.ftorch.metrics import accuracy,recall,precision
The following is to make pytorch dataset, which a familiar term with pytorch users, it’s just another pytorch dataset, can be further put into a pytorh dataloader.
But, forgebox put the batchsize configuration in dataset instead of dataloader.
Reason:since tabulation data usually goes beyond hundreds, if putting the batchsize in dataloader, we’ll end up running line by line non-parallel iterations by python, which is sssllooowww, hence every python coders’s big-no-no. We use the numpy/pandas slicing to replace the simple “for” loop
train_ds_x = DF_Dataset(train_df,prepro=x_pre.prepro,bs=64,shuffle=False)
train_ds_y = DF_Dataset(train_df,prepro=y_pre.prepro,bs=64,shuffle=False)
train_ds = fuse(train_ds_x,train_ds_y)
valid_ds_x = DF_Dataset(test_df,prepro=x_pre.prepro,bs=64,shuffle=False)
valid_ds_y = DF_Dataset(test_df,prepro=y_pre.prepro,bs=64,shuffle=False)
valid_ds = fuse(valid_ds_x,valid_ds_y)
from forgebox.ftorch.callbacks import print_all,recorddf,mars,stat
model = ti_mlp()
from torch.optim import Adam
opt = Adam(model.parameters())
loss_func = nn.BCELoss()
trainer=Trainer(train_ds, val_dataset=valid_ds, # dataset
batch_size=1, # batch size in dataloader, it has to be 1 in this case
fg=fg,
print_on=2, # print on every 2 steps
callbacks=[print_all, # print all metrics
stat,fg.logs(), # save the training log
fg.metrics(), # save the metrics
fg.weights(model) # save the model weights
])
Define a training step and a validation step
Here we don’t embody the optimizer or model into Trainer, that’s how we can use several different types of optimizers to update several models in a free style way totally in you dictates.
What you decide is what will happen, within an iteration, given a batch of data.
@trainer.step_train
def action(batch):
x,y = batch.data
if batch.i == 0:
model.train()
x = x.squeeze(0)
y = y.float()
opt.zero_grad()
y_ = model(x)
loss = loss_func(y_,y)
acc = accuracy(y_,y.long())
rec = recall(y_,y.long())
prec = precision(y_,y.long().squeeze(0))
f1 = (rec*prec)/(rec+prec)
loss.backward()
opt.step()
return {"loss":loss.item(),"acc":acc.item(),"rec":rec.item(),"prec":prec.item(),"f1":f1.item()}
@trainer.step_val
def val_action(batch):
x,y = batch.data
if batch.i == 0:
model.eval()
x = x.squeeze(0)
y = y.float()
y_ = model(x)
loss = loss_func(y_,y)
acc = accuracy(y_,y.long())
rec = recall(y_,y.long())
prec = precision(y_,y.long().squeeze(0))
f1 = (rec*prec)/(rec+prec)
return {"loss":loss.item(),"acc":acc.item(),"rec":rec.item(),"prec":prec.item(),"f1":f1.item()}
Train the model!
trainer.train(3)