../_static/logo.png

2. Functions used for the ML Model

2.1. How To Use

** Under Construction **

2.2. Modules inside the Model package

Model.PreProcessingData.PreProcessingData(DataSet, OutputPath, NormStd=None, SaveToCsv=True, Verbose=False)

Reformating the FullDataset to have just one row by student with the following columns:

  • Id Unico (Unique Id)

  • Sexo (Gender)

  • Semestre (Semester)

  • Promedio General

  • Materias Aprobadas Historicas

  • Materias No Aprobadas Historicas

  • Numero de Faltas Semestre En Curso

  • Repite Materia

  • Abandono

And the parameters for this module are:

  • DataSet: it could be the path to the csv file or a pandas dataframe.

  • OutputPath: Path where all the csv will be store just if SaveToCsv = True.

  • NormStd: This parameter could have three different values, and according to the values, it will return the output dataframe as follows:
    • None: it will not modify the output dataframe.

    • Norm: it will normalize the output dataframe.

    • Std: it will standarized the output dataframe.

  • SaveToCsv: If True, it will save the output dataframe in the path specified by OutputPath

  • Verbose: If True, it will print information about what is doing the script.


Model.TrainingModel.TrainingModel(DataSet, OutputPathModel, Verbose=True)

Function used to train an save a ML model. In this particular case a Random Forest Classifier from sklearn.

As the dataset that will be used to train the model is imbalance, it was required first to train to balance the dataset for that job, it was used the SMOTETomek module from imblearn package. Basically, the SMOTETomek used some techniques to do oversampling for the minory class and subsampling for the majority class.

Also, it was used the tran_test_split module from sklearn.model_selection to split the original dataset into two different datasets, one for training and another one for test.

Finally, it was used confusion_matrix and classification_report modules from sklearn.metrics to evaluate and print the performance of the Random Forest Classifier.

The parameters used are:

  • DataSet: This is the input dataset, it could be a path to a csv file or a pandas dataframe

  • OutputPathModel: Path and filename where the trained model will be saved. It’s use pickle for this job.

  • Verbose: If true, the script will be printing information about what is doing the script.