Skip to content
README.md 9.81 KiB
Newer Older
Tobias Strauß's avatar
Tobias Strauß committed
# Uni Rostock PyTorch Framework
The `uros_pf` framework intents to reduce the setup time for new machine 
learning projects based on pytorch. 
It unifies standard procedures, such as configuring, input/output
procedures, training, logging etc., that are used in all projects.

## Basic principles
- Files in the `uros_pf` package are of general interest. They must
not be adapted to a certain project.
- Project specific source code goes into the `scenario` folder.
- Data / configuration files / model checkpoints etc. are stored in a 
separate workdir folder.
- To implement a new scenario at least 3 files have to be coded: the 
`input_processor` (preparing the input and targets) the `model` 
 (connecting the trainer, optimizer and model) and the `module` (containing 
the neural network or the ML-method)
Tobias Strauß's avatar
Tobias Strauß committed
- The input processor, the model and all hyperparameter shall be configured
by the config file (a `yaml`-file). It is usually stored in the workdir and 
passed to the trainer by `-cn config_name` (`config_name` without extension `.yaml`).

Tobias Strauß's avatar
Tobias Strauß committed
## Sample project
A working configuration file (named `ag_linear_config.yaml`) is
```yaml
hydra:
    run:
        dir: ${trainer.model_path}
Tobias Strauß's avatar
Tobias Strauß committed
builder:
    input: "scenario.ag_news_corpus.ag_ip.AGInputProcessor"
    model: "scenario.ag_news_corpus.ag_simple_model.SimpleModel"
trainer:
    epochs: 20
    model_path: models/${now:%Y-%m-%d}/${now:%H-%M-%S}
Tobias Strauß's avatar
Tobias Strauß committed
input:
    feature_size: 1000
    train_file: "data/ag_news_corpus/train.csv"
    val_file: "data/ag_news_corpus/test.csv"
    batch_size: 10
    samples_per_epoch: 12000
Tobias Strauß's avatar
Tobias Strauß committed
    label2id: "1:0,2:1,3:2,4:3"
    TfIdfTransform:
        stopword: english
        feature_size: 1000
    CSVTransform:
        input_cols: "1,2"
        target_cols: "0"
Tobias Strauß's avatar
Tobias Strauß committed
model:
    num_of_classes: 4
    loss_fn: "torch.nn.MSELoss"
    metric_fns: "uros_pf.metrics.accuracy.Accuracy"
    module_cls: "scenario.ag_news_corpus.ag_least_square_module.AGLeastSquareModule"
    optimizer: "torch.optim.SGD"
    lr: 0.01
    scheduler_class: "torch.optim.lr_scheduler.LambdaLR"
    scheduler:
        lr_lambda: 0.9
Tobias Strauß's avatar
Tobias Strauß committed
    module:
        feature_size: ${input.feature_size}
        num_of_classes: ${model.num_of_classes}
```
The structure of the workdir is: 
Tobias Strauß's avatar
Tobias Strauß committed
├── data
│   ├── ag_news_corpus
│   │   ├── test.csv
│   │   ├── train.csv
├── config
│   ├── ag_linear_config.yaml
```
whereas the data is taken from 
[mhjabreel's github account](https://github.com/mhjabreel/CharCnn_Keras/tree/master/data/ag_news_csv).

To run the example project just execute 
`python3 path/to/src/uros_pf/trainer/trainer.py -cn ag_linear_config`
from the workdir.

## Install
- Operating system: Windows 7, 8, 10 or 11 (64bit) or Linux (ideally Ubuntu 23.04)
- Processor: x86 processor architecture (no ARM support)
- Hard disk: 5GB free disk space
### General
The Framework is written in Python 3.11. In order to install the required Python version and pip packages without coming into conflict with Python and module versions 
that may already exist or that may be installed later, we recommend to use **Anaconda** (or the minimised variant **Miniconda**) on Windows systems. With this 
software it is possible to run independent Python installations in **virtual environments**. So if you want to create your own projects with other 
Python versions, modules or module versions after this workshop, this is possible 
with Anaconda without any problems.

As a code editor, we recommend **PyCharm**, which offers various tools for the development 
of Python programmes.
The central Python package for fine-tuning large language models we use is **PyTorch**. 
Roger Labahn's avatar
Roger Labahn committed
The **Uni Rostock PyTorch Framework (uros_pf)** provides a framework that 
simplifies processes based on this. As already shown above, the framework also offers a simple sample 
project to train a classifier that classifies newspaper texts according to the 
genres *World*, *Sports*, *Business* and *Sci/Tech*. At the end of these installation 
instructions, it is explained in more detail how this sample application can be started as a PyCharm project.
### Ubuntu / Debian
Open a terminal and execute the following command lines while substituting 
`path/2/virtual/environment` and `path/to/project/dir` with your own folders.

sudo apt-get update 
sudo apt install python3-venv python3-pip
ENV_HOME=path/2/virtual/environment
python3 -m venv $ENV_HOME 
source $ENV_HOME/bin/activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install tensorboard transformers[torch] evaluate datasets scikit-learn tokenizers tqdm
pip install hydra-core --upgrade
```
To install the project framework and the Pycharm environment download and install PyCharm from 
https://www.jetbrains.com/de-de/pycharm/download/other.html and run:
```
cd path/to/project/dir
git clone https://citlab0.math.uni-rostock.de/gitlab/shared/uros_pf
```
(tested with Ubuntu 23.04)
#### Virtual environment
1. Install [Miniconda](https://docs.conda.io/en/latest/miniconda.html#windows-installers):
Use the installer for Windows 64bit systems. In the installation process, select the suggested default 
settings. If no shortcut to the Anaconda prompt has been created on the desktop after the 
installation, create one yourself: Create a new shortcut with the target 
`cmd /c start "Anaconda Prompt" "C:\path\to\Miniconda3\Scripts\activate"`, replacing 
`C:\path\to\` with the correct path to the Miniconda installation.
2. Expand the sources that Miniconda can access to install software in virtual environments: 
By default, Anaconda often only has access to older versions of software packages. To expand the 
scope, we add another community-maintained source, **Conda-Forge**. To do this, open the Anaconda 
prompt and type: `conda config --add channels conda-forge`. Conda-Forge should now be an entry in 
the list of channels for obtaining software packages: `conda config --show channels`. 
3. Create a new virtual environment: In the Anaconda prompt, you should see the prefix `(base)` 
in the line before the cursor, which means that you are currently in the base environment of Anaconda. 
To create the new environment, type: `conda create -n llm_workshop python=3.11 git`. You can replace 
`llm_workshop` with a name of your choice. 
4. Activate the new virtual environment: `conda activate llm_workshop`. If you forget the name of the 
environment or want to see an overview of all created virtual environments: `conda env list`.
5. Update the packages *pip* and *setuptools* installed with Python: 
`python -m pip install -U pip && pip install -U setuptools`.
6. Install the Python modules needed for the workshop: 
`pip install torch torchvision torchaudio transformers[torch] tensorboard evaluate datasets scikit-learn notebook hydra-core --upgrade`.
Roger Labahn's avatar
Roger Labahn committed
#### Workshop folder and Uni Rostock PyTorch Framework-Repository (uros_pf) 
7. Create a project folder on your system. Change to this project folder in the Anaconda prompt. 
8. Clone the `uros_pf` - repository into the project folder: 
`git clone https://citlab0.math.uni-rostock.de/gitlab/shared/uros_pf.git`. 
9. Create the folder `workdir` in the project folder, so that there should now be a total of 
two folders in the project folder: `workdir` and `uros_pf`. In `workdir` we will store data, 
configurations and models.
#### PyCharm
10. Install [PyCharm](https://www.jetbrains.com/de-de/pycharm/download/other.html) 
(community version 2023.2.1 for Windows).
11. Start PyCharm and open the repository folder `uros_pf` in the project folder: Click on the 
main menu icon in the upper left corner, click on `Open` and select the repository folder. Answer 
the following security question with `Trust Project`.
12. Activate your virtual environment in PyCharm: PyCharm can now use the virtual environment 
we created in point 3 as a working environment and thus also the Python 3.11 interpreter it contains. 
Click on the selection of the Python interpreter in the lower right corner. If your Miniconda 
environment is not yet available, select `Add New Interpreter` and then `Add Local Interpreter`. 
In the now opened display `Add Python Interpreter` select `Conda Environment` as type on the left. 
Then enter the path to `conda.exe` of your Miniconda installation on the right: `C:\path\to\Miniconda3\Scripts\conda.exe`. Then activate `Use existing environment` and select 
your virtual environment from the dropdown menu, which should now be found in the dropdown menu. 
Confirm with `OK`. Now PyCharm can use the Python modules you installed in the virtual environment 
Roger Labahn's avatar
Roger Labahn committed
### Uni Rostock PyTorch Framework (uros_pf)
13. To finally start the sample application of the Uni Rostock PyTorch Framework, 
a folder structure must first be created in the currently empty folder `workdir`:
```
├── data
│ ├── ag_news_corpus
│ │ ├── test.csv
│ │ ├── train.csv
├── config
│ ├── ag_linear_config.yaml
```
14. The training data is to be obtained from [mhjabreel's GitHub repository](https://github.com/mhjabreel/CharCnn_Keras/tree/master/data/ag_news_csv).
15. The configuration file is simply the one from section [Sample project](#sample-project).
16. To start the training, a **Run / Debug Configuration** can be set up in PyCharm: 
This allows certain `.py` files to be run directly with specified parameters. To set 
up a new configuration, click on the dropdown menu `Current File` in the top centre 
and select `Edit Configurations`. Now click on `Add new` and select `Python`. Give 
the configuration a name: `Run Trainer`. Select the file `uros_pf/trainer/trainer.py` 
as **script**. Define the folder `workdir` in the project folder as **working 
directory**. And enter `-cn ag_linear_config` in the **Script parameters** field. Save the 
changes and start the scenario by clicking on `Run`. A terminal will open and the 
training process can be observed.

This concludes the preparatory installations.