# Uni Rostock PyTorch Framework The `uros_pf` framework intents to reduce the setup time for new machine learning projects based on pytorch. It unifies standard procedures, such as configuring, input/output procedures, training, logging etc., that are used in all projects. ## Basic principles - Files in the `uros_pf` package are of general interest. They must not be adapted to a certain project. - Project specific source code goes into the `scenario` folder. - Data / configuration files / model checkpoints etc. are stored in a separate workdir folder. - To implement a new scenario at least 3 files have to be coded: the `input_processor` (preparing the input and targets) the `model` (connecting the trainer, optimizer and model) and the `module` (containing the neural network or the ML-method) - The input processor, the model and all hyperparameter shall be configured by the config file (a `yaml`-file). It is usually stored in the workdir and passed to the trainer by `-cn config_name` (`config_name` without extension `.yaml`). ## Sample project A working configuration file (named `ag_linear_config.yaml`) is ```yaml hydra: run: dir: ${trainer.model_path} builder: input: "scenario.ag_news_corpus.ag_ip.AGInputProcessor" model: "scenario.ag_news_corpus.ag_simple_model.SimpleModel" trainer: epochs: 20 model_path: models/${now:%Y-%m-%d}/${now:%H-%M-%S} input: feature_size: 1000 train_file: "data/ag_news_corpus/train.csv" val_file: "data/ag_news_corpus/test.csv" batch_size: 10 samples_per_epoch: 12000 TfIdfTransform: stopword: english feature_size: 1000 CSVTransform: input_cols: "1,2" target_cols: "0" model: num_of_classes: 4 loss_fn: "torch.nn.MSELoss" metric_fns: "uros_pf.metrics.accuracy.Accuracy" module_cls: "scenario.ag_news_corpus.ag_least_square_module.AGLeastSquareModule" optimizer: "torch.optim.SGD" lr: 0.01 scheduler_class: "torch.optim.lr_scheduler.LambdaLR" scheduler: lr_lambda: 0.9 module: feature_size: ${input.feature_size} num_of_classes: ${model.num_of_classes} ``` The structure of the workdir is: ```commandline ├── data │ ├── ag_news_corpus │ │ ├── test.csv │ │ ├── train.csv ├── config │ ├── ag_linear_config.yaml ``` whereas the data is taken from [mhjabreel's github account](https://github.com/mhjabreel/CharCnn_Keras/tree/master/data/ag_news_csv). To run the example project just execute `python3 path/to/src/uros_pf/trainer/trainer.py -cn ag_linear_config` from the workdir. ## Install ### Requirements - Operating system: Windows 7, 8, 10 or 11 (64bit) or Linux (ideally Ubuntu 23.04) - Processor: x86 processor architecture (no ARM support) - Hard disk: 5GB free disk space ### General The Framework is written in Python 3.11. In order to install the required Python version and pip packages without coming into conflict with Python and module versions that may already exist or that may be installed later, we recommend to use **Anaconda** (or the minimised variant **Miniconda**) on Windows systems. With this software it is possible to run independent Python installations in **virtual environments**. So if you want to create your own projects with other Python versions, modules or module versions after this workshop, this is possible with Anaconda without any problems. As a code editor, we recommend **PyCharm**, which offers various tools for the development of Python programmes. The central Python package for fine-tuning large language models we use is **PyTorch**. The **Uni Rostock PyTorch Framework (uros_pf)** provides a framework that simplifies processes based on this. As already shown above, the framework also offers a simple sample project to train a classifier that classifies newspaper texts according to the genres *World*, *Sports*, *Business* and *Sci/Tech*. At the end of these installation instructions, it is explained in more detail how this sample application can be started as a PyCharm project. ### Ubuntu / Debian Open a terminal and execute the following command lines while substituting `path/2/virtual/environment` and `path/to/project/dir` with your own folders. ```commandline sudo apt-get update sudo apt install python3-venv python3-pip ENV_HOME=path/2/virtual/environment python3 -m venv $ENV_HOME source $ENV_HOME/bin/activate pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu pip install tensorboard transformers[torch] evaluate datasets scikit-learn tokenizers tqdm pip install hydra-core --upgrade ``` To install the project framework and the Pycharm environment download and install PyCharm from https://www.jetbrains.com/de-de/pycharm/download/other.html and run: ``` cd path/to/project/dir git clone https://citlab0.math.uni-rostock.de/gitlab/shared/uros_pf ``` (tested with Ubuntu 23.04) ### Windows #### Virtual environment 1. Install [Miniconda](https://docs.conda.io/en/latest/miniconda.html#windows-installers): Use the installer for Windows 64bit systems. In the installation process, select the suggested default settings. If no shortcut to the Anaconda prompt has been created on the desktop after the installation, create one yourself: Create a new shortcut with the target `cmd /c start "Anaconda Prompt" "C:\path\to\Miniconda3\Scripts\activate"`, replacing `C:\path\to\` with the correct path to the Miniconda installation. 2. Expand the sources that Miniconda can access to install software in virtual environments: By default, Anaconda often only has access to older versions of software packages. To expand the scope, we add another community-maintained source, **Conda-Forge**. To do this, open the Anaconda prompt and type: `conda config --add channels conda-forge`. Conda-Forge should now be an entry in the list of channels for obtaining software packages: `conda config --show channels`. 3. Create a new virtual environment: In the Anaconda prompt, you should see the prefix `(base)` in the line before the cursor, which means that you are currently in the base environment of Anaconda. To create the new environment, type: `conda create -n llm_workshop python=3.11 git`. You can replace `llm_workshop` with a name of your choice. 4. Activate the new virtual environment: `conda activate llm_workshop`. If you forget the name of the environment or want to see an overview of all created virtual environments: `conda env list`. 5. Update the packages *pip* and *setuptools* installed with Python: `python -m pip install -U pip && pip install -U setuptools`. 6. Install the Python modules needed for the workshop: `pip install torch torchvision torchaudio transformers[torch] tensorboard evaluate datasets scikit-learn notebook hydra-core --upgrade`. #### Workshop folder and Uni Rostock PyTorch Framework-Repository (uros_pf) 7. Create a project folder on your system. Change to this project folder in the Anaconda prompt. 8. Clone the `uros_pf` - repository into the project folder: `git clone https://citlab0.math.uni-rostock.de/gitlab/shared/uros_pf.git`. 9. Create the folder `workdir` in the project folder, so that there should now be a total of two folders in the project folder: `workdir` and `uros_pf`. In `workdir` we will store data, configurations and models. #### PyCharm 10. Install [PyCharm](https://www.jetbrains.com/de-de/pycharm/download/other.html) (community version 2023.2.1 for Windows). 11. Start PyCharm and open the repository folder `uros_pf` in the project folder: Click on the main menu icon in the upper left corner, click on `Open` and select the repository folder. Answer the following security question with `Trust Project`. 12. Activate your virtual environment in PyCharm: PyCharm can now use the virtual environment we created in point 3 as a working environment and thus also the Python 3.11 interpreter it contains. Click on the selection of the Python interpreter in the lower right corner. If your Miniconda environment is not yet available, select `Add New Interpreter` and then `Add Local Interpreter`. In the now opened display `Add Python Interpreter` select `Conda Environment` as type on the left. Then enter the path to `conda.exe` of your Miniconda installation on the right: `C:\path\to\Miniconda3\Scripts\conda.exe`. Then activate `Use existing environment` and select your virtual environment from the dropdown menu, which should now be found in the dropdown menu. Confirm with `OK`. Now PyCharm can use the Python modules you installed in the virtual environment in point 6. ### Uni Rostock PyTorch Framework (uros_pf) 13. To finally start the sample application of the Uni Rostock PyTorch Framework, a folder structure must first be created in the currently empty folder `workdir`: ``` ├── data │ ├── ag_news_corpus │ │ ├── test.csv │ │ ├── train.csv ├── config │ ├── ag_linear_config.yaml ``` 14. The training data is to be obtained from [mhjabreel's GitHub repository](https://github.com/mhjabreel/CharCnn_Keras/tree/master/data/ag_news_csv). 15. The configuration file is simply the one from section [Sample project](#Sample-project). 16. To start the training, a **Run / Debug Configuration** can be set up in PyCharm: This allows certain `.py` files to be run directly with specified parameters. To set up a new configuration, click on the dropdown menu `Current File` in the top centre and select `Edit Configurations`. Now click on `Add new` and select `Python`. Give the configuration a name: `Run Trainer`. Select the file `uros_pf/trainer/trainer.py` as **script**. Define the folder `workdir` in the project folder as **working directory**. And enter `-cn ag_linear_config` in the **Script parameters** field. Save the changes and start the scenario by clicking on `Run`. A terminal will open and the training process can be observed. This concludes the preparatory installations.