DNBCFoodTextClassifier

About the model

DNBCFoodTextClassifier is a RøBÆRTa-base model that can be used for multi-label classification of food entities from Danish text. It has been trained to recognise 17 major categories and 62 subcategories of food items from Danish text using data from the Danish National Birth Cohort (DNBC). Specifically, this model was trained on a dataset consisting of 38,936 free-text responses to questions regarding specific foods that were introduced and avoided during pregnancy and tested on an additional 4326 responses from the same cohort.

You can explore the performance metrics for this model through this shiny app: DNBCFoodTextClassifier Shiny App

This model and its performance metrics for each category have been developed as part of the study: Food choices during pregnancy: evidence from 63,405 Danish women by Erica Elizabeth Eberl, Anne Ahrendt Bjerregaard, Arnór Ingi Sigurdsson, Siddhi Yash Jain, Ann-Marie Hellerung Christiansen, Charlotta Granström, Matthew Paul Gillum, Thorhallur Ingi Halldórsson, Simon Rasmussen, Sjurdur Frodi Olsen, Ruth J.F. Loos, and Marta Guasch-Ferré.

Food Classification Model - Prediction Tutorial

This tutorial demonstrates how to use the pre-trained food classification models for Danish food text classification. You can find the pretrained models here: https://huggingface.co/arnorsig/danish-food-classification

✏️ Note: The tutorial currently only supports Linux/MacOS.

Available Models

Two pre-trained models are available:

Major Category Model (majorcateg_run3jseq): Classifies food into 17 major categories (Alcohol, CoffeeTea, Dairy, Egg, Fastfood, Fish, Fruit, Grains, MeatPoultry, MixedDishes, etc.)
Subcategory Model (subcateg_run3jseq): Classifies food into 63 detailed subcategories (CultMilk, UnflavMilk, RedMeat, WhiteMeat, SafeFish, Wholegrain, etc.)

Both models use a Danish RoBERTa-based sequence model (DDSC/roberta-base-danish) for text encoding.

Model Structure

Each model is approximately 497-543MB and has the following structure:

model_name/
└── training_output/
    ├── meta/
    │   └── eir_version.txt
    ├── model_info.txt
    ├── saved_models/
    │   └── output_model_*.pt
    └── serializations/
        ├── configs_stripped/
        ├── sequence_input_serializations/
        ├── tabular_output_serializations/
        └── transformers/

Prerequisites

Python 3.12
UV package manager

Step 1: Install UV Package Manager

If you don’t already have UV installed:

curl -LsSf https://astral.sh/uv/install.sh | sh

Verify installation:

uv --version

Step 2: Install EIR

Create a Python 3.12 virtual environment and install EIR:

uv venv --python 3.12
source .venv/bin/activate  
uv pip install git+https://github.com/arnor-sigurdsson/EIR.git@0.13.x-maintenance

Note: We install from the GitHub maintenance branch due to a dependency issue with the PyPI release.

Verify installation:

eirpredict --help

Step 3: Download scripts and model files

Clone the GitHub repository for scripts:

git clone https://github.com/LoosTeam/DNBC_PregnancyFoodChoices.git

Navigate to the prediction_tutorial directory:

cd prediction_tutorial
mkdir models

Download model files from Hugging Face:

uvx hf download arnorsig/danish-food-classification --local-dir models/

Step 4: Prepare Your Input Data

Create a CSV file with two columns: ID and Sequence. The Sequence column should contain Danish food text descriptions.

An example file (example_inputs.csv) is included in the package:

ID,Sequence
sample_1,Skyr med blåbær og granola
sample_2,Ribeye bøf med kartofler og grøntsager
sample_3,Laks filet med citron
sample_4,Æblemost uden sukker
sample_5,Hvidt brød med smør og ost

Step 5: Configuration Files

Three configuration files are included in the package: - predict_global_config.yaml - Global settings (batch size, device, etc.) - predict_input_config.yaml - Input data configuration - predict_output_config.yaml - Output categories configuration

5.1 Input Configuration

The package includes predict_input_config.yaml template. To use your own data, update the input_source field:

input_info:
  input_inner_key: null
  input_name: foodtext
  input_source: example_inputs.csv  # Change this to your CSV file path
  input_type: sequence
input_type_info:
  adaptive_tokenizer_max_vocab_size: null
  max_length: 81
  min_freq: 10
  mixing_subtype: mixup
  modality_dropout_rate: 0.0
  sampling_strategy_if_longer: uniform
  split_on: ' '
  tokenizer: null
  tokenizer_language: null
  vocab_file: null
interpretation_config:
  interpretation_sampling_strategy: first_n
  manual_samples_to_interpret: null
  num_samples_to_interpret: 10
model_config:
  embedding_dim: 64
  freeze_pretrained_model: false
  model_init_config: {}
  model_type: DDSC/roberta-base-danish
  pool: avg
  position: embed
  position_dropout: 0.1
  pretrained_model: true
  window_size: 0
pretrained_config: null
tensor_broker_config: null

Important: Update input_source to point to your CSV file (use absolute path).

5.2 Output Configuration

You can copy this directly from the model’s serialization folder. For the subcategory model, you can use subcateg_run3jseq/training_output/serializations/configs_stripped/output_configs.yaml.

The key difference for prediction is setting output_source: null (no labels needed):

output_info:
  output_inner_key: null
  output_name: foodcateg_output
  output_source: null
  output_type: tabular

5.3 Global Configuration

Create predict_global_config.yaml:

attribution_analysis:
  compute_attributions: false
basic_experiment:
  batch_size: 64
  dataloader_workers: 0
  device: cpu
  memory_dataset: true
visualization_logging:
  log_level: info
  no_pbar: false

Note: Set device: cuda if you have a GPU available for faster predictions.

Step 6: Run Predictions

Major Category Model

eirpredict \
  --global_configs configs/predict_global_config.yaml \
  --input_configs configs/predict_input_config.yaml \
  --output_configs configs/predict_output_majorcateg_config.yaml \
  --model_path models/majorcateg_run3jseq/training_output/saved_models/output_model_5400_perf-average=0.9686.pt \
  --output_folder predictions_output

Subcategory Model

eirpredict \
  --global_configs configs/predict_global_config.yaml \
  --input_configs configs/predict_input_config.yaml \
  --output_configs configs/predict_output_subcateg_config.yaml \
  --model_path models/subcateg_run3jseq/training_output/saved_models/output_model_6300_perf-average=0.9312.pt \
  --output_folder predictions_output

Step 7: Understanding the Results

Result Structure

Predictions are organized in a hierarchical folder structure:

predictions_output/
└── foodcateg_output/
    ├── CultMilk/
    │   └── predictions.csv
    ├── RedMeat/
    │   └── predictions.csv
    ├── SafeFish/
    │   └── predictions.csv
    └── ... (one folder per category)

Prediction Format

Each predictions.csv contains binary classification logits:

ID,0,1
sample_1,4.7698545,-5.208173
sample_2,-1.7498194,2.2772772

Column 0: Logit for category absence
Column 1: Logit for category presence
Prediction: Category is present if 1 > 0

Aggregating Predictions

To get a readable summary, use this Python script:

import pandas as pd
from pathlib import Path

predictions_dir = Path("predictions_output/foodcateg_output")
categories = [d.name for d in predictions_dir.iterdir() if d.is_dir()]

all_predictions = {}

for category in sorted(categories):
    csv_path = predictions_dir / category / "predictions.csv"
    df = pd.read_csv(csv_path)

    df['prediction'] = (df['1'] > df['0']).astype(int)
    df['confidence'] = df['1'] - df['0']

    for _, row in df.iterrows():
        sample_id = row['ID']
        if sample_id not in all_predictions:
            all_predictions[sample_id] = {}
        all_predictions[sample_id][category] = {
            'predicted': row['prediction'],
            'confidence': row['confidence']
        }

for sample_id in sorted(all_predictions.keys()):
    print(f"\n{sample_id}:")
    predictions = all_predictions[sample_id]

    positive_predictions = [(cat, info['confidence'])
                           for cat, info in predictions.items()
                           if info['predicted'] == 1]

    positive_predictions.sort(key=lambda x: x[1], reverse=True)

    print(f"  Predicted categories ({len(positive_predictions)}):")
    for cat, conf in positive_predictions:
        print(f"    - {cat}: {conf:.2f}")

Example Output

sample_1:
  Predicted categories (3):
    - TropFruit: 1.78
    - StoneFruit: 0.46
    - SoftCheese: 0.32

sample_2:
  Predicted categories (2):
    - RootTuber: 6.54
    - RedMeat: 4.03

sample_3:
  Predicted categories (2):
    - SafeFish: 3.29
    - TropFruit: 1.76

Interpreting Confidence Scores

Higher values indicate stronger confidence in the prediction
Values > 0 indicate the category is predicted as present
Values < 0 indicate the category is predicted as absent
The magnitude indicates confidence level (e.g., 6.54 is very confident, 0.32 is less confident)

Model Performance

Subcategory Model: Validation performance = 0.9312
Major Category Model: Validation performance = 0.9686

These models were trained on Danish food text data using EIR version 0.13.9.

Troubleshooting

Version Warning

You may see warnings about version mismatches (e.g., trained on 0.13.9, running on 0.13.11). This is expected when using the maintenance branch and should not cause issues.

GPU/CPU

If you encounter memory issues with CPU, reduce batch_size in the global config. If you have a GPU, set device: cuda for faster predictions.

Missing Dependencies

If you encounter issues installing from GitHub, ensure you have git installed and network access to GitHub.

Step 8: Serving the Model as a Web Service

You can deploy the model as a REST API using eirserve:

Starting the Server

eirserve --model-path models/subcateg_run3jseq/training_output/saved_models/output_model_6300_perf-average=0.9312.pt

The server will start on http://localhost:8000 with: - OpenAPI documentation: http://localhost:8000/docs (interactive API explorer) - Model info: http://localhost:8000/info (input/output specifications) - ReDoc: http://localhost:8000/redoc (alternative API documentation)

Making Predictions via API

Here’s a Python example for sending requests:

import requests

url = "http://localhost:8000/predict"

payload = [{
    "foodtext": "Ribeye bøf med kartofler og grøntsager"
}]

response = requests.post(url, json=payload)
predictions = response.json()

print(predictions)

Example Response:

{
  "result": [
    {
      "foodcateg_output": {
        "RedMeat": 0.982,
        "RootTuber": 0.998,
        "ProcVeg": 0.156,
        ...
      }
    }
  ]
}

Using the Interactive API

Navigate to http://localhost:8000/docs
Click on the /predict endpoint
Click “Try it out”
Enter your food text in the JSON payload
Click “Execute”

The OpenAPI interface provides: - Interactive testing - Request/response examples - Schema documentation - Easy integration testing

Batch Predictions

Send multiple samples in one request:

payload = [
    {"foodtext": "Skyr med blåbær"},
    {"foodtext": "Laks filet"},
    {"foodtext": "Kaffe"}
]

response = requests.post("http://localhost:8000/predict", json=payload)

This is ideal for: - Integrating with web applications - Real-time predictions - Building food logging apps - API-based microservices

Citation

If you use these models, please cite the paper:

Food choices during pregnancy: evidence from 63,405 Danish women by Erica Elizabeth Eberl, Anne Ahrendt Bjerregaard, Arnór Ingi Sigurdsson, Siddhi Yash Jain, Ann-Marie Hellerung Christiansen, Charlotta Granström, Matthew Paul Gillum, Thorhallur Ingi Halldórsson, Simon Rasmussen, Sjurdur Frodi Olsen, Ruth J.F. Loos, and Marta Guasch-Ferré.

and the EIR framework:

EIR Framework: https://github.com/arnor-sigurdsson/EIR
Model trained using EIR version 0.13.9