This folder contains tools and notebooks for scraping, cleaning and analysing Autoscout (as24) listings used in the EuropeVans project.
as24_scraper.ipynb— scraping pipeline and utilities: collects listing URLs, scrapes URLs for full listing details, stores results underdata/. All of the above use Concurrent workers.as24_clean.ipynb— cleaning data: parses raw pages or exported tables into structured columns (price, mileage, power, dimensions, equipment), encodes categorical fields and produces cleaned parquet/csv underdata/.as24_analysis.ipynb— analysis and modeling: loads cleaned data, trains or loads a CatBoost price model fromcatboost_info/andmodels/, evaluates predictions, and produces plots / diagnostics.
as24_scraper.ipynb— Jupyter notebook for scraping/autodata extraction.as24_clean.ipynb— Jupyter notebook for cleaning and feature engineering (StockCleaner pipeline).as24_analysis.ipynb— Jupyter notebook for analysis, feature inspection, model training and prediction.Archive/— raw exported datasets, historical scraped pages and intermediate parquet files.data/— cleaned datasets ready for modeling. (contains.gitkeepto keep folders tracked)catboost_info/— CatBoost training artifacts, logs and.gitkeep.models/— trained model binaries (.cbm) and.gitkeep.auto24-api/— helper package and utilities used by the scraper.
- Python 3.8+ (notebooks use typical data stack; CatBoost for model tasks)
- Open
as24_scraper.ipynband run cells to collect listings. Adjust target search filters inside the notebook. - Scraper writes raw outputs to
data/(parquet).
- Open
as24_clean.ipynb. - Run the cleaning pipeline cells to parse raw HTML/rows into structured columns. The notebook creates dummies, numeric conversions and feature groups used for modeling.
- Resulting cleaned datasets are saved to
data/(.parquet,.csv).
- Open
as24_analysis.ipynb. - The notebook loads cleaned data and the CatBoost model from
models//catboost_info/and/or trains a new one. - Use the provided evaluation and prediction cells. Ensure
features_cbordering matches the model when predicting.
.gitignoreexcludesdata/,catboost_info/,models/andArchive/content; placeholder.gitkeepfiles are used to keep directories tracked.- When predicting with CatBoost, make sure input columns are in the same order as the training
features_cb. Fill any missing columns (from new datasets) with sensible defaults rather than zeros when appropriate. - Consider comparing feature distributions between training and target datasets before predicting — distribution shift is a common source of bias.
- Add a
requirements.txt - Add unit tests for the cleaning functions and a small smoke test for prediction.