Skip to content

The project is based on scrapping data from webpage otomoto and generation of the ML model with Natural Language Processing for prediction of the car price basing on the offer details.

Notifications You must be signed in to change notification settings

Advanced-Data-Mining/Otomoto_data_mining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Otomoto Data Mining

We want to scrape car advertisements and build models capable of predicting price based on car details.

Data Storage

Data is saved as Parquet files in the data/ directory, with each page of results stored as page_XXX.parquet for efficient storage and analysis.

Results

  • classes per step:
    • 14 cls -> 20_000 step,
    • 10 cls -> 30_000 step,
    • 7 cls -> 40_000 step,
  • min_price=20_000
  • max_price=300_000
Model Linear bins step* Accuracy
DecisionTree + TFIDF 20_000 40%
DecisionTree + TFIDF 30_000 46%
DecisionTree + TFIDF 40_000 52%
LogisticRegression + TFIDF 20_000 49%
LogisticRegression + TFIDF 30_000 55%
LogisticRegression + TFIDF 40_000 63%
RF + TFIDF (only descr) 20_000 51%
RF + TFIDF (only descr) 30_000 55%
RF + TFIDF (only descr) 40_000 62%
RFECV (num only) 20_000 52%
RFECV (num only) 30_000 59%
RFECV (num only) 40_000 71%
RF + RFECV + TFIDF (descr + num) 20_000 53%
RF + RFECV + TFIDF (descr + num) 30_000 59%
RF + RFECV + TFIDF (descr + num) 40_000 68%
HerBERT fine-tuned 20_000 56%
HerBert (Near-miss=1) 20_000 85%

*Above steps are for the linear bins. Finally, in the models logarithmic split has been chosen but basing number of bins from linear one.

About

The project is based on scrapping data from webpage otomoto and generation of the ML model with Natural Language Processing for prediction of the car price basing on the offer details.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published