If you read this file, you have passed our initial screening. Well done! 👏 👏 👏
🚀 The next step to join the Data Science team of xtream is this assignment. You will find several datasets: please choose only one. For each dataset, we propose several challenges. You do not need to complete all of them, but rather only the ones you feel comfortable about or the ones that interest you.
✨ Choose what really makes you shine!
⌚ The deadline for submission is 10 days after you are provided with the link to this repository, so that you can move at your own pace.
❗ Important: you might feel the tasks are too broad, or the requirements are not fully elicited. This is done on purpose: we wish to let you take your own way in extracting value from the data and in developing your own solutions.
Please fork this repository and work on it as if you were taking on a real-world project. On the deadline, we will check out your work.
❗ Important: At the end of this README, you will find a blank "How to run" section. Please write there instructions on how to run your code.
Your work will be assessed according to several criteria, for instance:
- Work Method
- Understanding of the business problem
- Understanding of the data
- Correctness, completeness, and clarity of the results
- Quality of the codebase
- Documentation
❗ Important: this is not a Kaggle competition, we do not care about model performance. No need to get the best possible model: focus on showing your method and why you would be able to get there, given enough time and support.
Problem type: regression
Dataset description: Diamonds readme
Don Francesco runs a jewelry. He is a very rich fellow, but his past is shady: be sure not to make him angry. Over the years, he collected data from 5000 diamonds. The dataset provides physical features of the stones, as well as their value, as estimated by a respected expert.
Francesco wants to know which factors influence the value of a diamond: he is not an expert, he wants simple and clear messages. However, he trusts no one, and he hired another data scientist to get a second opinion on your work. Create a Jupyter notebook to explain what Francesco should look at and why. Your code should be understandable by a data scientist, but your text should be clear for a layman.
Then, Francesco tells you that the expert providing him with the stone valuations disappeared. He wants you to develop a model to predict the value of a new diamond given its characteristics. He insists on a point: his customer are not easy-going, so he wants to know why a stone is given a certain value. Create a Jupyter notebook to meet John's request.
Francesco likes your model! Now he wants to use it. To improve the model, Francesco is open to hire a new expert and let him value more stones. Create an automatic pipeline capable of training a new instance of your model from the raw dataset.
Python script for model training
Finally, Francesco wants to embed your model in a web application, to allow for easy use by his employees. Develop a REST API to expose the model predictions.
- Make sure you have Python 3.11 installed.
- It's recommended to use a virtual environment for project dependencies.
-
Clone this repository.
git clone <REPO_URL> cd <REPO_DIRECTORY> -
Install the required packages.
pip install -r requirements.txt
Use the train.py script to train the model on your dataset.
python train.py --data_path /path/to/diamonds.csv
Additional parameters such as model depth, learning rate, etc., can be set using the command-line arguments. Run python train.py -h to see all available options.
Start the FastAPI server with:
uvicorn api:app --reload
Once the API is running, you can:
-
Send a POST request to
http://127.0.0.1:8000/predict/with diamond data in JSON format to get a price prediction.Example with
curl:curl -X POST http://127.0.0.1:8000/predict/ \ -H "Content-Type: application/json" \ -d '{ "carat": 0.3, "cut": "Good", "color": "E", "clarity": "SI2", "depth": 61.5, "table": 55, "x": 4.29, "y": 4.31, "z": 2.63 }' -
Send a POST request to
http://127.0.0.1:8000/predict_csv/with a CSV file containing diamond data to get predictions for each row.Example with
curl:curl -X POST http://127.0.0.1:8000/predict_csv/ \ -H "Content-Type: multipart/form-data" \ -F "file=@/path/to/test_data.csv" -
Train a new model by sending a POST request to
http://127.0.0.1:8000/train/with a CSV file containing diamond data.
- Ensure that the diamond data provided has all the necessary columns (
carat,cut,color,clarity,depth,table,x,y,z).