Skip to content

Commit e549889

Browse files
committed
Some improvements to config docs
1 parent 66227c5 commit e549889

File tree

2 files changed

+41
-16
lines changed

2 files changed

+41
-16
lines changed

quickstart/config.md

Lines changed: 38 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,41 @@
11

22
# Configuration file
33

4-
The configuration file - example at `configs/config.yml` - is used to store all relevant configurations, like paths to the datasets and spark parameters. It should be appropriately edited before executing the code. <br>
4+
The configuration file - example at `configs/config.yml` - is used to store all relevant configurations, like paths to the datasets and spark parameters. It should be appropriately edited before executing the code. It is written in [YAML](https://yaml.org/).<br>
55

6-
The first parameters to set are those related to spark:
6+
## Spark parameters
7+
Spark parameters are set under the `spark` heading. Syntax for specifying Spark parameters derives from [Spark\'s own property names](https://spark.apache.org/docs/latest/configuration.html#available-properties). For example, to conigure the parameter `spark.app.name` in Cider config, we'd use
8+
```
9+
spark:
10+
app:
11+
name: "my_first_cider_app"
12+
```
13+
14+
Here is a more complete example config. It's not meant as an endorsement of these specific config values; optimal choices vary greatly based on your environment and use case.
715

816
```
9-
spark:
10-
app_name: "mm"
11-
files:
12-
max_partition_bytes: 67108864
17+
spark:
18+
app:
19+
name: "my_first_cider_app"
20+
master: "local[*]"
21+
sql:
22+
shuffle:
23+
partitions: 144
1324
driver:
14-
memory: "8g" // driver memory
15-
max_result_size: "2g" // maximum result size when collecting to driver
16-
loglevel: "ERROR"
25+
memory: "8G"
26+
maxResultSize: "2G"
27+
supervise: true
28+
executor:
29+
memory: "8G"
30+
rpc:
31+
askTimeout: "600s"
32+
loglevel: "WARN"
33+
logConf: true
1734
```
1835

19-
Next we specify folder and file locations. File subpaths are given relative to a "parent" directory: Either the `input_data` directory or the `working` directory (file subpaths should not have leading slashes). The locations of the parent directories must be specified with absolute paths (with leading slashes). Cider will not modify files under the `input_data` directory. It will use the `working` directory for program outputs, some of which may act as inputs for later steps. For example, the featurizer writes features to the `working` directory, and then the ml module reads features back in from that directory. At present, file names/sub-paths written programmatically under the `working` directory are hard-coded and can't be specified in config.
36+
## File and folder locations
37+
38+
Under the `path` heading, we specify folder and file locations. File subpaths are given relative to a "parent" directory: Either the `input_data` directory or the `working` directory (if you'd rather specify an absolute path, use a leading slash). The locations of the parent directories must be specified with absolute paths (with leading slashes). Cider will not modify files under the `input_data` directory. It will use the `working` directory for program outputs, some of which may act as inputs for later steps. For example, the featurizer writes features to the `working` directory, and then the ml module reads features back in (from that same directory, unless a different one is specified as input). At present, file names/sub-paths written programmatically under the `working` directory are hard-coded and can't be specified in config.
2039

2140
```
2241
path:
@@ -42,7 +61,9 @@ path:
4261
directory_path: "/Users/example/Documents/GD/cider/working_directory/"
4362
```
4463

45-
The featurizer module expects certain column and column names, and we can define them in the following section of the config file:
64+
## Column names
65+
66+
Cider expects certain columns to be present, and we can specify their names under the `col_names` heading (this is not a complete list):
4667

4768
```
4869
col_names:
@@ -81,7 +102,9 @@ col_names:
81102
geo: "tower_id"
82103
```
83104

84-
We also have to set a few parameters that will affect the behaviour of some modules:
105+
## Miscellaneous parameters
106+
107+
We also have to set a few parameters that will affect the behaviour of some modules, under the `params` heading:
85108

86109
```
87110
params:
@@ -104,7 +127,9 @@ params:
104127
opt_in_default: false // if true opt-in is set as default, i.e. all users give their consent unless they opt-out
105128
```
106129

107-
Finally, we can set the hyper-parameters that will be tested during a grid-search performed by the ML module:
130+
## ML tuning parameters
131+
132+
Under the `hyperparams` heading, we set the hyper-parameters that will be tested during a grid-search performed by the ML module:
108133

109134
```
110135
hyperparams:

quickstart/installation.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22

33
To install, and manage dependencies and virtual environments this project uses Poetry. Follow the [instructions](https://python-poetry.org/docs/) to install Poetry.
44

5-
From the root directory `poetry update` followed by `poetry install` - this will establish a venv with all the needed dependencies.
5+
From the root directory `poetry update` followed by `poetry install` - this will establish a virtual environment with all the needed dependencies.
66

7-
Once your venv is made you can use `poetry run [command]` to run a single CLI command inside the venv.
7+
Once your virtual environment is made you can use `poetry run [command]` to run a single CLI command inside the virtual environment.
88

9-
You can use `poetry shell` to enter into the venv.
9+
You can use `poetry shell` to enter into the virtual environment.

0 commit comments

Comments
 (0)