You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: quickstart/config.md
+38-13Lines changed: 38 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,22 +1,41 @@
1
1
2
2
# Configuration file
3
3
4
-
The configuration file - example at `configs/config.yml` - is used to store all relevant configurations, like paths to the datasets and spark parameters. It should be appropriately edited before executing the code. <br>
4
+
The configuration file - example at `configs/config.yml` - is used to store all relevant configurations, like paths to the datasets and spark parameters. It should be appropriately edited before executing the code. It is written in [YAML](https://yaml.org/).<br>
5
5
6
-
The first parameters to set are those related to spark:
6
+
## Spark parameters
7
+
Spark parameters are set under the `spark` heading. Syntax for specifying Spark parameters derives from [Spark\'s own property names](https://spark.apache.org/docs/latest/configuration.html#available-properties). For example, to conigure the parameter `spark.app.name` in Cider config, we'd use
8
+
```
9
+
spark:
10
+
app:
11
+
name: "my_first_cider_app"
12
+
```
13
+
14
+
Here is a more complete example config. It's not meant as an endorsement of these specific config values; optimal choices vary greatly based on your environment and use case.
7
15
8
16
```
9
-
spark:
10
-
app_name: "mm"
11
-
files:
12
-
max_partition_bytes: 67108864
17
+
spark:
18
+
app:
19
+
name: "my_first_cider_app"
20
+
master: "local[*]"
21
+
sql:
22
+
shuffle:
23
+
partitions: 144
13
24
driver:
14
-
memory: "8g" // driver memory
15
-
max_result_size: "2g" // maximum result size when collecting to driver
16
-
loglevel: "ERROR"
25
+
memory: "8G"
26
+
maxResultSize: "2G"
27
+
supervise: true
28
+
executor:
29
+
memory: "8G"
30
+
rpc:
31
+
askTimeout: "600s"
32
+
loglevel: "WARN"
33
+
logConf: true
17
34
```
18
35
19
-
Next we specify folder and file locations. File subpaths are given relative to a "parent" directory: Either the `input_data` directory or the `working` directory (file subpaths should not have leading slashes). The locations of the parent directories must be specified with absolute paths (with leading slashes). Cider will not modify files under the `input_data` directory. It will use the `working` directory for program outputs, some of which may act as inputs for later steps. For example, the featurizer writes features to the `working` directory, and then the ml module reads features back in from that directory. At present, file names/sub-paths written programmatically under the `working` directory are hard-coded and can't be specified in config.
36
+
## File and folder locations
37
+
38
+
Under the `path` heading, we specify folder and file locations. File subpaths are given relative to a "parent" directory: Either the `input_data` directory or the `working` directory (if you'd rather specify an absolute path, use a leading slash). The locations of the parent directories must be specified with absolute paths (with leading slashes). Cider will not modify files under the `input_data` directory. It will use the `working` directory for program outputs, some of which may act as inputs for later steps. For example, the featurizer writes features to the `working` directory, and then the ml module reads features back in (from that same directory, unless a different one is specified as input). At present, file names/sub-paths written programmatically under the `working` directory are hard-coded and can't be specified in config.
Copy file name to clipboardExpand all lines: quickstart/installation.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,8 +2,8 @@
2
2
3
3
To install, and manage dependencies and virtual environments this project uses Poetry. Follow the [instructions](https://python-poetry.org/docs/) to install Poetry.
4
4
5
-
From the root directory `poetry update` followed by `poetry install` - this will establish a venv with all the needed dependencies.
5
+
From the root directory `poetry update` followed by `poetry install` - this will establish a virtual environment with all the needed dependencies.
6
6
7
-
Once your venv is made you can use `poetry run [command]` to run a single CLI command inside the venv.
7
+
Once your virtual environment is made you can use `poetry run [command]` to run a single CLI command inside the virtual environment.
8
8
9
-
You can use `poetry shell` to enter into the venv.
9
+
You can use `poetry shell` to enter into the virtual environment.
0 commit comments