Skip to content

Commit 2232633

Browse files
committed
2 parents b2b57b8 + b0324e0 commit 2232633

File tree

9 files changed

+121
-31
lines changed

9 files changed

+121
-31
lines changed

README.md

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,8 +62,54 @@ pip install -e .
6262
```
6363

6464
If you want to evaluate each modality of data, please use the following commands:
65+
<details>
66+
<summary>
67+
<b>text data eval</b>
68+
</summary>
69+
<p>
70+
71+
```bash
72+
pip install -e .[text]
73+
pip install flash-attn==2.6.3
74+
python -m spacy download en_core_web_sm
75+
```
76+
77+
</p>
78+
</details>
79+
80+
<details>
81+
<summary>
82+
<b>image data eval</b>
83+
</summary>
84+
<p>
85+
86+
```bash
87+
pip install -e .[image]
88+
pip install pyiqa==0.1.12
89+
pip install transformers==4.44.2
90+
```
91+
92+
</p>
93+
</details>
94+
95+
96+
<details>
97+
<summary>
98+
<b>video data eval</b>
99+
</summary>
100+
<p>
101+
102+
```bash
103+
pip install -e .[video]
104+
```
105+
When evaluating video-caption data, please run the following command to install modified CLIP for EMScore:
106+
```
107+
pip install git+https://github.com/MOLYHECI/CLIP.git
108+
```
109+
110+
</p>
111+
</details>
65112

66-
All dependencies can be installed by:
67113
<details>
68114
<summary>
69115
<b>All dependencies</b>

README.zh-CN.md

Lines changed: 48 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -56,18 +56,64 @@ DataFlow-Eval 是一个数据质量评估系统,可以从多个维度评估数
5656

5757
您可以用如下命令配置conda环境
5858
```
59-
6059
conda create -n dataflow python=3.9
6160
6261
conda activate dataflow
6362
6463
pip install -e .
65-
6664
```
6765

6866

6967
如果您想评估单个模态的数据,可以使用下面的安装代码👇
7068

69+
<details>
70+
<summary>
71+
<b>text data eval</b>
72+
</summary>
73+
<p>
74+
75+
```bash
76+
pip install -e .[text]
77+
pip install flash-attn==2.6.3
78+
python -m spacy download en_core_web_sm
79+
```
80+
81+
</p>
82+
</details>
83+
84+
<details>
85+
<summary>
86+
<b>image data eval</b>
87+
</summary>
88+
<p>
89+
90+
```bash
91+
pip install -e .[image]
92+
pip install pyiqa==0.1.12
93+
pip install transformers==4.44.2
94+
```
95+
96+
</p>
97+
</details>
98+
99+
100+
<details>
101+
<summary>
102+
<b>video data eval</b>
103+
</summary>
104+
<p>
105+
106+
```bash
107+
pip install -e .[video]
108+
```
109+
When evaluating video-caption data, please run the following command to install modified CLIP for EMScore:
110+
```
111+
pip install git+https://github.com/MOLYHECI/CLIP.git
112+
```
113+
114+
</p>
115+
</details>
116+
71117
<details>
72118
<summary>
73119
<b>All dependencies</b>
@@ -84,14 +130,10 @@ pip install transformers==4.44.2
84130
</p>
85131
</details>
86132

87-
88-
89133
请参考[数据评估文档](#数据评估文档)查看参数的使用规则. 仅使用yaml参数便可以完成数据评估:
90134

91135
```
92-
93136
python test.py --config [your config file]
94-
95137
```
96138
<p align="center">
97139
<img src="./static/images/example_1.png">

configs/text_scorer_pt.yaml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,10 @@ dependencies: [text]
55

66
data:
77
text:
8-
use_hf: False # Whether to use huggingface_dataset, if used, ignore the local data path below
9-
dataset_name: 'yahma/alpaca-cleaned'
10-
dataset_split: 'train'
11-
name: 'default'
8+
use_hf: False # Whether to use onlined Huggingface dataset, if used, ignore the local data path below
9+
dataset_name: 'yahma/alpaca-cleaned' # Huggingface dataset: dataset name
10+
dataset_split: 'train' # Huggingface dataset: dataset split
11+
name: 'default' # Huggingface dataset: subset name
1212

1313
data_path: 'demos/text_eval/fineweb_5_samples.json' # Local data path, supports json, jsonl, parquet formats
1414
formatter: "TextFormatter" # Data loader type
@@ -31,4 +31,4 @@ scorers: # You can select multiple text scorers from all_scorers.yaml and put th
3131
- educational_value
3232
PresidioScorer:
3333
language: 'en'
34-
device: 'cuda:0'
34+
device: 'cuda:0'

configs/text_scorer_sft.yaml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,10 @@ dependencies: [text]
55

66
data:
77
text:
8-
use_hf: True # Whether to use huggingface_dataset, if used, ignore the local data path below
9-
dataset_name: 'yahma/alpaca-cleaned'
10-
dataset_split: 'train'
11-
name: 'default'
8+
use_hf: False # Whether to use onlined Huggingface dataset, if used, ignore the local data path below
9+
dataset_name: 'yahma/alpaca-cleaned' # Huggingface dataset: dataset name
10+
dataset_split: 'train' # Huggingface dataset: dataset split
11+
name: 'default' # Huggingface dataset: subset name
1212

1313
data_path: 'demos/text_eval/alpaca_5_samples.json' # Local data path, supports json, jsonl, parquet formats
1414
formatter: "TextFormatter" # Data loader type
@@ -19,4 +19,4 @@ scorers: # You can select multiple text scorers from all_scorers.yaml and put th
1919
DeitaQualityScorer:
2020
device: 'cuda:0'
2121
model_name: 'hkust-nlp/deita-quality-scorer'
22-
max_length: 512
22+
max_length: 512

dataflow/Eval/Text/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,10 +12,10 @@ model_cache_path: '../ckpt' # cache path for models
1212

1313
data:
1414
text:
15-
use_hf: False # Whether to use huggingface_dataset, if used, ignore the local data path below
16-
dataset_name: 'yahma/alpaca-cleaned'
17-
dataset_split: 'train'
18-
name: 'default'
15+
use_hf: False # Whether to use onlined Huggingface dataset, if used, ignore the local data path below
16+
dataset_name: 'yahma/alpaca-cleaned' # Huggingface dataset: dataset name
17+
dataset_split: 'train' # Huggingface dataset: dataset split
18+
name: 'default' # Huggingface dataset: subset name
1919

2020
data_path: 'demos/text_eval/fineweb_5_samples.json' # Local data path, supports json, jsonl, parquet formats
2121
formatter: "TextFormatter" # Data loader type
@@ -152,4 +152,4 @@ calculate_score(save_path='./scores.json')
152152
}
153153
}
154154
}
155-
```
155+
```

dataflow/Eval/Text/README.zh-CN.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,10 @@ model_cache_path: '../ckpt' # 模型默认缓存路径
1313

1414
data:
1515
text:
16-
use_hf: False # 是否使用huggingface_dataset,如果使用则忽略下方本地数据地址
17-
dataset_name: 'yahma/alpaca-cleaned'
18-
dataset_split: 'train'
19-
name: 'default'
16+
use_hf: False # 是否使用在线的Huggingface数据集,如果使用则忽略下方本地数据地址
17+
dataset_name: 'yahma/alpaca-cleaned' # Huggingface数据集:数据集名称
18+
dataset_split: 'train' # Huggingface数据集:数据集分区名
19+
name: 'default' # Huggingface数据集:数据集子集名
2020

2121
data_path: 'demos/text_eval/fineweb_5_samples.json' # 本地数据地址,支持json、jsonl、parquet格式
2222
formatter: "TextFormatter" # 数据加载器类型
@@ -155,4 +155,4 @@ calculate_score(save_path='./scores.json')
155155
}
156156
}
157157
}
158-
```
158+
```

dataflow/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
from .config import *
22
from .utils import *
3-
# from .Eval import *
3+
from .Eval import *
44
from .format import *
55

6-
from .utils.utils import list_image_eval_metrics, get_scorer
6+
from .utils.utils import list_image_eval_metrics, get_scorer

dataflow/format/text_formatter.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
from datasets import load_dataset
1+
import datasets
22
import json
33
import pyarrow.parquet as pq
44
from dataflow.utils.registry import FORMATTER_REGISTRY
@@ -38,7 +38,7 @@ def load_hf_dataset(self, dataset_name, dataset_split=None, name=None, keys=None
3838
"name": name
3939
}
4040

41-
dataset = load_dataset(**{k: v for k, v in load_kwargs.items() if v is not None})
41+
dataset = datasets.load_dataset(**{k: v for k, v in load_kwargs.items() if v is not None})
4242

4343
metadata = {
4444
"description": dataset.info.description if hasattr(dataset, "info") else None,

dataflow/utils/utils.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,8 @@ def recursive_len(scores: dict):
6969
return recursive_len(v)
7070
elif isinstance(v, np.ndarray):
7171
return v.shape[0]
72+
elif isinstance(v, list):
73+
return len(v)
7274
else:
7375
raise ValueError(f"Invalid scores type {type(v)} returned")
7476

0 commit comments

Comments
 (0)