README PyJedAI - Entity Matching

List of works using pyJedAI

The following repositories leverage pyJedAI in the context of the STELAR project:

The code and experiments on AutoER are avalable here.
The code for Disintegration is avalable here.
The code for Generalized Supervised Meta-blocking is avalable here.
The code for the Evaluation of Supervised Entity Matching is avalable here.
The code for SMBench is available here.
The code for Prompting 7B LLMs for Entity Matching is available here.
The code for AvenER is available at here.
The code for the Experimental Analysis of Pre-trained Embeddings is available at here.
The code for Progressive Entity Matching is available at here.

Entity Matching Process

The following README will guide you through the whole process of Entity Matching using pyJedAI.

💡 Tip: Find json examples here.

💡 Tip: If you want to learn more about pyJedAI read the docs here.

Input

For all key attributes in JSON, exactly one file path must be provided.

Attributes	Info	Value Type	Required
`dataset_1`	`.csv` format	`list`	✔
`dataset_2`	`.csv` format	`list`
`ground_truth`	`.csv` format	`list`
`embeddings_dataset_1`	Used for loading embeddings in `EmbeddingsNNWorkflow` `.npy` format	`list`
`embeddings_dataset_2`	Used for loading embeddings in `EmbeddingsNNWorkflow` `.npy` format	`list`

{
	"inputs" :
		"dataset_1": [
            		"d5e730ba-c1d5-4ec1-ae95-88a637204c19"
        	],
        	"dataset_2": [
            		"cb37e262-a606-4d82-9712-b80e8f4d723d"
        	],
        	"ground_truth":[
            		"db006da0-16ed-4ef5-bf1e-d142488d533e"
        	]
}

💡 Tip: If dataset_2 is provided, matches will only be of type (e_1, e_2), where e_1 is an entity in dataset_1 and e_2 is an entity in dataset_2.

💡 Tip: If ground_truth is provided, metrics will be returned

Parameters

Concering input, additional info must be provided.

Attributes	Info	Value Type	Required
`dataset_1`	Provide info for dataset to be processed correctly	dataset_object	✔
`dataset_2`	Provide info for dataset to be processed correctly	dataset_object
`ground_truth`	Provide info for dataset to be processed correctly	ground_truth_object
`workflow`	Select your preferred workflow: `BlockingBasedWorkflow`, `EmbeddingsNNWorkflow`, or `JoinWorkflow`	`string`	✔
`block_building`	Block building method and parameters used only for `BlockingBasedWorkflow`, `EmbeddingsNNWorkflow`	block_building_object	✔
`block_cleaning`	Block cleaning method and parameters used only for `BlockingBasedWorkflow` More than one `block_cleaning` methods can be used	block_cleaning_object or `list` of block_cleaning_object
`comparison_cleaning`	Comparison cleaning method and parameters used only for `BlockingBasedWorkflow`	comparison-cleaning-object
`entity_matching`	Entity Matching method and parameters used only for `BlockingBasedWorkflow`	entity-matching-object	✔
`clustering`	Clustering method and parameters used only for `BlockingBasedWorkflow`, `EmbeddingsNNWorkflow` or `JoinWorkflow`	clustering-object
`join`	Join method and parameters used only for `JoinWorkflow`	join-object	✔

💡 Tip: JoinWorkflow does not contain block_building step.

Dataset

Attributes of keys: dataset_1, dataset_2

Attributes	Info	Value Type	Required
`separator`	Character separating values in csv	`char`	✔
`id_column_name`	Name of Dataset's id column	`string`	✔
`dataset_name`	Name of Dataset	`string`
`attributes`	Columns to be used for matching	`list`

Ground Truth

Attributes of key: ground_truth

Attributes	Info	Value Type	Required
`separator`	Character separating values in csv	`char`	✔

Input Examples

"parameters" : {
        "dataset_1" : {
            "separator" : "|",
            "id_column_name" : "id",
            "dataset_name" : "abt"                    
        },
        "dataset_2" : {
            "separator" : "|", 
            "id_column_name" : "id",
            "dataset_name" : "buy"
        },
        "ground_truth" : {
            "separator" : "|"
        },                
        "workflow": "BlockingBasedWorkflow",
        "block_building": {
              "method": "StandardBlocking",
              "attributes_1" : ["name"],
              "attributes_2" : ["first_name"]
        },
        "block_cleaning" : [
            {
                "method" : "BlockFiltering", 
                "params" : { "ratio" : 0.7 }
            }
        ],
        "comparison_cleaning": {
            "method": "BLAST"
        },
        "entity_matching" : { 
            "method" : "EntityMatching",
            "params" : {
                "similarity_threshold" : 0.8
            }
        },
        "clustering" : {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold" : 0.1
            }
        }
}

"parameters" : {           
        "workflow": "EmbeddingsNNWorkflow",
        "block_building": 
        {
            "method" : "EmbeddingsNNBlockBuilding",
            "params" : {
                "vectorizer" : "st5"
            }
        },
        "clustering": {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold": 0.4
            }
        }
     ....     
    }

"parameters" : {           
        "workflow": "JoinWorkflow",
        "block_building": 
        {
            "method" : "TopKJoin",
            "params" : {
                "metrics" : "cosine",
                "tokenization": "qgrams",
                "reverse_order": "False"
            },
            "attributes_1": ["name"],
            "attributes_2" : ["name"]
        },
        "clustering": {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold": 0.4
            }
        }
     ....     
    }

Output

For all key attributes in JSON, exactly one file path must be provided.

Attributes	Info	Value Type
`metrics`	Creates a file with F1, Recall, Precision metrics if ground truth exists `.csv` format	`path`
`pairs`	Creates a file with the ids of pairs `.csv` format	`path`
`entities`	Creates a file with all the matched entities`.csv` format	`list`

{
  "outputs": {
        "metrics" : "s3://klms-bucket/pyjedai-output/metrics.csv",
        "pairs" : "s3://klms-bucket/pyjedai-output/pairs.csv",
        "entities" : "s3://klms-bucket/pyjedai-output/entities_df.csv"
  }
}

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.local		.local
data		data
docs		docs
evaluation		evaluation
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
blocking_based.py		blocking_based.py
global_dict.py		global_dict.py
logo.png		logo.png
main.py		main.py
pyjedai_utils.py		pyjedai_utils.py
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README PyJedAI - Entity Matching

List of works using pyJedAI

Entity Matching Process

Input

Parameters

Dataset

Ground Truth

Output

About

Uh oh!

Releases

Packages

Languages

stelar-eu/pyjedai-em

Folders and files

Latest commit

History

Repository files navigation

README PyJedAI - Entity Matching

List of works using pyJedAI

Entity Matching Process

Input

Parameters

Dataset

Ground Truth

Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages