Skip to content

pyJedAI Entity Matching version integrated with the KLMS

Notifications You must be signed in to change notification settings

stelar-eu/pyjedai-em

 
 

Repository files navigation

README PyJedAI - Entity Matching

List of works using pyJedAI

The following repositories leverage pyJedAI in the context of the STELAR project:

  1. The code and experiments on AutoER are avalable here.
  2. The code for Disintegration is avalable here.
  3. The code for Generalized Supervised Meta-blocking is avalable here.
  4. The code for the Evaluation of Supervised Entity Matching is avalable here.
  5. The code for SMBench is available here.
  6. The code for Prompting 7B LLMs for Entity Matching is available here.
  7. The code for AvenER is available at here.
  8. The code for the Experimental Analysis of Pre-trained Embeddings is available at here.
  9. The code for Progressive Entity Matching is available at here.

Entity Matching Process

The following README will guide you through the whole process of Entity Matching using pyJedAI.

💡 Tip: Find json examples here.

💡 Tip: If you want to learn more about pyJedAI read the docs here.

Input

For all key attributes in JSON, exactly one file path must be provided.

Attributes Info Value Type Required
dataset_1 .csv format list
dataset_2 .csv format list
ground_truth .csv format list
embeddings_dataset_1 Used for loading embeddings in EmbeddingsNNWorkflow
.npy format
list
embeddings_dataset_2 Used for loading embeddings in EmbeddingsNNWorkflow
.npy format
list
{
	"inputs" :
		"dataset_1": [
            		"d5e730ba-c1d5-4ec1-ae95-88a637204c19"
        	],
        	"dataset_2": [
            		"cb37e262-a606-4d82-9712-b80e8f4d723d"
        	],
        	"ground_truth":[
            		"db006da0-16ed-4ef5-bf1e-d142488d533e"
        	]
}

💡 Tip: If dataset_2 is provided, matches will only be of type (e_1, e_2), where e_1 is an entity in dataset_1 and e_2 is an entity in dataset_2.

💡 Tip: If ground_truth is provided, metrics will be returned

Parameters

Concering input, additional info must be provided.

Attributes Info Value Type Required
dataset_1 Provide info for dataset to be processed correctly dataset_object
dataset_2 Provide info for dataset to be processed correctly dataset_object
ground_truth Provide info for dataset to be processed correctly ground_truth_object
workflow Select your preferred workflow: BlockingBasedWorkflow, EmbeddingsNNWorkflow, or JoinWorkflow string
block_building Block building method and parameters used only for BlockingBasedWorkflow, EmbeddingsNNWorkflow block_building_object
block_cleaning Block cleaning method and parameters used only for BlockingBasedWorkflow
More than one block_cleaning methods can be used
block_cleaning_object or list of block_cleaning_object
comparison_cleaning Comparison cleaning method and parameters used only for BlockingBasedWorkflow comparison-cleaning-object
entity_matching Entity Matching method and parameters used only for BlockingBasedWorkflow entity-matching-object
clustering Clustering method and parameters used only for BlockingBasedWorkflow, EmbeddingsNNWorkflow or JoinWorkflow clustering-object
join Join method and parameters used only for JoinWorkflow join-object

💡 Tip: JoinWorkflow does not contain block_building step.

Dataset

Attributes of keys: dataset_1, dataset_2

Attributes Info Value Type Required
separator Character separating values in csv char
id_column_name Name of Dataset's id column string
dataset_name Name of Dataset string
attributes Columns to be used for matching list

Ground Truth

Attributes of key: ground_truth

Attributes Info Value Type Required
separator Character separating values in csv char

Input Examples

"parameters" : {
        "dataset_1" : {
            "separator" : "|",
            "id_column_name" : "id",
            "dataset_name" : "abt"                    
        },
        "dataset_2" : {
            "separator" : "|", 
            "id_column_name" : "id",
            "dataset_name" : "buy"
        },
        "ground_truth" : {
            "separator" : "|"
        },                
        "workflow": "BlockingBasedWorkflow",
        "block_building": {
              "method": "StandardBlocking",
              "attributes_1" : ["name"],
              "attributes_2" : ["first_name"]
        },
        "block_cleaning" : [
            {
                "method" : "BlockFiltering", 
                "params" : { "ratio" : 0.7 }
            }
        ],
        "comparison_cleaning": {
            "method": "BLAST"
        },
        "entity_matching" : { 
            "method" : "EntityMatching",
            "params" : {
                "similarity_threshold" : 0.8
            }
        },
        "clustering" : {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold" : 0.1
            }
        }
}

"parameters" : {           
        "workflow": "EmbeddingsNNWorkflow",
        "block_building": 
        {
            "method" : "EmbeddingsNNBlockBuilding",
            "params" : {
                "vectorizer" : "st5"
            }
        },
        "clustering": {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold": 0.4
            }
        }
     ....     
    }    
"parameters" : {           
        "workflow": "JoinWorkflow",
        "block_building": 
        {
            "method" : "TopKJoin",
            "params" : {
                "metrics" : "cosine",
                "tokenization": "qgrams",
                "reverse_order": "False"
            },
            "attributes_1": ["name"],
            "attributes_2" : ["name"]
        },
        "clustering": {
            "method" : "UniqueMappingClustering",
            "params" : {
                "similarity_threshold": 0.4
            }
        }
     ....     
    }    

Output

For all key attributes in JSON, exactly one file path must be provided.

Attributes Info Value Type Required
metrics Creates a file with F1, Recall, Precision metrics if ground truth exists
.csv format
path
pairs Creates a file with the ids of pairs
.csv format
path
entities Creates a file with all the matched entities.csv format list
{
  "outputs": {
        "metrics" : "s3://klms-bucket/pyjedai-output/metrics.csv",
        "pairs" : "s3://klms-bucket/pyjedai-output/pairs.csv",
        "entities" : "s3://klms-bucket/pyjedai-output/entities_df.csv"
  }
}

About

pyJedAI Entity Matching version integrated with the KLMS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 85.0%
  • Python 13.8%
  • Other 1.2%