Skip to content

vilalali/word-alignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

┬а

History

2 Commits
┬а
┬а
┬а
┬а
┬а
┬а
┬а
┬а
┬а
┬а

Repository files navigation

	================= Word alignment using GIZA++ ===================
	################## INSTALLATION GUIDE##########
	1=> $ git clone https://github.com/vilalali/word-alignment.git
	2=> $ cd word-alignment
		$ ls -ltr
			GIZA++-v2
			mkcls-v2
	3=> Run the following Command to Install GIZA++-v2 and mkcls-v2:
	a=> Installation of GIZA++-v2:
	$ cd GIZA++-v2
	$ make clean
	$ make
	b=> Inastallation of mkcls-v2:
	$ cd mkcls-v2
	$ make clean
	$ make

	################ Word Alignment Guide ################
	1=> Create Source file:
		$ vi en1.txt
		Coppy the following text:
			India is a country of ancient culture and civilization
			Narendra Modi is the prime minister of India
			India is member of United Nations
	2=> Create the Targated File 
		$ vi hi1.txt
		Copy the following text:
			рднрд╛рд░рдд рдкреНрд░рд╛рдЪреАрди рд╕рдВрд╕реНрдХреГрддрд┐ рдФрд░ рд╕рднреНрдпрддрд╛ рдХрд╛ рджреЗрд╢ рд╣реИ
			рдирд░реЗрдВрджреНрд░ рдореЛрджреА рднрд╛рд░рдд рдХреЗ рдкреНрд░рдзрд╛рди рдордВрддреНрд░реА рд╣реИрдВ
			рдирд░реЗрдВрджреНрд░ рдореЛрджреА рднрд╛рд░рдд рдХреЗ рдкреНрд░рдзрд╛рди рдордВрддреНрд░реА рд╣реИрдВ  
	3=> Now Run the following Command:
		$ ./GIZA++-v2/plain2snt.out [source_language_corpus] [target_language_corpus]
		$ ./GIZA++-v2/plain2snt.out en1.txt hi1.txt

	4=> Output Generated FIles:
		It will generate 4 files:
		a) hi1.vcb, hi1_en1.snt, en1.vcb, en1_hi1.snt
		b) en1.vcb will represent the following data:
		c) en1.vcb: this file contains the eng word its id and frequency. format is [ID word frequency] 

	a) The output of en1.vcb contains the french words its id and frequency. format is [ID word frequency] 
	2 India 3
	3 is 3
	4 a 1
	5 country 1
	6 of 3
	7 ancient 1
	8 culture 1
	9 and 1
	10 civilization 1
	11 Narendra 1
	12 Modi 1
	13 the 1
	14 prime 1
	15 minister 1
	16 member 1
	17 United 1
	18 Nations 1

	d) Similarly the hi1.vcb contains the french words its id and frequency. format is [ID word frequency] 
	2 рднрд╛рд░рдд 3
	3 рдкреНрд░рд╛рдЪреАрди 1
	4 рд╕рдВрд╕реНрдХреГрддрд┐ 1
	5 рдФрд░ 1
	6 рд╕рднреНрдпрддрд╛ 1
	7 рдХрд╛ 1
	8 рджреЗрд╢ 1
	9 рд╣реИ 1
	10 рдирд░реЗрдВрджреНрд░ 2
	11 рдореЛрджреА 2
	12 рдХреЗ 2
	13 рдкреНрд░рдзрд╛рди 2
	14 рдордВрддреНрд░реА 2
	15 рд╣реИрдВ 2

	e) en1_hi1.snt : It contains the parallel sentences (source sentence Ids according to the alignment with target sentence Ids
	1
	2 3 4 5 6 7 8 9 10 
	2 3 4 5 6 7 8 9 
	1
	11 12 3 13 14 15 6 2 
	10 11 2 12 13 14 15 
	1
	2 3 16 6 17 18 
	10 11 2 12 13 14 15 

	f) hi1_en1.snt : It contains the parallel sentences (target sentence Ids according to the alignment with source sentence Ids
	1
	2 3 4 5 6 7 8 9 
	2 3 4 5 6 7 8 9 10 
	1
	10 11 2 12 13 14 15 
	11 12 3 13 14 15 6 2 
	1
	10 11 2 12 13 14 15 
	2 3 16 6 17 18 

	5=> Type following commands for Making class and cooccurrence:
		$ ./mkcls-v2/mkcls -p[source_language_corpus] -V[source_language_corpus].vcb.classes
		$ ./mkcls-v2/mkcls -p[target_language_corpus] -V[target_language_corpus].vcb.classes

	a) Example:
		$./mkcls-v2/mkcls -pen1.txt -Ven1.txt.vcb.classes
		$./mkcls-v2/mkcls -phi1.txt -Vhi1.txt.vcb.classes
	b) These commands will generate following four files:
		en1.txt.vcb.classes.cats, en1.txt.vcb.classes, hi1.txt.vcb.classes.cats, hi1.txt.vcb.classes
	c) '*.vcb.classes' file contains the class category Id associated with each of the tokens which is taken from the *.vcb.classes.cat file.

	6=> create output directory using command 
		$mkdir myout

	7=> Now use GIZA++ to build your dictionary: (-S is the source language, -T is the target language, -C is the generated aligned text file, and -o is the output file prefix): 
		./GIZA++ -S [source_language_corpus].vcb -T [target_language_corpus].vcb -C [target_language_corpus]_[source_language_corpus].snt -o [prefix] -outputpath [output_folder]
	Example. : 
		$ ./GIZA++ -S hi1.vcb -T en1.vcb -C hi1_en1.snt -outputpath myout -o test
		If you see ERROR: NO COOCURRENCE FILE GIVEN! when running GIZA++, you may need to change Makefile to compile GIZA++

		Before:

		CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE 
			-DBINARY_SEARCH_FOR_TTABLE -DWORDINDEX_WITH_4_BYTE

		After:
		CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE

	8=> It will generate the output files in myout/ directory and out of the variuos files file with name [prefix].actual.ti.final. In our case the final output file will be in output dir the wit the name: 
	
	test.actual.ti.final
	It contains the alignment of source and target words according to their probability value:
	FINAL Output Will Be:
	
	test.actual.ti.final:
	
	рдордВрддреНрд░реА Nations 0.5
	рд╣реИрдВ United 0.5
	рдирд░реЗрдВрджреНрд░ member 0.5
	рдордВрддреНрд░реА minister 0.5
	рд╕рднреНрдпрддрд╛ civilization 1
	рдФрд░ culture 0.995885
	рд╣реИрдВ prime 0.5
	рдкреНрд░рд╛рдЪреАрди NULL 8.25337e-06
	рдирд░реЗрдВрджреНрд░ Narendra 0.5
	рднрд╛рд░рдд NULL 0.00130641
	рднрд╛рд░рдд is 0.0465133
	рдореЛрджреА Modi 0.264116
	рднрд╛рд░рдд of 0.652697
	рднрд╛рд░рдд India 0.081918
	рдХреЗ NULL 0.358273
	рд╕рдВрд╕реНрдХреГрддрд┐ NULL 3.06593e-05
	рдХреЗ India 0.641727
	рдкреНрд░рдзрд╛рди NULL 1
	рджреЗрд╢ civilization 1
	рд╕рдВрд╕реНрдХреГрддрд┐ ancient 0.999956
	рд╣реИ and 1
	рдФрд░ ancient 0.00411528
	рдореЛрджреА is 0.735884
	рдХрд╛ civilization 1
	рднрд╛рд░рдд the 0.217566
	рдкреНрд░рд╛рдЪреАрди country 0.999992
	рд╕рдВрд╕реНрдХреГрддрд┐ country 1.38068e-05

About

Word alignment using GIZA++ For bilingual data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages