A functional data processing pipeline built in F# for linguistic research applications. Originally developed to support the ECODOC project project at the University of Bordeaux for processing French text corpora.
ListProcessor transforms collections of text files into structured CSV datasets suitable for linguistic analysis. The tool handles French language-specific processing including proper accent preservation, cultural-aware text normalisation, and tokenisation optimised for Romance language characteristics.
- Batch text processing - Processes all
.txtfiles in a specified directory - Linguistic tokenisation - Handles French contractions, hyphens, and word boundaries
- Cultural localisation - Uses French locale for proper case conversion and sorting
- UTF-8 accent preservation - Maintains linguistic integrity of accented characters (à-ÿ, À-Ÿ)
- Deduplication and sorting - Produces clean, alphabetically ordered word lists
- CSV output - Research-ready format with file provenance tracking
- Place your text files in the
samples/directory - Run the application:
dotnet run
- Find your processed data in
output.csv
The tool generates a CSV file where each line represents one input file:
filename.txt,mot1,mot2,mot3,...
The first column contains the source filename, followed by all unique words found in that file, alphabetically sorted.
The application follows a functional pipeline approach:
- File Loading - Reads text files with UTF-8 encoding
- Normalisation - Converts to lowercase using French cultural rules
- Tokenisation - Splits text into words, handling French-specific punctuation
- Symbol Removal - Removes non-alphabetic characters whilst preserving accents
- Deduplication - Eliminates duplicate entries
- Sorting - Alphabetically orders words using French collation rules
- CSV Export - Outputs structured data for analysis
To process text in a different language, modify the culture setting in Program.fs:
let culture = CultureInfo("en-GB") // For British English
let culture = CultureInfo("de-DE") // For German
let culture = CultureInfo("es-ES") // For SpanishAdjust the file paths as needed:
let inputFolder = "./samples" // Input directory
let outputFile = "./output.csv" // Output file pathThe regex pattern for character removal can be modified to include additional character sets:
let regex = Regex("[^a-zA-Zà-ÿÀ-Ÿ]", RegexOptions.Compiled)- .NET 8.0 or later
- Input files must be UTF-8 encoded text files (
.txt)
This tool is particularly useful for:
- Corpus linguistics research - Preparing text corpora for frequency analysis
- Digital humanities projects - Processing historical or literary texts
- Language learning applications - Extracting vocabulary from texts
- Lexicographic research - Building word lists from document collections
- Text mining preprocessing - Normalising data before analysis
The application uses F#'s lazy sequence evaluation (seq) for memory-efficient processing of large files. Files are processed one at a time to avoid loading entire datasets into memory simultaneously.
The tool is designed specifically for French text processing:
- Preserves all French accented characters (à, é, è, ç, etc.)
- Handles French punctuation conventions (apostrophes in contractions)
- Uses French cultural rules for case conversion and sorting
The current implementation assumes well-formed UTF-8 input files. For production use with untrusted data, consider adding:
- File encoding detection
- Malformed input handling
- Empty file validation
This tool was developed for specific research needs. If you're adapting it for other linguistic research:
- Test with your target language's character set
- Verify cultural sorting behaviour meets your requirements
- Consider whether the tokenisation rules suit your linguistic context
Developed for academic research purposes. Please credit the original work if adapting for other research projects.
Originally created to support text corpus analysis for the EcoDOC project at the University of Bordeaux, focusing on environmental and ecological discourse in French academic literature.
Built with F# for functional, reliable text processing in academic research environments.