This script normalizes company names in a CSV file, attributing patents to canonical company names. The normalization process includes handling whitespace, punctuation, legal structure variations, and fuzzy matching.
- Python (version 3.x)
- Pip (Python package installer)
-
Clone the repository:
git clone https://github.com/xvimnt/MTKLabsInterview.git
1. Navigate to the project directory:
```bash
cd your-repository
- Install the required Python packages:
pip install -r requirements.txt
Run the script using the following command:
python company_name_normalizer.py input_file.csv output_file.csv
Replace input_file.csv
with the path to your input CSV file and output_file.csv
with the desired output CSV file.
input_file.csv
: Path to the input CSV file.output_file.csv
: Path to the output CSV file.
- The algorithm normalizes company names for whitespace, punctuation, and legal structure variations before addressing misspellings.
- It uses fuzzy matching to compare the similarity of two company names but relies on other attributes in the file to rule out potential false positives.
- The script assumes no misspellings are possible in the country field but are possible in the city field.
If you find issues or have suggestions for improvements, please open an issue or submit a pull request.
Remember to customize placeholders such as `your-username`, `your-repository`, and `[Your License Name]` with the relevant information for your script and repository.