Git Product home page Git Product logo

sms-mms-deduplication's Introduction

Hi! I'm Ryan

I am currently working as a Quantitative Analyst at Wells Fargo, primarily focusing on risk model research, development, maintenance, and monitoring. In much of my free time, I work on mathematical, statistical, and programming hobby projects.

Feel free to reach out with any questions, comments, or ideas and I'll try to respond reasonably quickly!

https://ryanagibson.com

Some Projects

Feel free to look through my repositories list (many are private), but I've linked some projects in the images below!

Steganography illustration ModularityPruning illustration FPGA-Asteroids illustration DRRRT-motion-planning illustration

sms-mms-deduplication's People

Contributors

ragibson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

jcpeterson

sms-mms-deduplication's Issues

MMS duplicates are only deduplicated in pairs?

I've received an external report that occurrences of many simultaneous duplicates might not be deduplicated all at once but rather require multiple passes of deduplication.

E.g., if there are four occurrences of the same text, two will remain after running the tool.

I'm not sure if I thought about this case when initially developing it, so I want to check what actually happens here in the SMS and MMS cases.

RuntimeError: Encountered SMIL data not captured by existing check?

python3 ./dedupe_texts.py --ignore-date-milliseconds --ignore-whitespace-differences ~/sms-2023-05-09_01-04-44.xml/sms-2023-05-09_01-04-44.xml ~/sms-2023-05-09_01-04-44.xml/cleaned.xml ~/sms
-2023-05-09_01-04-44.xml/log.log
Reading '/home/USERNAME/sms-2023-05-09_01-04-44.xml/sms-2023-05-09_01-04-44.xml'... Done in 160.8 s.
Preparing log file '/home/USERNAME/sms-2023-05-09_01-04-44.xml/log.log'.
Searching for duplicates... Traceback (most recent call last):
  File "/home/USERNAME/Sources/SMS-MMS-deduplication/./dedupe_texts.py", line 295, in <module>
    output_tree, input_message_counts, output_message_counts = deduplicate_messages_in_tree(input_tree, log_file,
                                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/USERNAME/Sources/SMS-MMS-deduplication/./dedupe_texts.py", line 211, in deduplicate_messages_in_tree
    child_tag, child_attributes = retrieve_message_properties_and_tag(child, args)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/USERNAME/Sources/SMS-MMS-deduplication/./dedupe_texts.py", line 188, in retrieve_message_properties_and_tag
    child_tag, child_attributes = child.tag, retrieve_message_properties(child, args)
                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/USERNAME/Sources/SMS-MMS-deduplication/./dedupe_texts.py", line 131, in retrieve_message_properties
    result = tuple(item for element in [child] + list(child.iter()) for item in compile_relevant_fields(element))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/USERNAME/Sources/SMS-MMS-deduplication/./dedupe_texts.py", line 131, in <genexpr>
    result = tuple(item for element in [child] + list(child.iter()) for item in compile_relevant_fields(element))
                                                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/USERNAME/Sources/SMS-MMS-deduplication/./dedupe_texts.py", line 123, in compile_relevant_fields
    return tuple(
           ^^^^^^
  File "/home/USERNAME/Sources/SMS-MMS-deduplication/./dedupe_texts.py", line 128, in <genexpr>
    and not contains_smil(element.attrib[field])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/USERNAME/Sources/SMS-MMS-deduplication/./dedupe_texts.py", line 74, in contains_smil
    raise RuntimeError(f"Encountered SMIL data not captured by existing check? {repr(s)}")
RuntimeError: Encountered SMIL data not captured by existing check? '<?xml version="1.0" encoding="UTF-8"?><smil>\r\n<head>\r\n<layout>\r\n<root-layout/>\r\n<region id="Text" top="70%" left="0%" height="30%" width="100%" fit="scroll"/>\r\n<region id="Image" top="0%" left="0%" height="70%" width="100%" fit="meet"/>\r\n</layout>\r\n</head>\r\n<body>\r\n<par dur="10s">\r\n<img src="imagejpeg_0.jpg" region="Image"/>\r\n</par>\r\n</body>\r\n</smil>'

This is reading a 18GBish backup that has been handed across from Windows Phone 8 -> Windows Phone 8.1 -> Windows 10 Mobile -> Android 9 -> Android 10 -> Android 11 -> Android 12 (LineageOS).

I am not surprised this backup isn't playing nice, but it does work on-device well, if that helps at all.

Deduplication misses messages with inconsistent country code

Apparently some recovery agents inconsistently apply the country code to the addresses, which can cause some duplicates to be missed.

E.g., +1 1231231234 and 1231231234 are the same phone number in North American countries. This should be recognized in the deduplication.

Deduplication misses messages with inconsistent timestamp precision

It looks like some backup and/or recovery agents trim timestamp precision to the seconds level only (rather than millisecond).

That can cause two messages to truly be different since they technically appear to be received at different times, but they should be removed anyway.

UnicodeEncodeError

I encountered this error when running your script:

Reading 'sms-20240121003902.xml'... Done in 17.7 s.
Preparing log file 'sms-20240121003902_deduplication.log'.
Searching for duplicates... Traceback (most recent call last):
  File "C:\Users\cfben\OneDrive\Apps\SMS Backup and Restore\dedupe_texts.py", line 210, in <module>
    output_tree, input_message_counts, output_message_counts = deduplicate_messages_in_tree(input_tree, log_file)
                                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\cfben\OneDrive\Apps\SMS Backup and Restore\dedupe_texts.py", line 132, in deduplicate_messages_in_tree
    remove_element(child, child_tag, child_attributes, child_attributes)
  File "C:\Users\cfben\OneDrive\Apps\SMS Backup and Restore\dedupe_texts.py", line 120, in remove_element
    log_file.write(removal_summary(element_tag, element_attributes, attribute_match))
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.496.0_x64__qbz5n2kfra8p0\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f44d' in position 214: character maps to <undefined>

I was able to fix it by changing line 207 from with open(log_fp, "w") as log_file: to with open(log_fp, "w", encoding="utf-8") as log_file:.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.