Git Product home page Git Product logo

ruby-mbox's Introduction

mbox - A simple mbox parser.

I just needed a mbox parser for a notifier.

Yes, this is an overkill solution.

ruby-mbox's People

Contributors

alwahsh avatar brettporter avatar meh avatar shurizzle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ruby-mbox's Issues

File descriptor leak (never closed) - need close, or autoclose.

After the mailbox is read, file descriptors are left open and never closed. While probably not a big issue for small scripts, this could be a problem if Mbox is used by a long-running (server) process.

To make sure, I did

(1 .. 100).each do
  mbox = Mbox.open(file_name)
end

sleep 120

and had a look at the pertinent /proc/N/fd (Linux here).

Feature request: Different regular expression

The regular expression

/^From [^\s]+ .{24}/

is hard-coded many times at different places in mail.rb .

The mailbox that I am trying to parse needs a different regular expression than that, namely plain

/^From /

I would like to enhance the code so that the regular expression can be handed over as an option.

My normal approach would be a data member of some object holding the regexp.

I find that more difficult than I would this like to be, as I find class methods are being called.

What would you think about a re-organization of the code towards much fewer use of class methods?

nice to have a keys function on headers...

While introspecting the mbox format, it would be nice to know what keys are available to query the headers. (to, from, etc are pretty obvious and standard, but other headers may be custom inserted by other clients, etc., so it's nice to have a handy dynamic reference of the keys available from a given message.)

here is the idea quick and dirty as a refinement:

class Mbox::Mail::Headers
  def keys
    @data.keys.map(&:to_sym).sort
  end
end

and an example of the output from iterating over messages from a gmail mbox export:

[1] pry(main)> m.headers.keys
=> [:authentication_results,
 :content_length,
 :content_type,
 :date,
 :delivered_to,
 :dkim_signature,
 :from,
 :in_reply_to,
 :message_id,
 :mime_version,
 :received,
 :received_spf,
 :references,
 :reply_to,
 :return_path,
 :subject,
 :to,
 :x_gm_thrid,
 :x_gmail_labels,
 :x_received,
 :x_yahoo_newman_id,
 :x_yahoo_newman_property,
 :x_ymail_osg]

Once I know the header keys, I can take advantage of the overloaded [] operator thus:

[2] pry(main)> m.headers[:date]
=> "Sun, 16 Aug 2015 13:08:26 +0000 (UTC)"

no time to add a pull request for this now, but maybe later.

Any documentation?

For ruby non-programmers, it would be useful to know how to use this, without having to learn ruby.

Incorrect charset handling

Hi,

the mbox gems assumes the full mbox file to be UTF-8 encoded. But if there is a mail encoded differently, e.g. ISO-8859-1 (quite common in Europe), mbox aborts with

Error invalid byte sequence in UTF-8

(A problem that probably never occurs in country which use the ASCII for regular language only.)

mbox needs to check the charset given in the header and read it appropriately (or read it as binary).

Headers are case-insensitive

For one example:

Should I have a message with headers

Delivert-to: [email protected]
Delivert-To: [email protected]

this should be rolled up as two values of one header.

For another example, I should be able to grab ...headers->['message-id'] and get the message id, irregardless the individual message called it Message-id or Message-ID or Message-Id.

More examples for noobs?

Howdy,

Found this while looking around for something to parse some large mbox files. I'm very new to ruby, and I can't even get the example that counts messages to run. Any chance of putting some more docs/examples in git?

spork@linux-wbsc:~/archives> ruby -v
ruby 1.8.7 (2010-01-10 patchlevel 249) [i586-linux]
spork@linux-wbsc:~/archives> gem list -l ruby-mbox

*** LOCAL GEMS ***

ruby-mbox (0.0.2)

spork@linux-wbsc:~/archives> ./foo ./freebsd-stable-current 
/usr/lib/ruby/gems/1.8/gems/ruby-mbox-0.0.2/lib/mbox/mbox.rb:42:in `reopen': can't     convert File into String (TypeError)
from /usr/lib/ruby/gems/1.8/gems/ruby-mbox-0.0.2/lib/mbox/mbox.rb:42:in `initialize'
from ./foo:10:in `new'
from ./foo:10

spork@linux-wbsc:~/archives> cat foo
#! /usr/bin/env ruby
require 'rubygems'
require 'mbox'

if ARGV.length < 1
    puts "You have to pass the mbox."
    exit
end

mbox = Mbox.new(File.new(ARGV.shift))

puts mbox.length

needs license notification

What is the license under which this software can be distributed? If you are undecided, I would suggest using a two-clause BSD License, which is the license under which the main Ruby implementation itself will be distributed as of version 1.9.3, or (my personal favorite) the Open Works License.

Dead code in parse() in mbox.rb

I am staring at this code in mbox.rb :

while true
    if @internal[counter]
        Mail.seek(@stream, 1, IO::SEEK_CUR)
        next
    end
    # Stuff omitted
end

I think that entire if block is dead code that can not possibly be executed.

I reason as follows: If the Mail.seek(...) were ever executed, there is nothing here that could possibly change the value of counter or the outcome of if @internal[counter] .

The stuff that I have omitted does not get executed in that case. So whatever is there can not possibly change that.

So we would see an endless loop immediately.

As long as we do not see the endless loop in reality, this code must be dead.

Am I missing something?

type.boundary may be nil

In real world spam e-mails, it happens that type.boundary==nil. This in turn means that this check

if type && type.mime && matches = type.mime.match(%r{multipart/(\w+)})
should contain

&& type.boundary

, else

.split("--#{type.boundary}\n") 

means

.split("--\n")

which will then split along HTML comments like

<!-- some comment -->

, which is nonsense.

Version confusion on Rubygems.org?

Hey. I just did gem install mbox today, and it installed from rubygems as per normal. I think that the "master" on this branch hasn't been pushed to rubygems recently, even though the version in master says "0.1.0" which is what I have. Can you push it to "0.2.0" since you added things then push it?

Error that caused me to realize this:

require 'mbox'
some_mbox = Mbox.open '/path/to/mbox'
some_mbox.class.name     #=> "Mbox"
some_mbox[3].class.name #=> "Mbox::Mail"
some_mbox[3].subject        #=> NoMethodError: undefined method `subject' for #<Mail:"[email protected]">

Thanks!

Examples: Return should be != 0 on failure

A program that fails better returns something different from 0 to the shell, or to whoever called it.

The two example programs don't do that when called with no args.

Trivial to fix. Expect a pull request within a few minutes.

performance not so great on large files

parsing a 32M mbox file promptly hung on this line (from ruby trace):

#0:/Library/Ruby/Gems/2.0.0/gems/mbox-0.1.0/lib/mbox/mail/content.rb:38:Mbox::Mail::Content:-:      if type && type.mime && type.boundary && matches = type.mime.match(%r{multipart/(\w+)})
#0:/Library/Ruby/Gems/2.0.0/gems/mbox-0.1.0/lib/mbox/mail/content.rb:39:Mbox::Mail::Content:-:          text.sub(/^.*?--#{type.boundary}\n/m, '').sub(/--#{type.boundary}--$/m, '').split("--#{type.boundary}\n").each {|part|
^C#0:/Library/Ruby/Gems/2.0.0/gems/mbox-0.1.0/lib/mbox/mail/content.rb:39:String:^:             text.sub(/^.*?--#{type.boundary}\n/m, '').sub(/--#{type.boundary}--$/m, '').split("--#{type.boundary}\n").each {|part|
#0:/Library/Ruby/Gems/2.0.0/gems/mbox-0.1.0/lib/mbox/mail/content.rb:39:Mbox::Mail::Content:<:          text.sub(/^.*?--#{type.boundary}\n/m, '').sub(/--#{type.boundary}--$/m, '').split("--#{type.boundary}\n").each {|part|
#

It's not clear from the doc whether there is a way of paging over messages with large attachments. I'm using the stream interface, but it appears to want to parse the entire email message on most operations.

Running through tracer, It seems that split line is creating a large number of substrings... maybe an attached pdf or something.

Going to take a look at node-mbox in the meantime for a faster streaming solution.

Feature request: Use less RAM, stream mails to `each` as read.

Do not insist in storing essentially the entire mbox in RAM.

That each method could stream one mail object at a time to some code block during parsing while it goes, rather than from memory later. Much more RAM efficient if only one go through the mbox is needed.

Suggestion for an option:

Mbox.new(file, :stream => 1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.