mbox - A simple mbox parser.
I just needed a mbox parser for a notifier.
Yes, this is an overkill solution.
A Ruby mbox parser.
License: GNU Affero General Public License v3.0
I just needed a mbox parser for a notifier.
Yes, this is an overkill solution.
While trying to parse a relatively large mailing list I got this error.
I fixed it by inserting begin rescue and end around that line.
Where does the 24 here come from: https://github.com/meh/ruby-mbox/blob/82cf262/lib/mbox/mail/metadata.rb#L32 ?
I have dates that are 30 characters long, which are currently getting cut off
After the mailbox is read, file descriptors are left open and never closed. While probably not a big issue for small scripts, this could be a problem if Mbox
is used by a long-running (server) process.
To make sure, I did
(1 .. 100).each do
mbox = Mbox.open(file_name)
end
sleep 120
and had a look at the pertinent /proc/N/fd
(Linux here).
The regular expression
/^From [^\s]+ .{24}/
is hard-coded many times at different places in mail.rb
.
The mailbox that I am trying to parse needs a different regular expression than that, namely plain
/^From /
I would like to enhance the code so that the regular expression can be handed over as an option.
My normal approach would be a data member of some object holding the regexp.
I find that more difficult than I would this like to be, as I find class methods are being called.
What would you think about a re-organization of the code towards much fewer use of class methods?
While introspecting the mbox format, it would be nice to know what keys are available to query the headers. (to, from, etc are pretty obvious and standard, but other headers may be custom inserted by other clients, etc., so it's nice to have a handy dynamic reference of the keys available from a given message.)
here is the idea quick and dirty as a refinement:
class Mbox::Mail::Headers
def keys
@data.keys.map(&:to_sym).sort
end
end
and an example of the output from iterating over messages from a gmail mbox export:
[1] pry(main)> m.headers.keys
=> [:authentication_results,
:content_length,
:content_type,
:date,
:delivered_to,
:dkim_signature,
:from,
:in_reply_to,
:message_id,
:mime_version,
:received,
:received_spf,
:references,
:reply_to,
:return_path,
:subject,
:to,
:x_gm_thrid,
:x_gmail_labels,
:x_received,
:x_yahoo_newman_id,
:x_yahoo_newman_property,
:x_ymail_osg]
Once I know the header keys, I can take advantage of the overloaded [] operator thus:
[2] pry(main)> m.headers[:date]
=> "Sun, 16 Aug 2015 13:08:26 +0000 (UTC)"
no time to add a pull request for this now, but maybe later.
For ruby non-programmers, it would be useful to know how to use this, without having to learn ruby.
Hi,
the mbox gems assumes the full mbox file to be UTF-8 encoded. But if there is a mail encoded differently, e.g. ISO-8859-1 (quite common in Europe), mbox aborts with
Error invalid byte sequence in UTF-8
(A problem that probably never occurs in country which use the ASCII for regular language only.)
mbox needs to check the charset given in the header and read it appropriately (or read it as binary).
For one example:
Should I have a message with headers
Delivert-to: [email protected]
Delivert-To: [email protected]
this should be rolled up as two values of one header.
For another example, I should be able to grab ...headers->['message-id']
and get the message id, irregardless the individual message called it Message-id
or Message-ID
or Message-Id
.
Howdy,
Found this while looking around for something to parse some large mbox files. I'm very new to ruby, and I can't even get the example that counts messages to run. Any chance of putting some more docs/examples in git?
spork@linux-wbsc:~/archives> ruby -v
ruby 1.8.7 (2010-01-10 patchlevel 249) [i586-linux]
spork@linux-wbsc:~/archives> gem list -l ruby-mbox
*** LOCAL GEMS ***
ruby-mbox (0.0.2)
spork@linux-wbsc:~/archives> ./foo ./freebsd-stable-current
/usr/lib/ruby/gems/1.8/gems/ruby-mbox-0.0.2/lib/mbox/mbox.rb:42:in `reopen': can't convert File into String (TypeError)
from /usr/lib/ruby/gems/1.8/gems/ruby-mbox-0.0.2/lib/mbox/mbox.rb:42:in `initialize'
from ./foo:10:in `new'
from ./foo:10
spork@linux-wbsc:~/archives> cat foo
#! /usr/bin/env ruby
require 'rubygems'
require 'mbox'
if ARGV.length < 1
puts "You have to pass the mbox."
exit
end
mbox = Mbox.new(File.new(ARGV.shift))
puts mbox.length
Are there plans to hand off the control of this rubygem to one of the other github users who seem to be committing on this codebase?
https://github.com/meh/ruby-mbox/network
Or will you take pull requests here?
What is the license under which this software can be distributed? If you are undecided, I would suggest using a two-clause BSD License, which is the license under which the main Ruby implementation itself will be distributed as of version 1.9.3, or (my personal favorite) the Open Works License.
I am staring at this code in mbox.rb
:
while true
if @internal[counter]
Mail.seek(@stream, 1, IO::SEEK_CUR)
next
end
# Stuff omitted
end
I think that entire if
block is dead code that can not possibly be executed.
I reason as follows: If the Mail.seek(...)
were ever executed, there is nothing here that could possibly change the value of counter
or the outcome of if @internal[counter]
.
The stuff that I have omitted does not get executed in that case. So whatever is there can not possibly change that.
So we would see an endless loop immediately.
As long as we do not see the endless loop in reality, this code must be dead.
Am I missing something?
In real world spam e-mails, it happens that type.boundary==nil. This in turn means that this check
ruby-mbox/lib/mbox/mail/content.rb
Line 38 in 9d6e3a6
&& type.boundary
, else
.split("--#{type.boundary}\n")
means
.split("--\n")
which will then split along HTML comments like
<!-- some comment -->
, which is nonsense.
Hey. I just did gem install mbox
today, and it installed from rubygems as per normal. I think that the "master" on this branch hasn't been pushed to rubygems recently, even though the version in master says "0.1.0" which is what I have. Can you push it to "0.2.0" since you added things then push it?
Error that caused me to realize this:
require 'mbox'
some_mbox = Mbox.open '/path/to/mbox'
some_mbox.class.name #=> "Mbox"
some_mbox[3].class.name #=> "Mbox::Mail"
some_mbox[3].subject #=> NoMethodError: undefined method `subject' for #<Mail:"[email protected]">
Thanks!
the content of a valid Mbox::Mail can't be converted to a String.
.to_s returns
[#<File:>]
Ruby Version:
ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-darwin11.4.2]
A program that fails better returns something different from 0 to the shell, or to whoever called it.
The two example programs don't do that when called with no args.
Trivial to fix. Expect a pull request within a few minutes.
parsing a 32M mbox file promptly hung on this line (from ruby trace):
#0:/Library/Ruby/Gems/2.0.0/gems/mbox-0.1.0/lib/mbox/mail/content.rb:38:Mbox::Mail::Content:-: if type && type.mime && type.boundary && matches = type.mime.match(%r{multipart/(\w+)})
#0:/Library/Ruby/Gems/2.0.0/gems/mbox-0.1.0/lib/mbox/mail/content.rb:39:Mbox::Mail::Content:-: text.sub(/^.*?--#{type.boundary}\n/m, '').sub(/--#{type.boundary}--$/m, '').split("--#{type.boundary}\n").each {|part|
^C#0:/Library/Ruby/Gems/2.0.0/gems/mbox-0.1.0/lib/mbox/mail/content.rb:39:String:^: text.sub(/^.*?--#{type.boundary}\n/m, '').sub(/--#{type.boundary}--$/m, '').split("--#{type.boundary}\n").each {|part|
#0:/Library/Ruby/Gems/2.0.0/gems/mbox-0.1.0/lib/mbox/mail/content.rb:39:Mbox::Mail::Content:<: text.sub(/^.*?--#{type.boundary}\n/m, '').sub(/--#{type.boundary}--$/m, '').split("--#{type.boundary}\n").each {|part|
#
It's not clear from the doc whether there is a way of paging over messages with large attachments. I'm using the stream interface, but it appears to want to parse the entire email message on most operations.
Running through tracer, It seems that split line is creating a large number of substrings... maybe an attached pdf or something.
Going to take a look at node-mbox in the meantime for a faster streaming solution.
Do not insist in storing essentially the entire mbox in RAM.
That each
method could stream one mail
object at a time to some code block during parsing while it goes, rather than from memory later. Much more RAM efficient if only one go through the mbox is needed.
Suggestion for an option:
Mbox.new(file, :stream => 1)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.