Git Product home page Git Product logo

mongo-connector's Introduction

mongo-connector

The mongo-connector project originated as a MongoDB mongo-labs project and is now community-maintained under the custody of YouGov, Plc.

View build status

For complete documentation, check out the Mongo Connector Wiki.

System Overview

mongo-connector creates a pipeline from a MongoDB cluster to one or more target systems, such as Solr, Elasticsearch, or another MongoDB cluster. It synchronizes data in MongoDB to the target then tails the MongoDB oplog, keeping up with operations in MongoDB in real-time. Detailed documentation is available on the wiki.

Getting Started

mongo-connector supports Python 3.4+ and MongoDB versions 3.4 and 3.6.

Installation

To install mongo-connector with the MongoDB doc manager suitable for replicating data to MongoDB, use pip:

pip install mongo-connector

The install command can be customized to include the Doc Managers and any extra dependencies for the target system.

Target System Install Command
MongoDB pip install mongo-connector
Elasticsearch 1.x pip install 'mongo-connector[elastic]'
Amazon Elasticsearch 1.x Service pip install 'mongo-connector[elastic-aws]'
Elasticsearch 2.x pip install 'mongo-connector[elastic2]'
Amazon Elasticsearch 2.x Service pip install 'mongo-connector[elastic2-aws]'
Elasticsearch 5.x pip install 'mongo-connector[elastic5]'
Solr pip install 'mongo-connector[solr]'

You may have to run pip with sudo, depending on where you're installing mongo-connector and what privileges you have.

System V Service

Mongo Connector provides support for installing and uninstalling itself as a service daemon under System V Init on Linux. Following install of the package, install or uninstall using the following command:

$ python -m mongo_connector.service.system-v [un]install

Development

You can also install the development version of mongo-connector manually:

git clone https://github.com/yougov/mongo-connector.git
pip install ./mongo-connector

Using mongo-connector

mongo-connector replicates operations from the MongoDB oplog, so a replica set must be running before startup. For development purposes, you may find it convenient to run a one-node replica set (note that this is not recommended for production):

mongod --replSet myDevReplSet

To initialize your server as a replica set, run the following command in the mongo shell:

rs.initiate()

Once the replica set is running, you may start mongo-connector. The simplest invocation resembles the following:

mongo-connector -m <mongodb server hostname>:<replica set port> \
                -t <replication endpoint URL, e.g. http://localhost:8983/solr> \
                -d <name of doc manager, e.g., solr_doc_manager>

mongo-connector has many other options besides those demonstrated above. To get a full listing with descriptions, try mongo-connector --help. You can also use mongo-connector with a configuration file.

If you want to jump-start into using mongo-connector with a another particular system, check out:

Doc Managers

Elasticsearch 1.x: https://github.com/yougov/elastic-doc-manager

Elasticsearch 2.x and 5.x: https://github.com/yougov/elastic2-doc-manager

Solr: https://github.com/yougov/solr-doc-manager

The MongoDB doc manager comes packaged with the mongo-connector project.

Troubleshooting/Questions

Having trouble with installation? Have a question about Mongo Connector? Your question or problem may be answered in the FAQ or in the wiki. If you can't find the answer to your question or problem there, feel free to open an issue on Mongo Connector's Github page.

mongo-connector's People

Contributors

10kc-awright avatar aayushu avatar adgaudio avatar aganapat avatar aherlihy avatar alikrubin avatar anuragkapur avatar apanimesh061 avatar asparagirl avatar aviflax avatar bdeeney avatar behackett avatar bobend avatar estobbart avatar honzakral avatar ianwhalen avatar jaraco avatar jaredkipe avatar jgrivolla avatar llovett avatar makhdumi avatar malekascha avatar martinnowak avatar redox avatar sdz-dalbrecht avatar shaneharvey avatar stbrody avatar stedile avatar xmasotto avatar yeroon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mongo-connector's Issues

Does the connector handle document references?

Is it possible to save a document and include a manual document reference?
e.g. I have a book doc that references a publisher doc and I want the connector to send them to SOLR so that the book/publisher are stored in one doc.
{
"_id": 1,
"name": "a publisher"
}

{
"_id": 10,
"name": "a book",
"publisher_id": 1
}

Then the doc in SOLR would look like:
{
"_id": 10,
"name": "a book",
"publisher_id" 1,
"publisher_name": "a publisher"
}

Possible bug caused by bson ObjectIds?

The connector inserts one document then barfs. I tried pyelasticsearch to manually populate and received the same error until I cast the ObjectId into a string.

Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in bootstrap_inner
self.run()
File "/usr/local/lib/python2.7/dist-packages/mongo-connector/oplog_manager.py", line 101, in run
cursor = self.init_cursor()
File "/usr/local/lib/python2.7/dist-packages/mongo-connector/oplog_manager.py", line 298, in init_cursor
timestamp = self.dump_collection()
File "/usr/local/lib/python2.7/dist-packages/mongo-connector/oplog_manager.py", line 268, in dump_collection
self.doc_manager.upsert(doc)
File "./doc_managers/elastic_doc_manager.py", line 77, in upsert
self.elastic.index(doc, index, doc_type, doc_id)
File "/usr/local/lib/python2.7/dist-packages/pyes/es.py", line 1142, in index
return self._send_request(request_method, path, doc, querystring_args)
File "/usr/local/lib/python2.7/dist-packages/pyes/es.py", line 574, in _send_request
body = json.dumps(body, cls=self.encoder)
File "/usr/lib/python2.7/json/__init
.py", line 240, in dumps
**kw).encode(obj)
File "/usr/lib/python2.7/json/encoder.py", line 203, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python2.7/json/encoder.py", line 266, in iterencode
return _iterencode(o, 0)
ValueError: Circular reference detected

Bug in default -o config.txt file without correct permissions

If you start mongo_connector.py -m "" -t "" and you don't specify -o then it defaults to config.txt. But if you don't have write permissions on config.txt it doesn't error, it will do a new dump like it is initializing the connector again.

If you specify -o filename.txt and dont' have permissions then an error does present on the screen, but mongo_connector.py doesn't kill itself. <---not sure the what you want to behavior to be here?

For the first example I would assume an error would work just fine? I was going to add it in and put in a pull request? But wasn't sure if that was the expected behavior?

how to deal with nested objects

Hi folks,
I'm having trouble to import nested objects into solr. my mongodb object has this form:

rs0:PRIMARY> x
{
"_id" : ObjectId("50c44df9058f6fe4cb69335e"),
"date" : ISODate("2012-12-09T05:00:00Z"),
"nid" : 3,
"sid" : 17411,
"siteRef" : "3-17411",
"total" : {
"a" : 311119,
"ab" : 248929,
"aw" : 4308,
"c" : 1.9857989999999885,
"r" : 0.2903688648648634
},
"web" : {
"a" : 311119,
"ab" : 248929,
"aw" : 4308,
"c" : 1.9857989999999885,
"r" : 0.2903688648648634
}
}

SEVERE: org.apache.solr.common.SolrException: ERROR: [doc=50c44df9058f6fe4cb69335e] Error adding field 'total'='[a, c, ab, r, aw]'

this is the error from solr... but it's only to highlight that solr see "total" as an Array, but it's supposed to be a JSON Object.

Anyway about the schema.xml in Solr I put a bad parameter to raise an error to show that structure.
But with the following configuration :

there is no error but the structures "web" "total" doesn't show up in solr

mongo-connector don't read oplog_progress.txt

hello

i have question oplog_progress.txt file

first run mongo-connector , then write oplog_progress.txt

Like this

["Collection(Database(Connection([u'mongodb_s8:20212', u'mongodb_s7:20212']), u'local'), u'oplog.rs')", 5821688115139444745]
["Collection(Database(Connection([u'mongodb_s3:20212', u'mongodb_s4:20212']), u'local'), u'oplog.rs')", 5821688656305324034]
["Collection(Database(Connection([u'mongodb_s6:20212', u'mongodb_s5:20212']), u'local'), u'oplog.rs')", 5821688630535520257]
["Collection(Database(Connection([u'mongodb_s2:20212', u'mongodb_s1:20212']), u'local'), u'oplog.rs')", 5821688110844477469]

for one file .

and second run , mongo-connector can't read oplog_progress.txt file .

then re dump start (full dump)

how can i ues this, in a shard environment to work without any problems?

Issue with non-replica-set mongo setups.

I think the assumption on line 222 of connector.py is wrong:

try:
       main_conn.admin.command("isdbgrid")
except pymongo.errors.OperationFailure:
       conn_type = "REPLSET"

I tested this on mongo server without replication, and although it raises an OperationFailure in the code above. it then goes on to fail again on line 232

repl_set = prim_admin.command("replSetGetStatus")['set']

this time without any exception handling

ImportError: No module named 'mongo_connector.util'

When running mongo_connector.py with Python 3.3 on Windows 8, I get the following stacktrace:

2013-10-06 13:26:05,021 - INFO - Beginning Mongo Connector
Traceback (most recent call last):
File "", line 1521, in _find_and_load_unlocked
AttributeError: 'module' object has no attribute 'path'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\doc\mongo-connector-master\mongo-connector-master\mongo_connector\mongo_connector.py", line 481, in
main()
File "C:\doc\mongo-connector-master\mongo-connector-master\mongo_connector\mongo_connector.py", line 466, in main
auth_username=options.admin_name)
File "C:\doc\mongo-connector-master\mongo-connector-master\mongo_connector\mongo_connector.py", line 53, in init
doc_manager = imp.load_source('DocManager', doc_manager)
File "C:\Python33\lib\imp.py", line 114, in load_source
_LoadSourceCompatibility(name, pathname, file).load_module(name)
File "", line 586, in _check_name_wrapper
File "", line 1024, in load_module
File "", line 1005, in load_module
File "", line 562, in module_for_loader_wrapper
File "", line 870, in _load_module
File "", line 313, in _call_with_frames_removed
File "C:\doc\mongo-connector-master\mongo-connector-master\mongo_connector\doc_managers\solr_doc_manager.py", line 32, in
from mongo_connector.util import verify_url, retry_until_ok
ImportError: No module named 'mongo_connector.util'; mongo_connector is not a package

I saw that the import mechanism was modified in Python 3.3, and I was able to make mongo-connecor to work by replacing the line 32 in solr_doc_manager.py by: "from util import verify_url, retry_until_ok"
I also tried to run it with Python 3.2, but apparently there is a problem with the compatibility of simplejson

mongo connector inserts extra ns and _ts fields into mongo documents when used mongo-to-mongo

In the mongo document manager it assumes that a ns field is passed in the doc. In fact a ns and a _ts field are both passed in the doc to the mongo document manager and then the manager will insert the entire doc into the target mongo collection. This results in documents having a ns and a _ts field where they did not before.

I have fixed this and the fix is awaiting in this pull request #45

OperationFailure: database error: getMore: cursor didn't exist on server, possible restart or timeout?

I read about this tool recently and tried to sync two mongodb sharded clusters with mongo-connector today. This somehow does not work, and crashes with a traceback.


$ python mongo_connector.py -m XXX.edelight.net:27017 -t mongodb://localhost -d ./doc_managers/mongo_doc_manager.py
2012-08-16 08:48:04,332 - INFO - Beginning Mongo Connector
2012-08-16 08:48:05,377 - INFO - MongoConnector: Empty oplog progress file.
2012-08-16 08:48:05,440 - INFO - OplogManager: Initializing oplog thread
2012-08-16 08:48:05,444 - INFO - MongoConnector: Starting connection thread Connection([u's1.edelight.net:27017', u's2.edelight.net:27017', u's3.ede
light.net:27017'])
2012-08-16 08:48:05,472 - INFO - OplogManager: Initializing oplog thread
2012-08-16 08:48:05,475 - INFO - MongoConnector: Starting connection thread Connection([u's4.edelight.net:27017', u's5.edelight.net:27017', u'6.ede
light.net:27017'])
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
    self.run()
  File "/home/mkorn/mongo-connector/oplog_manager.py", line 101, in run
    cursor = self.init_cursor()
  File "/home/mkorn/mongo-connector/oplog_manager.py", line 322, in init_cursor
    timestamp = self.dump_collection()
  File "/home/mkorn/mongo-connector/oplog_manager.py", line 273, in dump_collection

    for doc in cursor:
  File "/usr/local/lib/python2.6/dist-packages/pymongo-2.2-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 747, in next
    if len(self.__data) or self._refresh():
  File "/usr/local/lib/python2.6/dist-packages/pymongo-2.2-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 711, in _refresh
    limit, self.__id))
  File "/usr/local/lib/python2.6/dist-packages/pymongo-2.2-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 657, in __send_message
    self.__tz_aware)
  File "/usr/local/lib/python2.6/dist-packages/pymongo-2.2-py2.6-linux-x86_64.egg/pymongo/helpers.py", line 102, in _unpack_response
    error_object["$err"])
OperationFailure: database error: getMore: cursor didn't exist on server, possible restart or timeout?

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
    self.run()
  File "/home/mkorn/mongo-connector/oplog_manager.py", line 101, in run
    cursor = self.init_cursor()
  File "/home/mkorn/mongo-connector/oplog_manager.py", line 322, in init_cursor
    timestamp = self.dump_collection()
  File "/home/mkorn/mongo-connector/oplog_manager.py", line 273, in dump_collection
    for doc in cursor:
  File "/usr/local/lib/python2.6/dist-packages/pymongo-2.2-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 747, in next
    if len(self.__data) or self._refresh():
  File "/usr/local/lib/python2.6/dist-packages/pymongo-2.2-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 711, in _refresh

    limit, self.__id))
  File "/usr/local/lib/python2.6/dist-packages/pymongo-2.2-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 657, in __send_message
    self.__tz_aware)
  File "/usr/local/lib/python2.6/dist-packages/pymongo-2.2-py2.6-linux-x86_64.egg/pymongo/helpers.py", line 102, in _unpack_response
    error_object["$err"])
OperationFailure: database error: getMore: cursor didn't exist on server, possible restart or timeout?

At this point it hangs forever, and the target system does not get any data.

XXX.edelight.net points to a mongos in the old cluster. On this server I see incoming connections in the mongodb logs.

If you need further information I'm happy to provide them, this mongo-manager seems like a promising solution for various scenarios to me.

Markus

batch upsert method for DocManagers

The upsert method is currently defined only to take a single document and upsert it into the target system. When the target system offers a batch API, a DocManager for that system should be able to take advantage of it for better performance. This change will probably involve writing and documenting a new method that can do batch upserts. The method should be able to be absent, so older DocManagers still work correctly.

Version on PyPI broken

The version of Mongoconnector on PyPI is currently not installing the mongo-connector script. I had to manually clone this repo and run setup.py install to have a working mongo-connector installed.

Mapping nested documents for Solr

It can be completely wrong, but as I understood mongo-connector sends all documents into target system (in my case Solr) as is. Is there any way to map the collection and pass only necessary fields to Solr?

unexpected keyword argument 'namespace_set'

Hi I am trying to get started with mongo-connector, but I am hitting the following error:

Traceback (most recent call last):
File "/usr/local/bin/mongo-connector", line 9, in
load_entry_point('mongo-connector==1.1.1-', 'console_scripts', 'mongo-connector')()
File "build/bdist.linux-x86_64/egg/mongo_connector/connector.py", line 484, in main
File "build/bdist.linux-x86_64/egg/mongo_connector/connector.py", line 110, in init
TypeError: init() got an unexpected keyword argument 'namespace_set'

Move code from main path to class set up path

Currently most of the setup occurs in the module when a test file is run. This code should really exist inside of the actual test case, not in the module that contains it. Note that this should also include moving configuration from being hardcoded in to it being stored in config files.

AttributeError: 'NoneType' object has no attribute 'count'

I got this crash after running mongo_connector for a while. I'm using the latest 1.0.0 version.

mkorn@srv00196:/usr/local/lib/python2.6/dist-packages/mongo-connector$ sudo python mongo_connector.py -m srv00044.edelight.net:27017 -t mongodb://localhost -d ./doc_m
anagers/mongo_doc_manager.py
[sudo] password for mkorn:
2012-08-17 13:00:58,159 - INFO - Beginning Mongo Connector
2012-08-17 13:00:59,164 - INFO - MongoConnector: Empty oplog progress file.
2012-08-17 13:00:59,177 - INFO - OplogManager: Initializing oplog thread
2012-08-17 13:00:59,180 - INFO - MongoConnector: Starting connection thread Connection([u'srv00076.edelight.net:27017', u'srv00048.edelight.net:27017', u'srv00053.ede
light.net:27017'])
2012-08-17 13:00:59,183 - INFO - OplogManager: Initializing oplog thread
2012-08-17 13:00:59,188 - INFO - MongoConnector: Starting connection thread Connection([u'srv00067.edelight.net:27017', u'srv00092.edelight.net:27017', u'srv00164.ede
light.net:27017'])
2012-08-17 17:48:42,241 - ERROR - OplogManager: Failed during dump collection cannot recover! Collection(Database(Connection([u'srv00076.edelight.net:27017', u'srv000
48.edelight.net:27017', u'srv00053.edelight.net:27017']), u'local'), u'oplog.rs')
2012-08-17 17:48:42,241 - INFO - OplogManager: Collection(Database(Connection([u'srv00076.edelight.net:27017', u'srv00048.edelight.net:27017', u'srv00053.edelight.net
:27017']), u'local'), u'oplog.rs') Dumped collection into target system
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
    self.run()
  File "/usr/local/lib/python2.6/dist-packages/mongo-connector/oplog_manager.py", line 112, in run
    if util.retry_until_ok(cursor.count) == 1:
AttributeError: 'NoneType' object has no attribute 'count'

2012-08-17 17:48:42,265 - ERROR - OplogManager: Failed during dump collection cannot recover! Collection(Database(Connection([u'srv00067.edelight.net:27017', u'srv000
92.edelight.net:27017', u'srv00164.edelight.net:27017']), u'local'), u'oplog.rs')
2012-08-17 17:48:42,265 - INFO - OplogManager: Collection(Database(Connection([u'srv00067.edelight.net:27017', u'srv00092.edelight.net:27017', u'srv00164.edelight.net
:27017']), u'local'), u'oplog.rs') Dumped collection into target system
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python2.6/threading.py", line 532, in __bootstrap_inner
    self.run()
  File "/usr/local/lib/python2.6/dist-packages/mongo-connector/oplog_manager.py", line 112, in run
    if util.retry_until_ok(cursor.count) == 1:
AttributeError: 'NoneType' object has no attribute 'count'

2012-08-17 17:48:42,366 - ERROR - MongoConnector: OplogThread <OplogThread(Thread-3, stopped 139709445224192)> unexpectedly stopped! Shutting down
2012-08-17 17:48:42,366 - INFO - MongoConnector: Stopping all OplogThreads

I'm happy to provide more information if needed.

Thanks for your help.
Markus

unable to install mongo-connector

The installation using pip is not an option (server is not connected to the internet).
So I did the "long" way from the readme file:
python setup.py install
This results in an error:

# python setup.py install
Traceback (most recent call last):
  File "setup.py", line 32, in <module>
    from ez_setup import setup
ImportError: cannot import name setup

I have running python 2.6 on SuSE Linux Enterprise Server 11.2

Missing keyFile support

I would like to be able to specify a key-file. I can't connect to a replicaSet that uses keyFile.

Are key-files supported in the PyMongo driver?

Connecting to a replicaSet which uses keyFile throws the following error:

Exception:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner
    self.run()
  File "mongo_connector.py", line 204, in run
    repl_set = prim_admin.command("replSetGetStatus")['set']
  File "/usr/lib64/python2.6/site-packages/pymongo/database.py", line 395, in command
    msg, allowable_errors)
  File "/usr/lib64/python2.6/site-packages/pymongo/helpers.py", line 144, in _check_command_response
    raise OperationFailure(msg % details["errmsg"])
OperationFailure: command SON([('replSetGetStatus', 1)]) failed: need to login

Performance Issue

HI,
I'm running some test to index an mongoDB to solr 4.4. The mongoDB contains 10.000.000 documents. Importing those documents need a long time. The mongo-connector puhsed 120.000 documents per hour to solr. So I will need 83 hours to run an full import.
Using the data import handler with the same documents from mysql, the whole import need 3 hours.
I think, one of the main reasons for this slow import is based on the single import of each document followed by an commit for each document, which is an "performance overkill".
So the question is: is'n it possible to push a set of documents to solr followed by one commit for all of the docs? Or isn't there an "initial sync" feature, which is based by an mongoDB query, not the oplog.
There is an 2nd. problem, which exits: because the oplog is an capped collection, there could be the situation, where more documents are in the mongoDB than in the oplog. So how can I index those documents, which are not in the oplog (anymore)?

Possible to manually install mongo-connector?

Hello,

My site does not allow installation via easy install or pip. Is there a reasonable set of steps for a python noob (on Windows 7, python 27) to get this installed?

Thanks,
-Mike

command-line running

Could you please amend documentation to show commandline running for non python savvy users

I have tried the latest version of the code running as the readme python mongo-connector etc multiple times , which worked in previous versions and the response is unable to find module util
Amending the import to import utl rather than mongoconnector.util appears to make it work

I am thinking I am doing something wrong to need to do this

Pep-8 compliant

Currently the code would not be able to pass through pylint/is is not compatible with pep-8 requirements.

pip install complains about an syntax error in elastic_doc_manager.py

There is a syntaxerror in mongo-connector/doc_managers/elastic_doc_manager.py line 131

$ sudo pip install mongo-connector
Downloading/unpacking mongo-connector
  Downloading mongo-connector-1.0.0.tar.gz
[...]
  Running setup.py install for importlib
  Running setup.py install for mongo-connector
    SyntaxError: ('invalid syntax', ('/usr/local/lib/python2.6/dist-packages/mongo-connector/doc_managers/elastic_doc_manager.py', 131, 65, "        result = self.elastic.search(q, size=1, sort={'_ts:desc'})\n"))

  Running setup.py install for pyes
    warning: no files found matching 'pavement.py'
    warning: no files found matching '*' under directory 'bin'
    warning: no files found matching '*' under directory 'examples'
    no previously-included directories found matching 'docs/*'
[...]

mongo_connector breaks down when trying to import more than 34000 records

After 34000+ documents was imported, mongo_connector breaks down with error

ERROR - Failed to connect to server at 'http://localhost:8983/solr/update/?commit=false', are you sure that URL is correct? Checking it in a browser might help: HTTPConnectionPool(host='localhost', port=8983): Max retries exceeded with url: /solr/update/?commit=false (Caused by <class 'socket.error'>: [Errno 99] Cannot assign requested address)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/pysolr.py", line 274, in _send_request
timeout=self.timeout)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 88, in post
return request('post', url, data=data, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 354, in request
resp = self.send(prep, *_send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 460, in send
r = adapter.send(request, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 246, in send
raise ConnectionError(e)
ConnectionError: HTTPConnectionPool(host='localhost', port=8983): Max retries exceeded with url: /solr/update/?commit=false (Caused by <class 'socket.error'>: [Errno 99] Cannot assign requested address)
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/opt/solr/mongo_connector/mongo_connector/oplog_manager.py", line 105, in run
cursor = self.init_cursor()
File "/opt/solr/mongo_connector/mongo_connector/oplog_manager.py", line 304, in init_cursor
timestamp = self.dump_collection()
File "/opt/solr/mongo_connector/mongo_connector/oplog_manager.py", line 274, in dump_collection
self.doc_manager.upsert(doc)
File "./doc_managers/solr_doc_manager.py", line 120, in upsert
self.solr.add([doc], commit=False)
File "/usr/local/lib/python2.7/dist-packages/pysolr.py", line 749, in add
return self._update(m, commit=commit, waitFlush=waitFlush, waitSearcher=waitSearcher)
File "/usr/local/lib/python2.7/dist-packages/pysolr.py", line 359, in _update
return self._send_request('post', path, message,
{'Content-type': 'text/xml; charset=utf-8'}
)
File "/usr/local/lib/python2.7/dist-packages/pysolr.py", line 283, in _send_request
raise SolrError(error_message % params)
SolrError: Failed to connect to server at 'http://localhost:8983/solr/update/?commit=false', are you sure that URL is correct? Checking it in a browser might help: HTTPConnectionPool(host='localhost', port=8983): Max retries exceeded with url: /solr/update/?commit=false (Caused by <class 'socket.error'>: [Errno 99] Cannot assign requested address)

I reproduced the error several times. And it breaks down after 34050 - 34150 records was imported. I can make queries or ping to solr at the same time (while import in progress).
I use: mongo 2.4.5, ubuntu 12.04 (mongo_connector), ubuntu 11.10 (mongodb)

Is it mongo_connector error or probably solr (system) config misconfiguration?

Better commit behavior

Each DocManager has an auto_commit keyword that specifies whether it should periodically call for the destination database to persist any changes. Currently, some DocManagers have this True by default and some have this as False. The default setting should be consistent and able to be overridden via a command-line option when mongo-connector is started.

Secondly, the upsert and delete methods in every DocManager automatically commit the change made. This is unnecessary, as discussed in #53

Plugin with entry points

Extend the plugin functionality to work with python entry points to make it even easier for users to create their own mongo-connector plugins.

Script gets stuck when thread hit error.

I checked on mongo-connector and saw that it was idle and then looked at log. Found this error. Looks like the script just stopped? but the process was still registering 13% CPU. Not quite sure what happened. Let me know if you need anything from me to look into this. Thanks,

Sam

log file:
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib64/python2.6/threading.py", line 532, in **bootstrap_inner
self.run()
File "/usr/lib/python2.6/site-packages/mongo_connector-1.1.1-py2.6.egg/mongo-connector/oplog_manager.py", line 101, in run
cursor = self.init_cursor()
File "/usr/lib/python2.6/site-packages/mongo_connector-1.1.1-py2.6.egg/mongo-connector/oplog_manager.py", line 298, in init_cursor
timestamp = self.dump_collection()
File "/usr/lib/python2.6/site-packages/mongo_connector-1.1.1-py2.6.egg/mongo-connector/oplog_manager.py", line 268, in dump_collection
self.doc_manager.upsert(doc)
File "doc_managers/mongo_doc_manager.py", line 67, in upsert
self.mongo[db][coll].save(doc)
File "/usr/lib/python2.6/site-packages/pymongo-2.5.2-py2.6-linux-x86_64.egg/pymongo/collection.py", line 269, in save
manipulate, safe, check_keys=check_keys, kwargs)
File "/usr/lib/python2.6/site-packages/pymongo-2.5.2-py2.6-linux-x86_64.egg/pymongo/collection.py", line 479, in update
check_keys, self.__uuid_subtype), safe)
File "/usr/lib/python2.6/site-packages/pymongo-2.5.2-py2.6-linux-x86_64.egg/pymongo/message.py", line 110, in update
encoded = bson.BSON.encode(doc, check_keys, uuid_subtype)
File "/usr/lib/python2.6/site-packages/pymongo-2.5.2-py2.6-linux-x86_64.egg/bson/__init
.py", line 567, in encode
return cls(_dict_to_bson(document, check_keys, uuid_subtype))
File "/usr/lib/python2.6/site-packages/pymongo-2.5.2-py2.6-linux-x86_64.egg/bson/init.py", line 476, in _dict_to_bson
elements.append(_element_to_bson(key, value, check_keys, uuid_subtype))
File "/usr/lib/python2.6/site-packages/pymongo-2.5.2-py2.6-linux-x86_64.egg/bson/init.py", line 406, in _element_to_bson
return BSONOBJ + name + _dict_to_bson(value, check_keys, uuid_subtype, False)
File "/usr/lib/python2.6/site-packages/pymongo-2.5.2-py2.6-linux-x86_64.egg/bson/init.py", line 476, in _dict_to_bson
elements.append(_element_to_bson(key, value, check_keys, uuid_subtype))
File "/usr/lib/python2.6/site-packages/pymongo-2.5.2-py2.6-linux-x86_64.egg/bson/init.py", line 353, in _element_to_bson
raise InvalidDocument("key %r must not contain '.'" % key)
InvalidDocument: key u'Music.com' must not contain '.'

'bytes' object jas no attribute 'encode'

Hi,
I am wrestling with this issue and can't find the cause.
I installed mongo-connector through pip on Python3. Everything went ok.
I started a local mongo and a local solr, using standard options.
When I start the mongo-connector, it crashes immediatly. See stack trace below.

This connector is exactly what I need for my project, I hope someone will rescue me on this! I am sure it is trivial, but I am not expert enough in Python to solve it.

c:\Program Files\python3\Lib\site-packages\mongo-connector>python mongo_connecto
r.py -m localhost:27017 -t http://localhost:8080/solr/ -d ./doc_managers/solr_do
c_manager.py
2013-05-21 12:58:01,131 - INFO - Beginning Mongo Connector
Traceback (most recent call last):
File "mongo_connector.py", line 441, in
auth_username=options.admin_name)
File "mongo_connector.py", line 100, in init
unique_key=u_key)
File "./doc_managers/solr_doc_manager.py", line 54, in init
self.run_auto_commit()
File "./doc_managers/solr_doc_manager.py", line 95, in run_auto_commit
self.solr.commit()
File "C:\Program Files\python3\lib\site-packages\pysolr.py", line 802, in comm
it
return self._update(msg, waitFlush=waitFlush, waitSearcher=waitSearcher)
File "C:\Program Files\python3\lib\site-packages\pysolr.py", line 359, in _upd
ate
return self._send_request('post', path, message, {'Content-type': 'text/xml;
charset=utf-8'})
File "C:\Program Files\python3\lib\site-packages\pysolr.py", line 274, in _sen
d_request
timeout=self.timeout)
File "C:\Program Files\python3\lib\site-packages\requests\api.py", line 88, in
post
return request('post', url, data=data, *_kwargs)
File "C:\Program Files\python3\lib\site-packages\requests\api.py", line 44, in
request
return session.request(method=method, url=url, *_kwargs)
File "C:\Program Files\python3\lib\site-packages\requests\sessions.py", line 3
36, in request
prep = req.prepare()
File "C:\Program Files\python3\lib\site-packages\requests\models.py", line 223
, in prepare
p.prepare_headers(self.headers)
File "C:\Program Files\python3\lib\site-packages\requests\models.py", line 340
, in prepare_headers
headers = dict((name.encode('ascii'), value) for name, value in headers.item
s())
File "C:\Program Files\python3\lib\site-packages\requests\models.py", line 340
, in
headers = dict((name.encode('ascii'), value) for name, value in headers.item
s())
AttributeError: 'bytes' object has no attribute 'encode'

Load options from configuration file

New features to Mongo Connector are causing the command-line options to become lengthy indeed. There should be a configuration file (not the same as the timestamp "config" file) that provides the equivalent of the command-line options to run with.

Some specs:

  • there should be an additional command-line option providing the path to the config file
  • there should be a documented default place to look for the config file
  • options specified in the config file can be overridden by explicit command-line options
  • a default configuration file should be provided in the repository with sensible options
  • all current command-line options should be settable in the config file

Considering the widespread usage of the JSON format in the mongoverse, let's use JSON as the config file format.

Error handling data with unrecognized character-set

self.doc_manager.upsert(doc) from time to time generates an error like this:

ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

This is happening inside of the lxml module, but this exception should be handled properly (in the upsert method) so that the indexing is not stopped.

better tests for sharded environments

test_oplog_manager_sharded.py covers only the most basic cases. There need to be tests that cover a sharded collection and many more possible failure conditions. The sharding tests should cover at least the following cases:

  • replicating a sharded collection
  • replicating during a chunk migration
  • failover of a primary with sharded collection
  • replicating a collection spread over a very large (> 100) number of shards

There are probably many more cases to cover as well, and they should be added to this issue as necessary.

Python 2/3 compatibility

Is mongo-connector compatible with both python 2 and 3? If not then it should be updated to be so.

Elastic Doc Manager

Currently does not pass get_last_doc test, in particular it does not get newly inserted documents correctly.

doc managers should return generators, not lists

there could be quite a few documents returned for DocManager methods like search() and _search(). It would be much better to have these methods return generators instead of lists to spare the some memory

Incorrect values inserted in many circumstances (race condition)

There is a race condition (sort of) when inserting or updating documents. The problem can be found in oplog_manager.py, here (starting at line 139):

 elif operation == 'i' or operation == 'u':
                    doc = self.retrieve_doc(entry)                      
                    if doc is not None:                            
                        doc['_ts'] = util.bson_ts_to_long(entry['ts'])
                        doc['ns'] = entry['ns']
                        try:
                            self.doc_manager.upsert(doc)
                        except SystemError:
                            logging.error("Unable to insert %s" % (doc))

This calls retrieve_doc(entry) which then in turns gets the actual document in question from the primary host / mongos. The problem is that it will always grab the latest update to date version of the document, which is passed to insert.

For example, consider (python):

   inserted_obj = self.p_db.insert({'name' : 'test'})       
    same_obj = self.p_db.update({'_id' : ObjectId(inserted_obj)},
        {'$set' : {'count' : 1 }})
    same_obj = self.p_db.update({'_id' : ObjectId(inserted_obj)},
        {'$set' : {'asdf' : 2 }})

Now the oplog entries will look like:

  {u'h': 8083363948423806993L, u'ts': Timestamp(1373934783, 1), u'o': {u'_id': ObjectId('51e494bfece1bb5523c89761'), u'name': u'test'}, u'v': 2, u'ns': u'test.test', u'op': u'i'}
  {u'h': 5712661147719603205L, u'ts': Timestamp(1373934783, 2), u'o': {u'$set': {u'count': 1}}, u'v': 2, u'ns': u'test.test', u'o2': {u'_id': ObjectId('51e494bfece1bb5523c89761')}, u'op': u'u'}
  {u'h': 1926329060389529522L, u'ts': Timestamp(1373934783, 3), u'o': {u'$set': {u'asdf': 2}}, u'v': 2, u'ns': u'test.test', u'o2': {u'_id': ObjectId('51e494bfece1bb5523c89761')}, u'op': u'u'}

So what you would expect to happen when you insert them into a target system is that it when you fetch the documents from mongo, you would get dictionaries like:

  {'_id' : ObjectId('51e494bfece1bb5523c89761, 'name' : 'test'}
  {'_id' : ObjectId('51e494bfece1bb5523c89761, 'name' : 'test', 'count' : 1}
  {'_id' : ObjectId('51e494bfece1bb5523c89761, 'name' : 'test', 'count' : 1, 'asdf' : 2}

But what actually happens is that you just get the last one, which is:

  {'_id' : ObjectId('51e494bfece1bb5523c89761, 'name' : 'test', 'count' : 1, 'asdf' : 2}

So when you do an update or an insert on a oplog operation, it is entirely possible that you will get a document that is too update-to-date, that is, it doesn't take any following oplog considerations in effect.

You can really goof this up by altering the above sequences of events to:

     inserted_obj = self.p_db.insert({'name' : 'test'})       
    same_obj = self.p_db.update({'_id' : ObjectId(inserted_obj)},
        {'$set' : {'count' : 1 }})
    same_obj = self.p_db.update({'_id' : ObjectId(inserted_obj)},
        {'$set' : {'asdf' : 2 }})
    self.p_db.remove({'name' : 'test'})

This will cause it to search for a document that doesn't exist, so when you go to process the oplog operation that corresponds to a delete, it is entirely possible that it won't know what to delete.

Depending on your doc manager, it is possible that this may not be a huge issue, but chances are it probably is. This is a major bug and needs to be addressed accordingly.

Keyboard interrupt doesn't actually cause exit

A control-c should terminate the process but it does not.

2013-11-19 22:08:00,011 - INFO - Caught keyboard interrupt, exiting!
2013-11-19 22:08:00,116 - INFO - MongoConnector: Stopping all OplogThreads

And the process just hangs from there.

PyMongo compatibility

To support older and newer versions of PyMongo, code that looks like this:

import pymongo
...
conn = pymongo.Connection(<options>)

should change to:

try:
    from pymongo import MongoClient as Connection
except ImportError:
    from pymongo import Connection
...
conn = Connection(<options>)

'Connector' object has no attribute 'doc_manager'

I downloaded connector using pip, installation passed without errors. But when I try to start connector using this command, I get an error.

sudo python mongo_connector.py -m localhost:27017 -t localhost:9200 -d ./doc_managers/elastic_doc_manager.py

2013-03-20 11:53:41,456 - INFO - Beginning Mongo Connector
2013-03-20 11:53:42,473 - CRITICAL - MongoConnector: Bad target system URL!
2013-03-20 11:53:42,476 - INFO - MongoConnector: Empty oplog progress file.
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "mongo_connector.py", line 210, in run
    False, self.doc_manager,
AttributeError: 'Connector' object has no attribute 'doc_manager'

Where can be problem?

Thank you!

Solr 4.0 w/ multiple cores: target URL doesn't validate

When having Solr 4.0 configured with multiple cores, the target URL as specified with -t doesn't validate. I.e. using -t http://localhost:8983/solr/core01. While http://localhost:8983/solr/core01 provides a 404 not found status, http://localhost:8983/solr/core01/update provides a 200 OK.

Work-around: When uncommenting the URL validity lines in the solr_doc_manager.py the connector performs fine.

This was tested using a Jetty 8 setup, with 2 cores under Solr.

solr not being updated

I have MongoDB and Solr both running locally. When i fire up mongo-connector, I see the connections being made to both, but any inserts/updates I make in mongoDB are not being seen by mongo-connector and not passed along to solr. What should I be looking at?

do the "core" and "database/collection" names need to match?

Mongo_Connector not working after replica-reconfiguration in mongodb

Hi,

I had mongo-connector working with my mongodb configuration, but I had some issues and did a replica set reconfiguration, but after doing rs.initiate() again and running mongo-connector I am getting the following error:

ubuntu@rhino:/home/ubuntu/mongo-connector/mongo-connector# ./mongo_connector.py -m localhost:27017 -t http://127.0.0.1:8080/solr/ -d ./doc_managers/solr_doc_manager.py -n test.employees
2012-12-11 21:20:35,660 - INFO - Beginning Mongo Connector
2012-12-11 21:20:36,685 - INFO - Finished '127.0.0.1:8080//solr/update/?commit=true' (POST) with body '' in 0.011 seconds.
2012-12-11 21:20:36,690 - INFO - OplogManager: Initializing oplog thread
2012-12-11 21:20:36,691 - INFO - MongoConnector: Starting connection thread Connection('localhost', 27017)
2012-12-11 21:20:36,693 - ERROR - OplogManager: Last entry no longer in oplog cannot recover! Collection(Database(Connection('localhost', 27017), u'local'), u'oplog.rs')
2012-12-11 21:20:37,694 - ERROR - MongoConnector: OplogThread <OplogThread(Thread-3, stopped 140131090495232)> unexpectedly stopped! Shutting down.
2012-12-11 21:20:37,694 - INFO - MongoConnector: Stopping all OplogThreads
2012-12-11 21:20:37,697 - INFO - Finished '127.0.0.1:8080//solr/update/?commit=true' (POST) with body '' in 0.010 seconds.

I can see the new entries added in oplog, but mongo-connector says "Last entry no longer in oplog"

Any Help in recovering from this issue will be a great help!

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.