castorini / anlessini Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 2.0 67 KB

Java 87.74% Python 11.76% Shell 0.50%

anlessini's People

Contributors

Stargazers

Watchers

Forkers

adamyy mayankanand007

anlessini's Issues

Caching Strategy

The current implementation caches the contents of a file if the file size is < 512MB, however we do not set a limit for the total amount of bytes we store in memory. For large indexes (e.g., msmarco-doc ~30GB and msmarco-passage ~4GB) we can be storing A LOT of bytes in memory, and it may exceed the memory limit.

The caching per container also means that every first call to a Lambda container will be a cache miss. And we have very little control over how long a container lives as Lambda containers are fully-managed.

What if we bring in some centralized caching system like Redis? It’d be less serverless. But I imagine we will have less cache misses (especially when we are dealing with immutable indexes, which in theory never expires).

Replicate serverless MS MARCO passage

Let's see if we can replicate serverless MS MARCO passage? Tricky bit here is ingesting the documents in DynamoDB is a reasonable amount of time and in a robust manner...

DynamoDB's primary partition/sort key

Currently, we map a document's id (not to be confused with the numerical docid used in index document lookup) to DynamoDB's primary partition/sort key for lookup.

Problem is that for some corpus such as CORD-19, the document id is not unique: https://github.com/allenai/cord19#why-can-the-same-cord_uid-appear-in-multiple-rows. In the case of CORD-19, when the two document has identical cord_uid, it means that they refer to the same paper but possibly submitted by different sources.

Logically speaking it should not make much of a difference in terms of lambda logic, but ImportCollection needs to be cautious of duplicating ids, since BatchWriteItem would fail if the batch contains two items with the same partition/sort key.

Fields to store in DynamoDB

Currently, the experiment has been completed for AclAnthology collection. And we are storing more than just docid and contents in DynamoDB:

Are these fields arbitrarily decided? Or are they tied to the front end logic? @shaneding It looks like these fields are directly returned to the frontend in the form of JSON strings. https://github.com/castorini/anlessini/blob/master/SearchLambdaFunction/src/main/java/io/anlessini/SearchLambda.java#L112

I am not very comfortable with bundling frontend logic with backend storage structure directly. For instance, what happens if we decided to add another field in the UI? That would have involved a database migration in the current implementation.

Alternatively, what if we just store the raw contents and the ids in Dynamo, and query for additional fields from the index? The front end can specify the fields it want, and the lambda can answer them by doing searcher.doc(docid).getFields(fieldnames). The performance implication of such strategy is yet to be examined.

Builder serverless MS MARCO doc

The collection here is bigger...

Not all bytes were read from the S3ObjectInputStream

S3 complained about S3Directory attempting to retrieve a binary blob that is too large and returned partial byte stream.

S3Directory and S3IndexInput might need to consider using range GET per recommended by S3: https://docs.aws.amazon.com/AmazonS3/latest/dev/GettingObjectsUsingAPIs.html

io.anlessini.store.S3Directory - Opened S3Directory under adamyy-anlessini/lucene-index-cord19-abstract-2020-07-16
08:50:44.596 [main] INFO  io.anlessini.store.S3Directory - [openInput] segments_1, size = 137 - caching!
08:50:44.711 [main] INFO  io.anlessini.store.S3Directory - [openInput] _0.si, size = 535 - caching!
08:50:44.766 [main] INFO  io.anlessini.store.S3Directory - [openInput] _0.fnm, size = 1848 - caching!
08:50:44.859 [main] INFO  io.anlessini.store.S3Directory - [openInput] _0_Lucene50_0.doc, size = 57259531 - caching!
08:50:46.082 [main] INFO  io.anlessini.store.S3Directory - [openInput] _0_Lucene50_0.pos, size = 55447893 - caching!
08:50:47.283 [main] INFO  io.anlessini.store.S3Directory - [openInput] _0_Lucene50_0.tim, size = 23942714 - caching!
08:50:47.609 [main] INFO  io.anlessini.store.S3Directory - [openInput] _0_Lucene50_0.tip, size = 468494 - caching!
08:50:47.736 [main] INFO  io.anlessini.store.S3Directory - [openInput] _0.nvm, size = 283 - caching!
08:50:47.763 [main] INFO  io.anlessini.store.S3Directory - [openInput] _0.nvd, size = 1154813 - caching!
08:50:47.898 [main] INFO  io.anlessini.store.S3Directory - [openInput] _0.fdx, size = 347835 - caching!
08:50:48.074 [main] INFO  io.anlessini.store.S3Directory - [openInput] _0.fdt, size = 2551741000 - not caching!
[main] WARN com.amazonaws.services.s3.internal.S3AbortableInputStream - Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
[main] WARN com.amazonaws.services.s3.internal.S3AbortableInputStream - Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
misplaced codec footer (file extended?): remaining=4294967312, expected=16, fp=-1743226312 (resource=_0.fdt): org.apache.lucene.index.CorruptIndexException
org.apache.lucene.index.CorruptIndexException: misplaced codec footer (file extended?): remaining=4294967312, expected=16, fp=-1743226312 (resource=_0.fdt)
	at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:497)
	at org.apache.lucene.codecs.CodecUtil.retrieveChecksum(CodecUtil.java:487)
	at org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.<init>(CompressingStoredFieldsReader.java:175)
	at org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsReader(CompressingStoredFieldsFormat.java:121)
	at org.apache.lucene.codecs.lucene50.Lucene50StoredFieldsFormat.fieldsReader(Lucene50StoredFieldsFormat.java:173)
	at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:127)
	at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:84)
	at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:69)
	at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:61)
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:680)
	at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:84)
	at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:76)
	at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:64)
	at io.anlessini.SearchLambda.<init>(SearchLambda.java:49)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
	at java.base/java.lang.reflect.Constructor.newInstance(Unknown Source)

DynamoDB's maximum item size limit

DynamoDB imposes a strict item size limit of 400KB, which includes both the attribute name and value. This might be something to worth looking out for.

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html#limits-items

I actually run into this error when I was trying to import the 2020-09-25 CORD-19 collection.

com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: Item size has exceeded the maximum allowed size (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: TMITLUJHMUQQMO31D0G4FNR4NRVV4KQNSO5AEMVJF66Q9ASUAAJG; Proxy: null)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1811) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1395) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1371) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:5136) ~[aws-java-sdk-dynamodb-1.11.848.jar:?]
        at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:5103) ~[aws-java-sdk-dynamodb-1.11.848.jar:?]
        at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeBatchWriteItem(AmazonDynamoDBClient.java:691) ~[aws-java-sdk-dynamodb-1.11.848.jar:?]
        at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.batchWriteItem(AmazonDynamoDBClient.java:657) ~[aws-java-sdk-dynamodb-1.11.848.jar:?]
        at com.amazonaws.services.dynamodbv2.document.internal.BatchWriteItemImpl.doBatchWriteItem(BatchWriteItemImpl.java:113) ~[aws-java-sdk-dynamodb-1.11.848.jar:?]
        at com.amazonaws.services.dynamodbv2.document.internal.BatchWriteItemImpl.batchWriteItem(BatchWriteItemImpl.java:53) ~[aws-java-sdk-dynamodb-1.11.848.jar:?]
        at com.amazonaws.services.dynamodbv2.document.DynamoDB.batchWriteItem(DynamoDB.java:178) ~[aws-java-sdk-dynamodb-1.11.848.jar:?]
        at io.anlessini.utils.ImportCollection$ImporterThread.sendBatchRequest(ImportCollection.java:296) ~[utils-0.1.0-SNAPSHOT.jar:?]
        at io.anlessini.utils.ImportCollection$ImporterThread.run(ImportCollection.java:247) [utils-0.1.0-SNAPSHOT.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]

Very unfortunately, AWS's recommendation for circumvent the large item issue is to store the values/items in S3 and store the object identifier in DynamoDB.

Arguably, the largest field of all the collection would be raw (more investigation required). Perhaps it would alleviate the issue if we store the document's raw field in S3 (keyed by the docid) and store the rest in DynamoDB.

castorini / anlessini Goto Github PK

anlessini's People

Contributors

Stargazers

Watchers

Forkers

anlessini's Issues

Caching Strategy

Replicate serverless MS MARCO passage

DynamoDB's primary partition/sort key

Fields to store in DynamoDB

Builder serverless MS MARCO doc

Not all bytes were read from the S3ObjectInputStream

DynamoDB's maximum item size limit

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent