Git Product home page Git Product logo

anlessini's People

Contributors

adamyy avatar lintool avatar mayankanand007 avatar shaneding avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

anlessini's Issues

Caching Strategy

The current implementation caches the contents of a file if the file size is < 512MB, however we do not set a limit for the total amount of bytes we store in memory. For large indexes (e.g., msmarco-doc ~30GB and msmarco-passage ~4GB) we can be storing A LOT of bytes in memory, and it may exceed the memory limit.

The caching per container also means that every first call to a Lambda container will be a cache miss. And we have very little control over how long a container lives as Lambda containers are fully-managed.

What if we bring in some centralized caching system like Redis? It’d be less serverless. But I imagine we will have less cache misses (especially when we are dealing with immutable indexes, which in theory never expires).

Replicate serverless MS MARCO passage

Let's see if we can replicate serverless MS MARCO passage? Tricky bit here is ingesting the documents in DynamoDB is a reasonable amount of time and in a robust manner...

DynamoDB's primary partition/sort key

Currently, we map a document's id (not to be confused with the numerical docid used in index document lookup) to DynamoDB's primary partition/sort key for lookup.

Problem is that for some corpus such as CORD-19, the document id is not unique: https://github.com/allenai/cord19#why-can-the-same-cord_uid-appear-in-multiple-rows. In the case of CORD-19, when the two document has identical cord_uid, it means that they refer to the same paper but possibly submitted by different sources.

Logically speaking it should not make much of a difference in terms of lambda logic, but ImportCollection needs to be cautious of duplicating ids, since BatchWriteItem would fail if the batch contains two items with the same partition/sort key.

Fields to store in DynamoDB

Currently, the experiment has been completed for AclAnthology collection. And we are storing more than just docid and contents in DynamoDB:

image

Are these fields arbitrarily decided? Or are they tied to the front end logic? @shaneding It looks like these fields are directly returned to the frontend in the form of JSON strings. https://github.com/castorini/anlessini/blob/master/SearchLambdaFunction/src/main/java/io/anlessini/SearchLambda.java#L112

I am not very comfortable with bundling frontend logic with backend storage structure directly. For instance, what happens if we decided to add another field in the UI? That would have involved a database migration in the current implementation.

Alternatively, what if we just store the raw contents and the ids in Dynamo, and query for additional fields from the index? The front end can specify the fields it want, and the lambda can answer them by doing searcher.doc(docid).getFields(fieldnames). The performance implication of such strategy is yet to be examined.

Not all bytes were read from the S3ObjectInputStream

S3 complained about S3Directory attempting to retrieve a binary blob that is too large and returned partial byte stream.

S3Directory and S3IndexInput might need to consider using range GET per recommended by S3: https://docs.aws.amazon.com/AmazonS3/latest/dev/GettingObjectsUsingAPIs.html

io.anlessini.store.S3Directory - Opened S3Directory under adamyy-anlessini/lucene-index-cord19-abstract-2020-07-16
08:50:44.596 [main] INFO  io.anlessini.store.S3Directory - [openInput] segments_1, size = 137 - caching!
08:50:44.711 [main] INFO  io.anlessini.store.S3Directory - [openInput] _0.si, size = 535 - caching!
08:50:44.766 [main] INFO  io.anlessini.store.S3Directory - [openInput] _0.fnm, size = 1848 - caching!
08:50:44.859 [main] INFO  io.anlessini.store.S3Directory - [openInput] _0_Lucene50_0.doc, size = 57259531 - caching!
08:50:46.082 [main] INFO  io.anlessini.store.S3Directory - [openInput] _0_Lucene50_0.pos, size = 55447893 - caching!
08:50:47.283 [main] INFO  io.anlessini.store.S3Directory - [openInput] _0_Lucene50_0.tim, size = 23942714 - caching!
08:50:47.609 [main] INFO  io.anlessini.store.S3Directory - [openInput] _0_Lucene50_0.tip, size = 468494 - caching!
08:50:47.736 [main] INFO  io.anlessini.store.S3Directory - [openInput] _0.nvm, size = 283 - caching!
08:50:47.763 [main] INFO  io.anlessini.store.S3Directory - [openInput] _0.nvd, size = 1154813 - caching!
08:50:47.898 [main] INFO  io.anlessini.store.S3Directory - [openInput] _0.fdx, size = 347835 - caching!
08:50:48.074 [main] INFO  io.anlessini.store.S3Directory - [openInput] _0.fdt, size = 2551741000 - not caching!
[main] WARN com.amazonaws.services.s3.internal.S3AbortableInputStream - Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
[main] WARN com.amazonaws.services.s3.internal.S3AbortableInputStream - Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.
misplaced codec footer (file extended?): remaining=4294967312, expected=16, fp=-1743226312 (resource=_0.fdt): org.apache.lucene.index.CorruptIndexException
org.apache.lucene.index.CorruptIndexException: misplaced codec footer (file extended?): remaining=4294967312, expected=16, fp=-1743226312 (resource=_0.fdt)
	at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:497)
	at org.apache.lucene.codecs.CodecUtil.retrieveChecksum(CodecUtil.java:487)
	at org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.<init>(CompressingStoredFieldsReader.java:175)
	at org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsReader(CompressingStoredFieldsFormat.java:121)
	at org.apache.lucene.codecs.lucene50.Lucene50StoredFieldsFormat.fieldsReader(Lucene50StoredFieldsFormat.java:173)
	at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:127)
	at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:84)
	at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:69)
	at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:61)
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:680)
	at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:84)
	at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:76)
	at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:64)
	at io.anlessini.SearchLambda.<init>(SearchLambda.java:49)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
	at java.base/java.lang.reflect.Constructor.newInstance(Unknown Source)

DynamoDB's maximum item size limit

DynamoDB imposes a strict item size limit of 400KB, which includes both the attribute name and value. This might be something to worth looking out for.

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html#limits-items

I actually run into this error when I was trying to import the 2020-09-25 CORD-19 collection.

com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: Item size has exceeded the maximum allowed size (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: ValidationException; Request ID: TMITLUJHMUQQMO31D0G4FNR4NRVV4KQNSO5AEMVJF66Q9ASUAAJG; Proxy: null)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1811) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1395) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1371) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530) ~[aws-java-sdk-core-1.11.848.jar:?]
        at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:5136) ~[aws-java-sdk-dynamodb-1.11.848.jar:?]
        at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:5103) ~[aws-java-sdk-dynamodb-1.11.848.jar:?]
        at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.executeBatchWriteItem(AmazonDynamoDBClient.java:691) ~[aws-java-sdk-dynamodb-1.11.848.jar:?]
        at com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.batchWriteItem(AmazonDynamoDBClient.java:657) ~[aws-java-sdk-dynamodb-1.11.848.jar:?]
        at com.amazonaws.services.dynamodbv2.document.internal.BatchWriteItemImpl.doBatchWriteItem(BatchWriteItemImpl.java:113) ~[aws-java-sdk-dynamodb-1.11.848.jar:?]
        at com.amazonaws.services.dynamodbv2.document.internal.BatchWriteItemImpl.batchWriteItem(BatchWriteItemImpl.java:53) ~[aws-java-sdk-dynamodb-1.11.848.jar:?]
        at com.amazonaws.services.dynamodbv2.document.DynamoDB.batchWriteItem(DynamoDB.java:178) ~[aws-java-sdk-dynamodb-1.11.848.jar:?]
        at io.anlessini.utils.ImportCollection$ImporterThread.sendBatchRequest(ImportCollection.java:296) ~[utils-0.1.0-SNAPSHOT.jar:?]
        at io.anlessini.utils.ImportCollection$ImporterThread.run(ImportCollection.java:247) [utils-0.1.0-SNAPSHOT.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]

Very unfortunately, AWS's recommendation for circumvent the large item issue is to store the values/items in S3 and store the object identifier in DynamoDB.

Arguably, the largest field of all the collection would be raw (more investigation required). Perhaps it would alleviate the issue if we store the document's raw field in S3 (keyed by the docid) and store the rest in DynamoDB.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.