freelawproject / eyecite Goto Github PK
View Code? Open in Web Editor NEWFind legal citations in any block of text
Home Page: https://freelawproject.github.io/eyecite/
License: BSD 2-Clause "Simplified" License
Find legal citations in any block of text
Home Page: https://freelawproject.github.io/eyecite/
License: BSD 2-Clause "Simplified" License
Re: openjournals/joss-reviews#3617
Once or twice I had the feeling that the paper switches between topics or mixes things. As example, under 'statement of need' it switches from explaining the need directly to explaining how eyecite itself works.
Maybe this could be its own section?
According to the grand daddy of parentheticals, apparently CA uses square brackets instead of parentheses:
https://twitter.com/tweetatpablo/status/1392920773824630784
I guess this means our regex needs a tweak, but it'd be nice to see this in the wild before we permanently add code based on a tweet.
And...I guess this means parentheticals should be called something else. "ParenBracketTheticals," perhaps.
What do we think is the right thing to do with text like this?
In [2]: get_citations("foo1 U.S. 1bar")
Out[2]: [FullCaseCitation(token=CitationToken(data='1 U.S. 1bar', start=3, end=14, volume='1', reporter='U.S.', page='1bar'...
We're currently permissive about finding cites with abutting characters, which I can imagine being overinclusive with reporter strings like "foo1 Or 2bar" or "foo1 P 2bar" or "foo1 mt 2bar" etc.
I guess this breaks down into a few separate things:
(^| )
? I'm thinking probably yes?($|[ ,.;])
and a few more along those lines. Less sure about that one.page='1'
? We currently capture 1bar
because of the page number regex \d{1,6}[-]?[a-zA-Z]{1,6}
, which is there to handle A special Connecticut or Illinois number. E.g., "13301-M"
. So we could add that regex specifically to CT and IL in reporters-db and avoid capturing random cruft in other reporters. I'm leaning toward that being a good idea.I would like to use the handy object-creation factories that @jcushman created (https://github.com/freelawproject/eyecite/blob/master/tests/factories.py) to streamline CL's tests as well.
However, we (rightly imo) exclude the tests directory when we package eyecite for distribution (https://github.com/freelawproject/eyecite/blob/master/setup.py#L36).
I propose either:
factories.py
file into the main eyecite
directory; orThoughts?
RE: openjournals/joss-reviews#3617
I'd like to see both usage examples in the tutorial and tests that use longer text documents -- full cases or other documents similar to what you expect to be parsed. The examples and the tests seem to use single-line snippets with few exceptions, which I don't think reflect actual expected usage.
Longer examples would help people see how to use the software. Would also be nice to have some example texts that people can run to get started with the software.
Longer text for the tests could reveal issues that other tests don't -- for example what is the full expected set of citations out of a document. I think the software also does some co-referencing, yes? So to also see the tests for that in some of the more complicated cases
Back in 2014, in freelawproject/courtlistener#1601, @brianwc lamented that we don't handle partial citations very well. His example was:
The Supreme Court often cites its recent cases that don't yet have a full U.S. citation like so:
Bullcoming v. New Mexico, 564 U. S. ___ (2011).
We have Bullcoming v. New Mexico, know it is a SCOTUS case, and know it was filed in 2011.
Alas, we never got the citator working on these. I'm not sure it's worth the effort to do so now, but I thought I'd file an issue so we could track it as a gap in functionality.
Here's a gist: https://gist.github.com/mattdahl/21080e7aafd0a28c3ef0ebf9d9d13b0c
In short, when use_dmp=True
, the annotation is inserted at the wrong location. When use_dmp=False
, the location is correct, but it takes about 30 seconds to compute on my machine.
Sorry for the spam today, but this one seems pretty important. While rebasing my code on master to incorporate the #64 refactor, I noticed my tests that should be failing were suddenly passing.
The problem (introduced in #64), it would seem, is forgetting to return a value from get_comparison_attrs
, resulting in the test runner asserting None == None
repeatedly and passing. To illustrate this, here's a commit that adds an obviously wrong test that should fail instead passing all CI checks: lexeme-dev@bcf6367
And here's a commit that comments out that intentionally broken test and fixes get_comparison_attrs
so that assertions are actually made: lexeme-dev@f0bc7fe
Fixing the method results in a significant number of test failures, most of which seem to be related to an index
mismatch.
For a cite like "1 U.S. 1, 2 S. Ct. 2 (1999) (overruling ...)" we extract "1 U.S. 1" and "2 S. Ct. 2" as separate cites that both have the parenthetical "overruling ...". If you later report the parentheticals somehow you double up, or if you use a resolver that knows those are the same case, you double-count the weight of that citation. It would be good if we detected this and linked the two cites as parallel to each other, so the weight and parenthetical could only be counted once.
Do you all have any insight about cleaning text for non-ascii characters? We have two parts of this in play for CAP:
Quotes and dashes (and maybe others?) can come in as curly quotes or mdashes or whatever. Some set of replacements should probably be made on our text like ‘ ’ ´ “ ” –
-> ' ' ' " " -
; don't know if there's a good complete list. This one probably applies to most text.
OCR'd cites can come in with accents and umlauts and such, so for OCR'd English text we probably want to replace é
and ü
and so on with English-language ascii lookalikes. This might be less generally applicable.
I'm thinking of throwing everything through https://pypi.org/project/Unidecode/ , which I think will do both of those things:
> print(unidecode('‘’´“”–éü'))
'''""-eu
I haven't measured performance yet though; might be overkill. Any other suggestions? And does some form of this want to make it into the built-in eyecite cleaners? That part doesn't matter for CAP's purposes, just curious if it'd be helpful.
Full citation not understood. Still investigating but a number of citations have (page?)
See examples:
Metzler v. Arcadian Corp. 1997 OSHD (CCH) P31,311
CCH OSHD P 20,091 (1975)
As I described in issues 1338 and 1344, I'm using a portion of the CourtListener code to write a standalone citation finder. This component is an NLP mention finder, so the extent of the component turns out to be important. I know that this isn't the intention of the citation finder in CourtListener, but in some cases the goals may overlap. I'd be surprised if this issue describes one of them, but @mlissner invited me to submit the issue, so here goes.
The general strategy in find_citations.py is to search through the list of tokens in the document and look for "anchors" for a citation: a reporter for the full and short citations, "Id." and "Ibid." and "supra" for other cases, and the sigma for non-opinion citations. Once the anchor is found, functions dedicated to the individual full, short and supra types are called to "build out" the citation to the left and right to capture the relevant information.
This is a pretty clever strategy, and I haven't changed it in my version of the code. However, because of the way it's written, it's possible to end up with citations which overlap in token extent with the citations to the left and right of it. For my purposes, this is pretty disastrous; for yours, probably not so much, although there may be cases where it leads to the wrong peripheral information, eventually.
The solution I've come up with requires a major refactor of the code, because after each citation is found, I need to see whether it overlaps with its predecessor, and in some circumstances it will result in my rebuilding the citations with start and/or end token limits which can't be exceeded, or simply discarding the citation entirely.
A particularly perverse example is:
Reeves v. Sanderson Plumbing Prods., 530 U.S. 133, 148, 120 S.Ct. 2097,147 L.Ed.2d 105 (2000).
Note the lack of a space in 2097,147
(this may be an artifact of extraction with Apache Tika, or it may be in the original, I haven't checked). The impact of this is that there are three separate reporters, U.S., S.Ct. and L.Ed.2d, all of which grab this entire text sequence.
In my solution, I'm keeping track of the start and end tokens for each citation, and so I know when I hit this problem, and one thing I've done is introduce an additional notion of the minimal start and end token, so that reanalyses have the option of ignoring some peripheral information. So, e.g., in the case above, I can reanalyze the citation anchored on the U.S. reporter to exclude everything after 133,
(obviously, it should after 148
, but that would require a more sophisticated page parser, which I haven't tackled yet). So the citation anchored on the S.Ct. reporter starts at 148
. Unfortunately, because of that missing space in 2097,147
, there's no way to create two citations out of the remainder of the string, and one of them ends up being dropped.
It should be obvious from this description that find_citations.py would need to be doing a lot more, and a lot more different, work, and it's not clear, as I said, that it matters for your purposes. I report this for the sake of completeness.
Eyecite thinks that South Carolina citations are SCOTUS citations:
from eyecite import get_citations
text = 'Lee County School Dist. No. 1 v. Gardner, 263 F.Supp. 26 (SC 1967)'
cites = get_citations(text)
cites[0].metadata.court
# prints 'scotus'
The SC
in the year could be ambiguous, but the F.Supp.
reporter should automatically rule SCOTUS out as a possibility for the court here.
Once I get eyecite running on the CAP corpus, I'd like to look at all of the extracted citations where the page number was supposedly a roman numeral, and figure out what the error rate is and if we can filter out common false matches or restrict roman numeral matches to some reporters and volumes.
Here are false positive matches I've collected so far:
(Moving this conversation over from #18 to focus specifically on roman numerals.)
Related to #11, I don't think it makes sense to pull in dependabot pull requests like this one for lxml. The minimum lxml version doesn't want to be 4.6.3, it wants to be whatever minimum version works.
The dependabot updates marked "[security]" might make sense to pull in? Even there I might be tempted to eyeball them and see if the security issue is relevant ... but on the other hand any github project that depends on eyecite will get the same security update prompt, so maybe fine to use those ones to set minimum requirements anyway.
In this opinion, there are a number of citations that contain HTML:
https://www.courtlistener.com/opinion/1338566/martin-v-henson/
For example, one of them is roughly (paraphrasing):
22 <i>Ga. App.</i> 33
We use regular expressions to make the links, but the HTML in there makes regexes basically a non-starter. The best solution is probably to clean up the citations as a first pass through the text.
The good news is that we do identify these citations properly and they factor into pagerank and whatnot. Only thing is they don't become links.
See examples:
2015 0667 (La.App. 1 Cir. 02/04/16); Court of Appeal of Louisiana, First Circuit
2011 2269 (La.App. 1 Cir. 11/29/12); Court of Appeal of Louisiana, First Circuit
2007 0889 (La.App. 4 Cir. 01/23/08); Court of Appeal of Louisiana, Fourth Circuit
also in two digit year mode
08 1119 (La.App. 3 Cir. 03/04/09); Court of Appeal of Louisiana, Third Circuit
Example:
71A A.F.T.R.2d (RIA) 3011 fails citation parsing because volume must be a digit.
NonopinionCitation
seems like a misnomer now that we have FullLawCitation
and FullJournalCitation
abstractions (which are obviously not opinions). The point of NonopinionCitation
is to serve as a naive catch-all for any citation that can't be otherwise parsed, so I think it should be renamed to something like UnknownCitation
.
We don't currently handle this citation format:
"People v. Brislin, 80 Ill. 423; Lehmer v. The People, id. 601; Prout v. The People, 83 id. 154; C. & N. W. Ry. Co. v. The People, id. 467; Andrews v. The People, id. 529; Gage v. Parker, 103 id. 528; Blake v. The People, 109 id. 504; Riverside v. Howell, 113 id. 256; Schertz v. The People, 105 id. 27; Murphy v. The People, 120 id. 234; Riebling v. People, 145 id. 120."
It's used both with the same volume, like "id. 601" for "80 Ill. 601", and with different volumes, like "83 id. 154" for "83 Ill. 154".
(I don't have any great ideas about how to handle this and I'm not sure how common it is; just documenting.)
I was doing a code review in preparation for the fix I was asked to work on regarding "scotus" being assigned inappropriately to certain cases. While reading the code I noticed the following line in resolve.py
MAX_OPINION_PAGE_COUNT = 150
I was wondering why the 150-page limit when there are cases like McConnell v. FEC, 251 F.Supp.2d 176 (D.C. 2003)
that are 750+ pages long. While probably not the majority of cases, some of the more important cases are well in excess of 150 pages and we might be missing out on citations to them by bailing out if the pincite is > page+150 in the _has_invalid_pin_cite function?
I'm properly baffled what's going on here. I could speculate, but I'm not sure it would be helpful. For some reason, sometimes when we run tests, they just don't work:
https://github.com/freelawproject/eyecite/runs/2027638177?check_suite_focus=true
I fixed this last time by purging the cache so that the deps were installed manually, but that's not something we can do every time. It's weird because sometimes when we restore from cache, it works fine:
https://github.com/freelawproject/eyecite/runs/2018816067?check_suite_focus=true
But other times (as above), it fails. When it fails or when it works, the cache seems to be loaded fine. If you look in the cache log for the failed action, it shows something just like what you see in the successful one. Both show:
Received 34564094 of 34564094 (100.0%), 90.8 MBs/sec
Cache Size: ~33 MB (34564094 B)
/usr/bin/tar --use-compress-program zstd -d -xf /home/runner/work/_temp/a4410fb8-75c9-4ce9-8a0d-a47c8182eb8a/cache.tzst -P -C /home/runner/work/eyecite/eyecite
Cache restored from key: venv-Linux-3.8-f11576093fd505fc160dc88e640b075f5961ced6301bbe880e6ba9d9d0aba930
So what's going on? I'm totally unsure. Anybody have ideas?
Here's an example of statutory short cites:
Business activities of national banks are controlled by the National Bank Act (NBA or Act), 12 U. S. C. § 1 et seq., and regulations promulgated thereunder by the Office of the Comptroller of the Currency (OCC). See §§24, 93a, 371(a). As the agency charged by Congress with supervision of the NBA, OCC oversees the operations of national banks and their interactions with customers. See NationsBank of N. C., N. A. v. Variable Annuity Life Ins. Co., 513 U. S. 251, 254, 256 (1995). The agency exercises visitorial powers, including the authority to audit the bank’s books and records, largely to the exclusion of other governmental entities, state or federal. See § 484(a); 12 CFR § 7.4000 (2006).
The NBA specifically authorizes federally chartered banks to engage in real estate lending. 12 U. S. C. § 371. It also provides that banks shall have power “[t]o exercise ... all such incidental powers as shall be necessary to carry on the business of banking.” §24 Seventh. Among incidental powers, national banks may conduct certain activities through “operating subsidiaries,” discrete entities authorized to engage solely in activities the bank itself could undertake, and subject to the same terms and conditions as those applicable to the bank. See § 24a(g)(3)(A); 12 CFR § 5.34(e) (2006).
When we see just "§ " we should potentially fill in the part before § with the previous cite containing §, so "§§ 24" becomes "12 U. S. C §§ 24".
This is interesting because it's not a short cite for clustering purposes ... we want to fill in what is probably the completion of the citation, but not treat them like citations to the same document.
While writing some tests to expose the issues pointed out by @jcushman in #62, I noticed that eyecite was capturing the comma separating the defendant of a case and the reporter citation and making it part of the defendant string. Checked some real text we're working with, same issue.
Here's a minimum reproducible example:
import eyecite
input_str = 'foo v. bar, 1 U.S. 1 (2021)'
citations = eyecite.get_citations(input_str)
print(citations[0].defendant) # Prints 'bar,'
If this behavior isn't intended and we can safely assume that party names don't end in commas, it can be relatively trivially fixed by simply stripping commas (as well as whitespace) off the defendant name:
Lines 121 to 123 in 586cbb4
from string import whitespace
# ...
citation.metadata.defendant = "".join(
str(w) for w in words[start_index : citation.index]
).strip(whitespace + ',')
Happy to create a new PR with this fix and a test to cover it, or fold it into #62. Just wanted to check if this is on purpose first.
"Ridgely's Notes," "Wilson's Red Book," etc. is how the Delaware Supreme Court cites its old precedents. They don't have numbers before and after so they are not parsed by the current citation detection method. This could be added.
Also, apart from running the included tests, do you have a test dataset you can recommend?
Originally posted by @step21 in #86 (comment)
Eyecite is finding the wrong reporter for a citation - but alternating between the right and wrong.
When I run this code six times I get
citations_objs = eyecite.get_citations("2013 Ark. App. 459")
cite_type_str = citations_objs[0].exact_editions[0].reporter.cite_type
print(cite_type_str)
state
state
neutral
neutral
state
neutral
This is confusing for me because I just updated reporters-db with the following reporter.
"Ark. App.": [
{
"cite_type": "state",
"editions": {
"Ark. App.": {
"end": "2008-12-31T00:00:00",
"regexes": [
"(?P<volume>\\d{1,3}) $reporter $page"
],
"start": "1981-01-01T00:00:00"
}
},
"examples": [
"84 Ark. App. 412"
],
"mlz_jurisdiction": [
"us:ar;appeals.court"
],
"name": "Arkansas Appellate Reports",
"variations": {
"Ak. App.": "Ark. App.",
"Ark.App.": "Ark. App."
}
},
{
"cite_type": "neutral",
"editions": {
"Ark. App.": {
"end": null,
"regexes": [
"$volume_year $reporter $page"
],
"start": "2009-01-01T00:00:00"
}
},
"examples": [
"2013 Ark. App. 5"
],
"mlz_jurisdiction": [
"us:ar;appeals.court"
],
"name": "Arkansas Appellate Reports",
"variations": {
"Ak. App.": "Ark. App.",
"Ark.App.": "Ark. App."
}
}
]
Arkansas switched Arkansas reports, supreme and appellate to online with a VOLUME_YEAR. So I was under the impression that one, the regex pattern for volume year would indeed find those citations and the regex pattern requiring 1,3 digits would keep these two separate.
I would perhaps expect the citations_objs[0].exact_editions
to not have a specific order but I dont know why its bringing back the edition that has a regex that would exclude it.
RE: openjournals/joss-reviews#3617
I see the tutorial in the repo readme, but is there reference/API documentation somewhere -- that lists the functions, classes, etc. in the package, with parameters, etc. (the type generally generated from docstrings)
I expect this type of documentation for packages (or alternatively much more extensive usage guides)
See Example:
RAVELERS INDEM. CO. v. HYLTON, 1972 U.S. Dist. LEXIS 12735
1972 Auto. Cas. (CCH) P7530
This is slightly different than the other format that has PXX, ZZZ
For example,
from eyecite import annotate, clean_text, get_citations
s = 'foo <i>Ibid.</i> bar'
s_cleaned = clean_text(s, ['html', 'all_whitespace'])
annotate(
plain_text=s_cleaned,
annotations=[[get_citations(s_cleaned)[0].span(), 'A', 'Z']],
source_text=s,
unbalanced_tags='skip'
)
# returns 'foo <i>Ibid.</i> bar' with no annotation
There is HTML here, yes, but it does not bisect the the substring ("Ibid.") to be annotated -- thus, one would expect the annotation to stick even with unbalanced_tags='skip'
.
I think this has to do with the way the diff between the plain text and the source text is calculated. Once the new offsets for the citation are calculated here (https://github.com/freelawproject/eyecite/blob/master/eyecite/annotate.py#L59), eyecite (erroneously) thinks that the closing </i>
tag is part of the substring to be annotated, (rightfully) detects it as unbalanced, and (rightfully) declines to do so.
I'm not immediately sure how to fix. My diagnosis of the problem may also not be complete.
We can improve cite clustering by excluding short cites with implausible pin cites. Example:
"1 U.S. 200. blah blah. 2 We Missed This 20. blah blah. Id. at 22."
We'll currently cluster "Id. at 22" with "1 U.S. 200," but we could refuse on the basis that int(id_cite.metadata.pin_cite) < int(us_cite.groups['page'])
. And then I think ignore the Id. cite entirely since it must be the product of some error or other.
RE: openjournals/joss-reviews#3617
There aren't performance claims in the paper, but I think maybe there should be -- in the paper or in the tutorial. If I'm evaluating whether to use this package, I have no idea how good it is. Is there a standard corpus (or can you make one?) where you can run your code and report some basic stats on how many citations you extract, how many errors, etc? It doesn't even have to be that big of a set of documents -- but some indication of the types of text/documents you've tested against, and how well it does on those.
I get that this package may be the best available for the task, but I have no idea what that means in practice -- can I rely on the results, or does it miss a lot?
I don't know how the citator would handle these presently, but in freelawproject/reporter-db#9 we discovered that there a number of citations in the corpus like:
XX Pac. (2d) XX
We need to identify:
Naturally, we'll want some tests for these as well so that we capture them in the future too.
We use Sentry to get stacktraces and variable values on CourtListener. This is coming from a live user on CourtListener:
TypeError: %d format: a number is required, not NoneType
(5 additional frame(s) were not displayed)
...
File "cl/search/views.py", line 449, in show_results
render_dict.update(do_search(request.GET.copy()))
File "cl/search/views.py", line 193, in do_search
query_citation = get_query_citation(cd)
File "cl/lib/search_utils.py", line 124, in get_query_citation
matches = match_citation(citations[0])
File "cl/citations/match_citations.py", line 140, in match_citation
main_params["fq"].append('citation:("%s")' % citation.base_citation())
File "eyecite/models.py", line 62, in base_citation
return "%d %s %s" % (self.volume, self.reporter, self.page)
I'm on vacation this week (last one for a bit, promise), so I won't look at this too much, but it seems to be because the user queried using a Supra that got parsed incorrectly such that the Volume was None.
The query that triggered it is:
https://www.courtlistener.com/?q=Williamson%20v.%20Tucker%2C%20supra%2C%20645%20F.2d%20
Sentry Issue: COURTLISTENER-17K
@mattdahl, any chance you want to take a look and have a minute? Inside CL, this is coming from a feature that looks for citations inside people's queries, so it can give them an info box. You can see an example of it working normally here:
https://www.courtlistener.com/?q=558%20U.S.%20310
And I'm filing in eyecite, but I don't actually know if this is CL or eyecite. Of course we could work around it in CL, but maybe it's worth pushing upstream.
If an "id" or "ibid" or "supra" reference is preceded by a stop word token, the former are not properly tokenized as their respective token types:
Compare
from eyecite.tokenizers import default_tokenizer
list(default_tokenizer.tokenize('see id. at 577.'))
# returns [StopWordToken(data='see', start=0, end=3, stop_word='see'), 'id.', 'at', '577.']
with
from eyecite.tokenizers import default_tokenizer
list(default_tokenizer.tokenize('see foo id. at 577.'))
[StopWordToken(data='see', start=0, end=3, stop_word='see'), 'foo', IdToken(data='id.', start=8, end=11), 'at', '577.']
The consequence of this is that those would-be citations are not extracted at all downstream.
I haven't really attempted to debug this, though I suspect it may have something to do with how the tokenizer deals with overlapping matches. (Just a guess.) @jcushman, do you have another intuition?
This project is BSD-licensed, but it depends on @asciimoo https://github.com/asciimoo/exrex which is using the AGPL.... my understanding is that the AGPL may extend here too. @asciimoo have you consider the usage of your tool as a library and its license impact?
These should be set to minimum values instead of exact values in requirements.txt to avoid conflicts with other libraries:
courts-db==0.9.7
lxml==4.6.2
reporters-db==2.0.5
six==1.15.0
See: Williams v. IRS, 2007-2 U.S. Tax Cas. (CCH) P50,568 (E.D. Mo. 2007)
Bill, do you think you could set these up and do a first deploy?
Does it make sense to include something like this in our new project template? I can't decide if it'd be helpful or annoying.
See:
McCahon 166
Armstrong v. Wyandotte Bridge Co., McCahon 166
Dallam 614
Allen v. Scott, Dallam 614
During my review for JOSS, I noticed that hyperscan seems to be unwilling to compile on anything non-linux.
This is fine, but in the eyecite readme at least it only says 'x86' which normally means it should at least run on macos x86_64 or/and Windows x86_64, so I think this should be clarified.
openjournals/joss-reviews#3617
Something like
In re Gault, 387 U.S. 1, 13, 87 S.Ct. 1428, 18 L.Ed.2d 527 (1967) (establishing that "neither the Fourteenth Amendment nor the Bill of Rights is for adults alone")
is counted as 3 in the citation depth rather than 1, and we pick up the same parenthetical thrice.
RE: openjournals/joss-reviews#3617
It would help in the tutorial/documentation/readme to have a more concise list of what the functionality of the package is -- what are the main things it does. This could be as simple as a table of contents for the tutorial sections, if all of the functionality is represented by a section.
RE: openjournals/joss-reviews#3617
There are references to a "database" in the tutorial - I don't know what these mean. What database? Are you talking about linking to external legal databases of some type? Are these just adding URLs?
It also says things like "fetched from our database" -- but what is "our database"?
A researcher is looking at old SCOTUS stuff, and noticed that in U.S. v. Hamilton there's a citation like:
Perris v. Hexamer,99 U.S. (9 Otto) 674, 675-76, 25 L. Ed. 308 (1878)
To nobody's surprise, we're not picking this up because 99 U.S. (9 Otto) 674
is a hot mess. It might be doable to work around this with some clever regex work without adding too much technical or processing overhead. Worth investigating.
RE: openjournals/joss-reviews#3617
I don't see any of the expected community guidelines:
Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support
First, thanks for eyecite. @kfunk074 and I are historians working on American legal history, and we intend to use eyecite for a project in progress.
eyecite does very well with citations from the twentieth century on (post-Bluebook?) but it does not detect citations from case reporters in the nineteenth century. To give one example: before Georgia created official case reports, the de facto standard reporter for Georgia's case law was Kelly's Reports. When Georgia began official reports, it adopted the first five volumes of Kelly as its official reports. So, 1 Kelly 254 = 1 Ga. 254
and so on. Of course citations before the Georgia reports all go to Kelly, but even after the official reports, Kelly might still be cited directly. eyecite will detect the Georgia reports, but not Kelly.
The same is true for basically every state jurisdiction in the U.S. I believe there are issues on this repository that are subsets of this problem. I suspect, e.g., that the reporters listed in this issue (#27) are the same kind of problem as described for Georgia. And it depends on the corpus, of course, but such citations can be a substantial body that are missed by eyecite.
We are currently compiling a list of these "antique" reporters. We would like to contribute a pull request that adds these reporters to eyecite. A few questions.
3 Or. 534
for Oregon. But historically, it's common for such citations to be written as 3 Oreg. 534
. We'd also like to contribute some variant abbreviations, and aren't sure what the best way to do that is.A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.