freelawproject / eyecite Goto Github PK

View Code? Open in Web Editor NEW

115.0 115.0 28.0 5.15 MB

Find legal citations in any block of text

Home Page: https://freelawproject.github.io/eyecite/

License: BSD 2-Clause "Simplified" License

Python 89.34% Shell 0.27% Jupyter Notebook 10.39%

eyecite's People

Contributors

Stargazers

Watchers

eyecite's Issues

Paper structure

Re: openjournals/joss-reviews#3617

Once or twice I had the feeling that the paper switches between topics or mixes things. As example, under 'statement of need' it switches from explaining the need directly to explaining how eyecite itself works.
Maybe this could be its own section?

California parentheticals use square brackets??

According to the grand daddy of parentheticals, apparently CA uses square brackets instead of parentheses:

https://twitter.com/tweetatpablo/status/1392920773824630784

I guess this means our regex needs a tweak, but it'd be nice to see this in the wild before we permanently add code based on a tweet.

And...I guess this means parentheticals should be called something else. "ParenBracketTheticals," perhaps.

Correct way to handle text abutting citation?

What do we think is the right thing to do with text like this?

In [2]: get_citations("foo1 U.S. 1bar")
Out[2]: [FullCaseCitation(token=CitationToken(data='1 U.S. 1bar', start=3, end=14, volume='1', reporter='U.S.', page='1bar'...

We're currently permissive about finding cites with abutting characters, which I can imagine being overinclusive with reporter strings like "foo1 Or 2bar" or "foo1 P 2bar" or "foo1 mt 2bar" etc.

I guess this breaks down into a few separate things:

Should there be a defined list of characters that can come before the start of a cite, like (^| )? I'm thinking probably yes?
Should there be a defined list of characters that can come after a cite? This seems harder to make complete, but might be like ($|[ ,.;]) and a few more along those lines. Less sure about that one.
Should we narrow the definition of page numbers so that example would come back as page='1'? We currently capture 1bar because of the page number regex \d{1,6}[-]?[a-zA-Z]{1,6}, which is there to handle A special Connecticut or Illinois number. E.g., "13301-M". So we could add that regex specifically to CT and IL in reporters-db and avoid capturing random cruft in other reporters. I'm leaning toward that being a good idea.

Expose test factories

I would like to use the handy object-creation factories that @jcushman created (https://github.com/freelawproject/eyecite/blob/master/tests/factories.py) to streamline CL's tests as well.

However, we (rightly imo) exclude the tests directory when we package eyecite for distribution (https://github.com/freelawproject/eyecite/blob/master/setup.py#L36).

I propose either:

Moving the factories.py file into the main eyecite directory; or
Putting the factory functions onto the Citation models directly as class methods (sort of a pseudo-constructor specifically for creating mock objects)

Thoughts?

Full Text Examples and Tests

RE: openjournals/joss-reviews#3617

I'd like to see both usage examples in the tutorial and tests that use longer text documents -- full cases or other documents similar to what you expect to be parsed. The examples and the tests seem to use single-line snippets with few exceptions, which I don't think reflect actual expected usage.

Longer examples would help people see how to use the software. Would also be nice to have some example texts that people can run to get started with the software.

Longer text for the tests could reveal issues that other tests don't -- for example what is the full expected set of citations out of a document. I think the software also does some co-referencing, yes? So to also see the tests for that in some of the more complicated cases

Find partial citations for decisions that don't yet have full citations, e.g. 22 U.S. ____ (2002)

Back in 2014, in freelawproject/courtlistener#1601, @brianwc lamented that we don't handle partial citations very well. His example was:

The Supreme Court often cites its recent cases that don't yet have a full U.S. citation like so:

Bullcoming v. New Mexico, 564 U. S. ___ (2011).

We have Bullcoming v. New Mexico, know it is a SCOTUS case, and know it was filed in 2011.

Alas, we never got the citator working on these. I'm not sure it's worth the effort to do so now, but I thought I'd file an issue so we could track it as a gap in functionality.

Annotating large strings is slow/incorrect

Here's a gist: https://gist.github.com/mattdahl/21080e7aafd0a28c3ef0ebf9d9d13b0c

In short, when use_dmp=True, the annotation is inserted at the wrong location. When use_dmp=False, the location is correct, but it takes about 30 seconds to compute on my machine.

Silent FindTest test failures

Sorry for the spam today, but this one seems pretty important. While rebasing my code on master to incorporate the #64 refactor, I noticed my tests that should be failing were suddenly passing.

The problem (introduced in #64), it would seem, is forgetting to return a value from get_comparison_attrs, resulting in the test runner asserting None == None repeatedly and passing. To illustrate this, here's a commit that adds an obviously wrong test that should fail instead passing all CI checks: lexeme-dev@bcf6367

And here's a commit that comments out that intentionally broken test and fixes get_comparison_attrs so that assertions are actually made: lexeme-dev@f0bc7fe

Fixing the method results in a significant number of test failures, most of which seem to be related to an index mismatch.

More obscure joke?

CAP's extraction task choked on a few cases because of the joke_cite, since we extract citations paragraph by paragraph and have some paragraphs that are just the word "this", and the code wasn't expected a cite with reporter_found=None. Maybe time for a new joke?

Link parallel cites

For a cite like "1 U.S. 1, 2 S. Ct. 2 (1999) (overruling ...)" we extract "1 U.S. 1" and "2 S. Ct. 2" as separate cites that both have the parenthetical "overruling ...". If you later report the parentheticals somehow you double up, or if you use a resolver that knows those are the same case, you double-count the weight of that citation. It would be good if we detected this and linked the two cites as parallel to each other, so the weight and parenthetical could only be counted once.

Cleaner to normalize unicode glyphs?

Do you all have any insight about cleaning text for non-ascii characters? We have two parts of this in play for CAP:

Quotes and dashes (and maybe others?) can come in as curly quotes or mdashes or whatever. Some set of replacements should probably be made on our text like ‘ ’ ´ “ ” – -> ' ' ' " " -; don't know if there's a good complete list. This one probably applies to most text.
OCR'd cites can come in with accents and umlauts and such, so for OCR'd English text we probably want to replace é and ü and so on with English-language ascii lookalikes. This might be less generally applicable.

I'm thinking of throwing everything through https://pypi.org/project/Unidecode/ , which I think will do both of those things:

> print(unidecode('‘’´“”–éü'))
'''""-eu

I haven't measured performance yet though; might be overkill. Any other suggestions? And does some form of this want to make it into the built-in eyecite cleaners? That part doesn't matter for CAP's purposes, just curious if it'd be helpful.

Citation parser fails for OSHD cases

Full citation not understood. Still investigating but a number of citations have (page?)

See examples:

Metzler v. Arcadian Corp.  1997 OSHD (CCH) P31,311
CCH OSHD P 20,091 (1975)

Avoiding overlapping citation extents in find_citations.py

As I described in issues 1338 and 1344, I'm using a portion of the CourtListener code to write a standalone citation finder. This component is an NLP mention finder, so the extent of the component turns out to be important. I know that this isn't the intention of the citation finder in CourtListener, but in some cases the goals may overlap. I'd be surprised if this issue describes one of them, but @mlissner invited me to submit the issue, so here goes.

The general strategy in find_citations.py is to search through the list of tokens in the document and look for "anchors" for a citation: a reporter for the full and short citations, "Id." and "Ibid." and "supra" for other cases, and the sigma for non-opinion citations. Once the anchor is found, functions dedicated to the individual full, short and supra types are called to "build out" the citation to the left and right to capture the relevant information.

This is a pretty clever strategy, and I haven't changed it in my version of the code. However, because of the way it's written, it's possible to end up with citations which overlap in token extent with the citations to the left and right of it. For my purposes, this is pretty disastrous; for yours, probably not so much, although there may be cases where it leads to the wrong peripheral information, eventually.

The solution I've come up with requires a major refactor of the code, because after each citation is found, I need to see whether it overlaps with its predecessor, and in some circumstances it will result in my rebuilding the citations with start and/or end token limits which can't be exceeded, or simply discarding the citation entirely.

A particularly perverse example is:

Reeves v. Sanderson Plumbing Prods., 530 U.S. 133, 148, 120 S.Ct. 2097,147 L.Ed.2d 105 (2000).

Note the lack of a space in 2097,147 (this may be an artifact of extraction with Apache Tika, or it may be in the original, I haven't checked). The impact of this is that there are three separate reporters, U.S., S.Ct. and L.Ed.2d, all of which grab this entire text sequence.

In my solution, I'm keeping track of the start and end tokens for each citation, and so I know when I hit this problem, and one thing I've done is introduce an additional notion of the minimal start and end token, so that reanalyses have the option of ignoring some peripheral information. So, e.g., in the case above, I can reanalyze the citation anchored on the U.S. reporter to exclude everything after 133, (obviously, it should after 148, but that would require a more sophisticated page parser, which I haven't tackled yet). So the citation anchored on the S.Ct. reporter starts at 148. Unfortunately, because of that missing space in 2097,147, there's no way to create two citations out of the remainder of the string, and one of them ends up being dropped.

It should be obvious from this description that find_citations.py would need to be doing a lot more, and a lot more different, work, and it's not clear, as I said, that it matters for your purposes. I report this for the sake of completeness.

Don't set court='scotus' for South Carolina citations

Eyecite thinks that South Carolina citations are SCOTUS citations:

from eyecite import get_citations
text = 'Lee County School Dist. No. 1 v. Gardner,  263 F.Supp. 26 (SC 1967)'
cites = get_citations(text)
cites[0].metadata.court

# prints 'scotus'

The SC in the year could be ambiguous, but the F.Supp. reporter should automatically rule SCOTUS out as a possibility for the court here.

Tighten matches for roman numeral page numbers

Once I get eyecite running on the CAP corpus, I'd like to look at all of the extracted citations where the page number was supposedly a roman numeral, and figure out what the error rate is and if we can filter out common false matches or restrict roman numeral matches to some reporters and volumes.

Here are false positive matches I've collected so far:

49 Col. L. Rev. 875
65 Mich. L. Rev. 477
37 Taylor v. Bd. of Educ., 240 Fed. Appx. 717 (6th Cir. 2007)
11 U. S. C. § 701

(Moving this conversation over from #18 to focus specifically on roman numerals.)

Disable dependabot PRs

Related to #11, I don't think it makes sense to pull in dependabot pull requests like this one for lxml. The minimum lxml version doesn't want to be 4.6.3, it wants to be whatever minimum version works.

The dependabot updates marked "[security]" might make sense to pull in? Even there I might be tempted to eyeball them and see if the security issue is relevant ... but on the other hand any github project that depends on eyecite will get the same security update prompt, so maybe fine to use those ones to set minimum requirements anyway.

Citations containing HTML not made into links by citation parser

In this opinion, there are a number of citations that contain HTML:

https://www.courtlistener.com/opinion/1338566/martin-v-henson/

For example, one of them is roughly (paraphrasing):

22 <i>Ga. App.</i> 33

We use regular expressions to make the links, but the HTML in there makes regexes basically a non-starter. The best solution is probably to clean up the citations as a first pass through the text.

The good news is that we do identify these citations properly and they factor into pagerank and whatnot. Only thing is they don't become links.

Citation parser fails for LA citations starting in mid 1990s

See examples:

2015 0667 (La.App. 1 Cir. 02/04/16);    Court of Appeal of Louisiana, First Circuit
2011 2269 (La.App. 1 Cir. 11/29/12);       Court of Appeal of Louisiana, First Circuit
2007 0889 (La.App. 4 Cir. 01/23/08);      Court of Appeal of Louisiana, Fourth Circuit

also in two digit year mode

08 1119 (La.App. 3 Cir. 03/04/09);         Court of Appeal of Louisiana, Third Circuit

Citation parsing Fails if Letter in Volume

Example:
71A A.F.T.R.2d (RIA) 3011 fails citation parsing because volume must be a digit.

Rename NonopinionCitation to UnknownCitation

NonopinionCitation seems like a misnomer now that we have FullLawCitation and FullJournalCitation abstractions (which are obviously not opinions). The point of NonopinionCitation is to serve as a naive catch-all for any citation that can't be otherwise parsed, so I think it should be renamed to something like UnknownCitation.

Volume and reporter id. cites

We don't currently handle this citation format:

"People v. Brislin, 80 Ill. 423; Lehmer v. The People, id. 601; Prout v. The People, 83 id. 154; C. & N. W. Ry. Co. v. The People, id. 467; Andrews v. The People, id. 529; Gage v. Parker, 103 id. 528; Blake v. The People, 109 id. 504; Riverside v. Howell, 113 id. 256; Schertz v. The People, 105 id. 27; Murphy v. The People, 120 id. 234; Riebling v. People, 145 id. 120."

It's used both with the same volume, like "id. 601" for "80 Ill. 601", and with different volumes, like "83 id. 154" for "83 Ill. 154".

(I don't have any great ideas about how to handle this and I'm not sure how common it is; just documenting.)

MAX_OPINION_PAGE_COUNT ?

I was doing a code review in preparation for the fix I was asked to work on regarding "scotus" being assigned inappropriately to certain cases. While reading the code I noticed the following line in resolve.py

MAX_OPINION_PAGE_COUNT = 150

I was wondering why the 150-page limit when there are cases like McConnell v. FEC, 251 F.Supp.2d 176 (D.C. 2003)
that are 750+ pages long. While probably not the majority of cases, some of the more important cases are well in excess of 150 pages and we might be missing out on citations to them by bailing out if the pincite is > page+150 in the _has_invalid_pin_cite function?

CI failing

I'm properly baffled what's going on here. I could speculate, but I'm not sure it would be helpful. For some reason, sometimes when we run tests, they just don't work:

https://github.com/freelawproject/eyecite/runs/2027638177?check_suite_focus=true

I fixed this last time by purging the cache so that the deps were installed manually, but that's not something we can do every time. It's weird because sometimes when we restore from cache, it works fine:

https://github.com/freelawproject/eyecite/runs/2018816067?check_suite_focus=true

But other times (as above), it fails. When it fails or when it works, the cache seems to be loaded fine. If you look in the cache log for the failed action, it shows something just like what you see in the successful one. Both show:

Received 34564094 of 34564094 (100.0%), 90.8 MBs/sec
Cache Size: ~33 MB (34564094 B)
/usr/bin/tar --use-compress-program zstd -d -xf /home/runner/work/_temp/a4410fb8-75c9-4ce9-8a0d-a47c8182eb8a/cache.tzst -P -C /home/runner/work/eyecite/eyecite
Cache restored from key: venv-Linux-3.8-f11576093fd505fc160dc88e640b075f5961ced6301bbe880e6ba9d9d0aba930

So what's going on? I'm totally unsure. Anybody have ideas?

Handle statutory short cites

Here's an example of statutory short cites:

Business activities of national banks are controlled by the National Bank Act (NBA or Act), 12 U. S. C. § 1 et seq., and regulations promulgated thereunder by the Office of the Comptroller of the Currency (OCC). See §§24, 93a, 371(a). As the agency charged by Congress with supervision of the NBA, OCC oversees the operations of national banks and their interactions with customers. See NationsBank of N. C., N. A. v. Variable Annuity Life Ins. Co., 513 U. S. 251, 254, 256 (1995). The agency exercises visitorial powers, including the authority to audit the bank’s books and records, largely to the exclusion of other governmental entities, state or federal. See § 484(a); 12 CFR § 7.4000 (2006).

The NBA specifically authorizes federally chartered banks to engage in real estate lending. 12 U. S. C. § 371. It also provides that banks shall have power “[t]o exercise ... all such incidental powers as shall be necessary to carry on the business of banking.” §24 Seventh. Among incidental powers, national banks may conduct certain activities through “operating subsidiaries,” discrete entities authorized to engage solely in activities the bank itself could undertake, and subject to the same terms and conditions as those applicable to the bank. See § 24a(g)(3)(A); 12 CFR § 5.34(e) (2006).

When we see just "§ " we should potentially fill in the part before § with the previous cite containing §, so "§§ 24" becomes "12 U. S. C §§ 24".

This is interesting because it's not a short cite for clustering purposes ... we want to fill in what is probably the completion of the citation, but not treat them like citations to the same document.

Trailing comma in defendant field of full case citation

While writing some tests to expose the issues pointed out by @jcushman in #62, I noticed that eyecite was capturing the comma separating the defendant of a case and the reporter citation and making it part of the defendant string. Checked some real text we're working with, same issue.

Here's a minimum reproducible example:

import eyecite

input_str = 'foo v. bar, 1 U.S. 1 (2021)'
citations = eyecite.get_citations(input_str)
print(citations[0].defendant) # Prints 'bar,'

If this behavior isn't intended and we can safely assume that party names don't end in commas, it can be relatively trivially fixed by simply stripping commas (as well as whitespace) off the defendant name:

eyecite/eyecite/helpers.py

Lines 121 to 123 in 586cbb4

 citation.metadata.defendant = "".join( 

 str(w) for w in words[start_index : citation.index] 

 ).strip()

from string import whitespace
# ...
citation.metadata.defendant = "".join(
    str(w) for w in words[start_index : citation.index]
).strip(whitespace + ',')

Happy to create a new PR with this fix and a test to cover it, or fold it into #62. Just wanted to check if this is on purpose first.

Add detection of "Ridgely's Notes" and other irregular reporters

"Ridgely's Notes," "Wilson's Red Book," etc. is how the Delaware Supreme Court cites its old precedents. They don't have numbers before and after so they are not parsed by the current citation detection method. This could be added.

Discuss test data sets

Also, apart from running the included tests, do you have a test dataset you can recommend?

Originally posted by @step21 in #86 (comment)

get_citations alternating between reporters

Eyecite is finding the wrong reporter for a citation - but alternating between the right and wrong.

When I run this code six times I get

citations_objs = eyecite.get_citations("2013 Ark. App. 459")
cite_type_str = citations_objs[0].exact_editions[0].reporter.cite_type
print(cite_type_str)

state
state
neutral
neutral 
state
neutral

This is confusing for me because I just updated reporters-db with the following reporter.

"Ark. App.": [
        {
            "cite_type": "state",
            "editions": {
                "Ark. App.": {
                    "end": "2008-12-31T00:00:00",
                    "regexes": [
                        "(?P<volume>\\d{1,3}) $reporter $page"
                    ],
                    "start": "1981-01-01T00:00:00"
                }
            },
            "examples": [
                "84 Ark. App. 412"
            ],
            "mlz_jurisdiction": [
                "us:ar;appeals.court"
            ],
            "name": "Arkansas Appellate Reports",
            "variations": {
                "Ak. App.": "Ark. App.",
                "Ark.App.": "Ark. App."
            }
        },
        {
            "cite_type": "neutral",
            "editions": {
                "Ark. App.": {
                    "end": null,
                    "regexes": [
                        "$volume_year $reporter $page"
                    ],
                    "start": "2009-01-01T00:00:00"
                }
            },
            "examples": [
                "2013 Ark. App. 5"
            ],
            "mlz_jurisdiction": [
                "us:ar;appeals.court"
            ],
            "name": "Arkansas Appellate Reports",
            "variations": {
                "Ak. App.": "Ark. App.",
                "Ark.App.": "Ark. App."
            }
        }
    ]

Arkansas switched Arkansas reports, supreme and appellate to online with a VOLUME_YEAR. So I was under the impression that one, the regex pattern for volume year would indeed find those citations and the regex pattern requiring 1,3 digits would keep these two separate.

I would perhaps expect the citations_objs[0].exact_editions to not have a specific order but I dont know why its bringing back the edition that has a regex that would exclude it.

API/Reference Documentation

RE: openjournals/joss-reviews#3617

I see the tutorial in the repo readme, but is there reference/API documentation somewhere -- that lists the functions, classes, etc. in the package, with parameters, etc. (the type generally generated from docstrings)

I expect this type of documentation for packages (or alternatively much more extensive usage guides)

Citation fails if P in page section

See Example:
RAVELERS INDEM. CO. v. HYLTON, 1972 U.S. Dist. LEXIS 12735

1972 Auto. Cas. (CCH) P7530

This is slightly different than the other format that has PXX, ZZZ

unbalanced_tags="skip" erroneously skips tightly wrapped, balanced HTML

For example,

from eyecite import annotate, clean_text, get_citations

s = 'foo <i>Ibid.</i> bar'
s_cleaned = clean_text(s, ['html', 'all_whitespace'])
annotate(
   plain_text=s_cleaned,
   annotations=[[get_citations(s_cleaned)[0].span(), 'A', 'Z']],
   source_text=s,
   unbalanced_tags='skip'
)
# returns 'foo <i>Ibid.</i> bar' with no annotation

There is HTML here, yes, but it does not bisect the the substring ("Ibid.") to be annotated -- thus, one would expect the annotation to stick even with unbalanced_tags='skip'.

I think this has to do with the way the diff between the plain text and the source text is calculated. Once the new offsets for the citation are calculated here (https://github.com/freelawproject/eyecite/blob/master/eyecite/annotate.py#L59), eyecite (erroneously) thinks that the closing </i> tag is part of the substring to be annotated, (rightfully) detects it as unbalanced, and (rightfully) declines to do so.

I'm not immediately sure how to fix. My diagnosis of the problem may also not be complete.

Use pin cite to filter out short cites in clustering

We can improve cite clustering by excluding short cites with implausible pin cites. Example:

"1 U.S. 200. blah blah. 2 We Missed This 20. blah blah. Id. at 22."

We'll currently cluster "Id. at 22" with "1 U.S. 200," but we could refuse on the basis that int(id_cite.metadata.pin_cite) < int(us_cite.groups['page']). And then I think ignore the Id. cite entirely since it must be the product of some error or other.

Performance evaluation?

RE: openjournals/joss-reviews#3617

There aren't performance claims in the paper, but I think maybe there should be -- in the paper or in the tutorial. If I'm evaluating whether to use this package, I have no idea how good it is. Is there a standard corpus (or can you make one?) where you can run your code and report some basic stats on how many citations you extract, how many errors, etc? It doesn't even have to be that big of a set of documents -- but some indication of the types of text/documents you've tested against, and how well it does on those.

I get that this package may be the best available for the task, but I have no idea what that means in practice -- can I rely on the results, or does it miss a lot?

Investigate and Fix Citations with Parentheses like XX Pac. (2d) XX

I don't know how the citator would handle these presently, but in freelawproject/reporter-db#9 we discovered that there a number of citations in the corpus like:

XX Pac. (2d) XX

We need to identify:

Does the citator already capture these? My guess is no.
If it doesn't capture these, do they need to be added to the variations dictionary, or does the parser need to updated?

Naturally, we'll want some tests for these as well so that we capture them in the future too.

TypeError: %d format: a number is required, not NoneType

We use Sentry to get stacktraces and variable values on CourtListener. This is coming from a live user on CourtListener:

TypeError: %d format: a number is required, not NoneType
(5 additional frame(s) were not displayed)
...
  File "cl/search/views.py", line 449, in show_results
    render_dict.update(do_search(request.GET.copy()))
  File "cl/search/views.py", line 193, in do_search
    query_citation = get_query_citation(cd)
  File "cl/lib/search_utils.py", line 124, in get_query_citation
    matches = match_citation(citations[0])
  File "cl/citations/match_citations.py", line 140, in match_citation
    main_params["fq"].append('citation:("%s")' % citation.base_citation())
  File "eyecite/models.py", line 62, in base_citation
    return "%d %s %s" % (self.volume, self.reporter, self.page)

I'm on vacation this week (last one for a bit, promise), so I won't look at this too much, but it seems to be because the user queried using a Supra that got parsed incorrectly such that the Volume was None.

The query that triggered it is:

https://www.courtlistener.com/?q=Williamson%20v.%20Tucker%2C%20supra%2C%20645%20F.2d%20

Sentry Issue: COURTLISTENER-17K

@mattdahl, any chance you want to take a look and have a minute? Inside CL, this is coming from a feature that looks for citations inside people's queries, so it can give them an info box. You can see an example of it working normally here:

https://www.courtlistener.com/?q=558%20U.S.%20310

And I'm filing in eyecite, but I don't actually know if this is CL or eyecite. Of course we could work around it in CL, but maybe it's worth pushing upstream.

Tokenizer fails to tokenize adjacent StopWord, Id, and Supra tokens

If an "id" or "ibid" or "supra" reference is preceded by a stop word token, the former are not properly tokenized as their respective token types:

Compare

from eyecite.tokenizers import default_tokenizer
list(default_tokenizer.tokenize('see id. at 577.'))
# returns [StopWordToken(data='see', start=0, end=3, stop_word='see'), 'id.', 'at', '577.']

with

from eyecite.tokenizers import default_tokenizer
list(default_tokenizer.tokenize('see foo id. at 577.'))
[StopWordToken(data='see', start=0, end=3, stop_word='see'), 'foo', IdToken(data='id.', start=8, end=11), 'at', '577.']

The consequence of this is that those would-be citations are not extracted at all downstream.

I haven't really attempted to debug this, though I suspect it may have something to do with how the tokenizer deals with overlapping matches. (Just a guess.) @jcushman, do you have another intuition?

License of deps

This project is BSD-licensed, but it depends on @asciimoo https://github.com/asciimoo/exrex which is using the AGPL.... my understanding is that the AGPL may extend here too. @asciimoo have you consider the usage of your tool as a library and its license impact?

Unpin requirements

These should be set to minimum values instead of exact values in requirements.txt to avoid conflicts with other libraries:

courts-db==0.9.7
lxml==4.6.2
reporters-db==2.0.5
six==1.15.0

Citation Parser Fails for hyphenated volumes

See: Williams v. IRS, 2007-2 U.S. Tax Cas. (CCH) P50,568 (E.D. Mo. 2007)

Microscope needs automated deploys to pypi

Bill, do you think you could set these up and do a first deploy?

Does it make sense to include something like this in our new project template? I can't decide if it'd be helpful or annoying.

Logo design feedback

We're doing a homepage revamp and had some logos made for the projects that we want to advertise on free.law.

The logos we've done so far were very cheap, and I don't think they're ready or very good yet, but here's the one for eyecite.

Comments and feedback very welcome!

Citation parser fails when cite is two parts not three

See:

McCahon 166 
Armstrong v. Wyandotte Bridge Co., McCahon 166

Dallam 614
Allen v. Scott, Dallam 614

Docs should say that hyperscan is currently linux only, not only x86_64 only.

During my review for JOSS, I noticed that hyperscan seems to be unwilling to compile on anything non-linux.
This is fine, but in the eyecite readme at least it only says 'x86' which normally means it should at least run on macos x86_64 or/and Windows x86_64, so I think this should be clarified.
openjournals/joss-reviews#3617

Parallel citations are detected as separate citations

Something like

In re Gault, 387 U.S. 1, 13, 87 S.Ct. 1428, 18 L.Ed.2d 527 (1967) (establishing that "neither the Fourteenth Amendment nor the Bill of Rights is for adults alone")

is counted as 3 in the citation depth rather than 1, and we pick up the same parenthetical thrice.

More concise list of functionality

RE: openjournals/joss-reviews#3617

It would help in the tutorial/documentation/readme to have a more concise list of what the functionality of the package is -- what are the main things it does. This could be as simple as a table of contents for the tutorial sections, if all of the functionality is represented by a section.

References to "database" in tutorial

RE: openjournals/joss-reviews#3617

There are references to a "database" in the tutorial - I don't know what these mean. What database? Are you talking about linking to external legal databases of some type? Are these just adding URLs?

It also says things like "fetched from our database" -- but what is "our database"?

Citator doesn't pick up old SCOTUS parallel citations like "99 U.S. (9 Otto) 674, 675-76"

A researcher is looking at old SCOTUS stuff, and noticed that in U.S. v. Hamilton there's a citation like:

Perris v. Hexamer,99 U.S. (9 Otto) 674, 675-76, 25 L. Ed. 308 (1878)

To nobody's surprise, we're not picking this up because 99 U.S. (9 Otto) 674 is a hot mess. It might be doable to work around this with some clever regex work without adding too much technical or processing overhead. Worth investigating.

Project does not have community guidelines

RE: openjournals/joss-reviews#3617

I don't see any of the expected community guidelines:
Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Detect historical or non-standard citations

First, thanks for eyecite. @kfunk074 and I are historians working on American legal history, and we intend to use eyecite for a project in progress.

eyecite does very well with citations from the twentieth century on (post-Bluebook?) but it does not detect citations from case reporters in the nineteenth century. To give one example: before Georgia created official case reports, the de facto standard reporter for Georgia's case law was Kelly's Reports. When Georgia began official reports, it adopted the first five volumes of Kelly as its official reports. So, 1 Kelly 254 = 1 Ga. 254 and so on. Of course citations before the Georgia reports all go to Kelly, but even after the official reports, Kelly might still be cited directly. eyecite will detect the Georgia reports, but not Kelly.

The same is true for basically every state jurisdiction in the U.S. I believe there are issues on this repository that are subsets of this problem. I suspect, e.g., that the reporters listed in this issue (#27) are the same kind of problem as described for Georgia. And it depends on the corpus, of course, but such citations can be a substantial body that are missed by eyecite.

We are currently compiling a list of these "antique" reporters. We would like to contribute a pull request that adds these reporters to eyecite. A few questions.

Would such a pull request be welcome?
If so, could you please give us some guidance about how best to do that. I've looked through the eyecite code, though not in great detail just yet. My understanding is that we would contribute the data to the reporters-db repo, but we could use some advice about how best to do so.
On a secondary issue. Bluebook usually standardizes citations, e.g., to 3 Or. 534 for Oregon. But historically, it's common for such citations to be written as 3 Oreg. 534. We'd also like to contribute some variant abbreviations, and aren't sure what the best way to do that is.

Skip supra cites to current document

SCOTUS cases apparently cite themselves as "supra":

Abbott v. Abbott, 560 U.S. 1, 35.

It would be good to detect these somehow ... maybe every cite like [.;] <signal>? supra isn't real? On the other hand maybe they're effectively filtered out later since the antecedents won't match.

	citation.metadata.defendant = "".join(
	str(w) for w in words[start_index : citation.index]
	).strip()

freelawproject / eyecite Goto Github PK

eyecite's People

Contributors

Stargazers

Watchers

Forkers

eyecite's Issues

Recommend Projects

Recommend Topics

Recommend Org