Skip to content

Commit da4d5d9

Browse files
authored
Merge pull request #99 from z3c0/test
v0.9.72
2 parents c584766 + 09f545c commit da4d5d9

24 files changed

+223138
-128
lines changed

MANIFEST.in

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
include vistos/VERSION
2-
include vistos/src/gpo/index/congress/*.bgmap
2+
include vistos/src/gpo/index/congress/*.bgmap
3+
include vistos/src/gpo/index/bills/*.bgmap

README.md

Lines changed: 3 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -227,7 +227,7 @@ The `terms` property returns a `list` of `BioguideTermRecord` objects describing
227227

228228
`Congress` is used to query a single congress, and takes either a year or number to determine which congress to return.
229229

230-
For example, the following Congress objects all return the 116<sup>th</sup> U. S. Congress:
230+
For example, the following `Congress` objects all return the 116<sup>th</sup> U. S. Congress:
231231

232232
``` python
233233
a = v.Congress(116)
@@ -260,7 +260,7 @@ Calling `get_member_bioguide()` returns a `BioguideMemberRecord` corresponding t
260260

261261
#### `.get_member_govinfo(bioguide_id: str)` <a name="get_member_govinfo"></a>
262262

263-
Calling `get_member_bioguide()` returns a `dict` containing the GovInfo data corresponding to the given Bioguide ID.
263+
Calling `get_member_govinfo()` returns a `dict` containing the GovInfo data corresponding to the given Bioguide ID.
264264

265265
#### `.number` <a name="congress_number"></a>
266266

@@ -289,7 +289,7 @@ print(c.bioguide)
289289

290290
#### `.govinfo` <a name="congress_govinfo"></a>
291291

292-
The `govinfo` property returns GovInfo data as `GovInfoCongressRecord` .
292+
The `govinfo` property returns GovInfo data as a `GovInfoCongressRecord` .
293293

294294
#### `.members` <a name="congress_members"></a>
295295

@@ -521,21 +521,4 @@ If you'd like to contribute to the project, or know of a useful data source, fee
521521
1. GovInfo data only goes as far back as the 105<sup>th</sup> Congress
522522

523523
The GovInfo API makes congress persons' data available via "Congressional Directories", which are only provided starting with the 105<sup>th</sup> Congress. If data for an earlier congress is needed, use Bioguide data instead.
524-
525-
1. Downloading bills is very slow
526-
527-
The GovInfo API is geared towards bulk data and does not function efficiently for low-granularity queries. To download the bills for a single congress, V may have to send requests to as many as twenty-thousand different endpoints, taking as long as an hour to download the full dataset. To understand why this is, a deeper explanation of the GovInfo API is needed.
528-
529-
Firstly, GovInfo datasets are organized by collections, which contain packages. A package is a snapshotted version of a given dataset. For example, in the Congressional Directory collection (denoted as CDIR), each package represents a unique version of a directory. Each time a new directory is created or an existing one is updated, it is made available in the CDIR collection under a new modified date. To get the most recent Congressional Directory for a given congress, you would need to look for the package with the most recent modified date.
530-
531-
Bills are a unique collection, which are queryable by four parameters: a start date, an end date, the congress number, and the class of the documents you're looking for (in the case of bills, this could be Senate Bills, House Joint Resolutions, Senate Concurrent Resolutions, etc.) Filtering down by any combination of the latter two (congress number and document class) can still result in thousands of records. For example, the 115<sup>th</sup> Congress had 10,740 House bills.
532-
533-
The maximum dataset size that can be downloaded from a single endpoint is 10,000 records, so in order to download all of the House bills for the 115<sup>th</sup> congress, the start date and end date parameters would have to be used to limit the size of the dataset. However, these date parameters do not use the date that the bills were issued, as one might expect. Instead, they use the last modified date of the packages. This is made even more difficult by the fact that a bill package can be modified outside of the term that it was issued during, so incrementally searching the dates between the beginning and end of the congress you're querying does not work.
534-
535-
If that didn't make matters difficult enough, a large amount of records have modified dates occurring on the same day, meaning that once you've found the right date to query, you'd have to segment your time window even further to accomodate the 10,000-record-limit
536-
537-
To work around these limitations, V begins searching for bills by doing an "open query" for one record and checking the header information for the total amount of expected records. Using that total amount, V then begins to work its way backwards over each year, until finding records. If the amount of records enocountered is larger than the record limit, V begins searching the months of the year to segment the data futher. If the dataset for a month is larger than the record limit, V searches the days. It repeats this pattern until it finds a unit of time small enough to segment the data to a size below the record limit, all the way down to seconds. V continues this recursize search until it downloads the total expected records. V might end up sending hundreds of requests before even being able to download data, and if 10,000 records were ever modfied in a single second, V would break, as seconds are the maximum depth by which V searches. Obviously, this is not the ideal approach, but it's an approach that works (mostly.)
538-
539-
It may seem abhorrent - in this era of "big data" and numerous tools capable of acting on hundreds of millions of records in a few seconds - that a dataset in the tens of thousands could take so long to download. However, this approach is a necessary evil until the design of the GovInfo API is improved.
540-
541524

tests/test_integration.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ def test_parameterless_congress_query(self):
101101
self.assertEqual(congress_a.number, congress_b.number)
102102
self.assertEqual(congress_a.start_year, congress_b.start_year)
103103
self.assertEqual(congress_a.end_year, congress_b.end_year)
104-
self.assertEqual(congress_a.bioguide, congress_b.bioguide)
104+
self.assertEqual(len(congress_a.bioguide), len(congress_b.bioguide))
105105

106106
def test_govinfo_congress_query(self):
107107
"""Validate requesting govinfo data with a Congress object"""

vistos/VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.9.65
1+
0.9.72

vistos/src/duo.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ def search_govinfo_members(govinfo_api_key, first_name=None, last_name=None,
4242
class CongressBills(list):
4343
"""An object for downloading bills for a single Congress"""
4444

45-
def __init__(self, congress_number, govinfo_api_key, bill_type=None,
45+
def __init__(self, congress_number, govinfo_api_key,
4646
load_immediately=True):
4747
self._bills = None
4848

@@ -293,7 +293,7 @@ def __init__(self, number_or_year=None, govinfo_api_key=None,
293293

294294
if govinfo_bills_data_exists:
295295
self._bills = \
296-
CongressBills(self._number, govinfo_api_key, None, False)
296+
CongressBills(self._number, govinfo_api_key, False)
297297
else:
298298
include_bioguide = True
299299

vistos/src/gpo/bioguideretro.py

Lines changed: 24 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,12 @@
33
import json as _json
44
import time as _time
55
import re as _re
6+
import sys as _sys
67
from typing import List, Optional, Callable
78
# from xml.etree import ElementTree as XML
89
from defusedxml import ElementTree as _XML
10+
from queue import PriorityQueue
11+
from threading import Thread
912

1013
import requests as _requests
1114
from bs4 import BeautifulSoup as _BeautifulSoup
@@ -507,9 +510,27 @@ def _query_members_by_id(bioguide_ids: list) -> BioguideMemberList:
507510
"""Gets a BioguideMemberList object corresponding
508511
to the given list of bioguide IDs"""
509512
member_records = list()
510-
for bioguide_id in bioguide_ids:
511-
member_record = _query_member_by_id(bioguide_id)
512-
member_records.append(member_record)
513+
514+
def _get_members_concurrently():
515+
while True:
516+
bioguide_id = q.get()
517+
member_record = _query_member_by_id(bioguide_id)
518+
member_records.append(member_record)
519+
q.task_done()
520+
521+
q = PriorityQueue(_util.NUMBER_OF_THREADS * 2)
522+
for _ in range(_util.NUMBER_OF_THREADS):
523+
t = Thread(target=_get_members_concurrently)
524+
t.daemon = True
525+
t.start()
526+
527+
try:
528+
for bioguide_id in bioguide_ids:
529+
q.put(bioguide_id)
530+
531+
q.join()
532+
except KeyboardInterrupt:
533+
_sys.exit(1)
513534

514535
return BioguideMemberList(member_records)
515536

0 commit comments

Comments
 (0)