NASA Enterprise Directory
Data and Resources
-
51949EC4-BBD5-4170-A209-17782B54DB3F.zip.zip
Archive containing all captured raw data, scripts used for data extraction,...
-
NASA_Directory.csvCSV
List of names and contact information of NASA employees and contractors in...
-
bag-info.txt.txt
-
bagit.txt.txt
-
manifest-md5.txt.txt
-
tagmanifest-md5.txt.txt
-
51949EC4-BBD5-4170-A209-17782B54DB3F.htmlHTML
NASA directory search
-
51949EC4-BBD5-4170-A209-17782B54DB3F.jsonJSON
-
NASA_Directory.csvCSV
-
raw_pages_email.zipZIP
-
raw_pages_first_name.zipZIP
-
raw_pages_last_name.zipZIP
-
raw_pages_phone.zipZIP
-
track.gifGIF
-
people.nasa.gov.1
-
index.html.1
-
base.csstext/css
-
ned.csstext/css
-
reset-fonts-grids.csstext/css
-
nasa_header_161616.pngPNG
-
nebula.jpgJPEG
-
ned_logo_161616.pngPNG
-
search
-
people.nasa.gov-2017-02-28-aa1b78a0-00000.warc
-
people.nasa.gov-2017-02-28-aa1b78a0-00000.warc.gz
-
01_scrape_script_email.pytext/x-python
-
02_scrape_script_last_name.pytext/x-python
-
03_scrape_script_first_name.pytext/x-python
-
04_scrape_script_phone.pytext/x-python
-
05_extracting_table_data.pytext/x-python
Additional Info
Field | Value |
---|---|
Source | https://people.nasa.gov |
Version | |
Author | |
Author Email | |
Maintainer | |
Maintainer Email | |
Shared (this field will be removed in the future) | Open |
IB1 Sensitivity Class | |
IB1 Trust Framework | |
IB1 Dataset Assurance | |
IB1 Trust Framework | |
Free text description of capture process | Python: Selenium and PhantomJS for scrape, LXML for parse. Ran an exhaustive series of searches by constructing URLs. Began by searching the email field for all valid two-character combinations, followed by the wildcard '*'. If a search returned too many results to display on one page (more than 100), exhaustively appended an additional character in the next round, and so on. The process ended when searches no longer returned too many results to display on a single page. To find directory listings without email addresses, I repeated the process for last names, first names, and phone numbers. If a field included >100 identical entries, I constructed additional search loops on a case-by-case basis, all of which are included in the attached scripts. Because pages were rendered using JavaScript, I used a headless browser via Selenium and PhantomJS in Python to convert pages to static HTML. I parsed the resulting HTML files using LXML in Python, then wrote all data to a comma-delimited CSV using the package unicodecsv. |