Week #6
I was going to try and continue my eye detection thing for midterm, but last class the presentation Jiaqi and I worked on seemed to have been engaging and interesting to people, so instead of stopping with that we decided to improve it for our midterm.
We started by refining our goal — raising awareness of how seemingly unsuspecting personal data may be used to manipulate web users. We chose to do that through showing people all the (somewhat easily) obtainable information we can possibly get on them just by them entering our website. (Side note: we didn't actually show all the possible information because a lot of it didn't seem very useful for the way we used it. That said, no doubt using all data points in conjunction with one another might prove very useful in a different setup). We then took all of that data, sent it to ChatGPT and asked it to use that information to infer any and all kinds of things about the user.
Some improvements from last week: we came up with a "report" design that lists all the data, while the ChatGPT response is shown as the "conclusion" to that report. We also looked for additional ways to gather information. We incorporated cookies to show how many times the client visited this site. We also made use of the Whois API to get information about the network the client is using. This is immensely useful in some cases — for example when accessing the web at school you can actually tell the client is using NYU's network which could suggest pointers to their affiliation, occupation, physical whereabouts and more. We also added a download button for downloading the report in a pdf file, and of course — refined the design...
It took a while to figure out how to incorporate both the ChatGPT API and the Whois API, but that worked relatively quickly. There wasn't one particular thing that was causing a lot of issues, but rather many small things that worked and then didn't and then worked again. I think the main problem was that at some point the code got quite long and quite cluttered, and it was hard to tell what's leading to what and in what order, especially as we were working together on the same files at the same time. Most of our issues were the kind where something is accidentally "undefined", suggesting data that isn't being sent, or isn't being sent correctly. Those were relatively easier to fix as we just followed the path of the data and figured where the problem was.
The issues that were more difficult to solve were things like internal server errors where we couldn't tell what caused the issue. At some point we had to start just removing lines of code until we didn't see the error anymore, then redo some parts and unfortunately give up on others.
We made an object for each client that is made of key-value pairs of all the data points. The data itself isn't being gathered in the same manner and all at once, so the information is being added to that object in several different parts of the code. For example, this is data that is being gathered from the browser upon connection to the web socket:
clientDataCollection = {
browser:
navigator.
userAgent
,
browserName:
navigator.
appName
,
browserEngine:
navigator.
product
,
browserVersion:
navigator.
userAgent
,
browserLanguage:
navigator.
language
,
scrColorDepth:
screen.
colorDepth
,
windowWidth:
window.innerWidth,
windowHeight:
window.innerHeight,
timeOpened:
new Date(),
timezone:
new Date().getTimezoneOffset() / 60,
previousSites:
history.
length
,
}
* Unfortunately we had to get rid of the geolocation data — it was causing permission issues, was sometimes crashing the entire thing, and in any case it meant having to ask the user for permission to get that data which makes it not that interesting to show as it's clearly consensual, and the point was showing the more "invisible" stuff anyway.
All the network data (things we can know based on the IP address) is gathered through the whois API like so:
function parseWhoisData(whoisData) {
let parsedData = {};
let netNameMatch = whoisData.match(/NetName:\s*(.+)/);
if (netNameMatch) {
parsedData.netName = netNameMatch[1];
document.getElementById("network-name").innerHTML = netNameMatch[1];
}
let orgNameMatch = whoisData.match(/OrgName:\s*(.+)/);
if (orgNameMatch) {
parsedData.orgName = orgNameMatch[1];
document.getElementById("org-id").innerHTML = orgNameMatch[1];
}
let netRangeMatch = whoisData.match(/NetRange:\s*(.+)/);
if (netRangeMatch) {
parsedData.netRange = netRangeMatch[1];
document.getElementById("network-range").innerHTML = netRangeMatch[1];
}
let cidrMatch = whoisData.match(/CIDR:\s*(.+)/);
if (cidrMatch) {
parsedData.cidr = cidrMatch[1];
document.getElementById("cidr").innerHTML = cidrMatch[1];
}
let netTypeMatch = whoisData.match(/NetType:\s*(.+)/);
if (netTypeMatch) {
parsedData.netType = netTypeMatch[1];
document.getElementById("network-type").innerHTML = netTypeMatch[1];
}
let orgMatch = whoisData.match(/Organization:\s*(.+)/);
if (orgMatch) {
parsedData.organization = orgMatch[1];
document.getElementById("organization").innerHTML = orgMatch[1];
}
let regDateMatch = whoisData.match(/RegDate:\s*(.+)/);
if (regDateMatch) {
parsedData.regDate = regDateMatch[1];
document.getElementById("reg-date").innerHTML = regDateMatch[1];
}
let updatedMatch = whoisData.match(/Updated:\s*(.+)/);
if (updatedMatch) {
parsedData.updated = updatedMatch[1];
document.getElementById("updated").innerHTML = updatedMatch[1];
}
let orgIdMatch = whoisData.match(/OrgId:\s*(.+)/);
if (orgIdMatch) {
parsedData.orgId = orgIdMatch[1];
document.getElementById("org-id").innerHTML = orgIdMatch[1];
}
let addressMatch = whoisData.match(/Address:\s*(.+)/);
if (addressMatch) {
parsedData.address = addressMatch[1];
document.getElementById("address").innerHTML = addressMatch[1];
}
let cityMatch = whoisData.match(/City:\s*(.+)/);
if (cityMatch) {
parsedData.city = cityMatch[1];
document.getElementById("city").innerHTML = cityMatch[1];
}
let stateProvMatch = whoisData.match(/StateProv:\s*(.+)/);
if (stateProvMatch) {
parsedData.state = stateProvMatch[1];
document.getElementById("state").innerHTML = stateProvMatch[1];
}
let postalCodeMatch = whoisData.match(/PostalCode:\s*(.+)/);
if (postalCodeMatch) {
parsedData.postalCode = postalCodeMatch[1];
document.getElementById("postal-code").innerHTML = postalCodeMatch[1];
}
let countryMatch = whoisData.match(/Country:\s*(.+)/);
if (countryMatch) {
parsedData.country = countryMatch[1];
document.getElementById("country").innerHTML = countryMatch[1];
}
return parsedData;
}
});
A quick overview of the weird looking match method — take for example this line:
let countryMatch = whoisData.match(/Country:\s*(.+)/);
whoisData.match()
uses a regular expression (/Country:\s*(.+)/
) to search within the whoisData
string.
/Country:\s*(.+)/
is the regex pattern where:
Country:
looks for this exact text.\s*
matches zero or more whitespace characters following "Country:".(.+)
is a capturing group that matches one or more characters after the whitespace, capturing the actual country name. This is the part of the text right after "Country:" and the whitespace.countryMatch
will be an array where:countryMatch[0]
contains the entire matched string (like "Country: United States").countryMatch[1]
contains the first capturing group, which is the country name (like "United States").
The Whois raw data is in the form of a long string so we need to extract the information we're interested in and parsing it so it's usable to us. We then need to add each piece of data to the ClientDataCollection
object in the form of key-value pairs. The code above shows how we look for the specific data points we're interested in and also setting their value in the corresponding HTML element to show on the page. The parsedData
is sent to another function that adds the new information to the clientDataCollection
object.
Of course, once we have the complete clientDataCollection
we're converting the object back to a string so it can be sent to ChatGPT along with instructions on what it should do with that information.
It's important to note that the network information is dependent on the IP address given by the router, and for that reason isn't always useful. But (and it's a big but) — possibly, when used in conjunction with cookies (which might be phased out of circulation soon though) and other datapoints, it might be possible to assume it's the same client even if they connect from different locations — which in turn might actually be used to assume even more about the user itself. And this is just the "simple" stuff anyone with a website can get...