Datasets by Status

cryptowanderer · June 13, 2018, 4:31pm

There have been two initiatives recently to gather data from the Ethereum community in the form of text-based questions and interviews. Status has also been doing a lot of UX research and gathering all kinds of user data.

All this data needs to be analysed closely and shared with the public in an easy-to-understand way such that everyone interested in Ethereum can see directly what the biggest needs are, what really matters to people in the community, where the best opportunities are, what the most widely-used tools are, and where to find the most insightful and educational resources.

We have begun that here.

When it comes to data visualisation, you have several ways to analyze text.

Word clouds: great for analyzing frequency and sharing punchy insights.
Topic modelling, which can be broken down into three sub-fields:
1. Supervised: A human says “This is a dog, this is a cat etc”. Generally gives the most accurate results, but is time consuming and can be very biased.
2. Unsupervised: The machine comes up by itself with the cluster. A humans says, “We have this kind of text, find some kind of structure”.
  Say we want to analyse visitors to website coming between 9-2pm, or 5-8pm from US or Canada. The model can do this - and find patterns we may not have been expecting at all - but it is then up to us to interpret why we have these kind of patterns.
  This is what we are trying to do with the interactive visualisations.
3. Semi-supervised: A human specifies, “There is a business, there is something called ‘dinosaurs’” and then tells the model: “try to find something that matches”. This is what we are doing for the word clouds: i.e. telling the model to focus on important words like “web3”, or teaching it that “VS == Visual Studio” and then setting it off to calculate the frequency and patterns by itself.
For more numerical datasets, there are a tonne of other graph-like visualisations to play around with. We’re not sure that many of these could be applied to such text-heavy and unstructured data, but are thinking about how to show in an interactive way the key things that were said about certain topics.
This means choosing something like “Truffle” and hopefully being able to show both that testing and prototyping is easy, but that adapting it for complex use cases is harder.

So, how do we actually do this for EIP0 and ETHPrize (and all the other datasets to come)?

Sentences are broken down into “tokens” (i.e. individual words), which are organised in associative arrays. We then calculate the frequencies of these tokens and assign weights to them probabilistically to map out the interactive visualisation showing circles of linked ideas.

You might be asking, “What do the positions on the graph really indicate? And why do they become red when I hover over them and show up different bars on the accompanying chart?”

The size of any given circle indicates the weight of the topic in the whole dataset (closely related to the frequency with which certain terms were mentioned, shown in the red bars on the chart). You can use these frequencies, when the data is highly correlated, to decide on names and categories with which to understand the patterns discovered in the data more easily. Data still needs interpretation…

The actual positions of the circles on the graph have to do with something called the Principal Component Analysis.

You transform the data into its “principal components” - which means changing the base along which you view it. You can split the data along any number of axes, but 2 or 3 are obviously easiest to visualise. The idea when doing this is to split the axes in such a way that it captures the most variance between the data.
A full description of what Principal Component Analysis can be found on Wikipedia but I recommend this visual example.
The circles are then mapped in the same plane as the principal components.
pyLDAvis is used for the visualisation.

Even though this is cutting edge stuff, there are no fancy neural networks at this stage. It’s also just difficult to handle the data, because it is totally non-numeric and unstructured. However, there are still some really cool things we can do.

We first need to do some manual tagging that will help the model identify what is important and focus its search for patterns around certain things we have taught it.

We use a library called Spacy, which is a neural network model trained on a dataset called Common Crawl from all web pages that are available in the world (literally growing by terabytes every month!). It’s made up of blogs, news and comments in English, all tagged manually (somehow?).

By tagging the data and training the model a bit and by using awesome libraries like Spacy, we can, for instance, build things that recognise only names in a text, and include them in a word cloud as opposed to the (prettier, but more scattered) one Alex van de Sande did.

Which is cool, if you understand Ethereum, but not so helpful without a lot of context, which is exactly what we want to avoid. So we tag the data, run it through a neural net and produce the below:

The Really Big Idea

Even more exciting, we can create a dataset trained on GitHub, ethresear.ch, ethereum-magicians.org, Medium, developer news and tooling sites etc. and continually pull data in to analyze, as this sort of methodology scales very well.

Decentral-AI-ze all the things!!

hester · June 14, 2018, 3:02pm

This is amazing work! Still figuring out how to give +20 likes on this status discussion forum.

ale · June 18, 2018, 10:56am

Andy, this is great! I love the data visualization field and wonder how we might use the machine-generated digests you’re proposing in combination with Edward Tufte’s viz methods to empower people to spot patterns and learn visually.

The machine learning processing approach could provide valuable insights when coupled with human feedback, and it’s very important to formulate the questions we want to answer very precisely. For example, if we wanted a list of qualities that people appreciate in role models, we could run a Spacy-powered name counter to get the top five role models on the question Who are 3 or more people in the Ethereum Community that make good role models? What do you admire in them?. Once we have five names, someone could look at all the qualities associated with these people and organize them semantically, which is very tough for machines to do at the moment.

The default Github CSV view doesn’t lend itself for qualitative analysis, so here’s a more human-readable version of the EIP0 Shared Values answers.

I want to reiterate (and with your help, refine) Andy’s goals for the analysis of these datasets.

To answer the following questions for those new to the Ethereum ecosystem:

what are Ethereum’s biggest needs?
what matters most to the Ethereum community?
where are the best opportunities in Ethereum?
what are the most widely-used tools in Ethereum?
where do you find the most insightful and educational resources?

cryptowanderer · June 20, 2018, 6:20pm

Hey @ale - thanks so much for this feedback! Really appreciate it, esepcially the advice on how to tag - it’s still something Mamy and I have been discussing (his focus is on Nimbus, so he’s doing this in his spare time).

I think it’s important to keep the EIP0 stuff separate from the ETHPrize stuff for now, although there is some overlap, the goals are largely different. Best to ask @jarradhope for his thoughts on on goals for EIP0.

As for me, I want to know:

What are the most common development/tooling problems?
What are the most frequently mentioned tools?
2a. In what context are these tools mentioned: a positive or negative one?
What are the best educational resources?
What potential bounties are mentioned most often?
4a. How good is each bounty? (yes, “good” is a slippery word, likely worth doing this manually).
What are people’s biggest frustrations in general?