Cytoscape StringApp

In these exercises, we will use the stringApp for Cytoscape to retrieve molecular networks from the STRING and STITCH databases. The exercises will teach you how to:

  • retrieve networks for proteins or small-molecule compounds of interest
  • retrieve networks for a disease or an arbitrary topics in PubMed
  • layout and visually style the resulting networks
  • import external data and map them onto a network
  • perform enrichment analyses and visualize the results
  • merge and compare networks
  • select proteins by attributes
  • identify functional modules through network clustering




The original version of this tutorial was developed by Lars Juhl Jensen of the Novo Nordisk Center for Protein Research at the University of Copenhagen. We thank professor Jensen for his gracious willingness to allow us to repackage the content for delivery as a Cytoscape tutorial.

Prerequisites

To follow the exercises, make sure that you have the latest version of Cytoscape installed. The exercises also require you to have certain Cytoscape apps installed; stringApp, enhancedGraphics and clusterMaker2, as well as the yFiles layout algorithms.

  • Go to Apps → App Store → Show App Store, and using the search field at the top search for one of the apps, for example stringApp.
  • The corresponding page at the Cytoscape App Store will open in a browser, click the Install button.
  • In the Cytoscape interface, the app should be shown with a check mark in the App Store in the Control Panel.
  • Repeat with the other apps: enhancedGraphics, clusterMaker2 and yFiles layout algorithms.

If you are not already familiar with the STRING database, we highly recommend that you go through the short STRING exercises provided by the Jensen lab to learn about the underlying data before working with them in these exercises.

Exercise 1

In this exercise, we will perform some simple queries to retrieve molecular networks based on a protein, a small-molecule compound, a disease, and a topic in PubMed.

1.1 Protein queries

  1. In the Network tab of the Contol Panel, select STRING protein query in the drop-down and type in a protein name, for example SORCS2. Alternatively, use File → Import → Network from Public Databases....
  2. Click the More Options... button ☰ to view settings. Make sure the appropriate organism is selected (e.g. Homo sapiens).
  3. Maximum number of interactors determines how many interaction partners of your protein(s) of interest will be added to the network. The default setting is 10, which we will keep for this example.
  4. Click the 🔍 button to start the search.

Unless the name(s) you entered give unambiguous matches, a disambiguation dialog will be shown next. It lists all the matches that the stringApp finds for each query term and selects the first one for each. Select the right one(s) you meant and continue by pressing the Import button.

How many nodes are in the resulting network? How does this compare to the maximum number of interactors you specified? What types of information does the Node Table provide?

1.2 Compound queries

  1. In the Network tab of the Contol Panel, select STITCH compound query in the drop-down and type in your favorite compound, for example imatinib.
  2. You can select the organism and number of additional interactors just like for the protein query above, and the disambiguation dialog also works the same way.
  3. Click the 🔍 button to start the search.

How is this network different from the protein-only network with respect to node types and the information provided in the Node Table?

1.3 Disease queries

  1. In the Network tab of the Contol Panel, select STRING disease query in the drop-down and type in a disease of interest, for example Alzheimer’s disease. The stringApp will retrieve a STRING network for the top-N proteins (by default 100) associated with the disease.
  2. Click the 🔍 button to start the search.
  3. The next dialog shows all the matches that the stringApp finds for your disease query and selects the first one. Make sure to select the intended disease before pressing the Import button to continue.

Which additional attribute column do you get in the Node Table for a disease query compared to a protein query? (Hint: check the last column.)

1.4 PubMed queries

  1. In the Network tab of the Contol Panel, select STRING PubMed query in the drop-down and type in type a query representing a topic or interest, for example jet-lag. You can use any query that would work on the PubMed website, but it should obviously a topic with related genes or proteins. The stringApp will query PubMed for the abstracts, find the top-N proteins (by default 100) associated with these abstracts, and retrieve a STRING network for them.
  2. Click the 🔍 button to start the search.

Which attribute column do you get in the Node Table for a PubMed query compared to a disease query? (Hint: check the last columns.)

Exercise 2

In this exercise, we are going to use the stringApp to query the DISEASES database for proteins associated with epithelial ovarian cancer (EOC), retrieve a STRING network for them, and explore the resulting network.

2.1 Disease network retrieval

  1. Close the current session in Cytoscape from the menu File → Close Session.
  2. In the Network tab of the Contol Panel, select STRING disease query in the drop-down and type in ovary epithelial cancer.
  3. Set the Maximum number of proteins option to 250.
  4. Click the 🔍 button to start the search.
  5. Once the network appears, go to the menu View → Always Show Graphics Details to see the individual nodes and edges.

2.2 Work with node attributes

Note that the retrieved network contains a lot of additional information associated with the nodes and edges, such as the protein sequence, tissue expression data, subcellular localization, disease score (Node Table) as well as the confidence scores for the different interaction evidences (Edge Table). In the following, we will explore these data using Cytoscape.

  1. Find the disease score column in the node attributes table (look at the last columns). Sort it by values to see the highest and lowest disease scores.
  2. Highlight the corresponding nodes by selecting rows in the table, bringing up the context menu (right-click the selected rows) and choosing the Select nodes from selected rows option. Use one of the icons in the menu to zoom into the selected node.

Give an example for a node with the highest and lowest disease score.

2.3 Inspect subcellular localization data

The stringApp automatically retrieves information about in which compartments the proteins are located from the COMPARTMENTS database, which we will take a look at first to better understand the data.

  1. Go to COMPARTMENTS and enter ARID1A into the search box. The resulting page will show all matches for the query ARID1A.
  2. After selecting the human gene, you will see a schematic of where in the cell it is located and below it tables containing the specific lines of evidence that contribute to the overall score.

What compartments is ARID1A present in with a confidence of 5? What source do these interactions come from? Hint: you can see what the abbreviations for different evidence types mean here.

2.4 Continuous color mapping

Cytoscape allows you to map attributes of the nodes and edges to visual properties such as node color and edge width. Here, we will map the subcellular localization data for nucleus to the node color.

  1. From the Control Panel, select Style. Click on the button to the right of the property you want to change, in this case Fill Color and set Column to the node column containing the data that you want to use (nucleus).
  2. Since this is a numeric value, we will use the Continuous Mapping as the Mapping Type, and set a color gradient for how likely each protein is located in the nucleus. The default Cytoscape yellow–purple color gradient already gives a nice visualization of the confidence of being located in this compartment.

Many proteins are strongly associated with the nucleus – they will be purple.

2.5 Select proteins located in the nucleus

Because many proteins are located in the nucleus, we will identify the proteins with highest confidence of 5. One way to do this is to use the COMPARTMENTS sliders in the STRING Results panel on the right side.

  1. Go to the Node tab and expand the group of Compartment filters by clicking the small triangle.
  2. To hide all nodes with a compartments score below 5, find the slider for nucleus and set the low bound to 5 by entering the number.

How many proteins are found in the nucleus with a confidence of 5? And in mitochondrion? Hint: You can see the number of hidden nodes in the light grey panel bar on the bottom-right part of the network view panel, just above the Table panel.

Important: Move the filter back to 0 before continuing with the next exercise.

Exercise 3

In this exercise, we will work with a list of 541 proteins associated with epithelial ovarian cancer (EOC) as identified by phosphoproteomics in the study by Francavilla et al.. An adapted, simplified version of their results table can be downloaded here. Download the file, and open it in Excel or a similar tool.

3.1 Protein network retrieval

  1. In the Network tab of the Contol Panel, select STRING protein query in the drop-down and paste the list of UniProt accession numbers from the UniProt column in the table.
  2. Leave the default value for Maximum number of interactor.
  3. The disambiguation dialog shows all STRING proteins that cannot be matched to the query terms uniquely, with the first protein for each query term automatically selected. This default is fine for this exercise
  4. Click the 🔍 button to start the search.
  5. Once the network appears, go to the menu View → Always Show Graphics Details to see the individual nodes and edges.

How many nodes and edges are there in the resulting network? Do the proteins all form a connected network? Why?

Cytoscape provides several visualization options under the Layout menu. Try the Degree Sorted Circle Layout, the Prefuse Force Directed Layout with score as edge weight, and yFiles Organic Layout.

Can you find a layout that allows you to easily recognize patterns in the network? What about the Edge-weighted Spring Embedded Layout with the attribute ‘score’, which is the combined STRING interaction score?

3.2 Discrete color mapping

Cytoscape allows you to map attributes of the nodes and edges to visual properties such as node color and edge width. Here, we will map drug target family data from the Pharos database to the node color. This data is contained in the node attribute called target family.

  1. Select Style from the Control Panel.
  2. Click the button to the right of the property you want to change, in this case Node Fill Color, and change Column from name to family, which is the node column containing the data that you want to use.
  3. This action will remove the rainbow coloring of the nodes and present you with a list of all the different values of the attributes that exist in the network.

Which target families are present in the network?

  1. To color the corresponding proteins, first click the field to the right of an attribute value, i.e. GPCR or Kinase, then click the ⋯ button and choose a color from the color selection dialog.
  2. You can also set a default color, e.g. for all nodes that do not have a target family annotation from Pharos, by clicking on the grey button in the first column of the same row.

How many of the proteins in the network are ion channels or GPCRs?

There are many kinases in the network. We can avoid counting them manually by creating a selection filter.

  1. Click on the Filter tab in the Control Panel.
  2. Click the button and choose Column filter from the drop-down menu. Then, find and select the attribute Node: family. Write kinase in the text field to select all nodes with this annotation.

How many kinases are in the network?

3.3 Data import

Network nodes and edges can have additional information associated with them that we can load into Cytoscape and use for visualization. We will import the data from the text file.

  1. To import the node attributes file into Cytoscape, go to File → Import → Table from File. The preview in the import dialog will show how the file is interpreted given the current settings and will update automatically when you change them.
  2. To change the default selection, click the arrow in the column heading. For example, you can decide whether the column is imported or not by changing the Meaning of the column (hover over each symbol with the mouse to see what they mean). This column-specific dialog will also allow you to change the column name and type.

Now you need to map unique identifiers between the entries in the data and the nodes in the network. The key point of this is to identify which nodes in the network are equivalent to which entries in the table. This enables mapping of data values into visual properties like Fill Color and Shape. This kind of mapping is typically done by comparing the unique identifier attribute value for each node (Key Column for Network) with the unique identifier value for each data value (key symbol). As a default, Cytoscape looks for an attribute value of ‘ID’ in the network and a user-supplied Key in the dataset.

The Key Column for Network can be changed using a combo box and allows you to set the node attribute column that is to be used as key to map to.

  1. In this case, we will use query term because this attribute contains the UniProt accession numbers you entered when retrieving the network.
  2. You can also change the Key by pressing the key button for the column that is to be used as key for mapping values in the dataset. In this case it is the first column in the table called UniProt, from where you copied the identifiers.
  3. Click OK to import the data.

If there is a match between the value of a Key in the dataset and the value the Key Column for Network field in the network, all attribute–-value pairs associated with the element in the dataset are assigned to the matching node in the network. You will find the imported columns at the end of the Node Table.

3.4 Continuous color mapping

Now, we want to color the nodes according to the quantitative phosphorylation data (log ratio) between disease and healthy tissues for the most significant site for each protein.

  1. From the Control Panel, select Style. Then click on the button to the right of the property you want to change, for example Node Fill Color.
  2. Next, set Column to the node column containing the data that you want to use (EOC vs EOS&FTE).
  3. Since this is a numeric value, we will use the Continuous Mapping as the Mapping Type, and set a color gradient for how abundant each protein is. The default Cytoscape color gradient blue–white–red already gives a nice visualization of the log ratio.

Are the up-regulated nodes grouped together? Do you see any issues with the color gradient?

  1. To change the colors, double-click on the color gradient in order to bring up the Continuous Mapping Editor window and edit the colors for the continuous mapping.
  2. In the mapping editor dialog, the color that will be used for the minimum value is on the left, and the maximum is on the right. Double-click on the triangles on the top and sides of the gradient to change the colors.
  3. The triangles on the top represent the values at which the data will be clipped; anything above the right triangle will be set to the max value. This is useful if you have a small number of values that are significantly higher than the median. As you move the triangles and change the color, the display in the network pane will automatically update – this is all easier to do than to explain!
  4. If at any point it does not seem to work as expected, it is easiest to just delete the mapping and start again.

Can you improve the color mapping such that it is easier to see which nodes have a log ratio below -4 and above 4?

3.5 Network clustering

Next, we will use the MCL algorithm to identify clusters of tightly connected proteins within the network. Go to the menu Apps → clusterMaker Cluster Network → MCL Cluster. Set the Granularity parameter (inflation value) to 4 and choose the stringdb::score attribute (i.e. the overall STRING confidence score) as Array Source, select the option Create new clustered network, and click OK to start the clustering. The app will now run the algorithm and automatically create a network showing the clusters.

How many clusters have at least 10 nodes?

We will work with the largest cluster in the network (it should be in the upper left corner). Select the nodes of this cluster by holding down the modifier key (Shift on Windows, Ctrl or Command on Mac) and then left-clicking and dragging to select multiple nodes. Then, create a new network by clicking on the New Network from Selection (All Edges) button.

How many nodes and edges are there in this cluster?

3.6 Functional enrichment and enriched publications

Next, we will retrieve functional enrichment for the proteins in our network of the largest cluster.

After making sure that no nodes are selected in the network, go to the menu Apps → STRING Enrichment → Retrieve functional enrichment and keep the default settings. A new STRING Enrichment tab will appear in the Table Panel on the bottom. It contains a table of enriched terms and corresponding information for each enrichment category. You can see which proteins are annotated with a given term by selecting the term in the STRING Enrichment panel.

Which are the four most statistically significant terms? Do the Uniprot and GO Process terms agree with each other, i.e., annotate the same set of nodes?

Next, we will visualize the top-5 enriched terms in the network using split charts, click the colorful chart icon to show the terms as the charts on the network. You can manually change the layout of the network to improve the visualization. First apply the yFiles Organic Layout and then scale the network to reduce the overlap of the charts using the Node Layout Tools (Layout → Layout Tools).

To retrieve a list of publications that are enriched for the proteins in the network, go to the menu Apps → STRING Enrichment → Retrieve enriched publications. A new STRING Publications tab will appear in the Table Panel on the bottom. It contains a table of enriched publications and associated information such as how many of the network proteins were mentioned in each publication.

What is the title of the most recent publication?

To save the list of enriched terms and associated p-values as a text file, go to Apps → STRING Enrichment → Export enrichment results.

3.7 Overlap networks

Cytoscape provides functionality to merge two or more networks, building either their union, intersection or difference. We will now merge the EOC network we have from the DISEASES query with the one we have from the data, so that we can identify the overlap between them. Use the Merge tool (Tools → Merge → Networks…) and select the Intersection button. Then, select the two STRING networks from Available Networks list (‘String Network - ovary epithelial cancer’ and ‘String Network’). Click on > to add them to the list of Networks to Merge and click Merge.

How many nodes are in the intersection?

3.8 Integrate networks

Now we will make the union of the intersection network, which contains the disease scores, and the experimental network. Use the Merge tool again to make the Union of ‘Merged Network’ and ‘String Network’. Make sure that the new merged network has the same number of nodes and edges as ‘String Network’, and that some nodes have a disease score.

Now, we can change the visualization of the merged network to be able to identify high disease score proteins. Specifically, we will change the size of the nodes in function of their disease score. Select Style in the Control Panel. Click on the Lock node width and height option to enable it so that the nodes have only one attribute Size instead of two attributes Height and Width. Click on the ◀ button to add a continuous mapping of the Size attribute using the disease score. Modify the values so that by default a node size is 30; the mapping should go from 35 to 50. To change the default value, you have to click on the default 35.0 value at the left of the Size attribute. To change the mapping values, double click on the chart and then to double click on the square corresponding to the value you want to modify and set the value you want (35 and 50).

Supporting literature

Doncheva NT, Morris JH, Gorodkin J and Jensen LJ (2018). Cytoscape stringApp: Network analysis and visualization of proteomics data.
Preprint

CC BY 4.0