Practical exercises – Bioinformatics teachings (Olivier Croce)

Use of the software « CLC genomics » for genomes finishing

=> Connection on the bioserver of the lab.

The connection use « ssh » protocol. SSH is used to communicate with a distant computer in the same way as is if this computer was your.

Under windows, a simple solution consists to use SSH through a graphical client, like « Putty »

Start “Xming”, which is complementary with Putty for exporting graphical display (Xming should be in the “start menu” of windows, under name “xming”)

Launch « Putty », which should be already installed on your computer. Putty should be somewhere on the Window desktop (have a look inside “Logiciels” directory first, if existed), or in the « start menu », or in “Programm files” folder.

In the options menu, search for a check box named « X11 » and check it. « X11 » means graphical interface which is necessary to see the graphical interface of CLC through SSH.

Then, fill parameters connections (“session” menu), using « 139.124.153.20 » as the server address, then “open”:

A terminal (a window with a black background) will opened. This proposes you to enter a login and a password : « user01 » as login, and « user01 » as password. Moreover, you will have to type further commands inside this window.

=> Launch CLC

Once your are connected, you are able to launch the CLC software :

- type « clcgenomicswb4 »

Note that under such linux server, you can type only the first letters of a command, and (double) press « tab » on keyboard, to automatically complete the command.

When CLC is starting, some “updates windows” appear that you can close.

=> Create your own folder

As many people share with each others the same account « user01 », to avoid any messy situation, each of you have to create a unique directory in CLC, which will be your working space (where you will put your own files). The name of this directory should be for i.e. your name.

Create the folder by right clicking into the left menu (navigation area) as :

=> Import an assembled sequence (= a reference sequence = a genome) :

First, you will have to import the sequence genome you wish to work with. The file could be in fasta format :

There is an example file located in ./formation_bioinfo/ => my_genome.fna. Click « next » to open this sequence.

And select your folder (with your name thus), as destination for this sequence :

Double-click on the sequence imported in the “navigation area”, to open it in the main window :

=> Import you reads from a sequencing

In the same way, we will import the reads (=very short sequences produced by NGS). The file containing reads could be in fasta format too. Import an example file as :

and select in « ./formation_bioinfo », the file named « reads.fasta »

Unselect options as in the example below, then click next :

Again, select your own folder as destination for this file :

=> Map the reads on the reference sequence

The goal is to map (=align) the reads against your reference sequence. To perform this, CLC proposes a simple tool that you can use by going in « Toolbox » left-bottom box (or main top menu ) « Hight-Throughput sequencing » => « Map reads to Reference » :

Select the file with the reads by switching left-right window (use the left and right arrows), then « next » :

then, select your reference genome (small icon with folder and loop), select file “my_genome”, then « next » :

Let the default options for mapping :

Choose the destination folder as usual, and « finish ». A new file is created and you should have something like:

= > Close the gaps !

The genome sequence given in example includes some gaps, that are supposed to be close using the mapping information.

Actually, in this example there are 3 gaps, from the beginning positions to the end: a small one is easy to close by taking the consensus of reads ; a larger gap ; and the last gap has a medium size but the number of unknown nucleotides (N) does not correspond to proper number of nucleotides.

=> How to do ?

Step 1 : find gaps along the sequence

You could of course zoom and use the slidebar button (or arrows on keyboard) in order to find « NNN... » along the full sequence.

Another way consists to use « find » tip. Go to the right menu, « find » section, type some « NN », unselect all except « include negative strand » and click « find »

The view of your mapping window will be placed exactly at a gap position :

Step 2 : copy the sequence from reads that seem to continue the contig through the gap.

You have to check by eyes if the mapped reads can continue the contig through the gap, with a good accuracy, which means that the majority of the reads have the same sequence (kind of consensus). If so, you will have to copy a part of the read sequence and replace « NNN... » into the reference sequence.

By default it is not possible to select and copy a part of a read sequence. To do this, you have to change the default view : into the right menu, change « read layout » => « compactness » => « low ».

Now the reads are selectable and you are able to copy nucleotides by a right click on the selected part of sequence, then « copy »:

Step 3: edit your reference sequence

Return to the window of your genome (your genome is supposed to be named as “my_genome”), the single sequence not the previous mapping window with reads.

Find the appropriate gap, select the gap part to edit, and right-click, « edit »

Then past the sequence that you had copy previously.

Now that you understood the main principle of finishing, you can fill the 3 gaps of this genome.

Be aware that the number of NNN should not by the same as real positions You have to be sure that the sequence from reads you add when closing the gap will overlaps exactly the beginning of the next contig. If not, the gap could be shorter or longer, or the read sequence you added is wrong !

Once, your genome is fully finished you can export your genome sequence into a final file (in fasta format) => right click on your sequence « export » « save » as fasta.