周二, 7月 28 2015, 5:39:48 下午

网页静态化SEO

###为什么需要网站静态化？
对于搜索引擎而言，很多动态页面的参数机制不利于搜索引擎收录，而静态页面更容易收录。此外，页面静态化在一定程度上也提高了页面访问速度和系统性能及稳定性。以http://alpha.tech2ipo.com 网站为例子，为了搜索引擎优化，这里对每一篇文章生成静态页面。当爬虫访问网站文章的时候，实际上是转发到文章的静态页面。实现方式如下：首先使用python脚本定期生成或者更新sitemap文件，在网站后台使用ejs模板为每一篇文章生成指定格式的静态页面，最后修改nginx.conf配置文件把搜索引擎的访问转发到静态页面上去。这样就实现了爬虫访问网站文章时实际上访问的就是静态页面。下面是具体步骤：

###1.生成sitemap和sitemap索引
按照google生成站点地图的格式，这里使用的python脚本生成自己网站的站点地图。创建站点地图代码可以参考这一篇用python脚本生成sitemap.xml.

###2.生成静态页面
项目网站后台使用了Node.js, Express模板引擎Ejs，数据存储使用了Leancloud的服务。这里可能需要为不同的搜索引擎定义不同的模板，以百度为例子，在定义好Ejs模板后，就可以在后台拿到Leancloud上的文章内容，渲染Ejs模板后就生成了静态页面。
这个是自定义的Ejs模板：

<!DOCTYPE html>
<head>
<meta charset=utf-8>
<title><%= post_title %></title>
<link rel=stylesheet href=//dn-s798.qbox.me/static/css/init.147fb604.css>
<link rel=stylesheet href=//dn-s798.qbox.me/static/css/minisite/post_read.895c7e59.css>
<link rel=stylesheet href=//dn-s798.qbox.me/static/css/minisite/html.ba7fe361.css>
<meta name=viewport content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=no">
<link rel="shortcut icon" type=image/x-icon href="<%= site_favicon %>">
<link rel=canonical href="http://<%= default_host %>/<%= post_ID %>">

<% if (query_site == 'xiaozhi') { %>
<style>
.replyLi{display:none}
.post_read a{pointer-events:none;color:#000}
.post_read p{font-size:16px;text-indent:2em;}
</style>
<% } %>

<body>
<div class=phone>
    <div class=head>
        <h1 class=logo>
            <a href="/" class=name>
                <%= site_name %>
            </a>
            <div class=slogo>
                <%= site_slogo %>
            </div>
        </h1>
        <h1 class=title>
            <%= post_title %>
        </h1>
        <p class=author>
            <%= post_author %>
        <span>
            <%= post_time %>
        </span>
        </p>
    </div>
    <div class=post_read>
        <%- post_html %>
        <% if (post_txt.length) { %>
            <div class=replyLi>
                <div class=bar>
                    <%= post_txt.length %> 评论
                </div>
                <% for (var i = 0; i < post_txt.length; ++i) { %>
                    <div class=C>
                        <%- post_txt[i].txt %>
                        <p class=author>
                        <%= post_txt[i].owner[1] %>
                        </p>
                    </div>
                <% } %>
            </div>
        <% } %>
    </div>
</div>

对应的route函数如下:

app = require("app")
require "cloud/db/post"
marked = require('marked')
DB = require "cloud/_db"

marked.setOptions({
    renderer: new marked.Renderer()
    breaks: true
    sanitize: true
})

app.get('/post/:host/:post_ID', (request, res) ->
    host = request.params.host.toLowerCase()
    post_ID = request.params.post_ID
    query_site = request.query.site
    DB.Site.by_host(
        {host}
        success:(_site) ->
            if not _site
                res.send '404'
                return
            site = DB.Site(_site)

            DB.Post.by_id({
                ID:post_ID-0    # trans to number, not str
                host:host
            }, success:(post)->
                if post
                    DB.PostTxt.by_post({
                        post_id: post.id
                        site_id: site.id
                    }, success: (post_txt_list) ->
                            for i in post_txt_list
                                i.txt = marked(i.txt)
                            d = post.updatedAt
                            res.render(
                                'static',
                                {
                                    query_site: query_site
                                    site_name: site.name
                                    site_slogo: site.slogo
                                    site_favicon: site.favicon
                                    site_host: host

                                    post_ID: post_ID
                                    post_title: post.get('title')
                                    post_author: post.get('author')
                                    post_time: d.getFullYear() + '-' + (d.getMonth() + 1) + '-' + d.getDate()
                                    post_html: post.get('html')
                                    post_txt: post_txt_list
                                }
                            )

                )
                else
                    res.send '404'
            )
    )
)

###3.修改nginx配置文件
最后一步就是修改nginx配置文件，当网络爬虫访问到网站的文章时，就转发到静态页面。配置如下：

location ~* ^/(\d+).html {
  proxy_pass http://${CONFIG.LEANCLOUD.HOST}/post/$host/$1;
}

location ~* ^/(\d+) {
  if (
    $http_user_agent ~* "baiduspider|googlebot|360spider|qihoobot|mediapartners-google|adsbot-google|feedfetcher-google|yahoo! slurp|yahoo! slurp china|youdaobot|sosospider|sogou spider|sogou web spider|msnbot|ia_archiver|tomato bot|twitterbot|facebookexternalhit|yandex|yeti|gigabot|bingbot|developers\.google\.com"
  ) {
    proxy_pass http://${CONFIG.LEANCLOUD.HOST}/post/$host/$1;
  }

    rewrite ^/(.*) /${page}.html break;
}

###项目地址
具体代码可以参考项目地址：
https://github.com/noman798/798
https://github.com/noman798/leancloud_798

留言